Validating the Construct of Word in Applied

Applied Linguistics 28/2: 241–265
doi:10.1093/applin/amm010
ß Oxford University Press 2007
Validating the Construct of Word in
Applied Corpus-based Vocabulary
Research: A Critical Survey
DEE GARDNER
Brigham Young University
Corpus-based vocabulary research has had a profound impact on English
language education, and there is abundant evidence that this will remain the
case for the foreseeable future. Perhaps the greatest challenge of such research
is the determination of what constitutes a Word for counting and analysis
purposes. Decisions in this regard have important ramifications not only for the
lexical findings themselves, but also for the pedagogical theories and practices
that derive from them. This article surveys several fields of study in order to
discuss this dilemma, with a particular focus on three problematic areas relating
to computer-processed corpora: (a) morphological relationships between words,
(b) homonymy and polysemy, and (c) multiword items. The article concludes
with recommendations for assessing the validity of the Word construct in applied
corpus-based vocabulary research.
The influence of corpora and corpus-based research on educational theories
and practices is well-established in both first language (L1) and second
language (L2) settings (e.g. Granger et al. 2002; Hunston 2002; Nagy and
Anderson 1984; Nation 1990, 2001; Nation and Waring 1997; Partington
1998; Schmitt 2004; Sinclair 2004a). Perhaps nowhere has this influence
been more apparent than in the areas of vocabulary analysis, training, and
theory-building. For example, the landmark study of words in printed school
English by Nagy and Anderson (1984) was based on the American Heritage
Intermediate corpus (Carroll et al. 1971). Findings from the study were
instrumental in establishing the construct of Incidental Vocabulary Learning, or
learning from context (Nagy et al. 1985, 1987), which, in turn, had a major
impact on the popular practice of Extensive Reading in its various forms
(e.g. wide reading, free reading, pleasure reading, book floods, sustained
silent reading—Elley 1991, 1996; Krashen 1989, 1993; Nagy 1997; Nagy and
Anderson 1984; Nagy et al. 1987).
The vocabulary-levels (VL) research by Nation and his colleagues is
another example of how corpus-based vocabulary investigations have
shaped pedagogical practices (Nation 1990, 2001). Among other instructional
applications, VL research has taken the strong position that relatively
short lists of high frequency words—derived from analyses of large
corpora—should receive direct instruction, as they cover such a large
242
WORD IN APPLIED CORPUS-BASED VOCABULARY RESEARCH
percentage of the total words that English language learners will actually
encounter. Proponents further argue that for reading to be an avenue of
protracted vocabulary growth, a learner must first have control of these
high frequency, high coverage words. Otherwise, the lexical demands of
most authentic texts would prevent readers from achieving the necessary
levels of reading comprehension for contextual word learning to take place
(Laufer 1997; Nation 1990, 2001).
However, upon closer examination, the validity of these two seminal
research paradigms, as well as other corpus-based vocabulary applications,
appears to hinge on the various ways that researchers have operationalized
the construct of Word for counting and analysis purposes. Some variables that
raise validity issues include the multifaceted nature of English morphological
word families, the impact of multiple word meanings, and the presence of
multiword items in the English lexicon. Furthermore, when corpus-based
vocabulary findings are used to inform or support actual language
acquisition, there is the additional concern of whether researcher-based
conceptualizations of Word (i.e. the criteria used to group words, count
words, etc.) actually match the psychological realities of Word (i.e. actual
knowledge of or about words in the minds of target language users).
All of these Word validity issues have particular significance for applied
vocabulary research that relies on, or is based on, lexical data from
computer-processed corpora. Examples of research that fits this description
include those that (a) estimate learners’ vocabulary sizes (e.g. Goulden et al.
1990; Nagy and Anderson 1984), (b) assess learners vocabulary competencies
(e.g. Beglar and Hunt 1999; Cameron 2002; Laufer and Goldstein 2004;
Laufer and Nation 1995, 1999; Nation 1983; Schmitt et al. 2001), (c) determine how many words learners need to know to adequately function in
a language (e.g. Nation 1990, 2000; Laufer 1989, 1997; Hazenberg and
Hulstijn 1996—Dutch language; Hsueh-chao and Nation 2000), (d) assess the
lexical characteristics, distributions, or densities of various written or spoken
texts (e.g. Gardner 2004; Hirsh and Nation 1992; Kyongho and Nation 1989;
Nation 1990, 2001; Nation and Kyongho 1995; Sutarsyah et al. 1994),
(e) produce lists of words for instructional or informative purposes
(e.g. Carroll et al. 1971; Coxhead 2000; Francis and Kučera 1982; Xue and
Nation 1984), and (f) hypothesize about the processes involved in actual
vocabulary acquisition (e.g. Nagy and Anderson 1984). There are also
those examples where the findings of corpus-based vocabulary research
(e.g. Nagy and Anderson 1984) have been used to support theories in
other educational venues (e.g. Krashen 1989, 1993).
The purpose of this article is to raise awareness of this Word dilemma in
applied corpus-based vocabulary research through a survey of several
fields of study, and to make recommendations for improving the validity of
such research in informing English language education. My intent is not
to criticize, but to improve the use of corpora in educational research and
practice. In terms of an audience for this discussion, I wish to make a clear
DEE GARDNER
243
distinction between the relatively large multidisciplinary group of applied
vocabulary researchers whom I address in this article (i.e. those who utilize
corpora to inform and support first and second language acquisition) and the
field of Corpus Linguistics proper (i.e. those who utilize corpora to study
and describe language). While evidence from the latter field will be used
to support certain claims of this study, I do not provide an extensive review
of the construct of Word in Corpus Linguistics.
Having said this, however, I also wish to recognize the increasingly porous
boundary between traditional Corpus Linguistics and Language Education, as
evidenced by the numerous research articles and books dealing with various
uses of corpora in language teaching (e.g. Biber et al. 2004; Granger et al. 2002;
Hunston 2002; Hunston and Francis 1998; Nesselhauf 2003; Partington 1998;
Schmitt 2004; Sinclair 2004a). In this regard, certain aspects of the current
discussion may also be applicable, such as matching corpus-based vocabulary
applications (concordancing, etc.) to learners’ skill levels and language
backgrounds, the potential differences in word-meaning redundancy between
large mega-corpora and smaller specialized corpora, and so forth.
Finally, additional audiences for this discussion are those interested
language educators who wish to be better informed about the possibilities
and limitations of corpus-based vocabulary research and applications for their
particular instructional circumstances (cf. Seidlhofer 2002). Along with
language learners themselves, these are the individuals who are most likely
to be affected by the potential lexical oversights discussed in this article.
RESEARCH BASIS FOR CRITIQUING THE CONSTRUCT
OF WORD
Convergent vocabulary and morphology studies from various fields of
inquiry provide a foundation for assessing the construct of Word in corpusbased vocabulary research that attempts to inform English language
education. Three key factors emerging from these studies appear to be
superficially treated in the bulk of applied corpus-based vocabulary research:
(a) the degree to which learners of various language backgrounds and skills
levels can make connections between morphologically-related words; (b) the
impact of homonymy and polysemy; and (c) the impact of multiword items
in the lexicon. I would argue that the lexical distortions resulting from
oversights in any of these areas, or their possible combinations, can have
long-term ramifications for theories and practices in language education.
Morphological relationships between words
The concept of lemma
In addressing morphological relationships between words, many corpus
linguists rely heavily on the concept of lemma, which is defined by Francis
and Kučera (1982: 1) as ‘a set of lexical forms having the same stem and
244
WORD IN APPLIED CORPUS-BASED VOCABULARY RESEARCH
belonging to the same major word class, differing only in inflection and/or
spelling’. Examples of lemma sets would be base forms of verbs together with
their inflected forms (e.g. climb, climbs, climbing, climbed). All four verb forms
would be considered under the lemma CLIMB, with capitalization indicating
the actual lemma set (cf. Stubbs 2002). However, the noun derivative climber
would be considered its own lemma, because it represents a different word
class (i.e. noun instead of verb). The lexical set belonging to the noun lemma
CLIMBER would also include the plural form climbers, and possibly the two
possessive forms climber’s and climbers’.
The Francis and Kučera definition of lemma also allows inclusion of
irregular verb forms in a lexical set (e.g. went belongs to GO; flew and flown
belong to FLY; broke and broken belong to BREAK; am, is, are, was, were, been,
and being belong to BE). However, the case of the irregulars poses serious
quandaries relating to the psychological validity of such family relationships—namely, that the opaque spelling and phonological connections
between the lemma headword and the family members will surely
cause more and different learning problems than their more transparent
counterparts.
Also implied in the traditional definitions of lemma is the idea that
the lexical sets must belong to the same grammatical class (all are nouns,
all are verbs, etc.—cf. Hunston 2002). Some corpus linguists also make the
additional distinction—and I would argue that it is a crucial distinction from
an applied perspective—that forms must share the same meanings in addition
to belonging to the same grammatical class (e.g. Stubbs 2002). This definition
puts lemma more on a par with lexeme, another term adopted by corpus
linguists to describe relationships between word forms and their meanings:
lexeme: a group of word forms that share the same basic meaning
(apart from that associated with the inflections that distinguish
them) and belong to the same word class (Biber et al. 1999: 54)
While a discussion of definitions within Corpus Linguistics is not the purpose
of this article, it is crucial to understand how different conceptualizations of
what constitutes a Word for counting purposes can affect the results and
conclusions of applied corpus-based vocabulary research. To begin, it would
be virtually impossible to meet the criteria of same grammatical class and
same meaning in grouping words unless the researcher had access
to a grammatically and semantically tagged corpus or a sophisticated
collocational analysis program—both of which are in their developmental
infancy (cf. Landes et al. 1998; Ravin and Leacock 2000; Sinclair 2004b).
Without such resources, a machine-based frequency count of a lemma like
MAKE might erroneously count lexical verb forms (make, makes, making, made)
with common noun forms (make and makes—as in makes and models of cars),
gerund forms (making—as in the making of the film), and even the
verbal components of multi-meaning phrasal verbs like make out
(makes out, making out, made out), make up (makes up, making up, make up),
DEE GARDNER
245
and make off (makes off, making off, made off ). The distortions in lexical
frequency counts resulting from these types of overgeneralizations
could certainly impact the validity of vocabulary research aimed at
establishing instructional word lists, assessing vocabulary loads in written
and oral texts, determining the vocabulary needed to comprehend texts,
and so forth.
The concept of word families
The definition of what constitutes a word for counting purposes and potential
learning purposes has also been the subject of extensive debate in L1 and L2
instructional contexts for quite some time (Anderson and Freebody 1981;
Anderson and Nagy 1992; Bauer and Nation 1993; Hazenberg and Hulstijn
1996; Goulden et al. 1990; Nagy and Anderson 1984; Wesche and
Paribakht 1996):
What counts as a word will depend upon the researcher’s principal
purposes. However, affixes and derivatives are important elements
of word knowledge, and several questions related to their role
are of considerable interest: In what way does knowledge of
basic or root forms relate to knowledge of the compound forms?
Are entries organized conceptually in the personal dictionary such
that the probability of knowing a compound word is the same as
that of knowing all its family members, basic form included?
Or is the chance of knowing a compound some combination of
the frequencies of the particular compounding elements? Much is
to be gained from research into these issues (Anderson and
Freebody 1981: 98).
In this broad characterization of some of the morphological concerns facing
vocabulary researchers, the experts tie morphology to both learners’ potential
knowledge and to issues of frequency. While there may be strong reasons in
general for analyzing frequency of Word Families (base forms plus inflected
forms and transparent derivatives) instead of frequency of individual word
forms, there is also evidence that word-level morphological relationships
have varying degrees of transparency in terms of graphic similarity
and meaning redundancy, and therefore the extent to which learners of
different ages and language backgrounds are able to associate related words
(Nagy et al. 1993).
Given the prolific nature of inflectional and derivational affixation in
English (i.e. affixed forms outnumber basic forms four to one—Cunningham
1998), it is little wonder that vocabulary researchers have had difficulty
in defining the construct of Word. At one extreme, we might be tempted
to count only unique spellings, as was the case in the American Heritage
Intermediate (AHI) corpus (Carroll et al. 1971). This avoids the role of
morphological affixation altogether, but it loses a great deal of psychological
validity. For instance, it is highly unlikely that average readers in the third
246
WORD IN APPLIED CORPUS-BASED VOCABULARY RESEARCH
through ninth grades (the focus of the AHI study) would see no connection
between boy and boys, or between walk and walks.
At the other extreme, however, there may be a tendency to say that
whatever morphological taxonomies we come up with to group related
words must have some correlation to the way learners actually associate
words in their minds. This view is particularly problematic in light of the
various orientations that individual learners have to morphological issues,
and the vast number of morphological variables that must be accounted for.
Bauer and Nation (1993: 257) offer perhaps the best attempt at addressing
several of these morphological variables as they relate to actual learner
knowledge. Their analysis of the Lancaster-Oslo-Bergen (LOB) corpus
resulted in a seven-level categorization scheme which the researchers
consider to be useful for ‘practical’ rather than ‘theoretical’ purposes. They
are also careful to explain that the seven levels (summarized below) are not
discrete or absolute, but represent possible ‘steps along a cline’ of word family
recognition (1993: 257), and that certain affixes can appear in more than one
level depending on restrictive usages:
Level 1: Each form is a different word. The assumption here is that learners
have basically no concept of morphological relationships between words.
The researchers see this as a very ‘pessimistic’ view, but a potentially
useful assumption when multiple meanings for the same word form are
considered—a point that will be addressed later in this paper.
Level 2: Inflectional suffixes. At this level, it is assumed that learners have
to perform ‘minimal morphographemic analysis’ in recognizing most
inflections. Words with these suffixes are roughly equivalent to those
that would fall under the classification of lemma in corpus linguistics.
Level 3: The most frequent and regular derivational affixes. This level includes
affixes such as -able, -er, -ish, -less, -ly, -ness, -th, -y, non-, and un-.
Level 4: Frequent, orthographically regular affixes. Here the researchers
prioritized in terms of frequency over productivity, and orthography
over phonology, given the researchers’ emphasis on written materials
over spoken language. At this level, the researchers found restricted
(in this case, transparent) uses of -al, -ation, -ess, -ful, -ism, -ist, -ity, -ize,
-ment, -ous, and in-.
Level 5: Regular but infrequent affixes. This level adds a number of affixes
that are fairly regular, but they do not individually add greatly to the
number of words that can be understood by learners (e.g. -age, -al, -an,
-hood, -let, anti-, arch-, and bi-).
Level 6: Frequent but irregular affixes. While affixes at this level may be
fairly frequent in the language, they nonetheless create ‘major problems
of segmentation, either because they cause gross (orthographic)
allomorphy in their bases (that is, parts of the base are deleted or
additions besides the suffix are needed), or because there are major
problems involved in segmenting them caused by homography
DEE GARDNER
247
[same form with different meaning].’ Affixes at this level are -able, -ee, -ic,
-ify, -ion, -ist, -ition, -ive, -th, -y, pre-, and re-.
Level 7: Classical roots and affixes. At this level belong all the classical roots
which abound in English words and which occur not only as bound roots
in English (as in embolism), but also as elements in neo-classical
compounds (such as photography). According to Bauer and Nation, both
native speakers and L2 speakers must be taught such forms ‘explicitly.’
However, they also point out that many prefixes at this level are frequent
in English, for example ab-, ad-, com-, de- dis-, ex-, and sub-(Bauer and
Nation 1993: 258–62).
One apparent advantage of this seven-level categorization scheme is that
Word or Word Family can be operationalized at various defensible levels for
analysis and comparative analysis purposes—at least in terms of learners’
abilities to associate morphologically related words. According to Bauer and
Nation, such a framework could foster consistency in research involving
estimations of vocabulary size, age-related morphological development,
lexical storage, and dictionary construction.
However, the framework has some potential weaknesses that deserve
further discussion in the context of applied corpus-based vocabulary
research. First, the fact that many affixed forms are repeated at
different levels (e.g. -able, -y, -ist) suggests that the use of the framework
beyond certain major distinctions (e.g. No Family Relationships vs.
All Family Relationships, or All Inflection vs. All Derivation) could become
extremely problematic, especially if one considers the fact that before
level determinations could be made, case by case assessments of affixed
word forms would be necessary to determine if a prolific derivational
affix was acting transparently or not. While this might not present
a dilemma with smaller corpora containing few different words, it could
certainly pose a major hurdle for larger text samples with many
different words.
Second the researchers tend to lump derivational affixes together without
consideration of the fact that derivational prefixes and derivational
suffixes may present different learning dilemmas for developing readers
(Nagy et al. 1993). For example, derivational prefixes tend to be
paraphrasable (e.g. non- and un- in Bauer and Nation’s Level 3 mean not),
and they do not change the word class of the bases they are added to
(e.g. entity and nonentity are both nouns; do and undo are both verbs).
In contrast, derivational suffixes change word class (e.g. do the verb becomes
doable the adjective when the suffix -able—also listed in Level 3—is added).
Also, the meanings contributed by many derivational suffixes such as ment,
ness, and ish are not only difficult to define, but they may also be difficult
for learners to grasp (cf. Nagy et al. 1993).
A third concern is that the framework does not adequately address the
role of stems in derived forms. For example, studies have shown that
248
WORD IN APPLIED CORPUS-BASED VOCABULARY RESEARCH
the definitions of derivatives given by L1 students (approximately nine to
thirteen years old) often center on the meaning of the stems with little or no
attention paid to the meaning of the suffixes (Freyd and Baron 1982;
Wysocki and Jenkins 1987).
In one particular L1 study, Tyler and Nagy (1989) did detect some
low-level learner knowledge of the contribution of suffixes, which the
researchers attributed to the more precise measurement tools used in the
study. However, the researchers also found that the students learned
the contribution of suffixes in derived forms only after they learned to
recognize the stems in those forms.
Hancin-Bhatt and Nagy (1994) also found some low-level student
knowledge of suffixes in their study of 196 Spanish-English bilingual
students at three age levels (roughly nine, eleven, and thirteen). However,
the researchers also concluded that cognate stems were realized more often
in suffixed words than noncognate stems, suggesting that the nature of the
stems, rather than the suffixes, was the major factor in determining whether
suffixation was being utilized by the students or not. Taken together,
these findings suggest that a psychologically valid definition of Word Family
(i.e. from a learner perspective) should address graphic and semantic
transparency issues involving morphemic stems before addressing the relative
contributions of affixes—a topic that I will address in greater detail in later
sections of this article.
A final concern is that the Bauer and Nation framework, like many others
to date, seems to assume that learners’ exposure to and acquisition of
morphologically-related words is somehow linear in nature—in other words,
that language learners acquire base forms before their inflected and derived
family members. However, Biemiller and Slonim (2001) point out that young
children may actually acquire many derived forms before they acquire their
root-form counterparts.
Of particular concern here is the question of whether learners of
various skill levels and language backgrounds are as capable of
recognizing and utilizing the common morphemic stem of a Word
Family when their initial and extensive exposure to that stem may be
through inflected and derived forms, rather than base forms. Extant
research in this regard suggests that such a non-linear process may be
more difficult for learners. For instance, Stahl and Shiel (1992: 233)
conclude that, ‘many children, especially poorer readers, have difficulty
isolating the root word’. Furthermore, while instruction may tend to favor
the presentation of root forms before their affixed relatives (cf. Jiang
2000), there is no way of controlling for such exposure in ‘authentic’
texts and during ‘natural’ reading and conversational experiences.
This would only seem possible through materials and communicative
contexts that have been linguistically engineered to control for vocabulary
presentation.
DEE GARDNER
249
Developmental and language background issues
The issues in the forgoing discussion make it clear that decisions by applied
corpus researchers to group and count words based on morphological
relationships must be tempered by learner variables if results are to be truly
useful in informing language pedagogy and theory-building. Too often
such decisions are based on researcher preference, computer limitations,
or convenience, and the resultant findings and claims tend to be far too
general or abstracted for real-world application.
The following is a brief synthesis of research findings that have considered
developmental and language background issues relating to the acquisition
of English morphological knowledge.
1 School-aged learners acquire English inflectional morphology before derivational
morphology. Berko (1958) and Derwing and Baker (1979) established that
L1 children generally learn inflection and compounding before derivation. In fact, knowledge of inflectional morphology is assumed to be well
underway by the first grade (Berko 1958). On the other hand, knowledge
of derivational morphology begins sometime before the fourth grade
(nine years old—Carlisle 2000; Tyler and Nagy 1989), continues through
the upper-elementary and middle grades (ten to fourteen years old—
Carlisle 2000; Mahony et al. 2000; Singson et al. 2000; Tyler and Nagy
1989), through high school (fifteen to eighteen years old), and possibly
beyond (Nagy et al. 1993; Tyler and Nagy 1990).
Hancin-Bhatt and Nagy (1994) add that bilingual children (roughly nine
to thirteen years old) perform better with inflected forms (same
grammatical category) than with derivational forms (different grammatical category). Additionally, these researchers suggest that an age-related
developmental pattern exists in children’s ‘knowledge of the relationships
between English and Spanish derivational suffixes’ (1994: 305). However,
while increases in morphological knowledge were positively correlated
with age, the overall knowledge of ‘English suffixes and their relationships to Spanish suffixes’, even among the oldest children, remained
‘relatively low’ (1994: 306).
2 L2 adult learners acquire English inflectional morphology before derivational
morphology. Schmitt and Meara (1997) found that adult L2 learners of
English (95 Japanese L1) show earlier and more profound awareness
(receptive and productive) of English inflectional verb suffixes than
derivational verb suffixes. The researchers attribute this finding to the
rule-based nature of English inflectional morphology as opposed to
the more idiosyncratic nature of English derivational morphology.
Schmitt and Zimmerman (2002) found similar results in their study of
106 nonnative English speakers (university students), drawing the
following pedagogical conclusion: ‘This study indicates that teachers
cannot assume that learners will absorb the derivative forms of a word
250
WORD IN APPLIED CORPUS-BASED VOCABULARY RESEARCH
family automatically from exposure. Rather, in this area, explicit
attention to form may be of value’ (2000: 163).
3 Individual differences in language skills account for noticeable variation in
morphological knowledge. The findings from the Freyd and Baron (1982)
study also suggest that individual differences in language capacity, rather
than age or grade level, are key predictors of derivational word
knowledge. They found that precocious fifth-graders (ten years old)
exhibited more awareness of derivational morphology than average
eighth-graders (thirteen years old).
Several research studies have also found strong correlations between
morphological knowledge and superior reading skills (Carlisle 2000; Freyd
and Baron 1982; Mahony et al. 2000; Nagy et al. 1993; Singson et al.
2000; Tyler and Nagy 1989, 1990). These correlations strongly suggest
that a reciprocal relationship exists between the two cognitive abilities—
that is, morphological knowledge is both a byproduct of skilled reading
and a contributor to skilled reading.
A recent study also suggests that variability in children’s early acquisition
of English vocabulary is strongly correlated with their morphological
awareness (McBride-Chang et al. 2005). Interestingly, morphological
awareness appeared to be a good predictor of vocabulary knowledge even
when ‘phonological processing, word reading skill, and age were
statistically controlled’ (2005: 428).
4 Explicit training can improve a learner’s ability to utilize morphological
knowledge. A growing body of research indicates that learners’
morphological development can be productively enhanced through
direct instruction of the relationships between root forms and
their affixed family members. This morphological awareness raising
through word analysis techniques appears to be effective for all groups of
learners, including L1 Children (Cunningham 1998; Stahl and Shiel
1992), L2 children (Carlo et al. 2004), and L2 adults (Mirhassani and
Toosi 2000).
Homonymy and polysemy
Grabe (1991) notes that assessments of vocabulary are often hindered by the
fact ‘that each word form is counted as a single word, though in reality, each
word form may represent a number of distinct meanings, some of which
depend strongly on the reading context, and some of which are quite
different from each other in meaning’ (1991: 392).
Meaning variation among words with the same or similar forms has
been debated among semanticists for many years (Anderson and Nagy 1991).
At a very fundamental level, scholars agree that a single word form can have
multiple meanings—for example: ‘Row (in a row of chairs) and row (in row
a boat)’ (Goulden et al. 1990: 344). There is also acknowledgement that
DEE GARDNER
251
a word’s various meanings can be clearly distinct (homonymy), as in the
row example above, or more closely related (polysemy):
Polysemes are the many variants of meaning of a word where it is
clear that the meanings are truly related. The verb break has many
different variants which are related in meaning [He broke his leg;
The cup broke; She broke his heart; She broke the world record, etc.].
The verb put also has an array of polysemes. Homonyms
(sometimes called homographs), on the other hand, are variants
that are spelled alike but which have no obvious commonality in
meaning. The classic example is the word bank, which could
mean a financial institution or the bank of a river or a tier or
row of objects (such as a bank of seats at a baseball game).. . .
One question regarding core research is whether or not polysemes
such as those shown in the break example really are the same
word (whether there is some invariant Gesamtbedeutung that links
them) or whether some should be listed as separate lexical items.
(Hatch and Brown 1995: 49)
This potential for meaning variation (both homonymy and polysemy)
becomes even more convoluted when the morphological word family is
considered. For instance, forms that appear to be related through affixation
may actually be homographs in context (e.g. bear, the animal, and bears/
bearing, the verb meaning to carry), or repetitions of the same affixed
forms may actually be homographs in context (e.g. bears, the plural animal,
and bears, the verb meaning to carry).
The existence of potential polysemes seems to complicate matters even
more (e.g. bear/bears/bearing, the verb meaning ‘to move while holding
up and supporting,’ and bear/bears/bearing, the verb meaning ‘to hold in the
mind’—Webster’s Ninth New Collegiate Dictionary 1988). To make matters even
worse, the past form of the verb bear (either definition above) is bore, which
bears (additional meaning) no orthographic resemblance to the other forms
in the same semantic family. Additionally, the form bore itself has much more
common meanings that are totally unrelated to bear in the senses described
above (to bore a hole, the man is a bore, I don’t want to bore you with this, etc.).
A quick search in WordNet (2003) reveals two senses for the noun bear,
13 senses for the verb bear, two additional senses for the verb bore, and
four senses for the noun bore—a total of 21 different meaning senses for
the two forms bear and bore. This does not even include the two adjective
senses of born which could also be included in the bear/bore family.
Conceivably, a machine-based frequency count of word-family forms could
link all of these forms of bear and bore together (assuming that the researcher
had determined to count bore under bear), but the question remains whether
or not a child or other language learner encountering these various forms
would make any semantic connections between them based on context.
Would they perceive some invariant Gesamtbedeutung that links them all?
One might argue here that I have chosen a rare example for effect,
252
WORD IN APPLIED CORPUS-BASED VOCABULARY RESEARCH
but it is easy to find similar examples among the highest frequency forms of
English (e.g. bank—18 senses; pick—21 senses; put—10 senses; fall—44
senses; read—12 senses; call—41 senses). In fact, Ravin and Leacock (2000: 1)
remind us that ‘the most commonly used words tend to be the most
polysemous’ and Biemiller and Slonim (2001) add that general print
frequency of word forms is a poor predictor of learners’ root word knowledge
because high frequency words tend to have variant meanings—a point that
has important ramifications for corpus-based vocabulary findings that rely
heavily on frequency of word forms, without consideration of their variant
meanings, especially when it is assumed that such findings can be easily
transferred to pedagogical applications, or that they accurately represent the
psychological realities in the minds of learners.
Some corpus linguists would also point out that the meaning of a word
is dependent on the other words associated with it in a particular text
(cotext), and that words are only ambiguous when isolated from their cotext
(Sinclair 2004b). From this perspective, the meaning of bore (see example
above) is ambiguous as an isolated word, but not so in the case of to bore
a hole (meaning to drill), and the man is a bore (meaning uninteresting or
tiresome). This line of reasoning brings into serious question the validity
of computerized counts of individual word forms or computer-generated lists
of individual word forms for investigative or instructional purposes, especially
if those words are of higher general frequency in the language.
I would further argue that the notion of lemma and counting lemmas
in much of Corpus Linguistics suffers from the same problem—namely,
that machine-based identifications and calculations of true lemmas (same
grammatical class and same meaning) are virtually impossible without a
grammatically and semantically tagged corpus or a sophisticated collocational
analysis program, both of which, as stated earlier, are in their developmental
infancy (Landes et al. 1998; Ravin and Leacock 2000; Sinclair 2004b).
While manual determinations of true lemmas might be possible on a contextby-context basis (perhaps using concordancing), there are at least two issues
that do not favor such an approach: (a) the overwhelming size of the task,
especially with large mega-corpora; and (b) the difficulty of accurately
assigning words to their particular lemmas.
the metaphorical use of lion (e.g., John is a lion) is likely to be
treated as ‘the same word,’ while the concrete and metaphorical
uses of crane (‘kind of bird’ and machine for lifting heavy objects’)
are more likely to be treated as independent words and therefore
members of different lemmas. If it is difficult to group word
meanings under headwords at the abstract level of the dictionary,
it is much more difficult to assign words in texts unambiguously
to their lemmas. (Knowles and Mohd Don 2004: 70)
These researchers also argue that ‘generalizations about whole lemma become
less and less convincing’ as detailed linguistic examinations of corpus-based
DEE GARDNER
253
data continue to be performed, and that researchers may need to begin
‘to consider individual words’ or ‘actually even individual word meanings’
as the basis for their analyses (Knowles and Mohd Don 2004: 71). Biemiller
and Slonim (2001: 510) add that ‘frequencies of word meanings rather than
word forms might lead to better predictions’ of learners’ root word knowledge,
although such frequencies ‘would be very hard to produce’.
In light of these semantic issues, I would argue that suggested applications
of corpus research based on frequency of word forms, without considerations
of word meanings, will invariably suffer from one of three problems—or
combinations of the three: (a) they will overestimate the true coverage of the
word forms; (b) they will underestimate the actual user knowledge required
to negotiate the word forms; and/or (c) they will underestimate the actual
number of meanings inherent in the word forms. I would further argue that
any of these conditions can have serious ramifications for the pedagogical
applications and theories that rely on such research. Indeed, the apparent
tensions building in Applied Linguistics between more traditional linguists
and corpus linguists seem to hinge on the limitations (e.g. Widdowson 2000)
and possibilities (e.g. Stubbs 2001) of corpus-based research to account for
the semantic and pragmatic realities of the language.
So what can be done to improve this situation or to make corpus-based
vocabulary claims more valid in terms of educational applications? One
useful consideration is that homonymy and weak polysemy among the high
frequency words of English would logically be the greatest in a corpus
consisting of many texts with varied themes and topics. Conversely, there is
less likelihood of semantic disparity within an individual text or a collection
of topically-related texts, as words and their meanings tend to be closely tied
to the themes or topics of those texts (Nagy et al. 1987; Sutarsyah et al. 1994).
However, the trend in modern corpus construction appears to be toward
bigger and broader. If one of the goals of such research is to provide
information for language pedagogy in terms of word coverage, instructional
word lists, and so forth, then much more attention must be paid to semantic
analyses of word forms (cf. Read 2000).
Another consideration is that technical words of the language tend to be
more complex in terms of form (i.e. more letters and more complex
affixations than basic words), but less ambiguous in terms of meaning.
This may be due in part to the fact that they are found in a narrower range
of texts (Bečka 1972; Chung and Nation 2004, Read 2000) and represent
more complex and specialized concepts than basic words, although some
basic word forms also have specialized meanings in some technical contexts
(e.g. supply and demand in an Economics text—Sutarsyah et al. 1994)—
further evidence of the semantic instability of the high frequency (basic)
word forms of the language.
Interestingly, in a study of the Academic Word List (AWL)—a list of
subtechnical words assumed to have high frequency and range in
academic reading materials—the researchers found that only 10 percent
254
WORD IN APPLIED CORPUS-BASED VOCABULARY RESEARCH
(60 out of 570 words) were homographs (Ming-Tzu and Nation 2004).
These AWL words are assumed to fall between the general high frequency
and technical words in the vocabulary distribution of academic texts, possibly
suggesting that meaning variation may be on a cline from high frequency
words (high variation) to subtechnical words (moderate variation) to
technical words (low variation).
Technical words also appear to maintain more stable meanings over time.
For instance, it seems unlikely that there would be many instances when the
technical word mitosis (generally a low frequency word) has, or would begin
to have, widely variant meanings (homonymy), or when the technical family
archaeology, archaeologist, archaeologists, archaeological has, or would begin
to have, either unrelated meanings in context (homonymy) or even distantly
related meanings (weak polysemy). Therefore, it would appear that
computerized counts of lower-frequency, technical word families have
a higher degree of semantic validity than counts of higher-frequency,
basic word families of the language.
However, this does not mean that technical word families are more easily
acquired than basic word families, only that there is a higher degree of
semantic consistency among the forms in technical families. In fact, the
semantic load carried by most technical words found in expository reading
materials is so high that contextual word learning is often extremely difficult,
as evidenced in the following conclusion by Anderson (1996):
We found small but highly reliable increments in word knowledge
attributable to reading at all grades and ability levels. The overall
likelihood ranged from better than 1 in 10 when children were
reading easy narratives to near zero when they were reading
difficult expositions [emphases added]. (Anderson 1996: 61)
This suggests that the nature of the text in which an unknown word is
embedded (i.e. narrative vs. expository) can have a profound effect on its
learnability. Gardner (2004) adds that many specialized words tend to be
specific to a major register (narrative fiction vs. expository nonfiction)—that
is, they tend to appear exclusively in narrative texts or exclusively in gradeequivalent expository texts, with no overlap. He provides further evidence
that when a specialized word (e.g. mummy) is shared between major registers,
it is often found in vastly different contexts in those registers, presenting
different semantic connotations for the word. Again, this evidence has
important ramifications for the bulk of applied corpus-based vocabulary
studies that consider frequency and/or a range of word forms without
appropriate consideration of their variant meanings.
The impact of multiword items in the English lexicon
One of the key discoveries of Corpus Linguistics is the major
presence of lexical collocation (Sinclair 1987, 1991). In its broadest sense,
DEE GARDNER
255
collocation includes pairs or groups of words that have strong co-occurring
patterns within a relatively limited amount of discourse (e.g. sentence or
paragraph). Of particular interest to the current topic are those cases of
collocation that are fairly fixed in the language, often referred to as formulaic
language (Wray 2002), formulaic sequences (Schmitt 2004), or multi-word items
(Moon 1997)—the term they are used in this article: ‘A multi-word item is
a vocabulary item which consists of a sequence of two or more words (a
word being simply an orthographic unit). This sequence of words
semantically and/or syntactically forms a meaningful and inseparable unit’
(Moon 1997: 43). Under this definition, Moon includes compounds
(e.g. Prime Minister, crystal ball, collective bargaining), phrasal verbs (e.g. give
up, break off, write down), idioms (e.g. rock the boat, kick the bucket, spill the
beans, have an ax to grind), fixed phrases (e.g. of course, at least, in fact, good
morning, how do you do, excuse me), and prefabs (e.g. the thing is, that reminds
me, I’m a great believer in).
Of particular importance to the current discussion is the fact that
multiword items are composed of more than one word form, and that,
as a collective whole, they cover a substantial portion of spoken and
written English. In fact, Erman and Warren (2000: 29) estimate that prefabs,
their choice for a cover term, account for 58.6 percent of spoken English
and 52.3 percent of written English. Such evidence, they suggest, ‘makes it
impossible to consider idioms and other multi-word combinations as marginal
phenomena’. There is also convergent evidence that the sheer number of
different multiword items may exceed the number of individual words in the
lexicon (Jackendoff 1995; Mel’čuk 1995; Pawley and Syder 1983).
Furthermore, many multiword items have multiple meanings themselves.
For example, Gardner and Davies (in press) found that the 100 most frequent
phrasal verbs in the BNC (e.g. break up, set up, put out) have 559 potential
meaning senses, or an average of 5.6 per phrasal verb. Figures like these
raise serious questions regarding the linguistic and psychological validity
of corpus-based studies that rely primarily on frequency counts of single
word forms.
Sinclair and his associates take an even more radical approach toward
multiword items, suggesting that most lexical meaning is associated
with word patterns rather than with individual words—a concept referred
to as the ‘maximal approach’ (Sinclair 2004c: 280).
[The maximal approach] would be to extend the dimensions of a
unit of meaning until all the relevant patterning was included—all
the patterning that was instigated by the presence of the central
word. At least, to stay on the practical side, we should extend the
unit until the ambiguity disappears (as it does in almost every
case). (Sinclair 2004c: 280)
For Sinclair, a lexical item ‘consists of one or more words that together
make up a unit of meaning’ (2004c: 281). If accurate, this particular
256
WORD IN APPLIED CORPUS-BASED VOCABULARY RESEARCH
conceptualization of Word, based on years of corpus investigation, is perhaps
the strongest indictment of traditional applied corpus-based vocabulary
research that has relied heavily on frequency counts of individual word
forms, or has treated grammar and lexis as being independent aspects
of language patterning. Similarly, Biber et al. (1999) provide numerous
examples of the intricate, often inseparable, relationships between lexical and
grammatical information in a corpus, and how variations in these
relationships tend to define and distinguish different macro-registers of
the language (i.e. conversation, fiction, academic prose, and newspaper
reportage).
Additional evidence of the combinatory nature of lexis and grammatical
or syntactical structure comes from applied linguistics under the cover term
Pattern Grammar (Francis et al. 1996; Hunston 2002; Hunston and Francis
1998, 1999) and Cognitive Linguistics under the cover term Construction
Grammar (Croft 2001; Fillmore et al. 1988; Fried and Östman 2004;
Goldberg 1995). The former suggests that the English language is replete with
patterns of codependency between lexical and grammatical items (e.g. account
of þ noun, change in þ noun), and that these patterns often overlap
themselves (accounts of recent changes in English language teaching—emphasis
added, Hunston 2002 146–7). The latter emphasizes the notion of
constructions. While a detailed review of constructions is beyond the scope
of this article, three of the major tenets of Construction Grammar
are certainly worth noting in the context of the current discussion on
multiword items:
1 ‘Constructions are taken to be the basic units of the language’ (Goldberg
1995: 4).
2 ‘Phrasal patterns are considered constructions if something about their
form or meaning is not strictly predictable from the properties of their
component parts or from other constructions’ (Goldberg 1995: 4).
3 ‘In Construction Grammar, no strict division is assumed between the
lexicon and syntax’ (Goldberg 1995: 7).
Interestingly, the Construction Grammarians and Pattern Grammarians
appear to have reached similar conclusions despite starting at opposite ends
of the empirical spectrum: the arguments of the former are built on
observations about rare word patterns, while those of the latter are based on
observations about the most frequent.
Shifting to a pedagogical focus, there is growing evidence that native
speakers of English tend to store and utilize multiword items more
productively than nonnative speakers (Schmitt et al. 2004; Wray 2000).
This L2 acquisition problem is further complicated by the fact that many
classes of multiword items, such as phrasal verbs, are very common
and highly productive in the English language as a whole (Celce-Murcia and
Larsen Freeman 1999; Darwin and Gray 1999; Gardner and Davies in press;
Moon 1997).
DEE GARDNER
257
Some nonnative English speakers actually avoid using phrasal verbs and
other multiword items altogether, especially those learners at the beginning
and intermediate levels of proficiency. (See Liao and Fukuya (2004) for a
review of this topic.) Interestingly, even learners whose native language
actually contains phrasal verbs (e.g. Dutch) often avoid using such forms
when communicating in English (Hulstijn and Marchena 1989).
Nesselhauf (2003: 237) also found that adult German learners of English
have ‘considerable difficulties’ producing the correct verb in various
verb–noun collocations (e.g. make one’s homework, instead of do one’s
homework; take one’s task; instead of carry out/perform one’s task). Her findings
indicate that L1 interference was responsible for approximately half of the
errors. The fact that these were advanced learners of English (i.e. university
students), who were likely aiming for a high degree of English competence,
emphasizes the scope of the multiword problem in English language
education.
The growing research interest in multiword and other collocational
phenomena, together with the actual experiences of language educators, has
also resulted in many instructional methodologies that scrap the traditional
grammar-versus-lexis dichotomy in favor of collocational training (e.g. Lewis
1993, 1997, 2000; Nattinger and DeCarrico 1992; Sinclair 2004a; Willis
1990). However, some of the same validity issues regarding the construct of
Word may apply to these modern corpus-based pedagogical applications,
even if they do highlight multiword phenomena. First, the term collocation
in the Mel’čuk/Cowie sense emphasizes combinations or strings of words
(Cowie 1994: Mel’čuk 1998), while collocation in the broader Sinclair sense
emphasizes co-occurrence of words within a certain distance of each other,
whether or not they constitute set phrases (e.g. Sinclair 1991, 2004a, 2004b).
These are quite different constructs, representing different conceptualizations
of the relationships between words in a text. Potentially, they also represent
different conceptualizations of how vocabulary is acquired and utilized by
language learners, and they are likely to lead to quite different instructional
presentations and emphases. (For more detailed discussions on both sides
of this collocation issue see Howarth 1996, 1998, and Gledhill 2000.)
Second, the acquisition of collocations appears to be affected by age issues,
L1 backgrounds, L1–L2 transfer issues, literacy skill levels, componential
aspects of collocations (e.g. verbs vs. nouns), and so forth (Nesselhauf 2003;
Wray 2000, 2002). Indeed, the popular claims regarding concordancing as
a powerful tool for the language classroom may need to be tempered by
these and other variables. For instance, it is likely that only the most
advanced language learners can take advantage of the intricate semantic
relationships between words that are revealed through concordancing.
Certainly, such an approach to language training presupposes that learners
will know most of the words (cotexts) that surround a key word or phrase in
context (KWIC), and that they can connect their meanings—an assumption
258
WORD IN APPLIED CORPUS-BASED VOCABULARY RESEARCH
that seems unreasonable for many groups of language learners (children,
beginning L2 learners, learners with low literacy skills, etc.).
Finally, the size and specialization of corpora will have a definite bearing
on the psychological validity (from a learner perspective) of a particular
Word construct, even if that construct incorporates multiword and other
collocational phenomena. For example, a concordancing display of a KWIC
from a large mega-corpus might include many more collocational possibilities
and patterns than a similar display from a smaller specialized corpus, or even
a subregister within the mega-corpus itself. Is this multiplicity of possibilities
good for language learning? Is it bad? For what learners is it good or bad?
Which of the patterns should be emphasized?
As another example, if our construct for Word assumes a ‘maximal’ posture
(i.e. extending the dimensions of a lexical unit until ambiguity disappears),
who will decide the appropriate size of the lexical unit or the cotext to
be displayed on a concordance window? The linguist? The computer?
The learner? If the linguist or the computer, how do we know that the
learner will understand the same relationships? How much intervention will
be necessary for this knowledge to be available to various learners?
These and similar questions will require much more attention as we continue
to build bridges between corpus-based vocabulary research and its
pedagogical applications, and answers to such questions will undoubtedly
lead to expansions or modifications of the views presented in this article.
SUMMARY AND RECOMMENDATIONS
The wide-ranging evidence surveyed in this article suggests that a summary
of key research findings is in order. This summary will also include
recommendations for improving the validity of the Word construct in applied
corpus-based vocabulary research and its pedagogical extensions, with
particular emphasis on issues relevant to the English language and English
language users.
1 Age, English language skills, English literacy skills, and extent of
morphological training have been shown to affect the ability of English
language users to recognize and utilize morphological relationships
between words. Researchers’ determinations of what constitutes a Word
for counting and analysis purposes using electronic corpora should take
these variables into consideration, with the following specific recommendations for various groups of language users:
a.
Base Forms þ Regular Inflections: younger children; low general English
proficiency; low English literacy skills; no specific morphological
training.
b. Base Forms þ Regular Inflections þ Irregular Inflections þ Derivational Prefixes: older children and adolescents, intermediate general English
DEE GARDNER
c.
259
proficiency, intermediate English literacy skills; some morphological
training.
Base Forms þ Regular Inflections þ Irregular Inflections þ Derivational Prefixes þ Derivational Suffixes (regular) þ Derivational Suffixes (irregular):
adults, high general English proficiency; high English literacy skills;
extensive morphological training or experience.
It is possible that English language users could have different combinations
of these characteristics (e.g. adults, high general English proficiency,
low English literacy skills, no morphological training). From a morphological
perspective, this may necessitate additional conceptualizations of what
constitutes a Word for investigative, instructional, or assessment purposes.
In short, if the findings of a corpus study are being applied to certain groups
of language users (e.g. how much vocabulary they already know, how much
vocabulary they need to know, how much vocabulary they could learn),
then the Word construct (i.e. how it is operationalized) should match the
assumed morphological knowledge of those groups as closely as possible.
This also implies that researchers and practitioners should use caution in
extending the findings of corpus-based vocabulary studies to novel groups of
language users who may have morphological backgrounds and skill-sets that
are incongruent with the definitions of Word used in the original studies.
2 Many English word forms have multiple, context-dependent meanings
that are often obscured or completely lost in the processing of electronic
corpora, especially when the word forms become isolated from their
contexts, or are not considered in their contexts. In light of this fact,
adjustments and disclaimers to the construct of Word should be made,
paying particular attention to the following areas:
a.
the general characteristics of the corpus used in the analysis (i.e. more
potential for word meaning variation in larger corpora, between
multiple registers, and between varied topics).
b. the degree of word tagging utilized (i.e. untagged corpora have the
highest potential for form–meaning variation, followed by grammatically tagged corpora, and then semantically tagged corpora).
c. the general characteristics of the words themselves (i.e. basic high
frequency words have the highest potential for form–meaning
variation, followed by subtechnical words, and then technical words).
Researchers and practitioners should carefully consider these and other
form–meaning issues when evaluating corpus-generated findings that
produce instructional word lists, estimate learners’ vocabulary sizes,
determine how many words learners need to know to adequately
function in the language, access the lexical characteristics, distributions, or
densities of various written or spoken texts, or hypothesize about the
processes involved in actual vocabulary acquisition. In short, the level of
260
WORD IN APPLIED CORPUS-BASED VOCABULARY RESEARCH
Word-meaning ambiguity dealt with in a particular corpus analysis or
application will directly affect its internal validity (how much we can trust
the conclusions or applications to be accurate), and/or its external validity
(how applicable the findings or applications are for various groups of
language users—for example L1 vs. L2, adults vs. children).
To date, the bulk of applied corpus studies appear to have treated
this form–meaning issue either superficially, or not at all. Undoubtedly,
it represents one of the greatest challenges for machine-processed language,
and calls for more and better ways to semantically tag electronic corpora,
or for more robust collocation-based computer programs that will allow
forms with similar meanings and forms with different meanings to be
identified and processed accurately.
3 English has many multiword items such as phrasal verbs (call on, chew him
out), idioms (rock the boat), fixed phrases (excuse me) and prefabs (the point
is). The degree to which these are identified and accounted for will
again affect the internal and external validity of a given corpus-based
vocabulary study. The following levels could serve as a potential guide
in determining the extent of multiword phenomena accounted for in
a particular study. The order also provides a rough estimate of the
difficulty associated with computerized identification of multiword items,
ranging from ‘a’ (least difficulty) to ‘e’ (highest difficulty):
a. Zero Level: closed compound nouns (mailman).
b. Restricted Level: closed compound nouns (mailman), and hyphenated
words (acid-free).
c. Moderate Level: closed compound nouns (mailman), hyphenated words
(acid-free), open compound nouns (post office), and nonseparable
phrasal verbs (call on).
d. Expanded Level: closed compound nouns (mailman), hyphenated
words (acid-free), open compound nouns (post office), nonseparable
phrasal verbs (call on), separable phrasal verbs (chew him out), idioms
(rock the boat), fixed phrases (excuse me), and prefabs (the point is).
e. Maximal Level: all co-occurring word patterns (contiguous or
noncontiguous) that form units of meaning (cf. Sinclair 2004c).
The prolific nature of multiword items in English—as well as the difficulty
in electronically identifying and processing them—is further complicated by
the fact that many English language users (e.g. L2 learners, L1 children)
struggle to recognize, acquire, and utilize such items. Researchers’
adjustments and disclaimers to the construct of Word should take these
issues into consideration, especially when corpus findings are being related
to certain groups of language users, or when findings relative to one group
are being extended to another.
Finally, I wish to emphasize that the recommendations above are not rigid
or independent taxonomies. In fact, the three major areas often interact
DEE GARDNER
261
with each other. For instance, an individual word form (e.g. chip)
and its apparent morphological relations (e.g. chips, chipping, chipped) can
have multiple, context dependent meanings (e.g. potato chip, computer chip,
chipping a golf ball, the chipped vase), and they can also combine with other
word forms to create multiword items with their own multiplicity of
meanings (e.g. He chipped in and got the job done; He chipped in the golf ball).
Furthermore, English language users with advanced skills are much more
likely to recognize and utilize the similarities and differences in these forms
(and their meanings) than less skilled users.
CONCLUSION
My primary aim in this article has been to raise awareness of the Word
dilemma that exists in applied corpus-based vocabulary research, especially
with regard to computerized processing of electronic word forms, and the
concomitant lexical oversights (theoretical, empirical, and pedagogical) that
can result from such research. I have also offered some recommendations for
addressing this issue in each of the major areas surveyed: morphological
relationships between word forms, semantic variation within and between
word forms, and the presence of multiword items in the lexicon.
Clearly, much more needs to be done to refine these recommendations,
and to identify and articulate the linguistic (language-based), psychological
(learner-based), and pedagogical (instructional-based) variables surrounding
the construct of Word in Applied Corpus Linguistics, particularly as we
attempt to inform language education in the area of English vocabulary,
which continues to be identified as a major correlate with literacy and other
essential academic skills (Alderson 2000; Read 2000).
I would also suggest that the major Word issues discussed in this article
could serve as a guide for many corpus applications (including interfaces)
that are designed to support language education. Because more and more
language training curricula are inspired by computerized research and
learning tools (Read 2004), there would seem to be a growing implicit
responsibility among those who produce such research and tools to make
them as valid and reliable as possible, or at least to inform practitioners of
their potential limitations. With such a perspective in mind, we may be in
a better position to build useful bridges between applied corpus-based
vocabulary research and Language Education.
Final version received May 2006
REFERENCES
Alderson, J. C. (2000). Assessing Reading.
Cambridge: Cambridge University Press.
Anderson, R. C. (1996). ‘Research foundations
to support wide reading’ in V. Greaney (ed.):
Promoting Reading in Developing Countries.
Newark, DE: International Reading Association,
pp. 55–77.
Anderson, R. C. and P. Freebody. (1981).
‘Vocabulary knowledge’ in J. T. Guthrie (ed.):
Comprehension and Teaching: Research Reviews.
262
WORD IN APPLIED CORPUS-BASED VOCABULARY RESEARCH
Newark,
DE:
International
Reading
Association, pp. 77–117.
Anderson, R. C. and W. E. Nagy. (1991).
‘Word meanings’ in R. Barr, M. L. Kamil,
P. B. Mosenthal, and P. D. Pearson, Handbook
of Reading Research, Vol. 2. New York: Longman,
pp. 690–724.
Anderson, R. C. and W. E. Nagy. (1992).
‘The vocabulary conundrum,’ American Educator
Winter: 14–19, 44–47.
Bauer, L. and P. Nation. (1993). ‘Word families,’
International Journal of Lexicography 6/4: 253–79.
Beglar, D. and A. Hunt. (1999). ‘Revising and
validating the 2000 Word Level and University
Word Level vocabulary tests,’ Language Testing
16/2: 131–62.
Berko, J. (1958). ‘The child’s learning of English
morphology,’ Word 14: 150–77.
Bečka, J. V. (1972). ‘The lexical composition of
specialised texts and its quantitative aspect,’
Prague Studies in Mathematical Linguistics 4: 47–64.
Biber, D., S. Conrad, and V. Cortes. (2004).
‘If you look at . . ..: Lexical bundles in university
teaching and textbooks,’ Applied Linguistics 25/3:
371–405.
Biber, D., S. Johansson, G. Leech, S. Conrad,
and E. Finegan. (1999). Longman Grammar of
Spoken and Written English. London: Longman.
Biemiller, A. and N. Slonim. (2001). ‘Estimating
root word vocabulary growth in normative and
advantaged populations: Evidence for a common
sequence of vocabulary acquisition,’ Journal of
Educational Psychology 93/3: 498–520.
Cameron, L. (2002). ‘Measuring vocabulary size
in English as an additional language,’ Language
Teaching Research 6/2: 145–73.
Carlisle, J. F. (2000). ‘Awareness of the structure
and meaning of morphologically complex
words: Impact on reading,’ Reading and Writing:
An Interdisciplinary Journal 12: 169–90.
Carlo, M. S., D. August, B. McLaughlin,
C. E. Snow, C. Dressler, D. N. Lippman,
T. J. Lively, and C. E. White. (2004). ‘Closing
the gap: Addressing the vocabulary needs of
English-language learners in bilingual and
mainstream classroom,’ Reading Research
Quarterly 39/2: 188–215.
Carroll, J. B., P. Davies, and B. Richman. (1971).
The American Heritage Word Frequency Book.
New York: Houghton Mifflin.
Celce-Murcia, M. and D. Larsen-Freeman.
(1999). The Grammar Book: An ESL/EFL Teacher’s
Course 2nd edn. Heinle and Heinle Publishers.
Chung, T. M. and P. Nation. (2004). ‘Identifying
technical vocabulary,’ System 32: 251–63.
Cowie, A. P. (1994). ‘Phraseology’ in R. E. Asher
and J. Simpson (eds): The Encyclopedia of
Language and Linguistics. Oxford: Pergamon
Press, pp. 3168–71.
Coxhead, A. (2000). ‘A new academic word list,’
TESOL Quarterly 34/2: 213–38.
Croft, W. (2001). Radical Construction Grammar:
Syntactic Theory in Typological Perspective. Oxford:
Oxford University Press.
Cunningham, P. (1998). ‘The multisyllabic word
dilemma: Helping students build meaning, spell,
and read ‘‘big’’ words,’ Reading and Writing
Quarterly: Overcoming Learning Difficulties 14:
189–218.
Darwin, C.M. and L. S. Gray. (1999). ‘Going
after the phrasal verb: An alternative approach
to classification,’ TESOL Quarterly 33/1: 65–83.
Derwing, B. L. and W. J. Baker. (1979). ‘Recent
research on the acquisition of English morphology’ in P. Fletcher and M. Garman (eds):
Language Acquisition. Cambridge: Cambridge
University Press, pp. 209–23.
Elley, W. B. (1991). ‘Acquiring literacy in a second
language: The effect of book-based programs,’
Language Learning 41/3: 375–411.
Elley, W. B. (1996). ‘Using book floods to raise
literacy levels in developing countries’ in
V. Greaney (ed.), Promoting Reading in Developing
Countries. Newark, DE: International Reading
Association, pp. 148–62.
Erman, B. and B. Warren. (2000). ‘The idiom
principle and the open choice principle,’
Text 20/1: 29–62.
Fillmore, C. J., P. Kay, and M. C. O’Connor.
(1988). ‘Regularity and idiomaticity in
grammatical constructions: The case of let
alone,’ Language 64/3: 501–38.
Francis, W. N. and H. Kučera. (1982). Frequency
analysis of English Usage. Boston: Houghton
Mifflin.
Francis, G., S. Hunston, and R. Manning.
(1996). Collins COBUILD Grammar Patterns:
Verbs. London: Harper Collins.
Freyd, P. and J. Baron. (1982). ‘Individual
differences in acquisition of derivational
morphology,’ Journal of Verbal Learning and
Verbal Behavior 21: 282–95.
Fried, M. and J. Östman. (eds) (2004) Construction
Grammar in a Cross-language Perspective.
Amsterdam: John Benjamins Publishing
Company.
DEE GARDNER
Gardner, D. (2004). ‘Vocabulary input through
extensive reading: A comparison of words found
in Children’s narrative and expository reading
materials,’ Applied Linguistics 25/1: 1–37.
Gardner, D. and M. Davies. (in press). ‘Pointing
out frequent phrasal verbs: A corpus-based
analysis,’ TESOL Quarterly.
Gledhill, C. J. (2000). Collocations in Science Writing.
Tübingen: Gunter Narr Verlag.
Goldberg, A. E. (1995). Constructions: A Construction
Grammar Approach to Argument Structure. Chicago:
The University of Chicago Press.
Goulden, R., P. Nation, and J. Read. (1990).
‘How large can a receptive vocabulary be?’
Applied Linguistics 11/4: 341–63.
Grabe, W. (1991). ‘Current developments
in second language reading research,’ TESOL
Quarterly 25/3: 375–405.
Granger, S., J. Hung, and S. Petch-Tyson. (eds)
(2002). Computer Learner Corpora, Second Language
Acquisition and Foreign Language Teaching.
Amsterdam: John Benjamins Publishing
Company.
Hancin-Bhatt, B. and W. Nagy. (1994). ‘Lexical
transfer and second language morphological
development,’ Applied Psycholinguistics 15:
289–310.
Hatch, E. and C. Brown. (1995). Vocabulary,
Semantics, and Language Education. New York:
Cambridge.
Hazenberg, S. and J. H. Hulstijn. (1996).
‘Defining a minimal receptive secondlanguage vocabulary for non-native university
students: An empirical investigation,’ Applied
Linguistics 17/2.
Hirsch, D. and P. Nation. (1992). ‘What vocabulary size is needed to read unsimplified texts
for pleasure,’ Reading in a Foreign Language 8/2:
689–96.
Howarth, P. A. (1996). Phraseology in English
Academic Writing. Tübingen: Max Niemeyer
Verlag.
Howarth, P. (1998). ‘Phraseology and second
language proficiency,’ Applied Linguistics 19/1:
24–44.
Hsueh-chao, M. and P. Nation. (2000).
‘Unknown vocabulary density and reading
comprehension,’ Reading in a Foreign Language
13/1: 403–30.
Hulstijn, J. and E. Marchena. (1989).
‘Avoidance:
Grammatical
or
semantic
causes?’ Studies in Second Language Acquisition
11: 241–55.
263
Hunston, S. (2002). Corpora in Applied Linguistics.
Cambridge: Cambridge University Press.
Hunston, S. and G. Francis. (1998). ‘Verbs
observed: A corpus-driven pedagogic grammar,’
Applied Linguistics 19/1: 45–72.
Hunston, S. and G. Francis. (1999). Pattern
Grammar: A Corpus-driven Approach to the Lexical
Grammar of English. Amsterdam: John
Benjamins Publishing Company.
Jackendoff, R. (1995). ‘The boundaries of
the lexicon,’ in M. Everaert, E. van der
Linden, A. Schenk, and R. Schreuder (eds):
Idioms: Structural and Psychological Perspectives.
Hillsdale, NJ: Lawrence Erlbaum Associates,
pp. 133–66.
Jiang, N. (2000). ‘Lexical representation and
development in a second language,’ Applied
Linguistics 21/1: 47–77.
Knowles, G. and Z. Mohd Don. (2004).
‘The notion of a ‘‘lemma’’: Headwords, roots
and lexical sets,’ International Journal of Corpus
Linguistics 9/1: 69–81.
Krashen, S. (1989). ‘We acquire vocabulary and
spelling by reading: Additional evidence for the
input hypothesis,’ The Modern Language Journal
73/4: 440–64.
Krashen, S. (1993). The Power of Reading: Insights
from the Research. Englewood, CO: Libraries
Unlimited.
Kyongho, H. and P. Nation. (1989). ‘Reducing
the vocabulary load and encouraging vocabulary
learning through reading newspapers,’ Reading
in a Foreign Language 6/1: 322–35.
Landes S., C. Leacock, and R. I. Tengi. (1998)
‘Building semantic concordances’ in C. Fellbaum
(ed.): WordNet: An Electronic Lexical Database
Cambridge, MA: The MIT Press, pp. 199–216.
Laufer, B. (1989). ‘What percentage of text-lexis
is essential for comprehension’ in C. Lauren
and M. Nordman (eds): Special Language:
From Humans Thinking to Thinking Machines.
Philadelphia: Multilingual Matters, pp. 316–33.
Laufer, B. (1997). ‘The lexical plight in second
language reading: Words you don’t know,
words you think you know, and words you
can’t guess’ in J. Coady and T. Huckin (eds):
Second
Language
Vocabulary
Acquisition.
Cambridge: Cambridge University Press, pp.
20–34.
Laufer, B. and Z. Goldstein. (2004). ‘Testing
vocabulary knowledge: Size, strength, and
computer adaptiveness,’ Language Learning
54/3: 399–436.
264
WORD IN APPLIED CORPUS-BASED VOCABULARY RESEARCH
Laufer, B. and P. Nation. (1995). ‘Vocabulary
size and use: Lexical richness in L2 written
production,’ Applied Linguistics 16: 307–22.
Laufer, B. and P. Nation. (1999). ‘A vocabularysize test of controlled productive ability,’
Language Testing 16/1: 33–41.
Lewis, M. (1993). The Lexical Approach. Hove,
England: LTP Publications.
Lewis, M. (1997). Implementing the Lexical Approach:
Putting Theory into Practice. Hove, England:
LTP Publications.
Lewis, M. (ed.) (2000). Teaching Collocation:
Further developments in the Lexical Approach.
Hove, England: LTP Publications.
Liao, Y. and Y. J. Fukuya. (2004). ‘Avoidance
of phrasal verbs: The case of Chinese learners of
English,’ Language Learning 54/2: 193–226.
McBride-Chang, C., R. K. Wagner, A Muse,
B. W. Y. Chow, and H. Shu. (2005). ‘The role
of morphological awareness in children’s
vocabulary acquisition in English,’ Applied
Psycholinguistics 26: 415–35.
Mahony, D., M. Singson, and V. Mann. (2000).
‘Reading ability and sensitivity to morphological
relations,’ Reading and Writing: An Interdisciplinary
Journal 12: 191–218.
Mel’čuk, I. (1995). ‘Phrasemes in language and
phraseology in linguistics’ in M. Everaert, E. van
der Linden, A. Schenk, and R. Schreuder (eds):
Idioms: Structural and Psychological Perspectives.
Hillsdale, NJ: Lawrence Erlbaum Associates,
pp. 167–232.
Mel’čuk, I. (1998). ‘Collocations and lexical
functions’ in A. P. Cowie (ed.): Phraseology:
Theory, Analysis, and Applications. Oxford:
Clarendon Press, pp. 23–53.
Ming-Tzu, K. W. and P. Nation. (2004). ‘Word
meaning in Academic English: Homography in
the Academic Word List,’ Applied Linguistics 25/3:
291–314.
Mirhassani, A. and A. Toosi. (2000). ‘The impact
of word-formation knowledge on reading
comprehension,’ IRAL 38: 301–11.
Moon, R. (1997). ‘Vocabulary connections:
multi-word items in English’ in N. Schmitt
and M. McCarthy (eds.): Vocabulary Description,
Acquisition and Pedagogy. Cambridge: Cambridge
University Press, pp. 40–63.
Nagy, W. (1997). ‘On the role of context in firstand second-language vocabulary learning’ in
N. Schmitt and M. McCarthy (eds): Vocabulary:
Description, Acquisition and Pedagogy. Cambridge:
Cambridge University Press, pp. 64–83.
Nagy, W. E. and R. C. Anderson. (1984). ‘How
many words are there in printed school English?’
Reading Research Quarterly 19/3: 304–30.
Nagy, W. E., R. C. Anderson, and P. A. Herman.
(1987). ‘Learning word meanings from context
during normal reading,’ American Educational
Research Journal 24/2: 237–70.
Nagy, W. E., I. N. Diakidoy, and R. C. Anderson.
(1993). ‘The acquisition of morphology:
Learning the contribution of suffixes to the
meanings of derivatives,’ Journal of Reading
Behavior 25/2: 155–80.
Nagy, W. E., P. A. Herman, and R. C. Anderson.
(1985). ‘Learning words from context,’ Reading
Research Quarterly 20/2: 233–53.
Nation, I. S. P. (1983). ‘Testing and teaching
vocabulary,’ Guidelines 5: 12–25.
Nation, I. S. P. (1990). Teaching and Learning
Vocabulary. New York: Newbury House
Publishers.
Nation, I. S. P. (2001). Learning Vocabulary
in Another Language. Cambridge: Cambridge
University Press.
Nation, P. and H. Kyongho. (1995). ‘Where
would general service vocabulary stop and
special purposes vocabulary begin?’ System
23/1: 35–41.
Nation, P. and R. Waring. (1997). ‘Vocabulary
size, text coverage and word lists’ in N. Schmitt
and M. McCarthy (eds): Vocabulary: Description,
Acquisition and Pedagogy. Cambridge: Cambridge
University Press, pp. 6–19.
Nattinger, J. R. and J. S. DeCarrico. (1992).
Lexical Phrases and Language Teaching. Oxford:
Oxford University Press.
Nesselhauf, N. (2003). ‘The use of collocations
by advanced learners of English and some
implications for teaching,’ Applied Linguistics
24/2: 223–42.
Partington, A. (1998). Patterns and Meanings:
Using Corpora for English Language Research
and Teaching. Amsterdam: John Benjamins
Publishing Company.
Pawley, A. and F. H. Syder. (1983). ‘Two puzzles
for linguistic theory: Nativelike selection and
nativelike fluency’ in J. C. Richards and R. W.
Schmidt (eds): Language and Communication.
London: Longman, pp. 191–225.
Ravin, Y. and C. Leacock. (2000). ‘Polysemy: An
overview’ in Y. Raven and C. Leacock (eds):
Polysemy:
Theoretical
and
Computational
Approaches. Oxford: Oxford University Press,
pp. 1–29.
DEE GARDNER
Read, J. (2000). Assessing Vocabulary. Cambridge:
Cambridge University Press.
Read, J. (2004). ‘Research in teaching vocabulary,’
Annual Review of Applied Linguistics 24: 146–61.
Schmidt, N., S. Grandage, and S. Adolphs.
(2004). ‘Are corpus-derived recurrent clusters
psycholinguistically valid’ in N. Schmitt (ed.):
Formulaic Sequences: Acquisition, Processing and Use.
Amsterdam: John Benjamins Publishing
Company, pp. 127–151.
Schmitt, N. (ed.) (2004). Formulaic Sequences.
Amsterdam: John Benjamins Publishing
Company.
Schmitt, N. and P. Meara. (1997). ‘Researching
vocabulary through a word knowledge framework: Word associations and verbal suffixes,’
SSLA 19/1: 17–35.
Schmitt, N. and C. B. Zimmerman. (2002).
‘Derivative word forms: What do learners
know?’ TESOL Quarterly 36/2: 145–71.
Schmitt, N., D. Schmitt, and C. Clapham.
(2001). ‘Developing and exploring the behaviour
of two new versions of the Vocabulary Levels
Test,’ Language Testing 18: 55–88.
Seidlhofer, B. (2002). ‘Pedagogy and local learner
corpora: Working with learning-driven data’
in S. Granger, J. Hung, and S. Petch-Tyson
(eds): Computer Learner Corpora, Second Language
Acquisition and Foreign Language Teaching.
Amsterdam: John Benjamins Publishing
Company, pp. 213–34.
Sinclair, J. (1987). ‘Collocation: a progress report’
in R. Steele and T. Threadgold (eds): Language
Topics: An International Collection of Papers by
Colleagues, Students and Admirers of Professor
Michael Halliday to Honour Him on His Retirement,
Vol. II. Amsterdam: John Benjamins Publishing
Company, pp. 319–31.
Sinclair, J. (1991). Corpus, Concordance, Collocation.
Oxford: Oxford University Press.
Sinclair, J. (2004a) (ed.). How to Use Corpora in
Language Teaching. Amsterdam: John Benjamins
Publishing Company.
Sinclair, J. (2004b). Trust the Text: Language,
Corpus and Discourse. London: Routledge.
Sinclair, J. (2004c). ‘New evidence, new priorities,
new attitudes’ in J. Sinclair (ed.): How to Use
Corpora in Language Teaching. Amsterdam: John
Benjamins Publishing Company, pp. 271–99.
265
Singson, M., D. Mahony, and V. Mann. (2000).
‘The relation between reading ability and
morphological skills: Evidence from derivational
suffixes,’ Reading and Writing: An Interdisciplinary
Journal 12: 219–52.
Stahl, S. A. and T. G. Shiel. (1992). ‘Teaching
meaning vocabulary: Productive approaches
for poor readers,’ Reading and Writing Quarterly:
Overcoming Learning Difficulties 8: 223–41.
Stubbs, M. (2001). ‘Texts, corpora, and problems
of interpretation: A response to Widdowson,’
Applied Linguistics 22/2: 149–72.
Stubbs, M. (2002). Words and Phrases: Corpus Studies
of Lexical Semantics. Oxford: Blackwell Publishing.
Sutarsyah, C., P. Nation, and G. Kennedy.
(1994). ‘How useful is EAP vocabulary for ESP?
A corpus based case study,’ RELC Journal 25/2:
34–50.
Tyler, A. and W. Nagy. (1989). ‘The acquisition
of English derivational morphology,’ Journal of
Memory and Language 28: 649–67.
Tyler, A. and W. Nagy. (1990). ‘Use of derivational morphology during reading,’ Cognition 36:
17–34.
Webster’s Ninth New Collegiate Dictionary. (1988).
Springfield, MA: Merriam-Webster Inc.
Wesche, M. and T. S. Paribakht. (1996).
‘Assessing second language vocabulary knowledge: Depth versus breadth,’ The Canadian
Modern Language Review 53/1: 13–40.
Widdowson, H. G. (2000). ‘On the limitations of
linguistics applied,’ Applied Linguistics 21/1: 3–25.
Willis, D. (1990). The Lexical Syllabus. London:
Harper Collins.
WordNet (2003). Princeton University electronic
lexical database (Version 2.0). Retrieved
February 11, 2004, from http://wordnet.
princeton.edu/obtain.
Wray, A. (2000). ‘Formulaic sequences in second
language teaching: Principle and practice,’
Applied Linguistics 21/4: 463–89.
Wray, A. (2002). Formulaic Language and the
Lexicon. Cambridge: Cambridge University Press.
Wysocki, K. and J. Jenkins. (1987). ‘Deriving
word meanings through morphological generalizations,’ Reading Research Quarterly 22: 66–81.
Xue, G. and P. Nation. (1984). ‘A university word
list,’ Language Learning and Communication 3/2:
93–242.