Applied Linguistics 28/2: 241–265 doi:10.1093/applin/amm010 ß Oxford University Press 2007 Validating the Construct of Word in Applied Corpus-based Vocabulary Research: A Critical Survey DEE GARDNER Brigham Young University Corpus-based vocabulary research has had a profound impact on English language education, and there is abundant evidence that this will remain the case for the foreseeable future. Perhaps the greatest challenge of such research is the determination of what constitutes a Word for counting and analysis purposes. Decisions in this regard have important ramifications not only for the lexical findings themselves, but also for the pedagogical theories and practices that derive from them. This article surveys several fields of study in order to discuss this dilemma, with a particular focus on three problematic areas relating to computer-processed corpora: (a) morphological relationships between words, (b) homonymy and polysemy, and (c) multiword items. The article concludes with recommendations for assessing the validity of the Word construct in applied corpus-based vocabulary research. The influence of corpora and corpus-based research on educational theories and practices is well-established in both first language (L1) and second language (L2) settings (e.g. Granger et al. 2002; Hunston 2002; Nagy and Anderson 1984; Nation 1990, 2001; Nation and Waring 1997; Partington 1998; Schmitt 2004; Sinclair 2004a). Perhaps nowhere has this influence been more apparent than in the areas of vocabulary analysis, training, and theory-building. For example, the landmark study of words in printed school English by Nagy and Anderson (1984) was based on the American Heritage Intermediate corpus (Carroll et al. 1971). Findings from the study were instrumental in establishing the construct of Incidental Vocabulary Learning, or learning from context (Nagy et al. 1985, 1987), which, in turn, had a major impact on the popular practice of Extensive Reading in its various forms (e.g. wide reading, free reading, pleasure reading, book floods, sustained silent reading—Elley 1991, 1996; Krashen 1989, 1993; Nagy 1997; Nagy and Anderson 1984; Nagy et al. 1987). The vocabulary-levels (VL) research by Nation and his colleagues is another example of how corpus-based vocabulary investigations have shaped pedagogical practices (Nation 1990, 2001). Among other instructional applications, VL research has taken the strong position that relatively short lists of high frequency words—derived from analyses of large corpora—should receive direct instruction, as they cover such a large 242 WORD IN APPLIED CORPUS-BASED VOCABULARY RESEARCH percentage of the total words that English language learners will actually encounter. Proponents further argue that for reading to be an avenue of protracted vocabulary growth, a learner must first have control of these high frequency, high coverage words. Otherwise, the lexical demands of most authentic texts would prevent readers from achieving the necessary levels of reading comprehension for contextual word learning to take place (Laufer 1997; Nation 1990, 2001). However, upon closer examination, the validity of these two seminal research paradigms, as well as other corpus-based vocabulary applications, appears to hinge on the various ways that researchers have operationalized the construct of Word for counting and analysis purposes. Some variables that raise validity issues include the multifaceted nature of English morphological word families, the impact of multiple word meanings, and the presence of multiword items in the English lexicon. Furthermore, when corpus-based vocabulary findings are used to inform or support actual language acquisition, there is the additional concern of whether researcher-based conceptualizations of Word (i.e. the criteria used to group words, count words, etc.) actually match the psychological realities of Word (i.e. actual knowledge of or about words in the minds of target language users). All of these Word validity issues have particular significance for applied vocabulary research that relies on, or is based on, lexical data from computer-processed corpora. Examples of research that fits this description include those that (a) estimate learners’ vocabulary sizes (e.g. Goulden et al. 1990; Nagy and Anderson 1984), (b) assess learners vocabulary competencies (e.g. Beglar and Hunt 1999; Cameron 2002; Laufer and Goldstein 2004; Laufer and Nation 1995, 1999; Nation 1983; Schmitt et al. 2001), (c) determine how many words learners need to know to adequately function in a language (e.g. Nation 1990, 2000; Laufer 1989, 1997; Hazenberg and Hulstijn 1996—Dutch language; Hsueh-chao and Nation 2000), (d) assess the lexical characteristics, distributions, or densities of various written or spoken texts (e.g. Gardner 2004; Hirsh and Nation 1992; Kyongho and Nation 1989; Nation 1990, 2001; Nation and Kyongho 1995; Sutarsyah et al. 1994), (e) produce lists of words for instructional or informative purposes (e.g. Carroll et al. 1971; Coxhead 2000; Francis and Kučera 1982; Xue and Nation 1984), and (f) hypothesize about the processes involved in actual vocabulary acquisition (e.g. Nagy and Anderson 1984). There are also those examples where the findings of corpus-based vocabulary research (e.g. Nagy and Anderson 1984) have been used to support theories in other educational venues (e.g. Krashen 1989, 1993). The purpose of this article is to raise awareness of this Word dilemma in applied corpus-based vocabulary research through a survey of several fields of study, and to make recommendations for improving the validity of such research in informing English language education. My intent is not to criticize, but to improve the use of corpora in educational research and practice. In terms of an audience for this discussion, I wish to make a clear DEE GARDNER 243 distinction between the relatively large multidisciplinary group of applied vocabulary researchers whom I address in this article (i.e. those who utilize corpora to inform and support first and second language acquisition) and the field of Corpus Linguistics proper (i.e. those who utilize corpora to study and describe language). While evidence from the latter field will be used to support certain claims of this study, I do not provide an extensive review of the construct of Word in Corpus Linguistics. Having said this, however, I also wish to recognize the increasingly porous boundary between traditional Corpus Linguistics and Language Education, as evidenced by the numerous research articles and books dealing with various uses of corpora in language teaching (e.g. Biber et al. 2004; Granger et al. 2002; Hunston 2002; Hunston and Francis 1998; Nesselhauf 2003; Partington 1998; Schmitt 2004; Sinclair 2004a). In this regard, certain aspects of the current discussion may also be applicable, such as matching corpus-based vocabulary applications (concordancing, etc.) to learners’ skill levels and language backgrounds, the potential differences in word-meaning redundancy between large mega-corpora and smaller specialized corpora, and so forth. Finally, additional audiences for this discussion are those interested language educators who wish to be better informed about the possibilities and limitations of corpus-based vocabulary research and applications for their particular instructional circumstances (cf. Seidlhofer 2002). Along with language learners themselves, these are the individuals who are most likely to be affected by the potential lexical oversights discussed in this article. RESEARCH BASIS FOR CRITIQUING THE CONSTRUCT OF WORD Convergent vocabulary and morphology studies from various fields of inquiry provide a foundation for assessing the construct of Word in corpusbased vocabulary research that attempts to inform English language education. Three key factors emerging from these studies appear to be superficially treated in the bulk of applied corpus-based vocabulary research: (a) the degree to which learners of various language backgrounds and skills levels can make connections between morphologically-related words; (b) the impact of homonymy and polysemy; and (c) the impact of multiword items in the lexicon. I would argue that the lexical distortions resulting from oversights in any of these areas, or their possible combinations, can have long-term ramifications for theories and practices in language education. Morphological relationships between words The concept of lemma In addressing morphological relationships between words, many corpus linguists rely heavily on the concept of lemma, which is defined by Francis and Kučera (1982: 1) as ‘a set of lexical forms having the same stem and 244 WORD IN APPLIED CORPUS-BASED VOCABULARY RESEARCH belonging to the same major word class, differing only in inflection and/or spelling’. Examples of lemma sets would be base forms of verbs together with their inflected forms (e.g. climb, climbs, climbing, climbed). All four verb forms would be considered under the lemma CLIMB, with capitalization indicating the actual lemma set (cf. Stubbs 2002). However, the noun derivative climber would be considered its own lemma, because it represents a different word class (i.e. noun instead of verb). The lexical set belonging to the noun lemma CLIMBER would also include the plural form climbers, and possibly the two possessive forms climber’s and climbers’. The Francis and Kučera definition of lemma also allows inclusion of irregular verb forms in a lexical set (e.g. went belongs to GO; flew and flown belong to FLY; broke and broken belong to BREAK; am, is, are, was, were, been, and being belong to BE). However, the case of the irregulars poses serious quandaries relating to the psychological validity of such family relationships—namely, that the opaque spelling and phonological connections between the lemma headword and the family members will surely cause more and different learning problems than their more transparent counterparts. Also implied in the traditional definitions of lemma is the idea that the lexical sets must belong to the same grammatical class (all are nouns, all are verbs, etc.—cf. Hunston 2002). Some corpus linguists also make the additional distinction—and I would argue that it is a crucial distinction from an applied perspective—that forms must share the same meanings in addition to belonging to the same grammatical class (e.g. Stubbs 2002). This definition puts lemma more on a par with lexeme, another term adopted by corpus linguists to describe relationships between word forms and their meanings: lexeme: a group of word forms that share the same basic meaning (apart from that associated with the inflections that distinguish them) and belong to the same word class (Biber et al. 1999: 54) While a discussion of definitions within Corpus Linguistics is not the purpose of this article, it is crucial to understand how different conceptualizations of what constitutes a Word for counting purposes can affect the results and conclusions of applied corpus-based vocabulary research. To begin, it would be virtually impossible to meet the criteria of same grammatical class and same meaning in grouping words unless the researcher had access to a grammatically and semantically tagged corpus or a sophisticated collocational analysis program—both of which are in their developmental infancy (cf. Landes et al. 1998; Ravin and Leacock 2000; Sinclair 2004b). Without such resources, a machine-based frequency count of a lemma like MAKE might erroneously count lexical verb forms (make, makes, making, made) with common noun forms (make and makes—as in makes and models of cars), gerund forms (making—as in the making of the film), and even the verbal components of multi-meaning phrasal verbs like make out (makes out, making out, made out), make up (makes up, making up, make up), DEE GARDNER 245 and make off (makes off, making off, made off ). The distortions in lexical frequency counts resulting from these types of overgeneralizations could certainly impact the validity of vocabulary research aimed at establishing instructional word lists, assessing vocabulary loads in written and oral texts, determining the vocabulary needed to comprehend texts, and so forth. The concept of word families The definition of what constitutes a word for counting purposes and potential learning purposes has also been the subject of extensive debate in L1 and L2 instructional contexts for quite some time (Anderson and Freebody 1981; Anderson and Nagy 1992; Bauer and Nation 1993; Hazenberg and Hulstijn 1996; Goulden et al. 1990; Nagy and Anderson 1984; Wesche and Paribakht 1996): What counts as a word will depend upon the researcher’s principal purposes. However, affixes and derivatives are important elements of word knowledge, and several questions related to their role are of considerable interest: In what way does knowledge of basic or root forms relate to knowledge of the compound forms? Are entries organized conceptually in the personal dictionary such that the probability of knowing a compound word is the same as that of knowing all its family members, basic form included? Or is the chance of knowing a compound some combination of the frequencies of the particular compounding elements? Much is to be gained from research into these issues (Anderson and Freebody 1981: 98). In this broad characterization of some of the morphological concerns facing vocabulary researchers, the experts tie morphology to both learners’ potential knowledge and to issues of frequency. While there may be strong reasons in general for analyzing frequency of Word Families (base forms plus inflected forms and transparent derivatives) instead of frequency of individual word forms, there is also evidence that word-level morphological relationships have varying degrees of transparency in terms of graphic similarity and meaning redundancy, and therefore the extent to which learners of different ages and language backgrounds are able to associate related words (Nagy et al. 1993). Given the prolific nature of inflectional and derivational affixation in English (i.e. affixed forms outnumber basic forms four to one—Cunningham 1998), it is little wonder that vocabulary researchers have had difficulty in defining the construct of Word. At one extreme, we might be tempted to count only unique spellings, as was the case in the American Heritage Intermediate (AHI) corpus (Carroll et al. 1971). This avoids the role of morphological affixation altogether, but it loses a great deal of psychological validity. For instance, it is highly unlikely that average readers in the third 246 WORD IN APPLIED CORPUS-BASED VOCABULARY RESEARCH through ninth grades (the focus of the AHI study) would see no connection between boy and boys, or between walk and walks. At the other extreme, however, there may be a tendency to say that whatever morphological taxonomies we come up with to group related words must have some correlation to the way learners actually associate words in their minds. This view is particularly problematic in light of the various orientations that individual learners have to morphological issues, and the vast number of morphological variables that must be accounted for. Bauer and Nation (1993: 257) offer perhaps the best attempt at addressing several of these morphological variables as they relate to actual learner knowledge. Their analysis of the Lancaster-Oslo-Bergen (LOB) corpus resulted in a seven-level categorization scheme which the researchers consider to be useful for ‘practical’ rather than ‘theoretical’ purposes. They are also careful to explain that the seven levels (summarized below) are not discrete or absolute, but represent possible ‘steps along a cline’ of word family recognition (1993: 257), and that certain affixes can appear in more than one level depending on restrictive usages: Level 1: Each form is a different word. The assumption here is that learners have basically no concept of morphological relationships between words. The researchers see this as a very ‘pessimistic’ view, but a potentially useful assumption when multiple meanings for the same word form are considered—a point that will be addressed later in this paper. Level 2: Inflectional suffixes. At this level, it is assumed that learners have to perform ‘minimal morphographemic analysis’ in recognizing most inflections. Words with these suffixes are roughly equivalent to those that would fall under the classification of lemma in corpus linguistics. Level 3: The most frequent and regular derivational affixes. This level includes affixes such as -able, -er, -ish, -less, -ly, -ness, -th, -y, non-, and un-. Level 4: Frequent, orthographically regular affixes. Here the researchers prioritized in terms of frequency over productivity, and orthography over phonology, given the researchers’ emphasis on written materials over spoken language. At this level, the researchers found restricted (in this case, transparent) uses of -al, -ation, -ess, -ful, -ism, -ist, -ity, -ize, -ment, -ous, and in-. Level 5: Regular but infrequent affixes. This level adds a number of affixes that are fairly regular, but they do not individually add greatly to the number of words that can be understood by learners (e.g. -age, -al, -an, -hood, -let, anti-, arch-, and bi-). Level 6: Frequent but irregular affixes. While affixes at this level may be fairly frequent in the language, they nonetheless create ‘major problems of segmentation, either because they cause gross (orthographic) allomorphy in their bases (that is, parts of the base are deleted or additions besides the suffix are needed), or because there are major problems involved in segmenting them caused by homography DEE GARDNER 247 [same form with different meaning].’ Affixes at this level are -able, -ee, -ic, -ify, -ion, -ist, -ition, -ive, -th, -y, pre-, and re-. Level 7: Classical roots and affixes. At this level belong all the classical roots which abound in English words and which occur not only as bound roots in English (as in embolism), but also as elements in neo-classical compounds (such as photography). According to Bauer and Nation, both native speakers and L2 speakers must be taught such forms ‘explicitly.’ However, they also point out that many prefixes at this level are frequent in English, for example ab-, ad-, com-, de- dis-, ex-, and sub-(Bauer and Nation 1993: 258–62). One apparent advantage of this seven-level categorization scheme is that Word or Word Family can be operationalized at various defensible levels for analysis and comparative analysis purposes—at least in terms of learners’ abilities to associate morphologically related words. According to Bauer and Nation, such a framework could foster consistency in research involving estimations of vocabulary size, age-related morphological development, lexical storage, and dictionary construction. However, the framework has some potential weaknesses that deserve further discussion in the context of applied corpus-based vocabulary research. First, the fact that many affixed forms are repeated at different levels (e.g. -able, -y, -ist) suggests that the use of the framework beyond certain major distinctions (e.g. No Family Relationships vs. All Family Relationships, or All Inflection vs. All Derivation) could become extremely problematic, especially if one considers the fact that before level determinations could be made, case by case assessments of affixed word forms would be necessary to determine if a prolific derivational affix was acting transparently or not. While this might not present a dilemma with smaller corpora containing few different words, it could certainly pose a major hurdle for larger text samples with many different words. Second the researchers tend to lump derivational affixes together without consideration of the fact that derivational prefixes and derivational suffixes may present different learning dilemmas for developing readers (Nagy et al. 1993). For example, derivational prefixes tend to be paraphrasable (e.g. non- and un- in Bauer and Nation’s Level 3 mean not), and they do not change the word class of the bases they are added to (e.g. entity and nonentity are both nouns; do and undo are both verbs). In contrast, derivational suffixes change word class (e.g. do the verb becomes doable the adjective when the suffix -able—also listed in Level 3—is added). Also, the meanings contributed by many derivational suffixes such as ment, ness, and ish are not only difficult to define, but they may also be difficult for learners to grasp (cf. Nagy et al. 1993). A third concern is that the framework does not adequately address the role of stems in derived forms. For example, studies have shown that 248 WORD IN APPLIED CORPUS-BASED VOCABULARY RESEARCH the definitions of derivatives given by L1 students (approximately nine to thirteen years old) often center on the meaning of the stems with little or no attention paid to the meaning of the suffixes (Freyd and Baron 1982; Wysocki and Jenkins 1987). In one particular L1 study, Tyler and Nagy (1989) did detect some low-level learner knowledge of the contribution of suffixes, which the researchers attributed to the more precise measurement tools used in the study. However, the researchers also found that the students learned the contribution of suffixes in derived forms only after they learned to recognize the stems in those forms. Hancin-Bhatt and Nagy (1994) also found some low-level student knowledge of suffixes in their study of 196 Spanish-English bilingual students at three age levels (roughly nine, eleven, and thirteen). However, the researchers also concluded that cognate stems were realized more often in suffixed words than noncognate stems, suggesting that the nature of the stems, rather than the suffixes, was the major factor in determining whether suffixation was being utilized by the students or not. Taken together, these findings suggest that a psychologically valid definition of Word Family (i.e. from a learner perspective) should address graphic and semantic transparency issues involving morphemic stems before addressing the relative contributions of affixes—a topic that I will address in greater detail in later sections of this article. A final concern is that the Bauer and Nation framework, like many others to date, seems to assume that learners’ exposure to and acquisition of morphologically-related words is somehow linear in nature—in other words, that language learners acquire base forms before their inflected and derived family members. However, Biemiller and Slonim (2001) point out that young children may actually acquire many derived forms before they acquire their root-form counterparts. Of particular concern here is the question of whether learners of various skill levels and language backgrounds are as capable of recognizing and utilizing the common morphemic stem of a Word Family when their initial and extensive exposure to that stem may be through inflected and derived forms, rather than base forms. Extant research in this regard suggests that such a non-linear process may be more difficult for learners. For instance, Stahl and Shiel (1992: 233) conclude that, ‘many children, especially poorer readers, have difficulty isolating the root word’. Furthermore, while instruction may tend to favor the presentation of root forms before their affixed relatives (cf. Jiang 2000), there is no way of controlling for such exposure in ‘authentic’ texts and during ‘natural’ reading and conversational experiences. This would only seem possible through materials and communicative contexts that have been linguistically engineered to control for vocabulary presentation. DEE GARDNER 249 Developmental and language background issues The issues in the forgoing discussion make it clear that decisions by applied corpus researchers to group and count words based on morphological relationships must be tempered by learner variables if results are to be truly useful in informing language pedagogy and theory-building. Too often such decisions are based on researcher preference, computer limitations, or convenience, and the resultant findings and claims tend to be far too general or abstracted for real-world application. The following is a brief synthesis of research findings that have considered developmental and language background issues relating to the acquisition of English morphological knowledge. 1 School-aged learners acquire English inflectional morphology before derivational morphology. Berko (1958) and Derwing and Baker (1979) established that L1 children generally learn inflection and compounding before derivation. In fact, knowledge of inflectional morphology is assumed to be well underway by the first grade (Berko 1958). On the other hand, knowledge of derivational morphology begins sometime before the fourth grade (nine years old—Carlisle 2000; Tyler and Nagy 1989), continues through the upper-elementary and middle grades (ten to fourteen years old— Carlisle 2000; Mahony et al. 2000; Singson et al. 2000; Tyler and Nagy 1989), through high school (fifteen to eighteen years old), and possibly beyond (Nagy et al. 1993; Tyler and Nagy 1990). Hancin-Bhatt and Nagy (1994) add that bilingual children (roughly nine to thirteen years old) perform better with inflected forms (same grammatical category) than with derivational forms (different grammatical category). Additionally, these researchers suggest that an age-related developmental pattern exists in children’s ‘knowledge of the relationships between English and Spanish derivational suffixes’ (1994: 305). However, while increases in morphological knowledge were positively correlated with age, the overall knowledge of ‘English suffixes and their relationships to Spanish suffixes’, even among the oldest children, remained ‘relatively low’ (1994: 306). 2 L2 adult learners acquire English inflectional morphology before derivational morphology. Schmitt and Meara (1997) found that adult L2 learners of English (95 Japanese L1) show earlier and more profound awareness (receptive and productive) of English inflectional verb suffixes than derivational verb suffixes. The researchers attribute this finding to the rule-based nature of English inflectional morphology as opposed to the more idiosyncratic nature of English derivational morphology. Schmitt and Zimmerman (2002) found similar results in their study of 106 nonnative English speakers (university students), drawing the following pedagogical conclusion: ‘This study indicates that teachers cannot assume that learners will absorb the derivative forms of a word 250 WORD IN APPLIED CORPUS-BASED VOCABULARY RESEARCH family automatically from exposure. Rather, in this area, explicit attention to form may be of value’ (2000: 163). 3 Individual differences in language skills account for noticeable variation in morphological knowledge. The findings from the Freyd and Baron (1982) study also suggest that individual differences in language capacity, rather than age or grade level, are key predictors of derivational word knowledge. They found that precocious fifth-graders (ten years old) exhibited more awareness of derivational morphology than average eighth-graders (thirteen years old). Several research studies have also found strong correlations between morphological knowledge and superior reading skills (Carlisle 2000; Freyd and Baron 1982; Mahony et al. 2000; Nagy et al. 1993; Singson et al. 2000; Tyler and Nagy 1989, 1990). These correlations strongly suggest that a reciprocal relationship exists between the two cognitive abilities— that is, morphological knowledge is both a byproduct of skilled reading and a contributor to skilled reading. A recent study also suggests that variability in children’s early acquisition of English vocabulary is strongly correlated with their morphological awareness (McBride-Chang et al. 2005). Interestingly, morphological awareness appeared to be a good predictor of vocabulary knowledge even when ‘phonological processing, word reading skill, and age were statistically controlled’ (2005: 428). 4 Explicit training can improve a learner’s ability to utilize morphological knowledge. A growing body of research indicates that learners’ morphological development can be productively enhanced through direct instruction of the relationships between root forms and their affixed family members. This morphological awareness raising through word analysis techniques appears to be effective for all groups of learners, including L1 Children (Cunningham 1998; Stahl and Shiel 1992), L2 children (Carlo et al. 2004), and L2 adults (Mirhassani and Toosi 2000). Homonymy and polysemy Grabe (1991) notes that assessments of vocabulary are often hindered by the fact ‘that each word form is counted as a single word, though in reality, each word form may represent a number of distinct meanings, some of which depend strongly on the reading context, and some of which are quite different from each other in meaning’ (1991: 392). Meaning variation among words with the same or similar forms has been debated among semanticists for many years (Anderson and Nagy 1991). At a very fundamental level, scholars agree that a single word form can have multiple meanings—for example: ‘Row (in a row of chairs) and row (in row a boat)’ (Goulden et al. 1990: 344). There is also acknowledgement that DEE GARDNER 251 a word’s various meanings can be clearly distinct (homonymy), as in the row example above, or more closely related (polysemy): Polysemes are the many variants of meaning of a word where it is clear that the meanings are truly related. The verb break has many different variants which are related in meaning [He broke his leg; The cup broke; She broke his heart; She broke the world record, etc.]. The verb put also has an array of polysemes. Homonyms (sometimes called homographs), on the other hand, are variants that are spelled alike but which have no obvious commonality in meaning. The classic example is the word bank, which could mean a financial institution or the bank of a river or a tier or row of objects (such as a bank of seats at a baseball game).. . . One question regarding core research is whether or not polysemes such as those shown in the break example really are the same word (whether there is some invariant Gesamtbedeutung that links them) or whether some should be listed as separate lexical items. (Hatch and Brown 1995: 49) This potential for meaning variation (both homonymy and polysemy) becomes even more convoluted when the morphological word family is considered. For instance, forms that appear to be related through affixation may actually be homographs in context (e.g. bear, the animal, and bears/ bearing, the verb meaning to carry), or repetitions of the same affixed forms may actually be homographs in context (e.g. bears, the plural animal, and bears, the verb meaning to carry). The existence of potential polysemes seems to complicate matters even more (e.g. bear/bears/bearing, the verb meaning ‘to move while holding up and supporting,’ and bear/bears/bearing, the verb meaning ‘to hold in the mind’—Webster’s Ninth New Collegiate Dictionary 1988). To make matters even worse, the past form of the verb bear (either definition above) is bore, which bears (additional meaning) no orthographic resemblance to the other forms in the same semantic family. Additionally, the form bore itself has much more common meanings that are totally unrelated to bear in the senses described above (to bore a hole, the man is a bore, I don’t want to bore you with this, etc.). A quick search in WordNet (2003) reveals two senses for the noun bear, 13 senses for the verb bear, two additional senses for the verb bore, and four senses for the noun bore—a total of 21 different meaning senses for the two forms bear and bore. This does not even include the two adjective senses of born which could also be included in the bear/bore family. Conceivably, a machine-based frequency count of word-family forms could link all of these forms of bear and bore together (assuming that the researcher had determined to count bore under bear), but the question remains whether or not a child or other language learner encountering these various forms would make any semantic connections between them based on context. Would they perceive some invariant Gesamtbedeutung that links them all? One might argue here that I have chosen a rare example for effect, 252 WORD IN APPLIED CORPUS-BASED VOCABULARY RESEARCH but it is easy to find similar examples among the highest frequency forms of English (e.g. bank—18 senses; pick—21 senses; put—10 senses; fall—44 senses; read—12 senses; call—41 senses). In fact, Ravin and Leacock (2000: 1) remind us that ‘the most commonly used words tend to be the most polysemous’ and Biemiller and Slonim (2001) add that general print frequency of word forms is a poor predictor of learners’ root word knowledge because high frequency words tend to have variant meanings—a point that has important ramifications for corpus-based vocabulary findings that rely heavily on frequency of word forms, without consideration of their variant meanings, especially when it is assumed that such findings can be easily transferred to pedagogical applications, or that they accurately represent the psychological realities in the minds of learners. Some corpus linguists would also point out that the meaning of a word is dependent on the other words associated with it in a particular text (cotext), and that words are only ambiguous when isolated from their cotext (Sinclair 2004b). From this perspective, the meaning of bore (see example above) is ambiguous as an isolated word, but not so in the case of to bore a hole (meaning to drill), and the man is a bore (meaning uninteresting or tiresome). This line of reasoning brings into serious question the validity of computerized counts of individual word forms or computer-generated lists of individual word forms for investigative or instructional purposes, especially if those words are of higher general frequency in the language. I would further argue that the notion of lemma and counting lemmas in much of Corpus Linguistics suffers from the same problem—namely, that machine-based identifications and calculations of true lemmas (same grammatical class and same meaning) are virtually impossible without a grammatically and semantically tagged corpus or a sophisticated collocational analysis program, both of which, as stated earlier, are in their developmental infancy (Landes et al. 1998; Ravin and Leacock 2000; Sinclair 2004b). While manual determinations of true lemmas might be possible on a contextby-context basis (perhaps using concordancing), there are at least two issues that do not favor such an approach: (a) the overwhelming size of the task, especially with large mega-corpora; and (b) the difficulty of accurately assigning words to their particular lemmas. the metaphorical use of lion (e.g., John is a lion) is likely to be treated as ‘the same word,’ while the concrete and metaphorical uses of crane (‘kind of bird’ and machine for lifting heavy objects’) are more likely to be treated as independent words and therefore members of different lemmas. If it is difficult to group word meanings under headwords at the abstract level of the dictionary, it is much more difficult to assign words in texts unambiguously to their lemmas. (Knowles and Mohd Don 2004: 70) These researchers also argue that ‘generalizations about whole lemma become less and less convincing’ as detailed linguistic examinations of corpus-based DEE GARDNER 253 data continue to be performed, and that researchers may need to begin ‘to consider individual words’ or ‘actually even individual word meanings’ as the basis for their analyses (Knowles and Mohd Don 2004: 71). Biemiller and Slonim (2001: 510) add that ‘frequencies of word meanings rather than word forms might lead to better predictions’ of learners’ root word knowledge, although such frequencies ‘would be very hard to produce’. In light of these semantic issues, I would argue that suggested applications of corpus research based on frequency of word forms, without considerations of word meanings, will invariably suffer from one of three problems—or combinations of the three: (a) they will overestimate the true coverage of the word forms; (b) they will underestimate the actual user knowledge required to negotiate the word forms; and/or (c) they will underestimate the actual number of meanings inherent in the word forms. I would further argue that any of these conditions can have serious ramifications for the pedagogical applications and theories that rely on such research. Indeed, the apparent tensions building in Applied Linguistics between more traditional linguists and corpus linguists seem to hinge on the limitations (e.g. Widdowson 2000) and possibilities (e.g. Stubbs 2001) of corpus-based research to account for the semantic and pragmatic realities of the language. So what can be done to improve this situation or to make corpus-based vocabulary claims more valid in terms of educational applications? One useful consideration is that homonymy and weak polysemy among the high frequency words of English would logically be the greatest in a corpus consisting of many texts with varied themes and topics. Conversely, there is less likelihood of semantic disparity within an individual text or a collection of topically-related texts, as words and their meanings tend to be closely tied to the themes or topics of those texts (Nagy et al. 1987; Sutarsyah et al. 1994). However, the trend in modern corpus construction appears to be toward bigger and broader. If one of the goals of such research is to provide information for language pedagogy in terms of word coverage, instructional word lists, and so forth, then much more attention must be paid to semantic analyses of word forms (cf. Read 2000). Another consideration is that technical words of the language tend to be more complex in terms of form (i.e. more letters and more complex affixations than basic words), but less ambiguous in terms of meaning. This may be due in part to the fact that they are found in a narrower range of texts (Bečka 1972; Chung and Nation 2004, Read 2000) and represent more complex and specialized concepts than basic words, although some basic word forms also have specialized meanings in some technical contexts (e.g. supply and demand in an Economics text—Sutarsyah et al. 1994)— further evidence of the semantic instability of the high frequency (basic) word forms of the language. Interestingly, in a study of the Academic Word List (AWL)—a list of subtechnical words assumed to have high frequency and range in academic reading materials—the researchers found that only 10 percent 254 WORD IN APPLIED CORPUS-BASED VOCABULARY RESEARCH (60 out of 570 words) were homographs (Ming-Tzu and Nation 2004). These AWL words are assumed to fall between the general high frequency and technical words in the vocabulary distribution of academic texts, possibly suggesting that meaning variation may be on a cline from high frequency words (high variation) to subtechnical words (moderate variation) to technical words (low variation). Technical words also appear to maintain more stable meanings over time. For instance, it seems unlikely that there would be many instances when the technical word mitosis (generally a low frequency word) has, or would begin to have, widely variant meanings (homonymy), or when the technical family archaeology, archaeologist, archaeologists, archaeological has, or would begin to have, either unrelated meanings in context (homonymy) or even distantly related meanings (weak polysemy). Therefore, it would appear that computerized counts of lower-frequency, technical word families have a higher degree of semantic validity than counts of higher-frequency, basic word families of the language. However, this does not mean that technical word families are more easily acquired than basic word families, only that there is a higher degree of semantic consistency among the forms in technical families. In fact, the semantic load carried by most technical words found in expository reading materials is so high that contextual word learning is often extremely difficult, as evidenced in the following conclusion by Anderson (1996): We found small but highly reliable increments in word knowledge attributable to reading at all grades and ability levels. The overall likelihood ranged from better than 1 in 10 when children were reading easy narratives to near zero when they were reading difficult expositions [emphases added]. (Anderson 1996: 61) This suggests that the nature of the text in which an unknown word is embedded (i.e. narrative vs. expository) can have a profound effect on its learnability. Gardner (2004) adds that many specialized words tend to be specific to a major register (narrative fiction vs. expository nonfiction)—that is, they tend to appear exclusively in narrative texts or exclusively in gradeequivalent expository texts, with no overlap. He provides further evidence that when a specialized word (e.g. mummy) is shared between major registers, it is often found in vastly different contexts in those registers, presenting different semantic connotations for the word. Again, this evidence has important ramifications for the bulk of applied corpus-based vocabulary studies that consider frequency and/or a range of word forms without appropriate consideration of their variant meanings. The impact of multiword items in the English lexicon One of the key discoveries of Corpus Linguistics is the major presence of lexical collocation (Sinclair 1987, 1991). In its broadest sense, DEE GARDNER 255 collocation includes pairs or groups of words that have strong co-occurring patterns within a relatively limited amount of discourse (e.g. sentence or paragraph). Of particular interest to the current topic are those cases of collocation that are fairly fixed in the language, often referred to as formulaic language (Wray 2002), formulaic sequences (Schmitt 2004), or multi-word items (Moon 1997)—the term they are used in this article: ‘A multi-word item is a vocabulary item which consists of a sequence of two or more words (a word being simply an orthographic unit). This sequence of words semantically and/or syntactically forms a meaningful and inseparable unit’ (Moon 1997: 43). Under this definition, Moon includes compounds (e.g. Prime Minister, crystal ball, collective bargaining), phrasal verbs (e.g. give up, break off, write down), idioms (e.g. rock the boat, kick the bucket, spill the beans, have an ax to grind), fixed phrases (e.g. of course, at least, in fact, good morning, how do you do, excuse me), and prefabs (e.g. the thing is, that reminds me, I’m a great believer in). Of particular importance to the current discussion is the fact that multiword items are composed of more than one word form, and that, as a collective whole, they cover a substantial portion of spoken and written English. In fact, Erman and Warren (2000: 29) estimate that prefabs, their choice for a cover term, account for 58.6 percent of spoken English and 52.3 percent of written English. Such evidence, they suggest, ‘makes it impossible to consider idioms and other multi-word combinations as marginal phenomena’. There is also convergent evidence that the sheer number of different multiword items may exceed the number of individual words in the lexicon (Jackendoff 1995; Mel’čuk 1995; Pawley and Syder 1983). Furthermore, many multiword items have multiple meanings themselves. For example, Gardner and Davies (in press) found that the 100 most frequent phrasal verbs in the BNC (e.g. break up, set up, put out) have 559 potential meaning senses, or an average of 5.6 per phrasal verb. Figures like these raise serious questions regarding the linguistic and psychological validity of corpus-based studies that rely primarily on frequency counts of single word forms. Sinclair and his associates take an even more radical approach toward multiword items, suggesting that most lexical meaning is associated with word patterns rather than with individual words—a concept referred to as the ‘maximal approach’ (Sinclair 2004c: 280). [The maximal approach] would be to extend the dimensions of a unit of meaning until all the relevant patterning was included—all the patterning that was instigated by the presence of the central word. At least, to stay on the practical side, we should extend the unit until the ambiguity disappears (as it does in almost every case). (Sinclair 2004c: 280) For Sinclair, a lexical item ‘consists of one or more words that together make up a unit of meaning’ (2004c: 281). If accurate, this particular 256 WORD IN APPLIED CORPUS-BASED VOCABULARY RESEARCH conceptualization of Word, based on years of corpus investigation, is perhaps the strongest indictment of traditional applied corpus-based vocabulary research that has relied heavily on frequency counts of individual word forms, or has treated grammar and lexis as being independent aspects of language patterning. Similarly, Biber et al. (1999) provide numerous examples of the intricate, often inseparable, relationships between lexical and grammatical information in a corpus, and how variations in these relationships tend to define and distinguish different macro-registers of the language (i.e. conversation, fiction, academic prose, and newspaper reportage). Additional evidence of the combinatory nature of lexis and grammatical or syntactical structure comes from applied linguistics under the cover term Pattern Grammar (Francis et al. 1996; Hunston 2002; Hunston and Francis 1998, 1999) and Cognitive Linguistics under the cover term Construction Grammar (Croft 2001; Fillmore et al. 1988; Fried and Östman 2004; Goldberg 1995). The former suggests that the English language is replete with patterns of codependency between lexical and grammatical items (e.g. account of þ noun, change in þ noun), and that these patterns often overlap themselves (accounts of recent changes in English language teaching—emphasis added, Hunston 2002 146–7). The latter emphasizes the notion of constructions. While a detailed review of constructions is beyond the scope of this article, three of the major tenets of Construction Grammar are certainly worth noting in the context of the current discussion on multiword items: 1 ‘Constructions are taken to be the basic units of the language’ (Goldberg 1995: 4). 2 ‘Phrasal patterns are considered constructions if something about their form or meaning is not strictly predictable from the properties of their component parts or from other constructions’ (Goldberg 1995: 4). 3 ‘In Construction Grammar, no strict division is assumed between the lexicon and syntax’ (Goldberg 1995: 7). Interestingly, the Construction Grammarians and Pattern Grammarians appear to have reached similar conclusions despite starting at opposite ends of the empirical spectrum: the arguments of the former are built on observations about rare word patterns, while those of the latter are based on observations about the most frequent. Shifting to a pedagogical focus, there is growing evidence that native speakers of English tend to store and utilize multiword items more productively than nonnative speakers (Schmitt et al. 2004; Wray 2000). This L2 acquisition problem is further complicated by the fact that many classes of multiword items, such as phrasal verbs, are very common and highly productive in the English language as a whole (Celce-Murcia and Larsen Freeman 1999; Darwin and Gray 1999; Gardner and Davies in press; Moon 1997). DEE GARDNER 257 Some nonnative English speakers actually avoid using phrasal verbs and other multiword items altogether, especially those learners at the beginning and intermediate levels of proficiency. (See Liao and Fukuya (2004) for a review of this topic.) Interestingly, even learners whose native language actually contains phrasal verbs (e.g. Dutch) often avoid using such forms when communicating in English (Hulstijn and Marchena 1989). Nesselhauf (2003: 237) also found that adult German learners of English have ‘considerable difficulties’ producing the correct verb in various verb–noun collocations (e.g. make one’s homework, instead of do one’s homework; take one’s task; instead of carry out/perform one’s task). Her findings indicate that L1 interference was responsible for approximately half of the errors. The fact that these were advanced learners of English (i.e. university students), who were likely aiming for a high degree of English competence, emphasizes the scope of the multiword problem in English language education. The growing research interest in multiword and other collocational phenomena, together with the actual experiences of language educators, has also resulted in many instructional methodologies that scrap the traditional grammar-versus-lexis dichotomy in favor of collocational training (e.g. Lewis 1993, 1997, 2000; Nattinger and DeCarrico 1992; Sinclair 2004a; Willis 1990). However, some of the same validity issues regarding the construct of Word may apply to these modern corpus-based pedagogical applications, even if they do highlight multiword phenomena. First, the term collocation in the Mel’čuk/Cowie sense emphasizes combinations or strings of words (Cowie 1994: Mel’čuk 1998), while collocation in the broader Sinclair sense emphasizes co-occurrence of words within a certain distance of each other, whether or not they constitute set phrases (e.g. Sinclair 1991, 2004a, 2004b). These are quite different constructs, representing different conceptualizations of the relationships between words in a text. Potentially, they also represent different conceptualizations of how vocabulary is acquired and utilized by language learners, and they are likely to lead to quite different instructional presentations and emphases. (For more detailed discussions on both sides of this collocation issue see Howarth 1996, 1998, and Gledhill 2000.) Second, the acquisition of collocations appears to be affected by age issues, L1 backgrounds, L1–L2 transfer issues, literacy skill levels, componential aspects of collocations (e.g. verbs vs. nouns), and so forth (Nesselhauf 2003; Wray 2000, 2002). Indeed, the popular claims regarding concordancing as a powerful tool for the language classroom may need to be tempered by these and other variables. For instance, it is likely that only the most advanced language learners can take advantage of the intricate semantic relationships between words that are revealed through concordancing. Certainly, such an approach to language training presupposes that learners will know most of the words (cotexts) that surround a key word or phrase in context (KWIC), and that they can connect their meanings—an assumption 258 WORD IN APPLIED CORPUS-BASED VOCABULARY RESEARCH that seems unreasonable for many groups of language learners (children, beginning L2 learners, learners with low literacy skills, etc.). Finally, the size and specialization of corpora will have a definite bearing on the psychological validity (from a learner perspective) of a particular Word construct, even if that construct incorporates multiword and other collocational phenomena. For example, a concordancing display of a KWIC from a large mega-corpus might include many more collocational possibilities and patterns than a similar display from a smaller specialized corpus, or even a subregister within the mega-corpus itself. Is this multiplicity of possibilities good for language learning? Is it bad? For what learners is it good or bad? Which of the patterns should be emphasized? As another example, if our construct for Word assumes a ‘maximal’ posture (i.e. extending the dimensions of a lexical unit until ambiguity disappears), who will decide the appropriate size of the lexical unit or the cotext to be displayed on a concordance window? The linguist? The computer? The learner? If the linguist or the computer, how do we know that the learner will understand the same relationships? How much intervention will be necessary for this knowledge to be available to various learners? These and similar questions will require much more attention as we continue to build bridges between corpus-based vocabulary research and its pedagogical applications, and answers to such questions will undoubtedly lead to expansions or modifications of the views presented in this article. SUMMARY AND RECOMMENDATIONS The wide-ranging evidence surveyed in this article suggests that a summary of key research findings is in order. This summary will also include recommendations for improving the validity of the Word construct in applied corpus-based vocabulary research and its pedagogical extensions, with particular emphasis on issues relevant to the English language and English language users. 1 Age, English language skills, English literacy skills, and extent of morphological training have been shown to affect the ability of English language users to recognize and utilize morphological relationships between words. Researchers’ determinations of what constitutes a Word for counting and analysis purposes using electronic corpora should take these variables into consideration, with the following specific recommendations for various groups of language users: a. Base Forms þ Regular Inflections: younger children; low general English proficiency; low English literacy skills; no specific morphological training. b. Base Forms þ Regular Inflections þ Irregular Inflections þ Derivational Prefixes: older children and adolescents, intermediate general English DEE GARDNER c. 259 proficiency, intermediate English literacy skills; some morphological training. Base Forms þ Regular Inflections þ Irregular Inflections þ Derivational Prefixes þ Derivational Suffixes (regular) þ Derivational Suffixes (irregular): adults, high general English proficiency; high English literacy skills; extensive morphological training or experience. It is possible that English language users could have different combinations of these characteristics (e.g. adults, high general English proficiency, low English literacy skills, no morphological training). From a morphological perspective, this may necessitate additional conceptualizations of what constitutes a Word for investigative, instructional, or assessment purposes. In short, if the findings of a corpus study are being applied to certain groups of language users (e.g. how much vocabulary they already know, how much vocabulary they need to know, how much vocabulary they could learn), then the Word construct (i.e. how it is operationalized) should match the assumed morphological knowledge of those groups as closely as possible. This also implies that researchers and practitioners should use caution in extending the findings of corpus-based vocabulary studies to novel groups of language users who may have morphological backgrounds and skill-sets that are incongruent with the definitions of Word used in the original studies. 2 Many English word forms have multiple, context-dependent meanings that are often obscured or completely lost in the processing of electronic corpora, especially when the word forms become isolated from their contexts, or are not considered in their contexts. In light of this fact, adjustments and disclaimers to the construct of Word should be made, paying particular attention to the following areas: a. the general characteristics of the corpus used in the analysis (i.e. more potential for word meaning variation in larger corpora, between multiple registers, and between varied topics). b. the degree of word tagging utilized (i.e. untagged corpora have the highest potential for form–meaning variation, followed by grammatically tagged corpora, and then semantically tagged corpora). c. the general characteristics of the words themselves (i.e. basic high frequency words have the highest potential for form–meaning variation, followed by subtechnical words, and then technical words). Researchers and practitioners should carefully consider these and other form–meaning issues when evaluating corpus-generated findings that produce instructional word lists, estimate learners’ vocabulary sizes, determine how many words learners need to know to adequately function in the language, access the lexical characteristics, distributions, or densities of various written or spoken texts, or hypothesize about the processes involved in actual vocabulary acquisition. In short, the level of 260 WORD IN APPLIED CORPUS-BASED VOCABULARY RESEARCH Word-meaning ambiguity dealt with in a particular corpus analysis or application will directly affect its internal validity (how much we can trust the conclusions or applications to be accurate), and/or its external validity (how applicable the findings or applications are for various groups of language users—for example L1 vs. L2, adults vs. children). To date, the bulk of applied corpus studies appear to have treated this form–meaning issue either superficially, or not at all. Undoubtedly, it represents one of the greatest challenges for machine-processed language, and calls for more and better ways to semantically tag electronic corpora, or for more robust collocation-based computer programs that will allow forms with similar meanings and forms with different meanings to be identified and processed accurately. 3 English has many multiword items such as phrasal verbs (call on, chew him out), idioms (rock the boat), fixed phrases (excuse me) and prefabs (the point is). The degree to which these are identified and accounted for will again affect the internal and external validity of a given corpus-based vocabulary study. The following levels could serve as a potential guide in determining the extent of multiword phenomena accounted for in a particular study. The order also provides a rough estimate of the difficulty associated with computerized identification of multiword items, ranging from ‘a’ (least difficulty) to ‘e’ (highest difficulty): a. Zero Level: closed compound nouns (mailman). b. Restricted Level: closed compound nouns (mailman), and hyphenated words (acid-free). c. Moderate Level: closed compound nouns (mailman), hyphenated words (acid-free), open compound nouns (post office), and nonseparable phrasal verbs (call on). d. Expanded Level: closed compound nouns (mailman), hyphenated words (acid-free), open compound nouns (post office), nonseparable phrasal verbs (call on), separable phrasal verbs (chew him out), idioms (rock the boat), fixed phrases (excuse me), and prefabs (the point is). e. Maximal Level: all co-occurring word patterns (contiguous or noncontiguous) that form units of meaning (cf. Sinclair 2004c). The prolific nature of multiword items in English—as well as the difficulty in electronically identifying and processing them—is further complicated by the fact that many English language users (e.g. L2 learners, L1 children) struggle to recognize, acquire, and utilize such items. Researchers’ adjustments and disclaimers to the construct of Word should take these issues into consideration, especially when corpus findings are being related to certain groups of language users, or when findings relative to one group are being extended to another. Finally, I wish to emphasize that the recommendations above are not rigid or independent taxonomies. In fact, the three major areas often interact DEE GARDNER 261 with each other. For instance, an individual word form (e.g. chip) and its apparent morphological relations (e.g. chips, chipping, chipped) can have multiple, context dependent meanings (e.g. potato chip, computer chip, chipping a golf ball, the chipped vase), and they can also combine with other word forms to create multiword items with their own multiplicity of meanings (e.g. He chipped in and got the job done; He chipped in the golf ball). Furthermore, English language users with advanced skills are much more likely to recognize and utilize the similarities and differences in these forms (and their meanings) than less skilled users. CONCLUSION My primary aim in this article has been to raise awareness of the Word dilemma that exists in applied corpus-based vocabulary research, especially with regard to computerized processing of electronic word forms, and the concomitant lexical oversights (theoretical, empirical, and pedagogical) that can result from such research. I have also offered some recommendations for addressing this issue in each of the major areas surveyed: morphological relationships between word forms, semantic variation within and between word forms, and the presence of multiword items in the lexicon. Clearly, much more needs to be done to refine these recommendations, and to identify and articulate the linguistic (language-based), psychological (learner-based), and pedagogical (instructional-based) variables surrounding the construct of Word in Applied Corpus Linguistics, particularly as we attempt to inform language education in the area of English vocabulary, which continues to be identified as a major correlate with literacy and other essential academic skills (Alderson 2000; Read 2000). I would also suggest that the major Word issues discussed in this article could serve as a guide for many corpus applications (including interfaces) that are designed to support language education. Because more and more language training curricula are inspired by computerized research and learning tools (Read 2004), there would seem to be a growing implicit responsibility among those who produce such research and tools to make them as valid and reliable as possible, or at least to inform practitioners of their potential limitations. With such a perspective in mind, we may be in a better position to build useful bridges between applied corpus-based vocabulary research and Language Education. Final version received May 2006 REFERENCES Alderson, J. C. (2000). Assessing Reading. Cambridge: Cambridge University Press. Anderson, R. C. (1996). ‘Research foundations to support wide reading’ in V. Greaney (ed.): Promoting Reading in Developing Countries. Newark, DE: International Reading Association, pp. 55–77. Anderson, R. C. and P. Freebody. (1981). ‘Vocabulary knowledge’ in J. T. Guthrie (ed.): Comprehension and Teaching: Research Reviews. 262 WORD IN APPLIED CORPUS-BASED VOCABULARY RESEARCH Newark, DE: International Reading Association, pp. 77–117. Anderson, R. C. and W. E. Nagy. (1991). ‘Word meanings’ in R. Barr, M. L. Kamil, P. B. Mosenthal, and P. D. Pearson, Handbook of Reading Research, Vol. 2. New York: Longman, pp. 690–724. Anderson, R. C. and W. E. Nagy. (1992). ‘The vocabulary conundrum,’ American Educator Winter: 14–19, 44–47. Bauer, L. and P. Nation. (1993). ‘Word families,’ International Journal of Lexicography 6/4: 253–79. Beglar, D. and A. Hunt. (1999). ‘Revising and validating the 2000 Word Level and University Word Level vocabulary tests,’ Language Testing 16/2: 131–62. Berko, J. (1958). ‘The child’s learning of English morphology,’ Word 14: 150–77. Bečka, J. V. (1972). ‘The lexical composition of specialised texts and its quantitative aspect,’ Prague Studies in Mathematical Linguistics 4: 47–64. Biber, D., S. Conrad, and V. Cortes. (2004). ‘If you look at . . ..: Lexical bundles in university teaching and textbooks,’ Applied Linguistics 25/3: 371–405. Biber, D., S. Johansson, G. Leech, S. Conrad, and E. Finegan. (1999). Longman Grammar of Spoken and Written English. London: Longman. Biemiller, A. and N. Slonim. (2001). ‘Estimating root word vocabulary growth in normative and advantaged populations: Evidence for a common sequence of vocabulary acquisition,’ Journal of Educational Psychology 93/3: 498–520. Cameron, L. (2002). ‘Measuring vocabulary size in English as an additional language,’ Language Teaching Research 6/2: 145–73. Carlisle, J. F. (2000). ‘Awareness of the structure and meaning of morphologically complex words: Impact on reading,’ Reading and Writing: An Interdisciplinary Journal 12: 169–90. Carlo, M. S., D. August, B. McLaughlin, C. E. Snow, C. Dressler, D. N. Lippman, T. J. Lively, and C. E. White. (2004). ‘Closing the gap: Addressing the vocabulary needs of English-language learners in bilingual and mainstream classroom,’ Reading Research Quarterly 39/2: 188–215. Carroll, J. B., P. Davies, and B. Richman. (1971). The American Heritage Word Frequency Book. New York: Houghton Mifflin. Celce-Murcia, M. and D. Larsen-Freeman. (1999). The Grammar Book: An ESL/EFL Teacher’s Course 2nd edn. Heinle and Heinle Publishers. Chung, T. M. and P. Nation. (2004). ‘Identifying technical vocabulary,’ System 32: 251–63. Cowie, A. P. (1994). ‘Phraseology’ in R. E. Asher and J. Simpson (eds): The Encyclopedia of Language and Linguistics. Oxford: Pergamon Press, pp. 3168–71. Coxhead, A. (2000). ‘A new academic word list,’ TESOL Quarterly 34/2: 213–38. Croft, W. (2001). Radical Construction Grammar: Syntactic Theory in Typological Perspective. Oxford: Oxford University Press. Cunningham, P. (1998). ‘The multisyllabic word dilemma: Helping students build meaning, spell, and read ‘‘big’’ words,’ Reading and Writing Quarterly: Overcoming Learning Difficulties 14: 189–218. Darwin, C.M. and L. S. Gray. (1999). ‘Going after the phrasal verb: An alternative approach to classification,’ TESOL Quarterly 33/1: 65–83. Derwing, B. L. and W. J. Baker. (1979). ‘Recent research on the acquisition of English morphology’ in P. Fletcher and M. Garman (eds): Language Acquisition. Cambridge: Cambridge University Press, pp. 209–23. Elley, W. B. (1991). ‘Acquiring literacy in a second language: The effect of book-based programs,’ Language Learning 41/3: 375–411. Elley, W. B. (1996). ‘Using book floods to raise literacy levels in developing countries’ in V. Greaney (ed.), Promoting Reading in Developing Countries. Newark, DE: International Reading Association, pp. 148–62. Erman, B. and B. Warren. (2000). ‘The idiom principle and the open choice principle,’ Text 20/1: 29–62. Fillmore, C. J., P. Kay, and M. C. O’Connor. (1988). ‘Regularity and idiomaticity in grammatical constructions: The case of let alone,’ Language 64/3: 501–38. Francis, W. N. and H. Kučera. (1982). Frequency analysis of English Usage. Boston: Houghton Mifflin. Francis, G., S. Hunston, and R. Manning. (1996). Collins COBUILD Grammar Patterns: Verbs. London: Harper Collins. Freyd, P. and J. Baron. (1982). ‘Individual differences in acquisition of derivational morphology,’ Journal of Verbal Learning and Verbal Behavior 21: 282–95. Fried, M. and J. Östman. (eds) (2004) Construction Grammar in a Cross-language Perspective. Amsterdam: John Benjamins Publishing Company. DEE GARDNER Gardner, D. (2004). ‘Vocabulary input through extensive reading: A comparison of words found in Children’s narrative and expository reading materials,’ Applied Linguistics 25/1: 1–37. Gardner, D. and M. Davies. (in press). ‘Pointing out frequent phrasal verbs: A corpus-based analysis,’ TESOL Quarterly. Gledhill, C. J. (2000). Collocations in Science Writing. Tübingen: Gunter Narr Verlag. Goldberg, A. E. (1995). Constructions: A Construction Grammar Approach to Argument Structure. Chicago: The University of Chicago Press. Goulden, R., P. Nation, and J. Read. (1990). ‘How large can a receptive vocabulary be?’ Applied Linguistics 11/4: 341–63. Grabe, W. (1991). ‘Current developments in second language reading research,’ TESOL Quarterly 25/3: 375–405. Granger, S., J. Hung, and S. Petch-Tyson. (eds) (2002). Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching. Amsterdam: John Benjamins Publishing Company. Hancin-Bhatt, B. and W. Nagy. (1994). ‘Lexical transfer and second language morphological development,’ Applied Psycholinguistics 15: 289–310. Hatch, E. and C. Brown. (1995). Vocabulary, Semantics, and Language Education. New York: Cambridge. Hazenberg, S. and J. H. Hulstijn. (1996). ‘Defining a minimal receptive secondlanguage vocabulary for non-native university students: An empirical investigation,’ Applied Linguistics 17/2. Hirsch, D. and P. Nation. (1992). ‘What vocabulary size is needed to read unsimplified texts for pleasure,’ Reading in a Foreign Language 8/2: 689–96. Howarth, P. A. (1996). Phraseology in English Academic Writing. Tübingen: Max Niemeyer Verlag. Howarth, P. (1998). ‘Phraseology and second language proficiency,’ Applied Linguistics 19/1: 24–44. Hsueh-chao, M. and P. Nation. (2000). ‘Unknown vocabulary density and reading comprehension,’ Reading in a Foreign Language 13/1: 403–30. Hulstijn, J. and E. Marchena. (1989). ‘Avoidance: Grammatical or semantic causes?’ Studies in Second Language Acquisition 11: 241–55. 263 Hunston, S. (2002). Corpora in Applied Linguistics. Cambridge: Cambridge University Press. Hunston, S. and G. Francis. (1998). ‘Verbs observed: A corpus-driven pedagogic grammar,’ Applied Linguistics 19/1: 45–72. Hunston, S. and G. Francis. (1999). Pattern Grammar: A Corpus-driven Approach to the Lexical Grammar of English. Amsterdam: John Benjamins Publishing Company. Jackendoff, R. (1995). ‘The boundaries of the lexicon,’ in M. Everaert, E. van der Linden, A. Schenk, and R. Schreuder (eds): Idioms: Structural and Psychological Perspectives. Hillsdale, NJ: Lawrence Erlbaum Associates, pp. 133–66. Jiang, N. (2000). ‘Lexical representation and development in a second language,’ Applied Linguistics 21/1: 47–77. Knowles, G. and Z. Mohd Don. (2004). ‘The notion of a ‘‘lemma’’: Headwords, roots and lexical sets,’ International Journal of Corpus Linguistics 9/1: 69–81. Krashen, S. (1989). ‘We acquire vocabulary and spelling by reading: Additional evidence for the input hypothesis,’ The Modern Language Journal 73/4: 440–64. Krashen, S. (1993). The Power of Reading: Insights from the Research. Englewood, CO: Libraries Unlimited. Kyongho, H. and P. Nation. (1989). ‘Reducing the vocabulary load and encouraging vocabulary learning through reading newspapers,’ Reading in a Foreign Language 6/1: 322–35. Landes S., C. Leacock, and R. I. Tengi. (1998) ‘Building semantic concordances’ in C. Fellbaum (ed.): WordNet: An Electronic Lexical Database Cambridge, MA: The MIT Press, pp. 199–216. Laufer, B. (1989). ‘What percentage of text-lexis is essential for comprehension’ in C. Lauren and M. Nordman (eds): Special Language: From Humans Thinking to Thinking Machines. Philadelphia: Multilingual Matters, pp. 316–33. Laufer, B. (1997). ‘The lexical plight in second language reading: Words you don’t know, words you think you know, and words you can’t guess’ in J. Coady and T. Huckin (eds): Second Language Vocabulary Acquisition. Cambridge: Cambridge University Press, pp. 20–34. Laufer, B. and Z. Goldstein. (2004). ‘Testing vocabulary knowledge: Size, strength, and computer adaptiveness,’ Language Learning 54/3: 399–436. 264 WORD IN APPLIED CORPUS-BASED VOCABULARY RESEARCH Laufer, B. and P. Nation. (1995). ‘Vocabulary size and use: Lexical richness in L2 written production,’ Applied Linguistics 16: 307–22. Laufer, B. and P. Nation. (1999). ‘A vocabularysize test of controlled productive ability,’ Language Testing 16/1: 33–41. Lewis, M. (1993). The Lexical Approach. Hove, England: LTP Publications. Lewis, M. (1997). Implementing the Lexical Approach: Putting Theory into Practice. Hove, England: LTP Publications. Lewis, M. (ed.) (2000). Teaching Collocation: Further developments in the Lexical Approach. Hove, England: LTP Publications. Liao, Y. and Y. J. Fukuya. (2004). ‘Avoidance of phrasal verbs: The case of Chinese learners of English,’ Language Learning 54/2: 193–226. McBride-Chang, C., R. K. Wagner, A Muse, B. W. Y. Chow, and H. Shu. (2005). ‘The role of morphological awareness in children’s vocabulary acquisition in English,’ Applied Psycholinguistics 26: 415–35. Mahony, D., M. Singson, and V. Mann. (2000). ‘Reading ability and sensitivity to morphological relations,’ Reading and Writing: An Interdisciplinary Journal 12: 191–218. Mel’čuk, I. (1995). ‘Phrasemes in language and phraseology in linguistics’ in M. Everaert, E. van der Linden, A. Schenk, and R. Schreuder (eds): Idioms: Structural and Psychological Perspectives. Hillsdale, NJ: Lawrence Erlbaum Associates, pp. 167–232. Mel’čuk, I. (1998). ‘Collocations and lexical functions’ in A. P. Cowie (ed.): Phraseology: Theory, Analysis, and Applications. Oxford: Clarendon Press, pp. 23–53. Ming-Tzu, K. W. and P. Nation. (2004). ‘Word meaning in Academic English: Homography in the Academic Word List,’ Applied Linguistics 25/3: 291–314. Mirhassani, A. and A. Toosi. (2000). ‘The impact of word-formation knowledge on reading comprehension,’ IRAL 38: 301–11. Moon, R. (1997). ‘Vocabulary connections: multi-word items in English’ in N. Schmitt and M. McCarthy (eds.): Vocabulary Description, Acquisition and Pedagogy. Cambridge: Cambridge University Press, pp. 40–63. Nagy, W. (1997). ‘On the role of context in firstand second-language vocabulary learning’ in N. Schmitt and M. McCarthy (eds): Vocabulary: Description, Acquisition and Pedagogy. Cambridge: Cambridge University Press, pp. 64–83. Nagy, W. E. and R. C. Anderson. (1984). ‘How many words are there in printed school English?’ Reading Research Quarterly 19/3: 304–30. Nagy, W. E., R. C. Anderson, and P. A. Herman. (1987). ‘Learning word meanings from context during normal reading,’ American Educational Research Journal 24/2: 237–70. Nagy, W. E., I. N. Diakidoy, and R. C. Anderson. (1993). ‘The acquisition of morphology: Learning the contribution of suffixes to the meanings of derivatives,’ Journal of Reading Behavior 25/2: 155–80. Nagy, W. E., P. A. Herman, and R. C. Anderson. (1985). ‘Learning words from context,’ Reading Research Quarterly 20/2: 233–53. Nation, I. S. P. (1983). ‘Testing and teaching vocabulary,’ Guidelines 5: 12–25. Nation, I. S. P. (1990). Teaching and Learning Vocabulary. New York: Newbury House Publishers. Nation, I. S. P. (2001). Learning Vocabulary in Another Language. Cambridge: Cambridge University Press. Nation, P. and H. Kyongho. (1995). ‘Where would general service vocabulary stop and special purposes vocabulary begin?’ System 23/1: 35–41. Nation, P. and R. Waring. (1997). ‘Vocabulary size, text coverage and word lists’ in N. Schmitt and M. McCarthy (eds): Vocabulary: Description, Acquisition and Pedagogy. Cambridge: Cambridge University Press, pp. 6–19. Nattinger, J. R. and J. S. DeCarrico. (1992). Lexical Phrases and Language Teaching. Oxford: Oxford University Press. Nesselhauf, N. (2003). ‘The use of collocations by advanced learners of English and some implications for teaching,’ Applied Linguistics 24/2: 223–42. Partington, A. (1998). Patterns and Meanings: Using Corpora for English Language Research and Teaching. Amsterdam: John Benjamins Publishing Company. Pawley, A. and F. H. Syder. (1983). ‘Two puzzles for linguistic theory: Nativelike selection and nativelike fluency’ in J. C. Richards and R. W. Schmidt (eds): Language and Communication. London: Longman, pp. 191–225. Ravin, Y. and C. Leacock. (2000). ‘Polysemy: An overview’ in Y. Raven and C. Leacock (eds): Polysemy: Theoretical and Computational Approaches. Oxford: Oxford University Press, pp. 1–29. DEE GARDNER Read, J. (2000). Assessing Vocabulary. Cambridge: Cambridge University Press. Read, J. (2004). ‘Research in teaching vocabulary,’ Annual Review of Applied Linguistics 24: 146–61. Schmidt, N., S. Grandage, and S. Adolphs. (2004). ‘Are corpus-derived recurrent clusters psycholinguistically valid’ in N. Schmitt (ed.): Formulaic Sequences: Acquisition, Processing and Use. Amsterdam: John Benjamins Publishing Company, pp. 127–151. Schmitt, N. (ed.) (2004). Formulaic Sequences. Amsterdam: John Benjamins Publishing Company. Schmitt, N. and P. Meara. (1997). ‘Researching vocabulary through a word knowledge framework: Word associations and verbal suffixes,’ SSLA 19/1: 17–35. Schmitt, N. and C. B. Zimmerman. (2002). ‘Derivative word forms: What do learners know?’ TESOL Quarterly 36/2: 145–71. Schmitt, N., D. Schmitt, and C. Clapham. (2001). ‘Developing and exploring the behaviour of two new versions of the Vocabulary Levels Test,’ Language Testing 18: 55–88. Seidlhofer, B. (2002). ‘Pedagogy and local learner corpora: Working with learning-driven data’ in S. Granger, J. Hung, and S. Petch-Tyson (eds): Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching. Amsterdam: John Benjamins Publishing Company, pp. 213–34. Sinclair, J. (1987). ‘Collocation: a progress report’ in R. Steele and T. Threadgold (eds): Language Topics: An International Collection of Papers by Colleagues, Students and Admirers of Professor Michael Halliday to Honour Him on His Retirement, Vol. II. Amsterdam: John Benjamins Publishing Company, pp. 319–31. Sinclair, J. (1991). Corpus, Concordance, Collocation. Oxford: Oxford University Press. Sinclair, J. (2004a) (ed.). How to Use Corpora in Language Teaching. Amsterdam: John Benjamins Publishing Company. Sinclair, J. (2004b). Trust the Text: Language, Corpus and Discourse. London: Routledge. Sinclair, J. (2004c). ‘New evidence, new priorities, new attitudes’ in J. Sinclair (ed.): How to Use Corpora in Language Teaching. Amsterdam: John Benjamins Publishing Company, pp. 271–99. 265 Singson, M., D. Mahony, and V. Mann. (2000). ‘The relation between reading ability and morphological skills: Evidence from derivational suffixes,’ Reading and Writing: An Interdisciplinary Journal 12: 219–52. Stahl, S. A. and T. G. Shiel. (1992). ‘Teaching meaning vocabulary: Productive approaches for poor readers,’ Reading and Writing Quarterly: Overcoming Learning Difficulties 8: 223–41. Stubbs, M. (2001). ‘Texts, corpora, and problems of interpretation: A response to Widdowson,’ Applied Linguistics 22/2: 149–72. Stubbs, M. (2002). Words and Phrases: Corpus Studies of Lexical Semantics. Oxford: Blackwell Publishing. Sutarsyah, C., P. Nation, and G. Kennedy. (1994). ‘How useful is EAP vocabulary for ESP? A corpus based case study,’ RELC Journal 25/2: 34–50. Tyler, A. and W. Nagy. (1989). ‘The acquisition of English derivational morphology,’ Journal of Memory and Language 28: 649–67. Tyler, A. and W. Nagy. (1990). ‘Use of derivational morphology during reading,’ Cognition 36: 17–34. Webster’s Ninth New Collegiate Dictionary. (1988). Springfield, MA: Merriam-Webster Inc. Wesche, M. and T. S. Paribakht. (1996). ‘Assessing second language vocabulary knowledge: Depth versus breadth,’ The Canadian Modern Language Review 53/1: 13–40. Widdowson, H. G. (2000). ‘On the limitations of linguistics applied,’ Applied Linguistics 21/1: 3–25. Willis, D. (1990). The Lexical Syllabus. London: Harper Collins. WordNet (2003). Princeton University electronic lexical database (Version 2.0). Retrieved February 11, 2004, from http://wordnet. princeton.edu/obtain. Wray, A. (2000). ‘Formulaic sequences in second language teaching: Principle and practice,’ Applied Linguistics 21/4: 463–89. Wray, A. (2002). Formulaic Language and the Lexicon. Cambridge: Cambridge University Press. Wysocki, K. and J. Jenkins. (1987). ‘Deriving word meanings through morphological generalizations,’ Reading Research Quarterly 22: 66–81. Xue, G. and P. Nation. (1984). ‘A university word list,’ Language Learning and Communication 3/2: 93–242.
© Copyright 2026 Paperzz