International Journal of Lexicography, Vol. 28 No. 3, pp. 338–350 doi:10.1093/ijl/ecv018 338 QUANTIFYING FIXEDNESS AND COMPOSITIONALITY IN CHINESE IDIOMS Feng Zhu: University of Michigan ([email protected]) Christiane Fellbaum: Princeton University ([email protected]) Abstract Idioms are a distinctive class of multi-word expressions often characterized as lexically and syntactically fixed and idiosyncratic, besides being semantically more or less noncompositional. The work presented here work attempts to provide a corpus-based, quantitative perspective of Chinese idioms that is intended as a complement and possible correction to theoretical studies. A number of statistical measures are applied to test for and examine fixedness and semantic idiosyncrasy of a specific type of idiomatic expressions (verb-noun idiomatic collocations, or VNICs) in a Chinese text corpus. The approach is based on two intuitions: first, that the verbs and nouns in a VNIC exhibit a measurably lower degree of semantic similarity to one another than literal verb-noun combinations; second, that the idiom constituents under their literal readings are semantically less related to their contexts than literal phrases. The semantic similarity in terms of co-occurrence overlap in a corpus is measured. Finally, a general approach to representing idioms in a lexical resource in a way that accounts for their attested flexibility is proposed. 1. Introduction Idioms have long attracted the attention of linguists, lexicographers, translators, language instructors, and ‘linguistically naı̈ve’ speakers who are aware of the particular properties of expressions like until the cows come home, bite the bullet, and gild the lily. These expressions are characterized by lexical idiosyncrasy and full or partial semantic non-compositionality. Until the availability of large digital corpora, many researchers assumed that idioms were largely fixed both lexically and syntactically. Consequently, they were regarded as fixed phrases or ‘long words’ that just happened to have multiple components but no internal structure open to grammatical or lexical operations. However, large-scale corpus studies of selected idioms have revealed considerable lexical Published by Oxford University Press 2015. This work is written by US Government employees and is in the public domain in the US. Quantifying Fixedness and Compositionality 339 and morphosyntactic modifications (Moon 1998, Fellbaum 2011, 2014, 2006, inter alia). While grammatical variations tend to follow the rules of ‘free’ language, lexical deviations from the ‘canonical form’ have been attributed to the metaphorical status, and hence semantic transparency, of particular idiom components. Dobrovol’skij and Piirainen (2005) describe idioms as a radial category, with fully non-compositional, frozen idioms as prototypes, and idioms that are partly compositional and allow lexical substitution and morphosyntactic operations to varying degrees as ‘radiating out’, i.e., diverging from, the prototype. There is increasing evidence that a strict classification of non-compositional, frozen idioms on the one hand and partly compositional, modifiable idioms on the other hand does not reflect the way speakers actually use idioms. Corpus data show that fully non-compositional idioms are not frozen, and semantic compositionality and variation are in fact independent of each other. For example, bucket in the much-discussed idioms kick the bucket has no clear referent, yet several song lyrics include the phrase kick the pail (I ain’t yet kicked the pail; Our little brother Willie has kicked the pail); and the idiomatic meaning is readily recognized here. While the lexical substitution here is paradigmatic, it often plays on the form of a word rather than its meaning, which can remain opaque or be coerced into a context-dependent, ad-hoc meaning (see Fellbaum 2006 for German corpus examples). Thus the idiom kein Blatt vor den Mund nehmen (lit. ‘put no leaf/sheet in front of one’s mouth’) means ‘to be outspoken’. Blatt here has no referent, yet this constituent is modified in many ways by speakers using the idiom in particular context. For example, a review of a lively concert described the performer as ‘not putting any sheet music in front of his mouth’ (nahm kein Notenblatt vor den Mund). An English example of a lexical variation that plays both on the literal and the idiomdependent meaning of a word, is found in the newspaper headline the Tiger is out of the bag, referring to the golfer Tiger Wood’s secret womanizing; while the word cat in the canonical idiom let the cat out of the bag is commonly assumed to be a metaphor for ‘secret’, the use of Tiger here plays both on the literal meaning of that word (‘feline’) and its semantic similarity to tiger as well as on the golfer’s name. Besides tense and aspect modifications required for the grammatically appropriate embedding of idioms into sentences, corpus analyses show variations of idiom constituents that are claimed to be semantically opaque; these include passivization, pronominalization, relativization, change of lexical category, reversal of negative polarity, quantification and adjective insertion (Moon 1998; Fellbaum 2006, 2011, 2014 inter alia). Moon (1998) in her corpus-based study of English fixed expressions and idioms found that ‘around 40% of database [idioms] have lexical variations or strongly institutionalized transformations, and around 14% have two or 340 Feng Zhu and Christiane Fellbaum more variations on their canonical form’. She further notes that variation ‘is particularly strong with predicate [idioms]’ (pp. 120–121). Corpus-based studies of idioms have shown that variation is not only independent of semantic compositionality but falls on a continuum, suggesting that the notion of selectional preferences more appropriately reflects usage. Thus, bucket is arguably the preferred, most frequent lexeme in the idiom, but it does not categorically rule out the use of its synonym pail. Church and Hanks (1989) showed that fully compositional expressions like strong tea similarly exhibit measurable collocational preferences; powerful tea, which is semantically equivalent sounds odd simply because the adjective here occurs far less frequently than its near-synonym as a modifier of tea. Similarly, the common expression empty threat is arguably less transparent but by no means fixed; vacant threat can be found numerous times on the Web. Church and Hanks (1989) proposed to measure the collocational strength of words by means of Pointwise Mutual Information (PMI), a statistical measure of the unexpectedness, or ‘surprise value’ of a word given a preceding word. The PMI of any word pair can be computed on the basis of a large corpus that allows one to extract reliable values for the probability of the independent occurrences of the words as well as that of their co-occurrence. The attested variability of idioms suggests that dichotomies like compositional vs. non-compositional and fixed vs. flexible phrases are somewhat oversimplified. A corpus-based study of idioms must allow for a range of possibly unexpected variations and avoid the limitations and false negatives that could arise from corpus searches that depart from specific forms. A more promising approach is to measure the strength of collocational preferences of words in a corpus in quantifiable ways. This approach allows one to discover not only variations of pre-classified idiomatic phrases but also to identify candidates for new idiomatic expressions that may be entering the language. At the same time, a representation of idiomatic phrases in lexical resources that reflects their flexibility is a challenge (see Section 6.2.). It should be noted that a number of idioms admit both idiomatic and literal readings (for instance kick the bucket or hit the jackpot); Moon’s corpus-based study of sixty idioms showed that close to half also have a clear literal reading, and about 40 percent of their usages are literal, a number consistent with that found for German (Fellbaum 2006). These finding make it imperative to devise methods for differentiating between literal and idiomatic readings; such methods would necessarily require more than mappings against a dictionary where idioms are represented as ‘long words’. 2. Verb-noun idiomatic collocations in Chinese The focus of this paper will be on Chinese verb-noun idiomatic collocations (VNICs). VNICs, which are said to form a ‘cross-lingually prominent class of Quantifying Fixedness and Compositionality 341 phrasal idioms which are commonly and productively formed’ (Cook et al. 2009), consist of a verb and a noun that functions as its direct object. Hundreds of Chinese VNICs have been identified (e.g., Jiao and Stone 2014), though the total number used in the contemporary language can only be estimated. Examples include [ ] (lit. ‘to fry [sb.’s] cuttlefish’, meaning ‘to dismiss [sb.] from a job’), [ ] (lit. ‘to eat [sb.’s] tofu’, i.e. to sexually harrass [sb.]), or (lit. ‘to move brains’, meaning ‘to think’ or ‘to brainstorm’). Zhu and Fellbaum (2014) attempt to automatically identify idiomatic Chinese (Mandarin) verb-noun pairs in a large corpus. The implementation of syntactic and lexical fixedness measures developed for English (Cook et al. 2007, 2009; Fazly and Stevenson 2006) reveal the lexical and syntactic preferences of verb-noun expressions, with syntactic variations mirroring those found for German (Fellbaum 2006) and English (Moon 1998). In contrast to freely composed Chinese verb phrases, idiomatic phrases show a high incidence of single-character verbs, a feature that can be useful in identifying idiomatic strings. Zhu and Fellbaum suggest that additional measures for detecting strongly associated words pairs that take language-specific properties into consideration are needed. Corpus-based studies have found variations in Chinese VNICs including the insertion of lexical material such as adjectives or aspect particles, verb reduplication, the addition of phonological particles after the noun, lexical substitution of synonyms or near-synonyms and re-ordering of constituents. These variations were found to be conditioned by a variety of factors, including phonology and prosody, the need to make the expressions more specific and vivid, to simplify expressions, or to better tailor the idiom to the specific syntactic and semantic context (Zhou 2010). 3. Tools and Methods The computational resources used and basic procedures performed are discussed in Zhu and Fellbaum (2014). Below is a brief description of the methods used to quantify fixedness and idiosyncrasy, discussed in more detail in Zhu and Fellbaum (2014), as well as methods used to study semantic compositionality and relatedness. 3.1 Syntactic Idiosyncrasy Despite their attested flexibility, idioms retain some degree of fixedness and idiosyncrasy; or, to put it another way, while lexical and syntactic variations are too common to be ignored altogether, idioms still ‘have a stronglypreferred canonical form’ (Cook et al. 2009). Psycholinguistic research strongly 342 Feng Zhu and Christiane Fellbaum suggests that speakers store and retrieve idioms and collocations as meaning units in their mental lexicons (Cacciari and Glucksberg 1994, inter alia). One approach to the automatic identification of idiomatic expression is to make use of the idiosyncrasies of potentially ambiguous idioms to help distinguish between literal and idiomatic usages. Moon (1998) observes that ‘very few metaphorical FEI [fixed expressions] are ambiguous between literal and idiomatic meanings. Either there is no or almost no evidence for literal equivalents, or there are strongly divergent collocational or colligational patterns associated with either literal or idiomatic uses’ [Moon 1998: 310]. Moon’s observation can be quantified in various ways and applied to both semantically non-compositional idioms as well as transparent verb-object combinations like have a fit and make a call. As noted earlier, Pointwise Mutual Information (PMI) is one often-used quantitative measure of fixedness, or collocation strength, inspired by Shannon and Weaver’s (1963) Information Theory (Church and Hanks 1989). Specifically, given two random variables X and Y, their PMI at given values X = x, Y = y is defined as PMIðx; yÞ ¼ log pðX; YÞ pðXÞpðYÞ where the independent and joint probabilities of X and Y are computed on the basis of the frequencies in a corpus. In this case the random variables are word frequencies (frequency of types), taken over all pairs of verbs and their direct nominal objects. If some (verb, noun) pairs occur together frequently, as one would assume in the case of an idiomatic or collocational pair, one would also expect it to have a much higher PMI than a freely-composed (verb, noun) pair, as the occurrence of one constituent makes the occurrence of the other more predictable and hence less surprising or ‘informative’. A quantitative measure of idiosyncrasy is the Kullback-Leibler divergence (KL divergence). It measures the distance between two probability distributions: given two discrete probability distributions P and Q, the KL divergence between them is defined as X PðiÞ DKL ðPjQÞ ¼ ln PðiÞ QðiÞ i Thus DKL ðPjQÞ ¼ 0, and if Q P then DKL ðPjQÞ 0. In this case one might take the probability distributions to be distributions of a fixed (verb, noun) pair over syntactic patterns—passive, active with no determiner, active with definite determiner, etc.—or over lexical patterns; one might then expect idiomatic expressions to display distributions significantly different from the distribution of a typical (verb, noun) pair. Quantifying Fixedness and Compositionality 343 Thus DKL ðPjQÞ ¼ 0, and if Q P then DKL ðPjQÞ 0. Here one might take the probability distributions to be distributions of a fixed (verb, noun) pair over syntactic patterns—passive, active with no determiner, active with definite determiner, etc.—or over lexical patterns; one would then expect idiomatic expressions to display distributions significantly different from the distribution of a typical (verb, noun) pair. Based on previous findings, the assumption is made that VNICs are relatively rare compared to syntactically similar literal (verb, noun) pairs, so that the PMI or syntactic pattern distribution of a ‘typical’ (verb, noun) pair can be assumed to represent the statistics for a literal non-idiomatic (verb, noun) pair. This assumption is borne out by most corpus studies, including a study of the corpus used here, as shown below. The measures described above were used by Fazly and Stevenson (2006) to design more sophisticated quantitative measures for automatically extracting English VNICs from a corpus, obtaining significant improvements compared to other state-of-the-art methods of identifying idioms. For a binary classification task in which verb-noun pairs were labelled as either literal or idiomatic, the authors obtained an accuracy of 74%, compared to the 63% accuracy obtained by a simple classifier using only PMI. In a task where verb-noun pairs were ranked by degree of idiomaticity, the combined fixedness measures achieved an interpolated average precision of nearly 85%, as opposed to below 65% for PMI (Fazly and Stevenson 2006). Zhu and Fellbaum (2014) applied similar measures to Chinese VNICs in a corpus and achieved lower results, which are most likely due to two key differences. First, instead of testing on a limited set of verbs, as Fazly and Stevenson did, all verbs were considered; second, the potential effectiveness of the measure of lexical fixedness was hampered by the lack of a semantic resource for Chinese comparable in structure and coverage to the English WordNet (Fellbaum 1998), which was used by Fazly and Stevenson. 4. Semantic Similarity and Distributional Similarity: DISCO As noted above, idioms have a degree of non-compositionality or semantic opaqueness which, as a whole, distinguishes them from more literal language. One possible way to quantify this is to use a measure of semantic similarity or relatedness between words. The measure used here is DISCO (DIStributionally similar words using COoccurrences), originally formulated by Kolb (2009). This measure represents word forms (types in the corpus) as vectors in a semantic vector space, the dimensions of which represent features corresponding to frequently occurring words in the vocabulary and/or corpus. The vector corresponding to a word w is determined by counting the co-occurrences of w with each of the features; two words are then deemed similar if their vectors are relatively close together. The linguistic basis for this similarity measure is the 344 Feng Zhu and Christiane Fellbaum distributional hypothesis, which states that ‘words that occur in similar context tend to have similar meanings’ (Harris 1954; see also Firth 1957). 10,000 feature words were selected by looking at the most frequent types in the corpus (excluding a pre-determined list of stop words) and a window size r. Co-occurrences in a window extending r words (tokens) before and after each word (token) appearing in the corpus were considered. There were then 10,000 x 2r = 20,000r features, corresponding to possible choices for the pair (feature word, position in window). To obtain the vector of a word (type) w, the number of times each feature word appears in the rth position relative to w in the corpus was counted. The use of the position within a window as well as the feature word is a distinguishing characteristic of DISCO and is intended to provide a rough model of syntactic dependencies, which may be assumed to be weakly encoded in relative word positions. This assumption appears particularly appropriate for a morphology-poor language such as Chinese, which makes stronger use of word order than English. Weighting with a PMI-like measure first proposed by Lin (1998) yields ðfðw; r; w0 Þ 0:95Þ fð; r; Þ 0 gðw; r; w Þ ¼ log fðw; r; Þ fð; r; w0 Þ where w and w0 represent words (types) and r a window position, and f represents the frequency of occurrence. f(w,r,w0 ) represents the number of times w0 occurs in the rth position relative to w, and f(w,r,*) represents the total number of times any feature word occurs in the rth position relative to w. To determine the similarity or ‘closeness’ of two word vectors, Lin’s (1998) information-theoretic measure is applied: DISCO1ðw1 ; w2 Þ ¼ ðr;w0 Þ gðw1 ; r ; ; 0w Þ þ gðw2 ; r ; 0w Þ ðr;w0 Þ gðw1 ; ; Þ þ ðr;w0 Þ gðw2 ; ; Þ Vectors of DISCO1 measures—(DISCO(w,w0 )) where w0 ranges over all, or some chosen subset, of the words in the vocabulary—are then used as input to compute DISCO2 in a similar fashion: DISCO2ðw1 ; w2 Þ ¼ 0w DISCO1ðw1 ; 0w Þ þ DISCO1ðw2 ; 0w Þ 0w DISCO1ðw1 ; Þ þ 0w DISCO1ðw2 ; Þ where g(w,w0 ) = DISCO1(w,w0 ). The DISCO2 vectors may ‘be seen as the “second order” word vector of the given word, containing not only the words which occur together with it, but those that occur in similar contexts’. (Kolb 2009:83). Two sets of DISCO measures were computed, one using the entire corpus with a window size of r = 1, and a second using the first 1.2 million sentences of the corpus with a window size r = 2. Limited computing resources prevented Quantifying Fixedness and Compositionality 345 the use of larger window sizes on larger corpora. In the remainder of this paper, the former set of DISCO measures will be referred to as ‘Set 1’ and the latter as ‘Set 2’. 5. Examining the Compositionality of Idioms with DISCO The measure of semantic similarity provided by DISCO is used to examine the compositional properties of our verb-predicate noun combinations, in an attempt to distinguish the idiomatic combinations from literal structures in two ways. First, the average DISCOs (DISCO1 and DISCO2) is computed (1) between the verb and the noun; (2) between the verb and the entire VP as identified by a parser; (3) between the predicate noun and the entire VP; (4) between the verb and the sentence in which the VP occurs; (5) between the predicate noun and the sentence in which the VP occurs. The hypothesis here is that given the relative sparsity of idioms there should be less semantic similarity (i.e. a lower DISCO measure) between the verb and the noun in the case of idiomatic verb-noun combinations. Thus, ‘fry someone’s cuttlefish’ (meaning dismiss someone from his job) is a phrase that is likely to occur in the context where employment, work, one’s boss, etc. are discussed rather than food preparation. And one might expect, due to their non-compositionality, that verbs and nouns appearing in VNICs are less semantically similar to their surrounding contexts (both the entire VP and the surrounding sentence) as compared to literal verbnoun combinations. Second, semantic vectors are computed for (1) verb + predicate noun, (2) the entire VP, and (3) the sentence in which the VP occurs. To this end, a singularvalue decomposition (SVD) was performed to reduce the dimensionality of the semantic vector space to 120. Note that in this case, since the matrices containing the DISCO measures are symmetric (DISCO1(w,w0 ) = DISCO1(w0 ,w), and the same is true for DISCO2, this procedure amounts to finding the top 120 eigenvectors for the DISCO matrices. The three vectors are then compared. As stated above, the hypothesis is that VNICs, being non-compositional, can be expected to have lower semantic similarity to their contexts than more literal verb-noun combinations, so that the three vectors should be less similar in the case of an idiomatic verb-noun combination. As an exploratory step, clustering is performed on the three sets of semantic vectors obtained above. 6. Results Some differences were found between the DISCO scores of a list of 3,052 VPs identified as possible VNICs by checking against a dictionary of known VNICs (labelled ‘VNIC’) and a list of about 90,000 VPs not checked for possible 346 Feng Zhu and Christiane Fellbaum Table 1: Average DISCO scores for two lists Set 1 DISCO1 (verb, noun) (verb, VP) (noun, VP) (verb, sentence) (noun, sentence) DISCO2 (verb, noun) (verb, VP) (noun, VP) (verb, sentence) (noun, sentence) Set 2 VNICs All VNICs All 0.01578 0.04415 0.07737 0.04753 0.08516 0.01358 0.02822 0.08335 0.02958 0.08638 0.01302 0.03877 0.05058 0.04558 0.05978 0.01068 0.03741 0.06581 0.03783 0.07019 0.00081 0.01286 0.02348 0.01917 0.03288 0.00365 0.02411 0.03556 0.03759 0.04659 0.00395 0.01013 0.00926 0.01140 0.01217 0.00504 0.01362 0.01092 0.01511 0.01274 idiomaticity and assumed to be mostly literal. Table 1 shows average DISCO scores over the two lists. The predicate nouns were, on average, less similar to the surrounding VPs and sentence for VNICs according to DISCO1 but the opposite effect was observed for the verbs. The effect appeared to be more pronounced for the Set 2 DISCO1 measures compared to the Set 1 DISCO1 measures. Interestingly, DISCO1(verb, noun) tended to be very slightly higher for VNICs—possibly because of many instances of literal readings. According to DISCO2, both the verbs and the predicate nouns were less similar to the surrounding VPs and sentence for VNICs, and in this case the effect was more pronounced for the Set 1 DISCO2 measures than for the Set 2 DISCO2 measures. DISCO2(verb, noun) also tended to be higher for VNICs. 6.1 DISCO-based semantic vectors The verb + noun vectors were grouped into 5 clusters and the VP and sentence vectors into 6 to 8 clusters using k-means clustering in R; the numbers of clusters to use was determined in each case using a scree plot. In all cases the sets of sentence vectors were substantially similar, at least in terms of summary statistics, for both the list of dictionary-identified VNICs and for the list of all verb-noun combinations; considering the entire sentence seemed to wash out any semantic idiosyncrasy coming from the verb-noun combination alone. Quantifying Fixedness and Compositionality 347 No major difference was observed between the summary statistics for the verb + noun / VP vectors corresponding to the dictionary-identified VNICs, and for the verb + noun / VP vectors corresponding to larger set of all verbnoun combinations when the vectors when computed using Set 1 DISCO1 measures; cumulative plots of coordinate distributions seem to show the vectors similarly distributed, and the two sets of cluster centers show a high degree of similarity when e.g. compared using a cosine similarity measure for vectors. With vectors computed using Set 2 DISCO1 measures, it was observed that the two sets of cluster centers—one set corresponding to clusters of semantic vectors for dictionary-identified VNICs, the other corresponding to clusters of semantic vectors for all verb-noun combinations—were now substantially different: cosine similarity scores between pairs of vectors (v,w) with v from the former set of vectors and w the latter set now ranged between 0.3 and 0.5. Cumulative plots of coordinate distributions also differed substantially. It is difficult to visualize these vectors since they lie in high-dimensional (120-dimensional) space, but the summary statistics suggest that the two sets are different enough so that e.g. one may design a classifier to distinguish (probabilistically) between them. The DISCO2 measures produced a more nuanced picture: when using Set 2 DISCO2 measures, some of the cluster centers were similar, but others were not—the cosine similarity scores here ranged between 0.29 and 0.99. Cumulative plots of coordinate distributions were similar in shape. Examining a small sample of VPs whose vectors landed in each cluster did not yield any easily discernible patterns. 7. Discussion The work described explored some aspects of the syntactic fixedness of idioms as well as their compositionality using a number of quantitative measures. The differences in the results when using DISCO measures with a larger context window size (Set 2 instead of Set 1) suggest that the compositionality of idio-matic language is better captured, or indeed may only be captured, by quantitative measures that take into account longer-range dependencies, even if only crudely by looking at collocations over a larger window. Further, the differences in the results obtained with DISCO1 versus DISCO2 measures seems to suggest that more subtle ‘second-order’ collocations provide information crucial to the determination and understanding of idiomatic language. Moreover, it appears that none of the measures considered here is sufficiently strong to distinguish between idiomatic and literal combinations on its own, but the measures may be able to serve as weak classifiers or features in a bootstrapped cascading classifier that might be able to more accurately distinguish between idioms and literal verb-noun combinations. 348 Feng Zhu and Christiane Fellbaum 7.1 Consequences for the Lexical Representation of Idioms Given the demonstrated variability of idiomatic phrases, one must discard the notion of ‘fixed’ phrases or ‘long words’ and reconsider the lexical status of idioms. Speakers clearly have ‘entries’ for idiomatic phrases in their mental lexicons, which allow them to interpret sequences like kick the bucket and steal someone’s thunder, both in their ‘canonical’ and modified forms. Grammatical modifications like passivization and relativization show that the idioms constituents often have independent word status; lexically modified forms like kick the pail show that the idiom constituents are linked to the literal meanings. Psycholinguistic experiments (Sprenger, Levelt and Kempen 2006) suggest a dual representation with access to both the word and the entire phrase. What, then, is the best way to capture the dual status of idioms in a lexical resource? Clearly, entries that suggest a fixed, immutable phrase are not reflecting speakers’ use. A learner of English, when encountering an unfamiliar idiom in a ‘non-canonical’ use, will not find a matching entry. Osherson and Fellbaum (2010) propose a solution for adding many idioms to WordNet. They manually inspected the components of over 200 common English idioms (verb phrases, noun phrases and full sentences) for their metaphoric status. For example, in the phrase old flames die hard, flames arguably refers to feelings of affection, while die here means ‘fade’ (rather than ‘undergo the cessation of all vital functions’). Indeed, a well-known rock music album is entitled Old Loves Die Hard, suggesting a play on the ‘canonical’ expression. Other attested variations include when an old flame won’t burn out and rekindling old flames and Fellbaum propose the following representation of such idioms in WordNet. An entry is created for the entire idiom in its ‘canonical’ and most frequently encountered form. Second, a new entry is created for each component that is deemed to have a meaning (e.g., flame). For idioms like spill the beans and let the cat out of the bag, a new synset with the members {cat, beans} is created. Finally, these new synsets are linked via a pointer to existing synsets that capture the appropriate meanings. For example, flame is linked to the synset containing feeling, die to the synset with fade, and {cat, beans} to the synset containing {secret}. Importantly, the pointer is bi-directional, so that the use of the words in the newly created synsets are clearly identified as members of idioms and thus as having a limited distribution and as assuming the appropriate meaning only within the context of the idioms. This precludes the use of fade in a context like my old cat faded last night or the use of beans in let me tell you some beans. The approach proposed by Osherson and Fellbaum is currently being adopted for the Open Multilingual Wordnet, which includes a range of typologically diverse languages (Bond et al. 2014), with the goal of better automatic cross-linguistic mapping and translation. Quantifying Fixedness and Compositionality 349 8. Future Work The results presented here point to further exploration of semantic vector space models. In particular, neural networks or other deep learning techniques to build and train semantic vector space models have already been used in such NLP applications such as semantic compositionality and sentiment analysis (Socher et al. 2013, Pennington et al. 2014). Such techniques could be applied using objective functions that are chosen specifically to pick out idioms and may yield insightful results. References A. Dictionaries Jiao, L. and B. Stone. 2014. Five Hundred Common Chinese Proverbs and Colloquial Expressions: An Annotated Frequency Dictionary. London: Routledge. Li, Mao [ ]. 2004. Hanyu guanyongyu cidian . Hanyu Da Cidian , Shanghai. Chubanshe Mei, Jiaju [ ], Zhu, Yiji [ ], Gao, Yunqi [ ], and Yin, Hongxiang ]. 1996. . , Shanghai. [ B. Other Literature Bond, F., C. Fellbaum, S.-K. Hsieh, C.-R. Huang, A. Pease and P. Vossen. 2014. ‘A Multilingual Lexico-Semantic Database and Ontology’ In Buitelaar, S. and P.Cimiano (eds), Towards the Multilingual Semantic Web. Heidelberg: Springer, 243–258. Cacciari, C. and S. Glucksberg, 1994. ‘Understanding figurative language’ In Gernsbacher, MA (ed.), Handbook of Psycholinguistics. New York: Academic Press, 447–477. Church, K.W. and P. Hanks. 1989. ‘Word association norms, mutual information, and lexicography’ In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 76–83. Cook, P., A. Fazly and S. Stevenson. 2007. ‘Pulling their weight: exploiting syntactic forms for the automatic identification of idiomatic expressions in context’ In Proceedings of the ACL’07 Workshop on A Broader Perspective on Multiword Expressions. Prague, 41–48. Cook, P., A. Fazly and S. Stevenson. 2009. ‘Unsupervised type and token identification of idiomatic expressions.’ Computational Linguistics 35: 61–103. Dobrovol’skij, D. and E. Piirainen. 2005. ‘Cognitive theory of metaphor and idiom analysis.’ Jezikoslovlje 6(1): 7–35. Fazly, A. and S. Stevenson. 2006. ‘Automatically constructing a lexicon of verb phrase idiomatic combinations’ In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL). Trento, Italy, 337–344. Fellbaum, C. 1998. WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press. Fellbaum, C. (ed.). 2006. ‘Corpus-Based studies of German idioms and light verbs.’ Special Issue, International Journal of Lexicography 19.4. 350 Feng Zhu and Christiane Fellbaum Fellbaum, C. 2011. ‘Idioms and collocations’ In von Heusinger, K., C.Maienborn and P.Portner (eds), Semantics: An International Handbook of Natural Language Meaning, Part 1. Berlin: De Gruyter Mouton, 439–454. Fellbaum, C. 2014. ‘Syntax and grammar of idioms and collocations’ In Kiss, T. and A.Alexiadou (eds), Handbook of Syntax. Berlin: De Gruyter Mouton, 776–802. Firth, J. R. 1957. Paper in Linguistics 1934–1951. Oxford University Press. Harris, Z. 1954. ‘Distributional structure.’ Word 10. 23:146–162. Huang, C.-R., S. K. Hsieh, J. F. Hong, Y. Z. Chen, I. L. Su, Y. X. Cheng and S. W. Huang. 2008. ‘Chinese WordNet: Design, implementation, and application of an infrastructure for crosslingual knowledge processing’ In Proceedings of the 9th Chinese Lexical Semantics Workshop, Singapore. Kolb, P. 2009. ‘Experiments on the difference between semantic similarity and relatedness’ In Proceedings of the 17th Nordic Conference on Computational Linguistics – NODALIDA ’09, Odense, Denmark. Lin, D. 1998. ‘Extracting Collocations from Text Corpora’ In Proceedings of the Workshop on Computational Terminology. Montreal, Canada, 57–63. Moon, R. 1998. Fixed Expressions and Idioms in English: A Corpus-Based Approach. Oxford University Press. Nunberg, G., I. A. Sag, and T. Wasow. 1994. ‘Idioms.’ Language 70.3: 491–538. Osherson, A. and C. Fellbaum. 2010. ‘The Representation of Idioms in WordNet’ In Proceedings of the Fifth Global WordNet Conference, Mumbai, India. Pennington, J., R. Socher and C. D. Manning. 2014. ‘Glove: Global Vectors for Word Representation’ In Proceedings of the Conference on Empirical Methods in Natural Language Processing 12. Shannon, C. and W. Weaver. 1963. Mathematical theory of communication. University of Illinois Press. Socher, R., A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng and C. Potts. 2013. ‘Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank’ In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 1642–1654. Sprenger, S., W. J. M. Levelt and G. Kempen 2006. ‘Lexical access during the production of idiomatic phrases.’ Journal of Memory and Language 54.2: 161–184. Zhou, R. [ ] 2010. . Master’s thesis, Xiangtan University. Xiangtan, Hunan, China. Zhu, F. and C. Fellbaum. 2014. ‘Automatically Identifying Chinese Verb-Noun Collocations’ In Dalmas M. and E.Piirainen (eds), Figurative Language: Festschrift for D. Dobrovol’skij. Tübingen: Stauffenburg, 187–200.
© Copyright 2026 Paperzz