quantifying fixedness and compositionality in

International Journal of Lexicography, Vol. 28 No. 3, pp. 338–350
doi:10.1093/ijl/ecv018
338
QUANTIFYING FIXEDNESS AND
COMPOSITIONALITY IN CHINESE
IDIOMS
Feng Zhu: University of Michigan ([email protected])
Christiane Fellbaum: Princeton University ([email protected])
Abstract
Idioms are a distinctive class of multi-word expressions often characterized as lexically
and syntactically fixed and idiosyncratic, besides being semantically more or less noncompositional. The work presented here work attempts to provide a corpus-based,
quantitative perspective of Chinese idioms that is intended as a complement and possible correction to theoretical studies. A number of statistical measures are applied to
test for and examine fixedness and semantic idiosyncrasy of a specific type of idiomatic
expressions (verb-noun idiomatic collocations, or VNICs) in a Chinese text corpus. The
approach is based on two intuitions: first, that the verbs and nouns in a VNIC exhibit a
measurably lower degree of semantic similarity to one another than literal verb-noun
combinations; second, that the idiom constituents under their literal readings are
semantically less related to their contexts than literal phrases. The semantic similarity
in terms of co-occurrence overlap in a corpus is measured. Finally, a general approach
to representing idioms in a lexical resource in a way that accounts for their attested
flexibility is proposed.
1. Introduction
Idioms have long attracted the attention of linguists, lexicographers, translators, language instructors, and ‘linguistically naı̈ve’ speakers who are aware of
the particular properties of expressions like until the cows come home, bite the
bullet, and gild the lily. These expressions are characterized by lexical idiosyncrasy and full or partial semantic non-compositionality. Until the availability
of large digital corpora, many researchers assumed that idioms were largely
fixed both lexically and syntactically. Consequently, they were regarded as
fixed phrases or ‘long words’ that just happened to have multiple components
but no internal structure open to grammatical or lexical operations. However,
large-scale corpus studies of selected idioms have revealed considerable lexical
Published by Oxford University Press 2015. This work is written by US
Government employees and is in the public domain in the US.
Quantifying Fixedness and Compositionality
339
and morphosyntactic modifications (Moon 1998, Fellbaum 2011, 2014, 2006,
inter alia). While grammatical variations tend to follow the rules of ‘free’ language, lexical deviations from the ‘canonical form’ have been attributed to the
metaphorical status, and hence semantic transparency, of particular idiom
components. Dobrovol’skij and Piirainen (2005) describe idioms as a radial
category, with fully non-compositional, frozen idioms as prototypes, and
idioms that are partly compositional and allow lexical substitution and morphosyntactic operations to varying degrees as ‘radiating out’, i.e., diverging
from, the prototype.
There is increasing evidence that a strict classification of non-compositional,
frozen idioms on the one hand and partly compositional, modifiable idioms on
the other hand does not reflect the way speakers actually use idioms. Corpus
data show that fully non-compositional idioms are not frozen, and semantic
compositionality and variation are in fact independent of each other. For
example, bucket in the much-discussed idioms kick the bucket has no clear
referent, yet several song lyrics include the phrase kick the pail (I ain’t yet
kicked the pail; Our little brother Willie has kicked the pail); and the idiomatic
meaning is readily recognized here. While the lexical substitution here is paradigmatic, it often plays on the form of a word rather than its meaning, which
can remain opaque or be coerced into a context-dependent, ad-hoc meaning
(see Fellbaum 2006 for German corpus examples). Thus the idiom kein Blatt
vor den Mund nehmen (lit. ‘put no leaf/sheet in front of one’s mouth’) means
‘to be outspoken’. Blatt here has no referent, yet this constituent is modified in
many ways by speakers using the idiom in particular context. For example, a
review of a lively concert described the performer as ‘not putting any sheet
music in front of his mouth’ (nahm kein Notenblatt vor den Mund). An English
example of a lexical variation that plays both on the literal and the idiomdependent meaning of a word, is found in the newspaper headline the Tiger is
out of the bag, referring to the golfer Tiger Wood’s secret womanizing; while
the word cat in the canonical idiom let the cat out of the bag is commonly
assumed to be a metaphor for ‘secret’, the use of Tiger here plays both on the
literal meaning of that word (‘feline’) and its semantic similarity to tiger as well
as on the golfer’s name.
Besides tense and aspect modifications required for the grammatically
appropriate embedding of idioms into sentences, corpus analyses show variations of idiom constituents that are claimed to be semantically opaque; these
include passivization, pronominalization, relativization, change of lexical category, reversal of negative polarity, quantification and adjective insertion
(Moon 1998; Fellbaum 2006, 2011, 2014 inter alia).
Moon (1998) in her corpus-based study of English fixed expressions and
idioms found that ‘around 40% of database [idioms] have lexical variations
or strongly institutionalized transformations, and around 14% have two or
340
Feng Zhu and Christiane Fellbaum
more variations on their canonical form’. She further notes that variation ‘is
particularly strong with predicate [idioms]’ (pp. 120–121).
Corpus-based studies of idioms have shown that variation is not only independent of semantic compositionality but falls on a continuum, suggesting that
the notion of selectional preferences more appropriately reflects usage. Thus,
bucket is arguably the preferred, most frequent lexeme in the idiom, but it does
not categorically rule out the use of its synonym pail. Church and Hanks (1989)
showed that fully compositional expressions like strong tea similarly exhibit
measurable collocational preferences; powerful tea, which is semantically
equivalent sounds odd simply because the adjective here occurs far less frequently than its near-synonym as a modifier of tea. Similarly, the common
expression empty threat is arguably less transparent but by no means fixed;
vacant threat can be found numerous times on the Web.
Church and Hanks (1989) proposed to measure the collocational strength of
words by means of Pointwise Mutual Information (PMI), a statistical measure
of the unexpectedness, or ‘surprise value’ of a word given a preceding word.
The PMI of any word pair can be computed on the basis of a large corpus that
allows one to extract reliable values for the probability of the independent
occurrences of the words as well as that of their co-occurrence. The attested
variability of idioms suggests that dichotomies like compositional vs. non-compositional and fixed vs. flexible phrases are somewhat oversimplified.
A corpus-based study of idioms must allow for a range of possibly unexpected
variations and avoid the limitations and false negatives that could arise from
corpus searches that depart from specific forms. A more promising approach is
to measure the strength of collocational preferences of words in a corpus in
quantifiable ways. This approach allows one to discover not only variations of
pre-classified idiomatic phrases but also to identify candidates for new idiomatic expressions that may be entering the language. At the same time, a representation of idiomatic phrases in lexical resources that reflects their flexibility is
a challenge (see Section 6.2.).
It should be noted that a number of idioms admit both idiomatic and literal
readings (for instance kick the bucket or hit the jackpot); Moon’s corpus-based
study of sixty idioms showed that close to half also have a clear literal reading,
and about 40 percent of their usages are literal, a number consistent with that
found for German (Fellbaum 2006). These finding make it imperative to devise
methods for differentiating between literal and idiomatic readings; such methods would necessarily require more than mappings against a dictionary where
idioms are represented as ‘long words’.
2. Verb-noun idiomatic collocations in Chinese
The focus of this paper will be on Chinese verb-noun idiomatic collocations
(VNICs). VNICs, which are said to form a ‘cross-lingually prominent class of
Quantifying Fixedness and Compositionality
341
phrasal idioms which are commonly and productively formed’ (Cook et al.
2009), consist of a verb and a noun that functions as its direct object.
Hundreds of Chinese VNICs have been identified (e.g., Jiao and Stone
2014), though the total number used in the contemporary language can only
be estimated. Examples include [
]
(lit. ‘to fry [sb.’s] cuttlefish’,
meaning ‘to dismiss [sb.] from a job’), [
]
(lit. ‘to eat [sb.’s] tofu’,
i.e. to sexually harrass [sb.]), or
(lit. ‘to move brains’, meaning
‘to think’ or ‘to brainstorm’).
Zhu and Fellbaum (2014) attempt to automatically identify idiomatic
Chinese (Mandarin) verb-noun pairs in a large corpus. The implementation
of syntactic and lexical fixedness measures developed for English (Cook et al.
2007, 2009; Fazly and Stevenson 2006) reveal the lexical and syntactic preferences of verb-noun expressions, with syntactic variations mirroring those found
for German (Fellbaum 2006) and English (Moon 1998). In contrast to freely
composed Chinese verb phrases, idiomatic phrases show a high incidence of
single-character verbs, a feature that can be useful in identifying idiomatic
strings. Zhu and Fellbaum suggest that additional measures for detecting
strongly associated words pairs that take language-specific properties into consideration are needed.
Corpus-based studies have found variations in Chinese VNICs including the
insertion of lexical material such as adjectives or aspect particles, verb reduplication, the addition of phonological particles after the noun, lexical substitution of synonyms or near-synonyms and re-ordering of constituents. These
variations were found to be conditioned by a variety of factors, including
phonology and prosody, the need to make the expressions more specific and
vivid, to simplify expressions, or to better tailor the idiom to the specific syntactic and semantic context (Zhou 2010).
3. Tools and Methods
The computational resources used and basic procedures performed are discussed in Zhu and Fellbaum (2014). Below is a brief description of the methods
used to quantify fixedness and idiosyncrasy, discussed in more detail in Zhu
and Fellbaum (2014), as well as methods used to study semantic compositionality and relatedness.
3.1 Syntactic Idiosyncrasy
Despite their attested flexibility, idioms retain some degree of fixedness and
idiosyncrasy; or, to put it another way, while lexical and syntactic variations
are too common to be ignored altogether, idioms still ‘have a stronglypreferred canonical form’ (Cook et al. 2009). Psycholinguistic research strongly
342
Feng Zhu and Christiane Fellbaum
suggests that speakers store and retrieve idioms and collocations as meaning
units in their mental lexicons (Cacciari and Glucksberg 1994, inter alia).
One approach to the automatic identification of idiomatic expression is to
make use of the idiosyncrasies of potentially ambiguous idioms to help distinguish between literal and idiomatic usages. Moon (1998) observes that ‘very
few metaphorical FEI [fixed expressions] are ambiguous between literal and
idiomatic meanings. Either there is no or almost no evidence for literal equivalents, or there are strongly divergent collocational or colligational patterns
associated with either literal or idiomatic uses’ [Moon 1998: 310]. Moon’s
observation can be quantified in various ways and applied to both semantically
non-compositional idioms as well as transparent verb-object combinations like
have a fit and make a call.
As noted earlier, Pointwise Mutual Information (PMI) is one often-used
quantitative measure of fixedness, or collocation strength, inspired by
Shannon and Weaver’s (1963) Information Theory (Church and Hanks
1989). Specifically, given two random variables X and Y, their PMI at given
values X = x, Y = y is defined as
PMIðx; yÞ ¼ log
pðX; YÞ
pðXÞpðYÞ
where the independent and joint probabilities of X and Y are computed on the
basis of the frequencies in a corpus. In this case the random variables are word
frequencies (frequency of types), taken over all pairs of verbs and their direct
nominal objects. If some (verb, noun) pairs occur together frequently, as one
would assume in the case of an idiomatic or collocational pair, one would also
expect it to have a much higher PMI than a freely-composed (verb, noun) pair,
as the occurrence of one constituent makes the occurrence of the other more
predictable and hence less surprising or ‘informative’.
A quantitative measure of idiosyncrasy is the Kullback-Leibler divergence
(KL divergence). It measures the distance between two probability distributions: given two discrete probability distributions P and Q, the KL divergence
between them is defined as
X PðiÞ DKL ðPjQÞ ¼
ln
PðiÞ
QðiÞ
i
Thus DKL ðPjQÞ ¼ 0, and if Q P then DKL ðPjQÞ 0. In this case one might
take the probability distributions to be distributions of a fixed (verb, noun) pair
over syntactic patterns—passive, active with no determiner, active with definite
determiner, etc.—or over lexical patterns; one might then expect idiomatic
expressions to display distributions significantly different from the distribution
of a typical (verb, noun) pair.
Quantifying Fixedness and Compositionality
343
Thus DKL ðPjQÞ ¼ 0, and if Q P then DKL ðPjQÞ 0. Here one might take
the probability distributions to be distributions of a fixed (verb, noun) pair
over syntactic patterns—passive, active with no determiner, active with definite
determiner, etc.—or over lexical patterns; one would then expect idiomatic
expressions to display distributions significantly different from the distribution
of a typical (verb, noun) pair.
Based on previous findings, the assumption is made that VNICs are relatively rare compared to syntactically similar literal (verb, noun) pairs, so that
the PMI or syntactic pattern distribution of a ‘typical’ (verb, noun) pair can be
assumed to represent the statistics for a literal non-idiomatic (verb, noun) pair.
This assumption is borne out by most corpus studies, including a study of the
corpus used here, as shown below.
The measures described above were used by Fazly and Stevenson (2006) to
design more sophisticated quantitative measures for automatically extracting
English VNICs from a corpus, obtaining significant improvements compared
to other state-of-the-art methods of identifying idioms. For a binary classification task in which verb-noun pairs were labelled as either literal or idiomatic,
the authors obtained an accuracy of 74%, compared to the 63% accuracy
obtained by a simple classifier using only PMI. In a task where verb-noun
pairs were ranked by degree of idiomaticity, the combined fixedness measures
achieved an interpolated average precision of nearly 85%, as opposed to below
65% for PMI (Fazly and Stevenson 2006).
Zhu and Fellbaum (2014) applied similar measures to Chinese VNICs in a
corpus and achieved lower results, which are most likely due to two key differences. First, instead of testing on a limited set of verbs, as Fazly and
Stevenson did, all verbs were considered; second, the potential effectiveness
of the measure of lexical fixedness was hampered by the lack of a semantic
resource for Chinese comparable in structure and coverage to the English
WordNet (Fellbaum 1998), which was used by Fazly and Stevenson.
4. Semantic Similarity and Distributional Similarity: DISCO
As noted above, idioms have a degree of non-compositionality or semantic
opaqueness which, as a whole, distinguishes them from more literal language.
One possible way to quantify this is to use a measure of semantic similarity or
relatedness between words. The measure used here is DISCO (DIStributionally
similar words using COoccurrences), originally formulated by Kolb (2009).
This measure represents word forms (types in the corpus) as vectors in a
semantic vector space, the dimensions of which represent features corresponding to frequently occurring words in the vocabulary and/or corpus. The vector
corresponding to a word w is determined by counting the co-occurrences of w
with each of the features; two words are then deemed similar if their vectors are
relatively close together. The linguistic basis for this similarity measure is the
344
Feng Zhu and Christiane Fellbaum
distributional hypothesis, which states that ‘words that occur in similar context
tend to have similar meanings’ (Harris 1954; see also Firth 1957).
10,000 feature words were selected by looking at the most frequent types in
the corpus (excluding a pre-determined list of stop words) and a window size r.
Co-occurrences in a window extending r words (tokens) before and after each
word (token) appearing in the corpus were considered. There were then 10,000
x 2r = 20,000r features, corresponding to possible choices for the pair (feature
word, position in window). To obtain the vector of a word (type) w, the number
of times each feature word appears in the rth position relative to w in the corpus
was counted. The use of the position within a window as well as the feature
word is a distinguishing characteristic of DISCO and is intended to provide a
rough model of syntactic dependencies, which may be assumed to be weakly
encoded in relative word positions. This assumption appears particularly
appropriate for a morphology-poor language such as Chinese, which makes
stronger use of word order than English.
Weighting with a PMI-like measure first proposed by Lin (1998) yields
ðfðw; r; w0 Þ 0:95Þ fð; r; Þ
0
gðw; r; w Þ ¼ log
fðw; r; Þ fð; r; w0 Þ
where w and w0 represent words (types) and r a window position, and f
represents the frequency of occurrence. f(w,r,w0 ) represents the number of
times w0 occurs in the rth position relative to w, and f(w,r,*) represents
the total number of times any feature word occurs in the rth position relative
to w.
To determine the similarity or ‘closeness’ of two word vectors, Lin’s (1998)
information-theoretic measure is applied:
DISCO1ðw1 ; w2 Þ ¼
ðr;w0 Þ gðw1 ; r ; ; 0w Þ þ gðw2 ; r ; 0w Þ
ðr;w0 Þ gðw1 ; ; Þ þ ðr;w0 Þ gðw2 ; ; Þ
Vectors of DISCO1 measures—(DISCO(w,w0 )) where w0 ranges over all, or
some chosen subset, of the words in the vocabulary—are then used as input to
compute DISCO2 in a similar fashion:
DISCO2ðw1 ; w2 Þ ¼
0w DISCO1ðw1 ; 0w Þ þ DISCO1ðw2 ; 0w Þ
0w DISCO1ðw1 ; Þ þ 0w DISCO1ðw2 ; Þ
where g(w,w0 ) = DISCO1(w,w0 ). The DISCO2 vectors may ‘be seen as the
“second order” word vector of the given word, containing not only the
words which occur together with it, but those that occur in similar contexts’.
(Kolb 2009:83).
Two sets of DISCO measures were computed, one using the entire corpus
with a window size of r = 1, and a second using the first 1.2 million sentences of
the corpus with a window size r = 2. Limited computing resources prevented
Quantifying Fixedness and Compositionality
345
the use of larger window sizes on larger corpora. In the remainder of this paper,
the former set of DISCO measures will be referred to as ‘Set 1’ and the latter as
‘Set 2’.
5. Examining the Compositionality of Idioms with DISCO
The measure of semantic similarity provided by DISCO is used to examine the
compositional properties of our verb-predicate noun combinations, in an
attempt to distinguish the idiomatic combinations from literal structures in
two ways.
First, the average DISCOs (DISCO1 and DISCO2) is computed (1) between
the verb and the noun; (2) between the verb and the entire VP as identified by a
parser; (3) between the predicate noun and the entire VP; (4) between the verb
and the sentence in which the VP occurs; (5) between the predicate noun and
the sentence in which the VP occurs. The hypothesis here is that given the
relative sparsity of idioms there should be less semantic similarity (i.e. a
lower DISCO measure) between the verb and the noun in the case of idiomatic
verb-noun combinations. Thus, ‘fry someone’s cuttlefish’ (meaning dismiss
someone from his job) is a phrase that is likely to occur in the context where
employment, work, one’s boss, etc. are discussed rather than food preparation.
And one might expect, due to their non-compositionality, that verbs and nouns
appearing in VNICs are less semantically similar to their surrounding contexts
(both the entire VP and the surrounding sentence) as compared to literal verbnoun combinations.
Second, semantic vectors are computed for (1) verb + predicate noun, (2) the
entire VP, and (3) the sentence in which the VP occurs. To this end, a singularvalue decomposition (SVD) was performed to reduce the dimensionality of the
semantic vector space to 120. Note that in this case, since the matrices containing the DISCO measures are symmetric (DISCO1(w,w0 ) = DISCO1(w0 ,w),
and the same is true for DISCO2, this procedure amounts to finding the top
120 eigenvectors for the DISCO matrices. The three vectors are then compared. As stated above, the hypothesis is that VNICs, being non-compositional, can be expected to have lower semantic similarity to their contexts
than more literal verb-noun combinations, so that the three vectors should
be less similar in the case of an idiomatic verb-noun combination.
As an exploratory step, clustering is performed on the three sets of semantic
vectors obtained above.
6. Results
Some differences were found between the DISCO scores of a list of 3,052 VPs
identified as possible VNICs by checking against a dictionary of known VNICs
(labelled ‘VNIC’) and a list of about 90,000 VPs not checked for possible
346
Feng Zhu and Christiane Fellbaum
Table 1: Average DISCO scores for two lists
Set 1
DISCO1
(verb, noun)
(verb, VP)
(noun, VP)
(verb, sentence)
(noun, sentence)
DISCO2
(verb, noun)
(verb, VP)
(noun, VP)
(verb, sentence)
(noun, sentence)
Set 2
VNICs
All
VNICs
All
0.01578
0.04415
0.07737
0.04753
0.08516
0.01358
0.02822
0.08335
0.02958
0.08638
0.01302
0.03877
0.05058
0.04558
0.05978
0.01068
0.03741
0.06581
0.03783
0.07019
0.00081
0.01286
0.02348
0.01917
0.03288
0.00365
0.02411
0.03556
0.03759
0.04659
0.00395
0.01013
0.00926
0.01140
0.01217
0.00504
0.01362
0.01092
0.01511
0.01274
idiomaticity and assumed to be mostly literal. Table 1 shows average DISCO
scores over the two lists.
The predicate nouns were, on average, less similar to the surrounding VPs
and sentence for VNICs according to DISCO1 but the opposite effect was
observed for the verbs. The effect appeared to be more pronounced for the
Set 2 DISCO1 measures compared to the Set 1 DISCO1 measures.
Interestingly, DISCO1(verb, noun) tended to be very slightly higher for
VNICs—possibly because of many instances of literal readings.
According to DISCO2, both the verbs and the predicate nouns were less
similar to the surrounding VPs and sentence for VNICs, and in this case the
effect was more pronounced for the Set 1 DISCO2 measures than for the Set 2
DISCO2 measures. DISCO2(verb, noun) also tended to be higher for VNICs.
6.1 DISCO-based semantic vectors
The verb + noun vectors were grouped into 5 clusters and the VP and sentence
vectors into 6 to 8 clusters using k-means clustering in R; the numbers of
clusters to use was determined in each case using a scree plot.
In all cases the sets of sentence vectors were substantially similar, at least in
terms of summary statistics, for both the list of dictionary-identified VNICs
and for the list of all verb-noun combinations; considering the entire sentence
seemed to wash out any semantic idiosyncrasy coming from the verb-noun
combination alone.
Quantifying Fixedness and Compositionality
347
No major difference was observed between the summary statistics for the
verb + noun / VP vectors corresponding to the dictionary-identified VNICs,
and for the verb + noun / VP vectors corresponding to larger set of all verbnoun combinations when the vectors when computed using Set 1 DISCO1
measures; cumulative plots of coordinate distributions seem to show the vectors similarly distributed, and the two sets of cluster centers show a high degree
of similarity when e.g. compared using a cosine similarity measure for vectors.
With vectors computed using Set 2 DISCO1 measures, it was observed that
the two sets of cluster centers—one set corresponding to clusters of semantic
vectors for dictionary-identified VNICs, the other corresponding to clusters of
semantic vectors for all verb-noun combinations—were now substantially different: cosine similarity scores between pairs of vectors (v,w) with v from the
former set of vectors and w the latter set now ranged between 0.3 and 0.5.
Cumulative plots of coordinate distributions also differed substantially. It
is difficult to visualize these vectors since they lie in high-dimensional
(120-dimensional) space, but the summary statistics suggest that the two sets
are different enough so that e.g. one may design a classifier to distinguish
(probabilistically) between them.
The DISCO2 measures produced a more nuanced picture: when using Set 2
DISCO2 measures, some of the cluster centers were similar, but others were
not—the cosine similarity scores here ranged between 0.29 and 0.99.
Cumulative plots of coordinate distributions were similar in shape.
Examining a small sample of VPs whose vectors landed in each cluster did
not yield any easily discernible patterns.
7. Discussion
The work described explored some aspects of the syntactic fixedness of
idioms as well as their compositionality using a number of quantitative measures. The differences in the results when using DISCO measures with a larger
context window size (Set 2 instead of Set 1) suggest that the compositionality of
idio-matic language is better captured, or indeed may only be captured, by
quantitative measures that take into account longer-range dependencies, even
if only crudely by looking at collocations over a larger window. Further, the
differences in the results obtained with DISCO1 versus DISCO2 measures
seems to suggest that more subtle ‘second-order’ collocations provide information crucial to the determination and understanding of idiomatic language.
Moreover, it appears that none of the measures considered here is sufficiently strong to distinguish between idiomatic and literal combinations on
its own, but the measures may be able to serve as weak classifiers or features
in a bootstrapped cascading classifier that might be able to more accurately
distinguish between idioms and literal verb-noun combinations.
348
Feng Zhu and Christiane Fellbaum
7.1 Consequences for the Lexical Representation of Idioms
Given the demonstrated variability of idiomatic phrases, one must discard the
notion of ‘fixed’ phrases or ‘long words’ and reconsider the lexical status of
idioms. Speakers clearly have ‘entries’ for idiomatic phrases in their mental
lexicons, which allow them to interpret sequences like kick the bucket and
steal someone’s thunder, both in their ‘canonical’ and modified forms.
Grammatical modifications like passivization and relativization show that
the idioms constituents often have independent word status; lexically modified
forms like kick the pail show that the idiom constituents are linked to the literal
meanings. Psycholinguistic experiments (Sprenger, Levelt and Kempen 2006)
suggest a dual representation with access to both the word and the entire
phrase.
What, then, is the best way to capture the dual status of idioms in a lexical
resource? Clearly, entries that suggest a fixed, immutable phrase are not reflecting speakers’ use. A learner of English, when encountering an unfamiliar idiom
in a ‘non-canonical’ use, will not find a matching entry. Osherson and
Fellbaum (2010) propose a solution for adding many idioms to WordNet.
They manually inspected the components of over 200 common English
idioms (verb phrases, noun phrases and full sentences) for their metaphoric
status. For example, in the phrase old flames die hard, flames arguably refers to
feelings of affection, while die here means ‘fade’ (rather than ‘undergo the
cessation of all vital functions’). Indeed, a well-known rock music album is
entitled Old Loves Die Hard, suggesting a play on the ‘canonical’ expression.
Other attested variations include when an old flame won’t burn out and rekindling old flames and Fellbaum propose the following representation of such
idioms in WordNet. An entry is created for the entire idiom in its ‘canonical’
and most frequently encountered form. Second, a new entry is created for each
component that is deemed to have a meaning (e.g., flame). For idioms like spill
the beans and let the cat out of the bag, a new synset with the members {cat,
beans} is created. Finally, these new synsets are linked via a pointer to existing
synsets that capture the appropriate meanings. For example, flame is linked to
the synset containing feeling, die to the synset with fade, and {cat, beans} to the
synset containing {secret}. Importantly, the pointer is bi-directional, so that the
use of the words in the newly created synsets are clearly identified as members
of idioms and thus as having a limited distribution and as assuming the appropriate meaning only within the context of the idioms. This precludes the use
of fade in a context like my old cat faded last night or the use of beans in let me
tell you some beans.
The approach proposed by Osherson and Fellbaum is currently being
adopted for the Open Multilingual Wordnet, which includes a range of typologically diverse languages (Bond et al. 2014), with the goal of better automatic
cross-linguistic mapping and translation.
Quantifying Fixedness and Compositionality
349
8. Future Work
The results presented here point to further exploration of semantic vector space
models. In particular, neural networks or other deep learning techniques to
build and train semantic vector space models have already been used in such
NLP applications such as semantic compositionality and sentiment analysis
(Socher et al. 2013, Pennington et al. 2014). Such techniques could be applied
using objective functions that are chosen specifically to pick out idioms and
may yield insightful results.
References
A. Dictionaries
Jiao, L. and B. Stone. 2014. Five Hundred Common Chinese Proverbs and Colloquial
Expressions: An Annotated Frequency Dictionary. London: Routledge.
Li, Mao [
]. 2004. Hanyu guanyongyu cidian
. Hanyu Da Cidian
, Shanghai.
Chubanshe
Mei, Jiaju [
], Zhu, Yiji [
], Gao, Yunqi [
], and Yin, Hongxiang
]. 1996.
.
, Shanghai.
[
B. Other Literature
Bond, F., C. Fellbaum, S.-K. Hsieh, C.-R. Huang, A. Pease and P. Vossen. 2014.
‘A Multilingual Lexico-Semantic Database and Ontology’ In Buitelaar, S. and
P.Cimiano (eds), Towards the Multilingual Semantic Web. Heidelberg: Springer,
243–258.
Cacciari, C. and S. Glucksberg, 1994. ‘Understanding figurative language’ In
Gernsbacher, MA (ed.), Handbook of Psycholinguistics. New York: Academic
Press, 447–477.
Church, K.W. and P. Hanks. 1989. ‘Word association norms, mutual information, and
lexicography’ In Proceedings of the Annual Meeting of the Association for
Computational Linguistics, 76–83.
Cook, P., A. Fazly and S. Stevenson. 2007. ‘Pulling their weight: exploiting syntactic
forms for the automatic identification of idiomatic expressions in context’ In
Proceedings of the ACL’07 Workshop on A Broader Perspective on Multiword
Expressions. Prague, 41–48.
Cook, P., A. Fazly and S. Stevenson. 2009. ‘Unsupervised type and token identification
of idiomatic expressions.’ Computational Linguistics 35: 61–103.
Dobrovol’skij, D. and E. Piirainen. 2005. ‘Cognitive theory of metaphor and idiom
analysis.’ Jezikoslovlje 6(1): 7–35.
Fazly, A. and S. Stevenson. 2006. ‘Automatically constructing a lexicon of verb phrase
idiomatic combinations’ In Proceedings of the 11th Conference of the European
Chapter of the Association for Computational Linguistics (EACL). Trento, Italy,
337–344.
Fellbaum, C. 1998. WordNet: An Electronic Lexical Database. Cambridge, MA: MIT
Press.
Fellbaum, C. (ed.). 2006. ‘Corpus-Based studies of German idioms and light verbs.’
Special Issue, International Journal of Lexicography 19.4.
350
Feng Zhu and Christiane Fellbaum
Fellbaum, C. 2011. ‘Idioms and collocations’ In von Heusinger, K., C.Maienborn and
P.Portner (eds), Semantics: An International Handbook of Natural Language Meaning,
Part 1. Berlin: De Gruyter Mouton, 439–454.
Fellbaum, C. 2014. ‘Syntax and grammar of idioms and collocations’ In Kiss, T. and
A.Alexiadou (eds), Handbook of Syntax. Berlin: De Gruyter Mouton, 776–802.
Firth, J. R. 1957. Paper in Linguistics 1934–1951. Oxford University Press.
Harris, Z. 1954. ‘Distributional structure.’ Word 10. 23:146–162.
Huang, C.-R., S. K. Hsieh, J. F. Hong, Y. Z. Chen, I. L. Su, Y. X. Cheng and S.
W. Huang. 2008. ‘Chinese WordNet: Design, implementation, and application of
an infrastructure for crosslingual knowledge processing’ In Proceedings of the 9th
Chinese Lexical Semantics Workshop, Singapore.
Kolb, P. 2009. ‘Experiments on the difference between semantic similarity and relatedness’ In Proceedings of the 17th Nordic Conference on Computational Linguistics –
NODALIDA ’09, Odense, Denmark.
Lin, D. 1998. ‘Extracting Collocations from Text Corpora’ In Proceedings of the
Workshop on Computational Terminology. Montreal, Canada, 57–63.
Moon, R. 1998. Fixed Expressions and Idioms in English: A Corpus-Based Approach.
Oxford University Press.
Nunberg, G., I. A. Sag, and T. Wasow. 1994. ‘Idioms.’ Language 70.3: 491–538.
Osherson, A. and C. Fellbaum. 2010. ‘The Representation of Idioms in WordNet’ In
Proceedings of the Fifth Global WordNet Conference, Mumbai, India.
Pennington, J., R. Socher and C. D. Manning. 2014. ‘Glove: Global Vectors for Word
Representation’ In Proceedings of the Conference on Empirical Methods in Natural
Language Processing 12.
Shannon, C. and W. Weaver. 1963. Mathematical theory of communication. University of
Illinois Press.
Socher, R., A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng and C. Potts. 2013.
‘Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank’
In Proceedings of the Conference on Empirical Methods in Natural Language
Processing, 1642–1654.
Sprenger, S., W. J. M. Levelt and G. Kempen 2006. ‘Lexical access during the production of idiomatic phrases.’ Journal of Memory and Language 54.2: 161–184.
Zhou, R. [
] 2010.
. Master’s thesis, Xiangtan University.
Xiangtan, Hunan, China.
Zhu, F. and C. Fellbaum. 2014. ‘Automatically Identifying Chinese Verb-Noun
Collocations’ In Dalmas M. and E.Piirainen (eds), Figurative Language: Festschrift
for D. Dobrovol’skij. Tübingen: Stauffenburg, 187–200.