Word-pair extraction for lexicography
Chris Brew & David McKelvie?
Language Technology Group
Human Communication Research Centre
2 Buccleuch Place
EDINBURGH EH8 9LW
SCOTLAND
Abstract. We describe an application of sentence alignment techniques and approxi-
mate string matching to the problem of extracting lexicographically interesting wordword pairs from multilingual corpora. Since our interest is in support systems for lexicographers rather than in fully automatic construction of lexicons, we would like to
provide access to parameters allowing a tunable trade-o between precision and recall.
We evaluate two techniques for doing this. Since sentence alignment tends to associate
semantically similar words, approximate string matching draws attention to orthographic
similarities, they can be used to serve dierent lexicographic purposes, as can the combination of the two techniques, which amounts, inter alia, to a tool for uncovering faux
amis. We conclude by sketching a simple and exible means for allowing lexicographers
to provide information which has the potential to improve system performance.
1 Introduction
One of the central challenges of computational lexicography is to design computational tools
which allow lexicographers to do what they have always done | only better. This has received
far less attention than the almost unrelated task of taking machine-readable dictionaries and
pressing them into service for computational linguistics, which is well described by Wilks, Slator
and Guthrie [19]. While the use of corpora is a common thread, the priorities of lexicographers
are, unsurprisingly, dierent from those of computationalists (see [18] for an account of how
frequency information is used, and [3] for a description of the use of mutual information and
other statistics to uncover sense information). We present techniques which are applicable to
multilingual corpora, describing our evaluation of them as information sources for those involved
in the construction of bilingual dictionaries, 2 .
At minimum, a bilingual dictionary needs to indicate which words and phrases of one language are possible translations of words and phrases in the other language. A helpful bilingual
dictionary will also draw the readers attention to information which goes beyond this, such as
facts about which words and phrases are commonly mistranslated by reason of orthographic
similarity (i.e. faux amis). It is for the lexicographer to determine whether information about
faux amis is important enough to include in a dictionary, but several dictionaries do, including
the Cambridge International Dictionary of English [14], which we will abbreviate as CIDE.
We are grateful to the UK EPSRC's Integrated Lexical Database project for funding this work, to the
European Commission and the MULTEXT project for providing access to tools and corpora, and to
the UK ESRC and the Universities of Edinburgh,Glasgow and Durham for providing infrastructure
and support
2 We are grateful to Diane Nicholls of Cambridge Language Services for annotating some of the data.
?
As computationalists, we want to aid the lexicographer by ltering out pairs of words and
phrases which are extremely unlikely to be worthy of comment, presenting only those pairs for
which human intuition is needed. In information retrieval (IR) terminology this is an attempt
to maximise the precision of the retrieval process. Of course, since the primary purpose is to
augment the capability of the lexicographer, it is important to present a some pairs which would
not otherwise have been considered. In IR terms this requires us to maximize the recall of the
retrieval process. As always, there is a trade-o between precision and recall. Increased recall
can always be obtained by accepting a larger percentage of false positives, and the attendant
responsibility of checking many screenfuls of largely spurious output.
2 Plan of the paper
Our experiments use the English and French components of a multilingual corpus of parliamentary questions taken from the MLCC Multilingual Corpus (not yet published, but to be made
available publicly in the near future). In this paper we restrict our attention to verbs, largely
because these receive nely articulated classication in the Cambridge International Dictionary
of English [14] , as well as elsewhere[12]. A crucial secondary reason is a remark made by Briscoe
and Carroll:
over half the analysis failures for unseen corpus material were caused by incorrect
subcategorisation for predicate valency[4].
One consequence of this, pointed out by Briscoe and Carroll, is that NLP systems would benet greatly from any techniques which can extract useful argument structure information from
corpora. We think that the same is true for lexicography, and are currently carrying out experiments to test the coverage of the verb-frame annotation scheme provided in CIDE.
2.1 A corpus of word-pairs
We begin with an un-annotated corpus of English and French. The French part contains
1,221,579 tokens (including punctuation) and the English part 1,046,420 tokens (again including
punctuation).
Our rst step was to pass the French and English texts through an HMM-based part-ofspeech tagger [2] using lexicons and transition tables prepared under the MULTEXT project.
Dierent tag sets were used for French and English. For each tag-set we identied the verbal
categories and lemmatised the verbs to their base forms. This process identied 98,713 verb
tokens in the French part of the corpus and 96,202 verb tokens in the English part. This
corresponds to 1818 verb types for French and 2065 verb types for English.
Having identied verbs on both sides of the presumed translation equivalence relation, we use
sentence alignment [8]. We used Afzal Ballim's implementation of Gale and Church's charalign
algorithm,yielding a sentence-level alignment of the texts However, the performance of the GaleChurch algorithm is highly dependent on cross language regularities in the use of punctuation
and spacing [6, pp67-74]. For some documents, notably those for which formatting conventions
are dierent in the two languages, or where there are substantial blocks of extraneous text in
one of the language pairs, the unmodied Gale-Church algorithm performs abysmally. Davis
suggests a scheme for weighting dierent sources of evidence to improve alignment. Among the
factors which they consider is the matching of Arabic numbers and English phrases, commenting
that these increasingly appear in texts in non-European languages. Since the Gale-Church
algorithm works on the lengths of the sentences, one can modify the algorithm by ltering
the sentences before aligning. E.g. by replacing words with their POS tags, or by removing
all words except for some subset of the words. Our heuristic was to remove all words except
punctuation and proper names and just use these to align. This seems to be best for the corpus
of parliamentary questions which we used. This is an interesting corollary of Davis's result,
indicating that when consistent formatting information is present, it can pay to throw away
everything else.
Having established sentence alignments, we generate a set of word-word pairs by simply
enumerating all possible pairings (fr; eng) which do not cross the boundaries of an alignment
unit. If we assume that the alignment process is infallible, and that verbs in French are always
translated by verbs in English, then we can assume that the relation generated by this procedure
is a superset of the true translation relation about which we want to inform ourselves. We make
these assumptions for their heuristic value only. since there is ample evidence that both are
incorrect. If we were to make the rash claim that this method were appropriate for machine
translation, we would be open to the charge that we are simply using a grotesquely oversimplied variant of the IBM approach[5]. For our purposes the criticism has less force, because:
{ We are interested in uncovering relations of potential inter-translatability between word
types, whereas machine translation requires that we hypothesise translation relations between word tokens, which is a much more demanding task.
{ We can aord to tolerate considerable error in the assessment of token alignments as long
as the type alignments are useful.
{ Because of the presence of a human supervisor, there is no pressing need for even the type
alignments to be uniformly correct.
Note that the relation of potential inter-translatability between word types is similar in
avour to the relationship between the components of lexical collocations in monolingual work.
Our reduction of the corpus to a collection of token-token pairs makes it possible for us to
press the technology of collocation analysis into the service of our current goals Potential intertranslatability is a symmetric relation, since the criterion for admission is that sucient sentences can be found in which the English word seems like that translation of the French word.
The corpus of word-pairs uncovered by sentence alignment contains 161,430 distinct types,
corresponding to 430,132 tokens (a mean of 2.668 occurrences per token, with maximum 995,
minimum 1, standard deviation 7.61). English verbs align with a mean of 78.62 French verbs
dierent French verbs (maximum 888, minimum 1, standard deviation 118.00). French verbs
align with a mean of 90.14 English verbs (maximum 1255, minimum 1, standard deviation
145.235).
Along with sentence alignment we chose a second means of uncovering relations between
English and French words, namely approximate string matching [13]. The relationship we are
really interested in is that which holds between words which look as if they might be translations.
In lieu of that we fall back on various heuristic measures of orthographic similarity. Both
techniques pick out (and order) a subset of the Cartesian product of all the words in the
corpus, so each word pair falls into one of the categories shown in table 1: The rest of the paper
is concerned with the task of assigning word-pairs to the appropriate cells.
2.2 Evaluation
Since there are 18182065 = 3; 754; 170 pairs it was not feasible to manuallyevaluate every pair.
The evaluation was carried out by one of the authors, relying on a good reading knowledge of
Cognate + Cognate Aligned + Cognates Translations
Aligned - False Friends Unrelated
Table 1. Overall design of the paper
French, the Collins Pocket English-French dictionary[10] and an appreciation of the likely topics
of discussion in the European parliament. For even this relatively small portion of the space of
verbs it was too time consuming to consult the corpus for every pair, although this would have
been ideal given unlimited time and patience. We evaluated all pairs which gave a likelihood
ratio score of greater than 7, and some of the other pairs thrown up by the various methods
of approximate string matching. This made a total of 16,051 distinct verb-verb pairs, of which
we judged 14,067 to be impossible as translation pairs, and 1983 to be possible. The evaluation
eort took the best part of six working days. In a pilot study, working with a a smaller corpus,
we checked for the possibility that there were large numbers of potentially correct translations
which we had not considered, and found nothing new in a sample of 1000 random translations.
It is certain that some correct translations have been missed, and we don't have data on the
reproducibility of the judgements.
Not all verbs will have an alignment with another verb, since some will align with semantically null support verbs. We evaluated on the basis that the alignment of \inform"
with \faire" is incorrect, although the alignment of \faire part" with \inform" is. Equally,
many verbs have more than one correct alignment. From French to English \interdire" can
be \ban","bar","forbid","outlaw" or \prohibit", while from English to French \show" can be
\attester","demontrer","manifester", \montrer","prouver", \reveler" or \temoigner".
3 Search for inter-translatability in alignment data
In this section we report results of an eort to nd inter-translatable words in the corpus of
word-pairs. In section 4 we cover the use of approximate string-matching to uncover eects of
orthographic similarity. In either case, we conceptualize the task as using some statistic (or, in
some cases, quasi-statistic) as an ordering heuristic for a set of pairs too large to examine in its
entirety. We rely on the statistics for illumination rather than support.
The most eective ordering heuristic for this data turned out to be the binomial likelihood
ratio statistic that has been used for collocation analysis by Dunning [7]. The rst strategy
which we investigated was that of simply taking the best match for each verb in each direction.
From French to English this yields 864 correct (48.2%) 920 incorrect (51.3%) and 7 unevaluated (0.3%). From English to French the corresponding gures are 953 correct(47.25%) , 1063
incorrect (52.7%) and 1 unevaluated (0.005%). Recall is dicult to evaluate denitively, because of the number of pairs that would need to be scanned. It certainly cannot be better than
864 1983 = 0:43 for French to English or 954 1982 = 0:48 for French to English.
3.1 Improving Precision
It is for lexicographers to decide whether these levels of precision and recall are appropriate for
their preferred working style ; if the hypotheses thrown up in the correct 50% are interesting
enough, the 50% of noise may well be tolerable. Nevertheless, we assume that this level of noise
is not adequate, and investigate ways of reducing The rst tactic which we investigate is that
of imposing a threshold on the likelihood gure equivalents 3 .
min(?2log) N Correct Precision
612
33 33
100%
144
256 243
95.3
53
502 453
90.2
24
816 659
80.8
13
1363 831
61.0
0
1791 864
48.2
Table 2. The eect of thresholding (French to English)
min(?2log) N Correct Precision
613
32 32
100%
91
354 337
95.2
43
590 532
90.2
23
878 705
80.3
13
1454 898 61.76
0
2017 953 47.24
Table 3. The eect of thresholding (English to French)
Tables 2 and 3 show the result of thresholding the data at various points in the scale dened
by the likelihood ratio statistic. Over half the correct translation pairs are available by cutting
of at a level which gives 90% precision. Another simple approach is to consider as equivalent
only these word pairs which satisfy the condition that each is the others' most preferred partner.
This is the Best Match condition used by Gaussier, Lange a and Meunier [9]. There are 826
pairs which meet this condition in the current data, of which 631 (76.4%) are correct and 195
(23.6%) incorrect. Thresholding this data at a likelihood ratio of of ?2log > 24 produces
571 answers of which 519 are correct (90.9% precision). In order to attain 90% precision on
the complete French-to-English data we had to threshold at ?2log > 53 yielding only 453
correct answers. However, for English to French we had to threshold at ?2log > 43, yielding
532 correct answers. If we require 95% precision we can threshold the symmetrical part of the
data at ?2log > 33 yielding 460 correct answers out of 483. To achieve the 95% level on the
French-to-English data we had to threshold at ?2log > 144, yielding only 243 correct answers.
For English to French the 95% threshold is ?2log > 91 yielding 337 correct answers.
3
In the work on cognate-hood, McEnery and Oakes [13] proceed analogously, giving data on the way
in which estimated probability of cognate-hood varies with threshold levels of a summary gure of
merit
3.2 Increasing recall at the expense of precision
We tried increasing recall by relaxing the condition that the pairs considered are the best
matches for some word. Allowing second-best matches in the French-to-English data produces
1183 matches from 3517 attempts (33% precision) compared with 864 from 1791 (48% precision)
using only the best matches. For French to English we get 1258/4001 (31% precision) rather
than 953/2017 (47% precision). Adding Best (or rather K-Best) Match to this gets us 1008/1873
(53.8% precision). The threshold for 90% precision on this data is ?2log > 64 yielding 482
correct answers from 532 guesses. This is inferior to the corresponding result using best matches
only.
We conclude that relaxation of the best match criterion is useful mainly to those lexicographers who are prepared to tolerate a relatively low level of precision (in other words, those
who do not nd the scanning of lists of word-pairs too onerous). For those, it may be of interest
to know that the 1364 word pairs can be had by allowing the best 3 matches from French to
English, nding 177 more than by using the best two matches, at the cost of scanning 5145
rather than 3517 pairs. This is a drop in overall precision from 33% to 24%. For English to
French a further 148 pairs can be found by scanning 5845 rather than 4001 pairs, a drop in
precision from 31% to 24%. It is a matter of taste whether one nds this prospect attractive.
3.3 Conclusions on the use of alignment information
The use of the log-likelihood statistic in order to derive word-word relations from sentence
alignments seems to work satisfactorily in diagnosing potential inter-translatability. There is the
usual trade-o between precision and recall, with the best results obtained from a combination of
thresholding and the use of Gaussier's symmetry criterion. About 30% of the known translation
pairs can be found with a precision of close to 90%. These techniques have so far been developed
on a single corpus, and for one particular language pair. We intend to remedy this by applying
the techniques to other language pairs within the nine{language parallel corpus.
It should be straightforward to extend this work to parts-of-speech other than verbs. We
plan to carry out the same analysis for nouns and adjectives. We expect to use the heuristic that interesting verb-noun equivalences probably occur for verbs and nouns which do not
have high-scoring verb-verb or noun-noun equivalences. This may make it possible to uncover
an interesting range of phrasal and support verbs without the need to enumerate pairs and
combinations of words 4.
4 Approximate string matching for cognate extraction
The next task is to identify cognates and false friends. Our strategy is to use the translation pair
information to classify orthographically similar words. We therefore need a way of detecting
orthographic similarity. We began from the work of McEnery and Oakes [13], who use a variant
of Dice's coecient originally due to Adamson and Boreham [1]. We report results from six
variants of this method.
4
But see the discussion of false-friends later for evidence that it is hard to detect the absence of
translation equivalence
1. dice The original Adamson and Boreham method. This counts the number of shared character bigrams. The formula is:
2 jbigrams(x) \ bigrams(y)j
jbigrams(x)j + jbigrams(y)j
where bigrams is the function which reduces a word to a multi-set of character bigrams.
Order of the bigrams is insignicant.
2. xdice A variant of dice which allows \extended bigrams". By this we mean ordered letter
pairs which are formed by deleting the middle letter from an three letter substring of the
word. The formula is:
2 jxbigrams(x) \ xbigrams(y)j
jxbigrams(x)j + jxbigrams(y)j
where xbigrams denotes the operation of forming a set of bigrams (as before) and adding
the extended bigrams.
3. wdice As dice, except that bigrams are weighted in inverse proportion to their frequency.
The weights were assigned by the formula:
Ntokens + Ntypes
weight(bigrami ) = freq(bigram
i) + 1
where Ntokens stands for the number of bigram tokens seen in the corpus of English and
French, and Ntypes stands for the number of distinct bigram types. This is a simple variant
of a standard frequency weighting scheme from information retrieval. We changed only the
numerator of the dice formula, so the coecient is no longer certain to fall between 1 and
0. Applying the same transformation to the denominator would have xed the problem.
4. wxdice The algorithm which is to xdice as wdice is to dice.
5. xxdice Another transformation of the numerator, this time using string positions as well as
extended bigrams. When a bigram is found to be shared, its contribution to the numerator
is
2
1 + (pos(x) ? pos(y))2
where pos returns the string position at which the bigram was found rather than simply 2.
This penalizes matches between bigrams from dierent parts of the word.
6. Longest common subsequence. The formula is5:
2 jlcs(x; y)j
jxj + jyj
4.1 Results Using Gaussier's Best Match Filter
Using Gaussier's best match ter as before, lcs 553 translation pairs out 767 (72.1% precision).
dice coecient found 581 from 860 (67.6% precision). wdice found 582 from 862 guesses
(67.5%) (not much help) xdice found 602 from 867 (69.4%) wxdice found 602 from 866
(again a small benet). xxdice nds 587 from 902 (65% precision). The move from bigrams
(dice) to extended bigrams (xdice) is useful, but the term{weighting strategies have not paid
o much for the symmetrical part of the data.
5
We used Hunt and Szymanski's algorithm [11], relying on the description given by Graham
Stephen [17]. An earlier version of this paper described a more elaborate algorithm which of our
own devising which worked on tries. This is potentially time-ecient, but involves large data structures and proves impractical for our application
4.2 Results using thresholding
Tables 4,5 and 6 indicate how the precision and recall of the various string similarity metrics
can be manipulated by use of thresholding. In every case we restrict our attention to the top
20000 pairs thrown up by the ordering. The data suggests that dice works better than wdice
at the high end of the scales, but note that wdice recovers 38 more correct pairs if one is
prepared to tolerate 50% precision. The results for dice are not directly comparable with those
of McEnery and Oakes [13] because of our use of part-of-speech tagging to pick out only verbs.
xdice is marginally inferior to wxdice throughout. The introduction of gappy bigrams seems
to be advantageous, since xdice outperforms dice and wxdice outperforms wdice. However,
table 6 shows that xxdice outperforms all the other methods on this data, yielding to wdice
only in yield at very low thresholds. It is unlikely that this notional superiority of wdice would
ever matter, while the high precision of xxdice in more accessible regions of the space is clearly
important.
min(dice) N Correct Precision min(wdice) N Correct Precision
0.90
114 110 96.5%
5.35
87
82
94.3%
0.85
219 193 88.12%
5.00
159 136 85.5%
0.80
299 255 85.3%
4.65
284 227 79.9%
0.70
860 450 52.3%
3.8
932 488 52.3%
0.43 20000 763 3.81%
2.03 20000 784 3.92%
Table 4. The eect of thresholding on dice and wdice
min(xdice) N Correct Precision min(wxdice) N Correct Precision
0.90
73
70
95.9%
5.3
127 119 93.7%
0.85
169 153 90.5%
5.07
192 173 90.1%
0.80
300 263 87.7%
4.7
333 272 81.7%
0.65
1021 496 48.8%
3.7
1088 558 51.3%
0.4
20000 772 3.86%
0.4
20000 799
4.0%
Table 5. The eect of thresholding on xdice and wxdice
5 Combining Sentence Alignment and Approximate String Matching
We now return to the original purpose of the paper: nding translations and false friends for
lexicographers. Since we are interested in nding pairs which look like translations, we x on the
string comparison method which has the best success in nding translations, namely xxdice.
We collected all the pairs which have either a likelihood ratio greater than 13 or an xxdice
coecient greater than 0.5 (thresholding close to 50% precision in each case). There are 1278
of these. 339 are over threshold on both criteria, 778 are orthographically similar but below
min(xxdice) N Correct Precision min(lcs) N Correct Precision
0.813
83
82
98.7% 0.924 74
72
97.3%
0.778
145 143 98.6% 0.90
74
72
97.3%
0.715
332 306
92%
0.86 278 232 83.45%
0.50
1127 576 51.1% 0.79 1100 526 47.8%
0.22
20000 749 3.74% 0.615 20000 787 3.94%
Table 6. The eect of thresholding on xxdice and lcs
threshold on alignment, 151 are aligned but not orthographically similar. There are 880 distinct
English verbs and 908 distinct French ones.
A reliable method for nding translation pairs The pairs for which both sentence alignment and
xxdice are above threshold highly likely to be translations. Of the 339 pairs falling into this
category, 338 are translations.
A method for nding false friends We now turn to a means of nding false friends. In principle,
all that is required is to look for pairs which have good orthographic similarity, but which are
not matched by the likelihood ratio technique.
Unfortunately, there are two possible reasons why pairs should not appear in the aligned
corpus
1. They are not in fact translations.
2. They are potential translations, but the translation was not used in the corpus at hand.
This is a sparse data eect.
The rst results which we show are for French-to-English. There are 45 cases where the
best orthographic match is better than the orthographic match for the pair having the highest
alignment score.
There are useful suggestions in the 45, notably that \contenir" might mean \contain" rather
than \contend", that \confondre" might mean \confuse" rather than \conform", that \commander" might mean \commission" rather than \commandeer" and that \alleger" means \alleviate"
rather than \allege". All of this needs careful monitoring by the lexicographer, because along
with the successful corrections come pairs of words both of which are translations (for example that \allouer" can be translated by \allot" or by \allocate", and that \stocker" is either
\stock" or \store"). More dubiously, we are told that \reduire" is not \require" but \reduce"
that \saturer" is not \mature" but \saturate" and that \saper" is not \paper" but \sap". None
of this confusions would ever have arisen in a human reader: they are unwelcome artefacts of the
approximate string matching. xxdice is not noticeably better or worse than any of the others
in throwing up this sort of nonsense. Finally, we informed that \inverser" means \notice" rather
than \reverse", which is counter-productive.
Going from English to French, there are 48 suggestions. Good ones include the suggestion
that \comporter" is a better translation of \comprise" than \comprimer" (which actually means
\compress") and that \allege" is to be translated by \alleguer" rather than \alleger". We are also
told that \re^eter" (which means \reect as in a mirror") is a less good translation of \reect"
than \re^echir" (to reect on some matter) , which, while untrue in general, is undoubtedly a
faithful reection of what one expects to nd in a corpus of parliamentary language.
Viewed solely as techniques for detecting false friends, the methods in this paper are at an
early stage of development. It looks as if the main limiting factor is the unselectivity of the
methods for detecting orthographic similarity, which throw up very many \confusions" that
would never trouble a human reader. If this is xed the next limiting factor will be the sparsity
of the data available in bilingual corpora, closely followed by the limited linguistic range of the
corpora currently available.
6 Conclusions
The research in this paper has all been aimed towards the generation of suitably ordered lists
of word pairs which a lexicographer can use as input. The main features of the eort have been:
1. The use of likelihood ratio techniques for picking out word-pairs whose alignment in a corpus
is signicant.
2. The use of k-best matching to get back some of the recall lost by using Gaussier's Best
Match criterion to increase precision. of precision. As far as we know the combination of
k-best and the symmetry criterion is new.
3. An evaluation of several potentially useful variants of Dice's coecient.
4. The xxdice coecient, combined with the use of likelihood ratio, forms an extremely high
precision tool for the detection of translation pairs in our corpus.
5. A preliminary technique, needing extensive human supervision, for pulling out false friend
information from corpus data.
An emphasis on the needs of the lexicographer is the driving force behind the work reported,
so we are hoping to carry the same principle into extensions of the work. We are aware of various
arbitrary choices which we have made , including the use of likelihood ratio as an ordering and
thresholding criterion. While we are provisionally satised with this criterion, we would prefer
to avoid hard-wiring any criterion. An obvious option is to tune an initial presentation order
by allowing the lexicographer to mark word pairs which are deemed of particular interest,
adjusting the presentation order in such a way as to minimise the number of distractors which
are presented before the items of interest, and maximize the probability that interesting pairs
are found in the early part of the ordering. Such techniques are common in document retrieval
applications [15]. Their primary advantage is an ability to provide a powerful tunable mechanism
from whose complexities the user is almost entirely sheltered, since they need do nothing more
than mark interesting pairs. Adding a \More pairs" button would provide analogous feedback
relevant to the precision/recall trade-o discussed above. It might even be useful to generate
new variations of the ordering criteria on the y by means of some suitable adaptive process.
With or without the fullment of this last, highly speculative possibility would be to provide a
system capable of attuning itself to the lexicographer's idea of lexicographic interest.
References
1. G.W. Adamson and J Boreham The use of an association measure based on character structure to
identify semantically related pairs of words and document titles Information Storage and Retrieval
(10) pp. 253{60, 1974
2. Susan Armstrong, Graham Russell, Dominique Petitpierre, Gilbert Robert An Open Architecture
for Multilingual Text Processing From Texts to Tags: issues in Multilingual Language Analysis:
Proceedings of the ACL SIGDAT Workshop, Dublin, Ireland, March 27, 1995
3. Simon Baugh ILD Corpus and Lexicon Search System User Manual Unpublished Manuscript
Cambridge Language Services, 1996
4. Edward Briscoe and John Carroll Toward Automatic Extraction of Argument Structure From
Corpora Rank Xerox Research Centre Technical Report MLTT-006, Meylan, 1994
5. Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra and Robert L. Mercer The
Mathematics of Statistical Machine Translation: Parameter Estimation Computational Linguistics
19(2), pp. 263-312, 1995
6. Mark Davis, Ted Dunning, Bill Ogden Aligning noisy corpora Proceedings of the Seventh Conference
of the European Chapter of the Association for Computational Linguistics, pp 67{74, March 27{31
1995, University College Dublin, Beleld, Dublin, Ireland
7. Ted Dunning Accurate Methods for the Statistics of Surprise and Coincidence Computational
Linguistics 19 (1) pp. 75{102, 1992
8. William A. Gale and Kenneth W.Church A program for Aligning Sentences in Bilingual Corpora
Computational Linguistics 19 (1) pp. 61{74, 1993
9. E. Gaussier, J-M Lange and F. Meunier Towards Bilingual Terminology Proceedings of the
ALLC/ACH Conference, pp. 121-24, OUP, Oxford, 1992
10. Pierre Henri-Cousin Collins Pocket French Dictionary Collins, London and Glasgow, 1987
11. J.W. Hunt and T.G. Szymanski A fast algorithm for computing longest common subsequences
Communications of the ACM. (20) 5 pp. 350-3, May 1977
12. Beth Levin English Verb Classes and Alternations: A Preliminary Investigation University of
Chicago Press, Chicago 1993
13. Tony McEnery and Michael Oakes Sentence and word alignment in the CRATER Project In
Thomas & Short \Using Corpora for Language Research, pp. 211{231
14. Paul Proctor (ed) Cambridge International Dictionary of English Cambridge University Press,
1995
15. G. Salton and C. Buckley Improving Retrieval Performance by Relevance Feedback J. American
Society for Information Science, 41, 4 (1990) pp. 288{297.
16. Jenny Thomas and Mick Short (eds) Using Corpora for Language Research: Studies in honour of
Georey Leech Longman, London and New York, 1996
17. Graham A Stephen String Search Technical Report TR-92-gas-01, School of Electronic Engineering
Science, University College North Wales, Bangor, UK, 1992
18. Della Summers Computer Lexicography: the importance of representativeness in relation to frequency In Thomas & Short, pp. 260-266
19. Yorick A. Wilks , Brian M. Slator and Louise M. Guthrie Electric Words: Dictionaries, Computers
and Meanings MIT Press, Cambridge 1996
© Copyright 2026 Paperzz