A Comparison of Spelling-Correction Methods for the Identification

A Comparison of Spelling-Correction Methods for
the Identification of Word Forms in Historical Text
Databases*
ALEXANDER M. ROBERTSON and PETER WILLETT
University of Sheffield, UK
Abstract
1. Introduction
An increasing amount of historical text is being stored
in text retrieval systems. The resulting databases
employ the original spellings, which are often very
different from the spelling used today owing to the
changes that have taken place over the centuries
(Vallins, 1965). In addition, spelling today has become
relatively standardized, but the concept of 'correct'
spelling is quite a modern one (Burgess, 1975), and
thus a given word might well appear in several equally
valid forms in a historical database that contains documents from the same time period. This presents a problem to users who wish to carry out searches of historical
text databases. Specialist users who are familiar with
older spelling conventions would be unlikely to be
aware of the full range of possible variants for a query
word that exist in such a database, while a nonspecialist or casual user would be likely to submit a
query in modern spelling. In either case, records will
only be retrieved if they contain words which are spelled
in the same way as the query, and the user will therefore not be able to retrieve material that involves alternative spellings. This problem could be alleviated by
the development of computational techniques that
could transform the users' search terms into the historical forms used in the database. In what follows, we
shall refer to the present-day and historical spellings of
Correspondence: Peter Willett, Department of Information
Studies, University of Sheffield, Western Bank, Sheffield
S10 2TN, UK
*Some of the material in this paper was first presented at the
Fifteenth International Conference on Research and Development in Information Retrieval (Copenhagen, June 1992) and at
the Sixteenth International Online Information Meeting (London,
December 1992).
Literary and Linguistic Computing, Vol. 8, No. 3, 1993
2. Identification of Variant Spellings
2.1 Introduction
A spelling checker detects which words in a text may be
incorrectly spelt by comparing each word with a stored
© Oxford University Press 1993
Downloaded from http://llc.oxfordjournals.org/ at Penn State University (Paterno Lib) on May 17, 2016
This paper discusses the application of algorithmic spellingcorrection techniques to the idenficiation of those words
in databases of sixteenth-, seventeenth-, and eighteenthcentury English texts that are most similar to a query word
in modern English. The experiments involved the n-gram
matching, phonetic and non-phonetic coding, and dynamicprogramming methods for spelling correction, and demonstrate the general effectiveness of this approach to the
identification of historical word forms. The best results
are given by the editcost-distance and longest-commonsubsequence methods, which use a dynamic-programming
algorithm; however, these are only slightly more effective
than the digram-matching method, which is far faster in
operation
a word as a modern-form and an old-form, respectively.
The identification of the old-forms that are associated with a query modern-form is a specific example of a
more general problem in information retrieval, namely
conflation (i.e. the equivalencing of different forms of
the same natural language word so that documents can
be retrieved even if they are not indexed by the precise
form of a word that has been specified in a query).
There are many types of word variant that can be
processed given an appropriate conflation procedure;
for example, morphological variants can be conflated
by the use of a stemming algorithm, alternative American and English spellings by means of rules that define
the acceptable equivalences, and spelling mistakes and
their correct spellings by an appropriate correction
algorithm. In this paper, we discuss the use of the last
of these conflation techniques, spelling correction, for
the conflation of modern-forms and their related oldforms. Thus, given a query of a historical text database
that has been specified in modern-day English, we wish
to identify related old-forms that could then be added
to the query to maximize the retrieval of relevant documents.
Our work has taken as its starting-point an M.Sc.
dissertation project (Rogers and Willett, 1991) that
used the reverse-error and phonetic-coding methods of
spelling correction on a test collection drawn from the
Hartlib Papers (Leslie, 1990). This work demonstrated
the potential of this general approach to the conflation
of old-forms and modern-forms, and showed that the
phonetic coding methods that were tested (Russell,
1918, 1922; Davidson, 1962; Gadd, 1988, 1990) gave a
better level of performance than the reverse-error
method (Damerau, 1964). The present paper reports
the results of using further historical text databases and
several additional correction methods that encompass
the full range of techniques that are currently available
for the automatic correction of spelling mistakes. Specifically, in addition to the phonetic-coding methods
studied previously, we have evaluated n-gram matching (Freund and Willett, 1982; Angell et al., 1983), the
SPEEDCOP non-phonetic coding method (Pollock and
Zamora, 1984), and two dynamic-programming
methods (Needleman and Wunsch, 1970; Wagner and
Fischer, 1974; Kruskal and Sankoff, 1983).
Table 1 Modern-forms and their equivalent old-forms from
the Hartlib dataset.
Modern-Form
Old-Form(s)
ACCOUNT
AFFAIRS
ARITHMETIC
BREASTWORK
CANVAS
FRIEND
FUEL
JUSTLY
LIKELIHOOD
MANUFACTURE
MEEKNESS
NEIGHBOUR
PROFIT
RASPBERRIES
RECRUIT
REPUBLICS
SABBATH
THYME
UNHAPPINESS
VALUE
YIELD
ACCOMPT, ACCOUMPTE, ACCOUNTE
AFAYRES, AFFAYES, AFFAYRES
AERITHMATICKE
BRESTWOORKE
CANUAISE
FFRINDE, FREIND, FREND, FRINDE
FEWELL
IUSTELY, IUSTLIE
LIKELYHODE, LIKELYHOODE, LYKELYHOOD
MANIFACTURIE
MEKNES
NEYGHBOR
PROFFITT
RESBERYES
RECREUT, RECRUITE
RESPUBLIQUES
SABOTH
TIME
VNHAPINES
VALEWE
YEI.DE
144
methods, phonetic and non-phonetic coding methods,
and string-similarity methods.
2.2 Reverse-error Method
Analyses of spelling errors in modern, machinereadable text files shows that about 80% of errors fall
into one of four categories (Damerau, 1964; Pollock
and Zamora, 1983):
(1) Insertion errors: one extra character inserted,
e.g. CRAOB for CRAB.
(2) Deletion (or omission) errors: one character
deleted, e.g CONTNT for CONTENT.
(3) Substitution errors: an incorrect character
substituted for a correct one, e.g. INPIT for
INPUT.
(4) Transposition errors: two adjacent characters
transposed, e.g. KYEED for KEYED.
An additional category may be defined that includes
those misspellings which contain more than one of the
four classes of error above: other categorizations of
spelling mistakes have been reported by Joseph and
Wong (1979), Mitton (1987) and Yannakoudakis and
Fawthrop (1983). Damerau's classification forms the
basis for the reverse-error method of spelling correction, which involves testing systematically for each
possible error type and then assuming that the correct
spelling has been identified if the reversal of one of
the operations above results in a word that is in the
dictionary (Damerau, 1964). We have not studied
this approach in the work reported here in view of its
poor performance (relative to the phonetic-coding
methods) in the initial study of historical text searching
by Rogers and Willett (1991); however, we shall return
to Damerau's classification when discussing the
SPEEDCOP and dynamic-programming methods.
2.3 Coding Methods
Many coding methods have been reported in the literature, but they all involve coding both dictionary words
and misspellings and then matching them if they have
identical, or near-identical, codes. Codes retain the
characters that are assumed to be the most significant,
i.e. known to be least likely to be involved in an error:
these are generally the initial letter and the consonants
(Yannakoudakis and Fawthrop, 1983). The code is normally of fixed length, with codes that are longer than
this threshold being truncated and shorter codes being
padded with blanks.
2.3.1 The Hartlib Code. The oldest and best-known
phonetic-coding method is the Soundex code, which
was originally developed for matching similar surnames
in the US census (Russell, 1918, 1922) and which involves conflating groups of similar-sounding characters.
Since the introduction of Soundex, there have been
many modifications of it, such as the Davidson Consonant Code (Davidson, 1962), which was devised for
name-management applications in airline bookings systems, and the Phonix code (Gadd, 1988, 1990), which
was devised for use in on-line public-access catalogues.
The main feature of this code is its use of a list of some
150 phonetic substitutions that are applied before the
coding itself is carried out, these substitutions consistLiterary and Linguistic Computing, Vol. 8, No. 3, 1993
Downloaded from http://llc.oxfordjournals.org/ at Penn State University (Paterno Lib) on May 17, 2016
dictionary and then flagging any words which do not
appear (Pollock and Zamora, 1984; Damerau and
Mays, 1989). More sophisticated checkers additionally
employ algorithms to identify possible correct versions
of the misspelt words. Correction methods make two
main assumptions. The first is that there is a correct
version of the word, and the second that an acceptable
variant spelling would be recognizable to the author: if
either of these is not so, then correction becomes much
more problematical. The first assumption is certainly
correct in the context of this paper, since what we call
the 'correct' spelling corresponds to the modern-form
of the word. The second assumption is reasonable for
the sixteenth-, seventeenth-, and eighteenth-century
spellings considered in the present paper, as is exemplified by many of the old-forms and modern-forms
shown in Table 1 (which is drawn from the Hartlib
Papers dataset described in Section 3.1.1). The second
assumption would not be generally correct for English
text from earlier periods (unless the retrieval system
was being used by an experienced searcher who was
intimately acquainted with the spelling conventions of
the time period that was being searched).
Pollock (1982) divides error correction methods into
absolute methods and relative methods. Absolute
methods are based on previous knowledge of likely
errors, while relative methods involve selecting words
which are in some sense similar to the incorrect form.
The former are of limited use in that most spelling
errors do not recur in a reasonable text span (Pollock
and Zamora, 1983) (and there is the added problem in
the present context that there can be any number of
old-forms corresponding to a single modern-form).
Most research has thus focused on the relative
methods, which require a dictionary from which to
select possible correct versions of misspelt words (i.e.
in the present context, the old-forms that are equivalent to a query modern-form). There are three main
classes of relative correction methods: reverse-error
Table 2 Phonetic substitutions list (Rogers and Willett, 1991)
for the Hartlib phonetic-coding method.
Location
S
S
S
S
S
1VI
ME
ME
M
M
M
ME
Substitution
Vc - U
IN - EN
IM - EM
Yc - I
J - I
DC - G
CE - SE
CK - K
vQv - KW
vjv - . Y
OWG - . OUG
GTH - GHT
Location
ME
ME
ME
ME
ME
SME
ME
ME
ME
M
E
ME
Substitution
IGHT — IT
YGHT - IT
GHT -» GH
LOUGH - LOW
OUGH - . OF
vUv - V
RUv - RV
LUv -> LV
DUv - . DV
J - I
ETH -> S
MPT -» MT
• For all characters except the first, strip all vowels,
and occurrences of H, W, and Y.
• Reduce all contiguous multiple character occurrences to single occurrences.
• For all characters except the first, substitute numerals as follows:
1 for C or K;
2 for F or V;
3 for S or Z.
• The first eight remaining characters (padded with
trailing blanks as necessary) constitute this code.
2.3.2 The SPEEDCOP Codes. SPEEDCOP is an
example of a non-phonetic coding method that is in
operation on a day-to-day basis for the identification of
spelling errors in the many scientific databases produced by Chemical Abstracts Service (Pollock and
Zamora, 1983, 1984). Like all coding methods, it is
based on the idea that a code captures the essence of a
word, so that the codes of a misspelling and the corresponding correct word resemble each other more closely
than do the originals; ideally, the codes should be identical. If a code is generated for each word and the set
of such codes sorted into alphabetical order, then the
Literary and Linguistic Computing, Vol. 8, No. 3, 1993
distance that entries sort apart is a measure of the
similarity of the words from which the codes were
generated. The SPEEDCOP workers investigated two
types of code, which they referred to as keys: the skeleton key and the omission key. Three principles underlie
both of these keys: (1) the key must retain the fundamental features of a word or its misspelling; (2) the key
must be similar to the word, but not too similar; (3) the
key must be insensitive to typical spelling-error operations.
The skeleton key contains the word's (or the misspelling's) first letter, then the unique consonants in
their order of occurrence in the word, and then the
unique vowels in their order of occurrence in the word.
'Unique' has a slightly different meaning for vowels and
consonants in that if the first letter is a vowel, then this
will reappear in the key if it occurs later, whereas a
consonant never appears more than once in the key.
The skeleton key relies on the early consonants being
correct, so that the nearer a wrong consonant is to the
start of the word, the further apart will be the word and
its misspelling in a list sorted by the key.
The vulnerability of the skeleton key to early consonant damage led to the development of the omission
key. Analysis of the results of earlier spellingcorrection experiments showed that consonants tend to
be omitted from modern words in the frequency order
RSTNLCHDPGMFBYWVZXQKJ
and the omission key for a string is constructed by
ordering its unique consonants according to the reverse
of this sequence, i.e.
JKQXZVWYBFMGPDHCLNTSR
and then appending the unique vowels in their original
order. Letter content, rather than letter order, is thus
the basis of this key. Examples of the two types of key
are given in Table 3.
Table 3 Words and associated SPEEDCOP codes for some
of the words in Table 1.
Modern-Form
Skeleton Key
Omission Key
ACCOUNT
BREASTWORK
LIKELIHOOD
MANUFACTURE
MEEKNESS
NEIGHBOUR
PROFIT
RASPBERRIES
SABBATH
VALUE
YIELD
ACNTOU
BRSTWKEAO
LKHDIEO
MNFCTRAUE
MKNSE
NGHBREIOU
PRFTOI
RSPBAEI
SBTHA
VLAUE
YLDIE
CNTAOU
KWBTSREAO
KDHLIEO
FMCNTRAUE
KMNSE
BCHNREIOU
FPTROI
BPSRAEI
BHTSA
VLAUE
YDLIE
2.4 String-similarity Methods
String-similarity measures provide a quantitative
measure of the degree of resemblance between a pair of
strings. Thus, when they are used for spelling correction, the degree of similarity is calculated between an
145
Downloaded from http://llc.oxfordjournals.org/ at Penn State University (Paterno Lib) on May 17, 2016
ing of pairs of character substrings that are phonetically
similar; in addition, the code replaces initial vowels
with V , rather than the vowel itself. Once any appropriate substitutions have been made, the code is created
using a Soundex-like scheme, in which the digits 1-8
replace groups of similar sounding characters.
The Soundex, Davidson, and Phonix codes were all
tested in our initial study of historical text searching
(Rogers and Willett, 1991), as was a new phoneticcoding method, called Hartlib, that was developed
from Phonix specifically for searching for historical
word variants. The Hartlib code uses a set of phonetic
substitutions that are appropriate for the conflation of
character substrings in modern and seventeenthcentury English (specifically, the Hartlib Papers dataset that is discussed further in Section 3.1.1, below),
rather than for the correction of modern misspellings as
in the original Phonix code. Once any appropriate substitutions have been made, as detailed in Table 2, the
Hartlib code is created as follows:
input misspelling and each word in a dictionary, and the
most similar word is then assumed to be the correct
spelling.
2.4.1 N-grams An /i-gram is a substring of length n
characters that is derived from a word of length not less
than n (Freund and Willett, 1982; Angell et al., 1983);
although it is possible to derive n-grams whose letters
are not adjacent, we have considered only those where
the letters are adjacent. Two different lengths of ngram were used in the tests: digrams with a length of
two, and tngrams with a length of three. One padding
space was added to both ends of each word before the
generation of digrams, and two padding spaces before
the generation of trigrams. Thus, the word SUBSTRING results in the generation of the digrams
and the trigrams
**S, *SU, SUB, UBS, BST, STR, TRI, RIN,
ING, NG*, G**
where * denotes a padding space. There are n + 1 such
digrams and n + 2 such trigrams in a word containing
n characters.
In this method, the greater the number of n-grams
that two words have in common, the more similar the
words are judged to be. In the present context, oldforms having the largest numbers of n-grams in common with a given modern-form are expected to be
associated with it, and should thus be retrieved when
that modern-form is used as a query term during a
search of a text database
2 4 2 Dynamic-programming Methods Dynamic programming is a technique that is very widely used to
calculate the homology, i.e. the degree of correspondence between two sequences (Kruskal and Sankoff,
1983). Many different measures of homology have been
described; we have used the editcost distance, i.e. the
number of insertions, deletions, and substitutions
needed to convert one sequence into the other, and the
longest common subsequence (LCS) between two sequences (Needleman and Wunsch, 1970; Wagner and
Fischer, 1974).
Our initial tests used the algorithm of Needleman
and Wunsch (1970), which represents all possible character matches between a pair of words in a twodimensional array, D, as shown in Fig 1 for the
modern-form USUAL and the old-form VSULLE. In
this example, each cell of the array is set to zero if a
character in one sequence is the same as a character in
the other, and to one otherwise (although other weights
can be used). The Needleman-Wunsch algorithm
calculates the maximum match, which is the largest
number of characters from one sequence that can be
matched with those of another while allowing for all
possible interruptions in either sequence This gives
rise to a large number of comparisons, but excludes
those comparisons that cannot contribute to the maximum match. Even so, the algorithm proved to be very
time-consuming and our dynamic-programming experiments thus used the algorithm due to Wagner and
146
• The cost of transforming A{i — 1) to B(j — 1),
plus the cost of changing A(i) to B(j).
• The cost of transforming A(i - 1) to B(j), plus
the cost of deleting A(i)
• The cost of transforming A(i) to B(J — 1), plus
the cost of inserting B(j).
More formally, let y(A(i) -+ B(j)), y(A(i) -^ A), and
7(A —» B(J)) be the cost function for a substitution, a
deletion, and an insertion, respectively, where'A is the
null character. Each element of D, D(i,j) (1 ^ i =S m, 1
^ y ' ^ n, where m and n are the lengths of the old-form
and modern-form, respectively) is then calculated using
the following recurrence formula:
\
D{i,,) = min \ D(i - 1, ;) 4- y(A(i) -+ A),
The editcost value is the value in cell D(m,n). The
editcost is zero when the words that are being compared are identical
The initial values of the elements of the array D need
not be 0 or 1 (as in Fig. 1). If
7 (A
BO))
i.e if the cost of a substitution is greater than or equal
to the sum of the costs of a deletion and an insertion,
then it is possible to calculate the length of the LCS, by
a simple backtracking through the array. If A and C are
strings of length m andn, respectively, such that C could
be obtained by deleting zero or more elements from A,
then C is a subsequence of A thus, COURSE, for
V
s
u
L
L
E
U
1
1
0
1
1
1
S
1
0
1
1
1
1
u
1
1
0
1
1
1
A
1
1
1
1
1
1
L
1
1
1
0
0
1
V
S
U
L
L
E
U S
U
A
L
1
2
2
3
4
2
3
2
1
2
3
4
2
3
2
2
3
4
3
3
3
2
2
3
2
1
2
2
3
3
Fig. 1 Dynamic programming array showing (a) initial and
(b) final values for the Wagner-Fischer editcost-distance
method
Literary and Linguistic Computing, Vol. 8, No. 3, 1993
Downloaded from http://llc.oxfordjournals.org/ at Penn State University (Paterno Lib) on May 17, 2016
*S, SU, UB, BS, ST, TR, RI, IN, NG, G*
Fischer (1974), which uses the same array data structure but which is about three times as fast for the data
used here. This algorithm calculates the editcost distance, which is a measure of the number of editing
operations required to convert one string into the
other. Three operations are recognized, these being the
insertion, deletion, and substitution operations that
have been described in Section 2.2 when discussing the
reverse-error correction method (the transposition
operation can be defined in terms of the other three
operations).
Given two strings, A and B, such as those shown in
Fig. 1, the Wagner-Fischer algorithm involves the recursive calculation of the minimum of:
example, is a subsequence of COMPUTER SCIENCE.
String C is a common subsequence of the strings A and
B if it is a subsequence of both: it is the LCS if it is a
common subsequence and if it is as long as any other
common subsequence of A and B. The LCS provides
an alternative measure of similarity to the editcost distance, and one that is, perhaps, intuitively more understandable to the user than a measure based on abstract
editing operations.
3. Experimental Details
3.1.2 Eighteenth-Century Short-Title Catalogue. The
Eighteenth-Century Short-Title Catalogue (ECSTC) is
an international project which was established at the
British Library in 1977 and which aims to create an
estimated half a million machine-readable records of
books, pamphlets, and ephemeral material printed in
the eighteenth-century and contained in the collections
of over 1,000 libraries worldwide (Crump and Harris,
1983; Crump, 1989). There are approximately 325,000
entries to date. Much of the material has never previously been catalogued, and includes, inter alia:
society memberships, lists of items for sale and of
commodities; advertisements, notices, and circulars;
shipping lists and transport timetables; and election
ephemera.
The ECSTC database can be searched on-line using
the BLAISE-LINE host system provided by the British
Library and this database was used to download a total
of 23,141 titles, in batches of 500 titles at a time. The
Literary and Linguistic Computing, Vol. 8, No. 3, 1993
3.1.3 Canterbury Cathedral Library Catalogue. This
is a catalogue of pre-1901 books in MARC format that
uses facilities and software developed at the University
of Kent (Shaw, 1991). It can be consulted on-line
through JANET, the UK academic computer network.
In addition, the software has been used for cataloguing
other collections of early printed books, mainly other
cathedral libraries or local collections at the university
or elsewhere in East Kent. The system runs on a DEC
VAX-Cluster mainframe under the VMS operating
system.
The catalogue contains about 28,000 records In this
work, we used all of the English-language titles from
the sixteenth and seventeenth centuries. The files from
both centuries were sufficiently small so as to permit
them to be processed in the same manner as the Hartlib
letters, i.e. the titles were read and the variant spellings
identified manually. Pairs and master dictionaries were
created for each file in the same manner as for the other
two datasets. The sixteenth-century dataset contains
604 and 1,906 entries and the seventeenth-century dataset 856 and 5,437 entries, for the pairs and master
dictionaries, respectively. The size of all of the dictionaries used in the experiments are listed in Table 4
Table 4 Numbers of old-forms in the pairs and master dictionaries used in the experiments.
Dataset
Hartlib
Canterbury-16
Canterbury-17
ECSTC
Pairs
Master
2620
604
856
3755
12191
1906
5437
30103
3.2 Ranking of Old-Forms
Each of the modern-forms in the pairs dictionary was
used as a query for a search of the master dictionary
to find the ten or twenty old-forms that were most
similar to it. The search procedure was as follows, each
of the methods being implemented in Pascal on an IBM
3083 mainframe computer operating under VM/CMS
at the University of Sheffield Academic Computing
Services:
147
Downloaded from http://llc.oxfordjournals.org/ at Penn State University (Paterno Lib) on May 17, 2016
3.1 Test Datasets
3.1.1 Harthb Papers. The initial data used in this
study derives from the Hartlib Papers Project, which
has been under way in Sheffield since 1987 (Leslie,
1990). The aim of the Project is to transcribe and edit
about 20 million words of seventeenth-century manuscripts in the form of letters, memoranda, and treatises,
these being the surviving working papers of Samuel
Hartlib, and to make this information available as a
searchable database.
The dataset used here was derived from 210 of
Hartlib's English-language letters, and was processed
to generate two dictionaries: the 'pairs dictionary' and
the 'master dictionary'. The pairs dictionary was
created by taking a sample of eighty-eight of these
letters and noting all words in it that did not conform
to modern-day spelling, as represented by the Oxford
English Dictionary. The modern-form was matched
with the corresponding old-form, and the resulting file
of word pairs, i.e. one modern-form and one old-form,
sorted so as to bring together all variants of a modernform. The resulting file contained a total of 2,620
unique word pairs, representing 2,195 modern-forms
and 2,620 associated old-forms, i.e. some of the
modern-forms had more than one associated old-form.
The old-forms associated with a given modern-form are
referred to subsequently as appropriate old-forms. The
master dictionary contained all the 12,191 distinct words
in the set of 210 letters, with the 2,620 old-forms in the
pairs dictionary being a subset of these.
resulting dataset was far too large to be read through,
one record at a time, to identify the old-forms (as had
been done with the much smaller Hartlib collection).
Accordingly, old-fashioned spellings were identified by
the use of the spelling checker in the Microsoft
WORKS word-processing package and flagged as possible candidates for inclusion in the dataset; note that
this means that any words which are correct modern
words, but which are variant spellings in an eighteenthcentury context, are not included in the dataset unless
the person processing the file happened to spot them.
In all other respects, the ECSTC test collection was
created in an identical manner to the Hartlib test collection, with the pairs and master dictionaries here having
3,755 and 30,103 entries, respectively.
The similarities for the n-gram-matching and
dynamic-programming methods were calculated using
the Dice coefficient; this coefficient was adopted because it is simple to understand and to compute, it is
widely used in tests of document retrieval systems
(Salton and McGill, 1983) and has been found to be at
least as effective as other similarity measures for historical text searching (Frith et al., 1993). If two words
are of lengths Y and Z, and X is the score (i.e. the
number of n-grams in common for the n-grammatching method, or the editcost distance or the length
of the LCS for the dynamic-programming methods)
then the Dice coefficient is:
2x X
Y+ Z
Values of the coefficient range between zero and unity,
these corresponding to words having nothing at all in
common and to identical words, or words containing
identical sets of n-grams, respectively. An exception is
its use in conjunction with the editcost calculation of
Wagner and Fischer (1974), where the calculated score
is bounded by the length of the longer sequence, and
can thus assume a value greater than unity. Note also
that the Dice coefficient value for the editcost distance
is the inverse of that for the other calculations in that
identical sequences would give a value of zero.
Once the ten or twenty most similar old-forms had
been identified, they were then compared with the sets
of appropriate old-forms from the pairs dictionary that
corresponded to the query modern-form, and the effectiveness of the search evaluated.
3.3 Evaluation of Retrieval Effectiveness
The effectiveness of retrieval was evaluated in a manner
analogous to that used for the evaluation of experimental document retrieval systems (Salton and McGill,
1983). The recall of a search for a given modern-form is
defined to be the percentage of the appropriate oldforms that are retrieved. If a modern-form has a set, A,
148
of appropriate old-forms and a set, B, of appropriate
old-forms has been retrieved, then the recall is defined
to be:
- x 100.
A
In a real text-searching application, some number of
the most similar old-forms for a query modern-form
would be displayed at a terminal, so that the user could
select appropriate old-forms for inclusion in the query
(Frith et al., 1993). The recall here was calculated for
all of the methods using a fixed cut-off of the ten or
twenty most similar old-forms.
Each modern-form in the pairs dictionary for a dataset was used as a query for a search of the corresponding master dictionary, and the recall calculated in each
case, using the information about the appropriate
old-forms that is included in the pairs dictionary. The
overall performance of a correction method was then
obtained as the mean of the individual recall values,
when averaged over the entire set of modern-form
queries associated with a dataset.
We note here that some of the searches may have
underestimated the true recall owing to the fact that not
all of the appropriate old-forms may have been known
for some of the modern-forms in the Hartlib and
ECSTC datasets. The Hartlib modern-form queries
were drawn from the eighty-eight letter subset of the
full set of 210 Hartlib letters, and it is hence likely that
some appropriate old-forms were not registered as
such, since they did not occur in this sample. Thus some
of the old-forms retrieved should undoubtedly have
been considered appropriate, as should some of the
old-forms that were not retrieved. This situation is
analogous to that pertaining in many document test
collections where only some of the documents have
been evaluated for relevance purposes and where the
recall figures therefore refer to a recall base that is less
than complete. This problem does not occur with the
Canterbury datasets, since all of the appropriate oldforms should have been identified during the analyses
of the texts (because each one of the records was read
in toto). In the case of the ECSTC dataset, appropriate
old-forms may have been missed if they were also valid
modern spellings (because they would have been
missed by the WORKS spelling checker).
3.4 Use of Substitutions
The Hartlib phonetic-coding method improves the performance of the Phonix method by replacing certain
character strings by others in the old-forms and
modern-forms before generating the codes themselves
(Rogers and Willett, 1991). It was decided to test these
phonetic substitutions to see if they could also improve
the performance of the other correction methods
studied here.
Each substitution in Table 2 details a search string, a
replacement string, and the necessary location of the
search string. Location is indicated by S, M, or E (for
start, middle, and end of a word), where 'middle'
means any position in the word so long as the first and
last characters are not involved. The lower-case characters v and c in search strings are wild cards indicating
Literary and Linguistic Computing, Vol. 8, No. 3, 1993
Downloaded from http://llc.oxfordjournals.org/ at Penn State University (Paterno Lib) on May 17, 2016
• In the case of the digram and trigram experiments, the modern-form was broken up into its
constituent n-grams, the number of n-grams in
common with each of the words in the master
dictionary identified, the Dice coefficient value
calculated (as described below), and the ten or
twenty most similar old-forms retrieved.
• In the case of the Hartlib phonetic-coding and
SPEEDCOP experiments, the appropriate code
was generated for the modern-form. The most
similar old-forms were then obtained by retrieving the old-forms for which the corresponding
codes occurred in the ten- or twenty-code window
centred on the code for the query modern-form.
• In the case of the dynamic programming
methods, the matrix operations were carried out
as detailed in Section 2 to calculate the editcost
or LCS values; these values were then used for the
calculation of the Dice coefficient (as described
below), and the ten or twenty most similar oldforms retrieved.
any vowel or consonant, respectively. The replacement
strings replace only the upper-case characters in the
corresponding search strings. Thus S CLv KL states
that CL is replaced by KL if the letters CL start a word
and are followed immediately by a vowel. The substitution instructions are obeyed in the order shown in
Table 2. When the location is in the middle of the word
and the search string appears in more than one middle
position, then it is replaced at each occurrence.
Two sets of experiments were carried out. The results listed under Ml in the Results (see Section 4,
below) used the correction methods as described in
Section 2, while those listed under M2 involved
applying the substitutions in Table 2 to the query
modern-form and to each of the old-forms in the master
dictionary, prior to matching.
As described in Section 3, four datasets were used in
our experiments. All of the correction methods were
tested initially on the Hartlib data, and the betterperforming methods were then tested on the ECSTC
and Canterbury datasets. In what follows, we present
the main results of the very extensive series of experiments that were carried out: full details are presented
by Robertson and Willett (1992).
4.1 Hartlib Dataset Results
Table 5 lists the mean recall, averaged over all of the
query modern-forms, for searches of the Hartlib dataset using digram and trigram matching, using the two
SPEEDCOP keys and using the dynamic-programming
LCS method. Two figures are listed in each case: the
first of these is the mean recall when the ten most
similar old-forms are retrieved in response to a query
modern-form, and the second when the twenty most
similar old-forms are retrieved.
An inspection of Table 5 shows that both of the digram methods can give high recall, with the superior
recall of the digram searches being achieved at the cost
of an increase of about 20% in the overall execution
time when compared with the times for the trigram
searches. It is interesting that the digrams perform
better than trigrams for this application, since the converse applies when they are used for searching for word
variants in modern-day text databases (Freund and
Willett, 1982).
The skeleton key was found to outperform the omission key in all of the SPEEDCOP experiments, the best
results being obtained when the skeleton key was used
with the phonetic-substitutions list (the column marked
M2); however, even this result is substantially inferior
to the recall figures for all of the rt-gram searches. The
poor results with the omission key may be due to the
fact that the reverse ordering described in Section 2.3.5
is based on the analysis of a very large body of modern
text (Pollock and Zamora, 1983). However, the corpus
available here was not large enough for a comparable
statistical analysis to be carried out; moreover, even if
sufficient seventeenth-century data were available, the
SPEEDCOP results are so poor that such an analysis
would be unlikely to improve the performance of this
Literary and Linguistic Computing, Vol. 8, No. 3, 1993
4.2 ECSTC and Canterbury Dataset Results
The best results with the Hartlib dataset are thus
obtained with the digram-matching, dynamicTable 5 Mean recall (averaged over all of the query modernforms) of the correction methods when they are applied to the
Hartlib dataset. Ml employed no preprocessing, while M2
applied phonetic substitutions to both the modern-forms and
old-forms The first and second figures in each case correspond to, respectively, the retrieval of the top ten and the top
twenty old-forms for a query modern-form.
Ml
Method
90.5
86 3
57.5
46.9
92 2
Digram Matching
Trigram Matching
SPEEDCOP Skeleton
SPEEDCOP Omission
LCS
M2
94 5
88 8
67 6
57 8
95 4
91 7
72 2
63 2
55 2
93.4
95 0
74 5
76 2
68 4
95 8
Table 6 Effect of parameter values on the mean recall (averaged over all of the query modern-forms) of the editcostdistance dynamic-programming method when it is applied to
the Hartlib dataset. The first and second figures in each case
correspond to. respectively, the retrieval of the top ten and
the top twenty old-forms for a query modern-form. The three
weights listed in each row of the column of the table are the
cost functions for insertion, deletion, and substitution, respectively, as defined in Section 2.4.2.
Cost
Functions
l/i/i
1/1/2
1/2/1
1/2/2
2/1/1
2/1/2
2/2/1
Ml
93.4
91.9
81 9
86 6
91.0
90.7
84.3
96.4
94 2
88 7
87 4
94 2
92.1
88 7
M2
93.2
93.0
82 2
88.2
91.0
918
84.9
96 3
95.0
89.4
89 1
94.1
93 2
89.3
149
Downloaded from http://llc.oxfordjournals.org/ at Penn State University (Paterno Lib) on May 17, 2016
4. Results and Discussion
method sufficiently to make it competitive with the
other methods detailed in the table.
The experiments with the LCS method generally
gave very impressive results, with the recall here being
consistently greater than with digram matching (the
next best method). The performance of the editcostdistance method depends on the values that are chosen
for the three cost functions: the effect of variations in
these costs is illustrated in Table 6, where it will be seen
that the best performance is obtained by assigning
equal costs to the three editing operations.
Use of the phonetic substitutions generally results in an
increase in search performance, with the exception of
the trigram searches, where the recall is noticeably less.
The precise level of effectiveness of the phoneticcoding methods depends on the length of the code that
is used. Table 7 illustrates the effect of variations in the
length of the Hartlib code. It will be seen that the best
results are obtained with the shortest code (and a similar conclusion is reached if the other, less-effective
Soundex, Davidson, and Phonix codes are studied
(Robertson and Willett, 1992)); however, the variation
is very small and thus any length could be selected.
Table 7 Effect of the code length on the mean recall (averaged over all of the query modern-forms of the Hartlib
phonetic-coding method when it is applied to the Hartlib
dataset. The first and second figures in each case correspond
to, respectively, the retrieval of the top ten and the top
twenty old-forms for a query modern-form
Recall
Code
Length
5
6
7
8
89
89
89
88
0
1
0
9
93 7
93 3
93 1
93 0
4.3 Implementation of the LCS Method
The results in Tables 5, 6, and 8 lead us to conclude
that the LCS method is the most effective method of
those that we have tested; however, the dynamicprogramming methods are the least efficient, e.g. the
processing of the Hartlib dataset using the WagnerFischer algorithm took about thirty times as long as
using digram matching (and the Needleman-Wunsch
algorithm was still slower), with the matching of a
single query modern-form against the Hartlib master
2xX
Y+ Z
Thus, if the value of X is known for a master dictionary
old-form and a query modern-form, it is possible to
calculate an upperbound to the value of the Dice coefficient that would be obtained if the full LCS calculation was to be carried out. There is a well-known, and
highy efficient inverted-file algorithm that allows the
calculation of the number of words common to a natural language query and each of the documents in a text
database (Frakes and Baeza-Yates, 1992). It is very
simple to modify this algorithm to identify the common
characters for pairs of words (Robertson and Willett,
1992), and hence to calculate an upperbound to the
LCS Dice coefficient: the full dynamic-programming
algorithm, which is based on the matching of common
substrings (as described in Section 2.4 2) rather than
just common characters, then need only be applied to
those old-forms that have the highest upperbound
values. The resulting two-stage search procedure is
detailed in Fig 2.
The upperbound algorithm was tested on the Canterbury seventeenth-century dataset, to see whether the
use of the initial upperbound search could reduce the
number of LCS calculations sufficiently to permit
efficient searching of large dictionaries. On average, it
was found that the full LCS algorithm needed to be
applied to only 17.7% (965 out of 5,437) of the oldforms in the master dictionary, i.e. that use of the
upperbound algorithm resulted in the elimination of
over 80% of the dictionary from the full search. This
result is very satisfying; however, the computational
requirements of dynamic programming are such
Table 8 Mean recall (averaged over all of the query modern-forms) of the
correction methods when they are applied to the ECSTC and Canterbury
datasets Ml employed no preprocessing, while M2 applied phonetic substitutions to both the modern-forms and old-forms The first and second
figures in each case correspond to. respectively, the retrieval of the top ten
and the top twenty old-forms for a query modern-form
Method
Digram Matching
LCS
Editcost Distance
Harthb Code
150
Canterbury-16
Canterbury-17
Ml
Ml
95 5 96 7
97 1 98 i
96 7 9S 4
M2
95
97
96
89
5
1
3
4
96
98
98
89
7
3
4
4
ECSTC
Ml
M2
92 0 93 6
95 7 98 1
91 9 92 9
92
96
91
87
4
1
8
0
94
98
92
87
0
2
7
9
85 7 90 7
92 1 95 8
91 8 95 6
M2
84 6
92 5
91 8
68 3
90 3
95 6
95 3
73 9
Literary and Linguistic Computing, Vol. 8, No. 3, 1993
Downloaded from http://llc.oxfordjournals.org/ at Penn State University (Paterno Lib) on May 17, 2016
programming and Hartlib coding methods (using a
code of length five characters), and we have further
tested these methods on the ECSTC and Canterbury
datasets, as detailed in Table 8. In this table, the
editcost-distance values are those obtained with the 1/
1/1 parameter setting, i.e. where equal cost functions
are used for the three editing operations since, as with
the Hartlib dataset, these parameter settings gave the
best recall. No value is listed under Ml in Table 8 for
the Hartlib phonetic-coding method since the use of
phonetic substitutions is inherent in this method.
It will be seen that the best overall level of performance is again given by the two dynamic-programming
methods, with the LCS method usually, but not always,
giving the highest recall figures of all of the methods.
The Hartlib codes are consistently inferior to the other
methods tested here, and digram matching does not
perform well with the large ECSTC dataset.
dictionary taking at least thirty central-processing-unit
seconds on an IBM 3083 BX mainframe using this
algorithm.
It is possible to increase the speed of the LCS
dynamic-programming algorithm by means of an
upperbound technique. Specifically, if two words, A
and B, contain Y and Z characters and have X characters in common, then the length of the LCS cannot
possibly be greater than Z (since this would imply that
it contained more characters than A and B have in
common). Accordingly, an upperbound to the true
Dice coefficient based on the LCS is given by
(1) Initialization Assume that M old-forms are to be retrieved by the search; then initialize an M-element data
structure that will contain the identifiers and calculated
similarities for the M nearest neighbours.
(2) Upperbound search:
(a) The inverted-file algorithm (Frakes and BaezaYates, 1992) is used to calculate the number of
characters in common between the query modernform and each of the old forms in the master
dictionary.
(b) These numbers are used to calculate the corresponding upperbound values, DICE-UPB, and the
dictionary is ranked in decreasing order of the
DICE-UPB values.
(4) Output the top M old-forms to the user for inspection
and possible inclusion in the query.
Fig. 2 Upperbound implementation of the LCS method. In
our experiments, M was set to either ten or to twenty
(Sedgewick, 1988) as to mean that the overall search is
still far too slow for interactive processing of a large
dictionary unless some type of parallel processor were
to be used (as has been done when dynamic-programming algorithms are used for searching databases of
protein sequences (Collins and Coulson, 1984)).
5. Conclusions
The experiments reported here and in our previous
paper (Rogers and Willett, 1991) demonstrate clearly
that methods that were originally developed for the
correction of misspellings in modern English text are
also well-suited to the identification of old-forms in
historical text. More specifically, we draw the following
conclusions regarding the effectiveness of the techniques we have evaluated:
• The
Hartlib
phonetic-coding
method
achieves a much higher level of performance
than the non-phonetic SPEEDCOP codes.
• Digram matching is consistently superior to
trigram matching for this application and is also
superior to the Hartlib method, which was the
best of the various coding methods in our earlier
experiments.
• The LCS is the better of the two dynamicprogramming methods; both of these are (usually)
Literary and Linguistic Computing, Vol. 8, No. 3, 1993
The increasing effectiveness of the techniques is
achieved at the cost of a concomitant decrease in
efficiency, with the coding techniques being faster than
/i-gram matching, which is, in its turn, very much faster
than dynamic programming. We hence conclude that
digram matching is, at present, the most appropriate
method for implementation in an operational environment where a large dictionary (i.e. several tens of
thousands of old-forms) was to be searched; the LCS
method would be the most appropriate method if a
much smaller dictionary were to be searched.
The most obvious way of extending the work reported here would be to investigate further ways of
increasing the speed of the dynamic-programming
methods, possibly by establishing a tighter upperbound
for the two-stage search procedure described in Section
4.3. It would also be of interest to improve the list
of phonetic substitutions that was used. This list was
created by inspection of the old-forms in the Hartlib
dataset, and it seems to work reasonably well not only
for this but also for the ECSTC and Canterbury datasets. However, we would expect that search performance could be further improved by tuning the list with
substitutions based on text from the precise period for
which searches were to be carried out
In conclusion, we note that spelling-correction
methods take no account of the meanings of words or
of the grammatical structure of the text that is being
processed. Accordingly, our approach to the searching
of historical text is applicable, in principle, to texts in
many languages and from many different periods, subject only to the constraint that the modern-forms that
are being searched for are sufficiently similar to the oldforms in the source database for ready comprehension
by the searcher. This is normally the case for the historical English texts considered here (as is evidenced by
the modern-forms and old-forms in Table 1), but would
be much less true with more ancient texts.
Acknowledgements
We thank Alastair Allan, Ian Bruno, Cuna Ekmekcioglu, Heather Rogers, David Shaw, and Marie
Willett for the provision and analysis of the text databases used in this study. The work was funded under
grant number RDD/G/114 from the British Library
Research and Development Department.
References
Angell, R. C , Freund, G. E. and Willett, P. (1983). Automatic Spelling Correction Using a Trigram Similarity
Measure, Information Processing and Management, 19.
255-61.
151
Downloaded from http://llc.oxfordjournals.org/ at Penn State University (Paterno Lib) on May 17, 2016
(3) Dynamic-programming search:
(a) If the value of DICE-UPB for the next old-form in
the sorted dictionary is not greater than the actual
Dice coefficient, DICE-LCS, for the Mth most
similar old-form seen thus far then go to Step 4
(since it is then not possible for any of the as yet
unprocessed old-forms to have a sufficiently large
similarity for them to be included in the final set of
M nearest neighbours).
(b) Carry out the full LCS calculation using the
Wagner-Fischer algorithm to obtain the actual
Dice coefficient, DICE-LCS; if this is greater than
the DICE-LCS value for the Mth nearest neighbour then update the list of nearest-neighbour oldforms accordingly. If there are more old-forms to
be processed then go to Step 3(a).
superior to digram matching. The best editcost
results are obtained with the simplest set of cost
functions.
• The use of the historical phonetic substitutions
usually, but not invariably, results in some improvement in performance.
152
Leslie, M (1990). The Hartlib Papers Project: Text Retrieval
in Large Datasets, Literary and Linguistic Computing, 5:
58-69.
Mitton, R. (1987). Spelling Checkers, Spelling Correctors
and the Misspellings of Poor Spellers, Information Processing and Management, 23: 495-505
Needleman, S. B. and Wunsch, C. D. (1970) A General
Method Applicable to the Search for Similarities in the
Amino Acid Sequence of Two Proteins, Journal of Molecular Biology, 48: 443-53.
Pollock, J. J (1982). Spelling Error Detection and Correction
by Computer- Some Notes and a Bibliography, Journal of
Documentation, 38: 282-91.
Zamora, A. (1983). Correction and Characterization of Spelling Errors in Scientific and Scholarly Text,
Journal of the American Society for Information Science,
34- 51-8.
-(1984). Automatic Spelling Correction in Scientific and
Scholarly Text, Communications of the ACM, 27. 358-68.
Robertson, A. M. and Willett, P. (1992). Identification Of
Word Variants in Historical Text Databases. Final project
report to the British Library Research and Development
Department. British Library, London
Rogers, H. J and Willett, P. (1991). Searching for Historical
Word Forms in Text Databases Using Spelling-Correction
Methods: Reverse Error and Phoentic Coding Methods,
Journal of Documentation, 47: 333-53.
Russell, R. C. (1918) United States patent 1261167. Washington: United States Patent Office.
(1922). United States patent 1435663. Washington:
United States Patent Office
Salton, G and McGill, M. J. (1983) Introduction to Modern
Information Retrieval. McGraw-Hill, New York.
Sedgewick, R. (1988). Algorithms Addison-Wesley,
Reading, MA.
Shaw, D. (1991). MARC Catalogues of Early-Printed Books
at the University of Kent, Program, 25 339-47.
Vallins, G. H. (1965). Spelling. Andre Deutsch, London
Wagner, R A. and Fischer, M J (1974). The String-to-String
Correction Problem, Journal of the ACM, 21: 168-73.
Yannakoudakis, E. J. and Fawthrop, D. (1983). The Rules of
Spelling Errors, Information Processing and Management,
19- 101-8.
Literary and Linguistic Computing, Vol. 8, No. 3, 1993
Downloaded from http://llc.oxfordjournals.org/ at Penn State University (Paterno Lib) on May 17, 2016
Barber, C. L. (1972). The Story of Language. Pan Books,
London.
Burgess, A. (1975) Language Made Plain. Fontana Paperbacks, London.
Collins, J. F. and Coulson, A. F W. (1984). Applications of
Parallel Processing Algorithms for DNA Sequence Analysis, Nucleic Acids Research, 12: 181-92.
Crump, M. (1989). Searching ESTC on BLAISE-LINE A
Brief Guide, Factotum. Newsletter of the XVIIIth Century
STC Occasional Paper 6
Harris, M. (eds.) (1983). Searching the Eighteenth
Century British Library, London.
Damerau, F. J (1964). Techniques for Computer Detection
and Correction of Spelling Errors, Communications of the
ACM, 7: 171-6.
Mays, E. (1989). An Examination of Undetected
Typing Errors, Information Processing and Management,
25, 659-64.
Davidson, L (1962). Retrieval of Misspelled Names in an
Airline's Passenger Record System, Communications of the
ACM, 5: 169-171
Frakes, W. B. and Baeza-Yates, R. (eds.) (1992). Information Retrieval Data Structures and Algorithms PrenticeHall, Englewood Cliffs, NJ
Freund, G E. and Willett, P. (1982) Online Identification of
Word Variants and Arbitrary Truncation Searching Using
a String Similarity Measure, Information TechnologyResearch and Development, 1: 177—87.
Frith, A. R., Robertson, A. M. and Willett, P. (1993). Effectiveness of Similarity Measures and of Query Expansion
Techniques for Searching Databases of 16th-, 17th- and
18th-Century English Text, Journal of Document and Text
Retrieval, 1: 97-114
Gadd, T N (1988). 'Fisching Fore Werds': Phonetic Retrieval of Written Text in Information Systems, Program,
22. 222-37
(1990) PHONIX: the Algorithm, Program, 24: 363-6.
Joseph, D. M. and Wong, R. L (1979). Correction of Misspellings and Typographical Errors in a Free-Text Medical
English Information Storage and Retrieval System,
Methods of Information in Medicine, 18: 228-34.
Kruskal, J. B. and Sankoff, D (1983) Time Warps, String
Edits, and Macromolecules- the Theory and Practice of
Sequence Comparison. Addison-Wesley, Reading, MA.

Download Report

A Comparison of Spelling-Correction Methods for the Identification

Paperzz.com

Your Paperzz