Decision Support Systems 35 (2003) 385 – 397 www.elsevier.com/locate/dsw Design of an interactive spell checker: optimizing the list of offered words Robert Garfinkel a,*, Elena Fernandez b, Ram Gopal a a Operations and Information Management, School of Business, University of Connecticut, New Business Administration Unit 1041-411M, Storrs, CT, USA b Statistics and Operations Research Department, Technical University of Catalonia, Barcelona, Spain Received 1 January 2002; accepted 1 April 2002 Abstract Interactive spell checking is an instance of a general problem that is characterized by doubt as to the correctness of an input, and the need to correct the input or offer a set of possible corrections. Our contention is that a well-defined model can provide a basis to improve the performance of spell checkers and other similar applications such as voice recognition, historical document searches, and data imputation among others. In this paper, we introduce a model designed to help the spell checkers within word processors place the correct word at, or near, the top of the list of offered words. A prototype spell checker (SP) is developed based on the model. It is tested on documents typed by subjects and found to be very accurate. D 2002 Elsevier Science B.V. All rights reserved. Keywords: Word processors; Spell checking; Optimization 1. Introduction With the advent of the personal computer, it can be assumed that mistyping of words has become increasingly prevalent since, even in the business world, professional typists no longer do the overwhelming majority of the typing. Thus, for many of us who do our own typing, the spell checkers in our word processors or other software have become indispensable. These programs are used to indicate that the correctness of a word is in doubt. Spell checkers are * Corresponding author. E-mail address: [email protected] (R. Garfinkel). distinguished from grammar checkers that indicate whether a sentence follows acceptable grammatical rules. The basic steps required of any spell checker are outlined in Ref. [11]. Various studies (e.g. Refs. [4,9,13]) have indicated a variety of average error rates depending on the skill of the typist. In general, the studies, including the results of our own experiments, yield nonword errors (the incorrect ‘‘word’’ is not found in a dictionary) of between 0.2% and 6%. Of course, error rates can also be expected to depend on other factors besides typing skill. For instance, one would be expected to type more accurately while transcribing rather than composing. Error rates based on misspelling rather than mistyping should also decrease dramatically in a phonetic language like Spanish or Hindi as compared 0167-9236/02/$ - see front matter D 2002 Elsevier Science B.V. All rights reserved. doi:10.1016/S0167-9236(02)00115-X 386 R. Garfinkel et al. / Decision Support Systems 35 (2003) 385–397 to a nonphonetic language such as English. By a phonetic language we mean one for which there is more or less a one-to-one correspondence between spelling and pronunciation. To give one example of how nonphonetic English is, consider that the vowel string ou can have at least five different pronunciations as indicated by the words rough, soul, bought, cougar, and foul. To make matters worse, each of these five pronunciations are found in other vowel strings including u, o, aw, oo, and ow in the words ruff, sole, awl, fool, and fowl. Spell checking generally consists of two basic tasks. The first is to determine if a word is in the dictionary of the spell checker. The second, in automatic mode, is to seek the most likely replacement word and to effect the replacement. In interactive mode, it is simply to offer an ordered list of possible replacements. Automatic spell checking seems to have been the subject of the vast majority of the research on spell checking (see for example Ref. [8]). It is clearly the appropriate model when time, cost, or other considerations preclude the option of letting the user make the final corrections. Interactive spell checking has received much less research attention, despite being the mode in which most of us work. Of course, the two models are closely related in that any technique that generally produces the correct replacement word will undoubtedly produce ‘‘good’’ lists as well. On the other hand, we note that in very sensitive applications, where it is critically important that the replacement word be correct, it may be appropriate to devise complex and relatively time-consuming algorithms. These may, for instance, take the context of the writing into account. In this work, since we assume that the typist will always have the chance to correct any mistakes, and since the algorithm will be part of a commercial word processing package, fast heuristics would seem to be the most appropriate. All modern word processors have spell checkers, and although they generally work quite well, their algorithms for choosing the list of replacement words are not well documented. Note that spell checkers within commercial word processing are generally attributed to other companies. For instance, Microsoft Word 97 (hereafter abbreviated as MSW97) attributes its spell checker to International CorrectSpell [6]. To some extent, our interest in spell checkers was motivated by the observation that from time to time, the list of offered words seems counterintuitive. For instance, MSW97 does not include idea in the list of offerings when the typed word is iedea. This seems strange because the two words differ by only one letter and, furthermore, most English speakers when presented with iedea would probably assume that the typist meant to type idea. Stranger yet, the only word offered is the French word idee. Table 5 in Section 4.3 contains a number of similar anomalies. For the moment, we leave open the precise definition of such an anomaly, hoping that the above example makes the concept clear. This omission will be rectified in Section 4 once a model has been presented that defines the closeness of two words. We have discovered the same sorts of anomalies in all word processors that we have encountered. We focus on MSW97 when we refer to commercial word processors only because it was one of the respected standards of the field at the time of our experiments. We also emphasize that we have compared our prototype spell checker (SP) to MSW97 for the same reason. Thus, it was chosen as a benchmark to see if our approach was viable. Our hope was simply that, by the measures listed above, SP would be competitive with MSW97. It should be noted that since the completion of the experiments, WORD 2000 and WORD XP have appeared. We have observed that many of the anomalies, noted in Section 4.3, of MSW97 are not anomalies of MSW00 and MSWXP. That is an encouraging observation since spell checkers are being continuously improved and perhaps are using models similar to that which we develop later to achieve this improvement. If that were true (and we have not been able to confirm this conjecture from Microsoft), it would serve as further validation that such models can be useful in this and other similar applications listed below. In Section 2.2, we develop a systematic way to construct the replacement list based on a likelihood function. The goal is to be able to consistently place the correct word not only in the list, but near the top. The maximum size of the list is one of the parameters of the prototype spell checker, and is described in Section 3. In the experiments of Section 4, the actual size never exceeds 10. We define the following measures of good lists. R. Garfinkel et al. / Decision Support Systems 35 (2003) 385–397 AV: the average location of the correct word in the lists. PA: the percentage of lists for which the correct word appears at all. P1: the percentage of lists for which the correct word appears first. P5: the percentage of lists for which the correct word is among the first five. The rationale for PA and AV are clear. P1 is of particular importance in the realm of automatic correction since presumably, the first word on the list would be the replacement. We add P5 since if one right clicks on a misspelled word in MSW97, the first five elements of the list are given. The experimental results of Section 3 will show that our model is quite successful in terms of these four measures. Even so, it would be too simplistic to suggest that commercial word processors should discard their spell checkers and use one based on our prototype. Our intent is simply to suggest a welldefined model and algorithm that could be the basis of commercial spell checkers of the future. The model-based approach outlined in the paper can also be employed in a number of related application areas. Such applications all have in common the presence of (1) doubt as to the correctness of an input and (2) the need to correct the input or offer a set of possible corrections. Some of these application areas are presented below. Search engines on the Internet. Spell checking user queries can improve the efficiency (of search) and effectiveness (of obtaining useful documents). This may grow in importance as the number of non-native English speakers using the Internet grows. Enhancing communication between speech or language-impaired individuals [9]. Text-to-speech conversion is an important issue in facilitating effective communication via telecommunication devices. Spell checking is critical to produce intelligible speech. Historical document search [12]. Over the centuries, the English language and word spelling have evolved. There was little proofreading of earlier documents and variant spellings were fairly common. Performing an effective search on these documents requires an awareness of spelling errors. Genetic research applications [10]. Properties and behavior of molecules depend on the sequence of 387 basic elements (much the same way as words are sequences of the alphabets). Similar techniques to ours can be applied to study the relationship between molecules. Data imputation [3]. Responses to some queries of a questionnaire may be known to be in error because they fail one or more ‘‘edits’’. The problem is to determine the true intended response. Search on music databases [14]. Recent growth in digital representation of music has led to the creation of large music databases. There is considerable interest in efficient retrieval techniques from such music collections. ‘‘Query-by-humming’’ systems permit a user to hum a tune and retrieve songs with similar tunes. A significant issue that arises in this context is that users vary significantly in their vocal abilities. The model used to evaluate possible replacement words is introduced in Section 2. The prototype spell checker, SP, is described in Section 3. Section 4 contains the results of experiments with SP and comparison to MSW97. Results and possible future research are the subject of Section 5. 2. The model Define a typed word m to be a sequence of consecutive symbols (throughout the paper, italics indicate the input to or output of a spell checker). In general, these symbols will be letters, but for the moment, other symbols such as numbers or apostrophes can be considered to be contained in m. The symbol m stands for ‘‘mistyped’’, and is used for the typed word because it will be of interest to us only if m is wrong in the sense that it does not appear in the relevant dictionary D. It is important for the testing of the model of Section 3.2 that there be a ‘‘correct’’ word w * a D, which is the word that the typist intended to type. Because of this concept of a correct word, the requirement that the sequence of symbols in m be consecutive is quite important. For instance, if the typist created a split word by inserting a space or a run-on word by omitting one, then it is not clear which word is w*. We avoid these ambiguities in the computational testing of Section 4 by removing split and runon words. An extension of our proposed model could deal with split and run-on words by concatenating a 388 R. Garfinkel et al. / Decision Support Systems 35 (2003) 385–397 word with its neighbors and performing multiword searches on the dictionary. Once it is determined that mgD, an offering list L(m)={w1(m), w2(m),. . .} is created. In the model of Section 3.2, the order of the words in the list of offerings is intended to indicate a corresponding order in the probability of being the correct word, with the first word having the greatest probability. Although the same statement is not made explicitly, an implication of similar intent is found in International CorrectSpell [6]. Its documentation states that it ‘‘. . .provides the desired correction for a misspelled word early (usually first) in a short list of candidates. . .’’. corrected by this device. On the other hand, there are many possible reasons for spelling errors. It may be that two different combinations of letters sound the same, or possibly even that the typist does not know how to say the word correctly. In Ref. [5], the efficacy of some commercial spell checkers with regard only to spelling errors is tested. Despite not considering spelling errors, we will see from the results of Section 4 that our model performs quite well. Perhaps that is because the population tested are connected to a university, so that the subjects might be less likely to make spelling errors (even in English) than is the population in general. 2.2. A flexible likelihood function 2.1. Some factors that may influence the list of offered words The cause of mistyping is naturally attributed to typing errors or spelling errors. Typing errors are simply physical mistakes, while a spelling error is made because the typist does not know how to spell the word. An examination of the offerings of MSW97 makes it clear that both categories are considered. Typing errors are exemplified by m = jsut resulting in L(m)=(just, jut). It would be hard to imagine that if either element of L(m) were correct, the error could be attributed to spelling. On the other hand, MSW97 seems to take spelling mistakes into account as distinct from typing mistakes. For example, for m = ofr, over a L(m) but our g L(m) implying that the typist was searching for a word sounding like over. Similarly for m = typ, tip a L(m) but top g L(m). Our model, as described in Section 2.2, does not specifically take spelling errors into account. The reason is that our aim is to present a general framework of a model and then to see how well a simple realization works. If that is successful, and depending on the application, further research can look at various levels of increased complexity, including spelling errors. Typing errors are much simpler to model than spelling errors. Each of us can hit the key for the letter q when we meant to hit the neighboring key of the letter a and it signifies nothing more than a physical mistake. A rudimentary approach that simply looks at the set of words that can be found by reversing a typing error involving a single letter is described in Ref. [2]. It is reported that 80% of words can be Here, we develop a likelihood function g(w,m) to be used in determining the placement and ordering of the elements of L(m). In fact, g(w,m) could better be called an anti-likelihood function in the sense that as g(w,m) increases, our likelihood that w is the correct replacement for m decreases. That is, w1 is considered to be more likely to be the correct word than w2 if g(w1,m) < g(w2,m). Therefore, if w1 and w2 both merit placement in L(m), w1 will appear before w2. There are many options for the form of such a function. The likelihood function that we use is quite simple and, as shown in Section 3, a particular realization has proven quite effective in our experiments. 2.2.1. The subarguments The arguments w and m of g are broken down into three subarguments. These are the following. ewm: The fewest single-errors which could transform the word w into the word m. The possible individual errors are: delete a letter (del.); insert a letter (ins.); substitute one letter for another (sub.); interchange two adjacent letters (int.). Note that if ewm = 1, the single-error is unique. This is not true in general if ewm > 1. For example, if w = think and m = tihk, then ewm = 2. Two possibilities for the intermediate word after the first error would be tihnk and thihk. In the first case, m would result from R. Garfinkel et al. / Decision Support Systems 35 (2003) 385–397 w after an interchange and a deletion, while in the second, it would be a substitution and a deletion. cw: The commonness of w. In our program, cw is measured on a scale from zero to nine, with zero being assigned to the most common words (e.g. a and the). fwm: Zero if w and m have the same first letter and one otherwise. 2.2.2. Why these subarguments? It is simple to defend ewm as a model of ‘‘closeness’’. A factor that may seem less fundamental is the ‘‘commonness’’ of w. It seems logical that if all else is equal, a more common word is more likely to be correct than a less common word. Of course, the definition of commonness is quite flexible. A word that is common to you may not even be part of the vocabulary of someone else. Two examples from MSW97 illustrate that commonness does not seem to consistently play an important role in the ordering of their lists. First, w1(alos) = Laos, while the seemingly more obvious choice is also = w2(alos). Similarly, w2(nerver) is the obscure word nervure, while the more obvious choices are nerve = w3(nerver) and never = w6(nerver). Related to this point is another quote from Ref. [6]. ‘‘. . .The quality of a spelling correction system does not increase in direct proportion to the size of its lexicons: bigger is not always better. A wordlist that is too large can include obscure words that coincide with misspellings of common words. For example, calendar could be misspelled as calender, which is a technical term for a machine that smoothes paper or fabric. Smaller is not always better either. A small lexicon can come at the expense of thorough coverage. A wordlist that is too small will not recognize correct words because the database is missing certain commonly used words. Small spelling correction databases often lack adverbs, geographical and cultural terms, and abbreviations and acronyms. International CorrectSpell’s databases include words that comprise the core of contemporary written language. . . .’’. The measure cw is, to us, one of the keys to improving the performance of spell checkers. Note that International CorrectSpell addresses this point when they say that a comprehensive dictionary may be detrimental to the attempt to find the correct word in that there may be very many choices, some of which are too obscure to be likely replacements. 389 However, limiting the size of the dictionary only affects the existence or lack of existence of a word on the list. Our definition of commonness within the likelihood function influences the order of the placement. Furthermore, it allows for individual flexibility in the assignment of commonness values. The third subargument, fwm, is harder to defend than the first two. In Ref. [9], it is reported that in some studies, up to 93% of erroneous words have correct first letters. In our experiments, the percentage reaches 90%. Thus, there may be some justification in having more confidence in the first letter than in other letters. Another reason why we included fwm is that, especially for long words, we have observed that L(m) from WORD is generally unlikely to contain elements w with different first letters from that of m. 2.2.3. The function In the prototype spell checker SP, we use the linear function gðw; mÞ ¼ eðewm 1Þ þ c cw þ f fwm : ð1Þ where e, c, and f are nonnegative constants. For example, if cthe = 0, then g(teh, the) = 0 since only one interchange is needed to transform the into teh and since the two words both begin with the letter t. Thus, the would, at worst, be tied for the most likely replacement for teh. 2.3. Customizing and learning An appealing aspect of creating L(m) from a function g(w,m) such as Eq. (1) is the possibility of customizing the function based on the type of errors made by an individual or group of individuals. Here, there are clearly many options. Different set of weights for the subarguments of Section 2.2 may be more appropriate under different conditions. For instance, the weights associated with ewm could be differentiated by the type of error. An individual who is most likely to make insertion errors would give higher weight to that type than to deletions, etc. Even more finely, one could differentiate among substitution of letters on neighboring keys, etc. Similarly, not only could individuals use their own measures of commonness cw, but they could estimate them by keeping track of the relative frequencies of the words they use. 390 R. Garfinkel et al. / Decision Support Systems 35 (2003) 385–397 A further option would be to allow the function to be updated by training it based on typing experience. Once the arguments have been chosen, ‘‘optimal’’ weights can be determined by ‘‘training’’ the function via any of a number of artificial intelligence techniques. This could be done within customizing for individuals or groups or to try to find ‘‘universal constants’’. corresponds to two or more words, the commonality will go with the most common. For instance, if the entry quick 4 appeared, it would refer to the concept of speed rather than to a painfully sensitive part of the body, such as the area under the fingernails. As usual, the words of DICT are ordered alphabetically and if w1 and w2 are any two words then w1 < w2 means that w1 would precede w2 in DICT. 3. A prototype spell checker: SP 3.2. The SP algorithm 3.1. The dictionary used in SP On initiation of the algorithm, the user is asked to input the following parameters: In the interest of simplicity and control, SP uses its own dictionary (DICT) rather than a standard one. The elements of DICT come primarily from two sources: emails received by the authors and the complete set of words offered by commercial spell checkers in our experiments. Thus, every word offered by WORD in the experiments of the next section is in DICT. Currently, DICT contains about 3500 entries. Every word in DICT is a string of lower case letters. Thus, capitalized letters are changed to lower case and dashes, apostrophes, numbers, etc. are simply ignored. We do that because all of those symbols can change the meaning of a word and would therefore require a more complex code than that of SP to find the appropriate list of offerings. Since SP is simply a prototype, these kinds of complications are left for future developments (note that proper names and words containing symbols that impart meaning, e.g. apostrophes, are not considered in our experiments when we compare SP to MSW97). Attached to each word is its commonness rating. For instance, two entries in DICT are the 0 and represtinate 9. The commonness values are assigned to the words whenever SP does not find a word in DICT and the user chooses to add the word to DICT. For the sake of consistency, all values have been assigned by the same person, namely one of the authors. They are done in a fairly ad hoc manner. There is no intent to say, with any scientific justification, that a word with cw = 4 is more common than a word with cw = 5. On the other hand, such distinctions could easily be made in future versions of the program. For us, the intention is to at least be able to distinguish among the four categories: very common; fairly common; relatively uncommon; and very rare. If a spelling L + : the maximum number of words in L(m); g + : an upper bound on the objective value of any word in L(m); e, c, and f: the parameters of the function g in Eq. (1). An overview of the algorithm is given below. Procedure Spellcheck While not end of document do: Read next word from text: t_word Select a word from DICT: d_word If t_word p d_word then If user declines to add t_word to DICT then Give list of possible alternatives to t_word Endif Endif End Every t_word, is compared to a single d_word where d word ¼ minfw:wzt word and wa DICTg: If t_word = d_word, then t_word is already in DICT. Otherwise, since it is possible that t_word is a correct word not yet present in DICT, the user is given the option to add t_word to DICT. If that option is declined, t_word is assumed to be mistyped. That is, m = t_word and the list L(m) is generated. In the prototype SP, every w a L(m) satisfies ewm V 2. Of course, ewm>2 could easily be incorporated but the computational burden within our algorithmic R. Garfinkel et al. / Decision Support Systems 35 (2003) 385–397 framework would increase (see Section 3.3). Since the input to SP is m rather than w, we define four basic transformations that can operate on m in the search for the correct word w. Because of symmetry, these are, of course, the same basic operations defined in Section 2.2, namely delete, insert, substitute, and interchange. From ewm V 2, w a L(m) only if w can be obtained from m by a single basic transformation, or by composition of any two basic transformations. The basic and composite transformations are defined below. Let nm denote the number of letters of a word m and S m,i be the ith letter of m. Also, let S * and S V be any two letters in A={a,. . .,z}. Two of the arguments of each composite transformation are numbers i and j, each between 1 and nm. In all cases, they refer to places in the original word m rather than in the word that results from the first of the two basic transformations. All distinct basic and composite transformations are listed below. For composite transformations, it is assumed that they are applied in the order indicated. Basic transformations del(i,m) ins(S *,i,m) sub(S *,i,m) int(i,m) Delete S m,i. Insert S * directly before S m,i. When i = nm + 1, S * is inserted at the end of m. If S m,i p S *, let S m,i = S *. If S m,i p S m,i + 1, interchange S and S m,i + 1, where i < nm. m,i Composite transformations del_del(i, j,m) del_ins(i,S *, j,m) del_sub(i,S *, j,m) del_int(i,j,m) ins_ins(S *,i,S V,j,m) del(i,m) and del( j,m), where i p j. del(i,m) and ins (S *,j,m), where i p j, else it is equivalent to sub(S *,i,m). del(i,m) and sub(S *, j,m), where i p j, else it is equivalent to sub(S *,i,m). del(i,m) and int( j,m), where j < i 1 or j >i, else it is equivalent to del(i,m). Also j < nm. ins(S *,i,m) and ins(S V, j,m). If i = j then S * precedes S V. ins_sub(S *,i,S V, j,m) ins_int(S *,i, j,m) sub_sub(S *,i,S V, j,m) sub_int(S *,i, j,m) int_int(i,j,m) 391 ins(S *,i,m) and sub(S V, j,m). ins(S *,i,m) and int( j,m) where i p j + 1, else it is equivalent to ins(S *,j,m). Also j < nm. sub(S *,i,m) and sub(S V, j,m), where i p j, else it is equivalent to sub(S V, j,m). sub(S *,i,m) and int( j,m), where i p j, else it is equivalent to sub( j,m). Valid for any other i and j but only applied when i p j 1 and j p i + 1, else the same result can be obtained with del_ins. Also j < nm. int(i,m) and int( j,m), where i p j. Also i < nm and j < nm. The list of possible alternatives is generated in two phases corresponding to ewm = 1 and ewm = 2. Each time a candidate word is considered, it is tested for possible inclusion in L(m) by the function Include described below. Include(w,m) Let nL(m) be the number of words currently in L(m) Let g* be the value of the last word (word nL(m)) in L(m)) Calculate g(w,m) If g(w,m) V g + do If nL(m) < L + or g(w,m) < g*, then place w in L(m) ordered by g Endif Endif End For convenience, we use the loose notation w = trans(m) to mean that w is the word that results from the specified transformation. Thus, for instance, the = int(2, teh). Phase 1 L(m) = F For i = 1, nm do 392 R. Garfinkel et al. / Decision Support Systems 35 (2003) 385–397 If w = del(i,m) a DICT, then Include(w,m) If w = int(i,m) a DICT, then Include(w,m) For S * aA do For S VaA do If w = ins(S *,i,m) a DICT, then Include(w,m) If w = sub(S *,i,m) a DICT, then Include(w,m) End for End for In some cases, it may not be necessary to enter Phase 2. In particular, if n* = L + , it is possible that the best word that could result from Phase 2 would be no better than the word currently residing in the place L + in L(m). Let c* and f * be the values for commonness and first letter for the last word in L(m), respectively. Then, Phase 2 should be entered only if c c* þ f f * e > 0: ð2Þ This follows since the left-hand side of Eq. (2) is the maximum improvement that any word found in Phase 2 can have over the worst word currently in L(m). Phase 2 For i = 1, nm do For j = 1, nm do If w = del_del(i,j,m)aDICT, then Include(w,m) If w = del_int(i,j,m)aDICT, then Include(w,m) If w = int_int(i,j,m)aDICT, then Include(w,m) For S * a A do If w = del_ins(i, S *,j,m) a DICT, Include(w,m) If w = del_sub(i, S *,j,m) a DICT, Include(w,m) If w = ins_int(S *,i,j,m) a DICT, Include(w,m) If w = sub_int(S *,i,j,m)aDICT, Include(w,m) If w = ins_ins(S *,i,S V, j,m) a DICT, then Include(w,m) If w = ins_sub(S *,i,S V, j,m) a DICT, then Include(w,m) If w = sub_sub(S *,i,S V, j,m) a DICT, then Include(w,m) End for End for End for End for 3.3. Complexity of the SP algorithm Procedure spell check requires one dictionary lookup. This is a relatively simple special case of a string matching problem combined with a search in the data structure representing DICT. Techniques for this kind of lookup are discussed in Refs. [1,7,8]. Let the time needed for the lookup be tLook. Assuming that w is stored in an ordered list, broadly speaking, the basic transformations are more or less done in constant time. In Phase 1, delete and interchange are performed once each, and insert and substitute AAA and AAA 1 or 26 and 25 times, respectively. Since a dictionary lookup accompanies each transformation, the order of Phase 1 is approximately 2AAAnmtLook = 52nmtLook. Since we impose ewm V 2, the order of Phase 2 is dominated by the three composite transformations sub_sub, sub_ins, and ins_ins. Each takes about twice the time of the basic transformations. Thus, whenever Phase 2 is entered, the overall complexity of SP is approximately 6tLook(26nm)2. While efficiencies could be achieved by clever data structures, we have found that the list is offered virtually instantaneously, so that incorporation within a word processor should be feasible with respect to time. then then 4. The experiments then SP was programmed in FORTRAN on a Sun workstation. Experiments were carried out on typing by staff, faculty, and doctoral students of the business school at the University of Connecticut. then R. Garfinkel et al. / Decision Support Systems 35 (2003) 385–397 4.1. The data The following request for help was sent electronically to all staff, faculty, and doctoral students of the Business School of the University of Connecticut. Dear SBA colleague:We are doing a research project on spell checkers with colleagues of the Polytechnic Institute of Catalonia in Barcelona, Spain. The objective is to try to design an ‘‘optimal’’ spell checker in the sense that if you type an incorrect word, the correct word will be offered to you as one of the first alternatives. We are partially motivated by the sometimes strange options given by commercial word processors.We have designed a prototype and would like to test it based on the typing of real people like yourselves. We are therefore requesting that, if you can find the time, you do the following for us: Type this letter (i.e. actually retype what you’re reading right now); and Type a note of about the same size of your own creation without copying it from an existing document. The subject matter is irrelevant! Send us, either electronically or on a disk or on paper your version of this letter and the note. For the note, we will also need an indication of what the correct words were if you misspelled any.We want you to type at your normal pace and ask you not to correct any errors that you make. We want to see what kinds of mistakes people make! I’m an incredibly bad typist so it’s unlikely you can do worse than me.Any help you can give us will be greatly appreciated. We’d like to start our analysis as quickly as possible but certainly better late than never. We hoped to be able to compare various typing abilities on the same document so as to eliminate the variability inherent in having each individual working on a different document. In addition, we asked for a copy of their own typing so as to be able to contrast typing on original against copied documents. The assumption being that error rates would increase when correct spellings are not present. We received a total of 29 documents from 21 people: 5 professors, 10 staff, and 6 doctoral students. The typing results are summarized in Table 1. 393 Table 1 Individual typing summary Total words Errors ewm = 1 ewm = 2 ewm>2 Deletions Insertions Substitutions Interchanges Split words Run-on words First letter incorrect Total Percent 6721 344 309 18 1 68 96 68 77 6 10 35 5.12 94.21 5.49 0.30 22.01 31.07 22.01 24.92 1.74 2.91 10.17 The 344 total mistyped words include split and run-on words for which there is no correct word w. The remaining mistyped words are partitioned by path length and within ewm = 1, and since the path is unique, they are further partitioned by error type. Ninety four percent of words could be corrected by limiting the search to ewm = 1. This is on the high end of the estimates in the literature [8], which range from 69% to 94%. Further, there was only one instance of ewm>2, and that was for w = polytechnic and m = polyytechnique, with ewm = 3. Despite the presence of original documents, it became clear from careful examination of all errors that polytechnique was also the only error that could plausibly be attributed to spelling. Furthermore, since the misspelled word is the French equivalent of the original, even that classification is doubtful. It is intuitively likely that the lack of spelling errors would account for the lack of instances of ewm>2. Again, this can most likely be explained by the fact that the subjects were all associated with a university. About 10% of errors involved the first letter. The estimates from Ref. [9] range from 7% to 15%. This would lead one to believe that some f >0 in Eq. (1) might lead to an improvement in performance, although we did not find it in our experiments. The overall error rate, including ‘‘good’’ words, of 5.12% varied, but not dramatically, by classification of the subjects. The error rates were 5.5, 6.8, and 4.9 for faculty, students, and staff, respectively. Thus, although the staff had the lowest rate, there was not the kind of dramatic difference that would have been common in the days of secretary/typists. Finally, and perhaps 394 R. Garfinkel et al. / Decision Support Systems 35 (2003) 385–397 Table 2 Test results checked by MSW97 and by SP using the following parameters: Total words = 266 MSW97 SP P1 P5 PA Average Standard deviation Average ignoring missing words Standard deviation ignoring missing words 0.74 0.90 0.92 2.64 2.87 1.35 1.43 0.85 0.98 1.00 1.36 1.11 1.30 0.97 surprisingly, the incidences of the four basic typing errors were fairly uniform, ranging from a high of 31.1% for insert to a low of 22% for delete and substitute. This contrasts with the results in Ref. [9] for which the range is 51.9% for insert and 7.6% for interchange. 4.2. The results Prior to the final experiments, a large number of runs were performed on all of the data to determine the best universal values in SP of the parameters e, c, and f from Eq. (1), in terms of the measures AV, PA, P1, and P5 introduced in Section 1. A range of all three parameters was tested and it became clear that, not only for average performance but also for virtually all typing instances, (e,c,f )=(10,1,0) gave the best results. These numbers can be interpreted as follows. First, for the range of parameters that we considered, there was no advantage in considering a typing error in the first letter to be less likely than errors in other letters. That is, despite the fact (see Table 1) that only about 10% of errors were made in the first letter, ewm and cw dominated fwm, so that we set f = 0 in our experiments. This leaves open the possibility that, for some data sets, f > 0 may be appropriate. Second, the values (e,c)=(10,1) when the range of cw is [0,9] mean that all words w with ew = 1 are offered before those with ewm = 2 and within each value of ewm words are ordered by cw. Finally, we note that the parameters (10,1,0) completely dominated (10,0,0) in every document. That is, ordering by commonness provided significant advantages in predicting the correct word. Thus, in all the results of this section, the parameters (e,c,f )=(10,1,0) are used uniformly to make comparisons to MSW97 straightforward. The 266 distinct words m (some were duplicated), for which the correct words were defined, were L + : the maximum number of words in L(m) = 10; g + : the upper bound on the objective value of any word in L(m) = 20; e, c, and f: the parameters of the function g in Eq. (1), = 10, 1, and 0. Again, there was no intent to show that one program is ‘‘better’’ than the other, but only to see if model (1) is reasonable, as compared to a respected benchmark. L + = 10 was selected because 10 words is the longest list that was offered by MSW97 in our experiments. If L + were decreased, at some point, the probability of w a L(m) would decrease, although in this case, there would have been no decrease until L + reached six. A value of g + = 20 means that no word was discarded because of score since g + cannot exceed 19 in our model. The influence of this parameter is discussed further in the analysis of Table 2 below. The last four rows of Table 2 concern the average placement of w*. Statistics on placement are duplicated because of the dilemma of how to count missing words. In the AV row, missing words were given the score 11 since g + = 10 in SP, and we are not aware that there is a parameter equivalent to g + in MSW97. AV* is the average placement ignoring missing words. From the data of Table 2, it is clear that SP is viable using MSW97 as a benchmark. In fact, based on these experiments, SP seems to outperform MSW97 as shown by the hypothesis tests of Table 3, where P refers to the population probability. Again, we recognize that the four significant comparisons are mainly due to MSW97 not including the correct word in L(m) from time to time, and that that event may very well not occur in MSW00 and MSWXP. The list of words not found by MSW97 in our experiments is given in Table 4. Table 3 Hypothesis tests Hypotheses accepted t-Statistic Confidence P1(SP)>P1(MSW97) P5(SP)>P5(MSW97) PA(SP)>PA(MSW97) AV(SP) < AV(MSW97) AV*(SP) < AV*(MSW97) 1.995 4.134 4.477 4.705 0.437 0.997 0.99998 0.999996 0.999998 0.669 R. Garfinkel et al. / Decision Support Systems 35 (2003) 385–397 395 Table 4 Not found by MSW97 m w* g(w*,m) SP placement First three WORD offerings Range of g(w,m) ahfve artially aspaces bsed chekerd fooice longwe motation mt oace od owrkds ptions ral ve versom weokrd ypte have partially spaces based checkers office longer notation my pace of words options real be version word type 11 6 5 5 17 15 13 8 1 5 0 14 6 3 0 17 14 15 1 1 1 5 7 1 1 2 3 4 1 1 1 2 1 4 1 8 artily aspics asperses asepses bed checkered cheered foci longs Longview longs mutation motivation motion mat met ace oak do odd old ode orchids pitons al rail rally ver vet vex verso versos worked yapped yipped 18 17 – 29 3 15 – 17 28 13 – 29 6 – 26 2–4 5 – 15 0–8 36 28 5 – 15 6–9 9–9 23 37 to 6 to 2 to 2 to to to to 4 5 4 2 to 2 to 10 Table 5 Anomalies of MSW97 m w* g(w*,m) SP placement First three WORD offerings g(w,m) ajbout dal dpoing forom fpr furst hpe iedea ih kelled lst nd neice oa og oht olver ou psition ral reating resduced sauares sedction sulutions syn tlake typ wdo yar about deal doing from for first hope idea in killed last and nice of of out over you position real rating reduced squares section solutions sun take top who year 1 5 0 0 0 3 4 4 0 4 3 0 2 0 0 1 1 0 4 3 6 6 5 4 6 3 1 3 1 4 1 3 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 2 7 1 1 1 1 1 1 1 2 3 al dale Daly doping form forum fr frusta furs pH he idee hi keeled Kelley celled st Ned nod niece au oak oaf go Aug hot oh oho lover Oliver older au our out piston spittoon al rail rally reassign reacting reading rescued saucers soars seduction sedation sultans sin sync sys talkie talk tip type typo do ear yard yarn 6–9 6 4–6 9 4 – 19 1–9 19 5 6–8 9 5–8 5 5 – 19 0 – 17 3–8 3–8 2–9 16 – 28 9 – 15 3 – 26 15 15 – 25 6 37 4–8 12 – 16 4–5 0 3–5 to 5 to 4 to 2 to 4 to 3 to 4 to 2 to 4 396 R. Garfinkel et al. / Decision Support Systems 35 (2003) 385–397 Column 4 gives the placement of w* in L(m) for SP, with ties indicated by ‘‘to’’. The last column gives g(w,m) for L(m) from MSW97. In the first row, L(m) was empty. Here, the role of g + becomes clear since in 8 of the 18 cases, ew * m = 2. Also, the relatively small dictionary used by SP may have been an advantage, although it does seem striking that in 10 of the 18 cases, SP placed the correct word at worst tied for first in its list. Only one word was not found by SP. The word was m = polytechnique when w* = polytechnic. It was not found since ewm = 3. As mentioned earlier, this was also the only error undoubtedly not caused by mistyping in all the experiments! 4.3. Anomalies As indicated in Section 1, the anomalies were compiled over time by the typing of the authors. There, we used a very loose definition of an anomaly but now, to us, it is the failure to offer a relatively common word with ewm = 1, especially if less common words or those having ewm>1 are offered. They are of particular interest to us because they would seem to indicate that the algorithm used by MSW97 does not resemble that of SP in the sense that it is based on a model using parameters similar to ours. Some of these anomalies are given in Table 5 along with the performance of SP on them. Note that if MSW97 were using a model based heavily on spelling rather than on typing, it would still be difficult to account for failure to correct the words forom, neice, and sulutions. SP correctly identifies all of these words with 23 out of 30 being at worst tied for first. 5. Conclusions and future research The main contention of this paper is that a systematic and a model-based approach can provide a basis to improve the performance of spell checkers and other similar applications. We have introduced a model designed to help the spell checkers within word processors place the correct word at, or near, the top of the list of offered words. The proposed model also introduces the concept of ‘‘commonness’’ that incorporates the notion that if all else is equal, a more common word is more likely to be correct than a less common word. Based on the proposed model, a prototype spell checker has been developed, tested, and found to be quite accurate. A number of issues deserve further investigation. An immediate extension is to expand the model to incorporate additional factors, including those related to spelling errors. Another useful avenue for further research is to explore customizing and learning possibilities, along the lines detailed in Section 2.3. Other application areas detailed in the paper, including query-by-humming techniques over music databases, can benefit from the model-based approach developed in the paper. With the growing use of online music, there is an important need to develop techniques for consumers to effectively search for and experience music offerings. We are currently working on some of these issues. Acknowledgements The authors are grateful for the helpful comments of the referees and the area editor. The authors received support from TECI—The Treibick Electronic Commerce Initiative, Department of Operations and Information Management, University of Connecticut. The research of the first author was supported in part by Grant SAB94-0115 from the Spanish Interministerial Commission of Science and Technology. Computational resources were provided by the Booth Research Center for Computer Applications and Research of the University of Connecticut and by the Department of Statistics and Operations Research at the Polytechnic University of Catalonia, Spain. The authors are especially grateful to the students, staff, and faculty of the University of Connecticut who volunteered their time. References [1] T.H. Cormen, C.E. Leirserson, R.L. Rivest, Introduction to Algorithms, MIT Press, Cambridge, MA, 1994. [2] F.J. Damerau, A technique for computer detection and correction of spelling errors, Communications of the ACM 7 (1964) 171 – 176. [3] R.S. Garfinkel, G.E. Liepins, A.S. Kunnathur, Error localization for erroneous data: a survey, TIMS Studies in the Management Sciences 19 (1982) 205 – 219. [4] J. Grudin, Error patterns in skilled and novice transcription R. Garfinkel et al. / Decision Support Systems 35 (2003) 385–397 [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] typing, in: W.E. Copper (Ed.), Cognitive Aspects of Skilled Typewriting, Springer-Verlag, New York, 1983, pp. 121 – 143. D.G. Hendry, T.R.G. Green, Spelling mistakes: how well do correctors perform? INTERACT ’93 and CHI ’93 Conference Companion on Human Factors in Computing Systems, Association for Computing Machinery, New York, 1993, pp. 83 – 84. International CorrectSpell. The web page at http://www.lhs. com/tech/icm/proofing/cs.asp (2000). D.E. Knuth, Sorting and searching, The Art of Computer Programming, vol. 3, Addison-Wesley, Reading, MA, 1973. K. Kukich, Techniques for automatically correcting words in text, ACM Computing Surveys 24 (1992) 378 – 439. K. Kukich, Spelling correction for the telecommunication network for the deaf, Communications of the ACM 35 (1992) 80 – 90. S.B. Needleman, C.D. Wunsch, A general method application to the search of similarities in the amino acid sequence of two proteins, Journal of Molecular Biology 48 (1970) 443 – 453. J.L. Peterson, Computer programs for detecting and correcting spelling errors, Communications of the ACM 23 (1980) 676 – 684. A.B. Robertson, P. Willett, Searching for Historical WordForms in a Database of 17th Century English Text Using Spelling-Correction Methods, 15th Annual International SIGIR, Denmark, Association for Computing Machinery, New York, 1992, pp. 165 – 256. Y. Tsao, A lexical study of sentences typed by hearingimpaired TDD users, Proceedings of the 13th International Symposium on Human Factors in Telecommunications, Turin, Italy, Information Gatekeepers Incorporated, Boston, MA, 1990, pp. 197 – 201. A. Uitdenbogerd, J. Zobel, Melodic matching techniques for large music databases, Proceedings: ACM Multimedia, Orlando, Florida, Association for Computing Machinery, New York, 1999, pp. 57 – 66. Robert Garfinkel is Professor of Operations and Information Management in the School of Business, University of Connecticut. He was previously on the faculties of the Universities of Rochester and Tennessee. His theoretical research focuses on such operations research problems as integer programming, network flows, vehicle routing, facility location, and combinatorial optimization. His current applied interests are mainly in data security and economic control of data streams in networks. His research has appeared in Operations Research, Management Science, INFORMS Journal on Computing, Mathematical Programming, Networks, Journal of ACM, Transportation Science, and other journals. He has also published the book ‘‘Integer Programming’’. 397 Elena Fernandez is an Associate Professor in the Statistics and Operations Research Department at the Technical University of Catalonia in Barcelona, Spain. Her research interests include integer and combinatorial optimization, discrete location and routing problems, and application of metaheuristic methods to combinatorial optimization problems. Her papers have appeared in Operations Research, European Journal of Operational Research, and other journals. Ram D. Gopal is GE Capital Endowed Professor of Business and Associate Professor of Operations and Information Management in the School of Business, University of Connecticut. He currently serves as the PhD director for the department. His current research interests include economics of information systems management, data security, economic and ethical issues relating to intellectual property rights, and multimedia applications. His research has appeared in Management Science, Operations Research, INFORMS Journal on Computing, Information Systems Research, Communications of the ACM, IEEE Transactions on Knowledge and Data Engineering, Journal of Management Information Systems, Decision Support Systems, and other journals and conference proceedings.
© Copyright 2026 Paperzz