Searching Proper Names in Databases Ulrich Pfeifer Thomas Poersch Norbert Fuhr University of Dortmund Lehrstuhl Informatik VI D-44221 Dortmund Zusammenfassung Identifikation von Namen — z.B. Autor- und Firmennamen — ist ein offenes Problem. In diesem Beitrag geben wir einen Überblick über bekannte Ähnlichkeitsmaße. Diese Maße basieren auf phonetischer Ähnlichkeit, Anzahl von Tippfehlern und einfacher String-Ähnlichkeit. Wir zeigen experimentell, daß alle drei Ansätze zu signifikant besserer Retrieval-Qualität führen als einfache Identität. Weiter zeigen wir, daß Kombinationen von verschiedenen Ähnlichkeitsmaße noch bessere Ergebnisse liefert als jede einzelne Methode. Abstract Identifying names — e.g., author names or company names — is still an open problem. In this paper we review known similarity measures. These measures deal with phonetic similarity, typing errors and plain string similarity. We show experimentally that all three approaches lead to significant better retrieval quality than plain identity. Furthermore, we demonstrate that combinations of different similarity measures perform even better than any single technique. 1 Introduction In our view modern information retrieval systems are characterized by being capable of dealing with the two features vagueness of queries and uncertainty of knowledge. In this paper, we focus on searching proper nouns. Here the vagueness stems from limited knowledge a user has about his information need. If he is e.g. searching for papers of a certain author in a bibliographic database, he will not be successful if he misspells the authors name in his query. The other item, uncertainty of knowledge, refers to the possible typing errors or different transliterations within the database. If a user is searching for an author name and the database contains the misspelled name, the right entry will not be found. email: fpfeifer,poersch,[email protected] These problems mostly arise with author names, company names etc. Therefore, we concentrated on single word strings, namely surnames. Some approaches have been made to take care of vagueness and uncertainty within information retrieval systems. In most commercial information retrieval systems, the user has only the possibility to mask his query with elaborated wildcards to find derived or similar words. A few systems – e.g. ADABAS – implement phonetic searches. Some systems employing stemming algorithms have been developed for reducing words to their basic forms or stem forms, i.e., they remove the suffixes or transform the words to their infinitive forms ([Porter 80] [Salton & McGill 83]). This method is interesting for verbs, nouns and adjectives, but it is not appropriate for name searching. Most non-linguistic similarity measures can be subdivided into three different categories: (plain) string similarity, similarity with respect to typing errors and phonetic similarity. From the first category n-grams are most common ([Angell et al. 83] [Hall & Dowling 80] [Pollock & Zamora 84] [Salton 88] [Zamora et al. 81]). An example of the second category is the Damerau-Levenstein-metric ([Damerau 64] [Salton 88]), which counts the insertions, deletions, substitutions and transpositions needed to transform one string into another. While these two classes compare words without regarding the language used, the third kind of similarity measure is very language dependent: The algorithms Soundex and Phonix compare words with regard to their phonetic similarity ([Gadd 88] [Gadd 90]). We implemented access paths for similarity measures from all of the three categories for the objectoriented information retrieval system N OSFERATU [Burghardt et al. 93]. Our experiments presented here resemble work done by Robertson and Willett ([Robertson & Willett 92]) who used most of the techniques mentioned here for searching historical word-forms. But they were mainly concerned with efficiency, while we put our focus on effectiveness. This seems a valid approach since names are not too frequent in text databases. We also tried combinations of the methods, which naturally decreased efficiency while it increased effectiveness significantly. The following sections 2, 3 and 4 describe the different similarity measures examined in detail. Section 5 contains illustrations of the experiments and their analysis. Finally, in section 6 the results are summarized and an outlook on further work is given. 2 Phonetic similarity As an example, assume that a native English speaker, verbally instructed to search for a German author named “Maier”, might look for “MYA”, which is pronounced similarly. A German might start with “MEIER” or “MEYER”. Both presumably will not get the expected result. T.N. Gadd describes in his papers [Gadd 88] and [Gadd 90] the two algorithms Soundex and Phonix. Both the algorithms calculate phonetic codes for the given names. Names sharing the same code are assumed to be similar. Soundex’s phonetic property is restricted to the collecting of similar sounding consonants into different classes. The algorithm for computing the Phonix codes uses elaborate substitution rules. Both algorithms have been developed for the English language. For applying them to other languages, character classes or the substitution rules have to be adapted, respectively. For mixed language databases (as e.g. a literature database with authors from different countries), the Soundex algorithm is better suited because of its simplicity. 2.1 Soundex Soundex B C D L M R F G T P J V K N Q S Phonix X Z ?! ?! ?! ?! ?! ?! 1 2 3 4 5 6 B C D L M R F S P G T J K N V X Z Q ?! ?! ?! ?! ?! ?! ?! ?! 1 2 3 4 5 6 7 8 Table 1: Substitution of letters with numbers Soundex works as follows: 1. Remove all vowels, the consonants H, W, Y and all duplicate consecutive characters. The first letter is always left unaltered. 2. Create the Soundex code by concatenating the first letter with the following 3 letters replaced by their nummeric code according to table 1. Two given words may 1) be identical, 2) differ but share a Soundex code or 3) be not related at all. Based on this classification, we can assign each word in the database with respect to a query one of these three ranks. 2.2 Phonix Phonix is far more complex than Soundex. While Soundex only removes vowels, some consonants and duplicate letters and carries out the numerical substitution, the work of Phonix is more extensively: 1. 2. 3. 4. Perform the phonetic substitution, i.e., replace certain letter groups by other letter groups. Replace the first letter by ’V’ if it is a vowel or the consonant Y. Strip the ending-sound from the word (roughly the part after the last vowel or ’Y’). Remove all the vowels, the consonants H, W, Y and all duplicate consecutive characters. 5. Create the Phonix code of the word without its ending-sound by replacing every but the first remaining letter by its numerical values according to table 1. The maximum length of a Phonix code is restricted to 8 characters. 6. Create the Phonix code of the ending-sound by replacing every letter by its numerical value. The maximum length of a Phonix code for an ending-sound is restricted to 8 characters. We now can assign each word of the database to one of the three ranks (1) identical, 2) similar, 3) unrelated), as we did for Soundex codes. If we take advantage of the ending-sounds we computed in the third step, we can split the similar rank into three ranks. These contain the words that also 2a) agree on the ending-sounds, 2b) agree on a prefix of the ending sounds and 2c) have different ending sounds. We implemented access paths for both phonetic similarity measures by an inverted index over the codes and stored the original names and for Phonix codes the ending sounds in the inverted file. Since there is no need to access the documents themselves, a query can be answered very fast. 3 Plain string similarity Another well-known method of comparing strings is the use of n-grams ([Angell et al. 83] [Hall & Dowling 80] [Pollock & Zamora 84] [Salton 88] [Zamora et al. 81]). The n-grams are language independent, i.e., this technique only compares the letters of words regardless of the used language. If two strings are compared with respect to their n-grams, the sets of n-grams will be calculated for both strings. Next, these sets will be compared, and the more n-grams occur in both of the sets, the more similar the two strings are. Table 2 shows an example for the use of trigrams taken from [Salton 88]: The user is searching for the misspelled word RECEIEVE, but the database only contains the five similar words RECEIVE, RECEIVER, REPRIEVE, RETRIEVE and REACTIVE. Now the trigrams are calculated, and, for example, the words RECEIEVE and RECEIVE have 3 of their 8 different trigrams in common. Therefore the retrieved term RECEIVE gets the similarity coefficient of 38 with respect to the searched term RECEIEVE. The general approach to this calculation of the similarity coefficient is performed by equation (1), where N1 and N2 are the n-gram-sets of the two compared words. similarity coefficient := j N1 \ N2 j j N1 [ N2 j (1) Choice of the parameter n: According to [Salton 88] and [Zamora et al. 81] trigrams and digrams achieve the best results in retrieving similar words to a given word. Use of additional blanks: Furthermore, the n-gram analysis can make use of additional blanks that are appended to the start and the end of a word. This technique allows to emphasize the first and trigrams REC ECE CEI EIE IEV EVE EIV IVE VER REP EPR PRI RIE RET ETR TRI REA EAC ACT TIV searched word RECEIEVE 1 1 1 1 1 1 RECEIVE 1 1 1 RECEIVER 1 1 1 retrieved words REPRIEVE RETRIEVE 1 1 1 1 REACTIVE 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 8 3 9 2 10 2 10 0 11 Table 2: Example for the use of a trigram analysis last letters of the words. While the word “IDLE” possesses only two trigrams using no blanks (see tab. 3), the use of two blanks gives six different trigrams. IDLE 2IDLE2 22IDLE22 ?! IDL DLE ?! 2ID IDL ?! 22I 2ID DLE IDL LE2 DLE LE2 L22 Table 3: Example for the use of additional blanks As an access path assisting the computation of the similarity coefficient we used an inverted index similar to the one used for the phonetic codes and exhaustively processed all n-gram lists. Even though this worked well for our small database, larger applications will need more sophisticated processing strategies or even other data structures in order to achieve reasonable performance. 4 Typing errors In his paper of 1964 Fred J. Damerau describes the comparison of words with respect to the following four types of typing errors [Damerau 64]: 1 1 The words within the parenthesis show examples for the word DAMERAU. An additional letter is inserted. (e.g. DAHMERAU) A letter is deleted. (e.g. DAMRAU) A letter is substituted by another letter. (e.g. DANERAU) Two adjacent letters are transposed. (e.g. DAMERUA) Damerau uses the so-called Damerau-Levenstein-metric for calculating the minimum number of errors for the two words s and t. f (0; 0) f (i; j ) := 0 minf f (i ? 1; j ) + 1; f (i; j ? 1) + 1; f (i ? 1; j ? 1) + d(s ; t ); f (i ? 2; j ? 2) + d(s ?1; t ) + d(s ; t ?1) + 1 g := i (2) j i j i j The function d is a distance measure for letters. A simple measure is the not-identity used in the following. But more elaborate measures drawn from statistical analysis of typing errors or from the geometry of keyboards can be used instead. d(s ; t ) := i ( j 0 1 ; if s = t ; if s = 6 t i j i j (3) The function f (i; j ) calculates the minimum number of errors that distinguish the first i characters of the first word from the first j characters of the second word. Conclusively, two words s and t with lengths l and l differ by f (l ; l ) errors. s t s t 4.1 Skeleton-Key and Omission-Key In [Pollock & Zamora 84] two techniques called Skeleton-Key and Omission-Key where presented: Definition: The Skeleton-Key of a word consists of its first letter, the remaining consonants in order of appearance and the remaining vowels in order of appearance. This key contains every letter at most once. Experiments showed that consonants were omitted in the following frequency order: RSTNLCHDPGMFBYWVZXQKJ, i.e., R is more often omitted than any other letter, J less than any other letter. Definition: The Omission-Key of a word consists of the the consonants in the reverse of the above frequency order and the vowels in order of appearance. This key contains every letter at most once. Pollock and Zamora then used the distance of two keys in an alphabetically sorted list as a distance measure for the original words. The distance based on Skeleton-Keys reflects the fact that consonants name AP BIB CACM FR TELEFON WSJ ZIFF COMPLETE terms 1190 1282 2208 2051 2510 1026 8125 14972 origin TREC SMART TREC TREC TREC description AP Newswire (1989) literature database (1993) ACM abstracts Federal Register (1989) phone book database of the University of Dortmund (1993) Wall Street Journal (1988) Information from Computer Select Disks (Ziff-Davis-Publishing) combination of all databases Table 4: Databases used for the experiments carry more information than the vowels. Using the Omission-Key, the computed similarity takes advantage of the frequency by which certain consonants are omitted when typed. In our experiments, we tested only a variation of the two techniques where we used them as an algorithm for normalizing the word, before applying the Damerau-Levenstein-metric. Retrieval using Damerau-Levenstein-metric was implemented by a scan over the full database. For real life applications this is prohibitive. Clustering methods might be used, in order to preselect the parts of the database, where searching for names related to the query is most promising. But – due to the nature of the measure – there is a limitation to the possible increase of performance. 5 Experiments For our experiments, we extracted surnames from a couple of sources manually. The automatic extraction is beyond the scope of this paper. Especially for English texts, detection of names could be achieved by heuristics looking at words starting with capital letters. But more sophisticated methods using dictionaries and/or NLP techniques may be applied. The used sources were parts of the TREC-collection [Harman 93], the CACM collection from the SMART system ([Buckley 85]), the phonebook of the University of Dortmund and a local bibliographic database (see table 4). The phonebook and the bibliographic database provide nonEnglish names, some of the TREC sources also contain spelling errors. All of these source databases were scanned for surnames, which were stored in test databases with sizes varying from about 1000 to 8000 terms and containing only unique surnames. Finally all those test databases were combined to one large test database called COMPLETE with some 14000 names. After this creation of experimental databases the queries to these databases were determined. First, 90 names were randomly chosen from the databases and, second, the sets of relevant terms were manually determined for each of those 90 queries. The COMPLETE database contains an average of 13.1 relevant names. 5.1 Quality measures For the comparison of the different techniques we used recall-precision-graphs. Since the ranking of the answer set is not a linear ordering, we get only a few recall-precision points for each query, namely after each rank. To have a means for averaging over the query set, we need a method for interpolation. We use the probability of relevance method proposed in [Raghavan et al. 89]. The method assumes that the user examines the result ranking, randomly selecting a document from the topmost not yet completely read rank. At a given stage during this process, the precision is defined as the probability that a random examined document is relevant. Recall is the fraction of relevant documents examined. The corresponding algorithm works as follows: Given that a rank contains r relevant and i nonrelevant documents, a user who wants s < r relevant documents will get esl irrelevant documents: esl (s) = rs+ i1 (4) r;i For coping with the situation with more than one rank, we call the rank currently under inspection l and further define: j s r i NR : : : : : f number of non-relevant documents of ranks 1 : : : l ?1 number of relevant documents picked from rank l number of relevant documents of rank l number of non-relevant documents of rank l number of required relevant documents f f f f Then equation 4 extends to: esl(NR) = j + esl (s) = j + rs+ i1 r;i (5) Finally, by the combining 5 with the definition of the probability of relevance (PRR), we get: NR = PRR(NR) = NR +NR esl(NR) NR + j + (s i)=(r + 1) (6) If we also allow non-integer values for NR, we get a smooth interpolation which can be averaged over the query set. 5.2 Analysis Because the test queries were drawn from the database COMPLETE there is no query with no relevant entry in COMPLETE. Real applications would have to cope with the situation where a user searches for a name that is not present in the database. Therefore, we present here the experimental results for database ZIFF where the number of relevant entries ranges from 0 (3 queries) to 26. On average, there are 6.6 relevant documents w.r.t. a single query. 1 Soundex Phonix4 Phonix8 PhonixE identity 0.9 0.8 0.7 Precision 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.9 1 Figure 1: Soundex/Phonix Figure 1 contains the results of the four variations on the Soundex and Phonix algorithms described below: SOUNDEX PHONIX4 PHONIX8 PHONIXE : : : : Soundex algorithm with codes of length 4 Phonix algorithm with codes of length 4 (no ending-sounds) Phonix algorithm with codes of length 8 (no ending-sounds) Phonix algorithm with codes of length 8 (ending-sounds) Obviously, SOUNDEX is the worst variation. Its values for the expected precision are about 0.1 lower than the values of PHONIXE. PHONIX8 delivers precision values that are better than S OUNDEX but still worse than PHONIX4 and PHONIXE. The latter two variations perform best in this experiment: While PHONIXE shows higher precision values than PHONIX4 for a recall lower than 0.5, this relationship changes into the contrary for a recall higher than 0.5. So it depends on the application which method should be preferred. The behavior of these variations can be explained easily: SOUNDEX is the worst variation because of the missing phonetic substitution. This feature of Phonix is a very important improvement. PHONIX8 shows worse performance than P HONIX4 since it uses codes that are too long and discriminative. PHONIXE alleviates this problem by additionally regarding ending-sounds. So similar words have a second chance of beeing discovered. Furthermore, figure 1 contains the graph representing the performance of retrieving only the identical terms. The difference between this graph and the similarity measures shows that our work is worthwhile. The technique using n-grams was analyzed for n = 1::5. Each of these variations was examined using no, one and two additional blanks.2 The results show clear tendencies: Figure 2 shows that the more additional blanks are used, the better the trigrams perform. This is also true for the other n-gram variations. Figure 3 reveals that for increasing n, performance is decreasing. This behavior is independent of the number of blanks used. Conclusively, these two figures show that digrams with one blank are the best variation of n-grams. The trigrams with two blanks are a little worse but still a good alternative to the digrams. A closer look at the graphs reveals that digrams are even better than the best Phonix variation (see fig. 1): At a recall of 0.5, PHONIXE shows precision-values of about 0.53, while digrams achieve a precision of 0.62 at the same recall level. The Damerau-Levenstein-metric was analyzed for the ordinary metric (DAMERAU), the metric using the Skeleton-Key (SKELETON) and the metric using the Omission-Key (OMISSION). For efficiency, we confined the maximum number of errors by 1 to 3. Figure 4 tells us that admitting more than 3 typing errors will not increase the effectiveness. The following conclusions can be formulated with the help of fig. 5 and 6: The more errors are allowed, the better the different variations perform. With a maximum of one error, S KELETON shows the best performance. With a maximum of two or three errors, DAMERAU is superior to SKELETON and OMISSION. The best results can be achieved with a maximum of three errors (see fig. 6). Here DAMERAU is better than SKELETON (about 0.08 for a recall of 0.5), and SKELETON is far better than OMISSION (about 0.18 for a recall of 0.5). Conclusively, fig. 7 compares the best variations of these techniques. The differences are significant: Digrams with one blank are the best similarity measure. The Phonix algorithm with ending-sounds is worse than the digrams but better than the Damerau-Levenstein-metric regarding the more interesting interval from 0 to 0.67. 2 Of course, digrams are not capable of using two blanks. 1 0 blanks 1 blank 2 blanks 0.9 0.8 0.7 Precision 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.9 1 Figure 2: Trigrams with different numbers of blanks 1 2-Grams 3-Grams 4-Grams 5-Grams 0.9 0.8 0.7 Precision 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 Recall 0.6 Figure 3: n-grams with one blank 0.7 0.8 0.9 1 1 max. 1 error max. 2 errors max. 3 errors max. 4 errors max. 5 errors 0.9 0.8 0.7 Precision 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.9 1 Figure 4: Damerau variation S KELETON with five errors at most 1 Damerau Skeleton Omission 0.9 0.8 0.7 Precision 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 Figure 5: Damerau-Levenstein-metric with one error at most 0.9 1 1 Damerau Skeleton Omission 0.9 0.8 0.7 Precision 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.9 1 Figure 6: Damerau-Levenstein-metric with three errors at most 1 PhonixE Digram1 Damerau3 0.9 0.8 0.7 Precision 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 Figure 7: Comparison of the best variations 0.8 0.9 1 5.3 Combinations In this section combinations of the three different techniques are analyzed. For the combinations only the best variation of each technique was chosen, i.e., PHONIXE, digrams with one additional blank (DIGRAM1) and the Damerau-Levenstein-metric with a maximum of three errors (DAMERAU3). The first problem arising is that the ranking produced by the different methods has to be mapped on a single linear scale. The obvious choice would be the probability that a word in a given rank will be judged relevant by the user. For calculating estimates, some statistical data is required, which we did not have the time to generate. So we choose a rather crude mapping: The trigram similarity coefficient is used without change. The Damerau coefficient f is mapped to 1=(1 + f ). The ranks for Soundex and Phonix without ending sounds are mapped to 1, 0:9 and 0. The ranks for Phonix with ending sounds are mapped to 1, 0:9 , 0:8 , 0:7 and 0. There are still two open problems: What function is to be applied for combining the different weights to one new weight? What techniques of similarity measures are to be combined? The first question addressed by an experiment using the following weighting functions: median, maximum, arithmetic mean and geometric mean. Results in figure 8 show: Maximum is the worst of these four functions. Its precision-values are about 0.11 lower than the values of the other graphs regarding a given recall of 0.5. Within the recall interval from 0 to 0.8, the two mean functions are slightly better than median. The difference between the arithmetic mean and the geometric mean can be neglected. Within the recall interval from 0.8 to 1, the median is significantly better than the arithmetic or geometric mean. The results may be summarized that either the arithmetic mean or the geometric mean should be used to combine the different weights to one weight when low recall levels (< 0:8) are important. For applications which put more emphasis on high recall levels the robust combination functions (median) might be a good choice. The drop of performance at the 0:8 recall level is probably due to the poor mapping of the Phonix ranks, which disturbs the mean combinations while the median is more robust. So for a better mapping of ranks to a linear scale the above statements might have to be reviewed. When choosing the similarity measures to be combined, efficiency and effectiveness have to be considered. For some measures there are very fast implementations available, whereas others are computationally demanding. Secondly, the more measures are combined, the more efficiency suffers. On the other hand the question is if adding a new measure to a set of measures increases performance in every case. 1 median maximum arithmetic mean geometric mean 0.9 0.8 0.7 Precision 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.9 1 Figure 8: Combination Digrams1–PhonixE–Damerau3 with different weighting functions 1 Digrams1 & Damerau3 Digrams1 Digrams1 & PhonixE & 0.9 & PhonixE & PhonixE Damerau3 Damerau3 0.8 0.7 Precision 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 Figure 9: Different combinations with geometric means 0.8 0.9 1 So we are looking for a minimal set of measures with reasonable fast implementation in another set of experiments. Figure 9 shows the results of four different combinations: The worst alternative is the combination of the two techniques PHONIXE and DAMERAU3. Its performance is significantly lower than the performances of the other combinations. The best combination is the use of all three techniques. The combination DIGRAMS1–PHONIXE is a slightly worse than the combination PHONIXE– DIGRAMS1–DAMERAU3, but about 0.4 better than DIGRAMS1–DAMERAU3. Summarizing these results (and other results not presented here), these experiments show that the more evidence you combine, the better results are achieved. The more the combined measures differ, the better the combination performs. An interesting result is that although PHONIXE–DIGRAMS1–DAMERAU3 performs best, the difference to DIGRAMS1–PHONIXE is rather small. So by dropping the Damerau measure effectiveness will drop slightly, while the performance will benefit largely since this measure can be hardly implemented efficiently. 6 Conclusion In this paper we have shown that an information retrieval system using a kind of similarity measure for searching proper names will perform much better than a system using only exact-match searches. Three different techniques for calculating the similarity of strings have been introduced and compared. The results show that digrams using one blank performs best while P HONIXE is only slightly better than the Damerau-Levenstein-metric regarding at most three errors. Much better than the use of a single technique are the combinations of different techniques. Here the combination of digrams with one additional blank and PHONIXE can be recommended. Although the approach of combining two or three different similarity measures seems to be very promising, the additional work for maintaining and searching one or even two more techniques has to be considered. A solution to this might be to compute weights for an additional similarity measure (e.g. DamerauLevenstein-metric) only for the answers already found. Though there will be no new answers found this way, the ranking of the retrieved answers might be improved. The future work will concentrate on this problem, i.e., the most efficient combination of some techniques. Furthermore, the choice of the best weighting function for the combinations is to be analyzed in more detail. References Angell, R.; Freund, G.; Willet, P. (1983). Automatic Spelling Correction using a Trigram Similarity Measure. Information Processing and Management 19(4), pages 255–261. Buckley, C. (1985). Implementation of the SMART Information Retrieval System. Technical Report 85-686, Department of Computer Science, Cornell University, Ithaca, NY. Burghardt, A.; Fuhr, N.; Großjohann, K.; Pfeifer, U.; Spielmann, H.; Stach, O. (1993). NOSFERATU — an Integrated Database Management and Information Retrieval system based on Data Streams (in German). In: Proceedings 1. GI-Fachtagung Information Retrieval, pages 27–40. Universitätsverlag Konstanz, Konstanz. Damerau, F. (1964). A Technique for Computer Detection and Correction of Spelling Errors. Communications of the ACM 7, pages 171–176. Gadd, T. (1988). ’Fisching for Werds’. Phonetic Retrieval of written text in Information Retrieval Systems. Program 22(3), pages 222–237. Gadd, T. (1990). PHONIX: The Algorithm. Program 24(4), pages 363–366. Hall, P.; Dowling, G. (1980). Approximate String Matching. Computing Surveys 12(4), pages 381–402. Harman, D. (1993). Overview of the First Text REtrieval Conference. In: Harman, D. (ed.): The First Text REtrieval Conference (TREC-1). National Institute of Standards and Technology Special Publication 500-207, Gaithersburg, Md. 20899. Pollock, J.; Zamora, A. (1984). Automatic Spelling Correction in Scientific and Scholary Text. Communications of the ACM 27, pages 358–368. Porter, M. F. (1980). An Algorithm for Suffix Stripping. Program 14, pages 130–137. Raghavan, V. V.; Bollmann, P.; Jung, G. S. (1989). Retrieval System Evaluation Using Recall and Precision: Problems and Answers. In: Belkin, N.; van Rijsbergen, C. J. (eds.): Proceedings of the Twelfth Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, pages 59–68. ACM, New York. Robertson, A.; Willett, P. (1992). Searching for Historical Word-Forms in a Database of 17thCentury English Text Using Spelling-Correction Methods. In: Belkin, N.; Ingwersen, P.; Pejtersen, M. (eds.): Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 256–265. ACM, New York. Salton, G.; McGill, M. J. (1983). Introduction to Modern Information Retrieval. McGraw-Hill, New York. Salton, G. (1988). Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley, Reading, Massachusetts. Zamora, E.; Pollock, J.; Zamora, A. (1981). The Use of Trigram Analysis for Spelling Error Detection. Information Processing and Management 17(6), pages 305–316.
© Copyright 2026 Paperzz