as a PDF

Searching Proper Names in Databases
Ulrich Pfeifer
Thomas Poersch
Norbert Fuhr
University of Dortmund
Lehrstuhl Informatik VI
D-44221 Dortmund
Zusammenfassung
Identifikation von Namen — z.B. Autor- und Firmennamen — ist ein offenes Problem. In
diesem Beitrag geben wir einen Überblick über bekannte Ähnlichkeitsmaße. Diese Maße basieren
auf phonetischer Ähnlichkeit, Anzahl von Tippfehlern und einfacher String-Ähnlichkeit. Wir
zeigen experimentell, daß alle drei Ansätze zu signifikant besserer Retrieval-Qualität führen als
einfache Identität. Weiter zeigen wir, daß Kombinationen von verschiedenen Ähnlichkeitsmaße
noch bessere Ergebnisse liefert als jede einzelne Methode.
Abstract
Identifying names — e.g., author names or company names — is still an open problem. In
this paper we review known similarity measures. These measures deal with phonetic similarity,
typing errors and plain string similarity. We show experimentally that all three approaches
lead to significant better retrieval quality than plain identity. Furthermore, we demonstrate that
combinations of different similarity measures perform even better than any single technique.
1 Introduction
In our view modern information retrieval systems are characterized by being capable of dealing with
the two features vagueness of queries and uncertainty of knowledge. In this paper, we focus on
searching proper nouns. Here the vagueness stems from limited knowledge a user has about his
information need. If he is e.g. searching for papers of a certain author in a bibliographic database,
he will not be successful if he misspells the authors name in his query. The other item, uncertainty
of knowledge, refers to the possible typing errors or different transliterations within the database. If
a user is searching for an author name and the database contains the misspelled name, the right entry
will not be found.
email:
fpfeifer,poersch,[email protected]
These problems mostly arise with author names, company names etc. Therefore, we concentrated
on single word strings, namely surnames.
Some approaches have been made to take care of vagueness and uncertainty within information
retrieval systems. In most commercial information retrieval systems, the user has only the possibility
to mask his query with elaborated wildcards to find derived or similar words. A few systems – e.g.
ADABAS – implement phonetic searches.
Some systems employing stemming algorithms have been developed for reducing words to their
basic forms or stem forms, i.e., they remove the suffixes or transform the words to their infinitive
forms ([Porter 80] [Salton & McGill 83]). This method is interesting for verbs, nouns and adjectives,
but it is not appropriate for name searching.
Most non-linguistic similarity measures can be subdivided into three different categories: (plain)
string similarity, similarity with respect to typing errors and phonetic similarity. From the first
category n-grams are most common ([Angell et al. 83] [Hall & Dowling 80] [Pollock & Zamora 84]
[Salton 88] [Zamora et al. 81]). An example of the second category is the Damerau-Levenstein-metric
([Damerau 64] [Salton 88]), which counts the insertions, deletions, substitutions and transpositions
needed to transform one string into another. While these two classes compare words without
regarding the language used, the third kind of similarity measure is very language dependent: The
algorithms Soundex and Phonix compare words with regard to their phonetic similarity ([Gadd 88]
[Gadd 90]).
We implemented access paths for similarity measures from all of the three categories for the objectoriented information retrieval system N OSFERATU [Burghardt et al. 93].
Our experiments presented here resemble work done by Robertson and Willett ([Robertson & Willett
92]) who used most of the techniques mentioned here for searching historical word-forms. But they
were mainly concerned with efficiency, while we put our focus on effectiveness. This seems a valid
approach since names are not too frequent in text databases. We also tried combinations of the
methods, which naturally decreased efficiency while it increased effectiveness significantly.
The following sections 2, 3 and 4 describe the different similarity measures examined in detail.
Section 5 contains illustrations of the experiments and their analysis. Finally, in section 6 the results
are summarized and an outlook on further work is given.
2 Phonetic similarity
As an example, assume that a native English speaker, verbally instructed to search for a German
author named “Maier”, might look for “MYA”, which is pronounced similarly. A German might
start with “MEIER” or “MEYER”. Both presumably will not get the expected result.
T.N. Gadd describes in his papers [Gadd 88] and [Gadd 90] the two algorithms Soundex and Phonix.
Both the algorithms calculate phonetic codes for the given names. Names sharing the same code are
assumed to be similar. Soundex’s phonetic property is restricted to the collecting of similar sounding
consonants into different classes. The algorithm for computing the Phonix codes uses elaborate
substitution rules.
Both algorithms have been developed for the English language. For applying them to other languages,
character classes or the substitution rules have to be adapted, respectively. For mixed language
databases (as e.g. a literature database with authors from different countries), the Soundex algorithm
is better suited because of its simplicity.
2.1 Soundex
Soundex
B
C
D
L
M
R
F
G
T
P
J
V
K
N
Q
S
Phonix
X
Z
?!
?!
?!
?!
?!
?!
1
2
3
4
5
6
B
C
D
L
M
R
F
S
P
G
T
J
K
N
V
X
Z
Q
?!
?!
?!
?!
?!
?!
?!
?!
1
2
3
4
5
6
7
8
Table 1: Substitution of letters with numbers
Soundex works as follows:
1. Remove all vowels, the consonants H, W, Y and all duplicate consecutive characters. The first
letter is always left unaltered.
2. Create the Soundex code by concatenating the first letter with the following 3 letters replaced
by their nummeric code according to table 1.
Two given words may 1) be identical, 2) differ but share a Soundex code or 3) be not related at all.
Based on this classification, we can assign each word in the database with respect to a query one of
these three ranks.
2.2 Phonix
Phonix is far more complex than Soundex. While Soundex only removes vowels, some consonants
and duplicate letters and carries out the numerical substitution, the work of Phonix is more extensively:
1.
2.
3.
4.
Perform the phonetic substitution, i.e., replace certain letter groups by other letter groups.
Replace the first letter by ’V’ if it is a vowel or the consonant Y.
Strip the ending-sound from the word (roughly the part after the last vowel or ’Y’).
Remove all the vowels, the consonants H, W, Y and all duplicate consecutive characters.
5. Create the Phonix code of the word without its ending-sound by replacing every but the first
remaining letter by its numerical values according to table 1. The maximum length of a Phonix
code is restricted to 8 characters.
6. Create the Phonix code of the ending-sound by replacing every letter by its numerical value.
The maximum length of a Phonix code for an ending-sound is restricted to 8 characters.
We now can assign each word of the database to one of the three ranks (1) identical, 2) similar, 3)
unrelated), as we did for Soundex codes. If we take advantage of the ending-sounds we computed
in the third step, we can split the similar rank into three ranks. These contain the words that also 2a)
agree on the ending-sounds, 2b) agree on a prefix of the ending sounds and 2c) have different ending
sounds.
We implemented access paths for both phonetic similarity measures by an inverted index over the
codes and stored the original names and for Phonix codes the ending sounds in the inverted file.
Since there is no need to access the documents themselves, a query can be answered very fast.
3 Plain string similarity
Another well-known method of comparing strings is the use of n-grams ([Angell et al. 83] [Hall &
Dowling 80] [Pollock & Zamora 84] [Salton 88] [Zamora et al. 81]). The n-grams are language
independent, i.e., this technique only compares the letters of words regardless of the used language.
If two strings are compared with respect to their n-grams, the sets of n-grams will be calculated for
both strings. Next, these sets will be compared, and the more n-grams occur in both of the sets, the
more similar the two strings are.
Table 2 shows an example for the use of trigrams taken from [Salton 88]: The user is searching for
the misspelled word RECEIEVE, but the database only contains the five similar words RECEIVE,
RECEIVER, REPRIEVE, RETRIEVE and REACTIVE. Now the trigrams are calculated, and, for
example, the words RECEIEVE and RECEIVE have 3 of their 8 different trigrams in common.
Therefore the retrieved term RECEIVE gets the similarity coefficient of 38 with respect to the searched
term RECEIEVE.
The general approach to this calculation of the similarity coefficient is performed by equation (1),
where N1 and N2 are the n-gram-sets of the two compared words.
similarity coefficient :=
j N1 \ N2 j
j N1 [ N2 j
(1)
Choice of the parameter n: According to [Salton 88] and [Zamora et al. 81] trigrams and digrams
achieve the best results in retrieving similar words to a given word.
Use of additional blanks: Furthermore, the n-gram analysis can make use of additional blanks that
are appended to the start and the end of a word. This technique allows to emphasize the first and
trigrams
REC
ECE
CEI
EIE
IEV
EVE
EIV
IVE
VER
REP
EPR
PRI
RIE
RET
ETR
TRI
REA
EAC
ACT
TIV
searched word
RECEIEVE
1
1
1
1
1
1
RECEIVE
1
1
1
RECEIVER
1
1
1
retrieved words
REPRIEVE RETRIEVE
1
1
1
1
REACTIVE
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
3
8
3
9
2
10
2
10
0
11
Table 2: Example for the use of a trigram analysis
last letters of the words. While the word “IDLE” possesses only two trigrams using no blanks (see
tab. 3), the use of two blanks gives six different trigrams.
IDLE
2IDLE2
22IDLE22
?! IDL DLE
?! 2ID IDL
?! 22I 2ID
DLE
IDL
LE2
DLE
LE2
L22
Table 3: Example for the use of additional blanks
As an access path assisting the computation of the similarity coefficient we used an inverted index
similar to the one used for the phonetic codes and exhaustively processed all n-gram lists. Even
though this worked well for our small database, larger applications will need more sophisticated
processing strategies or even other data structures in order to achieve reasonable performance.
4 Typing errors
In his paper of 1964 Fred J. Damerau describes the comparison of words with respect to the following
four types of typing errors [Damerau 64]: 1
1 The
words within the parenthesis show examples for the word DAMERAU.
An additional letter is inserted. (e.g. DAHMERAU)
A letter is deleted. (e.g. DAMRAU)
A letter is substituted by another letter. (e.g. DANERAU)
Two adjacent letters are transposed. (e.g. DAMERUA)
Damerau uses the so-called Damerau-Levenstein-metric for calculating the minimum number of
errors for the two words s and t.
f (0; 0)
f (i; j )
:=
0
minf f (i ? 1; j ) + 1;
f (i; j ? 1) + 1;
f (i ? 1; j ? 1) + d(s ; t );
f (i ? 2; j ? 2) + d(s ?1; t ) + d(s ; t ?1) + 1 g
:=
i
(2)
j
i
j
i
j
The function d is a distance measure for letters. A simple measure is the not-identity used in the
following. But more elaborate measures drawn from statistical analysis of typing errors or from the
geometry of keyboards can be used instead.
d(s ; t ) :=
i
(
j
0
1
; if s = t
; if s =
6 t
i
j
i
j
(3)
The function f (i; j ) calculates the minimum number of errors that distinguish the first i characters
of the first word from the first j characters of the second word. Conclusively, two words s and t with
lengths l and l differ by f (l ; l ) errors.
s
t
s
t
4.1 Skeleton-Key and Omission-Key
In [Pollock & Zamora 84] two techniques called Skeleton-Key and Omission-Key where presented:
Definition: The Skeleton-Key of a word consists of its first letter, the remaining consonants in order of appearance and the remaining vowels in order of appearance. This key
contains every letter at most once.
Experiments showed that consonants were omitted in the following frequency order: RSTNLCHDPGMFBYWVZXQKJ, i.e., R is more often omitted than any other letter, J less than any other letter.
Definition: The Omission-Key of a word consists of the the consonants in the reverse
of the above frequency order and the vowels in order of appearance. This key contains
every letter at most once.
Pollock and Zamora then used the distance of two keys in an alphabetically sorted list as a distance
measure for the original words. The distance based on Skeleton-Keys reflects the fact that consonants
name
AP
BIB
CACM
FR
TELEFON
WSJ
ZIFF
COMPLETE
terms
1190
1282
2208
2051
2510
1026
8125
14972
origin
TREC
SMART
TREC
TREC
TREC
description
AP Newswire (1989)
literature database (1993)
ACM abstracts
Federal Register (1989)
phone book database of the University of Dortmund (1993)
Wall Street Journal (1988)
Information from Computer Select Disks (Ziff-Davis-Publishing)
combination of all databases
Table 4: Databases used for the experiments
carry more information than the vowels. Using the Omission-Key, the computed similarity takes
advantage of the frequency by which certain consonants are omitted when typed.
In our experiments, we tested only a variation of the two techniques where we used them as an
algorithm for normalizing the word, before applying the Damerau-Levenstein-metric.
Retrieval using Damerau-Levenstein-metric was implemented by a scan over the full database. For
real life applications this is prohibitive. Clustering methods might be used, in order to preselect the
parts of the database, where searching for names related to the query is most promising. But – due
to the nature of the measure – there is a limitation to the possible increase of performance.
5 Experiments
For our experiments, we extracted surnames from a couple of sources manually. The automatic
extraction is beyond the scope of this paper. Especially for English texts, detection of names could
be achieved by heuristics looking at words starting with capital letters. But more sophisticated
methods using dictionaries and/or NLP techniques may be applied.
The used sources were parts of the TREC-collection [Harman 93], the CACM collection from
the SMART system ([Buckley 85]), the phonebook of the University of Dortmund and a local
bibliographic database (see table 4). The phonebook and the bibliographic database provide nonEnglish names, some of the TREC sources also contain spelling errors.
All of these source databases were scanned for surnames, which were stored in test databases with
sizes varying from about 1000 to 8000 terms and containing only unique surnames. Finally all those
test databases were combined to one large test database called COMPLETE with some 14000 names.
After this creation of experimental databases the queries to these databases were determined. First,
90 names were randomly chosen from the databases and, second, the sets of relevant terms were
manually determined for each of those 90 queries. The COMPLETE database contains an average
of 13.1 relevant names.
5.1 Quality measures
For the comparison of the different techniques we used recall-precision-graphs.
Since the ranking of the answer set is not a linear ordering, we get only a few recall-precision points
for each query, namely after each rank. To have a means for averaging over the query set, we need a
method for interpolation.
We use the probability of relevance method proposed in [Raghavan et al. 89]. The method assumes
that the user examines the result ranking, randomly selecting a document from the topmost not yet
completely read rank. At a given stage during this process, the precision is defined as the probability
that a random examined document is relevant. Recall is the fraction of relevant documents examined.
The corresponding algorithm works as follows:
Given that a rank contains r relevant and i nonrelevant documents, a user who wants s < r relevant
documents will get esl irrelevant documents:
esl (s) = rs+ i1
(4)
r;i
For coping with the situation with more than one rank, we call the rank currently under inspection l
and further define:
j
s
r
i
NR
:
:
:
:
:
f
number of non-relevant documents of ranks 1 : : : l ?1
number of relevant documents picked from rank l
number of relevant documents of rank l
number of non-relevant documents of rank l
number of required relevant documents
f
f
f
f
Then equation 4 extends to:
esl(NR) = j + esl (s) = j + rs+ i1
r;i
(5)
Finally, by the combining 5 with the definition of the probability of relevance (PRR), we get:
NR
=
PRR(NR) = NR +NR
esl(NR) NR + j + (s i)=(r + 1)
(6)
If we also allow non-integer values for NR, we get a smooth interpolation which can be averaged
over the query set.
5.2 Analysis
Because the test queries were drawn from the database COMPLETE there is no query with no
relevant entry in COMPLETE. Real applications would have to cope with the situation where a user
searches for a name that is not present in the database.
Therefore, we present here the experimental results for database ZIFF where the number of relevant
entries ranges from 0 (3 queries) to 26. On average, there are 6.6 relevant documents w.r.t. a single
query.
1
Soundex
Phonix4
Phonix8
PhonixE
identity
0.9
0.8
0.7
Precision
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
Recall
0.6
0.7
0.8
0.9
1
Figure 1: Soundex/Phonix
Figure 1 contains the results of the four variations on the Soundex and Phonix algorithms described
below:
SOUNDEX
PHONIX4
PHONIX8
PHONIXE
:
:
:
:
Soundex algorithm with codes of length 4
Phonix algorithm with codes of length 4 (no ending-sounds)
Phonix algorithm with codes of length 8 (no ending-sounds)
Phonix algorithm with codes of length 8 (ending-sounds)
Obviously, SOUNDEX is the worst variation. Its values for the expected precision are about 0.1 lower
than the values of PHONIXE. PHONIX8 delivers precision values that are better than S OUNDEX but still
worse than PHONIX4 and PHONIXE. The latter two variations perform best in this experiment: While
PHONIXE shows higher precision values than PHONIX4 for a recall lower than 0.5, this relationship
changes into the contrary for a recall higher than 0.5. So it depends on the application which method
should be preferred.
The behavior of these variations can be explained easily: SOUNDEX is the worst variation because of
the missing phonetic substitution. This feature of Phonix is a very important improvement. PHONIX8
shows worse performance than P HONIX4 since it uses codes that are too long and discriminative.
PHONIXE alleviates this problem by additionally regarding ending-sounds. So similar words have a
second chance of beeing discovered.
Furthermore, figure 1 contains the graph representing the performance of retrieving only the identical
terms. The difference between this graph and the similarity measures shows that our work is
worthwhile.
The technique using n-grams was analyzed for n = 1::5. Each of these variations was examined
using no, one and two additional blanks.2 The results show clear tendencies:
Figure 2 shows that the more additional blanks are used, the better the trigrams perform. This
is also true for the other n-gram variations.
Figure 3 reveals that for increasing n, performance is decreasing. This behavior is independent
of the number of blanks used.
Conclusively, these two figures show that digrams with one blank are the best variation of n-grams.
The trigrams with two blanks are a little worse but still a good alternative to the digrams.
A closer look at the graphs reveals that digrams are even better than the best Phonix variation (see
fig. 1): At a recall of 0.5, PHONIXE shows precision-values of about 0.53, while digrams achieve a
precision of 0.62 at the same recall level.
The Damerau-Levenstein-metric was analyzed for the ordinary metric (DAMERAU), the metric using
the Skeleton-Key (SKELETON) and the metric using the Omission-Key (OMISSION). For efficiency,
we confined the maximum number of errors by 1 to 3. Figure 4 tells us that admitting more than 3
typing errors will not increase the effectiveness.
The following conclusions can be formulated with the help of fig. 5 and 6:
The more errors are allowed, the better the different variations perform.
With a maximum of one error, S KELETON shows the best performance.
With a maximum of two or three errors, DAMERAU is superior to SKELETON and OMISSION.
The best results can be achieved with a maximum of three errors (see fig. 6). Here DAMERAU
is better than SKELETON (about 0.08 for a recall of 0.5), and SKELETON is far better than
OMISSION (about 0.18 for a recall of 0.5).
Conclusively, fig. 7 compares the best variations of these techniques. The differences are significant:
Digrams with one blank are the best similarity measure. The Phonix algorithm with ending-sounds is
worse than the digrams but better than the Damerau-Levenstein-metric regarding the more interesting
interval from 0 to 0.67.
2 Of course,
digrams are not capable of using two blanks.
1
0 blanks
1 blank
2 blanks
0.9
0.8
0.7
Precision
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
Recall
0.6
0.7
0.8
0.9
1
Figure 2: Trigrams with different numbers of blanks
1
2-Grams
3-Grams
4-Grams
5-Grams
0.9
0.8
0.7
Precision
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
Recall
0.6
Figure 3: n-grams with one blank
0.7
0.8
0.9
1
1
max. 1 error
max. 2 errors
max. 3 errors
max. 4 errors
max. 5 errors
0.9
0.8
0.7
Precision
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
Recall
0.6
0.7
0.8
0.9
1
Figure 4: Damerau variation S KELETON with five errors at most
1
Damerau
Skeleton
Omission
0.9
0.8
0.7
Precision
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
Recall
0.6
0.7
0.8
Figure 5: Damerau-Levenstein-metric with one error at most
0.9
1
1
Damerau
Skeleton
Omission
0.9
0.8
0.7
Precision
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
Recall
0.6
0.7
0.8
0.9
1
Figure 6: Damerau-Levenstein-metric with three errors at most
1
PhonixE
Digram1
Damerau3
0.9
0.8
0.7
Precision
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
Recall
0.6
0.7
Figure 7: Comparison of the best variations
0.8
0.9
1
5.3 Combinations
In this section combinations of the three different techniques are analyzed. For the combinations
only the best variation of each technique was chosen, i.e., PHONIXE, digrams with one additional
blank (DIGRAM1) and the Damerau-Levenstein-metric with a maximum of three errors (DAMERAU3).
The first problem arising is that the ranking produced by the different methods has to be mapped on
a single linear scale. The obvious choice would be the probability that a word in a given rank will
be judged relevant by the user. For calculating estimates, some statistical data is required, which we
did not have the time to generate. So we choose a rather crude mapping:
The trigram similarity coefficient is used without change.
The Damerau coefficient f is mapped to 1=(1 + f ).
The ranks for Soundex and Phonix without ending sounds are mapped to 1, 0:9 and 0.
The ranks for Phonix with ending sounds are mapped to 1, 0:9 , 0:8 , 0:7 and 0.
There are still two open problems:
What function is to be applied for combining the different weights to one new weight?
What techniques of similarity measures are to be combined?
The first question addressed by an experiment using the following weighting functions: median,
maximum, arithmetic mean and geometric mean. Results in figure 8 show:
Maximum is the worst of these four functions. Its precision-values are about 0.11 lower than
the values of the other graphs regarding a given recall of 0.5.
Within the recall interval from 0 to 0.8, the two mean functions are slightly better than median.
The difference between the arithmetic mean and the geometric mean can be neglected.
Within the recall interval from 0.8 to 1, the median is significantly better than the arithmetic
or geometric mean.
The results may be summarized that either the arithmetic mean or the geometric mean should be
used to combine the different weights to one weight when low recall levels (< 0:8) are important.
For applications which put more emphasis on high recall levels the robust combination functions
(median) might be a good choice.
The drop of performance at the 0:8 recall level is probably due to the poor mapping of the Phonix
ranks, which disturbs the mean combinations while the median is more robust. So for a better
mapping of ranks to a linear scale the above statements might have to be reviewed.
When choosing the similarity measures to be combined, efficiency and effectiveness have to be
considered. For some measures there are very fast implementations available, whereas others are
computationally demanding. Secondly, the more measures are combined, the more efficiency suffers.
On the other hand the question is if adding a new measure to a set of measures increases performance
in every case.
1
median
maximum
arithmetic mean
geometric mean
0.9
0.8
0.7
Precision
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
Recall
0.6
0.7
0.8
0.9
1
Figure 8: Combination Digrams1–PhonixE–Damerau3 with different weighting functions
1
Digrams1 & Damerau3
Digrams1
Digrams1 &
PhonixE &
0.9
& PhonixE
& PhonixE
Damerau3
Damerau3
0.8
0.7
Precision
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
Recall
0.6
0.7
Figure 9: Different combinations with geometric means
0.8
0.9
1
So we are looking for a minimal set of measures with reasonable fast implementation in another set
of experiments.
Figure 9 shows the results of four different combinations:
The worst alternative is the combination of the two techniques PHONIXE and DAMERAU3. Its
performance is significantly lower than the performances of the other combinations.
The best combination is the use of all three techniques.
The combination DIGRAMS1–PHONIXE is a slightly worse than the combination PHONIXE–
DIGRAMS1–DAMERAU3, but about 0.4 better than DIGRAMS1–DAMERAU3.
Summarizing these results (and other results not presented here), these experiments show that the
more evidence you combine, the better results are achieved. The more the combined measures differ,
the better the combination performs.
An interesting result is that although PHONIXE–DIGRAMS1–DAMERAU3 performs best, the difference
to DIGRAMS1–PHONIXE is rather small. So by dropping the Damerau measure effectiveness will drop
slightly, while the performance will benefit largely since this measure can be hardly implemented
efficiently.
6 Conclusion
In this paper we have shown that an information retrieval system using a kind of similarity measure
for searching proper names will perform much better than a system using only exact-match searches.
Three different techniques for calculating the similarity of strings have been introduced and compared.
The results show that digrams using one blank performs best while P HONIXE is only slightly better
than the Damerau-Levenstein-metric regarding at most three errors.
Much better than the use of a single technique are the combinations of different techniques. Here the
combination of digrams with one additional blank and PHONIXE can be recommended. Although the
approach of combining two or three different similarity measures seems to be very promising, the
additional work for maintaining and searching one or even two more techniques has to be considered.
A solution to this might be to compute weights for an additional similarity measure (e.g. DamerauLevenstein-metric) only for the answers already found. Though there will be no new answers found
this way, the ranking of the retrieved answers might be improved.
The future work will concentrate on this problem, i.e., the most efficient combination of some
techniques. Furthermore, the choice of the best weighting function for the combinations is to be
analyzed in more detail.
References
Angell, R.; Freund, G.; Willet, P. (1983). Automatic Spelling Correction using a Trigram
Similarity Measure. Information Processing and Management 19(4), pages 255–261.
Buckley, C. (1985). Implementation of the SMART Information Retrieval System. Technical Report
85-686, Department of Computer Science, Cornell University, Ithaca, NY.
Burghardt, A.; Fuhr, N.; Großjohann, K.; Pfeifer, U.; Spielmann, H.; Stach, O. (1993).
NOSFERATU — an Integrated Database Management and Information Retrieval system based
on Data Streams (in German). In: Proceedings 1. GI-Fachtagung Information Retrieval, pages
27–40. Universitätsverlag Konstanz, Konstanz.
Damerau, F. (1964). A Technique for Computer Detection and Correction of Spelling Errors.
Communications of the ACM 7, pages 171–176.
Gadd, T. (1988). ’Fisching for Werds’. Phonetic Retrieval of written text in Information Retrieval
Systems. Program 22(3), pages 222–237.
Gadd, T. (1990). PHONIX: The Algorithm. Program 24(4), pages 363–366.
Hall, P.; Dowling, G. (1980). Approximate String Matching. Computing Surveys 12(4), pages
381–402.
Harman, D. (1993). Overview of the First Text REtrieval Conference. In: Harman, D. (ed.):
The First Text REtrieval Conference (TREC-1). National Institute of Standards and Technology
Special Publication 500-207, Gaithersburg, Md. 20899.
Pollock, J.; Zamora, A. (1984). Automatic Spelling Correction in Scientific and Scholary Text.
Communications of the ACM 27, pages 358–368.
Porter, M. F. (1980). An Algorithm for Suffix Stripping. Program 14, pages 130–137.
Raghavan, V. V.; Bollmann, P.; Jung, G. S. (1989). Retrieval System Evaluation Using Recall
and Precision: Problems and Answers. In: Belkin, N.; van Rijsbergen, C. J. (eds.): Proceedings
of the Twelfth Annual International ACMSIGIR Conference on Research and Development in
Information Retrieval, pages 59–68. ACM, New York.
Robertson, A.; Willett, P. (1992). Searching for Historical Word-Forms in a Database of 17thCentury English Text Using Spelling-Correction Methods. In: Belkin, N.; Ingwersen, P.; Pejtersen,
M. (eds.): Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval, pages 256–265. ACM, New York.
Salton, G.; McGill, M. J. (1983). Introduction to Modern Information Retrieval. McGraw-Hill,
New York.
Salton, G. (1988). Automatic Text Processing: The Transformation, Analysis and Retrieval of
Information by Computer. Addison-Wesley, Reading, Massachusetts.
Zamora, E.; Pollock, J.; Zamora, A. (1981). The Use of Trigram Analysis for Spelling Error
Detection. Information Processing and Management 17(6), pages 305–316.

Download Report

as a PDF

Paperzz.com

Your Paperzz