Improving Precision and Recall Using a Spellchecker in a Search

Improving Precision and Recall
Using a Spellchecker in a Search Engine
Mansour Sarr
TRITA-NA-E04022
NADA
Numerisk analys och datalogi
KTH
100 44 Stockholm
Department of Numerical Analysis
and Computer Science
Royal Institute of Technology
SE-100 44 Stockholm, Sweden
Improving Precision and Recall
Using a Spellchecker in a Search Engine
Mansour Sarr
TRITA-NA-E04022
Master’s Thesis in Computer Science (20 credits)
within the First Degree Programme in Mathematics and Computer Science,
Stockholm University 2004
Supervisor at Nada was Hercules Dalianis
Examiner was Kerstin Severinson-Eklundh
Abstract
Search engines constitute a key to finding specific information on the
fast growing World Wide Web. Users query a search engine by using
natural language to extract documents that refer to the desired subject.
Sometimes no information is found because they make spelling and
typing mistakes while entering their queries. Earlier reports suggest that
10-12 percent of all questions to a search engine are misspelled.
The issue is how much does the use of a query spellchecker affect
the performance of a search engine?
This Master’s thesis presents an evaluation of how much a query
spellchecker improves precision and recall in information retrieval for
Swedish texts.
Evaluation results indicate that spellchecking improved both
precision and recall with 4 respectively 11.5 percent.
Evaluering av ett
stavningsstöd till en sökmotor
Sammanfattning
Sökmotorer är en nyckel till att kunna hitta specifik information i det
snabbt växande Internet. Användaren brukar använda naturligt språk på
en sökmotor för att kunna hitta den informationen han eller hon är intresserad av. Ibland misslyckas sökningen därför att användaren råkar
stava eller skriva fel. Tidigare studier visar att 10-12 procent av alla
frågor som ställs till en sökmotor är felstavade.
Frågan är hur påverkar stavningsstödet resultaten av sökningen?
Detta examensarbete utvärderar hur mycket en stavningskontroll kan
förbättra precision och täckning vid informationssökning på svenska.
Resultaten visar att stavningskontrollen förbättrade både precisionen
och täckningen med 4 respektive 11.5 procent.
Acknowledgments
This report is my Master’s thesis, and was carried out at the Department
of Numerical Analysis and Computing Science (NADA) at the Royal
Institute of Technology (KTH).
First of all I would like to thank my supervisor, Dr. Hercules
Dalianis, for providing a lot of support, encouragement and advises.
I would also like to thank some people who helped me along the
way: Johan Carlberger at Euroling AB for providing technical support
on the search engine SiteSeeker when needed and Martin Hassel
NADA/KTH for his assistance before and during the experimental
phase.
Finally I thank all the students in the language engineering class who
took part in the experiment.
Mansour Sarr
Contents
1
Introduction ................................................................................................................1
1.1 Some background terminology...........................................................................1
1.1.1 Query spellchecker ....................................................................................1
1.1.2 Spelling Errors ..........................................................................................2
1.1.3 Search engine ............................................................................................2
1.1.4 Precision and Recall ..................................................................................3
1.2 Motivation .........................................................................................................4
2
Previous research on SiteSeeker..................................................................................6
2.1 The stemming experiment ..................................................................................6
2.1.1 Stemming ..................................................................................................6
2.1.2 Experiment and evaluation ........................................................................7
2.2 The spelling support experiment.........................................................................7
2.3 Other related work .............................................................................................8
2.4 Error detection and correction techniques...........................................................9
2.4.1 Error detection techniques .......................................................................10
2.4.2 Error correction techniques......................................................................13
3
Experiment ...............................................................................................................17
3.1 The test participants .........................................................................................18
3.2 Building the questions and answers forms ........................................................18
3.3 The experiment rules........................................................................................19
4
Evaluation ................................................................................................................21
4.1 Evaluation Metrics ...........................................................................................21
4.2 Evaluating retrieved documents
and their Relevance judgements .......................................................................22
4.3 Evaluation Results ...........................................................................................22
4.3.1 Analysis of results ..................................................................................23
4.3.2 Discussion..............................................................................................24
5
Conclusions and future improvements ......................................................................26
References........................................................................................................................27
Appendices.......................................................................................................................30
Appendix A Search engines: how they work.............................................................30
Appendix B Ranking web pages ...............................................................................32
1 Introduction
This report evaluates a query spellchecker connected to the search engine SiteSeeker1 based on precision and recall. It is intended as an extension to the stemming experiment that is described in details in
Carlberger et al. (2001) and performed at the Department of Numerical
Analysis and Computer Science (NADA) at the Royal Institute of
Technology (KTH).
In their experiment they compared precision and recall with and
without a stemmer2. They concluded that stemming improved both
precision and recall with 15 respectively 18 percent for Swedish texts
having an average length of 181 words. This is described in more details
later in this report.
The goal of my work was to set up an appropriate test experiment,
evaluate it and then subsequently analyse whether or not the use of the
query spellchecker improved search results based on precision and
recall.
1.1 Some background terminology
In this section we introduce some terminology, which will be used
throughout this report.
1.1.1 Query spellchecker
Query spellcheckers generally process words as strings of characters.
They identify misspellings by matching the string of characters of the
misspelled word with strings of characters of words contained in a dictionary. If a match is not found, the query spellchecker flags the word as
a misspelling and generates a list of possible replacements. Matching the
initial string of letters and applying morphological rules to possible replacements in the program dictionary identifies the word choices provided in a replacement word list for a given misspelling. Two factors
influence the identification of the replacement word list: phonetic match
and correct sequence of letters. The less severe the phonetic mismatch is
to the target word the closer the query spellchecker will come to identi-
1
http://www.siteseeker.se
A software that uses a technique to transform different inflections and derivations of
the same word to one common least denominator
2
1
fying the target word within the list of possible replacements because the
string of characters of the misspelling will be similar to those of the
target word.
1.1.2 Spelling Errors
The word-error can belong to one of the two distinct categories, namely,
nonword error and real-word error (Kukich 1992). Let a string of
characters separated by spaces or punctuation marks be called a
candidate string. A candidate string is a valid word if it has a meaning.
Else, it is a nonword. By real word error we mean a valid but not the
intended word in the sentence, thus making the sentence syntactically or
semantically ill formed or incorrect.
For instance: the wote the paper (instead of “he wrote the paper);
“wote” is a nonword error and “the” is a real-word error.
In both cases the problem is to detect the erroneous word and suggest correct alternatives. See section 4.2 for further information on error
detection and correction.
In the case of typed text there are three kinds of nonword misspelling
according to Kukich (1992): (1) typographic errors, (2) cognitive errors,
and (3) phonetic errors.
In the case of typographic errors, it is assumed that the writer knows the
correct spelling but accidentally presses the wrong key, presses two
keys, presses the keys in the wrong order etc.
For instance: spell Æ speel, typing speel when spell was meant.
The source of cognitive errors is presumed to be a misconception or
a lack of knowledge on the part of the writer.
For instance: receive Æ recieve, minute Æ minite
In the case of phonetic errors it is assumed that the writer substitutes a
phonetically correct but orthographically incorrect sequence of letters
for the intended word.
For instance: naturally Æ nacherly, two Æ to
1.1.3 Search engine
According to Webopedia (2002) a search engine is a program that
searches documents for specified keywords and returns a list of the
2
documents where the keywords were found. Although a search engine is
really a general class of programs, the term is often used to specifically
describe systems that enable users to search for documents on the World
Wide Web.
Typically, a search engine works by sending out a crawler1 or a
spider to fetch as many documents as possible. Another program, called
an indexer, then reads these documents and creates an index based on
the words contained in each document.
Each search engine uses a proprietary algorithm to create its indices
such that, ideally, only meaningful results are returned for each query.
Search engines of the web are of different types. See Appendix A for
an in depth description of how they work. Major2 search engines like
Google3 and AllTheWeb4 are crawler-based. They have, simply put,
three major elements. First is the spider, also called the crawler. It visits
a web page, reads it, and then follows links to other pages within the
site. The spider returns to the site on a regular basis to look for changes.
Everything the spider finds goes into the second part of the search engine, the index. It is like a giant book containing a copy of every web
page that the spider finds. If a web page changes, then the index is
updated with new information. Search engine software is the third part
of a search engine. This is, as describe above, the program that sifts
through the millions of pages recorded in the index to find matches to a
search and rank them in order of what it believes is most relevant. More
on how search engines rank hit pages can be read in Appendix B.
1.1.4 Precision and Recall
These terms are metrics that traditionally define the “quality” of the retrieved documents set. In an ideal world, a perfect search will attain both
high precision and high recall. In the real world of the Internet, a balance
is struck, hopefully an optimum one.
Precision is defined as the number of relevant documents retrieved
divided by the total number of documents retrieved (Dalianis 2002).
For example, suppose there are 60 documents relevant to white roses
in the collection. A query returns 40 documents, 30 of which are about
1
A program that automatically fetches web pages
Well-known, well-used and commercially-backed. They are more likely to be
well-maintained and upgraded when necessary, to keep pace with the growing web.
3
http://www.google.com
4
http://www.alltheweb.com
2
3
white roses. 40 returned documents divided by 30 relevant ones equals
75 % precision.
Hundred percent precision is an obtainable goal, since the system could
be programmed to return just one completely relevant document (giving
a 1/1 precision rate). However, returning just one document might not
resolve the seeker’s query. Therefore, a search system attempts to
maximise both precision and recall simultaneously.
Recall is defined as the number of relevant documents retrieved
divided by the total number of relevant documents in the database
(Dalianis 2002).
For example, suppose there are 60 documents relevant to white roses in
the database. A query returns 48 documents, 30 of which are about white
roses. The formula to calculate recall is would be:
60 relevant documents divided by 30 returned documents equals 50 %
recall.
Trying to accomplish a high recall rate on the Internet is difficult,
due to the enormous volume of information, which must be searched. In
an ideal world, recall is 100 percent. However this is impossible to
achieve since the system attempts to maximise both recall and precision
simultaneously.
1.2 Motivation
The vast majority of users navigate the Web via search engines
(Webopedia 2002). Yet searching can be sometimes a frustrating
activity. Users type in one or many keywords or phrases, and
unfortunately, sometimes no documents are retrieved or they are likely
to get thousands of responses. Only a handful of which are close to what
they are looking for. The reason, according to Dalianis (2002) is because
the word is not in the index or because the user misspelled the word, or
because the user did not know the right inflection of the word as written
in the index.
According to Kukich (1992), approximately 80 % of all misspelled
words contained a single instance of one of the four error types: insertion (introduce a character more than intended), deletion (delete or omit
a character), substitution (substitute a character to one it resembles to)
and transposition (switch two adjacent characters).
4
It is a known fact that 10-12 percent of all questions to a search
engine are misspelled; these present results are also supported in Stolpe
(2003) and in a Google press release (2002).
For these observations and a claim made by Kann et al. (1998) that
up to a third of the search terms given to web based dictionaries are misspelled made us believe that attaching a spellchecker would assist the
users in terms of perhaps increasing either precision or recall or both.
A spellchecker is required to identify errors in queries where little or
no contextual information is available and using some measure of similarity, recommend words that are most similar to each misspelled word.
This error checking would prevent wasted computational processing,
prevent wasted users’ time and hopefully make our system score high
precision and recall.
The query spellchecker that is used in the present experiment
performs a presence check on words against a stored lexicon, identifies
spelling errors and recommends alternative spellings.
5
2 Previous research on SiteSeeker
Two major experiments have been conducted using the SiteSeeker
search engine. More elaborate details on these can be found in
Carlberger et al. (2001) and Dalianis (2002). I will nevertheless present
a quick overview experiments on these since our experiment is built on
them. Section 2.1 and 2.2 describe respectively the early experiments
done with the SiteSeeker search engine. Section 2.3 looks at earlier work
that have been done in the area of spellcheckers and search engines.
Section 2.4 describes error detection and correction techniques.
2.1 The stemming experiment
Carlberger et al. (2001) performed the stemming experiment and their
goal was to evaluate how much stemming improves precision in information retrieval for Swedish texts. They built an information retrieval
tool with optional stemming and created a corpus of over 54,000
Swedish news articles.
2.1.1 Stemming
Stemming as described in Carlberger et al. (2000) is a technique to transform different inflections and derivations of the same word to one common “stem” (the least common denominator for the morphological
variants). It can mean both prefix and suffix removal. Stemming can, for
example, be used to ensure that the greatest number of relevant matches
is included in search results. A word’s stem is its most basic form; the
stem of a plural noun is the singular; the stem of a past tense verb is the
present tense. For example “barnen” (children) is stemmed to “barn”
(child). The past tense verb “sköt” (shot) is stemmed to “skjut” (shoot).
The stemming algorithm for Swedish that is discussed above used
about 150 stemming rules. The technique is about modifying the original
word into an appropriate stem with a small set of suffix rules in a
number of steps.
The stemming is done in one up to four steps and in each step no more
than one rule from a set of rules is applied. This means that 0-4 rules are
applied to each word passing through the stemmer. Each rule consists of
a lexical pattern to match with the suffix of the word being stemmed and
a set of modifiers, or commands. Further details on this matter can be
read in Carlberger et al. (2000).
6
2.1.2 Experiment and evaluation
Over 54,000 news articles from the KTH News Corpus were selected
(Hassel 2001). Among these, 100 texts were randomly selected and
manually annotated a question and answer pair central to each text.
Three test participants did the experiment where a rotating
questioning-answering evaluation schema was used. Each of the three
participants answered 33, 33 and 34 questions respectively with and
without stemming functionality.
After going through all of the 100 questions and finding answers to
them, that is 33 questions each, the work was rotated and the participants
became each others evaluators assessing how many of the found top ten
hits pages of each question returned by the search engine were correct
and how many were wrong.
Of the 100 questions, the test participants found 96 answers, two
questions did not give any answers at all and two other questions gave
unreadable files. Each of the asked queries had an average length of
roughly 2.7 words. The texts containing the answer had an average
length of 181 words.
A 15 and 18 percent increase on precision respectively relative recall
were found on the first 10 hits for stemming versus no stemming.
2.2 The spelling support experiment
In this experiment, performed by Dalianis (2002), the website of the
Swedish National Tax Board (RSV, Rikskatteverket) was used as a
testing domain. This website is extensively used by the public to perform searches on among others tax and income issues. The goal was to
assess and evaluate the effectiveness of the query spellchecker from
April to September 2001.
It should be mentioned that the spellchecking algorithm used stems
from Stava and Granska (Domeij et al. 1994, Carlberger & Kann 1999,
2000, Knutsson 2001) and makes use of the Edit-distance techniques
described in (Kukich 1992).
During the five months period, the Swedish National Tax Board
whose site contained roughly 6,000 documents used EuroSeek’s search
engine with built-in stemmer and dynamic query spellchecker. Over
more than one million queries came in, out of which 101,446 were erroneous. Of the 100 most common spelling errors, the system gave ap7
proximately 92 percent good suggestions and 40 percent among these
contained split compound 1words, 22 percent were spelling errors and
the last 30 percent represented alternative spellings.
2.3 Other related work
A number of experiments have been carried out on information retrieval
and spellchecking. However no academic work detailing an evaluation
of a spellchecker connected to a search engine that checks the improvement on precision and recall could be found or is made available. Instead
many spellcheckers that have been built and evaluated are general purpose ones. These are suitable for any spellchecking application, even
isolated word error correction like spellchecking user queries in a search
engine.
An interactive spellchecker based upon phonetic matching and that is
aimed towards being used in a search engine is to be found in Hodge &
Austin (2001a). The spellchecker uses a hybrid approach to overcome
phonetic spelling errors and the four main forms of typing errors: insertion, deletion, substitution and transposition. It uses a Soundex2-type
coding approach coupled with transformation rules to overcome phonetic spelling errors. It also uses an n-gram (See Section 2.4.1) approach
to overcome the first two forms of typing error (insertion and deletion)
and integrated approach to overcome substitution and transposition
errors. It aims to high recall accuracy at the expense of precision.
Compared to several benchmark spellcheckers, the hybrid spellchecker
showed high figures. It had the highest recall rate at 93.9 % for a large
45,243-word lexicon and an increase recall to 98.5 % for a smaller
lexicon.
A method for detecting and correcting spelling errors in Swedish text
was devised in Domeij et al. (1994). It was refined in Kann et al. (1998)
where a ranking of correction using word frequencies and editing distance was implemented. These and other techniques are discussed in
length in the next section.
According to the last named, their spelling correction can be used in
information retrieval where the users would for example be offered
1
2
Incorrect division and fusion of word (e.g., my self for myself)
Soundex uses codes based on the sound of each letter to translate a string into
a canonical form.
8
interactive spelling correction of misspelled search terms. This, they
claim, would improve search results both as regards precision and recall.
The spellchecker we are using has been, in fact, put to the test twice.
The first time it was used was in a Swedish-English web dictionary,
which contains 28,500 Swedish words. About 20,000 queries were
posed to the web dictionary every day. Of these 20 % were misspelled.
For 33 % of the misspellings a single key is at closest distance to the
misspelling, so the query could be corrected automatically.
The second time our spellchecker was used is described in section
2.2.
In Hodge&Austin (2001b) a phonetic spellchecker called Phonetex
is evaluated. This spellchecker is intended to integrate with an existing
typographical spellchecker. It was compared against other phonetic and
benchmark spellcheckers. Different lexicon sizes were used to
investigate the effect of lexicon size on recall accuracy using the test set
of 360 phonetic misspellings. It was shown that Phonetex has the highest
recall, only failing to find eight words from 360. It maintains high recall
across lexicon sizes for best match retrieval ranging from 0.96 for the
large 45000 word dictionary to 0.98 for the smaller dictionaries.
2.4 Error detection and correction techniques
Often, when studying a spellchecker in a search engine, a differentiation
is made between the task of error detection and error correction.
The latter is considered to be a much harder problem to solve
because it deals with issues of locating candidate words and ranking
them among others. While the first named task above deals with checking if words or strings appear in a given index, dictionary or lexicon. In
reality, a user of a search engine does often not notice that a good suggestion from a search engine is generally a two-step task as stated above.
The success of such tasks determines how good precision and recall will
be.
In this section we will make a short survey over detection and correction techniques that have been used by others. For a more general
description, see Kukich (1992).
9
2.4.1 Error detection techniques
Research on error detection is mostly done on nonword errors
(See its definition in Section 1.1.2).
The error detection process usually consists of checking to see if an
input string is a valid index or dictionary word.
Efficient techniques have been devised for detecting such types of
errors. The two most known techniques are n-gram analysis and
dictionary lookup. Spellcheckers rely mostly on dictionary lookup while
text recognition1 systems rely on n-gram techniques.
N-gram Analysis
In Kukich (1992) n-gram analysis is described as a method to find incorrectly spelled words in a mass of text. Instead of comparing each entire
word in a text to a dictionary, just n-grams are controlled. A check is
done by using an n-dimensional matrix where real n-gram frequencies
are stored. If a non-existent or rare n-gram is found the word is flagged
as a misspelling, otherwise not.
An n-gram is a set of consecutive characters taken from a string with
a length of whatever n is set to. If n is set to one then the term used is a
unigram, if n is two then the term is a digram, if n is three then the term
is trigram and a quadrigram exists when n is set to four. The n-gram algorithm was developed as one of the benefits is that it allows strings that
have differing prefixes to match and the algorithm is also tolerant of
misspellings. Each string that is involved in the comparison process is
split up into sets of adjacent n-grams. The n-grams algorithms have the
major advantage that they require no knowledge of the language that it is
used with and so it is often called language independent or a neutral
string matching algorithm.
Using n-grams to calculate for example the similarity between two
strings is achieved by discovering the number of unique n-grams that
they share and then calculating a similarity coefficient, which is the
number of the n-grams in common (intersection), divided by the total
number of n-grams in the two words (union).
An example is shown in Figure 1.
10
String 1 – “statistics”
String 2 – “statistical”
If n is set to 2 (digrams are being extracted), then the similarity of the two
strings is calculated as follows.
Initially, the two strings are split into n-grams:
Statistics st ta at ti is st ti ic cs
9 digrams
Statistical - st ta at ti is st ti ic ca al
10 digrams
Then any n-grams that are replicated within the term are removed so that
only unique digrams remain. Repeated digrams are removed to prevent
strings that are different but contain repeated digrams from matching
Statistics st ta at is ti ic cs
7 unique digrams
Statistical - st ta at ti is ic ca al
8 unique digrams
The n-grams that are left over are then arranged alphabetically.
Statistics Statistical -
at cs ic is st ta ti
al at ca ic is st ta ti
The number of unique n-grams that the two terms share is then calculated.
The shared unique digrams are: at ic is st ta ti. In total 6 shared
unique digrams.
A similarity measure is then calculated using the above-mentioned
Similarity coefficient using the following formula:
A = the sum of unique n-grams in term 1.
B = the sum of unique n-grams in term 2.
C = the sum of unique n-grams appearing in term 1 and term 2.
The above example would produce the result 0.80.
(2*6) / (7+8) = 12/15 = 0.80.
Figure 1. An example of a similarity measure.
1
Systems that focus on hand-printed text or hand-written text or machine-printed text
11
A possible extension to this technique consists in adding blanks at
the beginning and end of the two involved words, thus increasing the
total number of n-grams involved giving a more precise result, especially for short words. For Statistical this would give the new digrams 0s
and l0, where the zeros represent the blanks.
N-grams are used in Domeij et al. (1994) and in Kann et al. (1998)
with a precompiled table called a “graphotactical table” to limit the
number of false words accepted by their stored word list and to increase
the speed.
Dictionary Lookup
A dictionary is a list of words that are assumed to be correct. Dictionaries can be represented in many ways, each with their own
characteristics like speed and storage requirements.
At first one might think that the more words a dictionary contains the
better it is. The problem is that when more words are added to the
dictionary the risk for real word errors increases.
A good substitute for a large dictionary might be a dictionary with
most common word combined with a set of additional dictionaries for
specific topics such as computer science or economy.
A big dictionary also uses more space and may take longer time to
search.
The non-word errors can be detected as mentioned above by checking each word against a dictionary. The drawbacks of this method are
difficulties in keeping such a dictionary up to date, and sufficiently
extensive to cover all the words in a text.
At the same time one should keep down system response time. Dictionary lookup and construction techniques must be tailored according to the
purpose of the dictionary. Too small a dictionary can give the user too
many false rejections of valid words; too large it can accept a high number of valid low-frequency words.
Hash tables are the most common used technique to gain fast access
to a dictionary. In order to lookup a string, one has to compute its hash
address and retrieve the word stored at that address in the preconstructed
12
hash table. If the word stored at the hash address is different from the
input string, a misspelling is flagged.
Hash tables main advantage is their random-access nature that eliminated the large number of comparisons needed to search the dictionary.
The main disadvantage is the need to devise a clever hash function that
avoids collisions.
The spellchecker described in Domeij et al. (1994) uses a method called
a Bloom filter to store its dictionary. A Bloom filter, in contrast to a hash
table, has only two operations, checking if a word is already in it and
adding a new word.
A Bloom filter is, according to them, a vector of boolean values and
a number of hash functions. To store a word in the dictionary you calculate each hash function for the word and set the vector entries corresponding to the calculated values to true. To find out if a word belongs to
the dictionary, you calculate the hash values for that word and look in
the vector. If all entries corresponding to the values are true, then the
word belongs to the dictionary, otherwise it does not.
The good thing about Bloom filters, they claim, is that they are compact and that the time required to find out if a word belongs to the
dictionary or not, is independent of the size of the dictionary.
2.4.2 Error correction techniques
Correction of spelling errors is an old problem. Much research has been
done in this area over the years. Most existing spelling correction techniques focus on isolated words, without taking into account the textual
context in which the string appears. One should remember that isolatedword correction techniques are not able to detect real-word errors (see its
definition in Section 1.1.2).
Often error detection only is not enough; according to
SearchEngineWatch (2003), users have come to expect for example that
spellcheckers in their favourite search engines to suggest good
corrections for the misspellings they detect. So the problem of isolatedword error correction consists of two steps: the generation of candidate
corrections and the ranking of candidate corrections. The candidate
generation process usually makes use of a precompiled table of legal ngrams to locate one or more potential correction terms (for more details
see Kann et al. (1998)). The ranking process usually invokes some
13
lexical similarity measure between the misspelled string and the
candidates or a probabilistic estimate of the likelihood of the correction
to rank order the candidates. These two steps are most of the time treated
as a separate process and executed in sequence. Some techniques can
omit the second process though, leaving the ranking and final selection
to the user.
The isolated-word methods that will be described here are the most
studied spelling correction algorithms (Kukich 1992), they are: edit
distance, similarity keys, rule-based techniques, n-gram-based
techniques, probabilistic techniques and neural networks. All of these
methods can be thought of as calculating a distance between the
misspelled word and each word in the dictionary or index. The shorter
the distance the higher the dictionary word is ranked.
Edit Distance
Edit distance is a simple technique. The distance between two words is
the number of editing operations required to transform one of the words
into the other. The allowed editing operations are: remove one character,
insert one character, replace one character with another or transpose two
adjacent characters.
Edit distance is useful for correcting errors resulting from keyboard
input, since these are often of the same kind as the allowed edit operations. It is not quite as good for correcting phonetic spelling errors, especially if the difference between spelling and pronunciation is big as in
English or French. It works better for some common Swedish
misspellings.
A variation on edit distance is to assign different distances for different editing operations and the letters involved in these operations.
This leads to a finer distinction between common and uncommon
typographic errors.
Similarity Keys
Similarity keys are based on some method to transform a word into a
similarity key. This key should reflect the characteristics of the word.
The most important characteristics are placed first in the key. All words
in the dictionary are transformed and sorted after the similarity key.
A misspelled word is then also transformed and similar keys can be
efficiently searched for in the sorted list.
14
This technique gives a linear ordering of all words. Words that are
close to each other in this ordering are considered similar.
With a good transformation algorithm this method can handle both keyboard errors and phonetic errors. The problem is to design this
algorithm.
Rule-based Techniques
Rule-based methods are interesting. They work by having a set of rules
that capture common spelling and typographic errors and applying these
rules to the misspelled word. Intuitively these rules are the “inverses” of
common errors. Each correct word generated by this process is taken as
a correction suggestion. The rules also have probabilities, making it possible to rank the suggestions by accumulating the probabilities for the
applied rules.
Edit distance can be viewed as a special case of a rule-based method
with limitation on the possible rules.
N-gram-Based Techniques
N-grams can be used in two ways, either without a dictionary or together
with a dictionary.
Used without a dictionary, n-grams are employed to find in which position in the misspelled word the error occurs. If there is a unique way to
change the misspelled word so that it contains only valid n-grams, this is
taken as the correction. The performance of this method is limited.
Its main virtue is that it is simple and does not require any dictionary.
Together with a dictionary, n-grams are used to define the distance
between words, but the words are always checked against the dictionary.
This can be done in several ways, for example check how many n-grams
the misspelled word and a dictionary word have in common, weighted
by the length of the words.
Probabilistic techniques
They are, simply put, based on some statistical features of the language.
Two common methods are transition probabilities and confusion probabilities.
Transition probabilities are similar to n-grams. They give us the
probability that a given letter or sequence of letters is followed by an-
15
other given letter. Transition probabilities are not very useful when we
have access to a dictionary or index.
Confusion probabilities give us the probability that a given letter has
been substituted for another letter.
In Xiang & Evans (1996) a word correction system based on statistical
language and applied to OCR is described. Given a sentence to be corrected, the system decomposes each string in the sentence into letter
n-grams and retrieves word candidates from the lexicon by comparing
string n-grams with lexicon-entry n-grams. The retrieved candidates are
ranked by the conditional probability of matches with the string, given
character confusion probabilities.
Finally, a word-bigram model and a certain algorithm are used to
determine the best scoring word sequence for the sentence. They claim
that the system can correct non-word errors as well as real-word errors
and achieves a 60.2 % error reduction rate for real OCR text.
Neural Networks
Neural networks are also an interesting and promising technique, but it
seems like it has to mature a bit more before it can be used generally.
The current methods are based on back-propagation networks, using
one output node for each word in the dictionary and an input node for
every possible n-gram in every position of the word, where n usually is
one or two. Normally only one of the outputs should be active,
indicating which dictionary words the network suggests as a correction.
This method works for small (< 1000 words) dictionaries, but it does
not scale well.
The time requirements are to big on traditional hardware, especially
in the learning phase.
16
3 Experiment
From a sub-corpus of over 70,000 news articles fetched from the KTH
News Corpus (Hassel 2001), 100 texts were randomly selected.
Each text is manually annotated as a question and answer pair as described in Carlberger et al. (2001) and used in our experiment as well.
The search engine used these XML like texts in order to find answers o
questions asked by the participants. During the actual experiment
though, the participants were given natural language texts. An example
of the first named type of text is shown in Figure 2.
In this section, the experimental setting that is used to evaluate the
performance of our search engine, is presented. Section 3.1 describes our
test participants. Section 3.2 shows how the question and answer form
were built and section 3.3 describes the set of rules that the test
participants had to follow during the experiment.
Question
<top>
<num> Number: 35
<desc> Description: (Natural Language question)
Vem är koncernchef på Telenor? (Who is CEO at Telenor?)
</top>
Answer
<top>
<num> Number: 35
<answer> Answer: Tormod Hermansen
<file> File: KTH
NewsCorpus/Aftonbladet/Ekonomi/0108238621340_EKO__00.html
<person> Person: Tormod Hermansen
<location> Location: Norden
<organization> Organization: Telenor
<time> Time: onsdagen
<keywords> Keywords: Telenor; koncernchef; teleföretag; mobilmarknaden; uppköp
</top>
Figure 2. Questioning and answering annotation scheme.
17
3.1 The test participants
To simulate actual people querying a search engine as it is in the Web,
we asked the students taking a course in language engineering to carry
out the experiment.
They were a total of 20 students and predominantly male. Their age
range can be guessed between 19-50 years old. Around 95 percent of
them have Swedish as their first language or speak Swedish fluently and
the rest do not have Swedish as their first language. There were French,
English and South-African students among these 5 percent. The reason
why no specific figures could be given is that demographic issues were
not considered important and relevant in this experiment.
They are familiar with search engines and have good knowledge in
among others natural language processing.
Their assignment was to, simply stated, query the search engine in
order to find the answers to some asked questions. We split them into
ten groups; each group comprises two participants.
3.2 Building the questions and answers forms
Since our test panel was divided into ten groups of two, our original
hundred questions mentioned at the beginning of this section were also
divided into ten questions. Group one would handle the first ten
questions, group two the second ten questions and so on.
These questions were made available to them only when experiment
began.
We then built two types of answering forms for all 100 questions:
one for answers from search with spellchecking and another for answers
from search without spellchecking.
For spellchecking a form contained six fields. The first is a name field
where the test participants would write their names, a second one where
they would write the query words used. A third one for eventual query
words suggested by SiteSeeker that they used. A fourth one was for the
number of hits returned by the search engine. Finally two fields where
they would write down the found answer respectively the link that gave
that answer.
A non-spellchecking form does not have the third field mentioned
above.
18
These answering forms, the location of the search engine and relevant
instructions to follow (see Section 3.3 for details) were made available
to the test participants via a web page.
3.3 The experiment rules
The experiment had to be conducted in two hours. It was decided that
the experiment would take two hours because it was considered to be
enough for the participants to answer all questions. Moreover, two hours
was the duration of a normal language engineering lecture. During the
first hour the test participants in each group had to query the search
engine (with spellchecker) and find the answers to their ten questions.
Actually, one of the two in each group read the questions while his/her
partner did the searching and filled the answers they found in the form.
The participant who was querying the search engine could not see how
words were spelled. This, according to our estimation, simulates an
ordinary web search where a user, wanting to find information on a news
event or a subject, just types in the term that comes to his mind.
Some sets of rules were made that the test participants had to follow
while querying:
1. They were not allowed to do more than five trials on each
2.
3.
4.
5.
6.
question to find the answer.
They were not allowed to use longer queries than five
words.
No background knowledge was allowed; that is only the
words used in natural language as they were written in the
questionnaire were allowed. So even if the test participant
were very familiar with a certain event and probably knew
the answer to a certain questions, the latter is only allowed to
use the words that were read to him/her.
Boolean expressions and phrases searches were allowed.
Each question had to be read no more than five times.
The full URL of each found answer to a question had to be
written down for use during the evaluation process.
After having finished this first part, we rotated the question forms to
avoid training effects of running the same set of questions. That is we
did not want the test participants to query the search engine with the
same set of questions whose answers they already knew. Now under the
19
second hour we switched the question forms. Group one’s question
forms were given to group two and vice versa, group three’s questions
forms to group four and vice versa and so on. This time each group had
to query the search engine without the spellchecker and using the nonspellchecking answering form described in section 3.2. The role of the
participant in each group was also inverted in the second part of the experiment. Those who, during the first hour, read the questions to their
partner queried the search engine and vice versa.
When all groups finished the second part, we made sure that they
filled the two answering forms as correct and legible as possible just by
browsing through them.
These forms were to be used to evaluate our search engine based on
precision and recall.
20
4 Evaluation
In this chapter we present the evaluation of the experiment and our
findings. Section 4.1 details the evaluation metrics for the performance
of our query spellchecker. Section 4.2 introduces the parameters used to
judge the relevancy of the returned documents by the search engine for a
query q. Our final results are presented in Section 4.3.
4.1 Evaluation Metrics
Information retrieval systems are usually compared based on the “quality” of the retrieved document sets. This “quality” is traditionally quantified using two metrics, recall and precision (SearchEngineWatch 2003).
Each document in the collection at hand is judged to be either relevant or
non-relevant for a query. Precision is calculated as the fraction of
relevant documents among the documents retrieved, and recall measures
the coverage of the system as the fraction of all relevant documents in
the collection that the system retrieved. See Section 1.1.4 for a more
elaborate definition. Since this method is well known and well used
(Carlberger et al 2001; Dalianis 2002), no other evaluation methods than
the present one was considered.
In this experiment, we think that the method used will in fact measure
what we want measure.
To measure recall over a collection we need to mark every document
in the collection as either relevant or non-relevant for each evaluation
query. This, of course, is a daunting task for any large document collection like ours, and is essentially impossible for the web, which contains
billions of documents.
Due to the difficulties of calculating recall, relative recall will be
used instead. Relative recall can be defined as the fraction of known
relevant items retrieved by the system. The following example will help
to clarify this term:
1. System A retrieves four items: D1, D2, D3 and D4.
2. System B retrieves three items: D2, D3 and D5.
3. Known unique relevant items D1, D2, D3, D4 and D5 are assumed to approximate the total number of relevant items (5
in this case).
4. Relative recall for A: 4/5 = 80 %
21
5. Relative recall for B: 3/5 = 60 %
In this report, the results found in each search method are pooled and assumed to approximate the total number of relevant documents, and the
relative recall of individual search engines is calculated from this.
4.2 Evaluating retrieved documents
and their Relevance judgements
What is a “relevant” document, and who determines the relevance of retrieved documents? In reality the only person who can determine
whether a document is relevant to an information need is the person who
has the information need, in this case it is the test participants.
The answering forms described in Section 3.3 that the students have
handed in are carefully scrutinised since these will determine how much
precision and recall have increased. In order to verify the correctness of
the answers given in the answering forms, we used, for each question,
the actual query words that each group wrote down and query the search
engine again. That way we could beside the document that contained the
answer, also check the other nine of the top ten documents (or all
documents if fewer than 10) returned by SiteSeeker.
An evaluation scheme that presents the result pages for a query one
page at a time was set up. It allows us to judge each page as “good”,
“bad” or “ignore” for the query.
The following criteria were used to judge documents. A good page is
one that contains an answer to the test question on the page. If a page is
not relevant, or only contains links to the useful information, even if
links are to other documents on the same site, the page is judge “bad”.
If a page could not be viewed for any reason (e.g., server error, broken
link), the result is “ignored”. This scale of judgement was found enough
because in this experiment it is a matter of a returned page yielding a
right or a wrong answer.
4.3 Evaluation Results
In this section we report the results of the experimental evaluation using
the methodology described in the previous section.
An Excel file was used and we put, for each test question, the top ten
retrieved documents in the file when the non-spellchecking was used in
one hand and likewise when the spellchecking was used on the other
hand.
22
Making use of the relevance judgement described in Section 4.2, we
award each “good” page or link with 1. A “bad” one is awarded with 0.
Using these figures and the metrics described in Section 4.1, we
calculated precision and relative recall in both cases. Subsequently, we
found an increase on precision and relative recall with 4 percent respectively 11.5 percent. The present results have also been presented in Sarr
(2003) during the NODALIDA conference in May 2003 in Reykjavik,
Iceland. Table 1 details our findings.
Table 1. No Spellchecking versus Spellchecking.
Precision/Recall
At 10 first
No Spellchecking
Number of texts
79,000
79,000
100
100
Number of questions
Spellchecking
Average precision
Increase in precision %
0.485
0.505
4%
Average relative recall
Increase in relative recall %
0.52
0.58
11.5 %
4.3.1
Analysis of results
In the answering form mentioned in Section 3.2 there was a field that the
test participants had to fill in if they used query words suggested by SiteSeeker. According to these, and verifications that were done the number
of misspellings that SiteSeeker gave suggestion for were exactly 14.
Concentrating on the nature of these misspellings, we noticed that
six were of typographic type, three of cognitive type and four of phonetic type and one could not be specified. Of these, 13 were good suggestion, that is they led to the correct answer. In 12 of these 13 cases,
more than one suggestion was given. Interestingly only twice was the
correct answer not the first suggestion. In both these two cases was the
correct answer the second suggestion. This tells us that our system is
good at ranking corrections. It gives the most probable correction the
first alternative in 10 cases out of 12. Only in three cases did a search
with the spellchecker fail to return hit pages or irrelevant hits. In two of
these three cases no documents were returned at all. In one case four
documents were returned but none of them were relevant to the query
word.
23
4.3.2
Discussion
Our Spellchecker has done a good job of detecting, correcting and
suggesting alternatives to the spelling errors made by the test participants. The figures shown in Section 4.3.1 have partly or alone contributed to the increase of precision and relative recall. The fact that 14 out
of 100 queries were erroneous can be interpreted as that the test panel
had in general no difficulties spelling the words that were read for them.
Or that the words in the questionnaires were simply not “difficult”
enough to be misspelled while entering them in the search field.
But on the other hand one can wonder what is consider to be a “difficult” word. Furthermore, whether a word is considered obscure and difficult to spell is a subjective matter. At the end of the experiment phase
we asked the test participants to write down their thoughts on using the
search engine with and without the spellchecker. The comment that kept
coming up was “we did not have a lot of use of the spellchecker, on
some occasions we got suggestions of words that we almost spelled
right.” By checking the words that were asked to the search engine just
after the experiment we indeed could confirm the comment made by
some of the test participants. Words that they most of the time hade difficulty with were those that describe either a name of a person like
“Whitney Houston” or name of a place like “Shebaa in Syria”. Since
there were maybe ten out of hundred of these on the questionnaires, that
can partly explain why the test participants (at least those that were
familiar with the names mentioned above) felt that the spellchecker was
not to a lot of help.
An interesting thing was that among the test participants there were five
exchange students whose first language is not Swedish. According to
them they had good assistance from the spellchecker.
In a final note we can say that the results obtained in this experiment
are very promising but should not be exaggerated. Why? The reasons are
twofold:
1. The corpus used in our experiment was static but in the world
of the Internet, corpuses are seldom static. On the other hand
comparison of two experiments would not be fair if the
corpus was not static.
2. Only 100 questions were asked to our search engine. On a
usual day search engines of the Internet are asked hundreds
of thousands if not millions of queries.
24
25
5 Conclusions and future improvements
We found in our experiment that the spellchecker connected to our
search engine SiteSeeker does in fact improve both precision and relative recall with 4 respectively 11.5 percent
Though the improvement in precision and relative recall is not dramatic as we hoped, it nevertheless, tells us that it is wise and helpful to
use a spellchecker in a search engine. But will we have the same results
if we set up the same experiment but with different types of
questionnaires and with a different test panel?
We would propose in the future as an answer to the questions above
the following steps in order to discover the full potential of our search
engine:
1. Run the same experiment but with fewer trials than five and
still maintain high score.
2. Run the same experiment but with different test panels and
compare the results. This, to see how much precision and
relative recall would vary. We have notice that our test panel
made very few spelling errors; it would be interesting to run
the same experiment with an “error-prone” test panel.
3. Introduce more complex words like technical terms and
obscure names in the questionnaires to query for.
The goal would also be to see how the spellchecker handles misspellings not discussed in the report such as run-on words1 and split
words2 and how these would influence the search result. These types of
spelling errors have become more common in Swedish, probably due to
English influences
1
2
Typing “ laddaned” instead of “ladda ned“ (download).
Typing “bok hylla” instead of “bokhylla” (bookshelf).
26
References
(All WWW references last visited 15 November 2003)
(Carlberger et al. 2001) Carlberger, J., H. Dalianis, M. Hassel and
O. Knutsson. Improving Precision in Information Retrieval for
Swedish using Stemming. In Proceedings of NODALIDA ’01 æ
13th Nordic Conference on Computational Linguistics, May
21-22, 2001, Uppsala, Sweden, 2001.
ftp://ftp.nada.kth.se/IPLab/TechReports/IPLab-194.pdf
(Carlberger and Kann 2000) Carlberger, J. and V. Kann
Some applications of a statistical tagger for Swedish. Proc. 4:th
conference of the International Quantitative Linguistics Association (Qualico-2000), pp. 51-52, August 2000
http://www.nada.kth.se/theory/projects/granska/rapporter/
taggerapplications20000215.pdf
(Carlberger and Kann 1999) Carlberger, J. And V. Kann.
Implementing an efficient part-of-speech tagger. Software
Practice and Experience, 29, pp. 815-832, 1999.
ftp://ftp.nada.kth.se/pub/documents/Theory/Viggo-Kann/
tagger.pdf
(Dalianis 2002) Dalianis, H.
Evaluating a Spelling Support in a Search Engine. In Natural Language Processing and Information Systems, 6th International
Conference on Applications of Natural Language to Information
Systems, NLDB 2002 (Eds.) B. Andersson, M. Bergholtz, P.
Johannesson, Stockholm, Sweden, June 27-28, 2002. Lecture Notes in Computer Science. Vol. 2553. pp. 183-190. Springer Verlag,
2002.
http://www.nada.kth.se/~hercules/papers/SpellingIR.pdf
(Domeij et al. 1994) Domeij, R., J. Hollman, and V. Kann.
Detection of spelling errors in Swedish not using a word list en
Clair. Journal of Quantitative Linguistics 1:195-201, 1994.
27
(Google 2002) Google press release written by Dean, J.
(Engineer at Google). Google’s future plans. (December 2002).
http://www.searchengineshowdown.com/newsarchive/
000611.shtml
(Hassel 2001) Hassel, M. 2001.
Internet as Corpus – Automatic Construction of a Swedish News
Corpus. NODALIDA ’01 æ 13th Nordic Conference on Computational Linguistics, May 21-22 2001, Uppsala, Sweden.
http://www.nada.kth.se/~xmartin/papers/
NewsCorpus_NODALIDA01.pdf
(Hodge & Austin 2001a) Hodge J., Victoria & Jim Austin.
An Evaluation of Standard Spell Checking Algorithms and a Binary Neural Approach. Accepted for, IEEE Transactions on
Knowledge and Data Engineering.
(Hodge & Austin 2001b) Hodge J. Victoria and Jim Austin.
An Evaluation of Phonetic Spell Checkers, Technical Report YCS
338(2001), Department of Computer Science, University of York,
2001
(Kann et al. 1998) Kann, V., R. Domeij, J. Hollman, M. Tillenius.
Implementation aspects and applications of a spelling correction
algorithm. NADA TRITA-NA-9813, NADA, Royal Institut of
Technology, KTH, 1998.
ftp://ftp.nada.kth.se/pub/documents/Theory/Viggo-Kann/
TRITA-NA-9813.pdf
(Knutsson 2001) Knutsson, O.
Automatisk språkgranskning av svensk text (in Swedish), (Automatic Proofreading of Swedish text), Licentiate thesis. IPLabNADA, Royal Institute of Technology, KTH, Stockholm, 2001.
28
(Kukich 1992) Kukich, K.
Techniques for automatically correcting words in text, ACM Computing Surveys. Vol.24, No. 4 (Dec. 1992), pp. 377-439, 1992.
(Sarr 2003) Sarr, M.
Improving precision and recall using a spellchecker in a search engine. In proceeding of NODALIDA 2003, the 14th Nordic Conference of Computational Linguistics, Reykjavik, May 30-31, 2003
http://www.nada.kth.se/~hercules/papers/SarrNodalida2003.pdf
(SearchEngineWatch 2003) SearchEngineWatch.com
Sullivan, D.: How search engines rank web pages (July 2003)
http://www.searchenginewatch.com/webmasters/article.php/ 2167961
(Stolpe 2003) Stolpe, D.
Högre kvalitet med automatisk textbehandling? En utvärdering av
SUNETs Webbkatalog (In Swedish) (Higher Quality by Automatic
Text Processing? An Evaluation of the SUNET Web Catalogue).
Forthcoming Master’s thesis, NADA.
(Tong & Evans 1996) Tong, X and Evans, D. A.
A Statistical Approach to Automatic OCR Error Correction in
Context. In: Proceedings of the fourth Workshop on Very Large
Corpora, 88-100, Copenhagen, Denmark, 1996.
(Webopedia 2002)
Webopedia web definition: Search Engine (October 2002)
http://www.webopedia.com/TERM/s/search_engine.html
29
Appendices
Appendix A Search engines: how they work
According to Webopedia (2002) there are basically three types of search
engines: Those that are powered by crawlers, or spiders; those that are
powered by human submissions; and those that are a combination of the
two.
Crawler-based engines, such as Google, send crawlers, or spiders,
out into cyberspace. These crawlers visit a Web site, read the information on the actual site, read the sites meta tags1 and also follow the links
that the site connects to. The crawler returns all that information back to
a central depository where the data is indexed.
The crawler will periodically return to the sites to check for any information that has changed, and the frequency with which this happens is
determined by the administrators of the search engine.
Human-powered search engines, such as the Open Directory, rely on
humans to submit information that is subsequently indexed and catalogued. Only information that is submitted is put into the index. A
search looks for matches only in the descriptions submitted.
Hybrid search engines combine the two. In the web’s early days, it
used to be that a search engine either presented crawler-based results or
human-powered listings. Today, it is extremely common for both types
of results to be presented. Usually, a hybrid search engine will favor one
type of listings over another. For example, MSN Search2 is more likely
to present human-powered listings from LookSmart3. However, it does
also present crawler-based results (as provided by Inktomi4), especially
for more obscure queries.
In all these cases, when a user query a search engine to locate
information, he or she is actually searching through the index that the
search engine has created; he or she is not actually searching the Web.
These indices are giant databases of information that is collected and
stored and subsequently searched. This explains why sometimes a search
on a commercial search engine, such as Yahoo! or Google returns results
1
A special HTML tag that provides information about a Web page.
http://search.msn.com
3
http://www.looksmart.com
4
http://www.inktomi.com
2
30
that are in fact dead links. Since the search results are based on the index, if the index hasn't been updated since a Web page became invalid
the search engine treats the page as still an active link even though it no
longer is. It will remain that way until the index is updated.
So why will the same search on different search engines produce different results? Part of the answer to that is because not all indices are
going to be exactly the same. It depends on what the spiders find or what
the humans submitted. But more important, not every search engine uses
the same algorithm to search through the indices. The algorithm is what
the search engines use to determine the relevance of the information in
the index to what the user is searching for.
31
Appendix B Ranking web pages
Often when a user does a search using a crawler-based search engine,
nearly instantly, the search engine will sort through the millions of pages
it knows about and present him with ones that match his topic. The
matches will even be ranked, so that the most relevant ones come first.
Of course, the search engines do not always find the specific information you want. Non-relevant pages make it through, and sometimes it
may take a little more digging to find the information needed. But, by
and large, search engines do an amazing job.
So, how do crawler-based search engines go about determining relevancy, when confronted with hundreds of millions of web pages to sort
through? According to SearchEngineWatch (2003) they follow an algorithm. Exactly how a particular search engine's algorithm works is a
closely kept trade secret.
However, all major search engines follow the general rule of location
and frequency of keywords on a web page. Also better known as location/frequency method, for short. Pages with the search terms appearing
in the HTML title tag are often assumed to be more relevant than others
to the topic.
Search engines will also check to see if the search keywords appear
near the top of a web page, such as in the headline or in the first few
paragraphs of text. They assume that any page relevant to the topic will
mention those words right from the beginning.
Frequency is the other major factor in how search engines determine
relevancy. A search engine will analyse how often keywords appear in
relation to other words in a web page and the distribution of them on all
documents that are indexed. Those with a higher frequency are often
judged more relevant than other web pages.
All the major search engines follow the location/frequency method
to some degree; in the same way cooks may follow a standard recipe.
But cooks like to add their own secret ingredients. In the same way,
search engines add spice to the location/frequency method. Nobody does
it exactly the same, which is one reason why the same search on different search engines produces different results.
To begin with, some search engines index more web pages than
others. Some search engines also index web pages more often than
others. The result is that no search engine has the exact same collection
32
of web pages to search through. That naturally produces differences,
when comparing their results.
Search engines may also penalise pages or exclude them from the
index, if they detect search engine “spamming”. An example is when a
word is repeated hundreds of times on a page, to increase the frequency
and propel the page higher in the listings. Search engines watch for
common spamming methods in a variety of ways, including following
up on complaints from their users.
Crawler-based search engines have plenty of experience now with
webmasters who constantly rewrite their web pages in an attempt to gain
better rankings. Some sophisticated webmasters may even go to great
lengths to “reverse engineer” the location/frequency systems used by a
particular search engine. Because of this, all major search engines now
also make use of “off the page” ranking criteria (SearchEngineWatch
2003).
Off the page factors are those that a webmasters cannot easily influence. Chief among these is link analysis. By analysing how pages link to
each other, a search engine can both determine what a page is about and
whether that page is deemed to be “important” and thus deserving of a
ranking boost. In addition, sophisticated techniques are used to screen
out attempts by webmasters to build “artificial” links designed to boost
their rankings.
Another off the page factor is clickthrough measurement. In short,
this means that a search engine may watch what results someone selects
for a particular search, then eventually drop high-ranking pages that are
not attracting clicks, while promoting lower-ranking pages that do pull
in visitors. As with link analysis, systems are used to compensate for
artificial links generated by eager webmasters.
33