Improving Precision and Recall Using a Spellchecker in a Search Engine Mansour Sarr TRITA-NA-E04022 NADA Numerisk analys och datalogi KTH 100 44 Stockholm Department of Numerical Analysis and Computer Science Royal Institute of Technology SE-100 44 Stockholm, Sweden Improving Precision and Recall Using a Spellchecker in a Search Engine Mansour Sarr TRITA-NA-E04022 Master’s Thesis in Computer Science (20 credits) within the First Degree Programme in Mathematics and Computer Science, Stockholm University 2004 Supervisor at Nada was Hercules Dalianis Examiner was Kerstin Severinson-Eklundh Abstract Search engines constitute a key to finding specific information on the fast growing World Wide Web. Users query a search engine by using natural language to extract documents that refer to the desired subject. Sometimes no information is found because they make spelling and typing mistakes while entering their queries. Earlier reports suggest that 10-12 percent of all questions to a search engine are misspelled. The issue is how much does the use of a query spellchecker affect the performance of a search engine? This Master’s thesis presents an evaluation of how much a query spellchecker improves precision and recall in information retrieval for Swedish texts. Evaluation results indicate that spellchecking improved both precision and recall with 4 respectively 11.5 percent. Evaluering av ett stavningsstöd till en sökmotor Sammanfattning Sökmotorer är en nyckel till att kunna hitta specifik information i det snabbt växande Internet. Användaren brukar använda naturligt språk på en sökmotor för att kunna hitta den informationen han eller hon är intresserad av. Ibland misslyckas sökningen därför att användaren råkar stava eller skriva fel. Tidigare studier visar att 10-12 procent av alla frågor som ställs till en sökmotor är felstavade. Frågan är hur påverkar stavningsstödet resultaten av sökningen? Detta examensarbete utvärderar hur mycket en stavningskontroll kan förbättra precision och täckning vid informationssökning på svenska. Resultaten visar att stavningskontrollen förbättrade både precisionen och täckningen med 4 respektive 11.5 procent. Acknowledgments This report is my Master’s thesis, and was carried out at the Department of Numerical Analysis and Computing Science (NADA) at the Royal Institute of Technology (KTH). First of all I would like to thank my supervisor, Dr. Hercules Dalianis, for providing a lot of support, encouragement and advises. I would also like to thank some people who helped me along the way: Johan Carlberger at Euroling AB for providing technical support on the search engine SiteSeeker when needed and Martin Hassel NADA/KTH for his assistance before and during the experimental phase. Finally I thank all the students in the language engineering class who took part in the experiment. Mansour Sarr Contents 1 Introduction ................................................................................................................1 1.1 Some background terminology...........................................................................1 1.1.1 Query spellchecker ....................................................................................1 1.1.2 Spelling Errors ..........................................................................................2 1.1.3 Search engine ............................................................................................2 1.1.4 Precision and Recall ..................................................................................3 1.2 Motivation .........................................................................................................4 2 Previous research on SiteSeeker..................................................................................6 2.1 The stemming experiment ..................................................................................6 2.1.1 Stemming ..................................................................................................6 2.1.2 Experiment and evaluation ........................................................................7 2.2 The spelling support experiment.........................................................................7 2.3 Other related work .............................................................................................8 2.4 Error detection and correction techniques...........................................................9 2.4.1 Error detection techniques .......................................................................10 2.4.2 Error correction techniques......................................................................13 3 Experiment ...............................................................................................................17 3.1 The test participants .........................................................................................18 3.2 Building the questions and answers forms ........................................................18 3.3 The experiment rules........................................................................................19 4 Evaluation ................................................................................................................21 4.1 Evaluation Metrics ...........................................................................................21 4.2 Evaluating retrieved documents and their Relevance judgements .......................................................................22 4.3 Evaluation Results ...........................................................................................22 4.3.1 Analysis of results ..................................................................................23 4.3.2 Discussion..............................................................................................24 5 Conclusions and future improvements ......................................................................26 References........................................................................................................................27 Appendices.......................................................................................................................30 Appendix A Search engines: how they work.............................................................30 Appendix B Ranking web pages ...............................................................................32 1 Introduction This report evaluates a query spellchecker connected to the search engine SiteSeeker1 based on precision and recall. It is intended as an extension to the stemming experiment that is described in details in Carlberger et al. (2001) and performed at the Department of Numerical Analysis and Computer Science (NADA) at the Royal Institute of Technology (KTH). In their experiment they compared precision and recall with and without a stemmer2. They concluded that stemming improved both precision and recall with 15 respectively 18 percent for Swedish texts having an average length of 181 words. This is described in more details later in this report. The goal of my work was to set up an appropriate test experiment, evaluate it and then subsequently analyse whether or not the use of the query spellchecker improved search results based on precision and recall. 1.1 Some background terminology In this section we introduce some terminology, which will be used throughout this report. 1.1.1 Query spellchecker Query spellcheckers generally process words as strings of characters. They identify misspellings by matching the string of characters of the misspelled word with strings of characters of words contained in a dictionary. If a match is not found, the query spellchecker flags the word as a misspelling and generates a list of possible replacements. Matching the initial string of letters and applying morphological rules to possible replacements in the program dictionary identifies the word choices provided in a replacement word list for a given misspelling. Two factors influence the identification of the replacement word list: phonetic match and correct sequence of letters. The less severe the phonetic mismatch is to the target word the closer the query spellchecker will come to identi- 1 http://www.siteseeker.se A software that uses a technique to transform different inflections and derivations of the same word to one common least denominator 2 1 fying the target word within the list of possible replacements because the string of characters of the misspelling will be similar to those of the target word. 1.1.2 Spelling Errors The word-error can belong to one of the two distinct categories, namely, nonword error and real-word error (Kukich 1992). Let a string of characters separated by spaces or punctuation marks be called a candidate string. A candidate string is a valid word if it has a meaning. Else, it is a nonword. By real word error we mean a valid but not the intended word in the sentence, thus making the sentence syntactically or semantically ill formed or incorrect. For instance: the wote the paper (instead of “he wrote the paper); “wote” is a nonword error and “the” is a real-word error. In both cases the problem is to detect the erroneous word and suggest correct alternatives. See section 4.2 for further information on error detection and correction. In the case of typed text there are three kinds of nonword misspelling according to Kukich (1992): (1) typographic errors, (2) cognitive errors, and (3) phonetic errors. In the case of typographic errors, it is assumed that the writer knows the correct spelling but accidentally presses the wrong key, presses two keys, presses the keys in the wrong order etc. For instance: spell Æ speel, typing speel when spell was meant. The source of cognitive errors is presumed to be a misconception or a lack of knowledge on the part of the writer. For instance: receive Æ recieve, minute Æ minite In the case of phonetic errors it is assumed that the writer substitutes a phonetically correct but orthographically incorrect sequence of letters for the intended word. For instance: naturally Æ nacherly, two Æ to 1.1.3 Search engine According to Webopedia (2002) a search engine is a program that searches documents for specified keywords and returns a list of the 2 documents where the keywords were found. Although a search engine is really a general class of programs, the term is often used to specifically describe systems that enable users to search for documents on the World Wide Web. Typically, a search engine works by sending out a crawler1 or a spider to fetch as many documents as possible. Another program, called an indexer, then reads these documents and creates an index based on the words contained in each document. Each search engine uses a proprietary algorithm to create its indices such that, ideally, only meaningful results are returned for each query. Search engines of the web are of different types. See Appendix A for an in depth description of how they work. Major2 search engines like Google3 and AllTheWeb4 are crawler-based. They have, simply put, three major elements. First is the spider, also called the crawler. It visits a web page, reads it, and then follows links to other pages within the site. The spider returns to the site on a regular basis to look for changes. Everything the spider finds goes into the second part of the search engine, the index. It is like a giant book containing a copy of every web page that the spider finds. If a web page changes, then the index is updated with new information. Search engine software is the third part of a search engine. This is, as describe above, the program that sifts through the millions of pages recorded in the index to find matches to a search and rank them in order of what it believes is most relevant. More on how search engines rank hit pages can be read in Appendix B. 1.1.4 Precision and Recall These terms are metrics that traditionally define the “quality” of the retrieved documents set. In an ideal world, a perfect search will attain both high precision and high recall. In the real world of the Internet, a balance is struck, hopefully an optimum one. Precision is defined as the number of relevant documents retrieved divided by the total number of documents retrieved (Dalianis 2002). For example, suppose there are 60 documents relevant to white roses in the collection. A query returns 40 documents, 30 of which are about 1 A program that automatically fetches web pages Well-known, well-used and commercially-backed. They are more likely to be well-maintained and upgraded when necessary, to keep pace with the growing web. 3 http://www.google.com 4 http://www.alltheweb.com 2 3 white roses. 40 returned documents divided by 30 relevant ones equals 75 % precision. Hundred percent precision is an obtainable goal, since the system could be programmed to return just one completely relevant document (giving a 1/1 precision rate). However, returning just one document might not resolve the seeker’s query. Therefore, a search system attempts to maximise both precision and recall simultaneously. Recall is defined as the number of relevant documents retrieved divided by the total number of relevant documents in the database (Dalianis 2002). For example, suppose there are 60 documents relevant to white roses in the database. A query returns 48 documents, 30 of which are about white roses. The formula to calculate recall is would be: 60 relevant documents divided by 30 returned documents equals 50 % recall. Trying to accomplish a high recall rate on the Internet is difficult, due to the enormous volume of information, which must be searched. In an ideal world, recall is 100 percent. However this is impossible to achieve since the system attempts to maximise both recall and precision simultaneously. 1.2 Motivation The vast majority of users navigate the Web via search engines (Webopedia 2002). Yet searching can be sometimes a frustrating activity. Users type in one or many keywords or phrases, and unfortunately, sometimes no documents are retrieved or they are likely to get thousands of responses. Only a handful of which are close to what they are looking for. The reason, according to Dalianis (2002) is because the word is not in the index or because the user misspelled the word, or because the user did not know the right inflection of the word as written in the index. According to Kukich (1992), approximately 80 % of all misspelled words contained a single instance of one of the four error types: insertion (introduce a character more than intended), deletion (delete or omit a character), substitution (substitute a character to one it resembles to) and transposition (switch two adjacent characters). 4 It is a known fact that 10-12 percent of all questions to a search engine are misspelled; these present results are also supported in Stolpe (2003) and in a Google press release (2002). For these observations and a claim made by Kann et al. (1998) that up to a third of the search terms given to web based dictionaries are misspelled made us believe that attaching a spellchecker would assist the users in terms of perhaps increasing either precision or recall or both. A spellchecker is required to identify errors in queries where little or no contextual information is available and using some measure of similarity, recommend words that are most similar to each misspelled word. This error checking would prevent wasted computational processing, prevent wasted users’ time and hopefully make our system score high precision and recall. The query spellchecker that is used in the present experiment performs a presence check on words against a stored lexicon, identifies spelling errors and recommends alternative spellings. 5 2 Previous research on SiteSeeker Two major experiments have been conducted using the SiteSeeker search engine. More elaborate details on these can be found in Carlberger et al. (2001) and Dalianis (2002). I will nevertheless present a quick overview experiments on these since our experiment is built on them. Section 2.1 and 2.2 describe respectively the early experiments done with the SiteSeeker search engine. Section 2.3 looks at earlier work that have been done in the area of spellcheckers and search engines. Section 2.4 describes error detection and correction techniques. 2.1 The stemming experiment Carlberger et al. (2001) performed the stemming experiment and their goal was to evaluate how much stemming improves precision in information retrieval for Swedish texts. They built an information retrieval tool with optional stemming and created a corpus of over 54,000 Swedish news articles. 2.1.1 Stemming Stemming as described in Carlberger et al. (2000) is a technique to transform different inflections and derivations of the same word to one common “stem” (the least common denominator for the morphological variants). It can mean both prefix and suffix removal. Stemming can, for example, be used to ensure that the greatest number of relevant matches is included in search results. A word’s stem is its most basic form; the stem of a plural noun is the singular; the stem of a past tense verb is the present tense. For example “barnen” (children) is stemmed to “barn” (child). The past tense verb “sköt” (shot) is stemmed to “skjut” (shoot). The stemming algorithm for Swedish that is discussed above used about 150 stemming rules. The technique is about modifying the original word into an appropriate stem with a small set of suffix rules in a number of steps. The stemming is done in one up to four steps and in each step no more than one rule from a set of rules is applied. This means that 0-4 rules are applied to each word passing through the stemmer. Each rule consists of a lexical pattern to match with the suffix of the word being stemmed and a set of modifiers, or commands. Further details on this matter can be read in Carlberger et al. (2000). 6 2.1.2 Experiment and evaluation Over 54,000 news articles from the KTH News Corpus were selected (Hassel 2001). Among these, 100 texts were randomly selected and manually annotated a question and answer pair central to each text. Three test participants did the experiment where a rotating questioning-answering evaluation schema was used. Each of the three participants answered 33, 33 and 34 questions respectively with and without stemming functionality. After going through all of the 100 questions and finding answers to them, that is 33 questions each, the work was rotated and the participants became each others evaluators assessing how many of the found top ten hits pages of each question returned by the search engine were correct and how many were wrong. Of the 100 questions, the test participants found 96 answers, two questions did not give any answers at all and two other questions gave unreadable files. Each of the asked queries had an average length of roughly 2.7 words. The texts containing the answer had an average length of 181 words. A 15 and 18 percent increase on precision respectively relative recall were found on the first 10 hits for stemming versus no stemming. 2.2 The spelling support experiment In this experiment, performed by Dalianis (2002), the website of the Swedish National Tax Board (RSV, Rikskatteverket) was used as a testing domain. This website is extensively used by the public to perform searches on among others tax and income issues. The goal was to assess and evaluate the effectiveness of the query spellchecker from April to September 2001. It should be mentioned that the spellchecking algorithm used stems from Stava and Granska (Domeij et al. 1994, Carlberger & Kann 1999, 2000, Knutsson 2001) and makes use of the Edit-distance techniques described in (Kukich 1992). During the five months period, the Swedish National Tax Board whose site contained roughly 6,000 documents used EuroSeek’s search engine with built-in stemmer and dynamic query spellchecker. Over more than one million queries came in, out of which 101,446 were erroneous. Of the 100 most common spelling errors, the system gave ap7 proximately 92 percent good suggestions and 40 percent among these contained split compound 1words, 22 percent were spelling errors and the last 30 percent represented alternative spellings. 2.3 Other related work A number of experiments have been carried out on information retrieval and spellchecking. However no academic work detailing an evaluation of a spellchecker connected to a search engine that checks the improvement on precision and recall could be found or is made available. Instead many spellcheckers that have been built and evaluated are general purpose ones. These are suitable for any spellchecking application, even isolated word error correction like spellchecking user queries in a search engine. An interactive spellchecker based upon phonetic matching and that is aimed towards being used in a search engine is to be found in Hodge & Austin (2001a). The spellchecker uses a hybrid approach to overcome phonetic spelling errors and the four main forms of typing errors: insertion, deletion, substitution and transposition. It uses a Soundex2-type coding approach coupled with transformation rules to overcome phonetic spelling errors. It also uses an n-gram (See Section 2.4.1) approach to overcome the first two forms of typing error (insertion and deletion) and integrated approach to overcome substitution and transposition errors. It aims to high recall accuracy at the expense of precision. Compared to several benchmark spellcheckers, the hybrid spellchecker showed high figures. It had the highest recall rate at 93.9 % for a large 45,243-word lexicon and an increase recall to 98.5 % for a smaller lexicon. A method for detecting and correcting spelling errors in Swedish text was devised in Domeij et al. (1994). It was refined in Kann et al. (1998) where a ranking of correction using word frequencies and editing distance was implemented. These and other techniques are discussed in length in the next section. According to the last named, their spelling correction can be used in information retrieval where the users would for example be offered 1 2 Incorrect division and fusion of word (e.g., my self for myself) Soundex uses codes based on the sound of each letter to translate a string into a canonical form. 8 interactive spelling correction of misspelled search terms. This, they claim, would improve search results both as regards precision and recall. The spellchecker we are using has been, in fact, put to the test twice. The first time it was used was in a Swedish-English web dictionary, which contains 28,500 Swedish words. About 20,000 queries were posed to the web dictionary every day. Of these 20 % were misspelled. For 33 % of the misspellings a single key is at closest distance to the misspelling, so the query could be corrected automatically. The second time our spellchecker was used is described in section 2.2. In Hodge&Austin (2001b) a phonetic spellchecker called Phonetex is evaluated. This spellchecker is intended to integrate with an existing typographical spellchecker. It was compared against other phonetic and benchmark spellcheckers. Different lexicon sizes were used to investigate the effect of lexicon size on recall accuracy using the test set of 360 phonetic misspellings. It was shown that Phonetex has the highest recall, only failing to find eight words from 360. It maintains high recall across lexicon sizes for best match retrieval ranging from 0.96 for the large 45000 word dictionary to 0.98 for the smaller dictionaries. 2.4 Error detection and correction techniques Often, when studying a spellchecker in a search engine, a differentiation is made between the task of error detection and error correction. The latter is considered to be a much harder problem to solve because it deals with issues of locating candidate words and ranking them among others. While the first named task above deals with checking if words or strings appear in a given index, dictionary or lexicon. In reality, a user of a search engine does often not notice that a good suggestion from a search engine is generally a two-step task as stated above. The success of such tasks determines how good precision and recall will be. In this section we will make a short survey over detection and correction techniques that have been used by others. For a more general description, see Kukich (1992). 9 2.4.1 Error detection techniques Research on error detection is mostly done on nonword errors (See its definition in Section 1.1.2). The error detection process usually consists of checking to see if an input string is a valid index or dictionary word. Efficient techniques have been devised for detecting such types of errors. The two most known techniques are n-gram analysis and dictionary lookup. Spellcheckers rely mostly on dictionary lookup while text recognition1 systems rely on n-gram techniques. N-gram Analysis In Kukich (1992) n-gram analysis is described as a method to find incorrectly spelled words in a mass of text. Instead of comparing each entire word in a text to a dictionary, just n-grams are controlled. A check is done by using an n-dimensional matrix where real n-gram frequencies are stored. If a non-existent or rare n-gram is found the word is flagged as a misspelling, otherwise not. An n-gram is a set of consecutive characters taken from a string with a length of whatever n is set to. If n is set to one then the term used is a unigram, if n is two then the term is a digram, if n is three then the term is trigram and a quadrigram exists when n is set to four. The n-gram algorithm was developed as one of the benefits is that it allows strings that have differing prefixes to match and the algorithm is also tolerant of misspellings. Each string that is involved in the comparison process is split up into sets of adjacent n-grams. The n-grams algorithms have the major advantage that they require no knowledge of the language that it is used with and so it is often called language independent or a neutral string matching algorithm. Using n-grams to calculate for example the similarity between two strings is achieved by discovering the number of unique n-grams that they share and then calculating a similarity coefficient, which is the number of the n-grams in common (intersection), divided by the total number of n-grams in the two words (union). An example is shown in Figure 1. 10 String 1 – “statistics” String 2 – “statistical” If n is set to 2 (digrams are being extracted), then the similarity of the two strings is calculated as follows. Initially, the two strings are split into n-grams: Statistics st ta at ti is st ti ic cs 9 digrams Statistical - st ta at ti is st ti ic ca al 10 digrams Then any n-grams that are replicated within the term are removed so that only unique digrams remain. Repeated digrams are removed to prevent strings that are different but contain repeated digrams from matching Statistics st ta at is ti ic cs 7 unique digrams Statistical - st ta at ti is ic ca al 8 unique digrams The n-grams that are left over are then arranged alphabetically. Statistics Statistical - at cs ic is st ta ti al at ca ic is st ta ti The number of unique n-grams that the two terms share is then calculated. The shared unique digrams are: at ic is st ta ti. In total 6 shared unique digrams. A similarity measure is then calculated using the above-mentioned Similarity coefficient using the following formula: A = the sum of unique n-grams in term 1. B = the sum of unique n-grams in term 2. C = the sum of unique n-grams appearing in term 1 and term 2. The above example would produce the result 0.80. (2*6) / (7+8) = 12/15 = 0.80. Figure 1. An example of a similarity measure. 1 Systems that focus on hand-printed text or hand-written text or machine-printed text 11 A possible extension to this technique consists in adding blanks at the beginning and end of the two involved words, thus increasing the total number of n-grams involved giving a more precise result, especially for short words. For Statistical this would give the new digrams 0s and l0, where the zeros represent the blanks. N-grams are used in Domeij et al. (1994) and in Kann et al. (1998) with a precompiled table called a “graphotactical table” to limit the number of false words accepted by their stored word list and to increase the speed. Dictionary Lookup A dictionary is a list of words that are assumed to be correct. Dictionaries can be represented in many ways, each with their own characteristics like speed and storage requirements. At first one might think that the more words a dictionary contains the better it is. The problem is that when more words are added to the dictionary the risk for real word errors increases. A good substitute for a large dictionary might be a dictionary with most common word combined with a set of additional dictionaries for specific topics such as computer science or economy. A big dictionary also uses more space and may take longer time to search. The non-word errors can be detected as mentioned above by checking each word against a dictionary. The drawbacks of this method are difficulties in keeping such a dictionary up to date, and sufficiently extensive to cover all the words in a text. At the same time one should keep down system response time. Dictionary lookup and construction techniques must be tailored according to the purpose of the dictionary. Too small a dictionary can give the user too many false rejections of valid words; too large it can accept a high number of valid low-frequency words. Hash tables are the most common used technique to gain fast access to a dictionary. In order to lookup a string, one has to compute its hash address and retrieve the word stored at that address in the preconstructed 12 hash table. If the word stored at the hash address is different from the input string, a misspelling is flagged. Hash tables main advantage is their random-access nature that eliminated the large number of comparisons needed to search the dictionary. The main disadvantage is the need to devise a clever hash function that avoids collisions. The spellchecker described in Domeij et al. (1994) uses a method called a Bloom filter to store its dictionary. A Bloom filter, in contrast to a hash table, has only two operations, checking if a word is already in it and adding a new word. A Bloom filter is, according to them, a vector of boolean values and a number of hash functions. To store a word in the dictionary you calculate each hash function for the word and set the vector entries corresponding to the calculated values to true. To find out if a word belongs to the dictionary, you calculate the hash values for that word and look in the vector. If all entries corresponding to the values are true, then the word belongs to the dictionary, otherwise it does not. The good thing about Bloom filters, they claim, is that they are compact and that the time required to find out if a word belongs to the dictionary or not, is independent of the size of the dictionary. 2.4.2 Error correction techniques Correction of spelling errors is an old problem. Much research has been done in this area over the years. Most existing spelling correction techniques focus on isolated words, without taking into account the textual context in which the string appears. One should remember that isolatedword correction techniques are not able to detect real-word errors (see its definition in Section 1.1.2). Often error detection only is not enough; according to SearchEngineWatch (2003), users have come to expect for example that spellcheckers in their favourite search engines to suggest good corrections for the misspellings they detect. So the problem of isolatedword error correction consists of two steps: the generation of candidate corrections and the ranking of candidate corrections. The candidate generation process usually makes use of a precompiled table of legal ngrams to locate one or more potential correction terms (for more details see Kann et al. (1998)). The ranking process usually invokes some 13 lexical similarity measure between the misspelled string and the candidates or a probabilistic estimate of the likelihood of the correction to rank order the candidates. These two steps are most of the time treated as a separate process and executed in sequence. Some techniques can omit the second process though, leaving the ranking and final selection to the user. The isolated-word methods that will be described here are the most studied spelling correction algorithms (Kukich 1992), they are: edit distance, similarity keys, rule-based techniques, n-gram-based techniques, probabilistic techniques and neural networks. All of these methods can be thought of as calculating a distance between the misspelled word and each word in the dictionary or index. The shorter the distance the higher the dictionary word is ranked. Edit Distance Edit distance is a simple technique. The distance between two words is the number of editing operations required to transform one of the words into the other. The allowed editing operations are: remove one character, insert one character, replace one character with another or transpose two adjacent characters. Edit distance is useful for correcting errors resulting from keyboard input, since these are often of the same kind as the allowed edit operations. It is not quite as good for correcting phonetic spelling errors, especially if the difference between spelling and pronunciation is big as in English or French. It works better for some common Swedish misspellings. A variation on edit distance is to assign different distances for different editing operations and the letters involved in these operations. This leads to a finer distinction between common and uncommon typographic errors. Similarity Keys Similarity keys are based on some method to transform a word into a similarity key. This key should reflect the characteristics of the word. The most important characteristics are placed first in the key. All words in the dictionary are transformed and sorted after the similarity key. A misspelled word is then also transformed and similar keys can be efficiently searched for in the sorted list. 14 This technique gives a linear ordering of all words. Words that are close to each other in this ordering are considered similar. With a good transformation algorithm this method can handle both keyboard errors and phonetic errors. The problem is to design this algorithm. Rule-based Techniques Rule-based methods are interesting. They work by having a set of rules that capture common spelling and typographic errors and applying these rules to the misspelled word. Intuitively these rules are the “inverses” of common errors. Each correct word generated by this process is taken as a correction suggestion. The rules also have probabilities, making it possible to rank the suggestions by accumulating the probabilities for the applied rules. Edit distance can be viewed as a special case of a rule-based method with limitation on the possible rules. N-gram-Based Techniques N-grams can be used in two ways, either without a dictionary or together with a dictionary. Used without a dictionary, n-grams are employed to find in which position in the misspelled word the error occurs. If there is a unique way to change the misspelled word so that it contains only valid n-grams, this is taken as the correction. The performance of this method is limited. Its main virtue is that it is simple and does not require any dictionary. Together with a dictionary, n-grams are used to define the distance between words, but the words are always checked against the dictionary. This can be done in several ways, for example check how many n-grams the misspelled word and a dictionary word have in common, weighted by the length of the words. Probabilistic techniques They are, simply put, based on some statistical features of the language. Two common methods are transition probabilities and confusion probabilities. Transition probabilities are similar to n-grams. They give us the probability that a given letter or sequence of letters is followed by an- 15 other given letter. Transition probabilities are not very useful when we have access to a dictionary or index. Confusion probabilities give us the probability that a given letter has been substituted for another letter. In Xiang & Evans (1996) a word correction system based on statistical language and applied to OCR is described. Given a sentence to be corrected, the system decomposes each string in the sentence into letter n-grams and retrieves word candidates from the lexicon by comparing string n-grams with lexicon-entry n-grams. The retrieved candidates are ranked by the conditional probability of matches with the string, given character confusion probabilities. Finally, a word-bigram model and a certain algorithm are used to determine the best scoring word sequence for the sentence. They claim that the system can correct non-word errors as well as real-word errors and achieves a 60.2 % error reduction rate for real OCR text. Neural Networks Neural networks are also an interesting and promising technique, but it seems like it has to mature a bit more before it can be used generally. The current methods are based on back-propagation networks, using one output node for each word in the dictionary and an input node for every possible n-gram in every position of the word, where n usually is one or two. Normally only one of the outputs should be active, indicating which dictionary words the network suggests as a correction. This method works for small (< 1000 words) dictionaries, but it does not scale well. The time requirements are to big on traditional hardware, especially in the learning phase. 16 3 Experiment From a sub-corpus of over 70,000 news articles fetched from the KTH News Corpus (Hassel 2001), 100 texts were randomly selected. Each text is manually annotated as a question and answer pair as described in Carlberger et al. (2001) and used in our experiment as well. The search engine used these XML like texts in order to find answers o questions asked by the participants. During the actual experiment though, the participants were given natural language texts. An example of the first named type of text is shown in Figure 2. In this section, the experimental setting that is used to evaluate the performance of our search engine, is presented. Section 3.1 describes our test participants. Section 3.2 shows how the question and answer form were built and section 3.3 describes the set of rules that the test participants had to follow during the experiment. Question <top> <num> Number: 35 <desc> Description: (Natural Language question) Vem är koncernchef på Telenor? (Who is CEO at Telenor?) </top> Answer <top> <num> Number: 35 <answer> Answer: Tormod Hermansen <file> File: KTH NewsCorpus/Aftonbladet/Ekonomi/0108238621340_EKO__00.html <person> Person: Tormod Hermansen <location> Location: Norden <organization> Organization: Telenor <time> Time: onsdagen <keywords> Keywords: Telenor; koncernchef; teleföretag; mobilmarknaden; uppköp </top> Figure 2. Questioning and answering annotation scheme. 17 3.1 The test participants To simulate actual people querying a search engine as it is in the Web, we asked the students taking a course in language engineering to carry out the experiment. They were a total of 20 students and predominantly male. Their age range can be guessed between 19-50 years old. Around 95 percent of them have Swedish as their first language or speak Swedish fluently and the rest do not have Swedish as their first language. There were French, English and South-African students among these 5 percent. The reason why no specific figures could be given is that demographic issues were not considered important and relevant in this experiment. They are familiar with search engines and have good knowledge in among others natural language processing. Their assignment was to, simply stated, query the search engine in order to find the answers to some asked questions. We split them into ten groups; each group comprises two participants. 3.2 Building the questions and answers forms Since our test panel was divided into ten groups of two, our original hundred questions mentioned at the beginning of this section were also divided into ten questions. Group one would handle the first ten questions, group two the second ten questions and so on. These questions were made available to them only when experiment began. We then built two types of answering forms for all 100 questions: one for answers from search with spellchecking and another for answers from search without spellchecking. For spellchecking a form contained six fields. The first is a name field where the test participants would write their names, a second one where they would write the query words used. A third one for eventual query words suggested by SiteSeeker that they used. A fourth one was for the number of hits returned by the search engine. Finally two fields where they would write down the found answer respectively the link that gave that answer. A non-spellchecking form does not have the third field mentioned above. 18 These answering forms, the location of the search engine and relevant instructions to follow (see Section 3.3 for details) were made available to the test participants via a web page. 3.3 The experiment rules The experiment had to be conducted in two hours. It was decided that the experiment would take two hours because it was considered to be enough for the participants to answer all questions. Moreover, two hours was the duration of a normal language engineering lecture. During the first hour the test participants in each group had to query the search engine (with spellchecker) and find the answers to their ten questions. Actually, one of the two in each group read the questions while his/her partner did the searching and filled the answers they found in the form. The participant who was querying the search engine could not see how words were spelled. This, according to our estimation, simulates an ordinary web search where a user, wanting to find information on a news event or a subject, just types in the term that comes to his mind. Some sets of rules were made that the test participants had to follow while querying: 1. They were not allowed to do more than five trials on each 2. 3. 4. 5. 6. question to find the answer. They were not allowed to use longer queries than five words. No background knowledge was allowed; that is only the words used in natural language as they were written in the questionnaire were allowed. So even if the test participant were very familiar with a certain event and probably knew the answer to a certain questions, the latter is only allowed to use the words that were read to him/her. Boolean expressions and phrases searches were allowed. Each question had to be read no more than five times. The full URL of each found answer to a question had to be written down for use during the evaluation process. After having finished this first part, we rotated the question forms to avoid training effects of running the same set of questions. That is we did not want the test participants to query the search engine with the same set of questions whose answers they already knew. Now under the 19 second hour we switched the question forms. Group one’s question forms were given to group two and vice versa, group three’s questions forms to group four and vice versa and so on. This time each group had to query the search engine without the spellchecker and using the nonspellchecking answering form described in section 3.2. The role of the participant in each group was also inverted in the second part of the experiment. Those who, during the first hour, read the questions to their partner queried the search engine and vice versa. When all groups finished the second part, we made sure that they filled the two answering forms as correct and legible as possible just by browsing through them. These forms were to be used to evaluate our search engine based on precision and recall. 20 4 Evaluation In this chapter we present the evaluation of the experiment and our findings. Section 4.1 details the evaluation metrics for the performance of our query spellchecker. Section 4.2 introduces the parameters used to judge the relevancy of the returned documents by the search engine for a query q. Our final results are presented in Section 4.3. 4.1 Evaluation Metrics Information retrieval systems are usually compared based on the “quality” of the retrieved document sets. This “quality” is traditionally quantified using two metrics, recall and precision (SearchEngineWatch 2003). Each document in the collection at hand is judged to be either relevant or non-relevant for a query. Precision is calculated as the fraction of relevant documents among the documents retrieved, and recall measures the coverage of the system as the fraction of all relevant documents in the collection that the system retrieved. See Section 1.1.4 for a more elaborate definition. Since this method is well known and well used (Carlberger et al 2001; Dalianis 2002), no other evaluation methods than the present one was considered. In this experiment, we think that the method used will in fact measure what we want measure. To measure recall over a collection we need to mark every document in the collection as either relevant or non-relevant for each evaluation query. This, of course, is a daunting task for any large document collection like ours, and is essentially impossible for the web, which contains billions of documents. Due to the difficulties of calculating recall, relative recall will be used instead. Relative recall can be defined as the fraction of known relevant items retrieved by the system. The following example will help to clarify this term: 1. System A retrieves four items: D1, D2, D3 and D4. 2. System B retrieves three items: D2, D3 and D5. 3. Known unique relevant items D1, D2, D3, D4 and D5 are assumed to approximate the total number of relevant items (5 in this case). 4. Relative recall for A: 4/5 = 80 % 21 5. Relative recall for B: 3/5 = 60 % In this report, the results found in each search method are pooled and assumed to approximate the total number of relevant documents, and the relative recall of individual search engines is calculated from this. 4.2 Evaluating retrieved documents and their Relevance judgements What is a “relevant” document, and who determines the relevance of retrieved documents? In reality the only person who can determine whether a document is relevant to an information need is the person who has the information need, in this case it is the test participants. The answering forms described in Section 3.3 that the students have handed in are carefully scrutinised since these will determine how much precision and recall have increased. In order to verify the correctness of the answers given in the answering forms, we used, for each question, the actual query words that each group wrote down and query the search engine again. That way we could beside the document that contained the answer, also check the other nine of the top ten documents (or all documents if fewer than 10) returned by SiteSeeker. An evaluation scheme that presents the result pages for a query one page at a time was set up. It allows us to judge each page as “good”, “bad” or “ignore” for the query. The following criteria were used to judge documents. A good page is one that contains an answer to the test question on the page. If a page is not relevant, or only contains links to the useful information, even if links are to other documents on the same site, the page is judge “bad”. If a page could not be viewed for any reason (e.g., server error, broken link), the result is “ignored”. This scale of judgement was found enough because in this experiment it is a matter of a returned page yielding a right or a wrong answer. 4.3 Evaluation Results In this section we report the results of the experimental evaluation using the methodology described in the previous section. An Excel file was used and we put, for each test question, the top ten retrieved documents in the file when the non-spellchecking was used in one hand and likewise when the spellchecking was used on the other hand. 22 Making use of the relevance judgement described in Section 4.2, we award each “good” page or link with 1. A “bad” one is awarded with 0. Using these figures and the metrics described in Section 4.1, we calculated precision and relative recall in both cases. Subsequently, we found an increase on precision and relative recall with 4 percent respectively 11.5 percent. The present results have also been presented in Sarr (2003) during the NODALIDA conference in May 2003 in Reykjavik, Iceland. Table 1 details our findings. Table 1. No Spellchecking versus Spellchecking. Precision/Recall At 10 first No Spellchecking Number of texts 79,000 79,000 100 100 Number of questions Spellchecking Average precision Increase in precision % 0.485 0.505 4% Average relative recall Increase in relative recall % 0.52 0.58 11.5 % 4.3.1 Analysis of results In the answering form mentioned in Section 3.2 there was a field that the test participants had to fill in if they used query words suggested by SiteSeeker. According to these, and verifications that were done the number of misspellings that SiteSeeker gave suggestion for were exactly 14. Concentrating on the nature of these misspellings, we noticed that six were of typographic type, three of cognitive type and four of phonetic type and one could not be specified. Of these, 13 were good suggestion, that is they led to the correct answer. In 12 of these 13 cases, more than one suggestion was given. Interestingly only twice was the correct answer not the first suggestion. In both these two cases was the correct answer the second suggestion. This tells us that our system is good at ranking corrections. It gives the most probable correction the first alternative in 10 cases out of 12. Only in three cases did a search with the spellchecker fail to return hit pages or irrelevant hits. In two of these three cases no documents were returned at all. In one case four documents were returned but none of them were relevant to the query word. 23 4.3.2 Discussion Our Spellchecker has done a good job of detecting, correcting and suggesting alternatives to the spelling errors made by the test participants. The figures shown in Section 4.3.1 have partly or alone contributed to the increase of precision and relative recall. The fact that 14 out of 100 queries were erroneous can be interpreted as that the test panel had in general no difficulties spelling the words that were read for them. Or that the words in the questionnaires were simply not “difficult” enough to be misspelled while entering them in the search field. But on the other hand one can wonder what is consider to be a “difficult” word. Furthermore, whether a word is considered obscure and difficult to spell is a subjective matter. At the end of the experiment phase we asked the test participants to write down their thoughts on using the search engine with and without the spellchecker. The comment that kept coming up was “we did not have a lot of use of the spellchecker, on some occasions we got suggestions of words that we almost spelled right.” By checking the words that were asked to the search engine just after the experiment we indeed could confirm the comment made by some of the test participants. Words that they most of the time hade difficulty with were those that describe either a name of a person like “Whitney Houston” or name of a place like “Shebaa in Syria”. Since there were maybe ten out of hundred of these on the questionnaires, that can partly explain why the test participants (at least those that were familiar with the names mentioned above) felt that the spellchecker was not to a lot of help. An interesting thing was that among the test participants there were five exchange students whose first language is not Swedish. According to them they had good assistance from the spellchecker. In a final note we can say that the results obtained in this experiment are very promising but should not be exaggerated. Why? The reasons are twofold: 1. The corpus used in our experiment was static but in the world of the Internet, corpuses are seldom static. On the other hand comparison of two experiments would not be fair if the corpus was not static. 2. Only 100 questions were asked to our search engine. On a usual day search engines of the Internet are asked hundreds of thousands if not millions of queries. 24 25 5 Conclusions and future improvements We found in our experiment that the spellchecker connected to our search engine SiteSeeker does in fact improve both precision and relative recall with 4 respectively 11.5 percent Though the improvement in precision and relative recall is not dramatic as we hoped, it nevertheless, tells us that it is wise and helpful to use a spellchecker in a search engine. But will we have the same results if we set up the same experiment but with different types of questionnaires and with a different test panel? We would propose in the future as an answer to the questions above the following steps in order to discover the full potential of our search engine: 1. Run the same experiment but with fewer trials than five and still maintain high score. 2. Run the same experiment but with different test panels and compare the results. This, to see how much precision and relative recall would vary. We have notice that our test panel made very few spelling errors; it would be interesting to run the same experiment with an “error-prone” test panel. 3. Introduce more complex words like technical terms and obscure names in the questionnaires to query for. The goal would also be to see how the spellchecker handles misspellings not discussed in the report such as run-on words1 and split words2 and how these would influence the search result. These types of spelling errors have become more common in Swedish, probably due to English influences 1 2 Typing “ laddaned” instead of “ladda ned“ (download). Typing “bok hylla” instead of “bokhylla” (bookshelf). 26 References (All WWW references last visited 15 November 2003) (Carlberger et al. 2001) Carlberger, J., H. Dalianis, M. Hassel and O. Knutsson. Improving Precision in Information Retrieval for Swedish using Stemming. In Proceedings of NODALIDA ’01 æ 13th Nordic Conference on Computational Linguistics, May 21-22, 2001, Uppsala, Sweden, 2001. ftp://ftp.nada.kth.se/IPLab/TechReports/IPLab-194.pdf (Carlberger and Kann 2000) Carlberger, J. and V. Kann Some applications of a statistical tagger for Swedish. Proc. 4:th conference of the International Quantitative Linguistics Association (Qualico-2000), pp. 51-52, August 2000 http://www.nada.kth.se/theory/projects/granska/rapporter/ taggerapplications20000215.pdf (Carlberger and Kann 1999) Carlberger, J. And V. Kann. Implementing an efficient part-of-speech tagger. Software Practice and Experience, 29, pp. 815-832, 1999. ftp://ftp.nada.kth.se/pub/documents/Theory/Viggo-Kann/ tagger.pdf (Dalianis 2002) Dalianis, H. Evaluating a Spelling Support in a Search Engine. In Natural Language Processing and Information Systems, 6th International Conference on Applications of Natural Language to Information Systems, NLDB 2002 (Eds.) B. Andersson, M. Bergholtz, P. Johannesson, Stockholm, Sweden, June 27-28, 2002. Lecture Notes in Computer Science. Vol. 2553. pp. 183-190. Springer Verlag, 2002. http://www.nada.kth.se/~hercules/papers/SpellingIR.pdf (Domeij et al. 1994) Domeij, R., J. Hollman, and V. Kann. Detection of spelling errors in Swedish not using a word list en Clair. Journal of Quantitative Linguistics 1:195-201, 1994. 27 (Google 2002) Google press release written by Dean, J. (Engineer at Google). Google’s future plans. (December 2002). http://www.searchengineshowdown.com/newsarchive/ 000611.shtml (Hassel 2001) Hassel, M. 2001. Internet as Corpus – Automatic Construction of a Swedish News Corpus. NODALIDA ’01 æ 13th Nordic Conference on Computational Linguistics, May 21-22 2001, Uppsala, Sweden. http://www.nada.kth.se/~xmartin/papers/ NewsCorpus_NODALIDA01.pdf (Hodge & Austin 2001a) Hodge J., Victoria & Jim Austin. An Evaluation of Standard Spell Checking Algorithms and a Binary Neural Approach. Accepted for, IEEE Transactions on Knowledge and Data Engineering. (Hodge & Austin 2001b) Hodge J. Victoria and Jim Austin. An Evaluation of Phonetic Spell Checkers, Technical Report YCS 338(2001), Department of Computer Science, University of York, 2001 (Kann et al. 1998) Kann, V., R. Domeij, J. Hollman, M. Tillenius. Implementation aspects and applications of a spelling correction algorithm. NADA TRITA-NA-9813, NADA, Royal Institut of Technology, KTH, 1998. ftp://ftp.nada.kth.se/pub/documents/Theory/Viggo-Kann/ TRITA-NA-9813.pdf (Knutsson 2001) Knutsson, O. Automatisk språkgranskning av svensk text (in Swedish), (Automatic Proofreading of Swedish text), Licentiate thesis. IPLabNADA, Royal Institute of Technology, KTH, Stockholm, 2001. 28 (Kukich 1992) Kukich, K. Techniques for automatically correcting words in text, ACM Computing Surveys. Vol.24, No. 4 (Dec. 1992), pp. 377-439, 1992. (Sarr 2003) Sarr, M. Improving precision and recall using a spellchecker in a search engine. In proceeding of NODALIDA 2003, the 14th Nordic Conference of Computational Linguistics, Reykjavik, May 30-31, 2003 http://www.nada.kth.se/~hercules/papers/SarrNodalida2003.pdf (SearchEngineWatch 2003) SearchEngineWatch.com Sullivan, D.: How search engines rank web pages (July 2003) http://www.searchenginewatch.com/webmasters/article.php/ 2167961 (Stolpe 2003) Stolpe, D. Högre kvalitet med automatisk textbehandling? En utvärdering av SUNETs Webbkatalog (In Swedish) (Higher Quality by Automatic Text Processing? An Evaluation of the SUNET Web Catalogue). Forthcoming Master’s thesis, NADA. (Tong & Evans 1996) Tong, X and Evans, D. A. A Statistical Approach to Automatic OCR Error Correction in Context. In: Proceedings of the fourth Workshop on Very Large Corpora, 88-100, Copenhagen, Denmark, 1996. (Webopedia 2002) Webopedia web definition: Search Engine (October 2002) http://www.webopedia.com/TERM/s/search_engine.html 29 Appendices Appendix A Search engines: how they work According to Webopedia (2002) there are basically three types of search engines: Those that are powered by crawlers, or spiders; those that are powered by human submissions; and those that are a combination of the two. Crawler-based engines, such as Google, send crawlers, or spiders, out into cyberspace. These crawlers visit a Web site, read the information on the actual site, read the sites meta tags1 and also follow the links that the site connects to. The crawler returns all that information back to a central depository where the data is indexed. The crawler will periodically return to the sites to check for any information that has changed, and the frequency with which this happens is determined by the administrators of the search engine. Human-powered search engines, such as the Open Directory, rely on humans to submit information that is subsequently indexed and catalogued. Only information that is submitted is put into the index. A search looks for matches only in the descriptions submitted. Hybrid search engines combine the two. In the web’s early days, it used to be that a search engine either presented crawler-based results or human-powered listings. Today, it is extremely common for both types of results to be presented. Usually, a hybrid search engine will favor one type of listings over another. For example, MSN Search2 is more likely to present human-powered listings from LookSmart3. However, it does also present crawler-based results (as provided by Inktomi4), especially for more obscure queries. In all these cases, when a user query a search engine to locate information, he or she is actually searching through the index that the search engine has created; he or she is not actually searching the Web. These indices are giant databases of information that is collected and stored and subsequently searched. This explains why sometimes a search on a commercial search engine, such as Yahoo! or Google returns results 1 A special HTML tag that provides information about a Web page. http://search.msn.com 3 http://www.looksmart.com 4 http://www.inktomi.com 2 30 that are in fact dead links. Since the search results are based on the index, if the index hasn't been updated since a Web page became invalid the search engine treats the page as still an active link even though it no longer is. It will remain that way until the index is updated. So why will the same search on different search engines produce different results? Part of the answer to that is because not all indices are going to be exactly the same. It depends on what the spiders find or what the humans submitted. But more important, not every search engine uses the same algorithm to search through the indices. The algorithm is what the search engines use to determine the relevance of the information in the index to what the user is searching for. 31 Appendix B Ranking web pages Often when a user does a search using a crawler-based search engine, nearly instantly, the search engine will sort through the millions of pages it knows about and present him with ones that match his topic. The matches will even be ranked, so that the most relevant ones come first. Of course, the search engines do not always find the specific information you want. Non-relevant pages make it through, and sometimes it may take a little more digging to find the information needed. But, by and large, search engines do an amazing job. So, how do crawler-based search engines go about determining relevancy, when confronted with hundreds of millions of web pages to sort through? According to SearchEngineWatch (2003) they follow an algorithm. Exactly how a particular search engine's algorithm works is a closely kept trade secret. However, all major search engines follow the general rule of location and frequency of keywords on a web page. Also better known as location/frequency method, for short. Pages with the search terms appearing in the HTML title tag are often assumed to be more relevant than others to the topic. Search engines will also check to see if the search keywords appear near the top of a web page, such as in the headline or in the first few paragraphs of text. They assume that any page relevant to the topic will mention those words right from the beginning. Frequency is the other major factor in how search engines determine relevancy. A search engine will analyse how often keywords appear in relation to other words in a web page and the distribution of them on all documents that are indexed. Those with a higher frequency are often judged more relevant than other web pages. All the major search engines follow the location/frequency method to some degree; in the same way cooks may follow a standard recipe. But cooks like to add their own secret ingredients. In the same way, search engines add spice to the location/frequency method. Nobody does it exactly the same, which is one reason why the same search on different search engines produces different results. To begin with, some search engines index more web pages than others. Some search engines also index web pages more often than others. The result is that no search engine has the exact same collection 32 of web pages to search through. That naturally produces differences, when comparing their results. Search engines may also penalise pages or exclude them from the index, if they detect search engine “spamming”. An example is when a word is repeated hundreds of times on a page, to increase the frequency and propel the page higher in the listings. Search engines watch for common spamming methods in a variety of ways, including following up on complaints from their users. Crawler-based search engines have plenty of experience now with webmasters who constantly rewrite their web pages in an attempt to gain better rankings. Some sophisticated webmasters may even go to great lengths to “reverse engineer” the location/frequency systems used by a particular search engine. Because of this, all major search engines now also make use of “off the page” ranking criteria (SearchEngineWatch 2003). Off the page factors are those that a webmasters cannot easily influence. Chief among these is link analysis. By analysing how pages link to each other, a search engine can both determine what a page is about and whether that page is deemed to be “important” and thus deserving of a ranking boost. In addition, sophisticated techniques are used to screen out attempts by webmasters to build “artificial” links designed to boost their rankings. Another off the page factor is clickthrough measurement. In short, this means that a search engine may watch what results someone selects for a particular search, then eventually drop high-ranking pages that are not attracting clicks, while promoting lower-ranking pages that do pull in visitors. As with link analysis, systems are used to compensate for artificial links generated by eager webmasters. 33
© Copyright 2026 Paperzz