Web Systems and Algorithms Searching and Indexing Chris Brooks Department of Computer Science University of San Francisco Department of Computer Science — University of San Francisco – p.1/?? Searching Once we’ve collected a large body of Web pages (or other documents), we would probably like to find information within this collection. There are a few criteria we would like to satisfy: Expressivity - can we ask complex queries? Scalability - Can we handle large document collections? Relevance - do the documents retrieved satisfy our need? Are they ranked? Department of Computer Science — University of San Francisco – p.2/?? Information Needs A user typically has an information need. The job of an IR system (such as a search engine) is to translate that need into a query and then find documents that satisfy that need. We can evaluate a search engine by its effectiveness at satisfying this information need. Department of Computer Science — University of San Francisco – p.3/?? Queries What sorts of queries might we want to ask? Boolean queries: “cat AND dog”, “(cat OR dog) and fish” “cat AND NOT mouse” proximity queries: “cat NEAR dog” “cat, dog within same sentence” phrases: “the lazy dog”, “the cat in the hat” “full text”: user just inputs keywords “cat dog fish” Synonymy: documents with words like “cat” Similarity: documents similar to a given document Department of Computer Science — University of San Francisco – p.4/?? Boolean queries Let’s start simple, with boolean queries. The user provides one or more keywords, and we must find documents containing those keywords. Given this query language, how can we structure our document collection? We want to avoid linear search through all documents. Department of Computer Science — University of San Francisco – p.5/?? Inverted Index The standard way to solve this is through the construction of an inverted index. This is a hashtable mapping tokens to documents they appear in. Construction is very easy: index = for document in collection : for word in document : if word in index : index[word].insert(document) else : index[word] = [document] (note: we only add a document once if words occur multiple times.) Department of Computer Science — University of San Francisco – p.6/?? Inverted Index Retrieval is also easy. For single-word queries, return the list of documents this word refers to. For multi-word queries, compute the intersection (for AND) or union (for OR) or set difference (for NOT). Department of Computer Science — University of San Francisco – p.7/?? Intersection How do we efficiently compute the intersection of two lists? Department of Computer Science — University of San Francisco – p.8/?? Intersection ### assume both lists are sorted p1 = l1.next(); p2 = l2.next() result = [] while p1 != null and p2 != null : if p1 == p2 : result.append(p1) p1 = l1.next() p2 = l2.next() elif p1 < p2 : p1 = l1.next() else : p2 = l2.next() Department of Computer Science — University of San Francisco – p.9/?? Union and NOT Union and NOT are very similar. For more complex queries, we want to take care to merge lists in the correct order. We want to start with the least frequent term, then move in order of increasing frequency Nested queries can be much more challenging. Department of Computer Science — University of San Francisco – p.10/?? Processing Documents We actually skipped a step here. We need to go from a document, which is a string, to a list of tokens, which are the keys in the inverted index. We need to decide: How to separate tokens What tokens to retain? Can tokens be combined or grouped? Department of Computer Science — University of San Francisco – p.11/?? Tokenizing The simplest thing to do is to split on whitespace. What about punctuation or non-alphanumeric characters? Throw them away? What about that’s, or aren’t, or C++? What about San Francisco, or Tiger Woods? What about anti-social vs antisocial? Dates? Phone numbers? What about languages like Chinese or Japanese? Department of Computer Science — University of San Francisco – p.12/?? Tokenizing Choices include: Fixed set of rules, written as regular expressions Learn rules from labeled data Use a segmentation algorithm, such as Viterbi Choice depends on anticipated information needs, experience, and efficiency issues. Department of Computer Science — University of San Francisco – p.13/?? Removing non-useful tokens We would also like to avoid indexing tokens that do not help us find documents. Markup. We can probably extract this with an HTML parser. Stopwords. These are words that do not contain any useful semantic content. a, an, the, are, among, about, he, she, they, etc. We can either have a fixed list, or determine them through frequency analysis. Removing stopwords may give us problems with phrase search “As We May Think”, “University of San Francisco” Department of Computer Science — University of San Francisco – p.14/?? Stemming We may also want to remove prefixes and suffixes. For example, if a user searches for “car” and the document contains the token “cars”. This process is known as stemming Most stemmers use a fixed set of rules for removing suffixes. This can introduce errors, due to inconsistencies in English university and universal both stem to univers In Web search, stemming can also introduce problems due to acronyms and jargon gated, SOCKS, ides Department of Computer Science — University of San Francisco – p.15/?? Conflation Stemming is a special case of what is called normalization or conflation Terms with the same meaning are grouped into an equivalence class. e.g. car, auto, automobile If we have a thesuarus such as WordNet that contains synonyms, we can deal with this in two different ways: During querying, look up synonyms and add them to the query as disjunction. “car” becomes “car” OR “auto” OR “automobile” During indexing, store a document inder the entry for each synonym Tradeoff: space vs time Department of Computer Science — University of San Francisco – p.16/?? Normalization Other issues include: Accents and diacritical marks. Less important in English, more in other languages. Case. Do we want to convert everything to lower case? What about proper names? (Bush, Windows) Acronyms? Misspellings, dates, regionalisms (color vs colour) Department of Computer Science — University of San Francisco – p.17/?? Measuring performance There are two basic metrics for measuring retrieval performance: Precision What fraction of the documents returned actually meet a user’s information need? Recall What fraction of the documents that would meet the user’s information need are returned? Often we trade one against the other. Department of Computer Science — University of San Francisco – p.18/?? Ranked Retrieval We probably don’t want to just give the user an unsorted list of results. We would instead like to present the best results first. What does best mean? Best match to our query Most authoritative Department of Computer Science — University of San Francisco – p.19/?? Ranked Retrieval If we want to score documents according to how well they match our query, we can use TFIDF and cosine similarity. This assigns each word in a document a weight according to how frequently it occurs. Cosine similarity measures the distance between two vectors; this produces a score between 0 and 1. Nice, but perhaps not effective for short queries. Department of Computer Science — University of San Francisco – p.20/?? Ranked Retrieval An alternative approach is to use a document’s place in the Web to determine its rank. Intuition: Each document has prestige, which is a function of the prestige of the documents that link to it. This is the basis of PageRank. One advantage is that a document’s owner cannot “game” the system by putting extra words in a document. Department of Computer Science — University of San Francisco – p.21/?? PageRank The idea behind PageRank is this: What is the probability if winding up on a web page x if one is surfing “at random”? Let’s assume the web is strongly connected, and that every page has out-degree >= 1. To have clicked once and wound up at x, a surfer must have: Been at a page y that links to x Chosen the link from y to x out of the E(y) outward links from y . Assuming a uniform distribution, this 1 . is N = E(y) Department of Computer Science — University of San Francisco – p.22/?? PageRank SInce there are y different pages linking to x, P p1 [x] = p0 [y] ∗ N1y (the probability of being at y and then choosing to go to x) We can follow this computation out to determine the stationary distribution of p. This is a page’s prestige, or PageRank. We’ll go through this more carefully next week. Department of Computer Science — University of San Francisco – p.23/?? PageRank What’s interesting about PageRank is that a document’s “value” is determined by the value of its neighbors. In practice, we would first find pages that matched a query, then use PageRank to order these results, perhaps in combination with document-level rankings. A criticism of PageRank is that there’s not necessarily a connection between quality and popularity. Department of Computer Science — University of San Francisco – p.24/?? Index compression An inverted index can take up a large amount of space, particularly if positional information about where a term occurred in a document is also kept. For performance reasons, it is preferable to keep the index small enough to be held in memory. We can do this through index compression If our decompression algorithm is fast enough, the cost of decompression will be less than the cost of retrieving an entry from disk. Department of Computer Science — University of San Francisco – p.25/?? Index compression A simple thing we can do is to store offsets of document IDs. For example, if we have: foo: 13, 100, 150 we store foo: 13,87,50 For large numbers of documents, the offset will hopefully be smaller We can also construct a hierarchy of indices, using a tree-based approach. Each identifier maps into a second-level index. Compression makes updating more challenging. Department of Computer Science — University of San Francisco – p.26/?? Relevance Feedback Retrieval can be very challenging in the Web domain Queries are very short, lots of potential results. If we can get the user to give us some help in understanding their information needs, we can improve performance. This is called relevance feedback. Department of Computer Science — University of San Francisco – p.27/?? Relevance Feedback The basic idea is this: The user submits a query q We return some documents. For each document, the user ranks them as “yes” or “no” We want to then use those features of the good documents to do a second search. Department of Computer Science — University of San Francisco – p.28/?? Rocchio’s method A simple way to do this is called Rocchio’s method. We compute each document’s score as before We then compute a weight for the most useful words in the “good” documents and a weight for the most useful words in the “bad” documents. These weights are used to adjust a document’s score. Department of Computer Science — University of San Francisco – p.29/?? Probabilistic feedback If we have “good” and “bad” documents, we can also use them to build a Naive Bayes classifier that will predict the likelihood of a document matching a search query. We can also build a probabilistic model directly from our collection and use this to predict the likelihood that a document will satisfy a query. P (dx = T rue|q = cat, dog, tree) Department of Computer Science — University of San Francisco – p.30/?? Relevance Feedback Relevance feedback has not been widely used in web search Issues Users don’t like labeling data Users don’t want to search twice Mostly useful for boosting recall, which is not as important as precision for this task. We can approximate “good” documents by mining clickstream data. We can also use documents that are both linked in many pages Department of Computer Science — University of San Francisco – p.31/?? Handling metadata The standard bag-of-words approach ignores metadata and document structure. The simplest approach is to add weights to terms that occur inside a tag of interest. The META tag was originally useful for this, but is an obvious target for spammers. The anchor text inside links to a particular page can be particularly helpful. Department of Computer Science — University of San Francisco – p.32/?? Complex queries The approach we’ve seen so far will work for Boolean and free-text searches. What about phrases or sentences? The simplest approach would be to construct a separate index for phrases of length n or less. We don’t want to keep all phrases, though. Just those that are statistically interesting. To do this, we estimate the frequency of tokens t1 and t2 from our collection. If t1 t2 is an interesting phrase, P (t1 t2 ) >> P (t1 ) ∗ P (t2 ) or P (t1 t2 ) << P (ti )P (t2 ). Department of Computer Science — University of San Francisco – p.33/?? Positional indices An alternate, more scalable approach is to keep track of the position in a document where a term occurs. To process a phrase query, we would treat it as a Boolean query, but in computing intersection only admit adjacent words. For example, suppose we had the query “fat cat” and the following indices: fat: d1 : (34, 72, 103), d2 : (44, 61) cat: d1 : (35, 88, 104), d2 : (42, 99) We would conclude that d1 had two matches, and d2 zero. We can also use this texhnique to implement proximity or sentence queries. “cat w/3 mat” Department of Computer Science — University of San Francisco – p.34/?? Approximate matches We might also want to find documents who contain tokens very close to a search query. Misspellings, regional dialect, translation error. We could have an aggressive conflation service. We could reduce words to n-grams and match on this. We could mine query logs and match to terms in our index using q-grams, and then use this to update conflation. Department of Computer Science — University of San Francisco – p.35/?? Summary Invertex indexes are used to point from terms into documents. Supporting rich queries can require a great deal of preprocessing of text. We can rank results either according to goodness of fit or to prestige (or both). Department of Computer Science — University of San Francisco – p.36/??
© Copyright 2026 Paperzz