Computing scores in a complete search system 1 Recap Term frequency weighting The log frequency weight of term t in d is defined as follows idf weight dft is the document frequency, the number of documents that t occurs in. df is an inverse measure of the informativeness of the term. idf is a measure of the informativeness of the term. idf weight of term t : 2 Recap tf-idf weighting The tf-idf weight of a term is the product of its tf weight and its idf weight. Best known weighting scheme in information retrieval 3 Recap Cosine similarity between query and document qi is the tf-idf weight of term i in the query. di is the tf-idf weight of term i in the document. |vector| denote vector length 4 Recap Cosine similarity illustrated 5 Efficient scoring and ranking Example q =jealous gossip Observations The unit vectorv(q ) has only two non-zero components. In the absence of any weighting for query terms, these non-zero components are equal – in this case, both equal 0.707. For any two documents d1, d2 6 A faster algorithm 7 How do we compute the top k in ranking? In many applications, we don’t need a complete ranking. We just need the top k for a small k (e.g., k = 100). If we don’t need a complete ranking, is there an efficient way of computing just the top k? Naive: Compute scores for all N documents Sort Return the top k What’s bad about this? Alternative? 8 Use heap for selecting the top k A heap efficiently implements a priority queue. Binary tree in which each node’s value is greater than the values of its children. Takes O(J) operations to construct (where J is the number of documents with nonzero score) Each of k winners read off in O(k log J) steps 9 Inexact top K document retrieval Produce K documents that are likely to be among the K highest scoring documents for a query. Lower the cost of computing the K documents The heuristics have the following two-step scheme Find a set A of documents that are contenders, where K < |A| ≪ N. A does not necessarily contain the K topscoring documents for the query, but is likely to have many documents with scores near those of the top K. Return the K top-scoring documents in A. 10 Inexact top K document retrieval Index elimination Champion lists, static quality scores and ordering Cluster pruning 11 Index elimination Only consider documents containing terms whose idf exceeds a preset threshold Benefit: the postings lists of low-idf terms are generally long Only consider documents that contain many of the query terms. A danger of this scheme is that we may end up with fewer than K candidate documents in the output 12 Champion lists, static quality scores and ordering Champion lists the set of the r documents with the highest weights for t take the union of the champion lists for each of the terms comprising q Static quality scores A measure of quality g(d) for each document d that is query-independent The net score for a document d is some combination of g(d) together with the query-dependent score 13 Champion lists, static quality scores and ordering For a well-chosen value r, we maintain for each term t a global champion list of the r documents with the highest values for g(d) + tf-idft,d 14 Cluster pruning Preprocessing 1. Pick √N documents at random from the collection. Call these leaders. 2. For each document that is not a leader, we compute its nearest leader. Query processing 1. Given a query q, find the leader L that is closest to q. This entails computing cosine similarities from q to each of the √N leaders. 2. The candidate set A consists of L together with its followers. We compute the cosine scores for all documents in this candidate set. 15 Cluster pruning 16 Cluster pruning Variations of cluster pruning introduce additional parameters b1 and b2, both of which are positive integers. Attach each follower to its b1 closest leaders, rather than a single closest leader. Consider the b2 leaders closest to the query q. 17 The complete search system 18 Tiered indexes Basic idea: Create several tiers of indexes, corresponding to importance of indexing terms During query processing, start with highest-tier index If highest-tier index returns at least k (e.g., k = 100) results: stop and return results to user If we’ve only found < k hits: repeat for next index in tier cascade Example: two-tier system Tier 1: Index of all titles Tier 2: Index of the rest of documents Pages containing the search words in the title are better hits than pages containing the search words in the body of the text. 19 Tiered index 20 Tiered indexes The use of tiered indexes is believed to be one of the reasons that Google search quality was significantly higher initially (2000/01) than that of competitors. 21 Query-term proximity Users prefer a document in which most or all of the query terms appear close to each other Let ω be the width of the smallest window in a document d that contains all the query terms Example : Query : strained mercy Sentence : “The quality of mercy is not strained” ω =4 Such proximity-weighted scoring functions are a departure from pure cosine similarity and closer to the “soft conjunctive” semantics Used by Google and other web search engines 22 Query parser and scoring functions Query parser Translate the user-specified keywords into a query with various operators that is executed against the underlying indexes Example Run the user-generated query string as a phrase query. Rank them by vector space scoring If fewer than ten documents contain the phrase, run the two 2-term phrase queries; rank these using vector space scoring If we still have fewer than ten results, run the vector space query consisting of the three individual query terms Score vector space scoring, static quality, proximity weighting and potentially other factors 23 Complete search system 24 Components we haven’t covered yet Document cache: we need this for generating snippets (=dynamic summaries) Zone indexes: They separate the indexes for different zones the body of the document, all highlighted text in the document, anchor text, text in metadata fields etc Proximity ranking (e.g., rank documents in which the query terms occur in the same local window higher than documents in which the query terms occur far from each other) 25 Components we haven’t covered yet: Query parser IR systems often guess what the user intended. The two-term query London tower (without quotes) may be interpreted as the phrase query “London tower”. The query 100 Madison Avenue, New York may be interpreted as a request for a map. How do we “parse” the query and translate it into a formal specification containing phrase operators, proximity operators? 26 Vector space retrieval: Complications How do we combine phrase retrieval with vector space retrieval? We do not want to compute document frequency / idf for every possible phrase. Why? How do we combine Boolean retrieval with vector space retrieval? For example: “or”-constraints and “not”-constraints Postfiltering is simple, but can be very inefficient – no easy answer. How do we combine wild cards with vector space retrieval? Again, no easy answer 27
© Copyright 2026 Paperzz