Rensselaer Polytechnic Institute CSCI-4220 – Network Programming David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0 how do we make acquired documents searchable? how do we best convert documents to their index terms Simplest approach is find, which requires no text transformation Useful in user applications, but not in search (why?) Optional transformation handled during the find operation: case sensitivity English documents are predictable: Top two most frequently occurring words are “the” and “of” (10% of word occurrences) Top six most frequently occurring words account for 20% of word occurrences Top fifty most frequently occurring words account for 50% of word occurrences Given all unique words in a (large) document, approximately 50% occur only once George Kingsley Zipf (1902-1950) Zipf’s law: Rank words in order of decreasing frequency The rank (r) of a word times its frequency (f) is approximately equal to a constant (k) r x f = k In other words, the frequency of the rth most common word is inversely proportional to r The probability of occurrence (Pr) for English, c ≈ 0.1 of a word is the word frequency divided by the total number of words in the document Revise Zipf’s law as: r x Pr = c Verify Zipf’s law using the AP89 dataset: Collection of Associated Press (AP) news stories from 1989 (available at http://trec.nist.gov): Total documents Total word occurrences Vocabulary size Words occurring > 1000 times Words occurring once 84,678 39,749,179 198,763 4,169 70,064 Top 50 words of AP89 As the corpus grows, so does vocabulary size Fewer new words when corpus is already large The relationship between corpus size (n) and vocabulary size (v) was defined empirically by Heaps (1978) and called Heaps law: v = k x nβ Constants k and β vary Typically 10 ≤ k ≤ 100 and β ≈ 0.5 note values of k and β Web pages crawled from .gov in early 2004 Word occurrence statistics can be used to estimate result set size of a user query Aside from stop words, how many pages contain all of the query terms? ▪ To figure this out, first assume that words occur independently of one another ▪ Also assume that the search engine knows N, the number of documents it indexes Given three query terms a, b, and c Probability of a document containing all three is the product of individual probabilities for each query term: P(a b c) = P(a) x P(b) x P(c) P(a b c) is the joint probability of events a, b, and c occurring We assume the search engine knows the number of documents that a word occurs in Call these na, nb, and nc ▪ Note that the book uses fa, fb, and fc Estimate individual query term probabilities: P(a) = na / N P(b) = nb / N P(c) = nc / N Given P(a), P(b), and P(c), we estimate the result set size as: nabc = N x (na / N) x (nb / N) nabc = (na x nb x nc) / N2 x (nc / N) This estimation sounds good, but is lacking due to our query term independence assumption Using the GOV2 dataset with N = 25,205,179 Poor results, because of the query term independence assumption Could use word co-occurrence data... Extrapolate based on the size of the current result set: The current result set is the subset of documents that have been ranked thus far Let C be the number of documents found thus far containing all the query words Let s be the proportion of the total documents ranked (use least frequently occurring term) Estimate result set size via nabc = C / s Given example query: tropical fish aquarium Least frequently occurring term is aquarium (which occurs in 26,480 documents) After ranking 3,000 documents, 258 documents contain all three query terms Thus, nabc = C / s = 258 / (3,000 ÷ 26,480) = 2,277 After processing 20% of the documents, the estimate is 1,778 ▪ Which overshoots actual value of 1,529
© Copyright 2026 Paperzz