Introduction to information retrieval (IR) Haim Levkowitz University of Massachusetts Lowell I VP R © Copyright 2006-2008 Haim Levkowitz 1 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Index construction I VP R © Copyright 2006-2008 Haim Levkowitz 2 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Plan • Previous: Tolerant retrieval • Wildcards • Spell correction • Soundex • Now: • Index construction I VP R © Copyright 2006-2008 Haim Levkowitz 3 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Index construction • How do we construct an index? • What strategies can we use with limited main memory? I VP R © Copyright 2006-2008 Haim Levkowitz 4 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Overview • • • • • Intro Block merge indexing Distributed indexing Dynamic indexing Other indices I VP R © Copyright 2006-2008 Haim Levkowitz 5 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Recall index construction Term Doc # I 1 did 1 enact 1 julius 1 caesar 1 I 1 was 1 killed 1 i' 1 the 1 capitol 1 brutus 1 killed 1 me 1 so 2 let 2 it 2 be 2 with 2 caesar 2 the 2 noble 2 brutus 2 hath 2 told 2 you 2 caesar 2 was 2 © Copyright 2006-2008 Haim Levkowitz 6 ambitious 2 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ • Documents parsed to extract words • saved w/ Document ID Doc 1 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. I VP R Doc 2 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious Key step: sort • After all doc’s parsed • inverted file sorted by terms Core indexing step 667M items to sort I VP R Term I did enact julius caesar I was killed i' the capitol brutus killed me so let it be with caesar the noble brutus hath told you caesar was ambitious Doc # 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Term ambitious be brutus brutus capitol caesar caesar caesar did enact hath I I i' it julius killed killed let me noble so the the told you was was with Doc # 2 2 1 2 1 1 2 2 1 1 1 1 1 1 2 1 1 1 2 1 2 2 1 2 2 2 1 2 2 © Copyright 2006-2008 Haim Levkowitz 7 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Index construction • Index construction in main memory • simple & fast. But: • As build index, hard to use compression • Parse docs one at a time • Final postings for any term – incomplete until end • W/ compression much more complex (later) • 10-12 bytes per postings entry ==> need several temporary gigabytes -- HDD I VP R © Copyright 2006-2008 Haim Levkowitz 8 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Example: The Reuters collection I VP R © Copyright 2006-2008 Haim Levkowitz 9 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Example: The Reuters collection Symbol Static Value N documents 800,000 L avg. # word tokens per document 200 M (distinct) word types 400,000 avg. # bytes per word token (incl. spaces / punct.) 6 avg. # bytes per word token (without spaces / punct.) 4.5 avg. # bytes per word type 7.5 non-positional postings 100,000,000 I VP R Why such diff? © Copyright 2006-2008 Haim Levkowitz 10 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ System parameters for design • Disk seek ~10 milliseconds • Block transfer from disk ~1 microsecond / byte (following seek) • All other ops ~10 microseconds • E.g., compare two postings entries and decide their merge order I VP R © Copyright 2006-2008 Haim Levkowitz 11 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Bottleneck • Parse & build postings entries one doc at a time • Now sort postings entries by term • then by doc within each term • Sort 100M records on disk ==> too slow -- too many disk seek If every comparison took 2 disk seeks, and N items could be sorted with N log2N comparisons, how long would this take? I VP R © Copyright 2006-2008 Haim Levkowitz 12 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Sorting with fewer disk seeks • 12-byte (4+4+4) records (term, doc, freq) • Generated as parse docs • Must now sort 100M such 12-byte records by term • Define Block ~1.6M such records • can “easily” fit a couple into memory • Will have 64 such blocks to start with • (Example; blocks much larger for modern HW) • Accumulate postings for each block, sort, write to disk • then merge blocks into one long sorted order I VP R © Copyright 2006-2008 Haim Levkowitz 13 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Sorting 64 blocks of 1.6M records • First, read each block and sort within: • Quicksort takes 2N ln N expected steps • Here 2 x (1.6M ln 1.6M) steps • Exercise: estimate total time to read each block from disk and quicksort it • 64 times this estimate ==> 64 sorted runs of 1.6M records each • Need 2 copies of data on disk, throughout I VP R © Copyright 2006-2008 Haim Levkowitz 14 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Merging 64 sorted runs • Merge tree of log264 = 6 layers • During each layer, read into memory runs in blocks of 1.6M, merge, write back 1 2 3 4 1 2 Merged run 3 Runs being merged I VP R 4 Disk © Copyright 2006-2008 Haim Levkowitz 15 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Merge tree … 1 run … ? 2 runs … ? 4 runs … ? 8 runs, 12.8M/run 16 runs, 6.4M/run 32 runs, 3.2M/run Sorted runs I VP R Bottom level of tree … 1 2 63 64 © Copyright 2006-2008 Haim Levkowitz 16 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Block merge indexing I VP R © Copyright 2006-2008 Haim Levkowitz 17 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Merging 64 runs • Time estimate for disk transfer: • 6 x (64 blocks x 1.6Mb x 12 x 10-6sec/b) x 2 ≈ 41mins Disk block transfer time. Why is this an Overestimate? # Layers in merge tree I VP R Work out how these transfers are staged, and the total time for merging. Read + Write © Copyright 2006-2008 Haim Levkowitz 18 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Exercise -- fill in this table Step ? Time 1 64 initial quicksorts of 10M records each 2 Read 2 sorted blocks for merging, write back 3 Merge 2 sorted blocks 4 Add (2) + (3) = time to read/merge/write 5 64 times (4) = total merge time I VP R © Copyright 2006-2008 Haim Levkowitz 19 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Large memory indexing • Suppose instead we had 16GB memory for above indexing task • Key decision in block merge indexing: block size • Exercise: Choose what initial block sizes? ==> What index time? • Repeat with a couple of values of n, m • In practice, spidering (crawling) often interlaced with indexing • Spidering bottlenecked by WAN speed and many other factors -- more later I VP R © Copyright 2006-2008 Haim Levkowitz 20 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Distributed indexing • For web-scale indexing (don’t try this at home!): • must use distributed computing cluster • Individual machines fault-prone • Can unpredictably slow down or fail • How exploit such pool of machines? I VP R © Copyright 2006-2008 Haim Levkowitz 21 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Distributed indexing • Maintain master machine directing indexing job • Considered “safe” • Break up indexing into sets of (parallel) tasks • Master machine assigns each task to idle machine from pool I VP R © Copyright 2006-2008 Haim Levkowitz 22 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Parallel tasks • Will use two sets of parallel tasks • Parsers • Inverters • Break input document corpus into splits • Each split = subset of documents • Master assigns split to idle parser machine • Parser • reads document at time • Emits (term, doc) pairs I VP R © Copyright 2006-2008 Haim Levkowitz 23 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Parallel tasks • Parser writes pairs into j partitions • Each for range of terms’ first letters • E.g., a-f, g-p, q-z • here j = 3 • Now to complete index inversion I VP R © Copyright 2006-2008 Haim Levkowitz 24 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Data flow assign splits I VP R Master assign Parser a-f g-p q-z Parser a-f g-p q-z Parser a-f g-p q-z Postings Inverter a-f Inverter g-p Inverter q-z © Copyright 2006-2008 Haim Levkowitz 25 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Inverters • Collect all (term, doc) pairs for partition • Sort and write to postings list • Each partition contains set of postings Above process flow a special case of MapReduce I VP R © Copyright 2006-2008 Haim Levkowitz 26 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Schema in MapReduce I VP R © Copyright 2006-2008 Haim Levkowitz 27 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Distributed indexing • MapReduce is simple interface for distributing complex tasks • Several indexing steps at Google computed with MapReduce I VP R © Copyright 2006-2008 Haim Levkowitz 28 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Dynamic indexing • Docs come in over time • postings updates for terms already in dictionary • new terms added to dictionary • Docs get deleted I VP R © Copyright 2006-2008 Haim Levkowitz 29 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Simplest approach Maintain “big” main index New docs go into “small” auxiliary index Search across both, merge results Deletions • Invalidation bit-vector for deleted docs • Filter docs output on search result by this invalidation bit-vector • Periodically, re-index into one main index • • • • I VP R © Copyright 2006-2008 Haim Levkowitz 30 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Issue with big and small indices • Corpus-wide statistics hard to maintain • E.g., spell-correction: which of several corrected alternatives present to user? • We said: pick the one with most hits • How do maintain top ones with multiple indices? • One possibility: ignore small index for such ordering • Will see more such statistics used in results ranking I VP R © Copyright 2006-2008 Haim Levkowitz 31 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Issue with big and small indices • Frequent merges • Poor performance during merge • (at least if implemented naively) I VP R © Copyright 2006-2008 Haim Levkowitz 32 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Logarithmic merge • Maintain series of indices • each twice as large as previous one • Keep smallest (J0) in memory • Larger ones (I0, I1, . . . ) on disk • If J0 gets too big (> n), write to disk as I0 • or merge with I0 (if I0 already exists) and right merger to I1 • etc. I VP R © Copyright 2006-2008 Haim Levkowitz 33 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Logarithmic merge I VP R © Copyright 2006-2008 Haim Levkowitz 34 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Logarithmic merge • # indices bounded by O(log T) • (T = # postings read so far) • So query processing require merge of at most O(log T) • Each posting merged at most O(log T) • Aux. index: construction time O(T2) • worst case I VP R © Copyright 2006-2008 Haim Levkowitz 35 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Dynamic indexing at large search engines • Often combination • Frequent incremental changes • Occasional complete rebuild I VP R © Copyright 2006-2008 Haim Levkowitz 36 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Building positional indices Why? • Still sorting problem (but larger) • Exercise: given 1GB of memory, how would you adapt the block merge described earlier? I VP R © Copyright 2006-2008 Haim Levkowitz 37 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Building n-gram indices • As text parsed, enumerate n-grams • For each n-gram, need pointers to all dictionary terms containing it • “postings” • NB: same “postings entry” can arise repeatedly in parsing docs • need efficient “hash” to keep track • E.g., that trigram uou occurs in term deciduous will be discovered on each text occurrence of deciduous I VP R © Copyright 2006-2008 Haim Levkowitz 38 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Building n-gram indices • Once all (n-gram ∈ term) pairs enumerated • must sort for inversion • Recall average English dictionary term ≈8 characters • ==> about 6 trigrams per term on average • For vocabulary of 500K terms ==> ~3 million pointers • can compress I VP R © Copyright 2006-2008 Haim Levkowitz 39 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Index on disk vs. memory • Most retrieval systems keep • dictionary in memory • postings on disk • Web search engines frequently keep both in memory • massive memory requirement • feasible for large Web service installations • less so for commercial usage where query loads are lighter I VP R © Copyright 2006-2008 Haim Levkowitz 40 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Indexing in the real world • Typically, not all doc’s on local file system • Documents need to be spidered • Could be dispersed over WAN with varying connectivity • Must schedule distributed spiders • Already discussed distributed indexers • Could be (secure content) in • Databases • Content management applications • Email applications I VP R © Copyright 2006-2008 Haim Levkowitz 41 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Content residing in applications • Mail systems/groupware, content management contain most “valuable” documents • http often not most efficient way to fetch these documents • native API fetching • Specialized, repository-specific connectors • Also facilitate document viewing when search result selected for viewing I VP R © Copyright 2006-2008 Haim Levkowitz 42 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Secure documents • Each document accessible to subset of users • Access Control Lists (ACLs) • Search users authenticated • Query to retrieve doc only if user may access it • So if there are docs matching your search but you’re not privy to them • “Sorry no results found” • E.g., as lowly employee in company, get “No results” for the query “salary roster” I VP R © Copyright 2006-2008 Haim Levkowitz 43 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Users in groups, docs from groups • Index ACLs • filter results by them Documents Users 0/1 0 if user can’t read doc, 1 otherwise • Often, user membership in ACL group verified at query time ==> slowdown I VP R © Copyright 2006-2008 Haim Levkowitz 44 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Exercise • Can spelling suggestion compromise such document-level security? • Consider the case when there are documents matching my query, but I lack access to them I VP R © Copyright 2006-2008 Haim Levkowitz 45 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Compound documents • What if doc consisted of components • Each component has own ACL • Search should get doc only if • query meets one of its components • + have access to component • More generally: doc assembled from computations on components • e.g., in Lotus databases or in content management systems • How index such docs? I VP R © Copyright 2006-2008 Haim Levkowitz 46 No good answers … http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ “Rich” documents • How index images? • Query Based on Image Content (QBIC) systems • “show picture similar to this CAT scan” • vector space retrieval (later) • Currently, image search usually based on meta-data • E.g., file name • e.g., monalisa.jpg • New approaches exploit social tagging • E.g., flickr.com • Still meta-data (not image features) I VP R © Copyright 2006-2008 Haim Levkowitz 47 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Passage/sentence retrieval • Suppose want to retrieve • Not entire document matching query • Only passage/sentence/fragment • E.g., in very long document • Can index passages/sentences as minidocuments • what should index units be? • ==> XML search I VP R © Copyright 2006-2008 Haim Levkowitz 48 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/ Resources • IIR, Chapter 4 • MG, Chapter 5 I VP R © Copyright 2006-2008 Haim Levkowitz 49 http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
© Copyright 2026 Paperzz