Indexing and Complexity Agenda • Inverted indexes • Computational complexity Some Interesting Questions • How long will it take to find a document? – Is there any work we can do in advance? • If so, how long will that take? • How big a computer will I need? – How much disk space? How much RAM? • What if more documents arrive? – How much of the advance work must be repeated? – Will searching become slower? – How much more disk space will be needed? A Cautionary Tale • Searching is easy - just ask Microsoft! – “Find” can search my 1 GB disk in 30 seconds • Well, actually it only looks at the file names... • How long do you think find would take for – The 100 GB disk we just got? – For the World Wide Web? • Computers are getting faster, but… – How does AltaVista give answers in 5 seconds? The “Inverted File” Trick • Organize the bag of words matrix by terms – You know the terms that you are looking for • Look up terms like you search phone books – For each letter, jump directly to the right spot • For terms of reasonable length, this is very fast – For each term, store the document identifiers • For every document that contains that term • At query time, use the document identifiers – Consult a “postings file” Inverted File A B AI AL BA BR C D F G J L M N O P Q T TH TI Term Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 An Example Postings aid all back brown come dog fox good jump lazy men now over party quick their time 0 0 1 1 0 0 0 0 0 1 0 0 1 0 1 1 0 4, 8 2, 4, 6 1, 3, 7 1, 3, 5, 7 2, 4, 6, 8 3, 5 3, 5, 7 2, 4, 6, 8 3 1, 3, 5, 7 2, 4, 8 2, 6, 8 1, 3, 5, 7, 8 6, 8 1, 3 1, 5, 7 2, 4, 6 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 1 0 0 1 1 1 1 0 0 0 The Finished Product Inverted File A B AI AL BA BR C D F G J L M N O P Q T TH TI Term Postings aid all back brown come dog fox good jump lazy men now over party quick their time 4, 8 2, 4, 6 1, 3, 7 1, 3, 5, 7 2, 4, 6, 8 3, 5 3, 5, 7 2, 4, 6, 8 3 1, 3, 5, 7 2, 4, 8 2, 6, 8 1, 3, 5, 7, 8 6, 8 1, 3 1, 5, 7 2, 4, 6 What Goes in a Postings File? • Boolean retrieval – Just the document number • Ranked Retrieval – Document number and term weight (TF*IDF, ...) • Proximity operators – Word offsets for each occurrence of the term • Example: Doc 3 (t17, t36), Doc 13 (t3, t45) How Big Is the Postings File? • Very compact for Boolean retrieval – About 10% of the size of the documents • If an aggressive stopword list is used! • Not much larger for ranked retrieval – Perhaps 20% • Enormous for proximity operators – Sometimes larger than the documents! • But access is fast - you know where to look Building an Inverted Index • Simplest solution is a single sorted array – Fast lookup using binary search – But sorting large files on disk is very slow – And adding one document means starting over • Tree structures allow easy insertion – But the worst case lookup time is linear • Balanced trees provide the best of both – Fast lookup and easy insertion – But they require 45% more disk space Starting a B+ Tree Inverted File Now is the time for all good … aaaaa all good now now time Adding a New Term Now is the time for all good men … aaaaa aaaaa all good now men men now time How Big is the Inverted Index? • Typically smaller than the postings file – Depends on number of terms, not documents • Eventually almost all terms will be indexed – But the postings file will continue to grow • Postings dominate asymptotic space complexity – Linear in the number of documents • Assuming that the documents remain about the same size Some Facts About Disks • It takes a long time to get the first byte – A Pentium can do 1,000,000 operations in 10 ms • But you can get 1,000 bytes just about as fast – 40 MB/sec transfer rates are typical • So it pays to put related stuff in each “block” – M-ary trees B+ are better than binary B+ trees • Time complexity is measured in disk blocks read – Since computing time is negligible by comparison Time Complexity • Indexing – Walk the inverted file, splitting if needed – Insert into the postings file in sorted order – Hours or days for large collections • Query processing – Walk the inverted file – Read the postings file – Seconds, even for enormous collections Summary • Slow indexing yields fast query processing • We use extra disk space to save query time – Index space is in addition to document space – Time and space complexity must be balanced • Disk block reads are the critical resource – Fast disks are more useful than fast computers A Question • If insertions are more common than queries (for example, filtering news stories as they arrive and then never looking at them again), what kind of an index should you build?
© Copyright 2026 Paperzz