INFO 320: Information Needs, Searching, and Presentation (aka… Search) Instructor: William Jones Email: [email protected] TA: Brennen Smith Email: [email protected] Lectures: Tuesdays & Thursdays: 1:30 – 3:20 pm, MGH 238 Labs : Wed.: 1:30 - 2:20 pm, MGH 030 For this Week 3 (10/13/2013) (Basics of Search) 2.2 T Add Word exercise in class Boolean search vs. the vector space model B-trees 2.2 W One-minute madness – each team gets one minute to describe progress on lab exercises & issues encountered. On-going work in lab. William Jones, a 2013 1.2 T INFO 320 2 And also for this Week 3 (of 10/13) 2.2 Th Cool tool presentations; Essay review Wrap-up Guest speaker on SEO; 2.2 F Quiz on Module 2. William Jones, a 2013 1.2 T INFO 320 3 William Jones, a 2013 1.2 T INFO 320 4 Components of a web crawler William Jones, a 2013 1.2 T INFO 320 fromButtcher, Clarke & Cormack, 2010, Information Retrieval, Chapter 15 5 Parsing a document What format is it in? What language is it in? pdf/word/excel/html? How to handle “and”? What character set is in use? … William Jones, a 2013 1.2 T INFO 320 6 What you see.. William Jones, a 2013 1.2 T INFO 320 *from http://en.wikipedia.org/wiki/Lawrence_Massacre 7 Is not what the crawler gets William Jones, a 2013 1.2 T INFO 320 *from http://en.wikipedia.org/wiki/Lawrence_Massacre 8 What character set is in use? ISO-8859-1. Latin alphabet part 1 covers North America, Western Europe, Latin America, the Caribbean, Canada, Africa; the default for Web pages. UTF-8. A character set implementation of Unicode. A character in UTF8 can be from 1 to 4 bytes long. UTF8 can represent any character in the Unicode standard. UTF-8 is backwards compatible with ASCII. UTF-8 is the preferred encoding for e-mail and web pages. William Jones, a 2013 1.2 T *from INFO 320 http://www.w3schools.com/TAGS/ref_charactersets.asp 9 An HTML sample William Jones, a 2013 1.2 T INFO 320 *from http://en.wikipedia.org/wiki/Lawrence_Massacre 10 Typical Stop Word List William Jones, a 2013 1.2 T INFO 320 11 Ambiguity of Natural Language (NL) Synonomy: Different Words, Same Meaning “car” ~= “automobile” “stomach pain after eating” ~= “post-prandial abdominal discomfort” Polysemy: Same Words, Different Meanings “jaguar” as animal vs. kind of automobile. “juvenile victims of crime” vs. “victims of juvenile crime” Venetian blinds vs. blind Venetians William Jones, a 2013 1.2 T INFO 320 12 How to handle synonyms? car= automobile When the document contains automobile, index it under car as well (also vice-versa) Or expand query. When the query contains automobile, look under car too. Or form concept, <automobile> When “car” is encountered, index under “<automobile>” (and “car” too?) Likewise for “automobile”. When either “car” or “automobile” are encountered in a query, add the term “<automobile>”. William Jones, a 2013 1.2 T INFO 320 13 Term Weighting Binary –presence or absence of term TF Simple count “Sublinear” TF scaling IDF TF .IDF William Jones, a 2013 1.2 T INFO 320 14 A matrix as a way to understand the index, the vector model and more. Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Term 1 1 0 1 1 0 0 Term 2 0 1 0 1 0 1 Term 3 0 0 0 0 1 0 Term 4 1 1 … Term 5 1 Term 6 0 … William Jones, a 2013 1.2 T INFO 320 15 Cells can have weights. Terms can be composites. Documents can have sections… William.title Doc 1.1 0 William.abstr 2 act William.intro 0 William Jones, a 2013 1.2 T Doc 1.2 3 Doc 2.1 4 Doc 2.2 0 Doc 6 0 1 0 1 0 0 7 0 INFO 320 … 0 16 The index has 3 essential components 1. A term list – structured for fast access to individual terms Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Term 1 1 0 1 1 0 0 Term 2 0 1 0 1 0 1 Term 3 0 0 0 0 1 0 Term 4 1 1 … Term 5 1 Term 6 0 … William Jones, a 2013 1.2 T INFO 320 17 The index has 3 essential components 2. For each term, a list of associations to documents. Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Term 1 1 0 1 1 0 0 Term 2 0 1 0 1 0 1 Term 3 0 0 0 0 1 0 Term 4 1 1 … Term 5 1 Term 6 0 … William Jones, a 2013 1.2 T INFO 320 18 The index has 3 essential components 3. a list of documents that are indexed. Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Term 1 1 0 1 1 0 0 Term 2 0 1 0 1 0 1 Term 3 0 0 0 0 1 0 Term 4 1 1 … Term 5 1 Term 6 0 … William Jones, a 2013 1.2 T INFO 320 19 The index can store information for each component 1. 2. 3. For terms – overall frequency in corpus (IDF), methods to identify or compute the term (e.g., variations in spelling, sound wave transformations, checksums, etc.) For term-to-doc associations – weights, # of occurrences, occurrence offsets, etc. For documents – address (by which to access content), summary, overall popularity (e.g., using PageRank). William Jones, a 2013 1.2 T INFO 320 20 The index can store information for each component 1. 2. 3. For terms – overall frequency in corpus (IDF), methods to identify or compute the term (e.g., variations in spelling, sound wave transformations, checksums, etc.) For term-to-doc associations – weights, # of occurrences, occurrence offsets, etc. For documents – address (by which to access content), summary, overall popularity (e.g., using PageRank). William Jones, a 2013 1.2 T INFO 320 21 The index can store information for each component 1. 2. 3. For terms – overall frequency in corpus (IDF), methods to identify or compute the term (e.g., variations in spelling, sound wave transformations, checksums, etc.) For term-to-doc associations – weights, # of occurrences, occurrence offsets, etc. For documents – address (by which to access content), summary, overall popularity (e.g., using PageRank). William Jones, a 2013 1.2 T INFO 320 22 The index can store information for each component 1. 2. 3. For terms – overall frequency in corpus (IDF), methods to identify or compute the term (e.g., variations in spelling, sound wave transformations, checksums, etc.) For term-to-doc associations – weights, # of occurrences, occurrence offsets, etc. For documents – address (by which to access content), summary, overall popularity (e.g., using PageRank). William Jones, a 2013 1.2 T INFO 320 23 Methods for fast access to terms Simple sort If updates are few; or term list can reside in RAM. Hashing* B-trees (more next thursday) William Jones, a 2013 1.2 T INFO 320 *From http://wapedia.mobi/en/Hash_function 24 Term Weighting Binary –presence or absence of term TF Simple count “Sublinear” TF scaling IDF TF .IDF William Jones, a 2013 1.2 T INFO 320 25 Zipf’s law If documents of a corpus are ranked (r) by the frequency (f) of their occurrence, then… r·f=k William Jones, a 2013 1.2 T Relates to the Pareto principle aka the "80-20 rule“. INFO 320 Schütze, Hinrich; Christopher D. Manning; Prabhakar Raghavan (2008) Introduction to Information Retrieval 26 An sample Zipf distribution The graph “hugs” the y and x-axes. Much is accounted for by top-ranked items but much is also hidden in a looong tail. William Jones, a 2013 1.2 T INFO 320 *from http://www.celtnet.org.uk/info/long_tail. php 27 Questions? William Jones, a 2013 1.2 T INFO 320 28
© Copyright 2026 Paperzz