Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Information Retrieval Document Parsing Basic indexing pipeline Documents to be indexed. Friends, Romans, countrymen. Tokenizer Token stream. Friends Romans Countrymen roman countryman Linguistic modules friend Modified tokens. Indexer Inverted index. 1 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Parsing a document What character set is in use? What format is it in? Plain ASCII, UTF-8, UTF-16,… pdf/word/excel/html? What language is it in? Each of these is a classification problem, with many complications… Tokenization: Issues Chinese/Japanese no spaces between words: Not always guaranteed a unique tokenization Dates/amounts in multiple formats フォーチュン500社は情報不足のため時間あた 社は情報不足のため時間あた$500K(約 約6,000万円 万円) フォーチュン 社は情報不足のため時間あた 万円 Katakana Hiragana Kanji “Romaji” What about DNA sequences ? ACCCGGTACGCAC... Definition of Tokens What you can search !! 2 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Case folding Reduce all letters to lower case Many exceptions e.g., General Motors USA vs. usa Morgen will ich in MIT … Is this the German “mit”? Stemming Reduce terms to their “roots” language dependent e.g., automate(s), automatic, automation all reduced to automat. e.g., casa, casalinga, casata, casamatta, casolare, casamento, casale, rincasare, case reduced to cas Originally used to reduce the dictionary size, now… 3 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Porter’s algorithm Commonest algorithm for stemming English Conventions + 5 phases of reductions phases applied sequentially each phase consists of a set of commands sample convention: Of the rules in a compound command, select the one that applies to the longest suffix. sses → ss, ies → i, ational → ate, tional → tion Full morphologial analysis modest benefit !! Thesauri Handle synonyms and polysemy Hand-constructed equivalence classes e.g., car = automobile e.g., macchina = automobile = spider For each word it specifies a list of correlated words (usually, synonyms, polysemic or phrases for complex concepts). Co-occurrence Pattern: BT (broader term), NT (narrower term) Vehicle (BT) Car Fiat 500 (NT) How to use it in SE ?? 4 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Dmoz Directory 5 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Yahoo! Directory Information Retrieval Statistical Properties of Documents 6 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Statistical properties of texts Token are not distributed uniformly They follow the so called “Zipf Law” Few tokens are very frequent A middle sized set has medium frequency Many are rare The first 100 tokens sum up to 50% of the text Many of these tokens are stopwords An example of “Zipf curve” 7 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Zipf’s law log-log plot The Zipf Law, in detail K-th most frequent term has frequency approximately 1/k; or the product of the frequency (f) of a token and its rank (r) is almost a constant r * f = c |T| f = c |T| / r s f = c |T| / r s = 1.5÷ 1.5÷2.0 General Law Scale-invariant: f(br) = b−s * f(r) 8 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Distribution vs Cumulative distr Power-law with smaller exponent Sum after the k-th element is ≤ fk k/(s-1) Sum up to the k-th element is ≥ fk k Consequences of Zipf Law Do exist many not frequent tokens that do not discriminate. These are the so called “stop words” English: to, from, on, and, the, ... Italian: a, per, il, in, un,… Do exist many tokens that occur once in a text and thus are poor to discriminate (error?). English: Calpurnia Italian: Precipitevolissimevolmente (o, paklo) Words with medium frequency ☺ Words that discriminate 9 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Other statistical properties of texts The number of distinct tokens grows as The so called “Heaps Law” (|T|β where β<1) Hence the token length is Ω(log |T|) Interesting words are the ones with Medium frequency (Luhn) Frequency vs. Term significance (Luhn) 10
© Copyright 2026 Paperzz