Course overview and introduction CE-324 : Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2012 Some slides have been adapted from: Profs. Nayak & Raghavan (CS-276, Stanford) Course info Instructor: Mahdieh Soleymani Email: [email protected] Website: http://ce.sharif.edu/cources/91-92/1/ce324-1 Lectures: Sun-Tue (10:30-12) Office Hours: Before MIR class or contact me to schedule an appointment. 2 Course info TAs: Seddigheh Zolaktaf, Ghazaleh Beigi, Sahba Ezami, Mahsa Mohammadi Use these addresses: 3 Group: https://groups.google.com/d/forum/MIR_fall_2012 Email for assignments: [email protected] Text books Main: Introduction to Information Retrieval, C.D. Manning, P. Raghavan and H. Schuetze, Cambridge University Press, 2008. Free online version is available at: http://informationretrieval.org/ Recommended: 4 Modern Information Retrieval, R. Baeza-Yates and B. Ribeiro-Neto, Addison Wesley Longman, 1999. Managing Gigabytes: Compressing and Indexing Documents and Images, I.H. Witten, A. Moffat, and T.C. Bell, Second Edition, Morgan Kaufmann Publishing,1999. Information Retrieval: Implementing and Evaluating Search Engines, S. Büttcher, C.L. A. Clarke and G.V. Cormack, MIT Press, 2010. Marking scheme Mid Term Exam: Final Exam: Project: Homeworks: Two quizzes: 5 25% 35% 20% 10% 10% Course main topics Introduction Indexing & text operations IR Models: Boolean, vector space, probabilistic Ranking Evaluation of IR systems Query operations Web IR Machine Learning in IR: Classification, clustering, and ranking Advanced topics 6 Information Retrieval (IR) Finding relevant documents to user’s request. Deals with the representation, storage, organization of, and access to information items Retrieving relevant documents to a query 7 especially from large sets of documents efficiently. Information Retrieval Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). 8 Definitions Document: a unit decided to build a retrieval system over textual: a sequence of words, punctuation, etc that express ideas about some topic in a natural language. Query: a request for docs pertaining to some topic. 9 textual: a string of text, describing the information that the user is seeking. A query is the formulation of a user information need Typical IR system Given: corpus & user query Find: A ranked set of documents relevant to the query. Corpus: A collection of documents Query 10 Document corpus IR System A list of Ranked Documents Minimize search overhead Search overhead: Time spent in all steps leading to the reading of items containing the needed information Steps: query generation, query execution, scanning results, reading non-relevant items, etc. The amount of online data has grown at least as quickly as the speed of computers 11 Data retrieval vs. information retrieval Data retrieval which items contain a set of keywords? Or satisfy the conditions specified in a (regular expression like) user query? well defined semantics a single erroneous object implies failure! Information retrieval 12 information about a subject or topic semantics is frequently loose small errors are tolerated Heuristic nature of IR Problem: Semantic disconnect between query and documents A document is relevant if the user perceives that this doc contains the information need Solution: IR system must interpret and rank documents according to the amount of relevance to the user’s query. 13 “The notion of relevance is at the center of IR.” How good are the retrieved docs? Precision: Fraction of retrieved docs that are relevant to user’s information need Precision = relevant retrieved / total retrieved = Retrieved Relevant / Retrieved Recall: Fraction of relevant docs that are retrieved Recall = relevant retrieved / relevant exist = Retrieved Relevant / Relevant Retrieved & Relevant Retrieved 14 Relevant Condensing the data (indexing) Search: naïve solution Linearly scanning the docs Indexing the corpus to speed up the searching task Computationally expensive for large collections Indexing depends on the query language and IR model Term (index unit): A word, phrase, and other groups of symbols used for retrieval 15 Index terms are useful for remembering the document themes Typical IR system architecture Text User Interface user need Text Text Operations Query user feedback Operations Indexing Corpus query Searching retrieved docs Ranking ranked docs 16 Index IR System Components Text Operations forms index terms Tokenization, stop word removal, stemming, … Indexing constructs an index for doc corpus. Searching retrieves documents that are related to the query. Ranking scores retrieved documents according to their relevance. 17 IR System Components (continued) User Interface manages interaction with the user: Query input and visualization of results Relevance feedback Query Operations transform the query to improve retrieval: 18 Query expansion using a thesaurus or query transformation using relevance feedback Structured vs. unstructured docs Unstructured text (free text): a continuous sequence of tokens Structured text (fielded text): text is broken into sections that are distinguished by tags or other markup Semi-structured text 19 e.g. web page Databases vs. IR: Structured vs. unstructured data Structured data tends to refer to information in “tables” Salary < 60000 AND Manager = Smith 20 Semi-structured data In fact almost no data is “unstructured” E.g., this slide has distinctly identified zones such as the Title and Bullets Facilitates “semi-structured” search such as Title contains data AND Bullets contain search … to say nothing of linguistic structure 21 22 More sophisticated semi-structured search Title is about Object Oriented Programming AND Author something like stro*rup Issues: * is the wild-card operator how do you process “about”? how do you rank results? The focus of XML search (IIR chapter 10) 23 Web Search Application of IR to HTML documents on the World Wide Web. 24 Web IR Web Crawler Query corpus IR System A list of Ranked Pages 25 The web and its challenges Web collection properties Unusual and diverse documents Unusual and diverse users, queries, information needs Documents change uncontrollably Web IR Collecting document corpus by crawling the web Can exploit the structural layout of docs Beyond terms, exploit the link structure (ideas from social networks) link analysis, clickstreams ... How do search engines work? how can we make them better? 26 Some main trends in IR models Boolean models: Exact matching Vector space model: Ranking docs by similarity to query PageRank: Ranking of matches by importance of documents Combinations of methods 27 More sophisticated information retrieval Cross-language information retrieval Question answering Summarization Text mining … 28
© Copyright 2026 Paperzz