Introduction.pdf

Course overview and introduction
CE-324 : Modern Information Retrieval
Sharif University of Technology
M. Soleymani
Fall 2012
Some slides have been adapted from: Profs. Nayak & Raghavan
(CS-276, Stanford)
Course info
Instructor: Mahdieh Soleymani


Email: [email protected]

Website: http://ce.sharif.edu/cources/91-92/1/ce324-1

Lectures: Sun-Tue (10:30-12)

Office Hours: Before MIR class or contact me to schedule
an appointment.
2
Course info

TAs: Seddigheh Zolaktaf, Ghazaleh Beigi, Sahba Ezami, Mahsa
Mohammadi

Use these addresses:
3

Group: https://groups.google.com/d/forum/MIR_fall_2012

Email for assignments: [email protected]
Text books
Main:


Introduction to Information Retrieval, C.D. Manning, P. Raghavan and
H. Schuetze, Cambridge University Press, 2008.

Free online version is available at: http://informationretrieval.org/
Recommended:




4
Modern Information Retrieval, R. Baeza-Yates and B. Ribeiro-Neto,
Addison Wesley Longman, 1999.
Managing Gigabytes: Compressing and Indexing Documents and
Images, I.H. Witten, A. Moffat, and T.C. Bell, Second Edition, Morgan
Kaufmann Publishing,1999.
Information Retrieval: Implementing and Evaluating Search Engines, S.
Büttcher, C.L. A. Clarke and G.V. Cormack, MIT Press, 2010.
Marking scheme
Mid Term Exam:
Final Exam:
Project:
Homeworks:
Two quizzes:





5
25%
35%
20%
10%
10%
Course main topics
Introduction
Indexing & text operations
IR Models: Boolean, vector space, probabilistic
Ranking
Evaluation of IR systems
Query operations
Web IR
Machine Learning in IR: Classification, clustering, and
ranking
Advanced topics









6
Information Retrieval (IR)
Finding relevant documents to user’s request.


Deals with the representation, storage, organization of, and
access to information items
Retrieving relevant documents to a query


7
especially from large sets of documents efficiently.
Information Retrieval
Information Retrieval (IR) is finding material (usually
documents) of an unstructured nature (usually text) that
satisfies an information need from within large collections
(usually stored on computers).

8
Definitions
Document: a unit decided to build a retrieval system
over


textual: a sequence of words, punctuation, etc that express
ideas about some topic in a natural language.
Query: a request for docs pertaining to some topic.



9
textual: a string of text, describing the information that the
user is seeking.
A query is the formulation of a user information need
Typical IR system


Given: corpus & user query
Find: A ranked set of documents relevant to the query.
Corpus: A collection of documents
Query
10
Document
corpus
IR System
A list of
Ranked
Documents
Minimize search overhead

Search overhead: Time spent in all steps leading to the
reading of items containing the needed information


Steps: query generation, query execution, scanning results,
reading non-relevant items, etc.
The amount of online data has grown at least as quickly
as the speed of computers
11
Data retrieval vs. information retrieval

Data retrieval




which items contain a set of keywords? Or satisfy the
conditions specified in a (regular expression like) user query?
well defined semantics
a single erroneous object implies failure!
Information retrieval



12
information about a subject or topic
semantics is frequently loose
small errors are tolerated
Heuristic nature of IR

Problem: Semantic disconnect between query and
documents


A document is relevant if the user perceives that this doc
contains the information need
Solution: IR system must interpret and rank documents
according to the amount of relevance to the user’s query.

13
“The notion of relevance is at the center of IR.”
How good are the retrieved docs?

Precision: Fraction of retrieved docs that are relevant to
user’s information need


Precision = relevant retrieved / total retrieved
= Retrieved  Relevant  / Retrieved 
Recall: Fraction of relevant docs that are retrieved

Recall = relevant retrieved / relevant exist
= Retrieved  Relevant  /  Relevant 
Retrieved & Relevant
Retrieved
14
Relevant
Condensing the data (indexing)

Search: naïve solution

Linearly scanning the docs


Indexing the corpus to speed up the searching task


Computationally expensive for large collections
Indexing depends on the query language and IR model
Term (index unit): A word, phrase, and other groups of
symbols used for retrieval

15
Index terms are useful for remembering the document themes
Typical IR system architecture
Text
User
Interface
user need
Text
Text Operations
Query
user feedback Operations
Indexing
Corpus
query
Searching
retrieved docs
Ranking
ranked docs
16
Index
IR System Components

Text Operations forms index terms

Tokenization, stop word removal, stemming, …

Indexing constructs an index for doc corpus.

Searching retrieves documents that are related to the
query.

Ranking scores retrieved documents according to their
relevance.
17
IR System Components (continued)

User Interface manages interaction with the user:



Query input and visualization of results
Relevance feedback
Query Operations transform the query to improve
retrieval:

18
Query expansion using a thesaurus or query transformation
using relevance feedback
Structured vs. unstructured docs

Unstructured text (free text): a continuous sequence of
tokens

Structured text (fielded text): text is broken into sections
that are distinguished by tags or other markup

Semi-structured text

19
e.g. web page
Databases vs. IR:
Structured vs. unstructured data

Structured data tends to refer to information in “tables”
Salary < 60000 AND Manager = Smith
20
Semi-structured data

In fact almost no data is “unstructured”


E.g., this slide has distinctly identified zones such as the Title
and Bullets
Facilitates “semi-structured” search such as

Title contains data AND Bullets contain search
… to say nothing of linguistic structure
21
22
More sophisticated semi-structured search

Title is about Object Oriented Programming AND Author
something like stro*rup


Issues:



* is the wild-card operator
how do you process “about”?
how do you rank results?
The focus of XML search (IIR chapter 10)
23
Web Search

Application of IR to HTML documents on the World
Wide Web.
24
Web IR
Web
Crawler
Query
corpus
IR System
A list of
Ranked Pages
25
The web and its challenges

Web collection properties




Unusual and diverse documents
Unusual and diverse users, queries, information needs
Documents change uncontrollably
Web IR



Collecting document corpus by crawling the web
Can exploit the structural layout of docs
Beyond terms, exploit the link structure (ideas from social
networks)


link analysis, clickstreams ...
How do search engines work? how can we make them
better?
26
Some main trends in IR models




Boolean models: Exact matching
Vector space model: Ranking docs by similarity to query
PageRank: Ranking of matches by importance of
documents
Combinations of methods
27
More sophisticated information retrieval





Cross-language information retrieval
Question answering
Summarization
Text mining
…
28