INFO624 - Week 2
Models of Information Retrieval
Dr. Xia Lin
Associate Professor
College of Information Science and Technology
Drexel University
Reviews of Last Week
Challenges of Information Retrieval
Translate user’s information needs to
queries.
Match queries to stored information.
Evaluate if the query results match the
user’s information needs
Differences between
Data, information, and knowledge
Data retrieval and information retrieval
Assignment 1
Some of my favorite Search Software Packages
IBM’ Content Management (high-cost)
AOL PLS Search Engine (free)
GreenStone Digital Library Software (opensource)
SWISH (open source)
mnoGoSearch (free)
Apache Lucene (open source components)
Documents
Documents are logical units of text
Units of records (text & other components)
Units that can be stored, retrieved, and
displayed as an unique entity
Units of semantic entity
units of text grouped together for a purpose
Units of unformatted text
Text as written by authors of documents.
Document Models
Documents need to be processed and represented
in a concise and identifiable formats/structures.
Documents are full of text.
Not every words of the text are meaningful for
searching/retrieval.
Documents themselves do not have identifiable
attributes such as authors and titles.
Figure 1.2: Logical view of a document: from full text to a
set of index terms.
Document Representation
Documents should be represented to help users
identify and receive information from the system.
to identify authors and titles
to identify subjects
to provide summaries/abstracts
to classify subject categories
Document Surrogates
Each document should have one or more short and
descriptive labels/attributes
Level 1:
Title:
Author:
Keywords:
Level 2:
Level 1 +Abstract:
Level 3:
Level 2 + full text
A Formal IR Models
An information retrieval model is a quadruple
(D, Q, F, R(qi, dj)) where
D is a set composed of logical views (or representations)
for the documents in the collection.
Q is a set composed of logical views (or representations)
for the information needs. Such representations are called
queries.
F is a framework for modeling document representations,
queries, and their relationships
R(qi, dj) is a ranking function which associated a real
number with a queryqi and a document representation dj.
Scuh ranking defines an ordering among the documents
with regard to the query qi.
Computerized Indexing
Title indexing:
Sort all the titles alphabetically
Not consider the beginning “a” or “the”
Convert all letters to uppercases.
Matching always starts from the beginning
of the title (not individual words).
Most early IR systems (such as library
catalogs) used title indexing
Word indexing
Parsing every individual words from documents
First decision: What is a word?
Are digits words?
• How about the letter and digit combination:
B6, B12
• Is F-16 one word or two words?
Hyphens
• Online, on-line, on line ?
• F-16
Singular or plural ?
List all the words alphabetically with points back to
documents – inverted indexing.
Inverted Indexing
Inverted indexing consists of an ordered list
of indexing terms, each indexing term is
associated with some document
identification numbers.
Retrieval is done by first searching in the
ordered list to find the indexing term, then
using the document identification numbers
to locate documents
Example: Create an inverted indexing for
the following:
Document Number
001
002
003
004
005
006
007
008
Terms
T3, T4, T6, T12, T15
T1, T3, T4, T7, T9, T13
T5, T12, T15,
T11, T12, T15, T15
T2, T3, T5, T7, T8, T12
T1, T4, T5
T3, T5, T6, T7
T1, T2, T7, T9, T12
Boolean Logic
Logical operators defined on sets
True and false:
A set is a collection of items with certain common
characteristics.
Any item either belongs to the set (true) or not belong to
the set (false)
AND
combine two sets, A and B, to create a smaller (or at
least not larger) set C.
any items in C must be in BOTH set A and set B.
OR
Union of two sets, A and B, to create a larger set C.
any item in C must be either in set A or in set B.
Not
to exclude items in a set.
Example:
Given:
A={1, 3, 7, 12, 14, 25,36,}
B={1, 2, 3,4,5,7,8,12,13, 14, 15, 25, 26}
C={2,4,6,8,10,11,12,13,14}
Derive:
A AND B
• (A AND B) OR C
A OR B
A AND B AND C
• (A OR B) AND C
(A AND B) NOT C • A AND (B OR C)
Boolean Logic
Venn Diagram
graphical representation of Boolean logic
A and (B or C)
A and B or (C and D)
Boolean Query
Terms connected by Boolean operators
The system retrieves a set of documents
based on the Boolean logic of the query.
Examples:
(network or networks or structured or
system or systems) and (information or
retrieval)
Advantages of Boolean Search
Simple and specific
Effective
AND reduces the number of hits very
quickly
OR expands search scope
Strong logic-based
proved mathematical foundations
Problems of Boolean Search:
Boolean search is an exact search
either retrieving or not retrieving a
document.
Requesting “computer” would not find
“computing” unless more programming is
done
No weighting can be done on terms
in query, A and B, you can’t specify A is
more important than B.
No Ranking
Retrieved sets can not be ordered based on
the Boolean logic.
Every retrieved document are treated
equally.
Possible order confusion
A AND B OR C
Vectors
A numerical representation for a point in a
multi-dimensional space.
(x1, x2, … … xn)
Dimensions of the space need to be
defined
A measure of the space needs to be
defined. x (x , x ,......,x )
i
i1
i2
in
xj (xj1, xj2,......, xjn)
Distance
d(xi, xj)
n
(xik xjk)
k 1
2
Vector Representation of
Document Space
Each indexing term is a dimension
Each document is a vector
Di = (ti1, ti2, ti3, ti4, ... tin)
Dj = (tj1, tj2, dj3, tj4, ..., tjn)
Document similarity is defined as
n
Similarity (Di, Dj)
tik * tjk
k 1
n
2
ik
k 1
n
t * tjk 2
k 1
Example:
A document Space is defined by three terms:
hardware, software, user
A set of documents are defined as:
A1=(1, 0, 0),
A4=(1, 1, 0),
A7=(1, 1, 1)
A2=(0, 1, 0), A3=(0, 0, 1)
A5=(1, 0, 1), A6=(0, 1, 1)
A8=(1, 0, 1). A9=(0, 1, 1)
If the Query is “hardware and software”
what documents should be retrieved?
In Boolean query matching:
document A4, A7 will be retrieved by “ANDing”
the two query terms
retrieved:A1, A2, A4, A5, A6, A7, A8, A9 if two
query terms are “ORed” together.
In Vector query matching:
q=(1, 1, 0)
S(q, A1)=0.71, S(q, A2)=0.71,S(q, A3)=0
S(q, A4)=1,
S(q, A5)=0.5, S(q, A6)=0.5
S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5
Document retrieved set (with order)=
{A4, A7, A1, A2, A5, A6, A8, A9}
Weights in the Vector Space
A main advantage of Vector representation is that
items in vectors don’t have to be just 0 or 1 (true or
false).
A1=(0.7, 0.5, 0.3)
A2=(0.5, 0.2, 0.7)
A3=(0.3, 0.6, 0.9)
A4=(0.7, 0.9, 1.0)
Queries may also be weighted:
Q=(0.7, 0.3, 0)
TF and IDF
TF – term frequency
number of times a term occurs in a
document
DF –Document frequency
Number of documents that contain the term.
IDF – inversed document frequency
=log(N/ni)
N –the total number of documents
ni – number of documents that contains
term i.
Salton’s Vector Space
A document is represented as a vector:
(W1, W2, … … , Wn)
Binary:
Wi= 1 if the corresponding term is in the
document
Wi= 0 if the term is not in the document
TF: (Term Frequency)
Wi= tfi where tfi is the number of times the term
occurred in the document
TF*IDF: (Inverse Document Frequency)
Wi =tfi*idfi=tfi*(1+log(N/dfi)) where dfi is the
number of documents contains the term i, and N
the total number of documents in the collection.
In vector space, documents and queries are treated
the same.
It is easier to do similarity search:
“find documents like this one”
It is easier to do document clusters:
“group documents into categories and
subcategories”
It’s easier to display search results graphically
“Giving meaning to place or location in the
multi-dimensional space”
Web Indexing
Most web indexing is Vector-based indexing, with
variances:
robot indexing software keeps traverse the web
to collect more pages and terms
Servers establish a huge inverted indexing and
vector indexing database
Search engines conduct different types of vector
query matching
only a few search engines implement truly
Boolean query matching
The real differences among different search
engines are
their indexing weight schemes
their query process methods
their ranking algorithms
None of these are published by any of the
search engines firms.
Alternative IR Models
Probabilistic Model
Given a document d, how likely would the user
consider it relevant?
How likely would the user consider it no
relevant?
If these two are known, Similarity of document d
and query q can be defined as:
S(d, q) = probability of d is relevant to q
probability of d is not relevant to q
Examples:
If a document is 80% likely to be relevant to
query q, what is its (probabilistic)
similarity?
If a document is only 30% likely to be
relevant, what is the similarity?
If there are 100 documents, 10 are relevant to a
query,
what is the probability of relevance for a
randomly select document?
What is the similarity of this document to the
query?
Any retrieve systems must do must better than
that.
In general, retrieval systems should retrieve
those S>1
Advantages of the Probabilistic model
Documents can be ranked by its relevance
probability.
Relevance probability can be improved through
the interaction process.
Good mathematic model
Disadvantages:
Involved many assumptions
Not very practical
Fuzzy Set Model
Fuzzy Set Theory
Extension of Boolean set theory
Instead of a binary membership
definition, fuzzy set Membership is
continuously defined between 0 and 1.
Example:
• { Male students in our class}
• {tall students in our class}
• One is Boolean set and one is fuzzy
set.
The set of retrieved documents should be
considered as a fuzzy set.
Documents are not just relevant or notrelevant.
Documents can be somehow relevant.
Documents can be 80% likely to be
relevant.
Good Mathematical Models but not widely
implemented and tested.
Latent Semantic Indexing Model
Map documents from a high-dimensional
space to a lower dimensional space, while
maintaining document relationships.
For clustering
For visualization
It’s a popular advanced retrieval technique.
It’s computationally expensive.
Neural Network Model
Organize the document collection as a semantic
network through learning
Use known queries/relevant documents to to
train the network, and later allow the network to
predict relevance for new queries. (supervised
learning)
Use document-document relationships to “selforganize” the network and move relevant
documents close to each other. (un-supervised
learning).
The Fusion Model
Retrieve documents based on text indexing
(Boolean model or Vector Space Model, etc.)
Retrieve documents based on link models
(Citations, Google’s PageLink, etc.)\
Retrieve documents based on classification models
(The classification schemes, thesauri, Yahoo
categories, etc).
“Fusion” results together before response to the
user
Models for Browsing
Flat Model
No particular organizations of materials
Hierarchical model
Assign documents into a hierarchical
structure.
Hypertext Model
Define appropriate links among related
documents.
© Copyright 2026 Paperzz