INSYS 300 - Week 2 Document Representation

INFO624 - Week 2
Models of Information Retrieval
Dr. Xia Lin
Associate Professor
College of Information Science and Technology
Drexel University
Reviews of Last Week
Challenges of Information Retrieval
 Translate user’s information needs to
queries.
 Match queries to stored information.
 Evaluate if the query results match the
user’s information needs
 Differences between
 Data, information, and knowledge
 Data retrieval and information retrieval

Assignment 1

Some of my favorite Search Software Packages
 IBM’ Content Management (high-cost)
 AOL PLS Search Engine (free)
 GreenStone Digital Library Software (opensource)
 SWISH (open source)
 mnoGoSearch (free)
 Apache Lucene (open source components)
Documents

Documents are logical units of text
 Units of records (text & other components)
 Units that can be stored, retrieved, and
displayed as an unique entity
 Units of semantic entity
 units of text grouped together for a purpose
 Units of unformatted text
 Text as written by authors of documents.
Document Models

Documents need to be processed and represented
in a concise and identifiable formats/structures.
 Documents are full of text.
 Not every words of the text are meaningful for
searching/retrieval.
 Documents themselves do not have identifiable
attributes such as authors and titles.
Figure 1.2: Logical view of a document: from full text to a
set of index terms.
Document Representation

Documents should be represented to help users
identify and receive information from the system.
to identify authors and titles
 to identify subjects
 to provide summaries/abstracts
 to classify subject categories

Document Surrogates

Each document should have one or more short and
descriptive labels/attributes
 Level 1:
 Title:
 Author:
 Keywords:
 Level 2:
 Level 1 +Abstract:
 Level 3:
 Level 2 + full text
A Formal IR Models

An information retrieval model is a quadruple
(D, Q, F, R(qi, dj)) where




D is a set composed of logical views (or representations)
for the documents in the collection.
Q is a set composed of logical views (or representations)
for the information needs. Such representations are called
queries.
F is a framework for modeling document representations,
queries, and their relationships
R(qi, dj) is a ranking function which associated a real
number with a queryqi and a document representation dj.
Scuh ranking defines an ordering among the documents
with regard to the query qi.
Computerized Indexing

Title indexing:
 Sort all the titles alphabetically
 Not consider the beginning “a” or “the”
 Convert all letters to uppercases.
 Matching always starts from the beginning
of the title (not individual words).
 Most early IR systems (such as library
catalogs) used title indexing
Word indexing


Parsing every individual words from documents
 First decision: What is a word?
 Are digits words?
• How about the letter and digit combination:
B6, B12
• Is F-16 one word or two words?
 Hyphens
• Online, on-line, on line ?
• F-16
 Singular or plural ?
List all the words alphabetically with points back to
documents – inverted indexing.
Inverted Indexing
Inverted indexing consists of an ordered list
of indexing terms, each indexing term is
associated with some document
identification numbers.
 Retrieval is done by first searching in the
ordered list to find the indexing term, then
using the document identification numbers
to locate documents

Example: Create an inverted indexing for
the following:
Document Number
001
002
003
004
005
006
007
008
Terms
T3, T4, T6, T12, T15
T1, T3, T4, T7, T9, T13
T5, T12, T15,
T11, T12, T15, T15
T2, T3, T5, T7, T8, T12
T1, T4, T5
T3, T5, T6, T7
T1, T2, T7, T9, T12
Boolean Logic

Logical operators defined on sets
 True and false:
 A set is a collection of items with certain common
characteristics.
 Any item either belongs to the set (true) or not belong to
the set (false)
 AND
 combine two sets, A and B, to create a smaller (or at
least not larger) set C.
 any items in C must be in BOTH set A and set B.
 OR
 Union of two sets, A and B, to create a larger set C.
 any item in C must be either in set A or in set B.
 Not
 to exclude items in a set.
Example:
Given:
A={1, 3, 7, 12, 14, 25,36,}
B={1, 2, 3,4,5,7,8,12,13, 14, 15, 25, 26}
C={2,4,6,8,10,11,12,13,14}
 Derive:
 A AND B
• (A AND B) OR C
 A OR B
 A AND B AND C
• (A OR B) AND C
 (A AND B) NOT C • A AND (B OR C)

Boolean Logic

Venn Diagram
 graphical representation of Boolean logic
 A and (B or C)
 A and B or (C and D)
Boolean Query
Terms connected by Boolean operators
 The system retrieves a set of documents
based on the Boolean logic of the query.
 Examples:
 (network or networks or structured or
system or systems) and (information or
retrieval)

Advantages of Boolean Search
Simple and specific
 Effective
 AND reduces the number of hits very
quickly
 OR expands search scope
 Strong logic-based
 proved mathematical foundations

Problems of Boolean Search:
Boolean search is an exact search
 either retrieving or not retrieving a
document.
 Requesting “computer” would not find
“computing” unless more programming is
done
 No weighting can be done on terms
 in query, A and B, you can’t specify A is
more important than B.

No Ranking
 Retrieved sets can not be ordered based on
the Boolean logic.
 Every retrieved document are treated
equally.
 Possible order confusion
 A AND B OR C

Vectors

A numerical representation for a point in a
multi-dimensional space.
 (x1, x2, … … xn)
 Dimensions of the space need to be
defined
 A measure of the space needs to be
defined. x  (x , x ,......,x )
i
i1
i2
in
xj  (xj1, xj2,......, xjn)
Distance
d(xi, xj) 
n
 (xik  xjk)
k 1
2
Vector Representation of
Document Space



Each indexing term is a dimension
Each document is a vector
 Di = (ti1, ti2, ti3, ti4, ... tin)
 Dj = (tj1, tj2, dj3, tj4, ..., tjn)
Document similarity is defined as
n
Similarity (Di, Dj) 
 tik * tjk
k 1
n
2
 ik
k 1
n
t *  tjk 2
k 1
Example:
A document Space is defined by three terms:
 hardware, software, user
 A set of documents are defined as:




A1=(1, 0, 0),
A4=(1, 1, 0),
A7=(1, 1, 1)
A2=(0, 1, 0), A3=(0, 0, 1)
A5=(1, 0, 1), A6=(0, 1, 1)
A8=(1, 0, 1). A9=(0, 1, 1)
If the Query is “hardware and software”
 what documents should be retrieved?


In Boolean query matching:
document A4, A7 will be retrieved by “ANDing”
the two query terms
 retrieved:A1, A2, A4, A5, A6, A7, A8, A9 if two
query terms are “ORed” together.


In Vector query matching:





q=(1, 1, 0)
S(q, A1)=0.71, S(q, A2)=0.71,S(q, A3)=0
S(q, A4)=1,
S(q, A5)=0.5, S(q, A6)=0.5
S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5
Document retrieved set (with order)=
 {A4, A7, A1, A2, A5, A6, A8, A9}
Weights in the Vector Space


A main advantage of Vector representation is that
items in vectors don’t have to be just 0 or 1 (true or
false).
 A1=(0.7, 0.5, 0.3)
 A2=(0.5, 0.2, 0.7)
 A3=(0.3, 0.6, 0.9)
 A4=(0.7, 0.9, 1.0)
Queries may also be weighted:
 Q=(0.7, 0.3, 0)
TF and IDF
TF – term frequency
 number of times a term occurs in a
document
 DF –Document frequency
 Number of documents that contain the term.
 IDF – inversed document frequency
 =log(N/ni)
 N –the total number of documents
 ni – number of documents that contains
term i.

Salton’s Vector Space

A document is represented as a vector:
 (W1, W2, … … , Wn)
 Binary:
 Wi= 1 if the corresponding term is in the
document
 Wi= 0 if the term is not in the document
 TF: (Term Frequency)
 Wi= tfi where tfi is the number of times the term
occurred in the document
 TF*IDF: (Inverse Document Frequency)
 Wi =tfi*idfi=tfi*(1+log(N/dfi)) where dfi is the
number of documents contains the term i, and N
the total number of documents in the collection.

In vector space, documents and queries are treated
the same.
 It is easier to do similarity search:
 “find documents like this one”
 It is easier to do document clusters:
 “group documents into categories and
subcategories”
 It’s easier to display search results graphically
 “Giving meaning to place or location in the
multi-dimensional space”
Web Indexing

Most web indexing is Vector-based indexing, with
variances:
 robot indexing software keeps traverse the web
to collect more pages and terms
 Servers establish a huge inverted indexing and
vector indexing database
 Search engines conduct different types of vector
query matching
 only a few search engines implement truly
Boolean query matching

The real differences among different search
engines are
 their indexing weight schemes
 their query process methods
 their ranking algorithms
 None of these are published by any of the
search engines firms.
Alternative IR Models

Probabilistic Model
 Given a document d, how likely would the user
consider it relevant?
 How likely would the user consider it no
relevant?
 If these two are known, Similarity of document d
and query q can be defined as:
 S(d, q) = probability of d is relevant to q
probability of d is not relevant to q
Examples:
If a document is 80% likely to be relevant to
query q, what is its (probabilistic)
similarity?
 If a document is only 30% likely to be
relevant, what is the similarity?


If there are 100 documents, 10 are relevant to a
query,
 what is the probability of relevance for a
randomly select document?
 What is the similarity of this document to the
query?
 Any retrieve systems must do must better than
that.
 In general, retrieval systems should retrieve
those S>1


Advantages of the Probabilistic model
 Documents can be ranked by its relevance
probability.
 Relevance probability can be improved through
the interaction process.
 Good mathematic model
Disadvantages:
 Involved many assumptions
 Not very practical
Fuzzy Set Model

Fuzzy Set Theory
 Extension of Boolean set theory
 Instead of a binary membership
definition, fuzzy set Membership is
continuously defined between 0 and 1.
 Example:
• { Male students in our class}
• {tall students in our class}
• One is Boolean set and one is fuzzy
set.
The set of retrieved documents should be
considered as a fuzzy set.
 Documents are not just relevant or notrelevant.
 Documents can be somehow relevant.
 Documents can be 80% likely to be
relevant.
 Good Mathematical Models but not widely
implemented and tested.

Latent Semantic Indexing Model
Map documents from a high-dimensional
space to a lower dimensional space, while
maintaining document relationships.
 For clustering
 For visualization
 It’s a popular advanced retrieval technique.
 It’s computationally expensive.

Neural Network Model

Organize the document collection as a semantic
network through learning
 Use known queries/relevant documents to to
train the network, and later allow the network to
predict relevance for new queries. (supervised
learning)
 Use document-document relationships to “selforganize” the network and move relevant
documents close to each other. (un-supervised
learning).
The Fusion Model




Retrieve documents based on text indexing
(Boolean model or Vector Space Model, etc.)
Retrieve documents based on link models
(Citations, Google’s PageLink, etc.)\
Retrieve documents based on classification models
(The classification schemes, thesauri, Yahoo
categories, etc).
“Fusion” results together before response to the
user
Models for Browsing
Flat Model
 No particular organizations of materials
 Hierarchical model
 Assign documents into a hierarchical
structure.
 Hypertext Model
 Define appropriate links among related
documents.
