LIS618 lecture 1

LIS618 lecture 3
Thomas Krichel
2004-02-14
Structure
•
•
•
•
•
Happy Valentine’s day!
Theory: discussion of the Boolean model
Theory: the vector model
Practice: introducing Nexis
More Nexis next week
advantages of Boolean model
• supposedly easy to grasp by the user
• precise semantics of queries
• implemented in the majority of commercial
systems
problems of Boolean model
• sharp distinction between relevant and
irrelevant documents
• no ranking possible
• users find it difficult to formulate Boolean
queries
• users find it difficult to resolve Boolean
queries
vector model
• associates weights with each index term
appearing in the query and in each
database document.
• relevance can be calculated as the cosine
between the two vectors, i.e. their cross
product divided be the square roots of the
squares of each vector. This measure
varies between 0 and 1.
tf/idf
• stands for term frequency / inverse
document frequency
• This refers to a technique that gives term a
high rank in a document if
– the term appears frequently in a document
– the term does not appear frequently in other
documents
• We will look at each component one at
time.
absolute & maximum term
frequency
• Let F_t_d be the number of times term t
appears in the document d. This is its
absolute term frequency in the document.
• Let m_d be the maximum absolute term
frequency achieved by any term in
document d. Examples
– Document 1: a b a a b c c d
m_1 = 3, because "a" appears 3 times
– Document 2: a b a f f f e d f a a
m_2 = 4, because "a" or "f" appears 4 times
relative document term frequency
• The relative term frequency f_t_d, is given
by
f_t_d = F_t_d / m_d
that is the absolute term frequency of term
t in document d divided by the maximum
absolute term frequency of document d.
• This completes the "term frequency" part of
the tf/idf formula.
• Let us look at this part through an example.
main example, part I
• Consider three documents
– 1:
– 2:
– 3:
abcafonlpoftyx
amoeeennnanpl
raeefnliffffxl
• First, look at the maximum frequency
achieved by any term in a given document.
m_1 = 2
m_2 = 4
m_3 = 5
("a", "f" and "o" are there twice)
("n" is there four times)
("f" is there five times)
main example part II
• Now look at some example of absolute
term frequency
F_a_1 = 2
F_e_2 = 3
F_x_3 = 1
• and some examples of relative term
frequency
f_a_1 = F_a_1 / m_1 = 2 / 2 = 1
f_e_2 = F_e_2 / m_2 = 3 / 4 = 0.75
f_x_3 = F_x_3 / m_3 = 1 / 5 = 0.2
inverse document frequency
• Let N be the number of documents in the
datebase. N=3 in our example.
• Let n_t be the number of documents
where the term t appears. In our example
n_a = 3
n_e = 2
n_x = 2
• N/n_t is an indication of inverse document
frequency of a term. It is larger the less a
term appears across documents in the
database.
intermezzo: the logarithm
• The logarithm, written log() is a
mathematical function. You should know
that
– log() is an increasing function, i.e. the bigger
is x, the bigger is log(x).
– log(1) = 0
– log(x) > 0 if
x>1
• Your calculator will tell you what the
logarithm of a number is.
tf/idf formula
• Term frequency and inverse document
frequency have to be combined.
• The final formula for the weight combines
the terms as follows
w_t_d = f_t_d * log( N / n_t )
main example part III
N=3
w_a_1 = 1
* log(3/3) = log(1) = 0!
w_e_2 = 0.75 * log(3/2)
w_x_3 = 0.2 * log(3/2)
where log(3/2) = 0.176, approximately
practical operation
• The computer will search the documents for the
query term and return the documents where the
weight of term in the index for that document is
strictly positive, by order of weights, highest to
lowest.
• If there are several query terms the computer
will perform a more complicated operation that
we will not further study here, so we limit
ourselves to the case of one query term.
practical tests
• You ask the computer to query the term
"a" in our example. What documents are
being returned?
– Compare with the result of the Boolean
model.
• You ask the computer to query the term
"e". What documents are being returned,
and in what order?
advantages of vector model
•
•
•
•
term weighting improves performance
sorting is possible
easy to compute, therefore fast
results are difficult to improve without
– query expansion
– user feedback circle
Lexis/Nexis
• Lexis is a specialized legal research
service
• Nexis is primarily a news services
• adds an important temporal component to
all its contents
• restricts contents as compared to Dialog
• potentially bad competition from Google
• lives at http://www.nexis.com
compilation of Nexis
• Uses a number of news sources such as
newspapers.
• Uses company reports databases
• Uses web sites, the URLs of which are
found in the news sources. Some of the
material there can be of low value
(remember the comments in the first
lecture)
SmartIndexing
• There is a controlled vocabulary of indexing
terms
• A document is indexed
– In full text view (except web sites)
– With automatic addition of index terms that
correspond to the document.
• Index terms are added
• Weight of index terms is calculated
• http://www.lexisnexis.com/infopro/products/index/ has more on
it.
equivalents
• Nexis has a number of "equivalents" where, depending
on sources, it replaces one with the other. Contrary to
their claims they also work in quick search
• First (second, third, etc.)is
1st (2nd, 3rd, etc.)
Monday (All days ex. Sunday) Mon (Tues, Weds, etc.)
• January (Abbreviations work) Jan (Feb, Mar, etc.)
• One (all numbers < 20)
1 (2, 3, etc.)
• and
&
• company
co
• corporation
corp
• incorporated
inc
Six interfaces to Nexis
•
•
•
•
•
•
•
Quick search
Subject directory
Power search
Personal news
Search forms
Real time news
In the remainder of the lecture I will go
through some of these
Quick search
•
•
•
•
Implicit OR between terms
Use quotes to require adjacency of terms
You can select from a drop-down box of sources
You can set the date range, though unclear what
it means
• It seems to OR a plural to your search term.
• Sometimes returns documents with none of the
search terms. “she is the one”
Quick search
• It is not clear what parts of documents are
being searched
• Apparently it does not search the full text.
• But it seems to prioritize
– TERM, i.e. smart keywords extracted,
– HLEAD for news
– TITLE for legal documents
– WEB-SEARCH-TEXT for web pages
relevance ranking concerns
• where terms appear within the document
• how many occurrences of the terms
appear in the document
• how often those search terms appear
throughout the document
• apparently not how much they occur,
example search for "the" or "the the"
• seems that they guard algorithm a secret
Subject directory
• you can follow the subject tree but
• there seems to be only a tiny amount of
documents
• categories are not particularly deep or
developed
• there is a "more like this" feature of limited
use, Thomas finds
Power search
• You can first create a customized set of
sources to search
• Do this at the start, you browse a menu,
then click “done, search now”
• This is a lot more efficient than trying to
build a search strategy on a large set.
power search truncation
• * represents a single character, present or
absent
– wom*n
– labo*r
• ! truncates to the end of the word
– bookk!
Power search connectors
•
•
•
•
•
•
•
•
•
OR
AND
AND NOT
PRE/n, n is a number, ordered proximity
W/n, n is a number, unordered proximity
W/S words in same sentence
W/P words is the some paragraph
Use parentheses!
There is no implicit or as in the simple search,
so forget about the double quotes.
Power search expressions
•
•
•
•
•
•
•
•
•
Parentheses group terms together
* for one or no letter
! for any number of letters
ATLEAST n (term), where n is a minimum
number of occurrences
PLURAL (term) only the plural of term
SINGULAR (term) only the singular of term
ALLCAPS (term) only capitals of term
NOCAPS (term) no capitals of term
CAPS (term) capitalized term only
http://openlib.org/home/krichel
Thank you for your attention!

Download Report

LIS618 lecture 1

Paperzz.com

Your Paperzz