Overview of A Information Retrieval System: Terrier

Overview of a Information
Retrieval System: Terrier
Ashish
overview
• Structural view
– Indexing
– Retrieval
• Extend
• Setup
• Run
IR Systems
• Terrier
– Academic/ research
– Open source
• Lucene-Nutch
– Commercial/ research
– Open source
Terrier
•
•
•
•
•
Being developed at University of Glasgow.
Open Source
OS independent : Java
Easy to learn
Easy to extend
– modular
Subfolders -1
• etc/
– Configuration files
• bin/
– Srcipts to compile and run the terrier
• lib/
– Java library, jar files containing the terrier
system.
Subfolders -2
• src/
– The java source files, user written plugins
• doc/
– Javadocs for terrier and for extended components
• var/
– Index/
• Index files
– Results/
• Results and evaluation
• share/
– Shared resources such as stopwords, lexicon etc.
Indexing
Tokenization
•
Identifying words
– Based on space
– Handling spacial characters such as -,$,
digits etc.
– Sometimes space is not word separator.
•
German, Chinese
– agglutinative languages
•
Marathi
Term Pipelining
• Stemming/ finding root
– ate -> eat
• Stopword removal
– is, was, I, in etc.
• Abbreviations
– Dr -> Doctor
• Normalisation
– Color Vs colour
Index – data structures
• Direct Index
– stores the identifiers of terms that appear in each document and
the corresponding frequencies.
• Document Index
– stores information about each document for example the
document length and identifier,
• Inverted Index
– stores the posting lists, i.e. the identifiers of the documents and
their corresponding term frequencies.
• Lexicon
– stores the collection vocabulary and the corresponding
document and term frequencies.
Extending the indexing process
• Tokenisation:
– uk.ac.gla.terrier.indexing.*Document
• Term Pipelines:
– uk.ac.gla.terrier.terms.*
Retrieval
query
Index
Scoring and Ranking
• Score: S(di,qj)
• Documents are ranked (sorted) according
to the score
• Presented to the user in decreasing order
of S(di,qj)
– Scoring model
• e.g. TF-IDF
Matching Process
• Input
– Query and weighting model
• Output
– Ranked resultset
• Weighting model
– Himestra-LM
• Uses
– Term Score Modifiers
• uk.ac.gla.terrier.matching.tsms
– Document Score Modifiers
• uk.ac.gla.terrier.matching.dsms
• extend
– uk.ac.gla.terrier.matching.models
Input
• Corpus
– Very large set of documents
• Topics
– Queries representing user need
• Relevance Results
– Set of judgments per query per document
Topic format
<doc>
<docno>Mumbai85B7FB3BB9.htm.txt</docno>
<text> राज्यपालाांनी घेतली राष्ट्रपती, उपराष्ट्रपतीांची भेट
मांबई, ता. २१ - राज्यपाल एस. एम. कृष्ट्णा याांनी आज राष्ट्रपती प्रततभा
पाटील आणण उपराष्ट्रपती डॉ. हमीद अन्सारी याांची ददल्ली येथे भेट
घेतली.
राष्ट्रपती, उपराष्ट्रपततपदी तनवड झाल्याच्या पार्वश् भम
ू ीवर
राज्यपालाांनी भेट
घेऊन तयाांचे स्वागत केले. आज दपारी राष्ट्रपती
भवन येथे श्रीमती प्रततभा
पाटील याांची भेट घेतल्यानांतर तयाांनी
हररयाना भवन येथे जाऊन
उपराष्ट्रपतीांची भेट घेतली.
</text>
</doc>
Document
<top>
<num>5
<title>भारतीय राष्ट्रपती तनवडणक
ू २००७
<desc>भारताच्या राष्ट्रपती तनवडणक
ू ीर्ी सांबांधित मद्दे व घटना.
<narr>राष्ट्रपतीांची तनवडणक
ू , उमेदवाराांववरूध्द केलेली / गललच्छ
राजकीय धचखलफेक आणण आपल्या तनकटच्या
उमेदवाराचा
पराभव करून प्रततभा पाटील हयाांचे
भारताच्या सव्प्रथम मदहला
राष्ट्रपती (अध्यक्ष) म्हणन
ू
तनवडून येणे हया-ववषयीची मादहती
सांबांधित कागदपत्रात
असावयास हवी.
</top>
Relevance Judement
Query-id
.
.
.
13 Q0 1100019.cms.txt 0
13 Q0 1102914.cms.txt 0
13 Q0 1104294.cms.txt 0
13 Q0 1104312.cms.txt 1
13 Q0 1110418.cms.txt 0
13 Q0 1123377.cms.txt 0
13 Q0 1124813.cms.txt 1
13 Q0 1126006.cms.txt 1
.
.
.
.
Relevence judgement: 0 or 1
Document id
Configuration files
• etc/terrier.properties
– Utf-8 settings, stemmer, index name, etc
etc/trec.topic.list
– set topics/queries
• etc/trec.models
– Set matching/retrieval model
• etc\trec.qrels
– Set Relevane Judgement file path
Running terrier
• Already compiled
• To recompile
– bin/compile.sh
• Setup corpus
– bin/trec_setup.sh “<corpus folder path>“
• Index
– bin/trec_terrier.sh -i
• Retrieval
– bin/trec_terrier.sh -r
• Evaluate
– bin/trec_terrier.sh -e “<result file>”
Reference
• http://ir.dcs.gla.ac.uk/terrier/doc/
• http://ir.dcs.gla.ac.uk/wiki/Terrier