Overview of a Information Retrieval System: Terrier Ashish overview • Structural view – Indexing – Retrieval • Extend • Setup • Run IR Systems • Terrier – Academic/ research – Open source • Lucene-Nutch – Commercial/ research – Open source Terrier • • • • • Being developed at University of Glasgow. Open Source OS independent : Java Easy to learn Easy to extend – modular Subfolders -1 • etc/ – Configuration files • bin/ – Srcipts to compile and run the terrier • lib/ – Java library, jar files containing the terrier system. Subfolders -2 • src/ – The java source files, user written plugins • doc/ – Javadocs for terrier and for extended components • var/ – Index/ • Index files – Results/ • Results and evaluation • share/ – Shared resources such as stopwords, lexicon etc. Indexing Tokenization • Identifying words – Based on space – Handling spacial characters such as -,$, digits etc. – Sometimes space is not word separator. • German, Chinese – agglutinative languages • Marathi Term Pipelining • Stemming/ finding root – ate -> eat • Stopword removal – is, was, I, in etc. • Abbreviations – Dr -> Doctor • Normalisation – Color Vs colour Index – data structures • Direct Index – stores the identifiers of terms that appear in each document and the corresponding frequencies. • Document Index – stores information about each document for example the document length and identifier, • Inverted Index – stores the posting lists, i.e. the identifiers of the documents and their corresponding term frequencies. • Lexicon – stores the collection vocabulary and the corresponding document and term frequencies. Extending the indexing process • Tokenisation: – uk.ac.gla.terrier.indexing.*Document • Term Pipelines: – uk.ac.gla.terrier.terms.* Retrieval query Index Scoring and Ranking • Score: S(di,qj) • Documents are ranked (sorted) according to the score • Presented to the user in decreasing order of S(di,qj) – Scoring model • e.g. TF-IDF Matching Process • Input – Query and weighting model • Output – Ranked resultset • Weighting model – Himestra-LM • Uses – Term Score Modifiers • uk.ac.gla.terrier.matching.tsms – Document Score Modifiers • uk.ac.gla.terrier.matching.dsms • extend – uk.ac.gla.terrier.matching.models Input • Corpus – Very large set of documents • Topics – Queries representing user need • Relevance Results – Set of judgments per query per document Topic format <doc> <docno>Mumbai85B7FB3BB9.htm.txt</docno> <text> राज्यपालाांनी घेतली राष्ट्रपती, उपराष्ट्रपतीांची भेट मांबई, ता. २१ - राज्यपाल एस. एम. कृष्ट्णा याांनी आज राष्ट्रपती प्रततभा पाटील आणण उपराष्ट्रपती डॉ. हमीद अन्सारी याांची ददल्ली येथे भेट घेतली. राष्ट्रपती, उपराष्ट्रपततपदी तनवड झाल्याच्या पार्वश् भम ू ीवर राज्यपालाांनी भेट घेऊन तयाांचे स्वागत केले. आज दपारी राष्ट्रपती भवन येथे श्रीमती प्रततभा पाटील याांची भेट घेतल्यानांतर तयाांनी हररयाना भवन येथे जाऊन उपराष्ट्रपतीांची भेट घेतली. </text> </doc> Document <top> <num>5 <title>भारतीय राष्ट्रपती तनवडणक ू २००७ <desc>भारताच्या राष्ट्रपती तनवडणक ू ीर्ी सांबांधित मद्दे व घटना. <narr>राष्ट्रपतीांची तनवडणक ू , उमेदवाराांववरूध्द केलेली / गललच्छ राजकीय धचखलफेक आणण आपल्या तनकटच्या उमेदवाराचा पराभव करून प्रततभा पाटील हयाांचे भारताच्या सव्प्रथम मदहला राष्ट्रपती (अध्यक्ष) म्हणन ू तनवडून येणे हया-ववषयीची मादहती सांबांधित कागदपत्रात असावयास हवी. </top> Relevance Judement Query-id . . . 13 Q0 1100019.cms.txt 0 13 Q0 1102914.cms.txt 0 13 Q0 1104294.cms.txt 0 13 Q0 1104312.cms.txt 1 13 Q0 1110418.cms.txt 0 13 Q0 1123377.cms.txt 0 13 Q0 1124813.cms.txt 1 13 Q0 1126006.cms.txt 1 . . . . Relevence judgement: 0 or 1 Document id Configuration files • etc/terrier.properties – Utf-8 settings, stemmer, index name, etc etc/trec.topic.list – set topics/queries • etc/trec.models – Set matching/retrieval model • etc\trec.qrels – Set Relevane Judgement file path Running terrier • Already compiled • To recompile – bin/compile.sh • Setup corpus – bin/trec_setup.sh “<corpus folder path>“ • Index – bin/trec_terrier.sh -i • Retrieval – bin/trec_terrier.sh -r • Evaluate – bin/trec_terrier.sh -e “<result file>” Reference • http://ir.dcs.gla.ac.uk/terrier/doc/ • http://ir.dcs.gla.ac.uk/wiki/Terrier
© Copyright 2026 Paperzz