Information Retrieval Assignment 4: Synonym Expansion with Lucene Ulf Leser Synonym Expansion • Idea: When a user searches a term K, implicitly also search for all synonyms of K – “S AND T” -> “(S or S’ OR ..) AND (T OR T’ …)) • • • • Popular method Usually increases recall and decreases precision Requires a high quality synonym lexicon Can be extended to also include hyponyms Ulf Leser: Information Retrieval, Exercises 2 Wordnet • Lexical database with semantic relationships • Maintained since 1985 • ~150.000 words, ~120.000 Synsets – Synset: groups of cognitive synonyms • Different relationship types: hypernomy, hyponomy, causation, antonomy, … • Download at http://wordnet.princeton.edu/ • Extract synsets from database files – Manual, e.g. http://wordnet.princeton.edu/man/wndb.5WN.html Ulf Leser: Information Retrieval, Exercises 3 Task • Implement synonym expansion within Lucene • Two modes – No expansion – Expand with synonyms • Use WordNet as lexicon – Current release, WordNet 3.1 • Do not use the WordNet-Method available for Lucene • Only Boolean keyword search, no phrase search any more Ulf Leser: Information Retrieval, Exercises 4 Query Expansion in Lucene • Two options • At indexing time: Add all expansions to all terms of a document d when indexing d • At search time: When searching a keyword K, rewrite query in disjunction of all expansions of K • Note: If K is part of more than one synset, use all – No disambiguation Ulf Leser: Information Retrieval, Exercises 5 Complications • Use only single-token synonyms – Ignore all synonyms with more than one token • Merge synsets of words appearing as verb, nouns, … • Build a symmetric synonym lexicon – Synonyms in Wordnet are not symmetric – Change this – read once and when storing A->B, also store B->A – Do not apply this rule transitively • Syn-relationships in Wordnet do not form an equivalence class • And we don’t change this Ulf Leser: Information Retrieval, Exercises 6 Program • Submit one JAR with three functionalities – Indexing ./java –jar myjar -index <corpusXMLfile> – Searching without expansion ./java –jar myjar -search <keywords> – Searching with synonym expansion ./java –jar myjar –synonym <keywords> Ulf Leser: Information Retrieval, Exercises 7 Output • The program must be ready to run on GRUENAU2 • Uncompressed folder “dict” from WordNet distribution will be in the folder where we place the JAR file • For each query, return the following (on STDOUT) – The PubMed IDs of all matching abstract – The total number of documents matching the query Ulf Leser: Information Retrieval, Exercises 8 Competition • We will measure both indexing time and query times for 10 queries • Query times will count much more (how much more will be determined empirically) Ulf Leser: Information Retrieval, Exercises 9 Deliverables • By Monday, 26.1.2015, 9:00 o’clock – Roughly three weeks • Send by mail as ASCII to leser@informatik … – JAVA source code and executable JAR • Include the Lucene JAR – Results for the following queries (next slide) Ulf Leser: Information Retrieval, Exercises 10 Some True Results No expansion Synonym expansion significant 1126 1751 monster 0 57 information retrieval 1 17 expose 4 204 transcription factor 3 23 break 16 1517 activity 1742 2195 genes 107 107 gene expression 27 36 chair 1 335 Ulf Leser: Information Retrieval, Exercises 11
© Copyright 2026 Paperzz