Praktikum Text Analytics - HU

Information Retrieval
Assignment 4:
Synonym Expansion with Lucene
Ulf Leser
Synonym Expansion
• Idea: When a user searches a term K, implicitly also search
for all synonyms of K
– “S AND T” -> “(S or S’ OR ..) AND (T OR T’ …))
•
•
•
•
Popular method
Usually increases recall and decreases precision
Requires a high quality synonym lexicon
Can be extended to also include hyponyms
Ulf Leser: Information Retrieval, Exercises
2
Wordnet
• Lexical database with semantic relationships
• Maintained since 1985
• ~150.000 words, ~120.000 Synsets
– Synset: groups of cognitive synonyms
• Different relationship types: hypernomy, hyponomy,
causation, antonomy, …
• Download at http://wordnet.princeton.edu/
• Extract synsets from database files
– Manual, e.g. http://wordnet.princeton.edu/man/wndb.5WN.html
Ulf Leser: Information Retrieval, Exercises
3
Task
• Implement synonym expansion within Lucene
• Two modes
– No expansion
– Expand with synonyms
• Use WordNet as lexicon
– Current release, WordNet 3.1
• Do not use the WordNet-Method available for Lucene
• Only Boolean keyword search, no phrase search any more
Ulf Leser: Information Retrieval, Exercises
4
Query Expansion in Lucene
• Two options
• At indexing time: Add all expansions to all terms of a
document d when indexing d
• At search time: When searching a keyword K, rewrite
query in disjunction of all expansions of K
• Note: If K is part of more than one synset, use all
– No disambiguation
Ulf Leser: Information Retrieval, Exercises
5
Complications
• Use only single-token synonyms
– Ignore all synonyms with more than one token
• Merge synsets of words appearing as verb, nouns, …
• Build a symmetric synonym lexicon
– Synonyms in Wordnet are not symmetric
– Change this – read once and when storing A->B, also store B->A
– Do not apply this rule transitively
• Syn-relationships in Wordnet do not form an equivalence class
• And we don’t change this
Ulf Leser: Information Retrieval, Exercises
6
Program
• Submit one JAR with three functionalities
– Indexing
./java –jar myjar -index <corpusXMLfile>
– Searching without expansion
./java –jar myjar -search <keywords>
– Searching with synonym expansion
./java –jar myjar –synonym <keywords>
Ulf Leser: Information Retrieval, Exercises
7
Output
• The program must be ready to run on GRUENAU2
• Uncompressed folder “dict” from WordNet distribution will
be in the folder where we place the JAR file
• For each query, return the following (on STDOUT)
– The PubMed IDs of all matching abstract
– The total number of documents matching the query
Ulf Leser: Information Retrieval, Exercises
8
Competition
• We will measure both indexing time and query times for 10
queries
• Query times will count much more (how much more will be
determined empirically)
Ulf Leser: Information Retrieval, Exercises
9
Deliverables
•
By Monday, 26.1.2015, 9:00 o’clock
– Roughly three weeks
•
Send by mail as ASCII to leser@informatik …
– JAVA source code and executable JAR
•
Include the Lucene JAR
– Results for the following queries (next slide)
Ulf Leser: Information Retrieval, Exercises
10
Some True Results
No
expansion
Synonym
expansion
significant
1126
1751
monster
0
57
information retrieval
1
17
expose
4
204
transcription factor
3
23
break
16
1517
activity
1742
2195
genes
107
107
gene expression
27
36
chair
1
335
Ulf Leser: Information Retrieval, Exercises
11