Information Retrieval - HU

InformationRetrieval
Assignment4:
SynonymExpansionwithLuceneandWordNet
PatrickSchäfer([email protected])
MarcBux([email protected])
SynonymExpansion
• Idea:WhenausersearchesatermK,implicitlysearchforall
synonymsofK
– SANDT->(SORS’OR..)AND(TORT’…)
• Popularmethod
• Usuallyincreasesrecallanddecreasesprecision
• Requiresahighqualitysynonymlexicon
• Canbeextendedtoalsoincludehyponyms(‘banana’isa
hyponymto‘fruits’).
Schäfer,Bux:Assignment4
2
WordNet
• Lexicaldatabasewithsemanticrelationships
• Maintainedsince1985
• Nouns,verbs,adjectivesandadverbsaregroupedintosetsof
cognitivesynonyms(synsets).
• ~66.000words, ~180.000Synsets
• Containsdifferentrelationshiptypes:hypernomy,hyponomy,
causation,antonomy,holonym,meronym…
Schäfer,Bux:Assignment4
3
Some Relationship Types
• Antonymsarewordswithoppositemeanings:
badisanantonymofgood
• Hyponymsarespecificinstancesofacategory:
redisahyponymofcolor
• Hypernymsdescribecategoriesofinstances:
colorisahypernymofred
• Holonyms definearelationshipbetweenterms(oneispartof
theother):
treeisaholonym oftrunk
• Meronymsaretheoppositeofholonyms:
trunkisameronymoftree
Schäfer,Bux:Assignment4
4
Task
• ImplementsynonymexpansionwithinLucene(v6.3)forthe
IMDBmovieplots.
• Youcanreuseyourexistingcodefromassignment3(usingword
tokenizationandstopwordremoval,butnostemming).
• UseWordNetaslexicon
– currentrelease,WordNet 3.1
• Forsimplicity,wewillonlyconsiderBoolean(AND,OR,NOT)
termsearch.
• Nophraseorproximitysearchanymore
Schäfer,Bux:Assignment4
5
ExampleSynsets fromWordNet
[well]: [considerably][intimately][easily][comfortably][wellspring][substantially]
[advantageously][good][swell][fountainhead]
[good]: [commodity][expert][sound][respectable][secure][estimable][effective]
[honest][serious][ripe][near][unspoiled][dear][just][salutary][goodness][proficient]
[skilful][adept][thoroughly][soundly][unspoilt][dependable][right][upright]
[beneficial][safe][well][honorable][full][practiced][skillful]
[better]: [expert][meliorate][sound][respectable][best][secure][good][estimable]
[wagerer][effective][honest][serious][ripe][easily][near][unspoiled][dear][just]
[salutary][proficient][skilful][adept][break][bettor][amend][considerably][intimately]
[unspoilt][dependable][comfortably][right][upright][ameliorate][improve][beneficial]
[safe][well][punter][substantially][advantageously][honorable][full][practiced]
[skillful]
Schäfer,Bux:Assignment4
6
Wordnet
• Youcansearchsynsets
directlyatWordNet:
http://wordnetweb.princeton.ed
u/perl/webwn
Schäfer,Bux:Assignment4
7
QueryExpansioninLucene
• Therearetwooptions:
• Atindexingtime:Addallexpansionstoalltermsofadocument
dwhenindexingd.
• Atsearchtime:WhensearchingakeywordK,rewritequeryin
disjunctionofallexpansionsofK.
– Query:plot:Berlin AND plot:wall AND type:television
• plot:berlin AND (plot:bulwark ORplot:fence ORplot:palisade ORplot:paries
ORplot:rampart ORplot:surround ORplot:wall)AND (type:telecasting OR
type:television ORtype:telly ORtype:tv ORtype:video)
• Note:IfKispartofmorethanonesynset,useall
– Nodisambiguation
Schäfer,Bux:Assignment4
8
GettingStarted
• DownloadWordNet3.1filesat
– http://wordnetcode.princeton.edu/wn3.1.dict.tar.gz
• Extractnoun,verb,adj,adv files:
– data.[noun,verb,adj,adv] (synsets)
– [noun,verb,adj,adv].exc
(baseforms)
• Parsesynsets fromtheseplainfilesusingsyntax:
– http://wordnet.princeton.edu/man/wndb.5WN.html
Schäfer,Bux:Assignment4
9
DataFileFormat
• Eachdatafilebeginswithacopyrightnotice.Skipthis.
• Eachsynset isencodedinoneline.
• Eachlinehas theformat:
synset_offset lex_filenum ss_type w_cnt word lex_id [word lex_id...]
p_cnt [ptr...] [frames...] | gloss
• w_cnt:Twodigithexadecimalintegerindicatingthenumberof
words.
• Exampleline(synset):
0000784603n06 person 0individual 0someone 0somebody 0mortal 0soul
0421@00004475n0000@00007347n0000#m07958392n0000+
01562007a0501+00388736v0203+04626138n0101+00729535v0101
%p04624919n0000%p…
Schäfer,Bux:Assignment4
10
ExceptionListFileFormat
• Thefirstfieldofeachlineisaninflectedform,followedbya
spaceseparatedlistofoneormorebaseformsoftheword.
• Examples:
bettergoodwell
biggerbig
• Meaning:allsynsets ofgood andwell applytobetter (butnot
thereverse).
Schäfer,Bux:Assignment4
11
ComplicationsI
• Useonlysingle-tokensynonyms
– Ignoreallsynonymswithmorethanonetoken
– Theseareformattedbya“_”inthename(e.g.,house_of_cards)
• Specialadjectivesyntax
– Remove(p),(a)and (ip)fromadjectives(e.g. galore(ip)).
– https://wordnet.princeton.edu/man/wninput.5WN.html
Schäfer,Bux:Assignment4
12
ComplicationsII
• Mergesynsets ofwordsappearingintheverb,nouns,adj,adv
files,suchasreason(noun)andreason(verb).
• Considerasynset asset
– Example:Synset ofcause={reason,grounds}
– Createthefollowingsynonymrelations:cause-reason,cause-grounds,
reason-groundsandallreverserelationsreason-cause,grounds-cause,
grounds-reason.
• BUTdonotapplythisruletransitively
– Example: cause={grounds} andgrounds={earth}shouldnotcreate
cause-earth!
– Syn-relationshipsinWordNetdonotformanequivalenceclass
Schäfer,Bux:Assignment4
13
ComplicationsIII
• Theexceptionlists arenotsymmetric.Theinflectedformis
mergedwithallsynsetsofitsbaseformsbutnotthereverse.
• Anexceptiongiveninadj.exc onlyaddsthesynsets definedin
thedata.adj file.Anexceptioninnoun.exc onlyaddsthesynsets
definedinthedata.noun file.
• Soyouhavetokeepthesynsets innoun,adj,adv,verb
separatedfortheexceptionlists.
• I.e.,givenanexception inadj.exc:better good well
syns (better):=syns adj(better)∪ syns adj(good) ∪ syns adj(well)∪ good∪ well
Butnotsyns(well):=syns adj (better)∪ …
Andnotsyns(better):=synsnoun(better)∪ … synsnoun(well)
Schäfer,Bux:Assignment4
14
Complications IV
• Theexceptionfilesdefinebaseandinflectedformsfor
irregularwords. WordNetapplieslemmatizationforregular
wordsbasedonruleslikebig,bigger,biggest.Butyoucan
skipthis.
https://wordnet.princeton.edu/man/morphy.7WN.html
• Sometrueresultsforreference
Onlysysnets:60993wordswith153394synonyms.
Synsets&exceptionlists:66126words with 176476synonyms.
Schäfer,Bux:Assignment4
15
Getting started
• inBooleanSeachWordnet.java,implementthefunctions:
– publicvoidbuildSynsets(StringwordnetDir)
(usedtoparsethewordnetfilesandbuildthesynonymindex)
– publicvoidbuildIndices(StringplotFile)
(usedtoparsethefileandbuildthelucene index)
– publicSet<String>booleanQuery(StringqueryString)
(parsesthequerystringandreturnsthetitlelinesofanyentriesinthe
plotFile matchingthequery)
– publicvoidclose()
(canbeusedtocloseLuceneindex,Threadpool,etc.)
Schäfer,Bux:Assignment4
16
Testyour Program
• weprovideyouwithamodified:
– queries_wordnet.txtfilecontainingexemplaryqueries
– results_wordnet.txt filecontainingtheexpectedresultsofrunningthese
queries
– mainmethodfortestingyourcode(whichexpectsasparametersthe
corpusfile,thequeriesfileandtheresultsfile)
• youcancheckyoursynonymexpansionforplausibilityonthe
WordNetwebsite:
– http://wordnetweb.princeton.edu/perl/webwn
Schäfer,Bux:Assignment4
17
Deliverables
•
byThursday,26.01.17,23:59(midnight)
•
submission:archive(zip,tar.gz)
•
–
containsJavasourcefiles,anyusedlibraries,andyourcompiledjar
namedBooleanQueryWordnet.jar
–
filename(ofsubmittedarchive):yourgroupname
uploadtohttps://hu.berlin/24377
–
ifthisdoesn’twork,[email protected]
• testyourjarbeforesubmittingbyrunningourquerieson
gruenau2
–
java-jarBooleanQueryWordnet.jar<plotlistfile><wordnetDir><queries
file><resultsfile>
–
youmighthavetoincreasetheJVM‘sheapsize(e.g.,-Xmx8g)
–
yourjarmustrunandansweralltestqueriesin‘queries_wordnet.txt’
correctly
Schäfer,Bux:Assignment4
18
Presentation of Solutions
• youarebeabletopickwhenandwhatyou‘dliketopresent
(first-come-first-served):
– monday:https://dudle.inf.tu-dresden.de/inforet_ue4_mo/
– tuesday:https://dudle.inf.tu-dresden.de/inforet_ue4_tu/
• presentationwillbegivenon30./31.01.17
• OneteamcanpresenttheirLuceneWordNetIndexer.
• Twoteamscanpresenttheir LuceneQueryExpansion.
Schäfer,Bux:Assignment4
19
Competition
• Searchasfastaspossible.
• stayunder40GBmemoryusage.
• wewillcalltheprogramusingoureval tool:
– wewillusedifferentqueriesand-Xmx40gparameter
• Wewillevaluatetwofold:
a)
Thetotalquerytime.
b) Thetotaltimeforbuildingtheindex.
Schäfer,Bux:Assignment4
20
Checklist
again,beforesubmittingyourresults,makesurethatyou
1.
didnotchangeorremoveanycodefromBooleanQueryWordnet.java
2.
didnotalterthefunctions‘signatures(typesofparams,returnvalues)
3.
onlyusethedefault constructoranddon‘tchangeitsparameters
4.
didnotchangetheclassorpackagename
5.
namedyourjarBooleanQueryWordnet.jar
6.
testedyourjarongruenau2 byrunning
java-jarBooleanQueryWordnet.jarplot.listwordNetDir
queries_wordnet.txtresults_wordnet_wordnet.txt
(youmighthavetoincreaseJavaheapspace,e.g.-Xmx6g)
7.
ascertainedthatthequeriesinqueries.txtwereansweredcorrectly
8.
Makesuretouploadazipfilenamedbyyourgroupname.
Schäfer,Bux:Assignment4
21
NextSteps
• thisweek:evaluationofassignment3
• nextweeks:Q/Asessionsforassignment4.
• UploadyoursolutionbyThursday,26.01.17,23:59(midnight)
Schäfer,Bux:Assignment4
22