InformationRetrieval Assignment4: SynonymExpansionwithLuceneandWordNet PatrickSchäfer([email protected]) MarcBux([email protected]) SynonymExpansion • Idea:WhenausersearchesatermK,implicitlysearchforall synonymsofK – SANDT->(SORS’OR..)AND(TORT’…) • Popularmethod • Usuallyincreasesrecallanddecreasesprecision • Requiresahighqualitysynonymlexicon • Canbeextendedtoalsoincludehyponyms(‘banana’isa hyponymto‘fruits’). Schäfer,Bux:Assignment4 2 WordNet • Lexicaldatabasewithsemanticrelationships • Maintainedsince1985 • Nouns,verbs,adjectivesandadverbsaregroupedintosetsof cognitivesynonyms(synsets). • ~66.000words, ~180.000Synsets • Containsdifferentrelationshiptypes:hypernomy,hyponomy, causation,antonomy,holonym,meronym… Schäfer,Bux:Assignment4 3 Some Relationship Types • Antonymsarewordswithoppositemeanings: badisanantonymofgood • Hyponymsarespecificinstancesofacategory: redisahyponymofcolor • Hypernymsdescribecategoriesofinstances: colorisahypernymofred • Holonyms definearelationshipbetweenterms(oneispartof theother): treeisaholonym oftrunk • Meronymsaretheoppositeofholonyms: trunkisameronymoftree Schäfer,Bux:Assignment4 4 Task • ImplementsynonymexpansionwithinLucene(v6.3)forthe IMDBmovieplots. • Youcanreuseyourexistingcodefromassignment3(usingword tokenizationandstopwordremoval,butnostemming). • UseWordNetaslexicon – currentrelease,WordNet 3.1 • Forsimplicity,wewillonlyconsiderBoolean(AND,OR,NOT) termsearch. • Nophraseorproximitysearchanymore Schäfer,Bux:Assignment4 5 ExampleSynsets fromWordNet [well]: [considerably][intimately][easily][comfortably][wellspring][substantially] [advantageously][good][swell][fountainhead] [good]: [commodity][expert][sound][respectable][secure][estimable][effective] [honest][serious][ripe][near][unspoiled][dear][just][salutary][goodness][proficient] [skilful][adept][thoroughly][soundly][unspoilt][dependable][right][upright] [beneficial][safe][well][honorable][full][practiced][skillful] [better]: [expert][meliorate][sound][respectable][best][secure][good][estimable] [wagerer][effective][honest][serious][ripe][easily][near][unspoiled][dear][just] [salutary][proficient][skilful][adept][break][bettor][amend][considerably][intimately] [unspoilt][dependable][comfortably][right][upright][ameliorate][improve][beneficial] [safe][well][punter][substantially][advantageously][honorable][full][practiced] [skillful] Schäfer,Bux:Assignment4 6 Wordnet • Youcansearchsynsets directlyatWordNet: http://wordnetweb.princeton.ed u/perl/webwn Schäfer,Bux:Assignment4 7 QueryExpansioninLucene • Therearetwooptions: • Atindexingtime:Addallexpansionstoalltermsofadocument dwhenindexingd. • Atsearchtime:WhensearchingakeywordK,rewritequeryin disjunctionofallexpansionsofK. – Query:plot:Berlin AND plot:wall AND type:television • plot:berlin AND (plot:bulwark ORplot:fence ORplot:palisade ORplot:paries ORplot:rampart ORplot:surround ORplot:wall)AND (type:telecasting OR type:television ORtype:telly ORtype:tv ORtype:video) • Note:IfKispartofmorethanonesynset,useall – Nodisambiguation Schäfer,Bux:Assignment4 8 GettingStarted • DownloadWordNet3.1filesat – http://wordnetcode.princeton.edu/wn3.1.dict.tar.gz • Extractnoun,verb,adj,adv files: – data.[noun,verb,adj,adv] (synsets) – [noun,verb,adj,adv].exc (baseforms) • Parsesynsets fromtheseplainfilesusingsyntax: – http://wordnet.princeton.edu/man/wndb.5WN.html Schäfer,Bux:Assignment4 9 DataFileFormat • Eachdatafilebeginswithacopyrightnotice.Skipthis. • Eachsynset isencodedinoneline. • Eachlinehas theformat: synset_offset lex_filenum ss_type w_cnt word lex_id [word lex_id...] p_cnt [ptr...] [frames...] | gloss • w_cnt:Twodigithexadecimalintegerindicatingthenumberof words. • Exampleline(synset): 0000784603n06 person 0individual 0someone 0somebody 0mortal 0soul 0421@00004475n0000@00007347n0000#m07958392n0000+ 01562007a0501+00388736v0203+04626138n0101+00729535v0101 %p04624919n0000%p… Schäfer,Bux:Assignment4 10 ExceptionListFileFormat • Thefirstfieldofeachlineisaninflectedform,followedbya spaceseparatedlistofoneormorebaseformsoftheword. • Examples: bettergoodwell biggerbig • Meaning:allsynsets ofgood andwell applytobetter (butnot thereverse). Schäfer,Bux:Assignment4 11 ComplicationsI • Useonlysingle-tokensynonyms – Ignoreallsynonymswithmorethanonetoken – Theseareformattedbya“_”inthename(e.g.,house_of_cards) • Specialadjectivesyntax – Remove(p),(a)and (ip)fromadjectives(e.g. galore(ip)). – https://wordnet.princeton.edu/man/wninput.5WN.html Schäfer,Bux:Assignment4 12 ComplicationsII • Mergesynsets ofwordsappearingintheverb,nouns,adj,adv files,suchasreason(noun)andreason(verb). • Considerasynset asset – Example:Synset ofcause={reason,grounds} – Createthefollowingsynonymrelations:cause-reason,cause-grounds, reason-groundsandallreverserelationsreason-cause,grounds-cause, grounds-reason. • BUTdonotapplythisruletransitively – Example: cause={grounds} andgrounds={earth}shouldnotcreate cause-earth! – Syn-relationshipsinWordNetdonotformanequivalenceclass Schäfer,Bux:Assignment4 13 ComplicationsIII • Theexceptionlists arenotsymmetric.Theinflectedformis mergedwithallsynsetsofitsbaseformsbutnotthereverse. • Anexceptiongiveninadj.exc onlyaddsthesynsets definedin thedata.adj file.Anexceptioninnoun.exc onlyaddsthesynsets definedinthedata.noun file. • Soyouhavetokeepthesynsets innoun,adj,adv,verb separatedfortheexceptionlists. • I.e.,givenanexception inadj.exc:better good well syns (better):=syns adj(better)∪ syns adj(good) ∪ syns adj(well)∪ good∪ well Butnotsyns(well):=syns adj (better)∪ … Andnotsyns(better):=synsnoun(better)∪ … synsnoun(well) Schäfer,Bux:Assignment4 14 Complications IV • Theexceptionfilesdefinebaseandinflectedformsfor irregularwords. WordNetapplieslemmatizationforregular wordsbasedonruleslikebig,bigger,biggest.Butyoucan skipthis. https://wordnet.princeton.edu/man/morphy.7WN.html • Sometrueresultsforreference Onlysysnets:60993wordswith153394synonyms. Synsets&exceptionlists:66126words with 176476synonyms. Schäfer,Bux:Assignment4 15 Getting started • inBooleanSeachWordnet.java,implementthefunctions: – publicvoidbuildSynsets(StringwordnetDir) (usedtoparsethewordnetfilesandbuildthesynonymindex) – publicvoidbuildIndices(StringplotFile) (usedtoparsethefileandbuildthelucene index) – publicSet<String>booleanQuery(StringqueryString) (parsesthequerystringandreturnsthetitlelinesofanyentriesinthe plotFile matchingthequery) – publicvoidclose() (canbeusedtocloseLuceneindex,Threadpool,etc.) Schäfer,Bux:Assignment4 16 Testyour Program • weprovideyouwithamodified: – queries_wordnet.txtfilecontainingexemplaryqueries – results_wordnet.txt filecontainingtheexpectedresultsofrunningthese queries – mainmethodfortestingyourcode(whichexpectsasparametersthe corpusfile,thequeriesfileandtheresultsfile) • youcancheckyoursynonymexpansionforplausibilityonthe WordNetwebsite: – http://wordnetweb.princeton.edu/perl/webwn Schäfer,Bux:Assignment4 17 Deliverables • byThursday,26.01.17,23:59(midnight) • submission:archive(zip,tar.gz) • – containsJavasourcefiles,anyusedlibraries,andyourcompiledjar namedBooleanQueryWordnet.jar – filename(ofsubmittedarchive):yourgroupname uploadtohttps://hu.berlin/24377 – ifthisdoesn’twork,[email protected] • testyourjarbeforesubmittingbyrunningourquerieson gruenau2 – java-jarBooleanQueryWordnet.jar<plotlistfile><wordnetDir><queries file><resultsfile> – youmighthavetoincreasetheJVM‘sheapsize(e.g.,-Xmx8g) – yourjarmustrunandansweralltestqueriesin‘queries_wordnet.txt’ correctly Schäfer,Bux:Assignment4 18 Presentation of Solutions • youarebeabletopickwhenandwhatyou‘dliketopresent (first-come-first-served): – monday:https://dudle.inf.tu-dresden.de/inforet_ue4_mo/ – tuesday:https://dudle.inf.tu-dresden.de/inforet_ue4_tu/ • presentationwillbegivenon30./31.01.17 • OneteamcanpresenttheirLuceneWordNetIndexer. • Twoteamscanpresenttheir LuceneQueryExpansion. Schäfer,Bux:Assignment4 19 Competition • Searchasfastaspossible. • stayunder40GBmemoryusage. • wewillcalltheprogramusingoureval tool: – wewillusedifferentqueriesand-Xmx40gparameter • Wewillevaluatetwofold: a) Thetotalquerytime. b) Thetotaltimeforbuildingtheindex. Schäfer,Bux:Assignment4 20 Checklist again,beforesubmittingyourresults,makesurethatyou 1. didnotchangeorremoveanycodefromBooleanQueryWordnet.java 2. didnotalterthefunctions‘signatures(typesofparams,returnvalues) 3. onlyusethedefault constructoranddon‘tchangeitsparameters 4. didnotchangetheclassorpackagename 5. namedyourjarBooleanQueryWordnet.jar 6. testedyourjarongruenau2 byrunning java-jarBooleanQueryWordnet.jarplot.listwordNetDir queries_wordnet.txtresults_wordnet_wordnet.txt (youmighthavetoincreaseJavaheapspace,e.g.-Xmx6g) 7. ascertainedthatthequeriesinqueries.txtwereansweredcorrectly 8. Makesuretouploadazipfilenamedbyyourgroupname. Schäfer,Bux:Assignment4 21 NextSteps • thisweek:evaluationofassignment3 • nextweeks:Q/Asessionsforassignment4. • UploadyoursolutionbyThursday,26.01.17,23:59(midnight) Schäfer,Bux:Assignment4 22
© Copyright 2026 Paperzz