Word Sense Disambiguation of Adjectives Using Probabilistic Networks G e r a l d C h a o a n d M i c h a e l G. D y e r Coinl)uter Science Det)artment, University of California, Los Angeles L o s A n g e l e s , C a l i f o r n i a 90095 [email protected], dyer~cs.ucla.edu Abstract quired manual knowledge engineering. More recent works take advantage of machine readable dictionaries such as WordNet (Miller, 1990) and Roget's Online Thesaurus. Statistical techniques, both supervised learning from tagged corpora (Yarowsky, 1992), (Ng and Lee, 1.996), and unsupervised learning (Yarowsky, 1995), (Resnik, 1997), have been investigated. There are also hybrid inodcls that incorporate both statistical and symbolic knowledge (Wiebe et al., 1998), (Agirre and I{igau, 1996). Supervised models have shown promising results, but the lack of sense tagged corpora often requires the need tbr laboriously tagging trailfing sets manually. Depending on the technique, unsupervised models can result in ill-defined senses. Many have not been evaluated with large vocabularies or flfll sets of senses. Hybrid models, using various heuristics, have demonstrated good accuracy but are ditficult to compare clue to variations in the evahlation procedures, as discussed in Resnik and Yarowsky (]997). In our Bayesian Hierarchical Disambiguator (BHD) model, we attempt to address some of the main issues faced by today's WSD systelns, namely: 1) the sparse data problem; 2) the selection of a fi;ature set that can be trained upon easily without sacrificing accuracy; and 3) the scalability of the systein to disambiguate um'estricted text. The first two problems can be attributed to the lack of tagged corpora, while the third results from the need for lland-annotated text as a method of circumventing the first two problems. We will address the first two issues by identiflying contexts in which knowledge can be obtained automatically, as opposed to those that require minimal manual tagging. The effectivehess of the BHD model is then tested on unrestricted text, thus addressing the third issue. In this paper, word sense dismnbiguation (WSD) accuracy achievable by a probabilistic classifier, using very milfimal training sets, is investigated. \Ve made the assuml)tiou that there are no tagged corpora available and identified what information, needed by an accurate WSD system, can and cmmot be automatically obtained. The lesson learned can then be used to locus on what knowledge needs malmal annotation. Our system, named Bayesian Hierarchical Disambiguator (BHD), uses the Internet, arguably tile largest corlms in existence, to address the st)arse data problem, and uses WordNet's hierarchy tbr semantic contextual features. In addition, Bayesian networks are automatically constructed to represent knowledge learned from training sets by lnodeling the selectional i)retbrence of adjectives. These networks are then applied to disaml)iguation by pertbrming inferences on unseen adjective-noun pairs. We demonstrate that this system is able to disambiguate adjectives in um'estricl;ed text at good initial accuracy rates without tile need tbr tagged corpora. The learning and extensibility aspects of the model are also discussed, showing how tagged corpora and additional context can be incorporated easily to improve accm'acy, and how this technique can be used to disambiguate other types of word pairs, such as verb-noun and adverb-verb pairs. 1 Introduction Word sense disambiguation (WSD) remains an open probleln in Natural Language Processing (NLP). Being able to identify tile correct sense of an mnbiguous word is ilnportant for many NLP tasks, such as machine translation, information retrieval, and discourse analysis. The WSD problem is exacerbated by the large number of senses of colmnonly used words and by the difficulty in determining relevant contextual features most suitable to the task. Tile absence of semantically tagged corpora makes probabilistic techniques, shown to be very effective by speech recognition and syntactic tagging research, difficult to employ due to the sparse data problem. Early NLP systeins limited their domain and re- 2 Problem Formulation WSD can t)e described as a classification task, where the i th sense ( W # i ) of a word (W) is classified as the correct tag, given a word and usually some surrounding context. For example, to disambiguate the adjective "great" in the sentence "Tile 152 great hm'riealm devastated the region", a \VSI) system should disambiguate "great" as left.q(: in .s'izc rather than the good or c:cccllent meaning. Using l)robability notations, this t)rocedure can be st~ted as m a x i ( P r ( g r e a t # i [ "great", "ihe", "lnn'ricane", "devastated", "the", "region")). T h a t is, given the word "great" and its context, elassit~y the sense g r e a t # / with the highest probability as the correct o11e. However, a large COllteX[;~ sll(:h as the whole sentence, is rarely used, due to the dilliculty in estimating the 1)robability of this particular set of words oceurring. Therefore, the context is usually narrowed, such as ,z number of surromMing words. Additionally, surrollnding syntactic tbatures aim semantic knowledge are sometimes used. The difficulty is in choosing the right context, or the set of features, that will ol)timize the classification. A larger (:ontext intl)roves t;11(!classiti(:ation a(:(:ura(:y al the exl)ense of increasing the mlml)er of l)arameters (typically learned fr(nn large training data). in our BHI) model, a minimal context (:Oml)OSe(l of only the adje(:tive, noun and the llOllll;S Selllalltie features obtained fl'()m \VordNel is used. Using the al)ove examl)le , only "great", "hurri(:ane" and hun'icane's t)atures encoded in WordNet's hierarchy, as in hurricane lISA cyclone lISA windstorm ISA violent storm..., are used its (;Olltext. Therefore, the classitieation 1)ertbrnmd 1)y Bill) (:an t)e written as nlaxi(lh(great¢~i [ "great", "]nlrricane", cyt:loltc, windslorm ...)), or more generically, ,m(,,:~( l ' r ( , d j # i l ( , d j , ,,>,,.,,, < N ],',; >)), ,vher(, <N/Zs> denotes the n(mn features. By using the Bayesian inversion fornmla, this equati()n l)e(:onw.s P r ( g r e a t # l ) , in l)rOl)ortion to the usage of the two senses. Although WordNet orders the senses of a t)olysenl(ms word according to usage, the actual t)rot)ortions are not quantified. Therefore, to (:omtmte the priors, one can iterate over all English nouns and sunl the instmlces of g r e a t # l - n o u n versus great#-2-noun t)airs. But siilce we assume that no training set exists (the worst possible ease of the sparse data l)rol)lem), these counts need to 1)e estinmted from ilMireet sources. 3.1 The techuique used to address data sI)arsity, as first proposed by Mihalcea and Moldovan (1998), treal;s the Internet as a cortms to automatically (lisambiguate word 1)airs. Using the previous examl)le , to disambiguate the adjective in "great hurricane", two synonym lists of ("great, large, t)ig") and ("great, neat, good") are retrieved from V¢ordNet. (SOllle s y n o n y m s ~111(1 el;her SellSeS i/re olllitted here for 1)revity.) Two queries, ("great hurricane" or "large hurrit:ane?' or "t)ig hurricane") and ("great hurricane" or "neat hurricane" or "good hurriemm"), are issued to Altavista, which reporl, s that 1100 and 914: 1)ages contain these terlllS, respectively. The query with the higher count (#1) is classitied its the correct sense. For fllrther details, please refer to Mihaleca anti Moldovau (1998). .,,,.<,:,.~(Pr(,.4i,'l,.o,.,., < N],'.s > I . , ! i # i ) x P,(,,4i#i)). Pr(,d:i,'no'u,., < Nt;'.s" >) (1) This eonte×t is chosen because it does not; need an annotated training set, and these semantic fe.atures are used to l)uiht a belief about the nouns ,:ill adjective SellSe typically lnodifies, i.e., the se]ectional preferences of a(ljectiv('.s. For examlfle , having learned about lmrrieane, the system can infer the most probable dismnbiguation of "great typhoon", "great tornado", or ltlore distal concepts such as earthquakes and tloods. 3 As two Establishing the Parameters shown in equation 1, BH1) requires parameters: 1) the likelihood term Pr(adj, n o u n , < N F s > ladj#i) and 2) the prior term P r ( a d j # i ) . The prior term represents the knowledge of how frequently a sense of an adjective is used without any contextual infornmtion. Dn" example, if g r e a t # 2 (sense: 9ood, excellent) is nsed frequently while g r e a t # l is less commonly used, then P r ( g r e a t # 2 ) wtmld be larger than The Sparse Data Problem In our luo(lel, the (:omd;s front Altavista are im:orl)orated as [)aralltel;er estilllatiollS within our probabilislh: frlt111(~work. Jill addition to disaml)igualing lhe adjectives, we also need to estimale lhe usage of the a(ljec|;ive#i-noml pair. For simt)licil;y, the counts fronl Altavista are assigued wholesale 1o the disambiguated adjective sense, e.g., the usage of g r e a t # l - h u r r i c a n e is 1100 times and g r e a t # 2 hurricane is zero times. This is a great simplification since in m a n y adjective-noun pairs nmltil)le meanings are likely. For instance, in "great sl;eak", both sense of '"great" (large steak vs. tasty steak) are e,qually likely. However, given no other information, this sinlt)lification is used as a gross apt)roximation of Counts(adj#i-noun), which becoines Pr(adj#i-noun) by dividing the counts by a nornmlizing constant, ~ Com~ts(adj#i-all nouns). These prolmbilities are then used to compute the priors, described in the next section. Using this technique, two major probleins are addressed. Not only are the adjectives automatically disambiguated, but the number of occurrenees of the word pairs is also estimated. The need lot haudannotated semantic corpora is thus avoided. However, the statistics gathered by this technique are at)proxinmtions, so the noise they introduce does require supervised training to nfinimize error, as will lm described. 153 3.2 C o m p u t i n g t h e Priors Using the methods described above, the priors cats be automatically computed by iterating over all nouns and smnnfing tile counts tbr each adjective sense. Untbrtunately, the automatic disambiguation of tim adjective is not reliable enough and results it, inaccurate priors. Therefore, manual classification of assigning nouns into one of the adjective senses is needed, constituting the first of two manual tasks needed by this model. However, instead of classifying all English nouns, Altavista is again used to provide collocation data on 5,000 nouns for each adjective. The collocation frequency is then sorted and the top 100 nouns are manually classified. For example, the top 10 nouns that collocate after "great" are "deal", "site", "job", "place", "time", "way", "American", "page", "book", and "work". They are then all classified as being modified by the g r e a t # 2 sense except for tile last one, which is classified into another sense, as defined by WordNet. Tim prior for each sense is then coinputed by smmning the counts Don, t)airing the adjective with the nouns classified into that sense and dividing by the sum of all adjective-noun pairs. The top 100 collocated nouns fbr each adjective are used as an al)proximation for all adjective-noun pairs since considering all nouns would be impractical. To validate these priors, a Naive Bayes classifier that comt)utes v,.azi Pr(adj, ,!,o'l,r@,dj#i) x Pr(,,dj #i) P r ( a4i, ~ o , . ,.) is used, with the noun as the only context. This simpler likelihood term is approxinmted by the same Internet counts used to establish the priors, i.e., Counts(adj#i-noun) / normalizing constant. In Table 1, the accuracy of disambiguating 135 adjectivenoun pairs Dora the br-a01 file of the semantically tagged corlms SemCor (Miller et al., 1993) is compared to the baseline, wlfich was calculated by using the first WordNet sense of the adjective. As mentioned earlier, disambiguating using, simply the highest count Dot** Altavista ("Before Prior" it, Table 1) achieved a low accuracy of 56%, whereas using the sense with the highest prior ("Prior Only") is slightly better than the baseline. This result validates the fact that the priors established here preserve WordNet's ordering of sense usage, with tile improvement that tim relative usages between senses are now quantified. Combining both the prior and the likelihood terms did not significantly improve or degrade tile accuracy. This would indicate that either the likelihood term is uniformly distributed across tile i senses, which is contradicted by the accuracy without the priors (second row) being significantly higher than the average number of senses per adjective of 3.98, Accur&cy Before Prior Prior Only Combined Baseline 56.3% 77.0% 77.8% 75.6% Table 1: Accuracy rates from using a Naive Bayes classifier to wflidate the priors. These results show that the priors established in this model are as accurate as the WordNet's ordering according to sense usage (Baseline). or, more likely that this parameter is subsumed by the priors due to the limited context. Therefore, more contextual infbrmation is needed to improve tim model's peribrmance. 3.3 C o n t e x t u a l Features Instead of adding other types of context such as the surrounding words and syntactic features, tlm semantic features of tim noun (as encoded in the WordNet ISA hierarchy) is investigated for its effectiveness. These features are readily available and are organized into a well-defined structus'e. Tim hierarchy provides a systematic and intuitive method of distance measurements between feature vectors, i.e., the semantic distance between concepts. This property is very important for inthrring the classification of the novel pair "great flood" into the sense tha.t contains hurricane as a member of its prototypical nouns. These prototypical nouns describe tile selectional preferences of adjective senses of "great", and the semantic distance between them and a new noun ineasures 1;11o "semantic fit" between the concepts. Tile closer the.y are, as with "hurricane" and "flood", the higher the prot)ability of the likelihood term, whereas distal concepts such as "hurricane" and "taste" would have a lower value. Representing these prototyt)ical 11011218prot)abilistically, however, is difficult due to the exponential number of 1)robal)ilit, ies with respect to the nmnber of features. For exmnl)le, ret)resenting hurricmm 1)eing I)resent in a selectional t)reference list requires 2 s probabilities since there are 8 features, or ISA l)arents, in the WordNet hierarchy. It, addition , tile sparse d a t a t)roblem resurfaces because each one of the 2 s probabilities has to l)e quantified. To address these two issues, belief networks are used, as described it, detail in the next section. 4 Probabilistic Network's There are m a n y advantages to using Bayesian networks over the traditional probabilistic models. The most notable is that the mlmber of probabilities needed to represent the distribution can be signif icantly reduced by making independence assumptions between variables, with each node condition- 154 ~ A ~ p(AIB, c ) " , -\ t'"il />, / ' \ (?.*ib,,}~ / " '""@t :':,,'..,,,.,:p, ........ H , (,,,,:,,,,@,,. P(h,B,C3)3¢,l~)=l'(All~,C)xl'(BIb,i;)xP(Cl})) (,.,,~,) xP(])llV,)xl)(l~)xl)(F) \ a~",'"') Figure :1: A n eXamlih; of a Bayesiml nel;work and the t)rolialiilities at ('a(:h node, t h a i (h'lin(; lhe rehtl.ionshills 1)('tweeu a no(h; and its parents. ']'he equation al; the l)ott()m shows how t.h(, (listrilmtion across all (if the variat)les is (:Oml)Ut(',(l. ,," ...............~............... / (.,,,5 (.,,,4-. ..... ~............... Figure 2: T h e stru(:ture (if the belief n(,l;w(/rk that i'el)resenls the, s(~h~ctional l)refer(!n(:(~ (if flrcal,#[. T h e leaf nodes are 1.he nouns within the training set, mid lhe int(~rm('xlial.(; no(les r('th;(:l; the ISA hierar(:hy fr()m \\;()r(1Nel;. T h e 1)rol)ahilities al each node at(', used i;o (lisanfl)iguat(! novel a(lj(,(:l;ive-noun 1)airs. ally d(,l/endenl; ltl)on only il;s t)arenls (]'earl, 1988). ]Pigllr(! ] shows all (LxCallll)](; ]{ayesiall ll(H:Woll( rel)r(;senl;ing l;h(' disl;ribul;i()n ])(;\,B;(',,1);F,,]"). ]nsl;('a(l of having oue large (,able wilh 2" lirol)al)ilit.i(~s (with all ]}o()](;an nod(!s), the (lisl;ribution is rel)resenl(;(t by the (:onditional I)rolial)ility tal)les (C,])Ts) at ea('h nod(,, su(:h as I>(B I l), F), requiring a (olal of only 2d prol)al)ilitie~s. Not only (lo th(, savings he(:ome more significant with larger networks, lmi; (.he sparse d a t a t/r()l)h;m b(!(:omes m()r(; manag(;abh, as well. Th(! I;l';/illitlg~ s(!l; 11o lollg:er lice(is Ix) (:()v(!r all l)(!l'llllltal;ions (if l;he f(!a(;m'e sets, lml only smaller sul)sets dictated l)y l.he sets of Valial)h!s (if lira C P T s . T h e n(;l.w(/rk shown in Figure 1 lo()ks simihn to any pOll;ion of th(~ \Vor(1N(~l; hierar(:hy for a reason. In BII]), belief networks with the same stru(:tur(; as the \Vor(tNet hierar(:hy are automatically (:onstru(:t;ed to rel)l"esent the seh'cl;ional I)reference of an a(1.iective sense. S1)ecifically, the network rei)resents the prol)aliilistic (tistribul;ion over all of the i)rotol;yl)i(:al nouns of an a(1j e c t i v e ~ i mid the nouns' semanti(: t'ealures, i.e., for th(, a(1.i('(:iives. For exmnl)h! , the nouns "auk", "oak", "steak", "delay" and " a m o u n t " (:ompose the training set fl)r g r e a t # l (SellSC" lalT/e iv, size,). Note l]lal; W o r d N e t in(:huh'd "sl;e~al~" ill l;he glossary of greal;#], l)ul: il; al)l)ears thai 1;]le 9ood or e:ccelhe'H,l, 5(!lise would lm lliOl'(! aplWol)rial;e. Neverlheh~ss, the }isis of exelnl)lary m m n s are sysl:(;malit:aily rel;rh'vcd and nol ('dile(l. Th(" sets (if l/r()tolypi(:al n(/mlS f(/r each a(lj(~(:liv(~ sense have to lie (lisaml)igual;ed lie(:ause, the S(~lllallli(: features (lifter 1)etween ambiguous nouns. Since these n(mns cmmol; lie autonmti(:ally disamlAguated with high accuracy, lhey have to be done mamm]ly. This is the second t)art of (,lie m m m a l process need(;d 1)3" BIID sitl(:(! the W(/rdNel; gh)ssary is not selnallti(:ally tagged. P(v,.ot,,,,,,,,..,., < v,.o~,oNF, > I,-!i#i). 'rh(; ,,s(, of 4.2 Bayesian networks for WSI) has I)een l)rop()sed by others such as W i e b e eL al (1.998), but a different fl)rmulation is used in this mod('l. Th(, construction of the networks in B H D can be divided into three steps: defining 1) the training sets, 2) the structure, and 3) the probabilities, as described in the tbllowing sections. 4.1 (,,& Belief Network Struetm'e T h e belief networks have the same structure as the \VordNet 1SA hierarchy with the ext;et)tion t h a t the edges are direcl;ed from the child nodes to their parents. Ilhlstrated in Figure 2, t h e B I t D - e o n s t r u c t e d network represents the selectional t)referelme of the to I) level nod(;, g r e a t # l . T h e leaf nodes are tile evi(len(:e nodes from the training set mM the internm(liale ilo(les are |;t1(; sema.ntie featm'es (if the leaf nodes. This organization emd)les the belief gathered from the leaf nodes 1;o lie tirot)agated u I) to the tol) level node during inferencing, as described in a later secLioIl. 1}11|; first, th(' l)robability table ae(:oml)anyiug ea(:h node needs to be constructed. ~lYaining S e t s T h e training set for each of the adje(:tive s(',nses is (;onstructed by ('.xi;ra(:l;ing l;he exenll)lary adje(:tivenoun pairs from tile WordNet glossary. T h e glossary contains the examlih; usage of the a(lje(:tiv('~s, and l;]l(', n o u n s from thein are taken as the training sets 155 4.3 Quantifying the Network Pr(great, flood, <flood N F s > , proto nouns, <l)roto N F s > I adj #i), even though neither network has ever encountered the noun "flood" before. To perform these inferences, the noun and its features are tenlporarily inserted into the network according to the WordNet hierarchy (if not already present). The prior for this "hypothetical evidence" is obtained the same way as the training set, i.e., by querying Altavista, and the C P T s are updated to reflect this new addition. To calculate the probability at the top node, any Bayesian network inferencing algorithm can be used. However, a query where all nodes are instantiated to true is a special case since the probability can be comlmted by multiplying together all priors and the C P T entries where all variables are true. Ill Figure 3, tile network for g r e a t # l is shown with "flood" as tile hypothetical evidence added on the right. The C P T of the node "natural phenomenon" is updated to reflect the newly added evidence. The propagation of the probabilities from the leaf nodes up the network is shown and illustrates how discounts a r t taken at each intermediate node. Whenever more related concepts are 1)resent in the network, such as "typhoon" and "tornado '~, less discounts are taken and thus a higher probability will result at the root node. Converso.ly, one can see that with a distal concept, such as "taste" (which is ill a completely different branch), the knowledge about "hurricane" will have little or no influence on disambiguating "great taste". The calculation above can be computed in linear time with respect to the det/th of tlle query noun node (depth=5 in the case of f l o o d # l ) and not the the nmnber of nodes ill the network. This is important for scaling the network to represent the large nuinber of nouns needed to accurately model tile selectional preferences of adjective senses. Tile only cost incurred is storage for a s u m m a r y probability of the children at each internlediate node and time for ut)dating these values when a new piece of evidence is added, which is also linear with respect to the depth of the node. Finally, the probabilities comt)uted by the inference algorithnl are combilmd with the priors established in the earlier section. Tile combined probabilities represent P ( a d j # i ] adj, noun, < N F s > ) , and tilt one with the highest probability is classified by BHD as tile most plausible sense of the adjective. The two parameters the belief networks require are the C P T s tbr each intermediate node and the priors of the leaf nodes, such as P ( g r e a t # l , hurricane). The latter is estilnated by tile counts obtained fronl Altavista, as described earlier, and a shortcut is used to specit~y the CPTs. Normally the C P T s ill a flflly specified Bayesian network contain all instantiations of the child and parent values and their corresponding probabilities. For example, the C P T at node D in Figure 1 would have four rows: e r ( D = t l E : t ) , P r ( D = t l E = f ) , P r ( D = f l E = t ) , and P r ( D = f l E = f ). This is needed to perform flfll inferencing, where queries can be issued for any instantiation of the variables. However, since the networks in this model are used only for one specific query, where all nodes are instantiated to be true, only the row with all w~riables equal to true, e.g., P r ( D = t l E = t ) , has to be specified. The nature of this query will be described in more detail in tile next section. To calculate the probability that an intermediate node and all of its parents are true, one divides the number of parents present by the number of possible parents as specified ill WordNet. hi Figure 2, the small clotted nodes denote the absent parents, which deternfine how the probabilities are specified at each node. Recall that tile parents in the belief network are actually the children in the. WordNet hierarchy, so this probability can be seen as the percentage of children actually present, hltuitively, this probability is a form of assigning weights to parts of the network where more related nouns are present in the training set, silnilar to the concept of semantic density. Tile probability, in conjunction with the structure of the belief network, also implicitly encodes the semantic distance between concepts without necessarily 1)enalizing concepts with deep hierarchies. A discount is taken at each ancestral node during inferencing (next section) only wlmn some of its WordNet children are absent in the network. Therefore, the semantic distance can be seen as the number of traversals 11I)the network weighted by the number of siblings present in tile tree (and not by direct edge counting). 4.4 Querying the Network With the probability between nodes specified, the network becomes a representation of the selectional prefbrence of an adjective sense, with features from the WordNet ISA hierarchy providing additional knowledge on both semantic densities and semantic distances. To disambiguate a novel adjective-noun pair such as "great flood", the g r e a t # l and g r e a t # 2 networks (along with 7 other g r e a t # / n e t w o r k s not shown here) infer the likelihood that "flood" belongs to the network by comtmting the probability 4.5 E v a h m t i o n To test the accuracy of BHD, the same procedure described earlier was used. Tile same 135 adjectivenoun pairs from SelnCor were disambiguated by BHD and compared to the baseline. Table 2 shows the accuracy results froln evaluating either the first sense of the nouns or all senses of the nouns. The results of the accuracy without the priors P r ( a d j # i ) in- 156 Context 1. st IIOllll all noun SOIISO \'Vithout lh'ior With 1}l.i{}r Baseline 1/16x 1/30x ¢i¢al) 2/-$ 11Ol111only S{BIIH(~S 56.3% 6O.O% 77.8% 80.{}% 75.6% +SP 11Ol111 ollly +SP 53.3% 60.0% 77.8% 81.4% 75.6% ,S/cLmst x 1/~ Ta.1}le 2: A c c u r a c y results from the seleetional preference model ( + S P ) , showing the improvements over the baseline by either considering the first llOllIl sense or all noun senses. Illka,d} 1/8 'I' t tlood) - 1770~ x 1/7 x I/I Model HIII) Mihah:ea and Moldovan (1999) SW.tina el; al. (1998) Results [ B a s e l i n e 81.4% 75.6% 79.8% 81.8% 83.6% 81.9% Table 3: Colnt/arison of atljectiv(~ disalnl}iguation at:curay with other inodels. Figure 3: (~ll{2ry of {;he flrcrtt#-I 1}elief netw{)rl( t(} infe.r l;h{', probal}ility of tlood being m{}{lified l}y .q'reat#l.. T h e left branch of the network has 1}{'.en o m i t t e d for clarity. dicate~ the imI}rovements l}rovided by the likeliho{}{t t e r m alone. T h e itnl}r(}vement gain(~{l from the addii.ional conl;extual featm:es shows th(~ ell'{~{:liv{m{~ss of the belief networks, l!'~v(m with only 3 t}rol{)tyt)ica.l m m n s l}er a(ljecl,ive se.nse (}n av{u'ag{~ (hardly a COml)lete deseril)tion of the sel(B(:tional pr(ff(w(u]{:{;s); the gain is very encouraging. Wit;h the 1)ri(}rs t'a{:tored in, 13IID iml}r(}ved even flu:ttmr (81%), signifi('antly surllassing the baseline (75.6%), a feat a{:{:omplished 1)3; {rely (me (}ther m{}del t h a t we are aware (}f (airi Sl;el;iIla and Nagao, 1998). Not;e thai; l;h{! l)esl ~/.C{;lll;}/{;y \v~ls a{'hieved by evaluating all senses of th{~ ll(}llllS} }IS exi)e{:ted, since the sele{:l;iollal t)r{~.fer{,Jl{'e is modele{t l;hr{mgh senmnti(: feai;mes (}t" the glossary nouns, not just their word forms. The. r{;as{m for the good a c c u r a c y from using only the tirst noun sense is because 72% of t h e m hal}pen to be the first; S0,IlSe. r]_~heso results are very ('JH;ouragillg si]lco 11o tagged corl)us and minimal training d a t a were. used. \¥e believe t h a t with a bigger training set, ]3HD's t)erf(}rnlance will iml}rove even further. 4.6 C o l n l ) a r i s o n w i t h O t h e r Mo{h,Js rE() our kn{}wledge, there are {rely two oLher sysl;enls t h a t (lisanfl)iguate a.(lj{~{:tive-noml llairs from unrest.ri{:l;e{l l;ex{;. Results fl'om both models were evaluated against S e m C o r and thus a comparison is meaningful. In Table 3, each model's accuracy (as well as 157 lhe baseline) is provided since different adjet:tivenoun pairs were e,valuated. We find t;he BIt]) resuits conlpa.ralfle, if not bel;ter, espcc, ially when ihe }IlllOllll{; Of inq)rovenw, nl; ()vet th0, ])aseline is eonsi(lered. T h e Ino<lel 1)3, ,SI;e,l;ina (1.998) was {,rai]md tm SemCor t h a t was merged with a flfll senl;ential parse tree, the d e t e r m i n a t i o n of which is considered a difficult l)rolflem of its own (Collins, 1997). \Ve belie, re thai; by int:ori}oral;ing tim data from SemC(n (dis(:llSse(l ill I;he fllI;llre work sc,ct;ion)~ {;11o,]}erforlll}lllCe (ff our sysi;em will surpass Stetina's. 5 Conclusion and Future Work We have 1)resenl;etl a t)rol}~fl}ilistic disanfl}iguation model t h a t ix sysl;e, ln~l;i{,, a(tcllral;e, all(l require llHlllual intervention ill only ill two places. The more l;illle (:OllSlllllill~ of tim l;wo manual l;asks is to (:]assitS' th(~ toll 100 nouns needed for the priors. T h e el;her task~ of disanfl)igual;ing l)rol;olTpical llOllllS, is relal;ively simple due to the limited nunfl)er of glossary nouns per sense. IIowever, it would l}e straightforward to i n c o r p o r a t e semantically tagged corpora, such as SemCor, to avoid these m a m m l tasks. T h e priors are the n u m b e r of instances of each adjective sense divided by all of the adjectives in the corpus. T h e d i s a m b i g u a t e d adjectiveT~i-noun#.] pairs from the corpus ean be used as training sets to build bet{e,r ret/resental;ion of selectional preferences l}y inserting tim nounT~j node mid the ac(:omf}any featm'es into the l}elief network of a{ljectivegfii. T h e insertion is the same prot:c/hue used to add the hyllothe|;ical evidence dm'ing the inferoncing stage. T h e Ul)(lated belief networks could then be, used tbr disambigualion wii;h improve.d at:curacy. Furthernlore, the performance of B I I D (:(mid a.lso be improved by exl)and- Sadao Kurohashi Jiri Stetina and Makoto Nagao. 1998. General word sense disambiguation method based oll a flfll sentential context. In Proceedings ing the context or using statistical learning methods such as the EM algorithln (DemI)ster et al., 1977). Using Bayesian networks gives the model ttexibility to incorporate additional contexts, such as syntactical and morphological features, without incurring exorbitant costs. It is l)ossible that, with an extended model that accurately disambiguates adjective-noun pairs, the selectional preference of adjective senses coutd be automatically learned. Having all improved knowledge al)out the selectional 1)references would then provide better parameters for disanfl)iguation. The model can be seen as a bootstrapping learning process tbr disambiguation, where the information gained from one part (selectional preference) is used to improve tile other (disambiguation) and vice versa, reminiscent of the work by Riloff and Jones (1.999) and Yarowsky (1995). Lastly, the techniques used in this paper could be scaled to disambiguate not only all adjective-noun pairs, but also other word pairs, such as subjectverb, verb-object, adverb-verb, by obtaining most of the paraineters from the Internet and WordNet. If the information fi'oln SemCor is also used, then the system could be automatically trained to pertbrm disambiguation tasks on all content words within a of COLING-ACL Workshop on Usage of WordNet in Natural Language Processing, Montreal, Canada, .July. Rada Mihalcea and Dan Moldovan. 1998. Word sense disambiguation base(! on semantic density. In Proceedings of COLING-ACL Workshop on Usage of WordNct in Natural Language Proecssing, Montreal, Canada, July. G. Miller, C. Leacock, and R. Tengi. 1993. A semantic concordance. In Procccdings of ARPA I]uTnan Language Technology, Princeton. G. Miller. 1990. WordNet: An on-line lexical database. International Journal of Lexicography, 3(4:). Hwee Tou Ng and Hian Beng Lee. 1996. Integrating inultiple knowledge sources to disambiguate word sense: An exemplar-based approaclL In Proceed- ings of the 3/tth, Annual Meeting of ACL, Santa Cruz, June. Judea Pearl. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmalm, San Mateo, CA. Philit) Resnik an(1 David Yarowsky. 1997. A perspective ell word sense disambiguation methods mid their evaluation. In ANLP Workshop on Tag- SellteI1Ce. In this paper, we have addressed three of what we believe to be the main issues timed 1)y current WSD systems. We demonstrated the effectiveness of the teclmiques used, while identii~ying two mmmal tasks that don't necessarily require a semantically tagged corpus. By establishing accurate priors a.nd small training sets, our system achieved good initial disambiguation accuracy. The salne methods could 1)e flflly automated to disami)iguate all content word pairs if infbrmation from semantically tagged corpora is used. Our goal is to create a system that can disambiguate all content words to an accuracy level sufficient for automatic tagging with tummn validation, which could then be used to improve or facilitate new probabilistic semantic taggers accurate enough for other NLP applications. ging Text with, Lexical Semantics, Washington, D.C., June. Philip Resnik. 1997. Selectional preference and sense disambiguation. In ANLP Worksh.op on Tagging Text with, Lcxical Semantics, Wash,ington, D.C., June. Ellen Riloff and Rosie Jones. 1999. Learning dictionaries fbr information extraction by multi-level bootstrapping. In Proceedings of AAAI-O9, O f lando, Florida. aanyce Wiebe, Tom O'Hara, and Rebecca Bruce. 1998. Constru(:ting bayesian networks from WordNet for word-sense disambiguation: I/.el)resentational and I)rocessing issues. In Proceedings of COLING-ACL Workshop on Usage of WordNct in Natural Language Processing, Montreal, Canada, July. David Yarowsky. 1992. Word-sense disambiguation using statistical model of Roget's categories trained on large corpora. In Proceedings of References Eneko Agirre and German Rigau. 1996. Word sense dismnbiguation using conceptual density. In Pro- COLING-92, Nantes, France. ceedings of COLING-96, Copenhagen. David Yarowsky. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In Michael Collins. 1997. Three generative, lexicalised models for statistical 1)arsing. In Proceedings of the 351h Annual Meeting of the ACL, pages 1623, Madrid, SI)ain. A.P. Dempster, N.M. Laird, and D.B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39 (B): 1-38. Proccedings of the 33rd Annual Meeting of the ACL. 158
© Copyright 2026 Paperzz