C00-1023 - ACL Anthology Reference Corpus

Word Sense Disambiguation of Adjectives
Using Probabilistic Networks
G e r a l d C h a o a n d M i c h a e l G. D y e r
Coinl)uter Science Det)artment,
University of California, Los Angeles
L o s A n g e l e s , C a l i f o r n i a 90095
[email protected], dyer~cs.ucla.edu
Abstract
quired manual knowledge engineering. More recent
works take advantage of machine readable dictionaries such as WordNet (Miller, 1990) and Roget's
Online Thesaurus. Statistical techniques, both supervised learning from tagged corpora (Yarowsky,
1992), (Ng and Lee, 1.996), and unsupervised learning (Yarowsky, 1995), (Resnik, 1997), have been investigated. There are also hybrid inodcls that incorporate both statistical and symbolic knowledge
(Wiebe et al., 1998), (Agirre and I{igau, 1996).
Supervised models have shown promising results,
but the lack of sense tagged corpora often requires
the need tbr laboriously tagging trailfing sets manually. Depending on the technique, unsupervised
models can result in ill-defined senses. Many have
not been evaluated with large vocabularies or flfll
sets of senses. Hybrid models, using various heuristics, have demonstrated good accuracy but are ditficult to compare clue to variations in the evahlation
procedures, as discussed in Resnik and Yarowsky
(]997).
In our Bayesian Hierarchical Disambiguator
(BHD) model, we attempt to address some of the
main issues faced by today's WSD systelns, namely:
1) the sparse data problem; 2) the selection of a
fi;ature set that can be trained upon easily without
sacrificing accuracy; and 3) the scalability of the systein to disambiguate um'estricted text. The first two
problems can be attributed to the lack of tagged
corpora, while the third results from the need for
lland-annotated text as a method of circumventing
the first two problems. We will address the first two
issues by identiflying contexts in which knowledge
can be obtained automatically, as opposed to those
that require minimal manual tagging. The effectivehess of the BHD model is then tested on unrestricted
text, thus addressing the third issue.
In this paper, word sense dismnbiguation (WSD) accuracy achievable by a probabilistic classifier, using
very milfimal training sets, is investigated. \Ve made
the assuml)tiou that there are no tagged corpora
available and identified what information, needed by
an accurate WSD system, can and cmmot be automatically obtained. The lesson learned can then be
used to locus on what knowledge needs malmal annotation. Our system, named Bayesian Hierarchical
Disambiguator (BHD), uses the Internet, arguably
tile largest corlms in existence, to address the st)arse
data problem, and uses WordNet's hierarchy tbr semantic contextual features. In addition, Bayesian
networks are automatically constructed to represent
knowledge learned from training sets by lnodeling
the selectional i)retbrence of adjectives. These networks are then applied to disaml)iguation by pertbrming inferences on unseen adjective-noun pairs.
We demonstrate that this system is able to disambiguate adjectives in um'estricl;ed text at good initial
accuracy rates without tile need tbr tagged corpora.
The learning and extensibility aspects of the model
are also discussed, showing how tagged corpora and
additional context can be incorporated easily to improve accm'acy, and how this technique can be used
to disambiguate other types of word pairs, such as
verb-noun and adverb-verb pairs.
1
Introduction
Word sense disambiguation (WSD) remains an open
probleln in Natural Language Processing (NLP). Being able to identify tile correct sense of an mnbiguous word is ilnportant for many NLP tasks, such as
machine translation, information retrieval, and discourse analysis. The WSD problem is exacerbated
by the large number of senses of colmnonly used
words and by the difficulty in determining relevant
contextual features most suitable to the task. Tile
absence of semantically tagged corpora makes probabilistic techniques, shown to be very effective by
speech recognition and syntactic tagging research,
difficult to employ due to the sparse data problem.
Early NLP systeins limited their domain and re-
2
Problem Formulation
WSD can t)e described as a classification task,
where the i th sense ( W # i ) of a word (W) is classified as the correct tag, given a word and usually
some surrounding context. For example, to disambiguate the adjective "great" in the sentence "Tile
152
great hm'riealm devastated the region", a \VSI) system should disambiguate "great" as left.q(: in .s'izc
rather than the good or c:cccllent meaning. Using l)robability notations, this t)rocedure can be
st~ted as m a x i ( P r ( g r e a t # i [ "great", "ihe", "lnn'ricane", "devastated", "the", "region")). T h a t is,
given the word "great" and its context, elassit~y the
sense g r e a t # / with the highest probability as the
correct o11e. However, a large COllteX[;~ sll(:h as the
whole sentence, is rarely used, due to the dilliculty
in estimating the 1)robability of this particular set of
words oceurring. Therefore, the context is usually
narrowed, such as ,z number of surromMing words.
Additionally, surrollnding syntactic tbatures aim semantic knowledge are sometimes used. The difficulty is in choosing the right context, or the set
of features, that will ol)timize the classification. A
larger (:ontext intl)roves t;11(!classiti(:ation a(:(:ura(:y al
the exl)ense of increasing the mlml)er of l)arameters
(typically learned fr(nn large training data).
in our BHI) model, a minimal context (:Oml)OSe(l
of only the adje(:tive, noun and the llOllll;S Selllalltie features obtained fl'()m \VordNel is used. Using the al)ove examl)le , only "great", "hurri(:ane"
and hun'icane's t)atures encoded in WordNet's hierarchy, as in hurricane lISA cyclone lISA windstorm ISA violent storm..., are used its (;Olltext.
Therefore, the classitieation 1)ertbrnmd 1)y Bill) (:an
t)e written as nlaxi(lh(great¢~i [ "great", "]nlrricane", cyt:loltc, windslorm ...)), or more generically, ,m(,,:~( l ' r ( , d j # i l ( , d j , ,,>,,.,,, < N ],',; >)), ,vher(,
<N/Zs> denotes the n(mn features. By using the
Bayesian inversion fornmla, this equati()n l)e(:onw.s
P r ( g r e a t # l ) , in l)rOl)ortion to the usage of the
two senses. Although WordNet orders the senses
of a t)olysenl(ms word according to usage, the
actual t)rot)ortions are not quantified. Therefore, to
(:omtmte the priors, one can iterate over all English
nouns and sunl the instmlces of g r e a t # l - n o u n
versus great#-2-noun t)airs. But siilce we assume
that no training set exists (the worst possible ease
of the sparse data l)rol)lem), these counts need to
1)e estinmted from ilMireet sources.
3.1
The techuique used to address data sI)arsity, as first
proposed by Mihalcea and Moldovan (1998), treal;s
the Internet as a cortms to automatically (lisambiguate word 1)airs. Using the previous examl)le , to
disambiguate the adjective in "great hurricane", two
synonym lists of ("great, large, t)ig") and ("great,
neat, good") are retrieved from V¢ordNet. (SOllle
s y n o n y m s ~111(1 el;her SellSeS i/re olllitted here for
1)revity.) Two queries, ("great hurricane" or "large
hurrit:ane?' or "t)ig hurricane") and ("great hurricane" or "neat hurricane" or "good hurriemm"), are
issued to Altavista, which reporl, s that 1100 and 914:
1)ages contain these terlllS, respectively. The query
with the higher count (#1) is classitied its the correct
sense. For fllrther details, please refer to Mihaleca
anti Moldovau (1998).
.,,,.<,:,.~(Pr(,.4i,'l,.o,.,.,
< N],'.s > I . , ! i # i ) x P,(,,4i#i)).
Pr(,d:i,'no'u,., < Nt;'.s" >)
(1)
This eonte×t is chosen because it does not; need an
annotated training set, and these semantic fe.atures
are used to l)uiht a belief about the nouns ,:ill adjective SellSe typically lnodifies, i.e., the se]ectional preferences of a(ljectiv('.s. For examlfle , having learned
about lmrrieane, the system can infer the most probable dismnbiguation of "great typhoon", "great tornado", or ltlore distal concepts such as earthquakes
and tloods.
3
As
two
Establishing the Parameters
shown in equation 1,
BH1) requires
parameters:
1) the likelihood
term
Pr(adj, n o u n , < N F s > ladj#i) and 2) the prior
term P r ( a d j # i ) .
The prior term represents the
knowledge of how frequently a sense of an adjective
is used without any contextual infornmtion. Dn"
example, if g r e a t # 2 (sense: 9ood, excellent) is
nsed frequently while g r e a t # l is less commonly
used, then P r ( g r e a t # 2 ) wtmld be larger than
The Sparse Data Problem
In our luo(lel, the (:omd;s front Altavista are im:orl)orated as [)aralltel;er estilllatiollS within our probabilislh: frlt111(~work. Jill addition to disaml)igualing
lhe adjectives, we also need to estimale lhe usage of the a(ljec|;ive#i-noml pair. For simt)licil;y,
the counts fronl Altavista are assigued wholesale 1o
the disambiguated adjective sense, e.g., the usage
of g r e a t # l - h u r r i c a n e is 1100 times and g r e a t # 2 hurricane is zero times. This is a great simplification since in m a n y adjective-noun pairs nmltil)le
meanings are likely. For instance, in "great sl;eak",
both sense of '"great" (large steak vs. tasty steak)
are e,qually likely. However, given no other information, this sinlt)lification is used as a gross apt)roximation of Counts(adj#i-noun), which becoines
Pr(adj#i-noun) by dividing the counts by a nornmlizing constant, ~ Com~ts(adj#i-all nouns). These
prolmbilities are then used to compute the priors,
described in the next section.
Using this technique, two major probleins are addressed. Not only are the adjectives automatically
disambiguated, but the number of occurrenees of the
word pairs is also estimated. The need lot haudannotated semantic corpora is thus avoided. However, the statistics gathered by this technique are
at)proxinmtions, so the noise they introduce does require supervised training to nfinimize error, as will
lm described.
153
3.2 C o m p u t i n g t h e Priors
Using the methods described above, the priors cats
be automatically computed by iterating over all
nouns and smnnfing tile counts tbr each adjective
sense. Untbrtunately, the automatic disambiguation
of tim adjective is not reliable enough and results it,
inaccurate priors. Therefore, manual classification
of assigning nouns into one of the adjective senses
is needed, constituting the first of two manual tasks
needed by this model. However, instead of classifying all English nouns, Altavista is again used to
provide collocation data on 5,000 nouns for each adjective. The collocation frequency is then sorted and
the top 100 nouns are manually classified. For example, the top 10 nouns that collocate after "great"
are "deal", "site", "job", "place", "time", "way",
"American", "page", "book", and "work". They are
then all classified as being modified by the g r e a t # 2
sense except for tile last one, which is classified into
another sense, as defined by WordNet. Tim prior
for each sense is then coinputed by smmning the
counts Don, t)airing the adjective with the nouns
classified into that sense and dividing by the sum
of all adjective-noun pairs. The top 100 collocated
nouns fbr each adjective are used as an al)proximation for all adjective-noun pairs since considering all
nouns would be impractical.
To validate these priors, a Naive Bayes classifier
that comt)utes
v,.azi Pr(adj, ,!,o'l,r@,dj#i) x Pr(,,dj #i)
P r ( a4i, ~ o , . ,.)
is used, with the noun as the only context. This
simpler likelihood term is approxinmted by the same
Internet counts used to establish the priors, i.e.,
Counts(adj#i-noun) / normalizing constant. In Table 1, the accuracy of disambiguating 135 adjectivenoun pairs Dora the br-a01 file of the semantically
tagged corlms SemCor (Miller et al., 1993) is compared to the baseline, wlfich was calculated by using
the first WordNet sense of the adjective. As mentioned earlier, disambiguating using, simply the highest count Dot** Altavista ("Before Prior" it, Table
1) achieved a low accuracy of 56%, whereas using
the sense with the highest prior ("Prior Only") is
slightly better than the baseline. This result validates the fact that the priors established here preserve WordNet's ordering of sense usage, with tile
improvement that tim relative usages between senses
are now quantified.
Combining both the prior and the likelihood terms
did not significantly improve or degrade tile accuracy. This would indicate that either the likelihood
term is uniformly distributed across tile i senses,
which is contradicted by the accuracy without the
priors (second row) being significantly higher than
the average number of senses per adjective of 3.98,
Accur&cy
Before Prior
Prior Only
Combined
Baseline
56.3%
77.0%
77.8%
75.6%
Table 1: Accuracy rates from using a Naive Bayes
classifier to wflidate the priors. These results show
that the priors established in this model are as accurate as the WordNet's ordering according to sense
usage (Baseline).
or, more likely that this parameter is subsumed by
the priors due to the limited context. Therefore,
more contextual infbrmation is needed to improve
tim model's peribrmance.
3.3 C o n t e x t u a l Features
Instead of adding other types of context such as
the surrounding words and syntactic features, tlm
semantic features of tim noun (as encoded in the
WordNet ISA hierarchy) is investigated for its effectiveness. These features are readily available and are
organized into a well-defined structus'e. Tim hierarchy provides a systematic and intuitive method of
distance measurements between feature vectors, i.e.,
the semantic distance between concepts. This property is very important for inthrring the classification
of the novel pair "great flood" into the sense tha.t
contains hurricane as a member of its prototypical
nouns. These prototypical nouns describe tile selectional preferences of adjective senses of "great",
and the semantic distance between them and a new
noun ineasures 1;11o "semantic fit" between the concepts. Tile closer the.y are, as with "hurricane" and
"flood", the higher the prot)ability of the likelihood
term, whereas distal concepts such as "hurricane"
and "taste" would have a lower value.
Representing these prototyt)ical 11011218prot)abilistically, however, is difficult due to the exponential
number of 1)robal)ilit, ies with respect to the nmnber
of features. For exmnl)le, ret)resenting hurricmm 1)eing I)resent in a selectional t)reference list requires
2 s probabilities since there are 8 features, or ISA
l)arents, in the WordNet hierarchy. It, addition , tile
sparse d a t a t)roblem resurfaces because each one of
the 2 s probabilities has to l)e quantified. To address these two issues, belief networks are used, as
described it, detail in the next section.
4
Probabilistic
Network's
There are m a n y advantages to using Bayesian networks over the traditional probabilistic models. The
most notable is that the mlmber of probabilities
needed to represent the distribution can be signif
icantly reduced by making independence assumptions between variables, with each node condition-
154
~ A ~ p(AIB, c )
"
,
-\
t'"il
/>,
/ ' \
(?.*ib,,}~
/
"
'""@t
:':,,'..,,,.,:p, ........
H
,
(,,,,:,,,,@,,.
P(h,B,C3)3¢,l~)=l'(All~,C)xl'(BIb,i;)xP(Cl}))
(,.,,~,)
xP(])llV,)xl)(l~)xl)(F)
\
a~",'"')
Figure :1: A n eXamlih; of a Bayesiml nel;work and the
t)rolialiilities at ('a(:h node, t h a i (h'lin(; lhe rehtl.ionshills 1)('tweeu a no(h; and its parents. ']'he equation
al; the l)ott()m shows how t.h(, (listrilmtion across all
(if the variat)les is (:Oml)Ut(',(l.
,," ...............~...............
/
(.,,,5
(.,,,4-. ..... ~...............
Figure 2: T h e stru(:ture (if the belief n(,l;w(/rk that
i'el)resenls the, s(~h~ctional l)refer(!n(:(~ (if flrcal,#[.
T h e leaf nodes are 1.he nouns within the training set,
mid lhe int(~rm('xlial.(; no(les r('th;(:l; the ISA hierar(:hy
fr()m \\;()r(1Nel;. T h e 1)rol)ahilities al each node at(',
used i;o (lisanfl)iguat(! novel a(lj(,(:l;ive-noun 1)airs.
ally d(,l/endenl; ltl)on only il;s t)arenls (]'earl, 1988).
]Pigllr(! ] shows all (LxCallll)](; ]{ayesiall ll(H:Woll( rel)r(;senl;ing l;h(' disl;ribul;i()n ])(;\,B;(',,1);F,,]"). ]nsl;('a(l
of having oue large (,able wilh 2" lirol)al)ilit.i(~s (with
all ]}o()](;an nod(!s), the (lisl;ribution is rel)resenl(;(t
by the (:onditional I)rolial)ility tal)les (C,])Ts) at ea('h
nod(,, su(:h as I>(B I l), F), requiring a (olal of only 2d
prol)al)ilitie~s. Not only (lo th(, savings he(:ome more
significant with larger networks, lmi; (.he sparse d a t a
t/r()l)h;m b(!(:omes m()r(; manag(;abh, as well. Th(!
I;l';/illitlg~ s(!l; 11o lollg:er lice(is Ix) (:()v(!r all l)(!l'llllltal;ions (if l;he f(!a(;m'e sets, lml only smaller sul)sets
dictated l)y l.he sets of Valial)h!s (if lira C P T s .
T h e n(;l.w(/rk shown in Figure 1 lo()ks simihn to any pOll;ion of th(~ \Vor(1N(~l; hierar(:hy for
a reason.
In BII]), belief networks with the
same stru(:tur(; as the \Vor(tNet hierar(:hy are automatically (:onstru(:t;ed to rel)l"esent the seh'cl;ional
I)reference of an a(1.iective sense.
S1)ecifically,
the network rei)resents the prol)aliilistic (tistribul;ion over all of the i)rotol;yl)i(:al nouns of an a(1j e c t i v e ~ i mid the nouns' semanti(: t'ealures, i.e.,
for th(, a(1.i('(:iives. For exmnl)h! , the nouns "auk",
"oak", "steak", "delay" and " a m o u n t " (:ompose the
training set fl)r g r e a t # l (SellSC" lalT/e iv, size,). Note
l]lal; W o r d N e t in(:huh'd "sl;e~al~" ill l;he glossary of
greal;#], l)ul: il; al)l)ears thai 1;]le 9ood or e:ccelhe'H,l,
5(!lise would lm lliOl'(! aplWol)rial;e. Neverlheh~ss, the
}isis of exelnl)lary m m n s are sysl:(;malit:aily rel;rh'vcd
and nol ('dile(l.
Th(" sets (if l/r()tolypi(:al n(/mlS f(/r each a(lj(~(:liv(~
sense have to lie (lisaml)igual;ed lie(:ause, the S(~lllallli(: features (lifter 1)etween ambiguous nouns. Since
these n(mns cmmol; lie autonmti(:ally disamlAguated
with high accuracy, lhey have to be done mamm]ly.
This is the second t)art of (,lie m m m a l process need(;d
1)3" BIID sitl(:(! the W(/rdNel; gh)ssary is not selnallti(:ally tagged.
P(v,.ot,,,,,,,,..,., < v,.o~,oNF, > I,-!i#i). 'rh(; ,,s(, of
4.2
Bayesian networks for WSI) has I)een l)rop()sed by
others such as W i e b e eL al (1.998), but a different
fl)rmulation is used in this mod('l. Th(, construction
of the networks in B H D can be divided into three
steps: defining 1) the training sets, 2) the structure,
and 3) the probabilities, as described in the tbllowing
sections.
4.1
(,,&
Belief Network
Struetm'e
T h e belief networks have the same structure as the
\VordNet 1SA hierarchy with the ext;et)tion t h a t the
edges are direcl;ed from the child nodes to their parents. Ilhlstrated in Figure 2, t h e B I t D - e o n s t r u c t e d
network represents the selectional t)referelme of the
to I) level nod(;, g r e a t # l . T h e leaf nodes are tile evi(len(:e nodes from the training set mM the internm(liale ilo(les are |;t1(; sema.ntie featm'es (if the leaf nodes.
This organization emd)les the belief gathered from
the leaf nodes 1;o lie tirot)agated u I) to the tol) level
node during inferencing, as described in a later secLioIl. 1}11|; first, th(' l)robability table ae(:oml)anyiug
ea(:h node needs to be constructed.
~lYaining S e t s
T h e training set for each of the adje(:tive s(',nses is
(;onstructed by ('.xi;ra(:l;ing l;he exenll)lary adje(:tivenoun pairs from tile WordNet glossary. T h e glossary
contains the examlih; usage of the a(lje(:tiv('~s, and
l;]l(', n o u n s from thein are taken as the training sets
155
4.3
Quantifying the Network
Pr(great, flood, <flood N F s > , proto nouns, <l)roto
N F s > I adj #i), even though neither network has ever
encountered the noun "flood" before.
To perform these inferences, the noun and its features are tenlporarily inserted into the network according to the WordNet hierarchy (if not already
present). The prior for this "hypothetical evidence"
is obtained the same way as the training set, i.e., by
querying Altavista, and the C P T s are updated to
reflect this new addition. To calculate the probability at the top node, any Bayesian network inferencing algorithm can be used. However, a query where
all nodes are instantiated to true is a special case
since the probability can be comlmted by multiplying together all priors and the C P T entries where all
variables are true.
Ill Figure 3, tile network for g r e a t # l is shown with
"flood" as tile hypothetical evidence added on the
right. The C P T of the node "natural phenomenon"
is updated to reflect the newly added evidence. The
propagation of the probabilities from the leaf nodes
up the network is shown and illustrates how discounts a r t taken at each intermediate node. Whenever more related concepts are 1)resent in the network, such as "typhoon" and "tornado '~, less discounts are taken and thus a higher probability will
result at the root node. Converso.ly, one can see that
with a distal concept, such as "taste" (which is ill a
completely different branch), the knowledge about
"hurricane" will have little or no influence on disambiguating "great taste".
The calculation above can be computed in linear
time with respect to the det/th of tlle query noun
node (depth=5 in the case of f l o o d # l ) and not the
the nmnber of nodes ill the network. This is important for scaling the network to represent the large
nuinber of nouns needed to accurately model tile selectional preferences of adjective senses. Tile only
cost incurred is storage for a s u m m a r y probability
of the children at each internlediate node and time
for ut)dating these values when a new piece of evidence is added, which is also linear with respect to
the depth of the node.
Finally, the probabilities comt)uted by the inference algorithnl are combilmd with the priors established in the earlier section. Tile combined probabilities represent P ( a d j # i ] adj, noun, < N F s > ) , and
tilt one with the highest probability is classified by
BHD as tile most plausible sense of the adjective.
The two parameters the belief networks require are
the C P T s tbr each intermediate node and the priors of the leaf nodes, such as P ( g r e a t # l , hurricane). The latter is estilnated by tile counts obtained fronl Altavista, as described earlier, and a
shortcut is used to specit~y the CPTs. Normally
the C P T s ill a flflly specified Bayesian network contain all instantiations of the child and parent values
and their corresponding probabilities. For example,
the C P T at node D in Figure 1 would have four
rows: e r ( D = t l E : t ) , P r ( D = t l E = f ) , P r ( D = f l E = t ) ,
and P r ( D = f l E = f ). This is needed to perform flfll
inferencing, where queries can be issued for any instantiation of the variables. However, since the networks in this model are used only for one specific
query, where all nodes are instantiated to be true,
only the row with all w~riables equal to true, e.g.,
P r ( D = t l E = t ) , has to be specified. The nature of
this query will be described in more detail in tile
next section.
To calculate the probability that an intermediate
node and all of its parents are true, one divides the
number of parents present by the number of possible parents as specified ill WordNet. hi Figure 2,
the small clotted nodes denote the absent parents,
which deternfine how the probabilities are specified
at each node. Recall that tile parents in the belief
network are actually the children in the. WordNet
hierarchy, so this probability can be seen as the percentage of children actually present, hltuitively, this
probability is a form of assigning weights to parts of
the network where more related nouns are present
in the training set, silnilar to the concept of semantic density. Tile probability, in conjunction with the
structure of the belief network, also implicitly encodes the semantic distance between concepts without necessarily 1)enalizing concepts with deep hierarchies. A discount is taken at each ancestral node
during inferencing (next section) only wlmn some
of its WordNet children are absent in the network.
Therefore, the semantic distance can be seen as the
number of traversals 11I)the network weighted by the
number of siblings present in tile tree (and not by
direct edge counting).
4.4
Querying the Network
With the probability between nodes specified, the
network becomes a representation of the selectional
prefbrence of an adjective sense, with features from
the WordNet ISA hierarchy providing additional
knowledge on both semantic densities and semantic
distances. To disambiguate a novel adjective-noun
pair such as "great flood", the g r e a t # l and g r e a t # 2
networks (along with 7 other g r e a t # / n e t w o r k s not
shown here) infer the likelihood that "flood" belongs to the network by comtmting the probability
4.5 E v a h m t i o n
To test the accuracy of BHD, the same procedure
described earlier was used. Tile same 135 adjectivenoun pairs from SelnCor were disambiguated by
BHD and compared to the baseline. Table 2 shows
the accuracy results froln evaluating either the first
sense of the nouns or all senses of the nouns. The results of the accuracy without the priors P r ( a d j # i ) in-
156
Context
1. st IIOllll
all noun
SOIISO
\'Vithout
lh'ior
With
1}l.i{}r
Baseline
1/16x 1/30x
¢i¢al) 2/-$
11Ol111only
S{BIIH(~S
56.3%
6O.O%
77.8%
80.{}%
75.6%
+SP
11Ol111
ollly
+SP
53.3%
60.0%
77.8%
81.4%
75.6%
,S/cLmst x 1/~
Ta.1}le 2: A c c u r a c y results from the seleetional preference model ( + S P ) , showing the improvements over
the baseline by either considering the first llOllIl
sense or all noun senses.
Illka,d} 1/8
'I'
t tlood) - 1770~
x 1/7 x I/I
Model
HIII)
Mihah:ea and Moldovan
(1999)
SW.tina el; al. (1998)
Results [ B a s e l i n e
81.4%
75.6%
79.8%
81.8%
83.6%
81.9%
Table 3: Colnt/arison of atljectiv(~ disalnl}iguation at:curay with other inodels.
Figure 3: (~ll{2ry of {;he flrcrtt#-I 1}elief netw{)rl(
t(} infe.r l;h{', probal}ility of tlood being m{}{lified l}y
.q'reat#l.. T h e left branch of the network has 1}{'.en
o m i t t e d for clarity.
dicate~ the imI}rovements l}rovided by the likeliho{}{t
t e r m alone. T h e itnl}r(}vement gain(~{l from the addii.ional conl;extual featm:es shows th(~ ell'{~{:liv{m{~ss
of the belief networks, l!'~v(m with only 3 t}rol{)tyt)ica.l m m n s l}er a(ljecl,ive se.nse (}n av{u'ag{~ (hardly a
COml)lete deseril)tion of the sel(B(:tional pr(ff(w(u]{:{;s);
the gain is very encouraging. Wit;h the 1)ri(}rs t'a{:tored in, 13IID iml}r(}ved even flu:ttmr (81%), signifi('antly surllassing the baseline (75.6%), a feat a{:{:omplished 1)3; {rely (me (}ther m{}del t h a t we are aware (}f
(airi Sl;el;iIla and Nagao, 1998). Not;e thai; l;h{! l)esl
~/.C{;lll;}/{;y \v~ls a{'hieved by evaluating all senses of th{~
ll(}llllS} }IS exi)e{:ted, since the sele{:l;iollal t)r{~.fer{,Jl{'e
is modele{t l;hr{mgh senmnti(: feai;mes (}t" the glossary nouns, not just their word forms. The. r{;as{m
for the good a c c u r a c y from using only the tirst noun
sense is because 72% of t h e m hal}pen to be the first;
S0,IlSe. r]_~heso results are very ('JH;ouragillg si]lco 11o
tagged corl)us and minimal training d a t a were. used.
\¥e believe t h a t with a bigger training set, ]3HD's
t)erf(}rnlance will iml}rove even further.
4.6 C o l n l ) a r i s o n w i t h O t h e r Mo{h,Js
rE() our kn{}wledge, there are {rely two oLher sysl;enls
t h a t (lisanfl)iguate a.(lj{~{:tive-noml llairs from unrest.ri{:l;e{l l;ex{;. Results fl'om both models were evaluated against S e m C o r and thus a comparison is meaningful. In Table 3, each model's accuracy (as well as
157
lhe baseline) is provided since different adjet:tivenoun pairs were e,valuated. We find t;he BIt]) resuits conlpa.ralfle, if not bel;ter, espcc, ially when ihe
}IlllOllll{; Of inq)rovenw, nl; ()vet th0, ])aseline is eonsi(lered. T h e Ino<lel 1)3, ,SI;e,l;ina (1.998) was {,rai]md tm
SemCor t h a t was merged with a flfll senl;ential parse
tree, the d e t e r m i n a t i o n of which is considered a difficult l)rolflem of its own (Collins, 1997). \Ve belie, re
thai; by int:ori}oral;ing tim data from SemC(n (dis(:llSse(l ill I;he fllI;llre work sc,ct;ion)~ {;11o,]}erforlll}lllCe
(ff our sysi;em will surpass Stetina's.
5
Conclusion
and
Future
Work
We have 1)resenl;etl a t)rol}~fl}ilistic disanfl}iguation
model t h a t ix sysl;e, ln~l;i{,, a(tcllral;e, all(l require llHlllual intervention ill only ill two places. The more
l;illle (:OllSlllllill~ of tim l;wo manual l;asks is to (:]assitS' th(~ toll 100 nouns needed for the priors. T h e
el;her task~ of disanfl)igual;ing l)rol;olTpical llOllllS, is
relal;ively simple due to the limited nunfl)er of glossary nouns per sense. IIowever, it would l}e straightforward to i n c o r p o r a t e semantically tagged corpora,
such as SemCor, to avoid these m a m m l tasks. T h e
priors are the n u m b e r of instances of each adjective
sense divided by all of the adjectives in the corpus.
T h e d i s a m b i g u a t e d adjectiveT~i-noun#.] pairs from
the corpus ean be used as training sets to build bet{e,r ret/resental;ion of selectional preferences l}y inserting tim nounT~j node mid the ac(:omf}any featm'es
into the l}elief network of a{ljectivegfii. T h e insertion
is the same prot:c/hue used to add the hyllothe|;ical
evidence dm'ing the inferoncing stage. T h e Ul)(lated
belief networks could then be, used tbr disambigualion wii;h improve.d at:curacy. Furthernlore, the performance of B I I D (:(mid a.lso be improved by exl)and-
Sadao Kurohashi Jiri Stetina and Makoto Nagao.
1998. General word sense disambiguation method
based oll a flfll sentential context. In Proceedings
ing the context or using statistical learning methods
such as the EM algorithln (DemI)ster et al., 1977).
Using Bayesian networks gives the model ttexibility
to incorporate additional contexts, such as syntactical and morphological features, without incurring
exorbitant costs.
It is l)ossible that, with an extended model that
accurately disambiguates adjective-noun pairs, the
selectional preference of adjective senses coutd be automatically learned. Having all improved knowledge
al)out the selectional 1)references would then provide
better parameters for disanfl)iguation. The model
can be seen as a bootstrapping learning process tbr
disambiguation, where the information gained from
one part (selectional preference) is used to improve
tile other (disambiguation) and vice versa, reminiscent of the work by Riloff and Jones (1.999) and
Yarowsky (1995).
Lastly, the techniques used in this paper could be
scaled to disambiguate not only all adjective-noun
pairs, but also other word pairs, such as subjectverb, verb-object, adverb-verb, by obtaining most of
the paraineters from the Internet and WordNet. If
the information fi'oln SemCor is also used, then the
system could be automatically trained to pertbrm
disambiguation tasks on all content words within a
of COLING-ACL Workshop on Usage of WordNet in Natural Language Processing, Montreal,
Canada, .July.
Rada Mihalcea and Dan Moldovan. 1998. Word
sense disambiguation base(! on semantic density.
In Proceedings of COLING-ACL Workshop on Usage of WordNct in Natural Language Proecssing,
Montreal, Canada, July.
G. Miller, C. Leacock, and R. Tengi. 1993. A semantic concordance. In Procccdings of ARPA I]uTnan
Language Technology, Princeton.
G. Miller. 1990. WordNet: An on-line lexical
database. International Journal of Lexicography,
3(4:).
Hwee Tou Ng and Hian Beng Lee. 1996. Integrating
inultiple knowledge sources to disambiguate word
sense: An exemplar-based approaclL In Proceed-
ings of the 3/tth, Annual Meeting of ACL, Santa
Cruz, June.
Judea Pearl. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.
Morgan Kaufmalm, San Mateo, CA.
Philit) Resnik an(1 David Yarowsky. 1997. A perspective ell word sense disambiguation methods
mid their evaluation. In ANLP Workshop on Tag-
SellteI1Ce.
In this paper, we have addressed three of what we
believe to be the main issues timed 1)y current WSD
systems. We demonstrated the effectiveness of the
teclmiques used, while identii~ying two mmmal tasks
that don't necessarily require a semantically tagged
corpus. By establishing accurate priors a.nd small
training sets, our system achieved good initial disambiguation accuracy. The salne methods could 1)e
flflly automated to disami)iguate all content word
pairs if infbrmation from semantically tagged corpora is used. Our goal is to create a system that can
disambiguate all content words to an accuracy level
sufficient for automatic tagging with tummn validation, which could then be used to improve or facilitate new probabilistic semantic taggers accurate
enough for other NLP applications.
ging Text with, Lexical Semantics, Washington,
D.C., June.
Philip Resnik. 1997. Selectional preference and
sense disambiguation. In ANLP Worksh.op on
Tagging Text with, Lcxical Semantics, Wash,ington, D.C., June.
Ellen Riloff and Rosie Jones. 1999. Learning dictionaries fbr information extraction by multi-level
bootstrapping. In Proceedings of AAAI-O9, O f
lando, Florida.
aanyce Wiebe, Tom O'Hara, and Rebecca Bruce.
1998. Constru(:ting bayesian networks from WordNet for word-sense disambiguation: I/.el)resentational and I)rocessing issues. In Proceedings of
COLING-ACL Workshop on Usage of WordNct in
Natural Language Processing, Montreal, Canada,
July.
David Yarowsky. 1992. Word-sense disambiguation using statistical model of Roget's categories trained on large corpora. In Proceedings of
References
Eneko Agirre and German Rigau. 1996. Word sense
dismnbiguation using conceptual density. In Pro-
COLING-92, Nantes, France.
ceedings of COLING-96, Copenhagen.
David Yarowsky. 1995. Unsupervised word sense
disambiguation rivaling supervised methods. In
Michael Collins. 1997. Three generative, lexicalised
models for statistical 1)arsing. In Proceedings of
the 351h Annual Meeting of the ACL, pages 1623, Madrid, SI)ain.
A.P. Dempster, N.M. Laird, and D.B. Rubin. 1977.
Maximum likelihood from incomplete data via the
EM algorithm. Journal of the Royal Statistical
Society, 39 (B): 1-38.
Proccedings of the 33rd Annual Meeting of the
ACL.
158