W89-0118 - Association for Computational Linguistics

G udrun M agnusdottir
Collocations in Knowledge Based
Machine Translation
A bstract
In order to generate colloquial language within the computational lin­
guistics paradigm the problems o f co-occurrence must be solved. As of
yet research on co-occurrence has mostly focused on problems of syntax
and selectional restrictions to describe the contextual relation within the
sentence. Collocations and idioms have been neatly put aside as a unified
problem to be dealt with in the lexicon or not at all.
In this paper collocations are defined according to the principles of
semantics and a suggestion as to how to work on the retrieval of coUoca^
tions, focussing on adjective noun constructions, from a text corpus will
be made.
The research was carried out at the Center for Machine Translation,
Carnegie Mellon University, together with Sergei Nirenburg and heavily
inspired by Professor AUen’s (Allén et al’s 1975) work on collocations.
1
Defining Collocations
Id io m s a n d c o llo c a t io n s a re tw o v e ry differen t p r o b le m s o n a sem a n tic level.
T h e e a r ly d e fin itio n o f c o llo c a tio n s (F ir th 1957, B e n s o n 1985) as b e in g a n y th in g
th a t fre q u e n tly c o -o c c u r s ca n n o lo n g e r b e a c c e p te d . T h is d efin ition d isca rd s th e
p r in c ip le o f s y n ta c tic a to m s a n d w o u ld th u s in c lu d e su ch freq u en t p a tte rn s as
‘ it is ’ e tc . A d d in g th e co n s tra in t o f a to m ic it y w o u ld elim in a te su ch p a tte rn s bu t
w o u ld n g t b e su fficien t t o d istin g u ish b etw een id io m s an d c o llo c a tio n s .
C o llo c a t io n s a re a strin g o f w o rd s th a t c o -o c c u r u n d er re strictio n s n o t d e­
fin a b le b y s y n ta x n o r s e le c tio n a l r e s trictio n s a lo n e . T h e s e re strictio n s ca n b e
re ferre d t o as le x ic a l r e s trictio n s sin ce th e se le ctio n o f th e le x ica l unit is n o t
c o n c e p tu a l, th u s s y n o n y m s c a n n o t re p la ce th e c o llo c a te . T h e m ea n in g o f a c o l­
lo c a t io n is c o m p o s itio n a l w h erea s th e m e a n in g o f an id io m is n o t.
C o llo c a t io n s are c o m p o s itio n a l w ith h ie ra rch ica l rela tion s a m o n g th e lex ica l
u n its. T h e p re v io u s s tr u c tu r a l su rfa ce d e fin itio n in clu d in g a h ead , as th e m ain
w o r d in th e c o n s tr u c tio n , a n d a c o llo c a t e , as th e s u p p o r tin g w o rd is a cce p ta b le
as is.
204
Proceedings of NODALIDA 1989, pages 204-207
GuSriin Magnusdottir: Collocations in M T
2
205
Detecting Collocations in a Text
A m u lti-w o rd id io m o fte n v io la te s s e le ctio n a l re s trictio n s d u e t o m e ta p h o r ic a l
use o f w o rd s w h erea s a c o llo c a tio n w ill n o t. T h u s an id io m m a y b e d e te c te d in
fa ilin g p arses w h e re a c o llo c a tio n w ill b e p a rsed u n d e te c te d . C o llo c a t io n s are
‘ p e rm itte d p a tte r n s ’ in co n tra st t o id io m s th a t are o fte n ‘ p r o h ib ite d p a tte r n s ’
w ith in th e se le ctio n a l r e strictio n fra m e.
P e r m itte d p a tte rn : ‘ la rg e c o k e ’
P r o h ib ite d p a tte rn : ‘ as la rg e as life ’
T h e b o r d e r lin e b e tw e e n th e tw o is d ifficu lt t o d ra w . S u ch q u e stio n s as is ‘ sh in ­
in g tr u th ’ a c o llo c a tio n o r an id io m axe in d e e d n o t e a sily an sw ered . O b je c t s th a t
‘ sh in e’ h ave th e p r o p e r ty TO_REFLECT_LIGHT a n d a re +C0NCRETE. T h e p r o p e r ty
list o f ‘ T r u t h ’ d o e s n o t in clu d e th ese a n d sh o u ld th e y b e a d d e d it w o u ld resu lt
in e x tr e m e o v e r-g e n e ra tio n to g e th e r w ith th e p o s s ib ility o f fa u lty p arses. T h u s
‘ T r u th ’ c a n n o t ‘sh in e’ e x c e p t in a id io m a tic sen se a n d w ill th e re fo re b e tre a te d
as an id io m .
A m b ig u itie s m a y arise b e tw een an id io m a tic m e a n in g a n d n o n -id io m a tic .
SimUaxly, a m b ig u ities m a y arise b etw ee n a c o llo c a t io n a l m e a n in g a n d n on c o llo ­
ca tio n a l m ea n in g as in th e e x a m p le :
D e c id e o n a b o a t.
W h e r e ‘ o n ’ is a p r e p o s itio n o r a c o llo c a te . O u t o f c o n te x t th e se n te n ce has
tw o m ea n in g s a n d th ere is n o w a y o f d e c id in g w h ich is th e righ t o n e . In su ch ca ses
c o n te x tu a l in fo r m a tio n is th e o n ly d is a m b ig u a tin g fa c t o r . T h u s th e m a n u a l la b o r
in v o lv e d w h en w o rk in g o n c o llo c a tio n s w ill in v o lv e g o in g th r o u g h th e c o r p u s to
d e te c t th e c o llo c a tio n s as w ell as sy s te m a tic a lly e n terin g th e m in a le x ic o n .
R e tr ie v in g c o llo c a tio n s is n o t a n e a sy ta sk fo r a n a tiv e sp e a k er, s im p ly d u e
t o th e fa c t th a t a c o llo c a tio n is th e n a tu ra l w a y o f ex p re ssio n th a t is m o r e ea sily
d e te c te d th r o u g h v io la tio n s in g e n e ra tio n , in th e o u tp u t fr o m a n a tu ra l la n g u a g e
sy stem o r a n o n -n a tiv e sp ea k er. It w o u ld b e fu tile t o p r o v id e a h u m a n u ser w ith
th e sa m e in te ra ctiv e k n o w le d g e a cq u isitio n t o o l t o w o rk o n b o t h c o llo c a t io n s a n d
id iom s.
C o n sid e r th e sen ten ces
T h e r e is a little lig h t in th e w in d o w .
T h e r e is a sm a ll lig h t in th e w in d o w .
T h e le m m a ‘ lig h t’ h as th ree d ifferen t le x e m e s in L o n g m a n ’s, le x e m e o n e , th e
n o u n , has six teen sen ses. O f th ese th e first fiv e a re m o st lik ely t o a p p e a r in
te ch n ica l te x ts .
L ig h t (c a t NOUN)
sen se 1: n a tu ra l fo r c e U
p r o p e r ty : QUANTITY
sen se 2: s o u rce o f ligh t C
p r o p e r ty : SIZ E
sen se 3: su p p ly o f ligh t U
p r o p e r ty : STRENGTH, QUALITY
sen se 4: light (a s tim e ) U
p r o p e r ty : QUANTITY
sen se 5: set b u rn in g C
p r o p e r ty : QUANTITY
Proceedings of NODALIDA 1989
205
206
Computational Linguistics — Reykjavik 1989
T h e a b b r e v ia t io n ‘ U ’ sta n d s fo r u n c o u n ta b le cind ‘ C ’ fo r c o u n ta b le . T h e a d ­
je c t iv e little h as fo u r sen ses, lin ked t o th ree scales.
L ittle — (c a t ADJ)
sen se 1: sm a ll
sca le: SIZ E
sen se 2: sh o r t - tim e
sen se 3: y o u n g
sca le: TINE
sca le: AGE
sen se 4: (id io m a tic )
T h e sca le QUANTITY, w h ich is p r o b a b ly th e m o st fre q u e n tly u sed m ea n in g o f
‘ little ’ , is n o t p resen t in th e d ic t io n a r y d e fin itio n . It m a tch e s senses o n e , fou r,
a n d fiv e o f ‘ lig h t’ .
T h e a d je c tiv e ‘ s m a ll’ has seven sen ses in L o n g m a n ’s. T h e last fo u r have n o
re le v a n ce as t o sca ler m ea n in g s.
S m a ll (c a t A D J )
sen se 1:
sen se 2:
little in size.
sca le: SIZE
w eig h t.
sca le: WEIGHT
fo r c e .
scale: STRENGTH
im p o r ta n c e
sca le: IMPORTANCE
d o in g o n ly a lim ite d
am ount o f X
sen se 3:
sca le: ACTIVITY
v e r y little, sligh t
(w ith U n o u n s)
sca le: QUANTITY
S en se o n e o f ‘s m a ll’ is v e r y h e a v ily c o m p ile d , fo r w h a te v e r rea son . T h e scales
o f th e tw o a d je c tiv e s ca n b e lin k ed t o th e p r o p e r tie s o f th e tw o c o n c e p ts in a )
a n d b ).
a ) L ig h t($ I S -T 0 K E N -0 F LIGHT)
b ) L ig h t($ I S -T 0 K E N -0 F LIGHT-BULB)
W h e r e a ) rep resen ts sen se o n e w ith th e p r o p e r ty QUANTITY, th a t g iv en a
m o d ifie r w ith a q u a n tita tiv e m e a n in g , w ill g iv e a cce ss t o a certa in p o s itio n on
th e s ca le QUANTITY. In th is ca se th e p o s it io n is rep resen ted b y ‘s m a ll’ a n d ‘ little ’
as e q u iv a len t s y n o n y m s .
S en se tw o is re p resen ted b y b ) w ith th e p r o p e r ty S IZ E th a t sim ila rly giv es
th e p o s it io n o n th e sca le S IZ E , re p resen tin g th e eq u iva len ts ‘s m a ll’ a n d ‘ little ’ .
A s fo r a n a ly sis th is seem s fu tile, sin ce in p u t te x ts are reg a rd e d as c o r r e c t
a n d th e a d je c tiv e is a lre a d y p resen t. H ow e v er it is n o t p o ssib le t o a ccess an
u n a m b ig u o u s resu lt. P a rs in g a t its b e s t, a n d le x ic a l m a p p in g i.e. m a p p in g su rfa ce
e x p re s s io n s t o c o n c e p ts , w ill g iv e tw o p o s s ib le c o n c e p ts , in a ca se w h e re th ere is
n o a m b ig u ity . T h u s le x ica l r e s trictio n s w o u ld d isa m b ig u a te b etw ee n th e co n c e p ts .
In g e n e r a tio n th e w r o n g c h o ic e o f a d je c tiv e s w ill lea d t o a w r o n g in terpretar
tio n b y th e rea d e r.
In o r d e r t o b e a b le t o g e n e ra te a n y in fo r m a tio n re g a rd in g th ese lin ks th e
se n te n ce s n eed t o b e a n a ly z e d c o n c e p tu a lly , th a t so rt o f a n a lysis is o n ly p ro v id e d
b y k n o w le d g e b a se d fo rm a lis m s u sin g o n to lo g ie s a n d h u m an in te ra ctio n t o ensure
u n a m b ig u o u s resu lts fr o m th e s o u r c e la n g u a g e an alysis.
Proceedings of NODALIDA 1989
206
Guörun Magnusdöttir: Collocations in M T
3
207
Knowledge Based Machine Translation
A t th e C e n ter fo r M ^lchine T ra n s la tio n , C a rn e g ie M e llo n U n iversity , research has
been ca rried o u t fo r s o m e y ea rs o n k n o w le d g e b a se d m a ch in e tra n sla tio n . T h e
m ost recen t resu lt is a p r o t o t y p e s y ste m ( K B M T -8 9 ) d e liv e re d t o th e fin a n cie r,
IB M J a p a n , in fe b r u a r y 1989.
T h e p r o t o t y p e s y s te m co n sists o f differen t m o d u le s fo r n a tu ra l la n g u a g e a n a l­
ysis, k n ow led g e a c q u is itio n , a n d n a tu ra l la n g u a g e g e n e ra tio n . T h e s y s te m is an
in terlin g u a sy stem re ly in g o n h um an in te ra ctio n fo r d is a m b ig u a tio n o f m u ltip le
p arsin g resu lts. T h e user is a id e d in th e d is a m b ig u a tio n p r o c e s s b y th e A u g m e n tor, a s p e c ia lly b u ilt in te ra ctio n c o m p o n e n t th a t a llo w s th e user t o c h o o s e
betw een th e a m b ig u itie s a t h a n d . T h e resu lt is a m e a n in g re p re se n ta tio n (in te r­
lin gu a ) th a t is th en th e in p u t o f th e g e n e ra tio n c o m p o n e n t.
T h e a n a ly sis is b a se d o n a L F G like g ra m m a r to g e th e r w ith th e se m a n ­
tics p resen t in th e k n ow led g e b a se o r th e o n to lo g y , th e su rfa ce e x p re ssio n s are
m a p p e d o n t o th e c o n c e p ts in th e o n t o lo g y g iv in g o p tim a l g r o u n d s fo r k n o w le d g e
a cq u isition .
D u e t o th e fa c t th a t th e resu lt fr o m th e a n a ly sis a n d h u m an in te ra ctio n
is u n a m b ig u ou s w ith linlcs t o th e c o r r e c t c o n c e p ts , a filte r fo r g e n e ra tio n a n d
d isa m b ig u a tio n fo r th e a n a ly z e d la n g u a g e ca n b e g e n e ra te d . H o w th is is t o b e
sy ste m ize d w ill b e p u b lish ed in a fo r t h c o m in g p a p e r.
References
Allén, S. et al. 1975. Nusvensk frekvensordbok baserad på tidningstext. 3. Ordförbindels­
er. (Frequency Dictionary o f Present-Day Swedish. 3. Collocations.) Stockholm.
Firth, J.R. 1957. Papers in Linguistics 1934-1951. Oxford: Oxford University Press.
Chomsky, N. 1965. Aspects o f the Theory o f Syntax. Cambridge, Mass.: M IT Press.
Cumming, S. 1987. The Lexicon in Text Generation. Dupl. 1987 Linguistic Institute
Workshop: The Lexicon in Theoretical and Computational Perspective. Stanford
University 1987.
Longman Dictionary o f Contemporary English. 1987. Longman Group Ltd.
Magnusdöttir, G. 1987. Fastcat Pilot-Study Report: Translation Systems and Transla­
tor Interviews. Språkdata.
Magnusdöttir, G. 1988. Problems of Lexical Access in Machine Translation. In Stud­
ies in Computer-Aided Lexicology. Data Linguistica 18. Stockholm: Almqvist &
Wiksell International.
Nirenburg et al. 1989. KBM T-89, project report. Center for Machine Translation.
Carnegie Mellon University.
D e p a r tm e n t o f C o m p u t a tio n a l L in g u istics,
U n iv e rs ity o f G o th e n b u r g ,
4 1 2 98 G ö t e b o r g ,
S w eden .
gu drim O hu m . g u . s e
Proceedings of NODALIDA 1989
207