G udrun M agnusdottir Collocations in Knowledge Based Machine Translation A bstract In order to generate colloquial language within the computational lin guistics paradigm the problems o f co-occurrence must be solved. As of yet research on co-occurrence has mostly focused on problems of syntax and selectional restrictions to describe the contextual relation within the sentence. Collocations and idioms have been neatly put aside as a unified problem to be dealt with in the lexicon or not at all. In this paper collocations are defined according to the principles of semantics and a suggestion as to how to work on the retrieval of coUoca^ tions, focussing on adjective noun constructions, from a text corpus will be made. The research was carried out at the Center for Machine Translation, Carnegie Mellon University, together with Sergei Nirenburg and heavily inspired by Professor AUen’s (Allén et al’s 1975) work on collocations. 1 Defining Collocations Id io m s a n d c o llo c a t io n s a re tw o v e ry differen t p r o b le m s o n a sem a n tic level. T h e e a r ly d e fin itio n o f c o llo c a tio n s (F ir th 1957, B e n s o n 1985) as b e in g a n y th in g th a t fre q u e n tly c o -o c c u r s ca n n o lo n g e r b e a c c e p te d . T h is d efin ition d isca rd s th e p r in c ip le o f s y n ta c tic a to m s a n d w o u ld th u s in c lu d e su ch freq u en t p a tte rn s as ‘ it is ’ e tc . A d d in g th e co n s tra in t o f a to m ic it y w o u ld elim in a te su ch p a tte rn s bu t w o u ld n g t b e su fficien t t o d istin g u ish b etw een id io m s an d c o llo c a tio n s . C o llo c a t io n s a re a strin g o f w o rd s th a t c o -o c c u r u n d er re strictio n s n o t d e fin a b le b y s y n ta x n o r s e le c tio n a l r e s trictio n s a lo n e . T h e s e re strictio n s ca n b e re ferre d t o as le x ic a l r e s trictio n s sin ce th e se le ctio n o f th e le x ica l unit is n o t c o n c e p tu a l, th u s s y n o n y m s c a n n o t re p la ce th e c o llo c a te . T h e m ea n in g o f a c o l lo c a t io n is c o m p o s itio n a l w h erea s th e m e a n in g o f an id io m is n o t. C o llo c a t io n s are c o m p o s itio n a l w ith h ie ra rch ica l rela tion s a m o n g th e lex ica l u n its. T h e p re v io u s s tr u c tu r a l su rfa ce d e fin itio n in clu d in g a h ead , as th e m ain w o r d in th e c o n s tr u c tio n , a n d a c o llo c a t e , as th e s u p p o r tin g w o rd is a cce p ta b le as is. 204 Proceedings of NODALIDA 1989, pages 204-207 GuSriin Magnusdottir: Collocations in M T 2 205 Detecting Collocations in a Text A m u lti-w o rd id io m o fte n v io la te s s e le ctio n a l re s trictio n s d u e t o m e ta p h o r ic a l use o f w o rd s w h erea s a c o llo c a tio n w ill n o t. T h u s an id io m m a y b e d e te c te d in fa ilin g p arses w h e re a c o llo c a tio n w ill b e p a rsed u n d e te c te d . C o llo c a t io n s are ‘ p e rm itte d p a tte r n s ’ in co n tra st t o id io m s th a t are o fte n ‘ p r o h ib ite d p a tte r n s ’ w ith in th e se le ctio n a l r e strictio n fra m e. P e r m itte d p a tte rn : ‘ la rg e c o k e ’ P r o h ib ite d p a tte rn : ‘ as la rg e as life ’ T h e b o r d e r lin e b e tw e e n th e tw o is d ifficu lt t o d ra w . S u ch q u e stio n s as is ‘ sh in in g tr u th ’ a c o llo c a tio n o r an id io m axe in d e e d n o t e a sily an sw ered . O b je c t s th a t ‘ sh in e’ h ave th e p r o p e r ty TO_REFLECT_LIGHT a n d a re +C0NCRETE. T h e p r o p e r ty list o f ‘ T r u t h ’ d o e s n o t in clu d e th ese a n d sh o u ld th e y b e a d d e d it w o u ld resu lt in e x tr e m e o v e r-g e n e ra tio n to g e th e r w ith th e p o s s ib ility o f fa u lty p arses. T h u s ‘ T r u th ’ c a n n o t ‘sh in e’ e x c e p t in a id io m a tic sen se a n d w ill th e re fo re b e tre a te d as an id io m . A m b ig u itie s m a y arise b e tw een an id io m a tic m e a n in g a n d n o n -id io m a tic . SimUaxly, a m b ig u ities m a y arise b etw ee n a c o llo c a t io n a l m e a n in g a n d n on c o llo ca tio n a l m ea n in g as in th e e x a m p le : D e c id e o n a b o a t. W h e r e ‘ o n ’ is a p r e p o s itio n o r a c o llo c a te . O u t o f c o n te x t th e se n te n ce has tw o m ea n in g s a n d th ere is n o w a y o f d e c id in g w h ich is th e righ t o n e . In su ch ca ses c o n te x tu a l in fo r m a tio n is th e o n ly d is a m b ig u a tin g fa c t o r . T h u s th e m a n u a l la b o r in v o lv e d w h en w o rk in g o n c o llo c a tio n s w ill in v o lv e g o in g th r o u g h th e c o r p u s to d e te c t th e c o llo c a tio n s as w ell as sy s te m a tic a lly e n terin g th e m in a le x ic o n . R e tr ie v in g c o llo c a tio n s is n o t a n e a sy ta sk fo r a n a tiv e sp e a k er, s im p ly d u e t o th e fa c t th a t a c o llo c a tio n is th e n a tu ra l w a y o f ex p re ssio n th a t is m o r e ea sily d e te c te d th r o u g h v io la tio n s in g e n e ra tio n , in th e o u tp u t fr o m a n a tu ra l la n g u a g e sy stem o r a n o n -n a tiv e sp ea k er. It w o u ld b e fu tile t o p r o v id e a h u m a n u ser w ith th e sa m e in te ra ctiv e k n o w le d g e a cq u isitio n t o o l t o w o rk o n b o t h c o llo c a t io n s a n d id iom s. C o n sid e r th e sen ten ces T h e r e is a little lig h t in th e w in d o w . T h e r e is a sm a ll lig h t in th e w in d o w . T h e le m m a ‘ lig h t’ h as th ree d ifferen t le x e m e s in L o n g m a n ’s, le x e m e o n e , th e n o u n , has six teen sen ses. O f th ese th e first fiv e a re m o st lik ely t o a p p e a r in te ch n ica l te x ts . L ig h t (c a t NOUN) sen se 1: n a tu ra l fo r c e U p r o p e r ty : QUANTITY sen se 2: s o u rce o f ligh t C p r o p e r ty : SIZ E sen se 3: su p p ly o f ligh t U p r o p e r ty : STRENGTH, QUALITY sen se 4: light (a s tim e ) U p r o p e r ty : QUANTITY sen se 5: set b u rn in g C p r o p e r ty : QUANTITY Proceedings of NODALIDA 1989 205 206 Computational Linguistics — Reykjavik 1989 T h e a b b r e v ia t io n ‘ U ’ sta n d s fo r u n c o u n ta b le cind ‘ C ’ fo r c o u n ta b le . T h e a d je c t iv e little h as fo u r sen ses, lin ked t o th ree scales. L ittle — (c a t ADJ) sen se 1: sm a ll sca le: SIZ E sen se 2: sh o r t - tim e sen se 3: y o u n g sca le: TINE sca le: AGE sen se 4: (id io m a tic ) T h e sca le QUANTITY, w h ich is p r o b a b ly th e m o st fre q u e n tly u sed m ea n in g o f ‘ little ’ , is n o t p resen t in th e d ic t io n a r y d e fin itio n . It m a tch e s senses o n e , fou r, a n d fiv e o f ‘ lig h t’ . T h e a d je c tiv e ‘ s m a ll’ has seven sen ses in L o n g m a n ’s. T h e last fo u r have n o re le v a n ce as t o sca ler m ea n in g s. S m a ll (c a t A D J ) sen se 1: sen se 2: little in size. sca le: SIZE w eig h t. sca le: WEIGHT fo r c e . scale: STRENGTH im p o r ta n c e sca le: IMPORTANCE d o in g o n ly a lim ite d am ount o f X sen se 3: sca le: ACTIVITY v e r y little, sligh t (w ith U n o u n s) sca le: QUANTITY S en se o n e o f ‘s m a ll’ is v e r y h e a v ily c o m p ile d , fo r w h a te v e r rea son . T h e scales o f th e tw o a d je c tiv e s ca n b e lin k ed t o th e p r o p e r tie s o f th e tw o c o n c e p ts in a ) a n d b ). a ) L ig h t($ I S -T 0 K E N -0 F LIGHT) b ) L ig h t($ I S -T 0 K E N -0 F LIGHT-BULB) W h e r e a ) rep resen ts sen se o n e w ith th e p r o p e r ty QUANTITY, th a t g iv en a m o d ifie r w ith a q u a n tita tiv e m e a n in g , w ill g iv e a cce ss t o a certa in p o s itio n on th e s ca le QUANTITY. In th is ca se th e p o s it io n is rep resen ted b y ‘s m a ll’ a n d ‘ little ’ as e q u iv a len t s y n o n y m s . S en se tw o is re p resen ted b y b ) w ith th e p r o p e r ty S IZ E th a t sim ila rly giv es th e p o s it io n o n th e sca le S IZ E , re p resen tin g th e eq u iva len ts ‘s m a ll’ a n d ‘ little ’ . A s fo r a n a ly sis th is seem s fu tile, sin ce in p u t te x ts are reg a rd e d as c o r r e c t a n d th e a d je c tiv e is a lre a d y p resen t. H ow e v er it is n o t p o ssib le t o a ccess an u n a m b ig u o u s resu lt. P a rs in g a t its b e s t, a n d le x ic a l m a p p in g i.e. m a p p in g su rfa ce e x p re s s io n s t o c o n c e p ts , w ill g iv e tw o p o s s ib le c o n c e p ts , in a ca se w h e re th ere is n o a m b ig u ity . T h u s le x ica l r e s trictio n s w o u ld d isa m b ig u a te b etw ee n th e co n c e p ts . In g e n e r a tio n th e w r o n g c h o ic e o f a d je c tiv e s w ill lea d t o a w r o n g in terpretar tio n b y th e rea d e r. In o r d e r t o b e a b le t o g e n e ra te a n y in fo r m a tio n re g a rd in g th ese lin ks th e se n te n ce s n eed t o b e a n a ly z e d c o n c e p tu a lly , th a t so rt o f a n a lysis is o n ly p ro v id e d b y k n o w le d g e b a se d fo rm a lis m s u sin g o n to lo g ie s a n d h u m an in te ra ctio n t o ensure u n a m b ig u o u s resu lts fr o m th e s o u r c e la n g u a g e an alysis. Proceedings of NODALIDA 1989 206 Guörun Magnusdöttir: Collocations in M T 3 207 Knowledge Based Machine Translation A t th e C e n ter fo r M ^lchine T ra n s la tio n , C a rn e g ie M e llo n U n iversity , research has been ca rried o u t fo r s o m e y ea rs o n k n o w le d g e b a se d m a ch in e tra n sla tio n . T h e m ost recen t resu lt is a p r o t o t y p e s y ste m ( K B M T -8 9 ) d e liv e re d t o th e fin a n cie r, IB M J a p a n , in fe b r u a r y 1989. T h e p r o t o t y p e s y s te m co n sists o f differen t m o d u le s fo r n a tu ra l la n g u a g e a n a l ysis, k n ow led g e a c q u is itio n , a n d n a tu ra l la n g u a g e g e n e ra tio n . T h e s y s te m is an in terlin g u a sy stem re ly in g o n h um an in te ra ctio n fo r d is a m b ig u a tio n o f m u ltip le p arsin g resu lts. T h e user is a id e d in th e d is a m b ig u a tio n p r o c e s s b y th e A u g m e n tor, a s p e c ia lly b u ilt in te ra ctio n c o m p o n e n t th a t a llo w s th e user t o c h o o s e betw een th e a m b ig u itie s a t h a n d . T h e resu lt is a m e a n in g re p re se n ta tio n (in te r lin gu a ) th a t is th en th e in p u t o f th e g e n e ra tio n c o m p o n e n t. T h e a n a ly sis is b a se d o n a L F G like g ra m m a r to g e th e r w ith th e se m a n tics p resen t in th e k n ow led g e b a se o r th e o n to lo g y , th e su rfa ce e x p re ssio n s are m a p p e d o n t o th e c o n c e p ts in th e o n t o lo g y g iv in g o p tim a l g r o u n d s fo r k n o w le d g e a cq u isition . D u e t o th e fa c t th a t th e resu lt fr o m th e a n a ly sis a n d h u m an in te ra ctio n is u n a m b ig u ou s w ith linlcs t o th e c o r r e c t c o n c e p ts , a filte r fo r g e n e ra tio n a n d d isa m b ig u a tio n fo r th e a n a ly z e d la n g u a g e ca n b e g e n e ra te d . H o w th is is t o b e sy ste m ize d w ill b e p u b lish ed in a fo r t h c o m in g p a p e r. References Allén, S. et al. 1975. Nusvensk frekvensordbok baserad på tidningstext. 3. Ordförbindels er. (Frequency Dictionary o f Present-Day Swedish. 3. Collocations.) Stockholm. Firth, J.R. 1957. Papers in Linguistics 1934-1951. Oxford: Oxford University Press. Chomsky, N. 1965. Aspects o f the Theory o f Syntax. Cambridge, Mass.: M IT Press. Cumming, S. 1987. The Lexicon in Text Generation. Dupl. 1987 Linguistic Institute Workshop: The Lexicon in Theoretical and Computational Perspective. Stanford University 1987. Longman Dictionary o f Contemporary English. 1987. Longman Group Ltd. Magnusdöttir, G. 1987. Fastcat Pilot-Study Report: Translation Systems and Transla tor Interviews. Språkdata. Magnusdöttir, G. 1988. Problems of Lexical Access in Machine Translation. In Stud ies in Computer-Aided Lexicology. Data Linguistica 18. Stockholm: Almqvist & Wiksell International. Nirenburg et al. 1989. KBM T-89, project report. Center for Machine Translation. Carnegie Mellon University. D e p a r tm e n t o f C o m p u t a tio n a l L in g u istics, U n iv e rs ity o f G o th e n b u r g , 4 1 2 98 G ö t e b o r g , S w eden . gu drim O hu m . g u . s e Proceedings of NODALIDA 1989 207
© Copyright 2026 Paperzz