Structural Lexical Heuristics in the Automatic Analysis of

Structural Lexical Heuristics
in the Automatic Analysis of Portuguese
Eckhard Bick
D e p a rtm e n t o f L in g u istic s, A rh u s U n iv e rsity , W ille m o e sg a d e 15 D , D K -8 2 0 0 Å rh u s N
tel: + 45 - 8 9 4 2 2 1 3 1 , fax: + 45 - 86 2 8 1 3 9 7 , e -m a il: lin e b @ h u m .a a u .d k
h ttp ://v isl.h u m .o u .d k /L in g u istic s.h tm l
Abstract
T h e p a p e r d isc u sse s, o n th e le x ic a l lev el, th e in te g ra tio n o f h e u ris tic so lu tio n s in to a le x ic o n
b a se d a n d ru le g o v e rn e d sy ste m fo r th e a u to m a tic a n a ly sis o f u n re s tric te d P o rtu g u e se te x t. In
p a rtic u la r, a m o rp h o lo g y b a se d a n a ly tic a p p ro a c h to le x ic a l h e u ris tic s is p re se n te d a n d e v a lu a te d .
T h e ta g g e r in v o lv e d u se s a 5 0 .0 0 0 en try b ase fo rm le x ic o n as w e ll as p re fix -, su ffix - a n d
in fle x io n e n d in g s le x ic a to a s s ig n p a rt o f sp e e c h a n d o th e r m o rp h o lo g ic a l ta g s to e v e ry w ordform .
in th e tex t, w ith re c a ll ra te s b e tw e e n 9 9 .6 % a n d 9 9 .7 % . M u ltip le re a d in g s are s u b se q u e n tly
d is a m b ig u a te d b y u s in g g ra m m a tic a l ru le s fo rm u la te d in th e C o n s tra in t G ra m m a r fo rm alism . O n
th e n e x t lev el o f a n a ly sis, ta g s fo r sy n ta c tic a l fo rm a n d fu n c tio n a lte rn a tiv e s a re m a p p e d o n to th e
w o rd fo n n s a n d d is a m b ig u a te d in a sim ila r w ay . In sp ite o f u sin g a h ig h ly d iffe re n tia te d ta g set,
th e p a rs e r y ie ld s c o rre c tn e ss ra te s - o n ru n n in g u n re stric te d a n d u n k n o w n te x t - o f o v e r 9 9 % fo r
m o rp h o lo g y /P o S a n d 9 7 -9 8 % fo r sy n tax . A te st site w ith a v a rie ty o f a p p lic a tio n s (p a rsin g ,
c o rp u s se arch es, in te ra c tiv e g ra m m a r te a c h in g a n d - e x p e rim e n ta l - M T h a s b e e n e sta b lish e d a t
h ttp ://v is l.h u m .o u .d k /L in g u is tic s .h tm l.
1 Background
In c o rp u s lin g u istic s, m o s t sy ste m s o f a u to m a tic a n a ly sis c a n b e c la ssifie d b y m e a su rin g th e m
a g a in st th e b ip o la rity o f ru le b a se d v e rsu s p ro b a b ilis tic a p p ro a c h e s. T h u s K a rlsso n (1 9 9 5 )
d is tin g u is h e s b e tw e e n “ p u re ” ru le b a se d o r p ro b a b ilis tic s y ste m s, h y b rid sy ste m s a n d c o m p o u n d
s y ste m s, i.e. ru le b a s e d sy ste m s su p p le m e n te d w ith p ro b a b ilis tic m o d u le s, o r p ro b a b ilis tic
s y ste m s w ith ru le b a s e d “b ia s ” o r p o stp ro c e ssin g . A s a se c o n d p a ra m e te r, le x ic o n d e p e n d e n c y
m ig h t b e ad d ed , sin c e b o th ru le s b a se d a n d p ro b a b ilistic s y ste m s d iffe r in te rn a lly a s to h o w m u c h
u se th e y m a k e o f e x te n siv e le x ic a , b o th in te rm s o f le x ic a l c o v e ra g e a n d g ra n u la rity o f le x ic a l
in fo rm atio n .
T h e c o n s tr a in t g r a m m a r (C G ) fo rm a lism (e.g. K a rlsso n e t al., 1 9 9 5 ), w h ic h I h a v e b e e n u s in g in
m y o w n s y ste m ’ fo r th e a u to m a tic a n a ly sis o f u n re s tric te d P o rtu g u e s e te x t (B ic k , 1996 [1] a n d
1997 [2]), is b o th ru le g o v e rn e d a n d le x ic o n b a se d , fo c u s in g o n d isa m b ig u a tio n o f m u ltip ly
‘ The system was developed in the framework o f a Ph.D.-project at Arhus University, over a period o f three
years, drawing on lexicographic research on Portuguese from an earlier Master's Thesis.
NODALIDA, Copenhagen, January 1998
a s s ig n e d le x ic a l a n d stru c tu ra l re a d in g s as th e m a in to o l o f a n a ly sis. R e a d in g s a re ex p re sse d as
s e ts o f w o rd b a se d m o d u la r ta g s. S y n ta c tic stru c tu re is c o v e re d b y u s in g fu n c tio n tags an d
d e p e n d e n c y m a rk e rs (B ick , 1997 [1]), b u t I w ill h ere c o n c e n tra te o n th e le x ic o -m o rp h o lo g ic a l
le v el. B e fo re a n y C o n stra in t G ra m m a r ru le s c a n apply, all (m o rp h o lo g ic a lly ) p o s s ib le read in g s
h a v e to b e id e n tifie d , a n d I h a v e to th is e n d d ev elo p ed a p re p ro c e sso r, th a t id e n tifie s w o rd fo rm s,
p o ly le x ic a l u n its a n d se n te n c e b o u n d a rie s, as w e ll as a m o rp h o lo g ic a l a n a ly se r f o r P o rtu g u e se
u s in g a n a d a p te d e le c tro n ic v e rs io n o f a p re v io u sly p u b lish e d d ic tio n a ry (B ic k , 1993) in
c o m b in a tio n w ith affix - a n d in fle c tio n e n d in g s le x ic a su p p le m e n te d b y c o rre sp o n d in g altern atio n
ru le s fo r w o rd fo rm a tio n (B ic k , 1995). In th e a n a ly se r's o u tp u t, e v e ry w o rd fo rm is fo llo w ed b y
a s m a n y ta g lin es a s th e re a re p o te n tia l re a d in g s:
( 1)
"< rev ista> "
"rev ista" < + n > < C P > < rr> N F S
"rev estir" < v t> < d e ^ v tp > < d e ^ v rp > V P R 1/3S S U B J V F IN
"rev istar" < v t> V IM P 2S V F IN
"rev istar" < v t> V P R 3S IN D V F IN
"rever" < v t> < v i> V P C
W ith a C G -te rm , su ch an a m b ig u o u s list o f re a d in g s is c a lle d a c o h o rt. In th e e x a m p le , the w o rd
fo rm 're v ista ' h a s o n e n o u n -re a d in g (fe m a le sin g u la r) a n d fo u r (!) v e rb -re a d in g s, the la tte r
c o v e rin g th re e d iffe re n t b a se fo rm s, su b ju n c tiv e , im p e ra tiv e , in d ic a tiv e p re s e n t te n se a n d
p a rtic ip le re a d in g s. C o n v e n tio n a lly , P o S a n d m o rp h o lo g ic a l featu res a re re g a rd e d as p rim ary ta g s
a n d c o d e d b y c a p ita l le tters. In a d d itio n th e re c a n b e se c o n d a ry le x ic a l in fo rm a tio n a b o u t v a le n c y
a n d se m a n tic a l c lass, m a rk e d b y
b ra c k e tin g .
A c o n s tr a in t g r a m m a r ru le b rin g s th e a m b ig u ity p ro b le m to th e fo re g ro u n d b y sp e c ify in g w h ic h
re a d in g (o u t o f a c o h o rt o f a m b ig u o u s re a d in g s fo r a g iv e n w o rd ) is im p o ss ib le (a n d th u s to b e
d is c a rd e d ) o r m a n d a to ry (a n d th u s to b e c h o se n ) in a g iv e n se n te n c e -c o n te x t. F o r in sta n ce, a ru le
m ig h t d is c a rd a fin ite v e rb re a d in g a fte r a p re p o sitio n (2 a) , o r w h e n a n o th e r - u n a m b ig u o u s fin ite v e rb is a lre a d y fo u n d in th e sa m e c la u se , w ith n o c o o rd in a to rs p re s e n t (2b).^
(2 a) R E M O V E (V F IN ) IF (-1 P R P )
[d isc a rd fin ite v erb re a d in g s (V F IN ) i f th e first w o rd to th e le ft (-1 ) is a p re p o sitio n P R P )]
(2 b ) R E M O V E (V F IN ) IF (* 1 C V F IN B A R R IE R C L B O R K C ) (N O T * - l C L B -W O R D )
[d isc a rd V F IN i f th e re is a n o th e r u n a m b ig u o u s (C ) fin ite v e rb (V F IN ) a n y w h e re to th e
rig h t (* 1 ) w ith n o c la u se -b o u n d a ry (C L B ) a n d c o o rd in a tin g c o n ju n c tio n (K C ) in terferin g
(B A R R IE R ). D isc a rd o n ly i f th e re is n o su b o rd in a to r (C L B -W O R D ) a n y w h e re to th e le ft
( * - l) ]
2 Ordinarily, this disambiguation process works on whole cohort lines, i.e. distinguishes between PoS, base form
and inflection, but tolerates competing valency options. However, on a higher level o f analysis, I have introduced
valency and semantical disambiguation, too. This can be very useful for polysemy resolution, like in "rever", where
the transitive <vt> - intransitive <vi> distinction has a meaning correlate: 'tomar a ver' [see again] vs. 'transudar'
[leak through]. Likewise, "revista" followed by a name <+n> or being read (semantical class <rr>) is more likely to
be a newspaper than an inspection (semantical class <CP> for action; +CONTROL, +PERFECTIVE).
NODALIDA, Copenhagen, January 1998
45
W ith c u rre n t so ftw a re , b e fo re a n an a ly sis ru n , a ll ru le s a re tra n s la te d in to a fin ite state n e tw o rk
b y a c o m p ile r p ro g ra m , y ie ld in g th e a c tu a l p a rse r. T h e P o rtu g u e s e g ra m m a r w a s o rig in a lly
w ritte n in th e fo rm a lism s u g g e ste d b y P asi T a p a n a in e n 's firs t c o m p ile r im p le m e n ta tio n , b u t la te r
re w ritte n to m a tc h th e n o ta tio n u se d in h is n e w C G -2 p a rs e r c o m p ile r (T a p a n a in e n , 1996).
B y a p p ly in g th e ru le se t se v e ra l tim e s, th e p a rs e r re n d e rs m o re a n d m o re w o rd s in th e se n te n c e
u n a m b ig u o u s, a n d in th e e n d , o n ly on e re a d in g is le ft fo r e v e ry w o rd . S in c e th e in d iv id u a l ru le
c a n b e m a d e v e ry " c a u tio u s" b y a d d in g m o re c o n te x t c o n d itio n s, a n d sin c e th e la s t su rv iv in g
read in g w ill n e v e r b e d isc a rd e d , th e fo rm a lism is v e ry ro b u st. E v e n im p e rfe c t in p u t w ill y ie ld
s o m e p arse. U n lik e p ro b a b ilis tic sy ste m s, w h e re " m a n u a l in te rfe re n c e " a s in th e in tro d u c tio n o f
b ia s o n b e h a lf o f irre g u la r p h e n o m e n a o fte n h a s a n a d v e rse sid e -e ffe c t o n th e o v e ra ll
p e rfo rm a n c e o f th e p a rs e r (d u e to in te rfe re n c e w ith th e o rd in a ry sta tistic a l "ru les" biised o n th e
r e g u la r
"m ajo rity "
p h e n o m e n a ),
C o n stra in t G ra m m a r to le ra te s
and
e v e n e n c o u ra g e s
th e
in c re m e n ta l " p ie c e m e a l " a d d itio n o f e x c e p tio n s a n d c o n te x t c o n d itio n s fo r in d iv id u a l ru le s (F o r
a c o m p a ris o n o f sta tistic a l a n d c o n stra in t-b a se d m e th o d s se e C h a n o d & T a p a n a in e n , 1994).
2. System Performance
I f th e y c a n b e m a d e to w o rk o n free te x t, ru le b a se d sy ste m s c a n a c h ie v e v e ry lo w e rro r rate s.
W h ile s ta te -o f-th e -a rt p ro b a b ilis tic ta g g e rs still h a v e e rro r ra te s o f o v e r th re e p e rc e n t^ e v e n fo r
P o S ta g g in g , C G b a s e d sy ste m s fare s o m e w h a t b etter. F o r E n g lis h w o rd cla ss e rro r rate s o f u n d e r
0 .3 % h a v e b e e n re p o rte d a t a d isa m b ig u a tio n le v el o f 9 4 -9 7 % (V o u tila in e n , 1992), F o r m y o w n
P o rtu g u e se C G sy ste m , te st ru n s ru n s w ith n e a r 100% d is a m b ig u a tio n o n fic tio n an d n ew s te x ts
su g g e st a c o rre c tn e ss ra te o f o v e r 9 9 % fo r m o rp h o lo g y a n d p a rt o f sp e e c h , w h e n a n a ly sin g
u n k n o w n u n re s tric te d text'*. F o r sy n ta x th e fig u re s a re 9 8 %
fo r c la ssic a l lite ra ry p ro se (E 9 a d e
Q u eiro z, "O te so u ro " ) a n d 9 7 % fo r th e m o re in v e n tiv e " jo u rn a le se " o f n e w s m a g a z in e te x ts
(V E J A ,9 .1 2 .1 9 9 2 ),a s sh o w n in ta b le (3 );
^ Compare, for English, (Garside et.al, 1987) on the HMM based CLAWS system, (Francis and Kucera, 1992)
on recovering PoS tags from the Brown corpus, Ratnaparkhi's maximum-entropy tagger trained on the Penn
Treebank (Marcus et al., 1993) or Brill's stochastic tagger using automated learning (Brill, 1992). For German the
Morphy system described in (Lezius et. al., 1996) achieved an accuracy o f 95.9%.
"The test texts used were not part of the benchmark corpus used to develop the rules, and fresh text chunks were
used for every new test. The present grammar, however, still being improved, does incorporate changes made as a
result of test run errors.
46
NOD ALIDA, Copenhagen, January 1998
(3) System performance on the PoS and syntactic levels:
T e x t:
E r r o r ty p e s :
,'
., > L.
P a r t- o f - s p e e c h e r r o r s
B a s e -fo rm & fle x io n e r r o r s
A ll m o r p h o l o g i c a l e r r o r s
s y n ta c tic : w o rd & p h r a s e s
s y n ta c tic : s u b c la u s e s
A ll s y n t a c t i c e r r o r s
"local" s y n ta c tic e r r o r s d u e to
P o S /m o rp h o lo g ic a l e r r o r s
P u re ly s y n ta c tic e r r o r s
0 tesouro
V E JA J
V EJA 2
c a . 2 5 0 0 w o rd s
c o rre c terro rs
ness .
16
1
9 9 .3 %
17
54
10
9 7 .4 %
64
ca. 4800 w o rd s
e rro rs
c o r r e c tn ess
15
2
17
9 9 .7 %
118
11
129
9 7 .3 %
-2 7
- 23
c a . 3 1 4 0 w o rd s
erro rs
c o rre c tness
24
2
26
9 9 ,2 %
101
13
114
9 6 .4 %
-2 8
37
9 8 .5 %
106
9 7 .8 %
86
9 7 .3 %
3. Lexico-morphological heuristics
Y e t e v e n in a ru le b a se d C G sy ste m , h e u ristic s can b e q u ite usefiil (fo r E n g lish , see K a rlsso n et.
al., 1995). T h u s ru les are u su a lly g ro u p e d acc o rd in g to th e ir "sa fe ty " , i.e. th e ir statistic al
te n d e n c y to m a k e errors. L e ss safe ru le s c a n b e ad d ed as a h e u ristic le v e l o n to p o f a k ern el o f
s a fe ru le s , a n d w ill b e a p p lie d a fte r th e se . A lso , statistic a l in sp ire d " ra rity tag s" (< R a re> ) c a n be
a d d e d to c e rta in less p ro b a b le re a d in g s in th e le x ico n , a n d th e n re fe rre d to b y co n tex tu al
d is a m b ig u a tio n ru les. A th ird fie ld fo r th e a p p lic a tio n o f h e u ristic s is o n th e a n a ly se r level, i.e.
c o n c e rn s th e (le x ic o -m o rp h o lo g ic a l) in p u t o f th e d isa m b ig u a tio n ru le sy ste m . It is th is th ird ty p e
o f h e u ris tic s I a m c o n c e rn e d w ith h e re .
S in c e th e h ig h e r le v els o f th e p a rs in g sy ste m (fo r e x a m p le , P o S a n d sy n ta x ) a re te c h n ic a lly ru le
b a s e d d isa m b ig u a to rs, th e y n e e d s o m e re a d in g fo r ev e ry w o rd to w o rk o n , w h ic h is w h y e v e n
u n a n a ly z a b le w o rd fo rm s (i.e. w o rd fo rm s th a t can n o t b e re d u c e d to a ro o t fo u n d in th e
a n a ly s e r’s le x ico n ) n e e d to b e g iv e n o n e o r m o re h e u ristic re a d in g s w ith re g a rd to w o rd class a n d
fle x io n m o rp h o lo g y . T h e m a jo rity o f su c h c a se s is a c c o u n te d fo r b y u n k n o w n p ro p e r n o u n s (12 % o f a ll w o rd s, d e p e n d in g o n te x t ty p e ), a n d can b e h a n d le d b y a s s ig n in g h e u ristic P R O P ta g s
to a ll c a p ita lise d w o rd s in c e rta in c o n te x ts, a n d th e n a d d in g a n y c o m p e tin g a n a ly tic a l an aly sis,
e s p e c ia lly in th e case o f s e n te n c e in itia l p o sitio n , le a v in g th e fin a l d e c isio n to th e ru le b a se d
d is a m b ig u a tio n m o d u le. T h is w a y , th o u g h 8 0 % o f a ll n o u n s in m y c o rp u s n e e d h e u ristic a l
tre a tm e n t, th e e rro r ra te fo r th e P R O P c la ss as a w h o le c a n b e k e p t a t 2 % , n o t to o fa r fro m th e
ta g g e rs o v e ra ll P o S e rro r ra te (< 1% ).
NODALIDA, Copenhagen, January 1998
47
3.1 "Unanalyzable® words": typology and statistics
T h o u g h a c c o u n tin g fo r o n ly 0 .3 -0 -5 % o f w o rd fo rm s in ru n n in g te x t, o th e r ty p e s o f a n a ly sis
fa ilu re s (i.e. w o rd fo rm s th a t c a n n o t b e re d u c e d to a ro o t fo u n d in th e a n a ly s e r’s le x ic o n ) are
m o re d iffic u lt to h a n d le , d u e to th e ir fim c tio n a l d iv e rs ity a n d th e la c k o f a c le a r m o rp h o lo g ic a l
m a rk e r. T a b le (4 ) p ro v id e s a n e rro r ty p o lo g y f o r a 1 3 L 9 8 1 w o rd lite ra tu re a n d s e c o n d a ry
lite ra tu re c o rp u s (T h e R N P d e p o sito ry o f B ra z ilia n literatu re), c o n ta in in g 6 0 4 u n a n a ly z a b le
w o rd s in th e te s t ru n .
T h re e m a in g ro u p s m a y b e d istin g u ish e d , c o m p risin g o f ro u g h ly o n e th ird o f th e c a se s each :
a)
o rth o g ra p h ic a l e rro rs (sh a d e d in th e ta b le , a n d p a rtia lly c o rre c te d b e fo re h e u ris tic s p ro p e r
b y a n a c c e n t m o d u le re c o g n isin g re g u la r re g io n a l sp e llin g v a ria tio n s)
b)
u n k n o w n a n d u n d e riv a b le P o rtu g u e se w o rd s o r a b b re v ia tio n s
c)
u n k n o w n fo re ig n lo a n w o rd s
(4 ) E rro r ty p e s in " u n a n a ly z a b le " w o rd s:
DO M A IN
F o r e ig n
o r th o g r a p h ic v a r ia tio n
(E u r o p e a n /a c c e h tu a tio n )
o th e r p o r t. O r th o g r a p h ic
n o n -c a p ita lis e d n a m e s a n d
a b b r e v ia tio n s
n a m e s a n d n a m e r o o ts
a b b r e v ia tio n s
r o o t n o t fo u n d in le x ic o n
fo u n d in A u relio^
n o t fo u n d in A u r e lio
d e r iv a tio n /fle x io n p r o b le m
s u ffix
p r e fix
H ex io n e n d in g
a lte r n a t io n in fo r m a tio n
o th e r
SUM
NUM BER OF
TO K ENS
232
125
PERCENTAGE
74
37
12.3
6.1
18
19
119
91
28
3.0
3.1
19.7
15.1
4.6
15
2.5
8
3
2
2
2
604
38.4
20.7
1.3
0.5
0.3
0.3
Correctables
M isspellings
E n cy c lo p a ed ic
lexicon fa ilu r e s
Core lexicon
fa ilu r e s
A ffix lexicon
fa ilu r e s
0.3
100.0
3.2 Analytical morphological heuristics
F o r o p tim a l p e rfo rm a n c e , th e th re e g ro u p s m e n tio n e d a b o v e w o u ld re q u ire d iffe re n t stra te g ie s.
F o re ig n w o rd s a p p e a rin g in ru n n in g P o rtu g u e s e te x t a re ty p ic a lly n o u n s o r n o u n p h ra se s , a n d *
* In this paper I intend "unanalysable word forms" to mean word forms that cannot - by derivation and/or
inflexional analysis - be reduced to a root found in the analyser's lexicon. Of course, only part o f these - typing
errors and foreign language quotes - are really unanalysable, while others might be covered by enlarging the lexicon
or enhancing the scientific derivation list.
* "Novo Dicion^io Aurelio" is the largest monolingual dictionary of Brazilian Portuguese.
48
NODALIDA, Copenhagen, January 1998
try in g to id e n tify v e rb a l e le m e n ts o n ly cau ses tro u b le . In "real" P o rtu g u e se \v o rd s w ith o u t
s p e llin g erro rs, stru c tu ra l c lu e s - lik e fle x io n e n d in g s a n d su ffix e s - sh o u ld b e em p h asised . T h ese
w ill b e m e a n in g fu l in m iss p e lle d P o rtu g u e se w o rd s, to o , b u t, in a d d itio n , sp e c ific ru les a b o u t
le tte r m a n ip u la tio n (d o u b lin g o f le tte rs, m issin g le tters, le tte r in v e rsio n , m issin g b la n k s etc.) a n d
e v e n k n o w le d g e a b o u t k e y b o a rd c h a ra c te ristic s m ig h t m a k e a d iffe re n c e .
M o tiv a te d b y a g ra m m a tic a l p e rsp e c tiv e ra th e r th a n p ro b a b ilistic s, m y a p p ro a c h h a s b e e n to
e m p h a sis e g ro u p s (a) a n d (b ) a n d lo o k fo r P o rtu g u e se m o rp h o lo g ic a l clu es in w o rd s w ith
u n k n o w n stem s. S in ce p re fix e s h a v e v e ry little b e a rin g o n th e p ro b a b ility o f a w o rd 's w o rd class
o r fle x io n a l categ o ries, o n ly th e fle x io n e n d in g s an d su ffix le x ic a a re u sed . A s it a lso d o es in
o rd in a ry m o rp h o lo g ic a l a n a ly sis, th e ta g g e r trie s to id e n tify a w o rd fro m th e right, i.e.
b a c k w a rd s, cu ttin g o f f p o te n tia l e n d in g s o r su ffix es a n d c h e c k in g f o r th e re m a in in g stem in th e
ro o t le x ic o n (th e m a in le x ic o n ). N o rm a lly , fo r (m u ltip ly ) a n a ly s e d w o rd s, u s in g K a rlsso n 's la w
(K a rls s o n 1992, 1 995)’, th e P o rtu g u e s e a n a ly se r w o u ld try to m a k e th e ro o t as lo n g as p o ssib le,
a n d to u s e as few d e riv a tio n a l layers® as p o ssib le . F o r (sy ste m -in te rn a lly ) u n a n a ly za b le w o rd s,
h o w e v e r, I u se th e o p p o site strateg y ; S in c e I am lo o k in g fo r a h y p o th e tic a l ro o t, fle x io n e n d in g s
a n d s u ffix e s are all I'v e g o t, a n d I try to m a k e th e ir h a l f o f th e w o rd (th e rig h t h an d p a rt) as larg e
a s p o ssib le .
W o rk in g w ith a m in im a l ro o t le n g th o f 3 letters, an d c a llin g m y h y p o th e tic a l ro o t 'xx x ', I w ill
sta rt b y re p la c in g o n ly th e first 3 le tte rs o f th e w o rd in q u e stio n b y 'x x x ' a n d try fo r a n an aly sis,
th e n I w ill rep la ce th e firs t 4 le tte rs b y 'x x x ', a n d so o n , u n til - i f n e c e ss a ry - th e w h o le w o rd is
re p la c e d b y 'x x x '.’ F o r a w o rd lik e o n to g e n e tic a m e n te th e re w ritin g re c o rd w ill y ie ld the c h ain
b e lo w . H ere, th e full c h a in is g iv e n , w ith all re a d in g s it w o u ld e n c o u n te r o n its w ay . In th e real
ca se , h o w e v e r, th e ta g g e r - p re fe rrin g lo n g d e riv a tio n s/e n d in g s to s h o rt o n es - w o u ld sto p
s e a rc h in g at th e x x x tic a m e n te -le v e l, w h e re th e first g ro u p o f re a d in g s is fo u n d . In fact, th e
a d v e rb ia l u se o f a n a d je c tiv e ly su ffix e d w o rd is m u c h m o re lik e ly th a n h ittin g u p o n , say, a "ro o to n ly " n o u n w h o se la st 9 le tte rs h a p p e n to in c lu d e b o th th e '-ic o ' a n d th e '-m e n te ' le tte r c h ain s b y
ch an c e.
(5 ) o n to e e n e tic a m e n te -> n o a n a ly sis
xxxogeneticamente
xxxgeneticamente
xxxeneticamente
xxxneticamente*
’’ Karlsson's law states, that of two morpological analyses o f different derivational complexity, the one with fewer
elements is almost always the correct one.
* Karlsson's law can be applied to any string o f free (i.e. compounding), derivational or inflexional morphemes,
but the frequency of ambiguity types with respect to these three elements will differ from language to language thus, in Portuguese, compounding is much rarer than in most germanic languages, while Swedish, the language for
which Karlsson's law was originally fomulated, does have compounding, but not as rich an inflexion morphology.
’ A similar method of partial morphological recognition and circumstantial categorization might be responsible
for a human being's successful inflectional and syntactic treatment of unknown words in a known language; the
Portuguese word games "coHorido" (president Collor & colorido - 'coloured') and "tucanagem" (the party of the
tucanos & sacanagem - 'dirty work'), for instance, will not be understood by a cultural novice in Brazil, even if he is
a native speaker of European Portuguese - but he will still be able to identify both as singular, the first as a past
participle ('-do') and the second as an abstract noun ('-agem') of the feminine gender.
NODALIDA, Copenhagen, January 1998
49
xxxeticamente
xxxticamente
-> su ffix -ico ' (v a ria tio n '-tic o ') + ad v e rb ia l e n d in g '-m e n te '
xxxicamente
"o n to g en e" < D E R S -ic o [A T T R ]> < d e a d j> A D V
-> su ffix '-ico ' + a d v e rb ia l e n d in g '-m en te'
"o n to g en e" < D E R S -ic o [A T T R ]> < d e a d j> A D V
xxxcamente
xxxamente
~> ad v erb ial e n d in g '-m e n te ' (v a ria tio n '-a m e n te ')
"o n to g en etico " < x x x o > < d e a d j> A D V
xxxmente
xxxente
~> "p re se n t p a rtic ip le " -su ffix '-e n te '
"o n to g e n e tic a m e r" < D E R S -e n te [P A R T .P R ]> A D J M /F S
" o n to g e n e tic a m e r" < D E R S -e n te [A G E N T ]> N M /F' S
-> c au sativ e su ffix '-e n ta r'’° + v e rb a l fle x io n e n d in g ’-e'
"o n to g en etica m " < D E R S -a r [C A U S E ]> V P R 1/3S S U B J V F IN
xxxnte
xxxte
xxxe
-> v erb al fle x io n e n d in g '-e'
" o n to g e n e tic a m e n te r" < x x x e r> V IM P 2S V F IN
" o n to g e n e tic a m e n tir" < x x x ir> V IM P 2 S V F IN # # #
" o n to g e n e tic a m e n te r" < x x x e r> V P R 3 S I N D V F IN
"o n to g e n e tic a m e n tir" < x x x ir> V P R 3 S IN D V F IN # # #
" o n to g e n e tic a m e n ta r" < x x x a r> V P R 1/3S S U B J V F IN
XXX
-> no d e riv a tio n o r fle x io n
"o n to g e n e tic a m e n te " < x x x > N F S
" o n to g e n e tic a m e n te " < x x x > N M S
R o o ts w ith 'x x x ' a re p re s e n t in th e c o re le x ic o n a lo n g sid e th e "real" ro o ts, in c lu d in g th e n e c e ss a ry
ste m a lte rn a tio n s " f o r v e rb s (h ere, B b C c fo r d iffe re n t ro o t-stre sse d fo rm s a n d A a iD f o r e n d in g ss tre s s e d fo rm s):
( 6)
ro o t
w o rd c la ss
XXX <sm>
xxxar <am£>
xxxar- <vt>
a lte rn a tio n su b c la ss
A a iD
le x e m e ID
ta rg e t o f an aly sis_________ _
54573
m a sc u lin e n o u n , ty p ic a lly fo re ig n
59547
P o rtu g u e se '-a r'-a d je c tiv e *
54578
e n d in g s-stre sse d fo rm s o f '- a r 'v e rb s
xxxer <sm>
xxxia <sf>
xxxo <sm>
54666
m a sc u lin e n o u n , ty p ic a lly E n g lish *
54665
fe m in in e n o u n , L a tin -P o rtu g u e s e *
54582
m a sc u lin e n o u n , ty p ic a lly
P o rtu g u e se
This suffix is regarded as a variant of'-ar', and therefore normalized in the DER-tag: <DERS -ar [CAUSE]>.
Here, BbCc for different root-stressed forms and AaiD for endings-stressed forms, with D, for example,
meaning a root to be used with future subjunctive endings.
50
NODALIDA, Copenhagen, January 1998
B e s id e s th e ty p ic a l ste in s e n d in g in '-o ', '-a' a n d '-r', d e fa u lt ste m s c o n s is tin g o f a p la in 'xxx' h a v e
b e e n e n te re d to a c c o m m o d a te fo r fo re ig n n o u n s w ith " u n -P o rtu g u e se " sp ellin g . L ik e m an y o th e r
la n g u a g e s, P o rtu g u e se w ill fo rc e its o w n g e n d e r sy ste m e v e n o n to fo re ig n lo a n w o rd s, so a
m a sc u lin e an d a fe m in in e ca se m u s t b e d istin g u ish e d , fo r la te r u s e in th e ta g g e r's d isa m b ig u a tio n
m o d u le .
S in ce th e an a ly se r's h e u ristic s fo r u n k n o w n w o rd s p re fe rs re a d in g s w ith e n d in g s (o r su ffix es) to
th o s e w ith o u t, a n d lo n g e r o n e s to sh o rte r o n es, v e rb a l re a d in g s (e sp e c ia lly th o se w ith in fle x io n
m o rp h e m e s in 'r', 'a' o r 'o') h a v e a "n atu ral" a d v a n ta g e o v e r w h a t re a lly sh o u ld b e n o u n s o r
a d je c tiv e s , e sp e c ia lly w h e n th e s e a p p e a r in th e ir u n in fle c te d sin g u la r b a se fo rm . L e x ic o n -w ise ,
th is te n d e n c y is c o u n te re d b y a d d in g th re e o f th e m o s t c o m m o n ly ig n o re d n o m in al c a se s
s p e c ific a lly in to th e le x ic o n : (a) E n g lish '-er' n o u n s o th e rw is e o n ly ta k e n as P o rtu g u e se
in fin itiv e s, (b) L a tin -P o rtu g u e se '-ia' n o u n s o th e rw ise o n ly re a d as v e rb a l fo rm s in th e im p erfeito
te n se , a n d (c) '-ar' a d je c tiv e s o th e rw ise a n a ly se d o n ly a s in fin itiv e s.
R u le -w ise , v e rb a l re a d in g s a lo n e are n o t a llo w e d to sto p th e h e u ristic s-m a c h in e , it w ill p ro c e e d
u n til it fin d s a re a d in g w ith a n o th e r w o rd class. S o, th e p ro c e s s is set to ig n o re v e rb a l re a d in g s o n
its w a y d o w n th e c h a in o f h y p o th e tic a l w o rd fo rm s w ith e v e r sh o rte r su ffix /e n d in g s-p a rts. T h u s,
th e h e u ris tic s-m a c h in e w ill r e c o r d v erb al read in g s, b u t o n ly sto p i f a n o u n , ad je c tiv e o r ad v e rb
re a d in g is fo u n d in th a t le v e l's c o h o rt (list o f re a d in g s). In th is c o n te x t, p a rtic ip le s a n d g e ru n d s th o u g h v erb al - are tre a te d as "ad jectiv es" a n d "ad v erb s", re sp e c tiv e ly , b e c a u se th e y featu re v e ry
c h a ra c te ristic e n d in g s ('-ad o ', '-id o ', '-an d o ', '-en d o ', '-in d o ').
T h is ra ise s th e p o s sib ility o f th e h e u ristic s-m a c h in e p ro g re s s in g fro m m u lti-d e riv e d an aly se s
(w ith o n e o r m o re su ffix e s) to sim p le an a ly se s (w ith o u t su ffix e s) b e fo re it en co u n ters a n o n ­
v e rb a l read in g . In th is ca se , th e a p p lic a tio n o f K a rlsso n 's la w d o e s still m a k e sen se, a n d w h e n th e
h e u ris tic s -m a c h in e h a n d s its re s u lts o v e r to th e lo c al d is a m b ig u a tio n m o d u le , th is w ill se le c t th e
re a d in g s o f lo w e st d e riv a tio n a l c o m p le x ity , w e e d in g o u t a ll (read : v e rb a l!) re a d in g s c o n ta in in g
m o re (read : v e rb a l!) su ffix e s th a n th e g ro u p selec ted . In th e m iss p e lle d F re n c h w o rd 'enta en te',
fo r e x am p le, th e v e rb a l re a d in g :
(7 a)
"en ta" < D E R S -(e n t)a r [C A U S E ]> V P R 1/3S S U B J V F IN ,
fro m th e 'x x x a e n te '-le v e l, is re m o v e d , le av in g o n ly u n d e riv e d v e rb a l re a d in g s - fro m th e 'x x x e 'le v el, a lo n g w ith th e d e sire d n o u n sin g u la r re a d in g fro m th e 'x x x '-le v e l.
(7 b )
e n ta e n te A L T x x x a e n te A L T x x x e A L T x x x
" en tae n ter" < x x x e r> V IM P 2S V F IN
" en tae n ter" < x x x e r> V P R 3S IN D V F IN
" en tae n tar" < x x x a r> V P R I/3 S S U B J V F IN
"en tae n te" < x x x > N F S
"en tae n te" < x x x > N M S
S in c e a ll d isa m b ig u a tio n n o t re la te d to K a rlsso n 's la w is re fe rre d to th e C G -m o d u Ie, th e w o rd
c la ss c h o ic e b e tw e e n V a n d N w ill b e c o n te x tu a l (a n d ru le b a se d ), a s w e ll as th e m o rp h o lo g ic a l
su b -c h o ic e o f m o d e (IM P - P R ) fo r th e v e rb , a n d g e n d e r (M - F ) fo r th e n o u n . In th e p ro to ty p ic a l
c a s e o f a p re c e d in g a rtic le , th e v e rb re a d in g is ru le d o u t by:
(8 a)
R E M O V E (V ) IF (-1 A R T )
a n d th e g e n d e r c h o ic e is th e n ta k e n b y a g re e m e n t ru le s su c h as:
NODALIDA, Copenhagen, January 1998
51
(8 b )
R E M O V E (N M ) IF ( - 1 C D E T ) (N O T -1 M )
R E M O V E (N F ) IF ( - 1 C D E T ) (N O T -1 F )
C o n s id e r th e fo llo w in g ex a m p le s o f " u n a n a ly z a b le " w o rd s fro m r e a l c o rp u s se n te n c e s, w h e re th e
fin a l o u tp u t, a fte r m o rp h o lo g ic a l c o n te x tu a l d isa m b ig u a tio n , is g iv e n ;
(9 a )
in v e n tim a n h a s A L T x x x a s
(also : o n e A D J a n d th re e ra re V -re a d in g s)
" in v e n tim a n h a " < x x x > N F P 'tric k s'
(9 b )
a ra ra q u a re n se s A L T x x x e n se s
(3 o th e r A D J read in g s re m o v e d b y lo c al
d is a m b ig u a tio n )
(9 c)
"a ra ra q u a i" < D E R S -e n se [P A T R ]> < jh > < jn > A D J M /F P 'fro m A ra ra q u a ra '
s o m b ra n c e lh a s A L T x x x a s
(also : o n e A D J a n d th re e ra re V -re a d in g s)
" so m b ra n c e lh a " < x x x > N F P '= s o b ra n c e lh a s - ey e b ro w s'
(9 d )
cast A L T xxx
(also : N F S )
"cast" < * 1 > < * 2 > < x x x > N M S 'E n g lish : cast'
In (9 a) a n d (9 b ) th e p a rs e r a ssig n s c o rre c t re a d in g s to u n k n o w n , b u t w e llfo rm e d P o rtu g u e se
w o rd s. D e p e n d in g o n th e o rth o d o x y o f th e fu s io n p ro c e ss, th e se a ffix e s m a y b e re c o g n ise d (9 b ),
o r n o t (9 a). T h a t a ffix re c o g n itio n is im p o rta n t, c a n b e seen fro m th e fa c t th a t a ll c o m p e tin g
a n a ly se s in (9 b ) - b u t n o t in (9 a) - h a v e th e c o rre c t P o S tag. W h a t is sp e c ia l a b o u t (9c), is th e
(p h o n e tic a l? ) m iss p e llin g ('so m b ra n c e lh a s') o f a n o th e rw ise o rd in a ry P o rtu g u e se w o rd . E v e n so,
w ith th e h e lp o f th e su rv iv in g m o rp h o lo g ic a l c lu e s a n d c o n te x tu a l d is a m b ig u a tio n , th e p a rs e r is
a b le to a s s ig n th e rig h t a n a ly sis in m o s t ca se s, e sp e c ia lly i f th e w o rd s still lo o k P o rtu g u e s e . (9 d ),
fin a lly , is th e h a rd czise - fo re ig n lo a n w o rd s. E n g lis h 'cast' d o e s n o t f it w ith a n y P o rtu g u e se
fle x io n e n d in g , th e re fo re th e d e fa u lt re a d in g N is assig n e d , g e n d e r d isa m b ig u a tio n re ly in g o n
N P -co n tex t„
In o rd e r to te s t th e p a rse r's p e rfo rm a n c e a n d to id e n tify th e stre n g th s a n d w e a k n e s s e s o f th e
h e u ris tic s s tra te g y o f th e p arser, I h a v e m a n u a lly in sp e c te d 7 5 7 "ru n n in g " in sta n c e s'^ o f lo w e r
ca se w o rd fo rm s w h e re th e p a rse r's d is a m b ig u a tio n m o d u le re c e iv e d its in p u t fro m th e ta g g e r's
h e u ris tic s m o d u le . T h e first c o lu m n sh o w s th e w o rd c la ss a n a ly sis c h o se n , a n d in sid e th e th ree
g ro u p s (e rro rs, P o rtu g u e se , fo re ig n ) th e le ft c o lu m n g iv e s th e n u m b e r o f c o rre c t a n a ly se s,
w h e re a s th e rig h t c o lu m n o ffers sta tistic s a b o u t th e m ista k e s, sp e c ify in g - a n d q u a n tify in g - w h a t
th e a n a ly s is s h o u ld h a v e been.
The words comprise all "unanalysable" word forms in my corpus, that begin with the letters 'a' and 'b'. Since
the relative distribution of foreing loan words and Portuguese words depends on which initial letters one works on
('a', for one, is over-representative o f Portuguese words, whereas 'x'. 'w' and y are English-only domains), no
conclusions can be drawn about these two groups' relative percentages. Inside the Portuguese group, however, the
distribution between real words and misspellings may be assumed to be fairly alphabet-independent. Any way, the
sampling technique has no significance for error frequencies or distribution in relation to word class, which was the
main objective in this case.
52
NOD ALIDA, Copenhagen, January 1998
(1 0 ) W o rd c la ss d is trib u tio n a n d p a rs e r p e rfo rm a n c e in " u n a n a ly z a b le " w o rd s (V E JA n ew s te x t)
analysis
N
ADJ
ADV
W IN
PCP
GER
INF
A) orthographical
errors
other
correct
ADJ 8
119
ADV 8
W IN 3
PRO Nl
DET 1
PRP 1
N8
25
GER 2
3
N4
13
PCP 1
ADV 1
10
3
9
182
38
(17.3%)
B) Portuguese words C) foreign words^®
aU
correct
212
other
ADJ 3
correct
226
correct
557
other
43
95
N7
8
128
17
-
-
5
9
16
-
4
341
other
ADV 11
ADJ 3
PRON2
PRP 2
N7
ADJ 1
N4
ADJ 2
-
-
-
-
-
-
16
(4.5%)
234
-
N4
30
(11.4%)
8
22
26
3
13
757
-
20
-
4
84
(10.0%)
T h e ta b le sh o w s th a t, w h e n u sin g le x ical h e u ristic s, th e p a rs e r p e rfo rm s b e st - n o t e n tire ly
su rp risin g ly - fo r w e llfo rm e d P o rtu g u e se w o rd s (B ). O f 323 n o u n s a n d ad je c tiv e s in g ro u p B,
o n ly 16 (5 % ) w e re m is a n a ly s e d as false p o sitiv e s o r fa lse n e g a tiv e s. T h e p ro b a b ility fo r a n
a ssig n e d N -ta g b e in g c o rre c t is as h ig h as 9 8 .6 % , fo r th e u n d e rre p re se n te d a d v erb a n d n o n -fm ite
v e rb a l c la ss e v e n 100% . A ll fa lse p o sitiv e n o m in a l re a d in g s (N a n d A D J) are still in th e n o m in a l
c lass, a fa c t th a t is q u ite fa v o u ra b le fo r la te r sy n ta c tic a n a ly sis.
F ig u re s are lo w e r fo r g ro u p C , u n k n o w n lo a n w o rd s, w h e re th e c h a n c e o f a n N -ta g b e in g c o rre c t
is o n ly 9 2 .6 % , e v e n w h e n a llo w in g fo r a n a m e -c h a in -lik e N -a n a ly sis o f E n g lish a d je c tiv e s
in te g ra te d in n o u n c lu ste rs o f th e ty p e 'b ig b o ss'. F in ite v e rb re a d in g s, th o u g h ra re (d u e to la c k in g
fle x io n in d ic a to rs), are o f c o u rse a ll failu res, a n d o n ly th e little a d je c tiv e g ro u p w a s a h it, th e fe w
cases b e in g trig g e re d b y m o rp h o lo g ic a lly "P o rtu g u e sish " S p a n is h o r Ita lia n w o rd s.
T h e re su lts in g ro u p A (m iss p e llin g s ) re se m b le
th o se o f g ro u p B , w ith a g o o d p e rfo rm a n c e fo r
c la sse s w ith c le a r e n d in g s, i.e. n o n -fm ite v e rb s a n d '-m e n te '-a d v e rb s, a n d a b ad p e rfo rm a n c e fo r
fin ite v e rb fo rm s. F o r th e la rg e n o m in a l g ro u p s fig u re s a re so m e w h a t lo w er: 8 4 .4 % o f N -ta g s ,
a n d o n ly 7 1 .4 % o f A D J -ta g s a re c o rre c t - th o u g h m o s t fa ls e p o s itiv e A D J-ta g s are still w ith in th e
n o m in a l ra n g e . T h e lo w e r fig u re s c a n b e p a rtly e x p la in e d b y th e fa c t th a t m issp e lle d c lo se d cla ss
w o rd s (a d v e rb s, p ro n o u n s a n d th e like) w ill g e t th e (d e fa u lt, b u t w ro n g ) n o u n re a d in g - a
te c h n iq u e th a t w o rk s so m e w h a t b e tte r a n d m o re n a tu ra lly fo r fo re ig n lo a n w o rd s (C ), w h ic h o fte n
a re "term s" im p o rte d to g e th e r w ith th e th in g o r c o n c e p t th e y sta n d fo r, o r n£imes. A lso , th e
Only individual words and short integrated groups are treated, foreign language sentences or syntactically
complex quotations are treated as "corpus fall-out" in this table.
NODALIDA, Copenhagen, January 1998
53
p e rc e n ta g e o f "sim p lex '^" w o rd s w ith o u t a ffix e s is m u c h h ig h e r a m o n g th e m iss p e llin g s in g ro u p
A th a n in g ro u p B , w h ere a ll s im p le x w o rd s - b e in g sp e lle d c o rre c tly - w o u ld h a v e b e e n
re c o g n is e d in th e le x ico n a n y w a y , d u e to th e g o o d le x ico n c o v e ra g e b e fo re g e ttin g to th e
h e u ris tic s m o d u le . T h ere fo re , n o u n s a n d a d je c tiv e s in g ro u p A la c k th e stru c tu ra l in fo rm a tio n o f
s u ffix e s th a t h e lp s th e p a rse r in g ro u p B : 'xxxo* lo o k s d e fin ite ly le s s a d je c tiv a l th a n 'x x x fstico '. In
p a rtic u la r, 'x x x o ' in v ite s th e N /A D J -c o n fu s io n , w h ereas m a n y s u ffix e s a re c le a rly N o r A D J.
T h u s , '-istic o ' y ie ld s a safe a d je c tiv e re a d in g ,
4. Special - ”deviant” - word class probabilities for the heuristics module
Is it p o s sib le , a p a rt fro m m o rp h o lo g ic a l-stru c tu ra l clu es, to u se " p ro b a b ilis tic s p u re" fo r d e c id in g
o n w o rd c la ss ta g s fo r "u n an aly z ab le" w o rd s ? In o rd e r to a n sw e r th is q u e stio n , I w ill - in ta b le
(1 4 ) - re a rra n g e in fo rm a tio n fro m ta b le (1 3 ) a n d c o m p a re it to w h o le te x t d a ta (in th is case, fiom.
a 1 9 7 .0 2 9 w o rd stre tc h o f th e m ix e d g e n re B o rb a -R a m se y c o rp u s). H e re , I w ill o n ly be
c o n c e rn e d w ith th e o p e n w o rd c la sse s, n o m in a l, v erb al a n d '-m e n te '-a d v e rb ia l.
(1 1 )
analyse
s
N
ADJ
ADV15
W IN
PCP
GER
INF
O p e n w o rd cla ss fre q u e n c y fo r "u n a n a ly z a b le " w o rd s as c o m p a re d to w h o le te x t fig u res
whole text "unanalysable" wore s
Portuguese
orthographical
errors
words
%
cases
%
cases
%
47.38
12.79
1.26
24.96
4.96
2.47
6.17
131
33
3 (+9)
16
11
3
9
206
63.59
16.02
1.46
7.77
5.34
1.46
4.37
232
100
5
9
16
63.39
27.32
1.37
2.46
4.37
-
4
366
foreign words
cases
%
cases
%
237
12
95.18
4.82
600
145
8
25
27
3
13
821
73.08
17.66
0.97
3.05
3.29
0.37
1.58
- (+11)
-
-
-
1.09
all heuristics
-
249
-
A m o n g o th e r th in g s, th e ta b le sh o w s th a t th e n o u n b ia s in "u n a n a ly z a b le " w o rd s is m u c h stro n g e r
th a n in P o rtu g u e s e te x t as a w h o le , th e d iffe re n c e b e in g m o st m a rk e d in fo re ig n lo a n w o rd s. T h e
o p p o s ite is tru e o f fin ite v e rb s w h ic h s h o w a stro n g te n d e n c y to b e a n a ly sa b le . F in ite v erb s are
v irtu a lly a b s e n t fro m th e u n k n o w n lo a n w o rd g ro u p . F o r th e n o n -fin ite v e rb a l c la sse s th e
d is trib u tio n p a tte rn is fairly u n ifo rm , a g a in w ith th e e x c e p tio n o f fo re ig n lo a n w o rd s.
''' "Simplex'' words are here defined as words that can be found in the root lexicon without prior removal of
prefixes or suffixes. Of course, the larger the lexicon the higher the likelihood o f an (etymologically) affix-bearing
word appearing in the lexicon, - and thus not needing "live" derivation from the parser.
Only deadjectival '-mente'-adverbs can meaningfully be guessed at heuristically, and therefore only they should
enter into the statistics for word class guessing. Also the base line figure o f 1.26% for normal text is for '-mente'adverbs only, the overall ADV frequency is nearly 12 times as high. Since non-'mente'-adverbs are a closed class in
Portuguese, the latter will be absent from the heuristics class of wellformed unknown Portuguese words, but in the
foreign loan word group and the orthographical error group they will appear in the false positive section of other
word classes (numbers given here in parentheses). In the orthographical error group, both '-mente'-adverbs and
closed class adverbs can occur, the first as correct ADV-hits, the other usually as false positive nouns (for instance,
'aiinda').
54
NODALIDA, Copenhagen, January 1998
A s m ig h t b e e x p e c te d , a m o n g th e " u n a n a ly z a b le " w o rd s, o rth o g ra p h ic a l e rro rs a n d c o rre c t
P o rtu g u e se w o rd s sh o w a re m a rk a b ly sim ila r w o rd c la ss d istrib u tio n .
A le sso n fro m th e a b o v e fin d in g s m ig h t b e to o p t fo r n o u n re a d in g s a n d a g a in st fin ite v erb
re a d in g s in "u n a n a ly sa b le " w o rd s, w h e n in d o u b t, e s p e c ia lly w h e re n o P o rtu g u e se fle x io n e n d in g
o r su ffix c a n b e fo u n d , s u g g e stin g fo re ig n m a te ria l. A s a m a tte r o f fact, th is stra te g y h as sin ce
b e e n im p le m e n te d in th e sy ste m , in th e fo rm o f h e u ris tic a l d isa m b ig u a tio n ru les, th a t d isc a rd
V F IN re a d in g s a n d c h o se N re a d in g s fo r < M O R P -H E U R > w o rd s, w h e re lo w e r le v e l (i.e. safe)
C G -ru le s h a v e n 't b e e n a b le to d e c id e th e ca se c o n te x tu a lly .
5. Conclusion
I t can b e sh o w n th a t le x ic o -m o rp h o lo g ic a l h e u ris tic s - at le a st fo r a m o rp h o lo g y -ric h la n g u a g e
lik e P o rtu g u e se - c a n b e b a s e d o n stru c tu ra l clu es a n d th e sy ste m a tic e x p lo ita tio n o f d e riv a tio n a l
an d in fle c tio n a l su b le x ic a . A p p lie d to im p ro v e a n a ly se r re c a ll o n th e in p u t le v el o f a C o n s tra in t
G ra m m a r sy ste m , th e d e sc rib e d te c h n iq u e p o sitiv e ly c o n trib u te d to th e o v erall p e rfo rm a n c e o f a
le x ico n b a se d ru le g o v e rn e d ta g g e r/p a rse r. C o rre c tn e ss ra te s o f m o re th a n 9 9 % w e re a c h ie v e d fo r
th e m o rp h o lo g ic a l/P o S ta g g e r m o d u le, w ith h e u ristic e rro r ra te s ru n n in g at 2 % fo r p ro p e r n a m e
h e u ristic s a n d 4 .5 % fo r th e h e u ristic a l
a n a ly sis o f o th e r u n re c o g n ise d , b u t c o rre c tly sp elled
P o rtu g u e se w o rd fo rm s. In all, h e u ristic a n a ly sis w a s n e e d e d fo r 80% o f all p ro p e r n o u n s
(a m o u n tin g to ca. 2 % o f ru n n in g w o rd fo rm s in n e w s te x t), b u t fo r le ss th a n 0 .4 % o f n o n -n a m e
w o rd
fo rm s.
F in a lly ,
w o rd
class
fre q u e n c y
c o u n ts
su g g e st
th a t
PoS
p ro b a b ilitie s
fo r
"u n an aly z ab le" w o rd s in P o rtu g u e se te x ts are q u ite d iffe re n t fro m th o s e fo r th e la n g u a g e o n th e
w h o le.
References
B ick , E c k h a rd , P o r tu g is is k - D a n s k O rdbog, M n e m o , Å rh u s, 1 9 9 3 ,1 9 9 5 ,1 9 9 7 .
B ick , E c k h a rd , T he P a r s in g S ystem "P alavras", D o c u m e n ta tio n , u n p u b lish e d P h .D . p ro je c t
e v a lu a tio n , 1995, 1997.
B ick , E c k h a rd , " A u to m a tic P a rs in g o f P o rtu g u e se ", in P ro c e e d in g s o f th e S e c o n d W o rk sh o p o n
C o m p u ta tio n a l P r o c e s s in g o f W ritten P o rtu g u e se , C u ritib a , 1996.
B ick , E c k h a rd , " D e p e n d e n sstru k tu re r i C o n s tra in t G ra m m a r S y n ta k s fo r P o rtu g isisk ",
in:
B rø n d ste d , T o m & L y tje , In g e r (ed s), S p r o g o g M u ltim e d ie r, A a lb o rg , 1997.
B ic k ,
E c k h a rd ,
" A u to m a tis k
a n a ly se
a f p o rtu g isisk
sk riftsp ro g " ,
in;
Je n se n , P e r A n k e r
& Jø rg e n se n , S tig . W . & H ø m in g , A n e tte (ed s). D a n sk e p h d - p r o j e k t e r i d a ta lin g vistik, fo r m e l
lin g v is tik o g sp ro g te k n o lo g i, p p . 2 2 -2 0 , K o ld in g , 1997.
B rill, E ric , "A S im p le R u le -b a se d P a rt o f S p e e c h T a g g e r", in P ro c e e d in g s o f th e T h ir d
C o n fe re n c e o n A p p lie d N a tu r a l L a n g u a g e P ro c e ssin g , A C L , T re n to , Ita ly , 1992.
NODALIDA, Copenhagen, January 1998
55
C h a n o d , J e a n -P ie rre & T a p a n a in e n , P a si, " T a g g in g F re n c h - c o m p a rin g a sta tistic a l an d
a c o n s tra in t-b a s e d m e th o d ", a d a p te d fro m : S ta tis tic a l a n d C o n s tr a in t-b a s e d T a g g e rs f o r F ren ch ,
T e c h n ic a l re p o rt M L T T -0 1 6 , R a n k X e ro x R e s e a rc h C en tre, G re n o b le , 1994.
F ra n c is , W .N . & K u cera, F ., F r e q u e n c y A n a ly s is o f E n g lish U sage, H o u g h to n M ifflin , 1982.
G a rs id e , R o g e r & L eech , G e o ffre y & S a m p so n , G e o ffre y (ed s.). T h e C o m p u ta tio n a l A n a ly s is o f
E n g lish . A C o r p u s -B a se d A p p ro a c h , L o n d o n , 1987.
K a rls s o n , F re d , "S W E T W O L : A C o m p re h e n siv e M o rp h o lo g ic a l A n a ly s e r fo r S w e d ish ", in
N o r d ic J o u r n a l o f L in g u istic s 15, 1992, p p . 1-45.
K a rls s o n , F re d & V o u tila in e n , A tro & H e ik k ila , Ju k a & A n ttila , A rto (e d s.), C o n str a in t
G ra m m a r, A L a n g u a g e -In d e p e n d e n t S y s te m f o r P a r s in g U n re stric te d Text, M o u to n d e G ru y ter,
B e rlin 1995.
L e z iu s, W o lfg a n g & R ap p , R e in h a rd & W e ttle r, M an fred , "A M o rp h o lo g y -S y ste m an d P a rt-o f
S p e e c h T a g g e r fo r G erm an ", ", in; D a fy d d G ib b o n (ed,): N a tu r a l L a n g u a g e P r o c e s s in g a n d
S p e e c h T e c h n o lo g y , B erlin , 1996.
M a rc u s , M itc h e ll, "N e w tre n d s in n a tu ra l la n g u a g e p ro c e ssin g : S ta tistic a l n a tu ra l la n g u ag e
p ro c e s s in g " , p a p e r p re se n te d a t th e c o llo q u iu m H u m a n -M a c h in e C o m m u n ic a tio n b y Voice,
o rg a n iz e d b y L a w re n c e R . R a b in e r, h e ld b y th e N a tio n a l A c a d e m y o f S c ie n c e s a t T he A r n o ld a n d
M a b e l B e c k m a n C en ter in Irv in e , U S A , F eb , 8-9, 1993.
T a p a n a in e n , P a si, "T h e C o n s tra in t G ra m m a r P a rs e r C G -2 ", U n iv e rs ity o f H e lsin k i, D e p a rtm e n t
o f L in g u is tic s, P u b lic a tio n s n o . 2 7 , 1996.
V o u tila in e n , A tro & H e ik k ila , J u k a & A n ttila , A rto , C o n s tr a in t G ra m m a r o f E n g lish , A
P e r fo r m a n c e -O r ie n te d In tro d u c tio n ,
P u b lic a tio n N o , 2 1 , D e p a rtm e n t o f G e n e ra l L in g u istic s,
U n iv e rs ity o f H e lsin k i, 1992.
56
NOD ALIDA, Copenhagen, January 1998