Recognition and Tagging of Compound Verb Groups in

Recognition and Tagging of Compound Verb Groups in Czech
Eva šá£ková, Lubo² Popelínský and Milo² Nepil
Faculty of Informatics, Masaryk University
Brno, Czech Republic
1
Example
Mrzí m¥, ºe jsem o té konferenci nev¥d¥la, byla bych se jí zú£astnila.
(I am sorry that I did not know about the conference, I would have
participated in it.)
<vg> Mrzí </vg> m¥, ºe
I <vg> am sorry </vg> that
<vg> jsem o té konferenci nev¥d¥la </vg>,
I <vg> did not know </vg> about the conference,
<vg> byla bych se jí zú£astnila. </vg>
I <vg> would have participated </vg> in it.
2
Recognition and Tagging of Compound Verb Groups in Czech
in more than half of Czech sentences the predicate contains the compound verb
group
compound verb groups are usually tagged in word-by-word manner. It may be
confusing.
nding all parts of a compound verb group and tagging the whole group is
fundamental for further analysis of a sentence
3
in such a case he could participate (in that competition)
mohl by se v tom p°ípad¥ zú£astnit (té sout¥ºe)
by se v tom p°ípad¥ mohl zú£astnit
zú£astnit by se v tom p°ípad¥ mohl
v tom p°ípad¥ by se zú£astnit mohl
Compound verb groups
Compound forms of tense, manner, (im)perfect, forms with reexive pronoun
se and forms with modal (or other with similar behaviour) verbs;
can contain gaps;
have several word order variants.
4
Learning verb rules
Input: annotated sentences from corpus DESAM
Output: DCG rules for compound verb groups - verb rules
Algorithm
1. Find boundaries of verb groups and eliminate gaps
2. Generalise (lemmata, grammatical categories, grammatical agreement
constraints)
3. Synthesise verb rules
126 verb rules covering all frequent verb groups in Czech
5
byla bych se jí zú£astnila
(I would have participated in it)
verb_group(vg(Be,Cond,Se,Verb), Gaps) >
be(Be,_,P,N,tM,mP,_),
% byla
cond(Cond,_,_,Ncond,tP,mC,_),
% bych
{check_num(N,Ncond,Cond,Vy)},
% gramm. agreement in number
reflex_pron(Se,xX,_,_),
% reexive pronoun
gap([],Gaps),
% gap
k5(Verb,_,_,P,N,tM,mP,_).
% full-meaning verb
6
Accuracy:
on unannotated text 86.8%
Question 1:
how to increase the accuracy ?
Most of errors caused by wrong recognition of verbs.
To x it:
elimination of lemmata which are very rare
lemma disambiguation by inductive logic programming
(Pavelek & Popelínský 1999)
7
DESAM: Results for unannotated text
correct rules(%)
> 1 verb all
number of examples
349
600
original method
86.8
92.3
+ infrequent lemmata
91.1
94.8
+ ILP
92.8
95.8
8
Question 2:
are the learned rules valid for other corpora (other types of corpora) ?
PTB: Results for unannotated text
correct rules(%)
> 1 verb all
number of examples
284
467
original method
87.0
92.1
+ infrequent lemmata
91.6
94.9
+ ILP
92.6
95.5
9
Tagging
notation: SGML
how to characterize the group: attribute tag
problem: tagging unannotated texts
Example
byla bych se jí zú£astnila (I would have participated in it)
<vg tag="eApFnStPmCaPr1v0" fmverb="zú£astnit">
<vgp>byla</vgp>
<vgp>bych</vgp>
<vgp>se</vgp>
jí
<vgp>zú£astnila</vgp>
</vg>
10
Ongoing research
nding sentence boundary (accuracy 90.8%)
Future
fully automatic tool (minimum ad hoc techniques)
building partial parser for Czech (noun phrases Smrº & šá£ková 1998)
semantic analysis
11