parsing german - Association for Computational Linguistics

COLING~ , ~ Horec~ /~.)
North-Holland PubltJhing Company
Accdcm~ 1982
PARSING
GERMAN
Ingeborg Steinacker, H a r a l d Trost
D e p a r t m e n t of Medical C y b e r n e t i c s
U n i v e r s i t y of Vienna
The first part of this paper is d e d i c a t e d to an o v e r v i e w
of the
parser of the system V I E - L A N G (Viennese Language
U n d e r s t a n d i n g System).
The
parser
is
a
production
system which
uses
an
interleaved method that combines
syntax and
semantics.
It parses
directly
into
the
internal r e p r e s e n t a t i o n of the system, w i t h o u t producing
an i n t e r m e d i a t e
syntactic
structure.
The
last
part
d i s c u s s e s the r e l a t i o n s h i p between some special features
of the G e r m a n language, and
properties
of
the
parser
that o r i g i n a t e in the language.
GENERAL
APPROACH
A sentence is parsed word per word, from left to right.
The
parser
is l a r g e l y a d a t a - d r i v e n p r o d u c t i o n system.
P r o d u c t i o n s involve the
use of s y n t a c t i c and s e m a n t i c information at all major stages of the
process.
Noun
phrases, for example, are r e c o g n i z e d by an ATN which
v e r i f i e s the result of s y n t a c t i c analysis s e m a n t i c a l l y .
It
returns
s e m a n t i c a l l y valid
NPs
only.
The
parser b e l o n g s to the class of
semantic p a r s e r s as s u g g e s t e d by [I], [4], [7].
It
has
two
main
sources of information:
one is a semantic net, which p r o p a g a t e s the
information
about
selectional
restrictions,
the
other
is
the
parsing-lexicon,
which
for
each
word
contains
different
senses
a s s o c i a t e d with
the
information n e c e s s a r y to d i s t i n g u i s h one sense
from the others.
I n f o r m a t i o n includes
syntactic
features
of
the
sentence (infinitive, s u r f a c e - c a s e s of d e p e n d e n t n o u n phrases .... ),
semantic
restrictions
and
words
that
occur
together
with
the
input-word.
The p r o d u c t i o n s make
use
of
a correspondence
between
syntactic
i n f o r m a t i o n in
the
sentence
and the roles of the net (see chapter
internal r e p r e s e n t a t i o n for an e x p l a n a t i o n of
roles).
Productions
are used
not
only
for
generating
the internal r e p r e s e n t a t i o n of
c o n s t i t u e n t s but also as e x p e c t a t i o n s that guide the a n a l y s i s of the
rest of the sentence.
The g e n e r a t i o n
of
the
internal
structure
corresponding
to
the
sentence is centered
around the verb.
Since the r e p r e s e n t a t i o n of
other c o n s t i t u e n t s can be initiated i n d e p e n d e n t l y of the
verb,
the
parser builds
a s e m a n t i c s t r u c t u r e i m m e d i a t e l y after a c o n s t i t u e n t
is recognized.
These s t r u c t u r e s are stored in
a
list,
until
the
main verb
of the sentence has been found.
Then the parser tries to
fill the c a s e - s l o t s of the verb
with
the
given
structures.
The
s e m a n t i c c a t e g o r i e s of the s t r u c t u r e s have to be m a t c h e d against the
value r e s t r i c t i o n s of the roles of the verb.
INTERNAL REPRESENTATION
The source of
semantic
information
Net [2].
This
net
formalism
has
e p i s t e m o l o g i c a l l y clear and explicit.
365
is
a
Structural
the
advantage
S I - N e t s are based
Inheritance
of
being
on a s t r i c t
I. STEINACKER and H. TROST
366
d i s c r i m i n a t i o n b e t w e e n few s t r u c t u r a l c o m p o n e n t s , and their
content
(what is
represented).
Real world k n o w l e d g e is r e p r e s e n t e d in the
form of c o n c e p t s and roles.
Roles
explain
relationships
between
concepts.
A
concept
is d e f i n e d by its a t t r i b u t e s w h i c h c o n s i s t of
two parts:
the
role
and
the
value
restriction.
The
value
r e s t r i c t i o n is a c o n c e p t w h i c h d e f i n e s the range of p o s s i b l e f i l l e r s
for the
attribute,
the
role d e f i n e s the f u n c t i o n of a filler with
regard to the c o n c e p t being defined.
Role-filler
concepts
can
be
r e g a r d e d as s e m a n t i c c a t e g o r i e s .
Generic concepts
are
organized
in
a
hierarchy
of
superand
subconcepts.
A
subconcept
inherits
the
attributes
of
the
superconcept.
If a
concept
has
more
than
one
superconcept
it
inherits the
combined
set of a t t r i b u t e s .
W h e n p r o c e s s i n g an input
i n d i v i d u a l s of
the
addressed
concepts
are
instantiated.
These
i n d i v i d u a l s c o n s t i t u t e the e p i s o d i c layer of the net.
A word sense a d d r e s s e s either
a
concept
or
the
attribute
of
a
concept.
If
an
input word relates to a concept, as most nouns and
v e r b s do, that c o n c e p t is i n s t a n t i a t e d .
If it c o r r e s p o n d s to a role
both the c o n c e p t and the a t t r i b u t e are
instantiated,
i.
e.
the
g e n e r i c concept,
the
role
defining
the
attribute
and the value
restriction.
Most a d j e c t i v e s and most p r e p o s i t i o n s are mapped
into
roles (size,
colour,
location,
time,
.... )
but also some nouns
(e.g.
father is the role of a p e r s o n in the c o n c e p t family).
The net is s t r u c t u r e d in a way that f a c i l i t a t e s the i n c o r p o r a t i o n of
r e s u l t s g a i n e d in l i n g u i s t i c s :
a t t r i b u t e s of a c t i o n s are d e f i n e d in
a way c o r r e s p o n d i n g to cases of a case g r a m m a r .
This
can
best
be
i l l u s t r a t e d by an example:
A c t i o n s are r e p r e s e n t e d as n e t - c o n c e p t s ,
e.g.
DO.
The
c o n c e p t DO is d e f i n e d by a t t r i b u t e s with roles like
AGENT, OBJECT,
GOAL,
RESULT,
that
are
restricted
by
adequate
role-filler
concepts.
By
defining
attributes
in
this
way
a
c o r r e s p o n d e n c e b e t w e e n s u r f a c e cases in a s e n t e n c e and roles of
the
net can e a s i l y be e s t a b l i s h e d .
THE
PARSING-LEXICON
In the
parsing-lexicon
each
word-sense
is
associated
with
productions.
These
productions
r e f l e c t the c o r r e s p o n d e n c e b e t w e e n
s u r f a c e cases of the s e n t e n c e and s e m a n t i c
cases
within
the
net.
The number
of
tests
in
a p r o d u c t i o n c o r r e l a t e s to the number of
s e n s e s of a word.
By e x e c u t i n g these tests
the
parser
gains
the
information necessary
to
choose
the
correct
reading
of a word.
T e s t s check the s y n t a c t i c and the s e m a n t i c c o n t e x t in which an input
w o r d is
found.
Sometimes
morphological
information
and
the
o c c u r r e n c e of
certain
w o r d s have to be taken into c o n s i d e r a t i o n as
well.
The range of tests r e f l e c t s our g e n e r a l a p p r o a c h to
parsing:
c o m b i n i n g syntax
and s e m a n t i c s at all s t a g e s of the p a r s i n g p r o c e s s
[8].
D e p e n d i n g on the stage of the p r o c e s s
the
failure
of
a
i n t e r p r e t e d in
two
ways.
If
the
end
of
the s e n t e n c e
e n c o u n t e r e d the result is taken as false, if p a r s i n g is in
the test is r e p e a t e d at later stages of the process.
Actions associated
structure-building
with
the
procedures.
tests
Some
mostly
actions
test
is
has been
progress
deal
with
semantic
are u s e d to c o n t r o l the
367
PARSING GERMAN
p a r s i n g process.
Usually
the semantic s t r u c t u r e for a c o n s t i t u e n t
of the sentence is built after the
constituent
is
recognized
but
a c t i o n s can
delay
the c r e a t i o n of n e t - s t r u c t u r e s .
The reasons for
such a d e l a y are e x p l a i n e d in the following chapter.
A verb-sense
is
recognized
by
taking
into
consideration
the
syntactic surroundings
of the verb and the s e m a n t i c c a t e g o r i e s that
match the s e l e c t i o n a l r e s t r i c t i o n s defined by
the
verb.
After
a
v e r b - s e n s e has
been
chosen
expectations
are
built
up r e g a r d i n g
missing constituents.
The o c c u r r e n c e of c e r t a i n
surface-structures
also leads
to
the f o r m a t i o n of e x p e c t a t i o n s .
T h e r e f o r e tests that
are a s s o c i a t e d with verbs first check the surface structure
of
the
sentence (cases,
prepositions...).
The
c o n s t i t u e n t s that s a t i s f y
these s y n t a c t i c
tests
have
to
fulfill
semantic
selectional
restrictions.
After
having
passed these tests, actions create the
semantic r e p r e s e n t a t i o n for the verb and fill
its
roles
with
the
selected c o n s t i t u e n t s .
Unless
object
an entry in the lexicon includes a test r e g a r d i n g subject and
of a sentence the
following
default
actions
are
executed
automatically:
the
subject
of a sentence is m a p p e d onto the A G E N T
and the object (accusative)
is m a p p e d onto the O B J E C T of the action.
A Detailed
Example
The two
senses
of
'gehen'
in
the
following
example
can
be
d i s a m b i g u a t e d by
using
the
entries
in the p a r s i n g - l e x i c o n listed
below (parts of the entry which are irrelevant to
the
example
are
left out).
These
sample
entries include important kinds of tests
and actions.
(i)
(2)
'Ich gehe in den Park.'
'Der Bus geht nach Wien.'
gehenl (move along)
C ( ( C A S E NOM) AND (RESTRICTION
A(CRI LOCOMOTION)
g e h e n 2 (bound for)
C ( ( C A S E NOM) AND (RESTRICTION
A(CRI (PUBL.-TRANSPORT))
C(PLOC) ->
A (CRV(+,DESTINATION,*)).
(I walk into the garden.)
(The bus is bound
for
Vienna.)
ANIMATE))
->
PUBL.-TRANSPORT.))
->
In the example the '+' p a r a m e t e r is an
individual
of
the
concept
PUBL.-TRANSPORT,
the
'*' p a r a m e t e r is the l o c a t i o n e x p r e s s e d by the
p r e p o s i t i o n a l phrase,
namely
Vienna.
The
nounphrase
'Ich'
(I)
fulfills
the
restriction
ANIMATE,
because
speakers
are
always
interpreted as humans.
Surface-tests:
C a s e - t e s t s search for an NP of the
surface-case
indicated
by
the
second parameter.
If an NP is found that s a t i s f i e s the condition,
the tests that are c o n n e c t e d by AND
or
OR
to
the
case-test
are
executed.
The c o n s t i t u e n t of the sentence which s a t i s f i e s the tests
is referred to with an asterix '*' in the a s s o c i a t e d action(s).
The test PLOC refers
location.
It
is
to a p r e p o s i t i o n a l p h r a s e that
a test
which
uses
syntactic
indicates
some
and
semantic
368
I. STEINACKER and H. TROST
information.
Restriction-tests:
These s e m a n t i c tests are used
to
check
selectional
restrictions.
They are
often
used
in c o m b i n a t i o n with s y n t a c t i c tests.
If both
tests are met by a c o n s t i t u e n t this is a s i g n i f i c a n t indicator, that
the c o r r e c t i n t e r p r e t a t i o n has been s e l e c t e d .
Structure-building
Actions:
The a c t i o n C R I ( c o n c e p t ) c r e a t e s an i n d i v i d u a l of the
concept.
The
action CRV(pl,p2,p3)
individuates
an
a t t r i b u t e of the c o n c e p t pl.
The c o n c e p t pl, the role p2 and the c o n c e p t P3 as
value-restriction
are i n s t a n t i a t e d .
If
pl
or p3 are a d d r e s s e d by '+' the p a r a m e t e r
refers to the first c o n c e p t that was
individuated
when
processing
this p a r t i c u l a r
entry
in
the
parsing-lexicon.
A
'*'-parameter
refers to the
semantic
representation
of
the
constituent
which
s a t i s f i e s the first test of the p r o d u c t i o n .
SPECIAL
FEATURES
Morphological
OF G E R M A N
Ambiguities
We b e l i e v e that m a k i n g use of the
interaction
between
syntax
and
s e m a n t i c s has m a n y a d v a n t a g e s over a s t r i c t l y s e q u e n t i a l a p p r o a c h to
parsing.
Introducing
semantic
information
helps
to r e s o l v e some
a m b i g u i t i e s at an e a r l y stage of the
analysis
and
thus
to
avoid
unnecessary backtracking.
T y p i c a l l y , m o r p h o l o g i c a l a m b i g u i t i e s can
be resolved by such an i n t e r a c t i o n .
The G e r m a n language is rich in
inflectional
forms,
therefore
the
morphological component
often
comes up with m o r e than one p o s s i b l e
stem for an input word.
These stems
usually
belong
to
different
c a t e g o r i e s of words, e.g.
'meinen' can be i n t e r p r e t e d as a verb (to
suppose) or it can be reduced to the p o s s e s s i v e p r o n o u n 'mein' (my).
Syntax restricts
the
type of a c o n s t i t u e n t , w h i c h is e x p e c t e d at a
g i v e n point in the
analysis.
Usually
it
is
sufficient
to
use
syntactic information
to
d i s a m b i g u a t e m o r p h o l o g i c a l a m b i g u i t i e s of
this kind.
If a word is reduced to two d i f f e r e n t stems of the same c a t e g o r y
of
words, s e l e c t i o n a l
restrictions
in
the
semantic
net are used to
c h o o s e one stem.
The
parsing-lexicon
relates
surface
cases
to
semantic restrictions
of
the
attributes
of
the action.
In most
cases this i n f o r m a t o n is s u f f i c i e n t for disambiguation.
The inflected
(to hear) and
form 'gehoert' is reduced to the
'gehoeren' (to b e l o n g to).
two
verbs
'hoeren'
(3) D i e s e s Buch g e h o e r t mir.
(This is my book.)
(4) Hast du d i e s e s G e r a e u s c h g e h o e r t ?
(Did you hear that n o i s e ? )
In (3) the s u b j e c t of the s e n t e n c e has to be a ' P O S S E S S I B L E O B J E C T ' ,
in (4) the object of has to be a s u b c o n c e p t of 'SOUND'.
A violation
of s e l e c t i o n a l
r e s t r i c t i o n s , is
a
clear
i n d i c a t o r that the wrong
i n t e r p r e t a t i o n of the v e r b has been chosen.
PARSING GERMAN
Disconnected
369
Constituents
A n o t h e r c h a r a c t e r i s t i c f e a t u r e of the G e r m a n language
is
the
verb
second p h e n o m e n o n .
In G e r m a n
a verb
can o c c u p y three d i f f e r e n t
positions within a sentence:
the first in q u e s t i o n s
and
commands,
the second
in
main
clauses,
and the last in s u b o r d i n a t e clauses.
C o m p o u n d p r e d i c a t e s are d i v i d e d into two parts.
The
auxiliary or
the modal
verb
hold
%he
place
of
the verb, and the rest of the
p r e d i c a t e is put at the end of the sentence.
One has to deal with a
t w o - p i e c e p r e d i c a t e w h e n e v e r c o m p o u n d tenses are used, in s t r u c t u r e s
i n v o l v i n g the i n f i n i t i v e etc.
For a parser
that
uses
a
traditional
approach
of
sequential
s y n t a c t i c and
semantic
processing
these
f e a t u r e s cause e x t e n s i v e
backtracking.
The
method
of
combinig
syntactic
and
semantic
analysis
does
not
avoid
backtracking
completely
but
it
makes
r e - i n t e r p r e t a t i o n easier.
This claim is s u p p o r t e d in the
following
p a r a g r a p h using the e x a m p l e of a c o m p o u n d p r e d i c a t e .
(5) Mein Bruder
hat
das
Buch, yon dem du mir e r z a e h l t hast,
schon g e l e s e n .
(My b r o t h e r a l r e a d y read the book, w h i c h you told me about.)
In (5)
the
object
and a r e l a t i v e c l a u s e s e p a r a t e the two parts of
the p r e d i c a t e .
One p o s s i b l e r e a d i n g
of
the
verb
'haben'
is
to
possess.
The
object
'das Buch' s a t i s f i e s the s e m a n t i c r e s t r i c t i o n
' P O S S E S S I B L E - O B J E C T ' , t h e r e f o r e 'hat' is taken as the p r e d i c a t e
and
a possess
relation
is
e s t a b l i s h e d b e t w e e n the r e p r e s e n t a t i o n s for
s u b j e c t and
object.
When
the
past
participle
'gelesen'
is
e n c o u n t e r e d at
the
end
of
the
sentence
this d e c i s i o n has to be
revised in favour of the c o m p o u n d p r e d i c a t e 'hat g e l e s e n ' .
The p o s s e s s r e l a t i o n which was e s t a b l i s h e d has to be replaced by the
c o n c e p t that is a d d r e s s e d by 'lesen, n a m e l y
'INFORMATION-TRANSFER'.
The s e m a n t i c
representations
of
the
o b j e c t book and the r e l a t i v e
clause are not a f f l i c t e d by this change.
Book also
fits
into
the
hierarchy
of
'INFORMATION-SOURCE'
and
therefore
satisfies
the
s e l e c t i o n a l r e s t r i c t i o n s for the
object
of
'INFORMATION-TRANSFER'
also.
S e p a r a b l e p r e f i x e s also add to the
problem
of
finding
the
right
verb.
Syntactically
verbadjuncts
are
p a r t i c l e s , that are part of
the verb.
In some tenses a v e r b a d j u n c t b e c o m e s s e p a r a t e d
from
the
v e r b and
is put at the end of the clause.
V e r b a d j u n c t s can s p e c i f y
the verb, but s o m e t i m e s they change its sense c o m p l e t e l y
(aufhoeren
= to stop, h o e r e n = to hear).
(6) Das Kind hoert
(After an hour
nach
einer
Stunde e n d l i c h zu w e i n e n
the child f i n a l l y stops crying.)
auf.
Such features either cause d e l a y in t h ~ c o n s t r u c t i o n of the internal
r e p r e s e n t a t i o n for
a
sentence,
or
they
result
in
backtracking
b e c a u s e the
c o r r e c t m e a n i n g of the verb b e c o m e s a p p a r a n t at the end
of the sentence.
CONCLUSION
The s t r u c t u r e of the G e r m a n language adds some d i f f i c u l t i e s
to
the
general problem
of
parsing
n a t u r a l language.
Flexible word-order
370
I. STEINACKER and H. TROST
and multiple sources for ambiguities led us to choose a data-driven
approach.
Syntactic
and
semantic
information
are
used
for
disambiguation of existing
structures
and
for expectations
that
control processing of new input.
Since backtracking is inevitable in some cases we tried to make
it
as efficient
as possible.
The integration of syntax and semantics
facilitates backtracking
to a large
degree
because
semantic
representations for all
constituents
are built independently.
If
backtracking occurs e.g.
after having selected a wrong
verb-sense,
the parser
has to destroy the existing semantic representation and
replace it with the one indicated by the new verb.
The slots of the
instantiation of the concept for the new verb have to be filled with
the already existing structures instead of having to start the parse
all over again.
ACKNOWLEDGEMENTS
This research is part of the project 'Development of a NLU
System
with regard
to Medical Applications' (supervision R.Trappl), which
is supported
by
the
Austrian
'Fonds
zur
Foerderung
der
wissenschaftlichen Forschung', grant #4158.
REFERENCES
[I] Boguraev, B.K., Automatic Resolution of Linguistic
Ambiguities,
University of Cambridge, (1979).
[2] Brachmann, R.J.,
A
Structural
Paradigm
for
Representing
Knowledge (Bolt Seranek and Newman, 1978).
[3] Buchberger, E.,
Steinacker, I.,
Trappl, R.,
Trost, H.,
Leinfellner, E., A NLU
System for Medical Applications, SIGART
79 (1982), 146-147.
[4] Gershman, A.V.,
Knowledge-Based
Parsing,
Yale
Univ.,
RR-156,(1979).
[5] .den, G., On the Use of Semantic
Constraints
in
Guiding
Syntactic Analysis, Univ.
of Wisconsin, WR-3, (1978).
[6] Palmer, M., A Case
for Rul~-driven
Semantic
Processing,
in:
Proceedings of the
19th Ann.
Meeting
of the ACL, Stanford,
(1981).
[7] Riesbeck, C.K.
and
Schank, R.C.,
Comprehension
by Computer:
Expectation-based
Analysis
of Sentences
in Context,
Yale
University, RR-78, (1976).
[8] Steinacker, I.,
Parsing
between
Syntax
and
Semantics,
Automatische Sprachverarbeitung 81, Potsdam, (1981).
[9] Trost, H. and Steinacker, I., The Role of Roles,
Some Aspects
of World Knowledge Representation, in:
Proc.
of the 7th Int'l.
Joint Conf.
on Artificial Intelligence, Vancouver, (1981).
[l@]Wilks,
Y.,
Processing
Case,
Am.
J.
of
Computational
Linguistics 4,(1976).