Connective-based Local Coherence Analysis: A Lexicon for

Connective-based Local
Coherence Analysis: A Lexicon
for Recognizing Causal
Relationships
Manfred Stede
University of Potsdam (Germany)
email: [email protected]
Abstract
Local coherence analysis is the task of deriving the (most likely) coherence relation holding between two elementary discourse units or, recursively, larger spans of text. The primary source of information for this
step is the connectives provided by a language for, more or less explicitly, signaling the relations. Focusing here on causal coherence relations,
we propose a lexical resource that holds both lexicographic and corpusstatistic information on German connectives. It can serve as the central repository of information needed for identifying and disambiguating
connectives in text, including determining the coherence relations being
signaled. We sketch a procedure performing this task, and describe a
manually-annotated corpus of causal relations (also in German), which
serves as reference data.
221
Stede
222
1 Introduction
“Text parsing” aims at deriving a structural description of a text, often a tree in the
spirit of Rhetorical Structure Theory (Mann and Thompson, 1988). For automating
this task (see, e.g., Sumita et al. (1992); Corston-Oliver (1998); Marcu (2000)), the
central source of information are the connectives that the author employed to more
or less specifically signal the type of coherence relation between adjacent spans. For
illustration, consider this short text:1 .
Because well-formed XML does not permit raw less-than signs and ampersands, if you use a character
reference such as &#60; or the entity reference &lt; to insert the < character, the formatter will
output &lt; or perhaps &#60;.
Supposing that we are able to identify the connectives and punctuation symbols correctly (here in particular: note that to is not a spatial preposition; distinguish between
commas in enumerations and those finishing clauses), we can identify the “scaffold”
of this short text as the following:
Because A, if B or C to D, E or F
with A to F representing the minimal units of analysis. Next, fairly simple rules will
be sufficient to guess the most likely overall bracketing of this string:
(Because A, (if ((B or C) to D)), (E or F))
And finally, it happens that the connectives because, if, to and or are quite reliable signals of the coherence relations Reason, Condition, Purpose and Disjunction, respectively. Combining this information with the bracketing, we can obtain a tree structure
in spirit of RST.
Texts of this level of complexity could be handled by early text parsers (see Section 2). But, obviously, not too many texts behave as nicely as our example does. In
general, constructing a discourse tree is highly complicated even without trying to find
semantic/pragmatic labels for the relationships; the discussion by Polanyi et al. (2004)
demonstrates that just the structural decisions are often very difficult to make. Taking
a different viewpoint, this author argues in Stede (2008) that constructing “the” tree
structure for a text should not be regarded as such an important goal and that coherence should rather be explained as the interplay of different levels of (possibly partial)
description, such as referential and thematic structure, intentional structure, and a level
of local coherence analysis that records the clearly recognizable relationships between
adjacent text spans but does not aim at constructing a complete and well-formed tree.
In the present paper, this viewpoint is taken to the task of automatic analysis, which
aims at identifying individual coherence relations and the spans related. We restrict
ourselves here to causal relationships and moreover to those that are explicitly signaled by a connective. The central resource used in our approach is a lexicon that
collects the information associated with individual connectives and makes it available
to applications such as a coherence analysis or text generation.
The paper is organized as follows. After reviewing some earlier research on text
parsing in Section 2, we turn to connectives in Section 3 and point out a number of
problems that sophisticated coherence analyzers have to reckon with. Then, Section 4
explains the connective lexicon we developed, and Section 5 describes a corpus we
collected and annotated manually for causal connectives and the relations they signal.
1 Source:
http://www.cafeconleche.org/books/bible2/chapters/ch17.html
Connective-based Local Coherence Analysis
223
It serves as a reference for designing the analysis procedure, which is finally sketched
in Section 6. Our analysis and implementation target German text, but most of the
phenomena apply equally to English.
2 Related Work
In the late 1990s. the best-known work on “text parsing” was that of Marcu, which
is collected in Marcu (2000). He had used surface-based and statistical methods to
identify elementary discourse units, hypothesize coherence relations between adjacent
segments, and finally compute the most likely overall “rhetorical tree” for the text.
Surface-based methods were highly popular at the time, but with the recent advances
in robust and wide-coverage sentence parsing, it seems sensible to cast local coherence
analysis as a problem of linguistic analysis, drawing on the results of syntactic parsing
(or even, on top of that, semantic analysis).
An early approach in this spirit was implemented in the RASTA analyzer (CorstonOliver, 1998). It perused the output of the ‘Microsoft English Grammar’ to guess the
presence of coherence relations on the basis of accumulated evidence from a variety
of more or less deep linguistic features. For instance, a hypotactic clause would always figure as the satellite of some nucleus-satellite relation in RST terms. For some
relations (e.g., Elaboration), the type of referring expressions, especially in subject
position, was considered a predictive feature. In general, RASTA employed a set of
necessary criteria for each relation to hold in a particular context, and for those relations passing the filter, a voting scheme accumulated evidence to decide on the most
likely relation. The system worked on Encarta articles, hence on expository text; 13
relations were being used.
While RASTA employed a relation-centric approach, the recent work by Lüngen
et al. (2006) places the connectives at the center of the analysis, recording information
about them in a specific lexicon (similar to our own earlier work (Stede, 2002)). In the
lexicon used by Lüngen et al., an entry consists of three zones: the identification zone
gives the textual representation of the connective, its lemma and part-of-speech tag;
the filter zone encodes necessary conditions for particular discourse relations, in the
form of context descriptions; the allocation zone then specifies a default relation to be
assumed if no other relation can be derived on the basis of further (soft) conditions. It
also encodes constraints on the size of units to be related, the nuclearity assignment,
and the information whether the segment including the connective attaches to the left
or to the right in the text. Each entry gives rise to a rule used by a shift-reduce parser
that tries to build a complete rhetorical tree. This parser works in close cooperation
with a module identifying logical document structure, and the context conditions specified in lexicon entries often refer to this level of structure, or to a syntactic dependency
analysis provided by the Connexor parser2 .
We share with these approaches (and with that of Polanyi et al. (2004)) the desire to
derive as much information about discourse relations as possible without resorting to
non-linguistic knowledge, so that the role of local coherence analysis in effect can be
seen as extending the realm of robust sentence parsing. Our approach is to represent
as much of the necessary information as possible in a declarative resource: a lexicon
2 http://www.connexor.com
224
Stede
of connectives.
3 Complications with Connectives
Connectives are closed-class lexical items that can belong to four different syntactic
categories: coordinating and subordinating conjunction, adverbial, and preposition
(such as despite or due to). They have in common that semantically they denote twoplace relations, and the text spans they relate can at least potentially be expressed as
full clauses (Pasch et al., 2003). As mentioned in the beginning, they are not always
as easy to interpret as in our “well-formed XML” example. In this section, we suggest
an inventory of the complications that a thorough local coherence analysis procedure
needs to deal with. We group them into four categories.
Ambiguity. Here we need to distinguish two kinds: (i) ambiguity as to whether a
word is used as a connective or not, and (ii) ambiguity as to the semantic reading of a
connective. Certain cases of (i) correspond to the distinction between ‘sentential use’
and ‘discourse use’ that Hirschberg and Litman (1994) had proposed not for connectives but more generally for ‘cue phrases’ in spoken language. For example, German
denn can be a coordinating conjunction (sentential use) or a particle often used in
questions without a recognizable semantic effect (discourse use). Other cases of (i)
reflect ambiguity between different ‘sentential’ uses. Sometimes this coincides with a
syntactic difference (e.g., English as is a connective only when used as subordinator),
but with many adverbials it does not (e.g., German daher can be a locative adverbial ‘from there’ or a causal adverbial ‘therefore’). Also, sometimes the distinction
coincides with semantic scope, as with the focus particle / connective nur (‘only’):
(1) Es war ein schöner Sommertag. Nur die Vögel sangen nicht.
(‘It was a nice summer day. Only the birds weren’t singing.’)
In a narrow-scope reading of ‘only’, the message is that everybody was singing except
for the birds; in a wide-scope reading, ‘only’ connects the two sentences and signals
a restrictive elaboration. Ambiguity of type (i) is more widespread than one might
think; in Dipper and Stede (2006), we report that 42 out of 135 frequent German
connectives also have a non-connective reading, and we point out that many of the
problems cannot be handled with off-the-shelf part-of-speech taggers.
Concerning ambiguity (ii), some connectives can have more than one semantic
reading, which we regard as a difference in the coherence relation being signaled.
Sometimes, the relation can be established on different levels of linguistic description
(see, e.g., Sweetser (1990)). For example, finally can be used to report the last one
in a sequence of events, or it can be used by the author as a device for structuring
the discourse (“and my last point is...”). Interestingly, the very similar German word
schließlich in addition has a third reading: It can also be an argumentative marker
conveying that a presented reason is definitive or self-evident, which in English may
be signaled with ‘after all’: Vertraue ihr. Sie ist schließlich die Chefin. (‘Trust her.
She is the boss, after all.’)
Pragmatic features. In addition to the relational differences, connectives can sometimes be distinguished by more fine-grained pragmatic features, which are usually not
modeled as a difference in coherence relation. A well-known case in point is the difference between because and since (corresponding to German weil / da), where only
Connective-based Local Coherence Analysis
225
the latter has a tendency to mark the following information as hearer-old (not necessarily discourse-old). The same pair of connectives serves to illustrate the feature of
non-/occurrence within the scope of focus particles:
(2) Nur weil/?da es regnet, nehme ich das Auto
(‘Only because/?since it’s raining, I take the car.’)
While in German, the da variant is hardly acceptable at all, in English there is a tendency for since to be interpreted in its temporal reading when used within the scope
of only.
Also, connectives can convey largely the same information yet differ in terms of
stylistic nuances, for instance in degree of formality. Thus a concessive relation in
English may be signaled in a standard way with although, or with a rather formal, and
in that sense “marked” notwithstanding construction.
Form. While the majority of connectives consist of a single word, some of them have
two parts. Well-known instances are either .. or and if .. then. For the German version
of the latter (wenn .. dann), a coherence analyzer must account for the possibility
of its occurring in reverse order: Dann nehme ich eben das Auto, wenn Du so bettelst. (‘Then I’ll take the car, if you’re begging so much’.) Further, looking at highly
frequent collocations such as even though or even if, it is difficult to decide whether
we are dealing with a single-word connective and a focus particle, or with a complex
connective; one solution is to check in such cases whether the meaning is in fact derived compositionally and then to prefer the focus particle analysis. From “regular”
two-word connectives it is only a small step to the shady area of phrasal connectives,
which can allow for almost open-ended variation and modification: for this reason /
for these reasons / for all these very good reasons / ....
For German, we have dealt with the issue of differentiating between types of multitoken connectives in a separate paper Stede and Irsig (2008).
Discourse structure. As is well-known, the structural description of a text can also
be more complicated than in our “well-formed XML” example shown at the beginning.
For one thing, discourse units can be embedded into one another, using parenthetical
material or appositions. Further, connectives can occasionally link text segments that
are non-adjacent — a phenomenon that has been studied intensively by Webber et al.
(2003) and also by Wolf and Gibson (2005). An example from Webber et al.: John
loves Barolo. So he ordered three cases of the ’97. But he had to cancel the order
because then he discovered he was broke. Here, the then is to be understood as linking
the discovery event back to the ordering event rather than to the (adjacent) canceling.
In German, many adverbial connectives have an overt anaphoric affix (e.g., deswegen,
daher, trotzdem), and the ability to link non-adjacent segments appears to be restricted
to these. Non-adjacency also leads to the issue of crossing dependencies, which is also
discussed by the two teams of authors mentioned above. It correlates with the problem
of two connectives occurring in the same clause, as it happens in the Barolo example
(because then), which renders the parsing task significantly more complex than in the
“well-formed XML” example.
A different problem is to be found in situations where a single coherence relation
is signaled twice, by two different connectives, where one typically is to be read cataphorically:
226
Stede
(3) Ich nehme deshalbi das Auto, weili Du so bettelst.
(‘I take the car (for that reason)i becausei you’re begging so much.’)
This phenomenon is difficult to reproduce in English; again, in German it is also
limited to a certain class of connectives that can serve as cataphoric ‘correlates’. Obviously, in such examples, a coherence analyzer will have to be very careful not to
hypothesize two separate causal relationships. The same danger applies when multiple causes are enumerated for the same consequence, or multiple consequences arising
from the same cause. The mere insertion of the focus particle auch (‘also’) in example
3 can fundamentally change the discourse structure to stating two reasons for taking
the car:
(4) Es regnet sehr stark. Ich nehme deshalb das Auto, auch weil Du so bettelst.
(‘It’s raining heavily. I therefore take the car, also because you’re begging so
much.’)
Finally, it is to be noted that certain connectives convey information about the discourse structure beyond the local relation between two segments. A case in point is
the first word of this paragraph, which not only makes a ‘List’ or ‘Enumeration’ relation explicit, but also provides the information that this very list is now coming to an
end. A smart coherence analyzer could thus reduce the search space for linking the
subsequent text segment — it will definitely not be part of the same ‘List’ configuration.
4 A Rich Lexical Resource for Connectives
For building programs to perform local coherence analysis on texts that display the
complexities discussed above, our approach is to clearly divide the labor between
a declarative connective lexicon on the one hand, and a flexible analysis procedure
on the other. In this section, we describe our Discourse Marker Lexicon ( DIMLEX ),
whose first version was described in Stede (2002). At the time, it was used for relatively simple text parsing as outlined at the beginning of the paper, and also for a language generation application. The multi-functionality results from using a rather abstract XML encoding for the “master” lexicon, which is transformed by XSLT scripts
to the format needed by a specific application — both in terms of technical format
(e.g., programming language) and the amount and granularity of information needed
for the application. With our current focus on causal relations, we extended the DIM LEX entries of the causal connectives to a richer scheme, which will gradually be
transferred to the remaining connectives as well.
It is not trivial to define an inventory of causal connectives, due to the grey area
of words marking a semantic relationship that readers can also interpret causally —
after all, causality is very often not explicitly signaled but being left for the reader to
reconstruct. For example, in The wind shook the shed for a few seconds, and then
it collapsed there certainly is causality involved in the relationship between the sentences, but we would not want to treat and or then as causal connectives. With the
help of the ‘Handbook of German Connectives’ (Pasch et al., 2003), we determined a
set of 66 German connectives that primarily convey causality.
Connective-based Local Coherence Analysis
227
The DIMLEX entries for these connectives consist of the following zones of information: (1) orthography, syntax, and structural features; (2) non-/connective disambiguation rules; (3) semantic and pragmatic features, including information on disambiguating different readings, and on role linking. As for the type of information,
entries contain both binary features and probabilities derived from corpus analyses.
Orthography and syntax. Orthographic variants that we store in the lexicon result
from the recent official German spelling reform and from frequent mistakes made
by speakers/authors (as found in corpora). Also, we list both upper and lower case
spellings because this difference plays a role in many disambiguation rules (see below). Each variant has a unique identifier that is being used in those rules. Also, one
of the variants is marked as ‘canonical’ for co-reference purposes. Here is a sample
excerpt from the entry for aufgrund, corresponding to the English due to:
<orth type="simple" canon="1" onr="k2v1">
<part type="cont">aufgrund</part> </orth>
<orth type="complex" canon="0" onr="k2v2">
<part type="cont">auf Grund</part> </orth>
<orth type="simple" canon="0" onr="k2v3">
<part type="cont">Aufgrund</part> </orth>
<orth type="complex" canon="0" onr="k2v4">
<part type="cont">Auf Grund</part> </orth>
Each orth is of type ‘simple’ or ‘complex’, depending on the number of tokens
involved. For simple connectives (single tokens), the part type is always ‘cont’
(continuous), whereas for complex connectives it may also be ‘discontinuous’ if linguistic material can intervene between the parts (which is not the case for the two
complex variants above).
Syntactically, connectives can be subordinating conjunctions; Postponierer; pre, post- and circumpositions; and adverbials, some of which can occur only in specific positions (characterized in accordance with the Feldermodell that is often used
to describe German sentence structure in terms of Vorfeld, Mittelfeld, Nachfeld). We
encode this information following the classification by Pasch et al. (2003)), whose
primary criterion is whether the connective can be integrated into the clause, and if
so, at what positions it can occur. Here is the information for the prepositional adverb
(‘padv’) dadurch (‘by means of this’):
<padv>
<vorfeld>1</vorfeld>
<mittelfeld>1</mittelfeld>
<nacherst>0</nacherst>
<nachfeld>1</nachfeld>
<nullstelle>0</nullstelle>
<nachnachfeld>0</nachnachfeld>
<satzklammer>0</satzklammer>
</padv>
The binary features say that the connective can be in the Vorfeld (preceding the
finite verb or auxiliary: Dadurch ist es geschehen), Mittelfeld (between auxiliary and
Stede
228
verb: Es ist dadurch geschehen), and Nachfeld (following the verb phrase: Es ist
geschehen dadurch).
As a representation more directly usable for computational purposes, we also specify patterns of the connective being situated in a syntax tree in TIGER format (Brants
et al., 2004). This format is used both in large hand-annotated German corpora as well
as in an automatic parser3. The idea of the patterns in the lexical entry thus is to find
instances of the word in a TIGER-tree, whether coming from a treebank or from a
parser. For illustration, here is the pattern for the complex connective so .. dass (‘so ..
that’):
(#avp:[cat="AVP"] > [lemma="so"])
&
((#avp > #s:[cat="S"])
|
((#avp > #cs:[cat="CS"]) &
(#cs > #s:[cat="S"]))
)
&
(#s > [lemma=("dass")])
This expression looks for an adverbial phrase (AVP) that dominates both so and a sentence (S), or a coordination of sentences (CS) that in turn dominate dass. Between the
so and dass, any material can intervene. An examples matched by this expression in
the TIGER corpus is: Der Kanzler hat China so gern , daß er ihm sogar die höchsten
Berge der Welt zu schenken vermöchte. (‘The chancellor likes China so much, that he
even wants to give the world’s highest mountains as a present to the country.’)
Besides the syntactic structure of individual conjuncts, we also need to represent
the possibilities on linear order of the conjuncts. This is also based on the terminology
of Pasch et al. (2003), who distinguish between the internal conjunct (the clause or
phrase that the connective syntactically belongs to) and the external one. Sometimes,
this a hard constraint: With the conjunction denn (causal ‘for’), the internal conjunct
can only follow the external one. With other connectives, e.g., weil (‘because’), both
orderings are possible, i.e., the because-clause giving a reason can precede or follow
the clause giving the effect. In these cases we include probabilities derived from a
corpus analysis, which the coherence analysis module can use for disambiguating
scope when it has no other information available.
The syntactic representations become somewhat more complicated in case of complex connectives. For instance, there is a variant of dadurch that co-occurs with a subsequent (but not necessarily adjacent!) complement clause headed by dass (‘that’).
Similarly, as shown in the previous section, certain causal conjunctions and adverbials
can co-occur and redundantly mark the same relation. Our lexicon entries contain features representing those possible pairings. For a more general discussion on German
complex connectives, see Stede and Irsig (2008).
Finally, we include a feature stating whether the connective can be in the scope of a
focus particle. This information can sometimes support non-/connective disambiguation.
3 http://www.ims.uni-stuttgart.de/tcl/SOFTWARE/BitPar.html
Connective-based Local Coherence Analysis
229
Non-/connective disambiguation. In Dipper and Stede (2006), we reported on an
approach to disambiguating non-/connective use for nine connectives by incrementally training a Brill tagger, which lead to F-measures of 81% (+connective) and 95%
(–connective) in the best of four training scenarios. During this work it became clear
that the part-of-speech context of the word often indeed provides enough information
for making the decision. The main reason why off-the-shelf taggers, however, do not
perform very well is that tagsets do not reflect the distinction — recall the syntactic
heterogeneity of the “class” of connectives. From our findings we thus constructed for
each connective a set of patterns over part-of-speech and lemma information, leading
to regular expressions associated with probabilites (again gathered from corpus studies). These expressions become part of the DiMLex entries and can be used by the coherence analyzer. Starting from the Dipper/Stede results, we manually created classes
of connectives with apparently-equivalent behavior, rather than studying each of the
66 connectives in detail. For illustration, here is the pattern set for daher, which can
be a causal connective (‘therefore’) or a locative adverb (‘from there’):
<conn-disambi>
<pros>
<pro value="90" ref="k5v2"> $. $$/PROAV </pro>
<pro value="90" ref="k5v1"> VVFIN $$/PROAV </pro>
</pros>
<cons>
<con value="99" ref="k5v1 k5v2">
$$/PROAV $, {’dass’}/KOUS
</con>
<con value="95" ref="k5v2">
$. $$/PROAV .* {’kommen’ ’ruehren’} .+ $, {’dass’}/KOUS
</con>
<con value="99" ref="k5v1"> $$/PROAV $. </con>
</cons>
</conn-disambi>
Weights range from 0 to 100, so 99 represents basically a strict rule. Notice the ref
attribute, which restricts the rules to orthographic variants (in this case to upper and
lower case ones). The first two rules support a +connective reading: daher tagged as
pronominal adverb (PROAV) following a full stop or a finite verb, respectively. The
following three rules support a –connective reading: daher followed by the subordinating conjunction (KOUS) dass; occurring in a collocation like kommt daher, dass
(‘stems from’); occurring before a full stop, i.e., sentence-final.
Semantics and Pragmatics. As stated earlier, we identify a difference in readings
with a difference in coherence relation signaled by the connective. As for the inventory of relations, we take inspiration from Mann and Thompson (1988), Asher
and Lascarides (2003), and especially for the causal relations, from the taxonomic
approach of Sanders et al. (1992). Not every distinction made in the literature can
be traced to connectives; so we do for instance not follow RST’s distinction between
‘Volitional Cause’ and ‘Non-volitional Cause’ in DIMLEX. But we find differences in
connective use for semantic versus pragmatic causal relations (Sanders et al., 1992).
230
Stede
For instance, the denn used in (4) below is quite typical for pragmatic relations (see,
e.g. Pasch, 1989).
(5) Er wird bestimmt pünktlich kommen, denn er ist doch immer so gewissenhaft.
(‘Surely he will arrive on time, for he is always so assiduous.’)
Thus, in the realm of causality we use coherence relations labeled ‘Argument-Claim’
(pragmatic) and ‘Reason-Consequence’ (semantic). Further, if the consequence is a
yet-unrealized intended effect, we assign the relation ‘Purpose’ as it has been suggested by Mann and Thompson (1988). The connectives associated with Purpose are
mostly quite specific (e.g., English in order to; German um .. zu), but there can also
be ambiguity between Purpose and “other” causality (e.g., English so that; German
damit).
Disambiguation between the semantic and the pragmatic relation is usually very
difficult and thus a matter of heuristically weighing the evidence. Similar to our handling non-/connective disambiguation (see above), we use a scheme of weight accumulation for features indicating the presence of a relation. For example, for the
connective schließlich we found that with the main verb of the clause elided, the pragmatic reading is very unlikely; on the other hand, if the verb is in present tense and the
Aktionsart is ‘state’, it very likely signals the pragmatic ‘Claim-Argument’ relation.
Other evidence for this relation includes modal particles signaling the epistemic status
of the proposition(s), often in conjunction with present or future tense. This is illustrated in example 4 above, where the speaker expresses her confidence that the event
will materialize with bestimmt (‘surely’), while doch ăin the second clause marks the
information has hearer-old, so that the difference between claim and argument in this
case is quite transparent. Other features we modeled are inspired by the empirical
work of Frohning (2007). They include position, tense and aspect of the clause, mood
and modality, and lexical collocations; Frohning derived their weights from corpus
analyses.
Often, however, no compelling evidence for either of the three relations can be
found, and for these cases we use a neutral relation called ‘Cause-Caused’, which is
thus meant to subsume the two others.
In addition to relation(s), a lexicon entry specifies the role linking for connectives:
the mapping from the syntactically internal or external conjunct (see above) to its
function in the relation. We label these functions in accordance with the relations:
‘Argument’, ‘Claim’, ‘Reason’, and so forth. Since causal relations are directed, and
the mapping cannot be predicted from syntactic features, it is crucial to represent this
information explictly.
Besides, we use a number of more idosyncratic features to represent information
that is relevant only for certain connectives, in particular to distinguish very similar
ones. An example mentioned in the previous section is the information-structural difference between weil (‘because’) and da (‘since’). For other families of connectives,
this “miscelleneous features” section is more important; with temporal connectives,
for instance, we specify in addition to the coarse-grained coherence relation more
fine-grained distinctions such as whether the time spans of the related events meet or
not, etc.
Connective-based Local Coherence Analysis
231
Having discussed our treatment of syntax and semantics separately, we now have
to attend to the relationship between the two, i.e., to the issues of ambiguity and polysemy. The majority of connectives has one syntactic description and can convey
one or two similar coherence relations (the typical ambiguity between semantic and
pragmatic reading). We do, however, also find other configurations:
• Two syntactic descriptions: weil used to be a subordinating conjunction, but in
spoken German is now widely accepted as a coordinating conjunction as well.
Since the meaning is the same, it suffices to simply list both syntactic variants
in DIMLEX.
• One syntactic description, many coherence relations: When used as an adverb,
the connective damit can signal Purpose (‘so that’) or Reason-Consequence
(‘thus’). This situation is similar to the previous one: We provide a disjunction
of semantic readings (including the disambiguation information) and a single
syntactic description.4
• Two syntactic descriptions, several coherence relations: These cases are the
only serious complications, as a difference in syntax can correlate with one in
semantics, so that we cannot simply specify disjunctions for the syntactic and
semantic descriptions. Instead, we use multiple lexical entries, in accordance
with the intuition that we are dealing with fairly unrelated items (polysemy).
An example is dann (‘then’), which on the one hand is a temporal adverbial,
and on the other hand can express a Condition relation (optionally with a corresponding wenn (‘if’) in the other clause). In the latter case, it does not behave
as an adverb, though, but it governs a verb-second clause. So, distributing the
information across two separate lexicon entries seems to be appropriate.
Finally, to enhance the maintainability of DIMLEX, we inlcude with the entries
a range of linguistic examples that illustrate the relevant distinctions, and we also
citee information that is provided by standard dictionaries — especially in those cases
where our formalization is not yet complete. One of the XSLT scripts for converting
DIMLEX maps the base lexicon to an HTML format that allows for inspecting the
entries, including the information just mentioned, which is intended for the human
eye rather than for automatic parsers or generators.
5 A Corpus Annotated with Causal Relations
As a preparatory step for implementing a local coherence analyzer that aims specifically at identifying causal relations, we built a corpus with causal connectives annotated manually. We selected 200 short texts from a product review web site5 , where
travelers comment on various tourist destinations. Since they often give reasons for
the opinions they express, this genre offers more instances of causal connectives than,
say, newspaper text. On the other hand, there is the undeniable drawback of frequent
4 As a matter of fact, the situation is more difficult: Damit is one of the most complicated words in
our lexicon, as it also has a reading as subordinator where it signals Purpose, as well as a non-connective
adverbial reading (‘with it/that’).
5 http://www.dooyoo.de
Stede
232
mistakes in grammar and orthography, which makes any automatic analyis quite hard,
and also sometimes poses challenges to the human annotator.
Creating the corpus involved several steps. First, potential causal connectives were
searched automatically (using the list from DIMLEX) and manually filtered. Subsequently, identifying causal connectives was not an issue for the annotation process, as
they were already presented to annotators as “anchors” for their task. We then designed annotation guidelines with instructions for identifying causes and effects. As
for the length of spans, annotators were encouraged to prefer a shorter span in cases
where the boundary of a cause or effect is not quite clear. At the same time, they were
asked to mark two discontinuous spans in cases where a cause/effect was interrupted
by extraneous material such as authors’ remarks on their own text production. Thus,
in the following example, the C1 and C2 indices mark the intended cause, and E the
intended effect.
(6) [The beach was not very pleasant]E , as [it was,]C1 I just have to say this here,
[utterly littered with remains of picnics.]C2
When multiple reasons are given for the same effect (or vice versa), annotators had to
mark them separately, so that each cause-effect pair can be derived individually from
the annotated data. Sometimes this multiplicity can involve separate connectives, as
in the following example. In such cases, annotators had to choose a central connective
(the one linking the adjacent cause and effect) and then add additional ones as secondary connectives, possibly forming a chain. This ensures easy retrievability of all
pairs from the data.
(7)
[We reached the hotel late]E [due to]Co1 [the flight’s delay]Ca and also
[because]Co2 [it took so long to find a cab.]Cb
Further, annotators had to identify possible redundant markings of the same causeeffect pair (as with the cataphoric correlates discussed above) as well as focus particles
that modify connectives. Thus, in example (7), they would mark also and link it to the
modified because.
Our first version of the annotation guidelines was subject to an informal evaluation with annotators who had not been involved in the project. On the basis of the
results we clarified several aspects in the guidelines and thus wrote the final version.
Furthermore, we prepared two instructional videos: one for using the annotation tool
MMAX26 , and one for our specific annotation scenario, illustrating the handling of a
fairly complicated text passage. In the formal evaluation with two annotators, they received no training other than by the guidelines and the two videos. Of 78 connectives,
34 were analyzed identically. The vast majority of the mismatches (36 of 44) resulted from different span length: There was overlap between the spans chosen by the
annotators, but the boundaries were not exactly identical. Other mismatches, which
occurred only a few times, included different decisions on secondary connectives and
the resulting chains of causes/effects.
Finally, with the guidelines having become stable, experienced annotators created
the “official” annotation of the entire corpus of 200 texts (containing some 1,200
6 http://mmax2.sourceforge.net
Connective-based Local Coherence Analysis
233
causal connectives). It is now is available as a resource for training and evaluation
of automatic procedures. We also developed a web-based viewer (essentially translating the MMAX2 format to HTML and Javascript) that allows for manually browsing
the corpus comfortably.7
6 Towards recognizing causal relations automatically
Having described DIMLEX as the central resource for local coherence analysis, and
the corpus as reference and evaluation tool, we now briefly sketch a procedure for
recognizing causal relations, whose implementation is currently under way in our text
analysis workbench (Chiarcos et al., 2008), a standoff XML architecture for fusing
linguistic annotations coming from different manual or automatic annotation tools. In
this highly modular approach, the output of each individual analysis module is stored
in a separate layer, using our standoff XML format PAULA (Dipper, 2005). Analysis
tools can use previously computed layers for their own task, which usually involves
creating one or more new layers.
In this setting, the task of local coherence analysis involves the following layers.
The first four are to be built in the pre-processing phase, and the last two are the result
of the coherence analyzer:
1. Token layer (including sentence boundaries)
2. Part-of-Speech
3. Logical document structure (headlines, paragraph breaks, etc.)
4. Dependency syntax analysis
5. Elementary discourse units
6. Connectives and (sets of) EDUs they relate
The procedure of coherence analysis consists of the following three sequential
steps, which at various points make use of information from DIMLEX:
Connective identification. All the words listed in DIMLEX as some orthographic
variant of a causal connective are identified in the text. This includes a check for
complex connectives as listed in the lexicon, i.e., two corresponding words in adjacent
clauses (amongst others, the if .. then type). It also includes a check for correlates,
i.e., a connective that according to DIMLEX can be a correlate occurring in a clause
immediately preceding a subordinate clause governed by a connective that according
to DIMLEX can have a correlate (the deshalb .. weil type. For these checks, the syntax
layer (4) is used to identify adjacent clauses.
Next, the single-word connective candidates are run through the disambiguation
filters, i.e., the PoS/token regular expressions specified in their lexical entries are
matched against the text’s PoS representation on the corresponding layer (2). Those
items that appear to be words in non-connective use are removed from the connective
list. Finally, a new layer (6) is created, for now holding only the words that were
recognized as connectives.
7 All
material can be found at http://www.ling.uni-potsdam.de/~stede/kausalkorpus.html.
234
Stede
Segmentation. The basic idea of our approach follows that of the module implemented by Lüngen et al. (2006) for German. We first overgenerate, guessing segment
boundaries at every possible position, according to the dependency parse result; then,
contextual rules remove those boundaries that appear to be wrong (e.g., commas in
enumerations). We are, however, using somewhat different definitions of segments,
namely a variant of Jasinskaja et al. (2007), and the corpus annotated according to
those segmentation guidelines will be used to evaluate our module. One issue where
we diverge from both Lüngen et al. and from Jasinskaja et al. is in our handling of
prepositions: We do admit certain prepositional phrases as elementary discourse units,
but only those that are headed by a preposition listed in DIMLEX, e.g., the causal
markers wegen (‘due to’) or durch (‘through’). The resulting sequence of segments is
represented on a new layer of analysis (5).
Relation and scope identification. Next, the connective layer (6) is extended with
information on relations and scopes: Every connective is associated with one or more
attribute-value structures listing possible coherence relations along with probabilities.
To this end, all relations stored with the connective in DIMLEX are recorded as hypotheses, and weights are accumulated as the result of evaluating the associated disambiguation rules, which largely operate on the syntax layer, as explained in Section 4.
Finally, for each relation we also hypothesize its scope: the thematic roles are associated with sequences of minimal units from layer (5). Given a reliable syntactic
analysis, scope determination is usually straightforward for coordinating and subordinating conjunctions. For adverbials, we hypothesize different solutions and rank them
according to size: The most narrow interpretation is taken as most likely. In this step,
we also consider the layer of logical document structure in order to avoid segments
that would stretch across paragraphs or other kinds of boundaries. Similarly, a layer
with the results of “text tiling” (breakdown of the text in terms of thematic units, in the
tradition of Hearst (1994)) could be used for this purpose, as well as as an ‘attribution’
layer that identifies those modal contexts that attribute a span of text to a particular
source (as in indirect speech).
In this way, the module will generate hypotheses of coherence relations and related
spans, for the time being solely on the basis of connectives occurring in the text. As
explained, this information is represented in two additional analysis layers. Modules
following in the processing chain may combine the various hypotheses into the most
likely overall relational tree structure for the paragraph (or a set of such tree structures, see Reitter and Stede (2003)), or they may use the hypotheses directly for some
application that does not rely on a spanning tree.
7 Discussion
The central idea behind the separation of the declarative DIMLEX resource and the (ongoing) implementation of an analysis procedure is to facilitate a smooth extensibility
of the overall approach towards further kinds of connectives and coherence relations.
When the lexicon is extended — while the underlying scheme remains unchanged
— coverage of the analyzer grows without adaptations to the analysis procedure. An
important benefit of the XML-based organization of the lexicon is its suitability for a
variety of applications (parsing, generation, lexicography), which can each select from
Connective-based Local Coherence Analysis
235
the master lexicon exactly those types of information that are relevant for them. On
the other hand, an obvious drawback of the present “flat” XML format is a relatively
high degree of redundancy. The good reasons for introducing inheritance-based representation formalisms in “standard” computational lexicons of content words largely
apply to the realm of connectives (and possibly to other function words) as well. For
the time being, however, the more mundane task of lexical description still offers a
great many open questions for individual connectives and families thereof; the issue
of more intelligent storage should become prominent later, when the groundwork has
stabilized.
As with the vast majority of coherence relations, causal ones often need not be
explicitly signaled at the linguistic surface by a connective. Thus the approach proposed in this paper will of course only partially solve the problem of local coherence
analysis. An important challenge for future work is to identify linguistic features of
discourse units other than connectives that can also serve to at least constrain the range
of admissible coherence relations (see, e.g. Asher and Lascarides, 2003). Investigating
these with empirical methods is an important next step in the overall program of partially deriving coherence relations in authentic text without resorting to non-linguistic
knowledge.
Acknowledgments
The following people from the Potsdam Applied Computational Linguistics Group
contributed to the work described in this paper (in alphabetical order): André Herzog
(causality corpus); Kristin Irsig (DiMLex entries for causal connectives); Andreas
Peldszus (causality corpus and annotation guidelines); Uwe Küssner (implementation
of LCA module).
References
Asher, N. and A. Lascarides (2003). Logics of Conversation. Cambridge: Cambridge
University Press.
Brants, S., S. Dipper, P. Eisenberg, S. Hansen, E. König, W. Lezius, C. Rohrer,
G. Smith, and H. Uszkoreit (2004). Tiger: Linguistic interpretation of a german
corpus. Research on Language and Computation 2(4), 597–620.
Chiarcos, C., S. Dipper, M. Götze, J. Ritz, and M. Stede (2008). A flexible framework for integrating annotations from different tools and tagsets. In Proc. of the
First International Conference on Global Interoperability for Language Resources,
Hongkong.
Corston-Oliver, S. (1998). Computing of Representations of the Structure of Written
Discourse. Ph. D. thesis, University of California at Santa Barbara.
Dipper, S. (2005). XML-based stand-off representation and exploitation of multilevel linguistic annotation. In R. Eckstein and R. Tolksdorf (Eds.), Proceedings of
Berliner XML Tage, pp. 39–50.
236
Stede
Dipper, S. and M. Stede (2006). Disambiguating potential connectives. In M. Butt
(Ed.), Proceedings of KONVENS ’06, Konstanz, pp. 167–173.
Frohning, D. (2007). Kausalmarker zwischen Pragmatik und Kognition. Korpusbasierte Analysen zur Variation im Deutschen. Tübingen: Niemeyer. (Im Erscheinen).
Hearst, M. A. (1994). Multi-paragraph segmentation of expository text. In Proceedings of the 32nd Meeting of the Association for Computational Linguistics, Las
Cruces/NM.
Hirschberg, J. and D. J. Litman (1994). Empirical studies on the disambiguation of
cue phrases. Computational Linguistics 19(3), 501–530.
Jasinskaja, K., J. Mayer, J. Boethke, A. Neumann, A. Peldszus, and K. J. Rodríguez
(2007). Discourse tagging guidelines for German radio news and newspaper commentaries. Ms., Universität Potsdam.
Lüngen, H., H. Lobin, M. Bärenfänger, M. Hilbert, and C. Puskas (2006). Text parsing
of a complex genre. In B. Martens and M. Dobreva (Eds.), Proc. of the Conference
on Electronic Publishing (ELPUB 2006), Bansko, Bulgaria.
Lüngen, H., C. Puskas, M. Bärenfänger, M. Hilbert, and H. Lobin (2006). Discourse
segmentation of German written text. In T. Salakoski, F. Ginter, S. Pyysalo, and
T. Phikkala (Eds.), Proceedings of the 5th International Conference on Natural
Language Processing (FinTAL 2006), Berlin/Heidelberg/New York. Springer.
Mann, W. and S. Thompson (1988). Rhetorical structure theory: Towards a functional
theory of text organization. TEXT 8, 243–281.
Marcu, D. (2000). The theory and practice of discourse parsing and summarization.
Cambridge/MA: MIT Press.
Pasch, R. (1989). Adverbialsätze – kommentarsätze – adjungierte sätze. eine hypothese zu den typen der bedeutungen von ‘weil’, ‘da’ und ‘denn’. In W. Motsch
(Ed.), Wortstruktur und Satzstruktur, Linguistische Studien des ZISW: Reihe A –
Arbeitsberichte 194, pp. 141–158. Berlin: Akademie der Wissenschaften der DDR.
Pasch, R., U. Brauße, E. Breindl, and U. H. Waßner (2003). Handbuch der deutschen
Konnektoren. Berlin/New York: Walter de Gruyter.
Polanyi, L., C. Culy, M. van den Berg, G. L. Thione, and D. Ahn (2004). A rule
based approach to discourse parsing. In Proceedings of the SIGDIAL ’04 Workshop, Cambridge/MA. Assoc. for Computational Linguistics.
Reitter, D. and M. Stede (2003). Step by step: underspecified markup in incremental
rhetorical analysis. In Proceedings of the 4th International Workshop on Linguistically Interpreted Corpora (LINC), Budapest.
Sanders, T., W. Spooren, and L. Noordman (1992). Toward a taxonomy of coherence
relations. Discourse Processes 15, 1–35.
Connective-based Local Coherence Analysis
237
Stede, M. (2002). DiMLex: A lexical approach to discourse markers. In Exploring
the Lexicon - Theory and Computation. Alessandria: Edizioni dell’Orso.
Stede, M. (2008). RST revisited: Disentangling nuclearity. In C. Fabricius-Hansen
and W. Ramm (Eds.), ‘Subordination’ versus ‘coordination’ in sentence and text.
Amsterdam: John Benjamins.
Stede, M. and K. Irsig (2008). Identifying complex connectives: Complications for
local coherence analysis. In A. Benz, P. Kühnlein, and M. Stede (Eds.), Proceedings of the Workshop on Constraints in Discourse, Potsdam, pp. 77–84.
Sumita, K., K. Ono, T. Chino, T. Ukita, and S. Amano (1992). A discourse structure
analyzer for Japanese text. In Proceedings of the International Conference on Fifth
Generation Computer Systems, pp. 1133–1140.
Sweetser, E. (1990). From etymology to pragmatics. Cambridge: Cambridge University Press.
Webber, B., M. Stone, A. Joshi, and A. Knott (2003). Anaphora and discourse structure. Computational Linguistics 29(4), 545–587.
Wolf, F. and E. Gibson (2005). Representing discourse coherence: a corpus-based
study. Computational Linguistics 31(2), 249–287.