Building a Discourse-annotated Dutch Text Corpus

Building a Discourse-annotated Dutch Text Corpus
Nynke van der Vliet, Ildikó Berzlánovich, Gosse Bouma,
Markus Egg, Gisela Redeker
Beyond Semantics
DGfS Workshop, Göttingen, February 23-25, 2011
Overview
 Introduction MTO Project
 Corpus design



Text selection
Segmentation
Annotation




Discourse structure
Genre structure
Discourse connectives
Lexical cohesion
 Preliminary results and future work
2
Introduction
Modeling Textual Organization (MTO) Program
Build a Dutch text corpus, annotated for discourse structure,
genre structure, lexical cohesion, coreference, and discourse
connectives
Project Goals:


Investigate the genre-dependent interaction between discourse
structure and lexical cohesion (Project 1, Ildikó Berzlánovich)
Investigate the mechanisms that establish coherence in text and
develop algorithms for discourse parsing (Project 2, Nynke van
der Vliet)
 http://www.let.rug.nl/mto/
3
Corpus design
 Provide a reliably annotated “gold standard” resource
covering a range of genres
 Core corpus: 80 texts (length: 190 - 400 words)
• expository texts:
20 encyclopedia texts
20 popular scientific news texts
• persuasive texts:
20 fundraising letters
20 advertisements
4
Text Selection (1)
Preparation: selection of text material, stripping off
‘text-external’ elements
• Exclude pictures and picture captions
• Exclude genre-specific elements that are not related to
rhetorical choices
5
Text Selection (2) Example
6
Segmentation (1)




EDU ~ simple sentence
Each donation is valuable!
EDU ~ finite clause
You can build with us by donating, ][ but you can also build
with us literally.
EDU ~ fragment functioning as complete utterance
Nice gadgets.
EDU ~ non-restrictive relative clause
This gap is caused by one of the moons of Saturn, Mimas,
][ which disturbs the rings.
7
Segmentation (2)



EDU ~ embedded discourse unit
However during the night, [ which can last for months on
Mercury, ] the temperature decreases to about -185 degrees
Celsius.
EDU ~ coordinated VP
At a young age a cataract in her eye was diagnosed ][ and
treated.
EDU ~ elliptical clause
The planet turns around its axis in 58.6 days ][ and around
the sun in 88.0 days.
8
Discourse Structure (1)
Rhetorical Structure Theory (RST)
(Mann and Thompson,1988)




Full hierarchical text structure
Extended Classic RST (30 relations)
Semantic and pragmatic relations
Non-binary trees
9
Discourse Structure (2)
(37) P.S. The enclosed cards are a thank you gift for reading my
letter about the malaria epidemic in Africa. (38 ) Help us now in
our fight against malaria (39) by donating today (40) - within an
hour more than 120 children will die needlessly from this deadly
disease. (41) Give generously (42) and do this today, (43) so that
we can help more children (44) before it is too late.
10
Discourse Structure (3)
Multi-satellite construction (non-binary tree)tree)
1-4
Motivation
1 P.S. The enclosed
cards are a thank
you gift for reading
my letter about the
malaria epidemic in
Africa.
Justify
2-3
Means
2 Help us now in our
fight against malaria
3 by donating today
4 - within an hour
more than 120
children will die
needlessly from this
deadly disease.
Restriction to binary trees yields implausible analyses
1-4
1-4
Motivation
1 P.S. The enclosed
cards are a thank
you gift for reading
my letter about the
malaria epidemic in
Africa.
Justify
2-4
1-3
Justify
2-3
Means
2 Help us now in our
fight against malaria
3 by donating today
Motivation
4 - within an hour
more than 120
children will die
needlessly from this
deadly disease.
1 P.S. The enclosed
cards are a thank
you gift for reading
my letter about the
malaria epidemic in
Africa.
2-3
Means
2 Help us now in our
fight against malaria
4 - within an hour
more than 120
children will die
needlessly from this
deadly disease.
3 by donating today
11
Genre structure (1)
Move analysis (Upton and Cohen, 2009)
•
Moves = functional components in the text
•
Each genre has a particular set of move types
•
The moves create a linear, non-hierarchical
partition of the text
12
Genre structure (2)
Encyclopedia texts
Name
 Define
 Describe

Fundraising letters
Get attention
 Introduce the cause
 Establish credentials of organization
 Solicit response
 Offer incentive
 Reference insert
 Express gratitude
 Conclude with pleasantries

13
Genre structure (3)
14
Genre structure (4)
Mapping the move structure onto RST structure
Discourse Connectives (1)
Why annotate connectives?
•
•
•
At least at intra-sentential level (but probably also
across sentences), connectives should be valuable
cues to coherence relations.
Frequencies of connectives may differ between
genres and thus provide a cue for genre
classification.
Genre information may help the parser by biasing
the disambiguation of multifunctional connectives,
e.g., toward a semantic meaning for expository texts
and pragmatic one for persuasive texts.
16
Discourse Connectives (2)
(16) With the help of research much has already been
achieved. (17) But to protect you and others from the
consequences of diabetes (18) more research is
needed. (19) That is why we keep asking for your
support.
17
Lexical cohesion (1)
•
•
Lexical cohesive items build up graph structures in the
text
For each lexical item, lexical links to items in preceding
and following EDUs are identified
18
[After the forming of the sun and the solar system, our
star began its long existence as a so-called dwarf star ] EDU6
[In the dwarf phase of its life, the energy that the sun gives
EDU5
off is generated in its core through the fusion of hydrogen
into helium.] EDU7[The sun is about five billon years ]
19
Annotations (1)
Segmentation
Detailed manual with rules and examples
Reliability: 25% of the material, K = 0.98 (fundraising letters and
encyclopedia)
Coherence analysis (RST)
Relation definitions as published on the RST website
Consensus procedure: each final analysis is based on at least two
independent first versions and intensive team discussion
(Berzlánovich, Redeker, van der Vliet)
Reliability: K = 0.88 for the spans, 0.82 for nuclearity and 0.57 for
labeling
20
Annotations (2)
Genre structure (move analysis)
Detailed manual
Final analysis by consensus among two coders (Berzlánovich,
Redeker)
Reliability: K will be calculated on 20 % of the corpus (4 texts per
genre)
Lexical cohesion
Detailed manual, training of the coders
Final analysis by consensus among two coders (Berzlánovich
with Rensema/ Wagenaar)
Reliability: K will be calculated on 20 % of the corpus (4 texts per
genre)
21
Coherence, Cohesion, and Genre
Preliminary results on genre-sensitivity of coherence and cohesion
(comparing encyclopedia texts and fundraising letters)

Genre difference in coherence


Genre difference in cohesion



Presentational relations are much more frequent in persuasive texts
than in expository texts.
Different discourse connectives in expository and persuasive texts.
Systematic semantic relations are more frequent in expository texts
than in persuasive texts.
Genre-specific interaction of coherence and cohesion

Coherence and cohesion are closely aligned in expository texts, but
not in persuasive texts.
22
Future plans
Automatic discourse parsing
- automatic segmentation
(basic program already achieves good precision (0.72)
and recall (0.75))
- determine the validity of genre, connectives, coreference and lexical cohesion relations as cues for the
recognition of RST relations (using machine learning)
23
Thank you
24