Building a Discourse-annotated Dutch Text Corpus Nynke van der Vliet, Ildikó Berzlánovich, Gosse Bouma, Markus Egg, Gisela Redeker Beyond Semantics DGfS Workshop, Göttingen, February 23-25, 2011 Overview Introduction MTO Project Corpus design Text selection Segmentation Annotation Discourse structure Genre structure Discourse connectives Lexical cohesion Preliminary results and future work 2 Introduction Modeling Textual Organization (MTO) Program Build a Dutch text corpus, annotated for discourse structure, genre structure, lexical cohesion, coreference, and discourse connectives Project Goals: Investigate the genre-dependent interaction between discourse structure and lexical cohesion (Project 1, Ildikó Berzlánovich) Investigate the mechanisms that establish coherence in text and develop algorithms for discourse parsing (Project 2, Nynke van der Vliet) http://www.let.rug.nl/mto/ 3 Corpus design Provide a reliably annotated “gold standard” resource covering a range of genres Core corpus: 80 texts (length: 190 - 400 words) • expository texts: 20 encyclopedia texts 20 popular scientific news texts • persuasive texts: 20 fundraising letters 20 advertisements 4 Text Selection (1) Preparation: selection of text material, stripping off ‘text-external’ elements • Exclude pictures and picture captions • Exclude genre-specific elements that are not related to rhetorical choices 5 Text Selection (2) Example 6 Segmentation (1) EDU ~ simple sentence Each donation is valuable! EDU ~ finite clause You can build with us by donating, ][ but you can also build with us literally. EDU ~ fragment functioning as complete utterance Nice gadgets. EDU ~ non-restrictive relative clause This gap is caused by one of the moons of Saturn, Mimas, ][ which disturbs the rings. 7 Segmentation (2) EDU ~ embedded discourse unit However during the night, [ which can last for months on Mercury, ] the temperature decreases to about -185 degrees Celsius. EDU ~ coordinated VP At a young age a cataract in her eye was diagnosed ][ and treated. EDU ~ elliptical clause The planet turns around its axis in 58.6 days ][ and around the sun in 88.0 days. 8 Discourse Structure (1) Rhetorical Structure Theory (RST) (Mann and Thompson,1988) Full hierarchical text structure Extended Classic RST (30 relations) Semantic and pragmatic relations Non-binary trees 9 Discourse Structure (2) (37) P.S. The enclosed cards are a thank you gift for reading my letter about the malaria epidemic in Africa. (38 ) Help us now in our fight against malaria (39) by donating today (40) - within an hour more than 120 children will die needlessly from this deadly disease. (41) Give generously (42) and do this today, (43) so that we can help more children (44) before it is too late. 10 Discourse Structure (3) Multi-satellite construction (non-binary tree)tree) 1-4 Motivation 1 P.S. The enclosed cards are a thank you gift for reading my letter about the malaria epidemic in Africa. Justify 2-3 Means 2 Help us now in our fight against malaria 3 by donating today 4 - within an hour more than 120 children will die needlessly from this deadly disease. Restriction to binary trees yields implausible analyses 1-4 1-4 Motivation 1 P.S. The enclosed cards are a thank you gift for reading my letter about the malaria epidemic in Africa. Justify 2-4 1-3 Justify 2-3 Means 2 Help us now in our fight against malaria 3 by donating today Motivation 4 - within an hour more than 120 children will die needlessly from this deadly disease. 1 P.S. The enclosed cards are a thank you gift for reading my letter about the malaria epidemic in Africa. 2-3 Means 2 Help us now in our fight against malaria 4 - within an hour more than 120 children will die needlessly from this deadly disease. 3 by donating today 11 Genre structure (1) Move analysis (Upton and Cohen, 2009) • Moves = functional components in the text • Each genre has a particular set of move types • The moves create a linear, non-hierarchical partition of the text 12 Genre structure (2) Encyclopedia texts Name Define Describe Fundraising letters Get attention Introduce the cause Establish credentials of organization Solicit response Offer incentive Reference insert Express gratitude Conclude with pleasantries 13 Genre structure (3) 14 Genre structure (4) Mapping the move structure onto RST structure Discourse Connectives (1) Why annotate connectives? • • • At least at intra-sentential level (but probably also across sentences), connectives should be valuable cues to coherence relations. Frequencies of connectives may differ between genres and thus provide a cue for genre classification. Genre information may help the parser by biasing the disambiguation of multifunctional connectives, e.g., toward a semantic meaning for expository texts and pragmatic one for persuasive texts. 16 Discourse Connectives (2) (16) With the help of research much has already been achieved. (17) But to protect you and others from the consequences of diabetes (18) more research is needed. (19) That is why we keep asking for your support. 17 Lexical cohesion (1) • • Lexical cohesive items build up graph structures in the text For each lexical item, lexical links to items in preceding and following EDUs are identified 18 [After the forming of the sun and the solar system, our star began its long existence as a so-called dwarf star ] EDU6 [In the dwarf phase of its life, the energy that the sun gives EDU5 off is generated in its core through the fusion of hydrogen into helium.] EDU7[The sun is about five billon years ] 19 Annotations (1) Segmentation Detailed manual with rules and examples Reliability: 25% of the material, K = 0.98 (fundraising letters and encyclopedia) Coherence analysis (RST) Relation definitions as published on the RST website Consensus procedure: each final analysis is based on at least two independent first versions and intensive team discussion (Berzlánovich, Redeker, van der Vliet) Reliability: K = 0.88 for the spans, 0.82 for nuclearity and 0.57 for labeling 20 Annotations (2) Genre structure (move analysis) Detailed manual Final analysis by consensus among two coders (Berzlánovich, Redeker) Reliability: K will be calculated on 20 % of the corpus (4 texts per genre) Lexical cohesion Detailed manual, training of the coders Final analysis by consensus among two coders (Berzlánovich with Rensema/ Wagenaar) Reliability: K will be calculated on 20 % of the corpus (4 texts per genre) 21 Coherence, Cohesion, and Genre Preliminary results on genre-sensitivity of coherence and cohesion (comparing encyclopedia texts and fundraising letters) Genre difference in coherence Genre difference in cohesion Presentational relations are much more frequent in persuasive texts than in expository texts. Different discourse connectives in expository and persuasive texts. Systematic semantic relations are more frequent in expository texts than in persuasive texts. Genre-specific interaction of coherence and cohesion Coherence and cohesion are closely aligned in expository texts, but not in persuasive texts. 22 Future plans Automatic discourse parsing - automatic segmentation (basic program already achieves good precision (0.72) and recall (0.75)) - determine the validity of genre, connectives, coreference and lexical cohesion relations as cues for the recognition of RST relations (using machine learning) 23 Thank you 24
© Copyright 2026 Paperzz