Using semantic roles to improve summaries Diana Trandabăț Faculty of Computer Science University “Alexandru Ioan Cuza” Iasi Iasi, Romania [email protected] Abstract This paper describes preliminary analysis on the influence of the semantic roles in summary generation. The proposed method involves three steps: first, the named entities in the original text are identified using a named entity recognizer; secondly, the sentences are parsed and semantic roles are extracted; thirdly, selection of the sentences containing specific semantic roles for the most relevant entities in text. Although the method is language independent, in order to check its viability, we tested the proposed approach for Romanian summaries. 1 Introduction Text summarization refers to the task of shortening a long text. There are two major directions in text summarisation: the extractive and the abstractive paradigm (Mani, 2001). The first approach in creating summaries (most common) is based on identifying important words in texts by using their frequencies, and determining those sentences that contain a bigger number of important words. These sentences are extracted from the original text, and taken to constitute the summary. In this paradigm, the summarization is performed through sentence extraction: the summary is a subset of the sentences in the original text. An alternative approach is to build a summary consisting of sentences that don’t necessarily have to show up in that specific form in the source text. This requires a certain amount of deeper understanding of the text. This method can also be applied in the case of very large texts, such as a whole novel, where neither the determination of most significant sentences based on occurrences of frequent words, nor building discourse structures could be of help. In these cases, other methods, mainly expanding a collection of predefined flexible summary patterns (based for instance on the genre of the novel, or on some data on the main characters of the novel, a time and place positioning, and a rather shallow sketch of the initiation of the action) could be applied. Our approach to summary building uses the first method, sentence extraction. However, the novelty of our approach consists in basing the extraction of different sentences from the original text on semantic role analysis, an association which is not yet explored at its full potential. The method is language independent, provided that named entity and semantic roles extraction modules are available. The next Section introduces the sentence extraction phase of the summary generation using semantic roles. Section 3 presents the named entity recognition system use to identify entities in the initial text, while Section 4 presents the semantic role labeling procedure. The last section presents preliminary results obtained on 20 summaries, and discusses further development of the system. 2 Generating Summaries Semantic Roles based on The natural language processing community has recently experienced a growth of interest in semantic roles, since they describe WHO did WHAT to WHOM, WHEN, WHERE, WHY, HOW etc. for a given situation, and contribute to the construction of meaning. If for text analysis, semantic roles have gained their way into natural 164 Proceedings of the 13th European Workshop on Natural Language Generation (ENLG), pages 164–169, c Nancy, France, September 2011. 2011 Association for Computational Linguistics language analysis systems (see for instance Lluis et al., 2008; Surdeanu et al., 2003), they are rarely used at their full potential for text generation. Christopherson (1981) was among the first to investigate the usefulness of semantic roles in summaries. More recently, Suanmali et al. (2010) used semantic roles and WordNet (Fellbaum, 1998) to compute the semantic similarity of two sentences in order to decide if the sentences are to be kept or not in the summary. The proposed method is a further step in this direction, combining semantic roles and named entity for sentence extraction. The overall pipeline architecture of the proposed method is presented in Figure 1. Hercules, of all of Zeus’s illegitimate children seemed to be the focus of Hera’s anger. She sent a two-headed serpent to attack him when he was just an infant. Named Entity Recognition systems to be adapted to new domains and perform very well for coarse-grained classification, but require large training data. Thus, as a preprocessing module for our summary generation system, we used a Named Entity Recognition component for Romanian, based on linguistic grammar-based techniques and a set of resources. The NER system is based on two modules, the named entity identification module and the named entity classification module. After the named entity (NE) candidates are marked for each input text, each candidate is classified into one of the considered categories, such as Person, Organization, Place, Country, etc. The major drawback of the sentence extraction Semantic role labeling Selecting relevant sentences Hercules seemed to be the focus of Hera’s anger Figure 1. Overall architecture of the proposed summary generation method based on semantic roles The method presented in this paper works in three steps: first, the original text is parsed for named entities; secondly, semantic roles are extracted from the sentences containing named entities; thirdly, sentences are selected to be kept in the summary, based on the semantic role the named entity has. Each module is detailed in the Sections below. 2.1 Identifying entities In order to identify the semantic role a specific entity express, the entity must be first identified in the text. This is the task of named entity recognition (NER). NER systems typically use linguistic grammar-based techniques or statistical models (an overview is presented in (Nadeau and Satoshi Sekine. 2007)). Hand-crafted grammarbased systems typically obtain better precision, but at the cost of lower recall and months of work by experienced computational linguists. Besides, they are hard to adapt to new domains. Statistical NER systems typically require a large amount of manually annotated training data. Machine learning techniques, such as the ones discussed in (Scurtu et al., 2009) or (Nadeanu, 2007), allow 165 approach for summaries generation is that it ignores the referential expressions that could occur in the initial text and should have been kept in the summary. Thus, due to the elimination of previous sentences, their antecedents may not be present anymore, resulting in incomprehensive readings. For example, consider the following text to be summarized: Hercules, of all of Zeus’s illegitimate children seemed to be the focus of Hera’s anger. She sent a two-headed serpent to attack him when he was just an infant. The summary of this very short fragment, using the sentence elimination method, could (hypothetically) be: She sent a attack him. two-headed serpent to which is really incomprehensible if no explanation is provided of who is “she” or “him”. One way to increase the coherence of such summaries is to derive first the discourse structure of the text and to guide the selection of the sentences to be included into the summary by a score that considers both the relevance of the sentence in a discourse tree and the coherence of the text1, as given by solving anaphoric references. For the summary example above, solving anaphoric references means identifying “she” as Hera and “him” as Hercules. Thus, the provided summary becomes readable: Hera sent a two-headed serpent to attack Hercules. Therefore, after identifying named entities and their types (person, organization, place, etc.), a simple anaphora resolution method, based on a set of reference rules, is applied to our input text, in order to link all entities to their referees. The anaphoric system we used is a basic rulebased one, focusing on named entity anaphoric relations. Thus, we developed a rule-based system that performs the following actions: identifies a subset of a named entity with the full named entity, if it appears as such in the same text. For instance, Caesar is identified with Julius Caesar if both entities appear in the same text. Similarly, the President of Romania and the President are considered anaphoric relations of the same entity, if they appear in a narrow word window in the text. solves acronyms using a gazetteer we have initially built over the Internet, and which is continuously growing in size. For instance, United States of America and USA are co-references. searches for different addressing modalities and matches the ones that are similar. For instance, John Smith is co-referenced with Mr. Smith, and Mary and John Smith is co-referenced with The Smiths, or The Smith Family. solve pronominal anaphora in a simplistic way. Thus, if a pronoun (i.e. she, he, him, his etc.) is found in the text, and in the preceding sentence a named entity with the entity type person is found, then we create an anaphoric link between the pronoun and its antecedent. A similar rule exists for companies, where the pronoun it may be linked to the Insurance Company, for instance. Lists stating these correspondences are presently used and, although the rules are limited so far, our tests show that the overall accuracy of the summarization system benefits from this simple anaphoric resolution system for named entities. The next step is the identification of the semantic roles that each named entity plays. 1 A detailed analysis of the coherence of different texts is presented in (Cristea and Iftene, 2011). 166 2.2 Identifying semantic roles Fillmore in (Fillmore, 1968) defined six semantic roles: Agent, Instrument, Dative, Factive, Object and Location, also called deep cases. His later work on lexical semantics led to the conviction that a small fixed set of deep case roles was not sufficient to characterize the combinatorial properties of lexical items, therefore he added Experiencer, Comitative, Location, Path, Source, Goal and Temporal, and then other cases. This ultimately led to the theory of Frame Semantics (Fillmore, 1982), which later evolved into the FrameNet project2. In the last decades, hand-tagged corpora that encode such information for the English language were developed (VerbNet 3 (Levin and Rappaport, 2005), FrameNet (Baker et al., 1998) and PropBank 4 (Palmer et al., 2005)). For other languages, such as German, Spanish, and Japanese, semantic roles resources are being developed. For Romanian, Trandabăț and Husarciuc (2008) have started to automatically build such a resource. For role semantics to become relevant for language technology, robust and accurate methods for automatic semantic role assignment are needed. With the SensEval-3 competition5 and the CONLL Shared Tasks 6 , Automatic Labeling of Semantic Roles, identifying frame elements within a sentence and tag them with appropriate semantic roles given a sentence (Lluis et al., 2008), has become increasingly present among researchers worldwide. In recent years, a number of studies, such as (Chen and Rambow, 2003) and (Gildea and Jurafsky, 2002), has investigated this task on the FrameNet corpus. Role assignment has generally been modeled as a classification task. While using different statistical frameworks, most studies have largely converged on a common set of features to base their decisions on, namely syntactic information (path from predicate to constituent, phrasal type of constituent) and lexical 2 FrameNet web page: http://framenet.icsi.berkeley.edu/ VerbNet web page: http://verbs.colorado.edu/~mpalmer/projects/verbnet/do wnloads.html 4 PropBank web page: http://verbs.colorado.edu/~mpalmer/projects/ace.html 5 SemEval web address: http://www.senseval.org/ 6 ConLL web address: http://ifarm.nl/signll/conll/ 3 information (head word of the constituent, predicate). Semantic roles are classified in terms of how central they are to a particular verb. Arguments (or core semantic roles) instantiate required roles, which are in a close relation to the verb whose sense they complete, and adjuncts (or non-core semantic roles), which are more general roles that can apply to any verb. Adjuncts represent circumstantial objects and can be of the following types: directions, locatives, temporal, manner, extent, reciprocals, secondary predication, purpose, cause, discourse, adverbials, modals, negation. For instance, temporal and locative adjuncts can be found in both sentences below: John broke the window school]LOC [yesterday]TMP. John visited his kids school]LOC [yesterday]TMP. [at the [at the An important drawback in this domain is that most researches focus on text analysis, and text generation applications using semantic roles are not so well developed. In this context, using the semantic role labeling system presented in (Trandabat, 2010), we annotated the sentences containing entities from the input text with the semantic roles these entities play, and passed to the third step. The semantic role system we used for Romanian was obtained by training 12 machine translation algorithms (see Trandabat, 2010) from the Weka framework (Hall et al., 2009) with different feature sets. After running all the classifiers for different modules (the module that separately identifies the semantic roles and classify them, or the module that jointly identifies semantic roles and classify them), their performance is compared, and the module that obtains the highest performance is considered the best configuration. The models for this best configuration are saved, and the best path is written to a configuration file. This configuration can then be used at a later time to annotate new texts with the developed SRL system. The 10 fold cross-validation results of all classifiers are also saved since they provide a confusion matrix that can be used to see which classes were correctly predicted by different classifiers. The output of the system presented in (Trandabat, 2010) is a Semantic Role Labeling 167 System, a sequence of trained models which can be used to annotate new texts.. 2.3 Selecting relevant sentences The third module of the summary generation system implies selecting, among the list of sentences from which summaries can be generated, the ones in which the entity has core semantic roles. The proposed method involves four main steps: Identifying the main character Extract sentences containing the main character Keep sentences with core roles for the specific character. Simplify sentences There are two possible ways of identifying the main character: the easiest one is when the central character of the text is a-priori given as argument (in case a character-oriented summary is requested). Otherwise, the main character is considered to be the named entity having the higher number of occurrences in the text (including references, see Section 2.1). For the example below, the main character is considered to be Alcmene, with 9 occurrences. Hercules was the son of Zeus and Alcmene. Alcmene's husband Amphiteryon was out avenging her brother's death at the hands of pirates. Zeus, disguised as Amphiteryon, came to her and told her stories of how he killed the pirates to avenge her brother's death. That night Zeus went to bed with Alcmene and impregnated her. The next day the real Amphiteryon returned with his stories of avenging the pirates, and he could not understand why his wife was irritated with him and seemed disinterested in the stories. It was then that Amphiteryon consulted a blind seer and became aware of what Zeus did. For the extraction of the sentences containing the main character, both the entity as if, and its references, are considered. For the example above, the last sentence is kept out, as not containing the character or a reference to it. The distinction between the situations when the main character has core and non-core semantic roles (or adjuncts vs. arguments) represents the backbone of our system. Thus, when the entity considered for the summary has a semantic role that is mandatory for a sentence meaning (it is a core semantic role, such as an Agent), the sentence containing it is kept. In contrast, if a sentence contains the entity in a non-core position (expressing temporal, spatial, modal, etc. circumstances), then its meaning is not essential for the summary, and the sentence containing the entity will be discarded from the summary. As an example, in the sentence below, Alcmene (refered as his wife) is only part of a non-core semantic role (Content for the verb understand), so this sentence will be discarded and not kept for the final summary: The next day the real Amphiteryon returned with his stories of avenging the pirates, and [he]Cognizer [could not understand]TARGET [why his wife was irritated with him and seemed disinterested in the stories]Content. The last step involved a simplification of the sentences. This simplification is based on a set of heuristics using semantic roles. Thus, in a sentence, not only one verb requiring semantic roles may appear. In order to simplify these complex sentences, we only keep the predicate7 for which the entity is a semantic role. To give an example, consider the sentence below: [Alcmene's]Partner1 [husband]TARGET [Amphiteryon]Partner2 was out avenging her brother's death at the hands of pirates. We evaluated the method on 20 summaries extracted from the Legend of the Olympus novel. In a first batch, 5 volunteers received full version of the 20 texts, and were asked to generate short summaries (about 10% of the size if the full text). A second batch of 5 volunteers received the initial text marked with semantic roles, and were instructed to create short summaries (the same 10%) using the semantic role information. Although the evaluation was only intended to give a feedback on the method, and a proper evaluation is still to be developed, the volunteers reported that knowing the semantic roles of entities and guiding the summary on it makes the summary generation task easier. Acknowledgments The research presented in this paper was funded by the Sectoral Operational Programme for Human Resources Development through the project "Development of the innovation capacity and increasing of the research impact through postdoctoral programs" POSDRU/89/1.5/S/49944. References Baker G., Collin F., Fillmore, Charles J., and Lowe, John B. 1998. The Berkeley FrameNet project. In Proceedings of the COLING-ACL, Montreal, Canada. 1998 In this case, two predicates are annotated with semantic roles: husband as a relationship predicate (according to FrameNet), and avenging as an activity predicate. Simplifying this sentence means keeping only the semantic roles for the first predicate (husband), for which the main character plays a semantic role, i.e. keeping only “Alcmene’s husband, Amphiteryon was out”. Chen J. and O. Rambow. 2003. Use of deep linguistic features for the recognition and labeling of semantic arguments. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, 2003. 3 Cristea and Iftene. 2011. If you want your talk be fluent, think lazy! Grounding coherence properties of discourse. Invited talk at SPED-2011, University of Brasov, May Discussion and Further Work In this paper, we presented a summary generation system based on semantic roles. The main components of the system are dedicated to identifying named entities, marking semantic roles, and selecting the sentences of the text to be kept in the summary. 7 In general, predicates are associated with verbs. However, semantic roles theories have recently accepted the existence of predicate-like nouns and adjectives, which can gather around them semantic roles, just like verbs do. 168 Christopherson, Steven L . 1981. Effects of knowledge of semantic roles on summarizing written prose. Contemporary Educational Psychology, Vol 6(1), Jan 1981, 59-65 Fillmore Charles J. 1968. The case for case. In Bach and Harms, editors, Universals in Linguistic Theory, pages 1-88. Holt, Rinehart, and Winston, New York, 1968. Fillmore Charles J. 1982. Frame semantics, în Linguistics in the Morning Calm, Hanshin Publishing, Seoul , 1982, 111-137. Gildea Daniel and Daniel Jurafsky. 2002. Automatic labeling of semantic roles. Computational Linguistics, 28(3):245-288, 2002 Hall, Mark, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten (2009); The WEKA Data Mining Software: An Update; SIGKDD Explorations, Volume 11, Issue 1 Levin B. and M. Rappaport Hovav. 2005. Argument Realization. Research Surveys in Linguistics Series. Cambridge University Press, Cambridge, UK, 2005. Mani Inderjeet, automatic summarization, John Benjamins Pub Co; ISBN: 1588110591 (hardcover), 1588110605 (paperback), 2001. Marquez Lluis, Xavier Carreras, Kenneth C. Litkowski, and Suzanne Stevenson. 2008. Semantic role labeling: An introduction to the Special Issue. Computational Linguistics, 34(2):145-159, 2008. David Nadeau and Satoshi Sekine. 2007. A survey of named entity recognition and classification, Linguisticae Investigationes 30, no. 1, 3{26, Publisher: John Benjamin’s Publishing Company David Nadeau. 2007. Semi-supervised named entity recognition: Learning to recognize 100 entity types with little supervision, PhD Thesis. Palmer Martha, Daniel Gildea, and Paul Kingsbury. 2005. The proposition bank: An annotated corpus of semantic roles. Computational Linguistics, 31(1):71106, 2005. Scurtu V., Stepanov E., Mehdad, Y. 2009. Italian named entity recognizer participation in NER task@evalita 09, 2009. Suanmali, L., Salim, N. and Binwahlan, M. S.. 2010. SRL-GSM : A Hybrid Approach based on Semantic Role Labeling and General Statistic Method for Text Summarization. Journal of Applied Sciences 10(3) : 166-173 Trandabăţ D. 2010. Natural Language Processing Using Semantic Frames, PhD Thesis, University Al. I. Cuza Iasi, Romania Trandabat Diana and Maria Husarciuc. 2008. Romanian semantic role resource. In Proceedings of the Sixth International Language Resources and Evaluation (LREC'08), Marrakech, Morocco, may 2008. 169
© Copyright 2026 Paperzz