3rd National Natural Language Processing Symposium - Building Language Tools and Resources SIMTEXT Text Simplification of Medical Literature Jerwin Jan S. Damay [email protected] Gerard Jaime D. Lojico [email protected] Dex B. Tarantan [email protected] Kimberly Amanda L. Lu [email protected] Ethel C. Ong [email protected] College of Computer Studies, De La Salle University 2401 Taft Ave. Manila Philippines 1004 Abstract long sentences, infrequent words and complicated grammatical constructs including embedded clauses and passive voice. Inui et al [3] are developing a reading assistance system that performs syntactic and lexical paraphrasing of text to make it more comprehensible for congenitally deaf junior high students. Difficulty in comprehending written text should not hinder the general population from accessing medical literature. Text simplification is the process of transforming complex sentences into a set of equivalent simpler sentences with the goal of making the resulting text easier to read by some target group. This paper presents SimText, a text simplification system that accepts a document in a specified reading level, and converts it to a target reading level using natural language processing techniques. The paper starts with an overview of the current state of text simplification. This is followed by a presentation of a text simplification algorithm and a discussion of the different processes involved, as adopted by SimText. The third part of the paper discusses lexical simplification, a component of text simplification, in detail. Text simplification systems can generate texts that are of a certain target linguistic complexity to aid adults learning English, non-native English speakers surfing a predominantly English internet, and users of limited channel devices that display text in short sentences to fit the small screens of these devices [8]. Maintenance manuals used by aircraft companies follow a Simplified English to make instructions clearer and simpler to follow. Printed materials concerning health and safety would also benefit from using simplified English that increases their readability. In a separate study conducted by health specialists [6], they found that the average reading level of instruction manuals on health and safety was the 10th grade, too difficult for 80 percent adult readers in the U.S. whose average reading level is that of 7th grade. Keywords. Text Simplification, Natural Language Processing, Syntactic Simplification, Lexical Simplification, English Grammar 1. INTRODUCTION Text simplification is the process of transforming complex sentences into a set of equivalent simpler sentences while preserving the original meaning. The goal is to make the resulting text easier to comprehend for human readers or to process by other programs. Simplified text can also be used by some other software applications. The quality of the generated translation by machine translation systems, for example, can be improved by simpler sentential structures that reduce ambiguity [1]. The information extraction and retrieval process of informationseeking applications are easier if complicated sentences become more simple and easy to understand [7]. Text simplification systems have a wide range of applications. The Practical Simplification of English Text (PSET) system [2] simplifies newspaper stories so that they may be better understood by aphasic people who have difficulty comprehending text with 34 3rd National Natural Language Processing Symposium - Building Language Tools and Resources Dubay [6] cited that the two important elements of communication are the reading skills of the audience and the readability of the text. The advent of the Internet has made medical literature easily accessible to the general population. But medical literature, just like other technical documents, is fairly complex when evaluated against readability metrics [1]. Once the readability level of the text exceeds that of the comprehension level of the readers, they stop reading. This leads to the development of SimText, a text simplification system that simplifies technical documents to make them easier to read and comprehend by college students. 2. TEXT SIMPLIFICATION There are two approaches to text simplification, namely lexical simplification and syntactic simplification. Lexical simplification involves replacing difficult words with their more frequentlyused synonyms or paraphrasing them with their dictionary definition. Certain sentence constructs, such as compound and complex sentences, pose difficulties for readers. Syntactic simplification involves restructuring of these complex sentences, such as the separation of a conjoined sentence into two or more shorter sentences, identification of embedded clauses that can be extracted and converted into stand-alone simpler sentences, changing of passive voices to active voices, and resolution and replacement of pronouns. Figure 1. Architecture of SimText 3.1 Analysis Module The simplification process involves two major subtasks, namely sentence structure analysis, and simplified text generation. Sentence structure analysis involves identifying the syntactic complexity of a given text and determining the existence of components that can be extracted or simplified. The text simplification process begins when an input text document is fed to SimText. The Analysis module processes and performs various tasks on the text - sentence boundary detection, text segmentation, part-of-speech tagging, noun chunking, grammaticalfunction determination, clause and appositive identification and attachment, third-person pronoun resolution, and difficulty tagging. The goal is to derive a structural representation of each sentence in the text and to determine sentence structures that can be simplified. 3. SIMTEXT The architecture of SimText is shown in Figure 1. This architecture is based on the proposed text simplification architecture of Siddharthan [4]. It is composed of three major modules, namely Analysis, Syntactic Simplification, and Lexical Simplification. The following sections discuss each of these modules in detail. The output of the Analysis module has the following specifications [8]: a. The text has been segmented into sentences. b. Words have their corresponding part-ofspeech tags. 35 3rd National Natural Language Processing Symposium - Building Language Tools and Resources paraphrasing difficult words with their dictionary definition, or replacing them with their synonym counterpart. Note that only words are modified in this module. The sentence structures generated by the Syntactic Simplification module are preserved. c. Elementary noun phrases are marked-up and annotated with grammatical function information. d. Boundaries and attachment have been marked-up for the clauses and phrases to be simplified. e. Pronouns co-refer to their antecedents. f. Difficult words are tagged. Paraphrasing identifies difficult words based on the contents of the lexicon. If a word is found to be difficult, then a dictionary lookup in the lexicon occurs and the word is paraphrased with its equivalent dictionary definition. Tagging a word as difficult is subjective in SimText; a word is considered difficult if it has a positive value for its difficulty entry in the lexicon. Note that the sixth item has been added to the original five listed by Siddhartan. This sixth specification is needed by the Lexical Simplification module in order to identify the words that are to be lexically simplified – either by paraphrasing or dictionary definition. 3.2 Syntactic Simplification Module Synonym substitution, on the other hand, replaces difficult words with their synonym counterparts. This involves a thesaurus lookup for the synonyms of words in the lexicon. After being analyzed, the text document will now undergo Transformation and Regeneration in the Syntactic Simplification module. A document will typically undergo a series of transformation and regeneration until all sentences in the document are deemed to be syntactically simplified before being passed to the next module. 3.4 Knowledge Sources The level of syntactic and lexical simplification that SimText hopes to achieve will be dependent on the syntactic and lexical information stored in the knowledge sources used by the system. There are three knowledge sources, namely the Transformation Rules, the Lexicon, and the User Model. Utilizing a set of rules from the system's knowledge source, the Transformation phase recursively applies appropriate rules on a sentence until no further simplification is possible. The stopping condition is dictated by the set of rules available from the knowledge source, as well as the user model that represents a computational model of the reading level of the target readers. The knowledge sources of SimText are further discussed in Section 4 of this document. 3.4.1 Transformation Rules Transformation Rules are used by the Syntactic Simplification Module to convert complex sentence structures into simpler ones. A good formalism for representing transformation rules would facilitate their manual refinement and maintenance [3]. Transformation rules are represented in SimText as pattern-effect pairs that are in XML. Sentence patterns identify sentences in certain complex configuration and rule effects are the new sentence formats to be applied to form simpler configurations. A transformed sentence is then forwarded to the regeneration phase. This phase takes into consideration inter-sentential discourse, which is necessary in order to preserve the cohesion and meaning of the original text [8]. Otherwise, the simplification process will not be useful. In addition, by changing the transformation rules the user model of SimText can also be changed. This feature enables SimText to be extensible, thus, being able to cater to varying user types and domains. Currently, the static user model of SimText represents a computational model of the English proficiency level of college students. Regeneration may involve the following activities: cue words selection, sentence order determination, referring expression generation, determiner selection, and anaphoric link preservation. 3.3 Lexical Simplification Module 4.2 Lexicon Lexical simplification involves replacing difficult words with their more commonly used synonym or Once the document is syntactically simplified, it will now undergo Paraphrasing and Synonym Substitution in the Lexical Simplification module. This involves 36 3rd National Natural Language Processing Symposium - Building Language Tools and Resources paraphrasing them with a definition of the word. Thus, the system must maintain a thesaurus and a dictionary for synonym replacement and word definition, respectively. The dictionary is also employed by the Analysis module to tag the parts of speech of each word, annotate them with their grammatical function, and tag them as difficult. The lexicon can be built by either changing the entire lexicon or by importing new entries. There are certain cases, however, when replacing a word directly with its dictionary definition may not be an ideal option. A solution is to retain the original word and insert the definition as an embedded clause. For example, given the sentence below: 5. LEXICAL SIMPLIFICATION MODULE asthma have been tagged difficult. paraphrasing, the new text will be: Anti-inflammatory medications are now the single most effective therapy for adults with asthma. The Lexical Simplification module is not part of Siddhartan’s original work; however he proposed a revised architecture to include such. Anti-inflammatory medications are now the single most effective therapy for adults with asthma, a respiratory disorder characterized by wheezing. A subtask of the simplification process involves identifying if a given text is difficult (in terms of readability or comprehensibility) for the target reader. Williams [5] defines readability as directly related to the reader’s performance on the reading task, i.e., reading speed, ability to answer comprehension questions, and ability to recall content. According to Dubay [6], readability is what makes some texts easier to read than others. However, this paraphrasing method produces a sentence that lacks the cohesion we aim for, thus presenting an issue. A possible solution for this is discussed below. In synonym substitution, the difficult words are replaced by their simpler and more common counterparts. The synonyms are determined through a thesaurus lookup. The thesaurus is part of the lexicon and contains administrator-defined synonyms and SimText is dependent on these synonyms in order to produce a more readable text. Words may only have one synonym entry and one dictionary definition entry. SimText also assumes that these entries are valid. For example, given the sentence below: A factor contributing to the difficulty of a text is the difficulty of a word used in the text. In SimText, a word is considered difficult only if its difficulty tag in the lexicon has been set. This is then used by the Analysis module to identify words that are difficult, and tag them accordingly. A word that has been simplified in two ways, synonym substitution. In word is simplified by definition into the text. sentence below: tagged difficult can be using paraphrasing and paraphrasing, a difficult inserting its dictionary For example, given the In a survey, 29.5% of pregnant smoker had experienced problems associated with previous pregnancy, including cot death. cot death has been tagged difficult. After synonym substitution, the new text will be: Jan overlooked one important detail. overlooked has been tagged paraphrasing, the new text will be: difficult. After In a survey, 29.5% of pregant smokers had experienced problems associated with previous pregnancy, including infant death. After Jan did not notice one important detail. Notice that cot death was substituted with its synonym counter part infant death. Notice that the dictionary definition has been inserted into the text by paraphrasing overlooked with did not notice. The Lexical Simplification Module is implemented in Microsoft C#.Net. The Lexical Database containing 37 3rd National Natural Language Processing Symposium - Building Language Tools and Resources other areas of natural language processing where such simplification would be of great use [1]. the lexicon needed has been designed in mySQL. A number of issues arose after preliminary testing of this module. The first involves deciding which of the lexical simplification tasks to be performed first. The proponents decided to give priority to synonym substitution. The synonym of a word is looked up first and in the event that a synonym is not available from the thesaurus, then the dictionary definition is used to paraphrase the sentence where the word is located. The SimText system is currently in the implementation phase. This paper reported on the current progress of the SimText system in simplifying technical documents for college students and presented the Lexical Simplification module and the issues that arose during development. Further work involves implementing the Analysis and Syntactic Simplification modules and includes the research issues that need to be addressed in order to realize this application – readability assessment, paraphrase representation, specialized transformation rules, etc. A second concern is that lexical simplification, specifically paraphrasing, generates very long sentences, as shown in the example below, where the definition of the biomedical terms agonists and asthma were inserted into the sentence: 6. REFERENCES Short-acting inhaled beta-2 agonists, drugs that can combine with a receptor on a cell to produce a physiological reaction, are most effective for speedy relief of asthma, a respiratory disorder characterized by wheezing, symptoms. [1] Chandrasekar, R., Doran, C., Srinivas, B. (1996). Motivations and Methods for Text Simplification. In Proceedings COLING 1996, Copenhagen. [2] Devlin, S., Tait, J., Canning, Y., Carroll, J., Minnen, G., Pearce D. (2000). The Application of Assistive Technology in Facilitating the Comprehension of Newspaper Text by Aphasic People. In Buhler and Knops (eds) Assistive Technology on the Threshold of the New Millenium, Assistive Technology Research Series, Volume 6. IOS Press, The Netherlands. A possible solution to this problem is to modify the system architecture of SimText. If the sentences being generated by the Lexical Simplification Module are long and possibly needing syntactic simplification, then a possible solution is to perform lexical simplification first, before syntactically simplifying the text. This will ensure that all sentences, including the long sentences generated by the Lexical Simplification module, are able to pass through the Syntactic Simplification module. [3] Inui, K., Fujita, A., Takahashi, T., Iida, R. (2003). Text simplification for reading assistance: A project note. Available online: http://acl.ldc.upenn.edu/acl2003/ iwp/pdf/InuiFujita.pdf. [4] Siddharthan, A. (2004). Syntactic simplification and text cohesion. Available online: http://www.cs.columbia.edu/ nlp/papers/2004/siddharthan_04.pdf. 5. CONCLUSION The field of natural language processing (NLP) in our country is still very young, more so with studies pertaining to text generation, text summarization, and text simplification. [5] Williams, S., Reiter, E., Osman, L. (2003). Experiments with Discourse-Level Choices and Readability. In Proceedings of the 9th European Workshop on Natural Language Generation, Budapest. A text simplification process will identify components of sentence that may be separated out, and transforms each of these into free-standing simpler sentences. Some nuances of meaning from the original text may be lost in the simplification process, since sentencelevel syntactic restructuring can possibly alter the meaning of the sentence. Thus, simplification is not appropriate for texts, such as legal documents, where it is important not to lose any nuance. But there are [6] Dubay, W. (2004). The Principles of Readability. Impact Information, CA. August 2004. [7] Klebanov. (2004). Text Simplification Information-Seeking Applications. for [8] Siddharthan, A. (2002). An Architecture for a Text Simplifation System. In Proceedings of the Language Engineering Conference 2002 (LEC 2002). 38
© Copyright 2025 Paperzz