Text Simplification of Medical Literature

3rd National Natural Language Processing Symposium - Building Language Tools and Resources
SIMTEXT
Text Simplification of Medical Literature
Jerwin Jan S. Damay
[email protected]
Gerard Jaime D. Lojico
[email protected]
Dex B. Tarantan
[email protected]
Kimberly Amanda L. Lu
[email protected]
Ethel C. Ong
[email protected]
College of Computer Studies, De La Salle University
2401 Taft Ave. Manila Philippines 1004
Abstract
long sentences, infrequent words and complicated
grammatical constructs including embedded clauses
and passive voice. Inui et al [3] are developing a
reading assistance system that performs syntactic and
lexical paraphrasing of text to make it more
comprehensible for congenitally deaf junior high
students.
Difficulty in comprehending written text should not
hinder the general population from accessing medical
literature. Text simplification is the process of
transforming complex sentences into a set of
equivalent simpler sentences with the goal of making
the resulting text easier to read by some target group.
This paper presents SimText, a text simplification
system that accepts a document in a specified reading
level, and converts it to a target reading level using
natural language processing techniques. The paper
starts with an overview of the current state of text
simplification. This is followed by a presentation of a
text simplification algorithm and a discussion of the
different processes involved, as adopted by SimText.
The third part of the paper discusses lexical
simplification, a component of text simplification, in
detail.
Text simplification systems can generate texts that are
of a certain target linguistic complexity to aid adults
learning English, non-native English speakers surfing
a predominantly English internet, and users of limited
channel devices that display text in short sentences to
fit the small screens of these devices [8].
Maintenance manuals used by aircraft companies
follow a Simplified English to make instructions
clearer and simpler to follow. Printed materials
concerning health and safety would also benefit from
using simplified English that increases their
readability. In a separate study conducted by health
specialists [6], they found that the average reading
level of instruction manuals on health and safety was
the 10th grade, too difficult for 80 percent adult
readers in the U.S. whose average reading level is that
of 7th grade.
Keywords. Text Simplification, Natural Language
Processing, Syntactic Simplification, Lexical
Simplification, English Grammar
1. INTRODUCTION
Text simplification is the process of transforming
complex sentences into a set of equivalent simpler
sentences while preserving the original meaning. The
goal is to make the resulting text easier to
comprehend for human readers or to process by other
programs.
Simplified text can also be used by some other
software applications. The quality of the generated
translation by machine translation systems, for
example, can be improved by simpler sentential
structures that reduce ambiguity [1]. The information
extraction and retrieval process of informationseeking applications are easier if complicated
sentences become more simple and easy to understand
[7].
Text simplification systems have a wide range of
applications. The Practical Simplification of English
Text (PSET) system [2] simplifies newspaper stories
so that they may be better understood by aphasic
people who have difficulty comprehending text with
34
3rd National Natural Language Processing Symposium - Building Language Tools and Resources
Dubay [6] cited that the two important elements of
communication are the reading skills of the audience
and the readability of the text. The advent of the
Internet has made medical literature easily accessible
to the general population. But medical literature, just
like other technical documents, is fairly complex
when evaluated against readability metrics [1]. Once
the readability level of the text exceeds that of the
comprehension level of the readers, they stop reading.
This leads to the development of SimText, a text
simplification system that simplifies technical
documents to make them easier to read and
comprehend by college students.
2. TEXT SIMPLIFICATION
There are two approaches to text simplification,
namely lexical simplification and syntactic
simplification. Lexical simplification involves
replacing difficult words with their more frequentlyused synonyms or paraphrasing them with their
dictionary definition.
Certain sentence constructs, such as compound and
complex sentences, pose difficulties for readers.
Syntactic simplification involves restructuring of
these complex sentences, such as the separation of a
conjoined sentence into two or more shorter
sentences, identification of embedded clauses that can
be extracted and converted into stand-alone simpler
sentences, changing of passive voices to active voices,
and resolution and replacement of pronouns.
Figure 1. Architecture of SimText
3.1 Analysis Module
The simplification process involves two major
subtasks, namely sentence structure analysis, and
simplified text generation. Sentence structure analysis
involves identifying the syntactic complexity of a
given text and determining the existence of
components that can be extracted or simplified.
The text simplification process begins when an input
text document is fed to SimText. The Analysis
module processes and performs various tasks on the
text - sentence boundary detection, text segmentation,
part-of-speech tagging, noun chunking, grammaticalfunction determination, clause and appositive
identification and attachment, third-person pronoun
resolution, and difficulty tagging. The goal is to
derive a structural representation of each sentence in
the text and to determine sentence structures that can
be simplified.
3. SIMTEXT
The architecture of SimText is shown in Figure 1.
This architecture is based on the proposed text
simplification architecture of Siddharthan [4]. It is
composed of three major modules, namely Analysis,
Syntactic Simplification, and Lexical Simplification.
The following sections discuss each of these modules
in detail.
The output of the Analysis module has the following
specifications [8]:
a. The text has been segmented into sentences.
b. Words have their corresponding part-ofspeech tags.
35
3rd National Natural Language Processing Symposium - Building Language Tools and Resources
paraphrasing difficult words with their dictionary
definition, or replacing them with their synonym
counterpart. Note that only words are modified in this
module. The sentence structures generated by the
Syntactic Simplification module are preserved.
c. Elementary noun phrases are marked-up and
annotated with grammatical function
information.
d. Boundaries and attachment have been
marked-up for the clauses and phrases to be
simplified.
e. Pronouns co-refer to their antecedents.
f. Difficult words are tagged.
Paraphrasing identifies difficult words based on the
contents of the lexicon. If a word is found to be
difficult, then a dictionary lookup in the lexicon
occurs and the word is paraphrased with its equivalent
dictionary definition. Tagging a word as difficult is
subjective in SimText; a word is considered difficult
if it has a positive value for its difficulty entry in the
lexicon.
Note that the sixth item has been added to the original
five listed by Siddhartan. This sixth specification is
needed by the Lexical Simplification module in order
to identify the words that are to be lexically simplified
– either by paraphrasing or dictionary definition.
3.2 Syntactic Simplification Module
Synonym substitution, on the other hand, replaces
difficult words with their synonym counterparts. This
involves a thesaurus lookup for the synonyms of
words in the lexicon.
After being analyzed, the text document will now
undergo Transformation and Regeneration in the
Syntactic Simplification module. A document will
typically undergo a series of transformation and
regeneration until all sentences in the document are
deemed to be syntactically simplified before being
passed to the next module.
3.4 Knowledge Sources
The level of syntactic and lexical simplification that
SimText hopes to achieve will be dependent on the
syntactic and lexical information stored in the
knowledge sources used by the system. There are
three knowledge sources, namely the Transformation
Rules, the Lexicon, and the User Model.
Utilizing a set of rules from the system's knowledge
source, the Transformation phase recursively applies
appropriate rules on a sentence until no further
simplification is possible. The stopping condition is
dictated by the set of rules available from the
knowledge source, as well as the user model that
represents a computational model of the reading level
of the target readers. The knowledge sources of
SimText are further discussed in Section 4 of this
document.
3.4.1 Transformation Rules
Transformation Rules are used by the Syntactic
Simplification Module to convert complex sentence
structures into simpler ones. A good formalism for
representing transformation rules would facilitate
their manual refinement and maintenance [3].
Transformation rules are represented in SimText as
pattern-effect pairs that are in XML. Sentence
patterns identify sentences in certain complex
configuration and rule effects are the new sentence
formats to be applied to form simpler configurations.
A transformed sentence is then forwarded to the
regeneration phase. This phase takes into
consideration inter-sentential discourse, which is
necessary in order to preserve the cohesion and
meaning of the original text [8]. Otherwise, the
simplification process will not be useful.
In addition, by changing the transformation rules the
user model of SimText can also be changed. This
feature enables SimText to be extensible, thus, being
able to cater to varying user types and domains.
Currently, the static user model of SimText represents
a computational model of the English proficiency
level of college students.
Regeneration may involve the following activities:
cue words selection, sentence order determination,
referring expression generation, determiner selection,
and anaphoric link preservation.
3.3 Lexical Simplification Module
4.2 Lexicon
Lexical simplification involves replacing difficult
words with their more commonly used synonym or
Once the document is syntactically simplified, it will
now undergo Paraphrasing and Synonym Substitution
in the Lexical Simplification module. This involves
36
3rd National Natural Language Processing Symposium - Building Language Tools and Resources
paraphrasing them with a definition of the word.
Thus, the system must maintain a thesaurus and a
dictionary for synonym replacement and word
definition, respectively. The dictionary is also
employed by the Analysis module to tag the parts of
speech of each word, annotate them with their
grammatical function, and tag them as difficult. The
lexicon can be built by either changing the entire
lexicon or by importing new entries.
There are certain cases, however, when replacing a
word directly with its dictionary definition may not be
an ideal option. A solution is to retain the original
word and insert the definition as an embedded clause.
For example, given the sentence below:
5. LEXICAL SIMPLIFICATION MODULE
asthma have been tagged difficult.
paraphrasing, the new text will be:
Anti-inflammatory medications are
now the single most effective
therapy for adults with asthma.
The Lexical Simplification module is not part of
Siddhartan’s original work; however he proposed a
revised architecture to include such.
Anti-inflammatory medications are
now the single most effective
therapy for adults with asthma, a
respiratory disorder characterized
by wheezing.
A subtask of the simplification process involves
identifying if a given text is difficult (in terms of
readability or comprehensibility) for the target reader.
Williams [5] defines readability as directly related to
the reader’s performance on the reading task, i.e.,
reading speed, ability to answer comprehension
questions, and ability to recall content. According to
Dubay [6], readability is what makes some texts
easier to read than others.
However, this paraphrasing method produces a
sentence that lacks the cohesion we aim for, thus
presenting an issue. A possible solution for this is
discussed below.
In synonym substitution, the difficult words are
replaced by their simpler and more common
counterparts. The synonyms are determined through a
thesaurus lookup. The thesaurus is part of the lexicon
and contains administrator-defined synonyms and
SimText is dependent on these synonyms in order to
produce a more readable text. Words may only have
one synonym entry and one dictionary definition
entry. SimText also assumes that these entries are
valid. For example, given the sentence below:
A factor contributing to the difficulty of a text is the
difficulty of a word used in the text. In SimText, a
word is considered difficult only if its difficulty tag in
the lexicon has been set. This is then used by the
Analysis module to identify words that are difficult,
and tag them accordingly.
A word that has been
simplified in two ways,
synonym substitution. In
word is simplified by
definition into the text.
sentence below:
tagged difficult can be
using paraphrasing and
paraphrasing, a difficult
inserting its dictionary
For example, given the
In a survey, 29.5% of pregnant
smoker had experienced problems
associated with previous
pregnancy, including cot death.
cot death has been tagged difficult. After synonym
substitution, the new text will be:
Jan overlooked one important
detail.
overlooked has been tagged
paraphrasing, the new text will be:
difficult.
After
In a survey, 29.5% of pregant
smokers had experienced problems
associated with previous
pregnancy, including infant death.
After
Jan did not notice one important
detail.
Notice that cot death was substituted with its
synonym counter part infant death.
Notice that the dictionary definition has been inserted
into the text by paraphrasing overlooked with did not
notice.
The Lexical Simplification Module is implemented in
Microsoft C#.Net. The Lexical Database containing
37
3rd National Natural Language Processing Symposium - Building Language Tools and Resources
other areas of natural language processing where such
simplification would be of great use [1].
the lexicon needed has been designed in mySQL. A
number of issues arose after preliminary testing of
this module. The first involves deciding which of the
lexical simplification tasks to be performed first. The
proponents decided to give priority to synonym
substitution. The synonym of a word is looked up first
and in the event that a synonym is not available from
the thesaurus, then the dictionary definition is used to
paraphrase the sentence where the word is located.
The SimText system is currently in the
implementation phase. This paper reported on the
current progress of the SimText system in simplifying
technical documents for college students and
presented the Lexical Simplification module and the
issues that arose during development. Further work
involves implementing the Analysis and Syntactic
Simplification modules and includes the research
issues that need to be addressed in order to realize this
application – readability assessment, paraphrase
representation, specialized transformation rules, etc.
A second concern is that lexical simplification,
specifically paraphrasing, generates very long
sentences, as shown in the example below, where the
definition of the biomedical terms agonists and
asthma were inserted into the sentence:
6. REFERENCES
Short-acting inhaled beta-2
agonists, drugs that can combine
with a receptor on a cell to
produce a physiological reaction,
are most effective for speedy
relief of asthma, a respiratory
disorder characterized by
wheezing, symptoms.
[1] Chandrasekar, R., Doran, C., Srinivas, B. (1996).
Motivations and Methods for Text Simplification. In
Proceedings COLING 1996, Copenhagen.
[2] Devlin, S., Tait, J., Canning, Y., Carroll, J., Minnen,
G., Pearce D. (2000). The Application of Assistive
Technology in Facilitating the Comprehension of
Newspaper Text by Aphasic People. In Buhler and
Knops (eds) Assistive Technology on the Threshold of
the New Millenium, Assistive Technology Research
Series, Volume 6. IOS Press, The Netherlands.
A possible solution to this problem is to modify the
system architecture of SimText. If the sentences
being generated by the Lexical Simplification Module
are long and possibly needing syntactic simplification,
then a possible solution is to perform lexical
simplification first, before syntactically simplifying
the text. This will ensure that all sentences, including
the long sentences generated by the Lexical
Simplification module, are able to pass through the
Syntactic Simplification module.
[3] Inui, K., Fujita, A., Takahashi, T., Iida, R. (2003).
Text simplification for reading assistance: A project
note.
Available
online:
http://acl.ldc.upenn.edu/acl2003/
iwp/pdf/InuiFujita.pdf.
[4] Siddharthan, A. (2004). Syntactic simplification and
text
cohesion.
Available
online:
http://www.cs.columbia.edu/
nlp/papers/2004/siddharthan_04.pdf.
5. CONCLUSION
The field of natural language processing (NLP) in our
country is still very young, more so with studies
pertaining to text generation, text summarization, and
text simplification.
[5] Williams, S., Reiter, E., Osman, L. (2003).
Experiments with Discourse-Level Choices and
Readability. In Proceedings of the 9th European
Workshop on Natural Language Generation,
Budapest.
A text simplification process will identify components
of sentence that may be separated out, and transforms
each of these into free-standing simpler sentences.
Some nuances of meaning from the original text may
be lost in the simplification process, since sentencelevel syntactic restructuring can possibly alter the
meaning of the sentence. Thus, simplification is not
appropriate for texts, such as legal documents, where
it is important not to lose any nuance. But there are
[6] Dubay, W. (2004). The Principles of Readability.
Impact Information, CA. August 2004.
[7] Klebanov.
(2004).
Text
Simplification
Information-Seeking Applications.
for
[8] Siddharthan, A. (2002). An Architecture for a Text
Simplifation System. In Proceedings of the Language
Engineering Conference 2002 (LEC 2002).
38