Automatic Simplification of Spanish Text for e

Automatic Simplification of Spanish Text for
e-Accessibility
Stefan Bott and Horacio Saggion
Universitat Pompeu Fabra,
Departament of Information and Communication Technologies
C/Tanger 122, 08018 Barcelona, Spain
{stefan.bott, horacio.saggion}@upf.edu
Abstract. In this pa per we present an automatic text simplification
system for Spanish which intends to make texts more accessible for users
with cognitive disabilities. This system aims at reducing the structural
complexity of Spanish sentences in that it converts complex sentences in
two or more simple sentences and therefore reduces reading difficulty.
Keywords: Automatic Text Simplification, Natural Language Processing, e-Accessibility
1
Introduction
The United Nations’ Convention on the rights of Persons with Disabilities requires that the signing countries promote access to information for persons with
disabilities [1]. But the reality is usually different: for people with cognitive disabilities the access to textual information is often hard because texts written for
general public are too difficult for them to read. Adapting only the format of the
text is not the solution in this case. One possibility to enable access to textual
information for people with cognitive problems is to adapt and simplify texts
manually. One set of guidelines under the umbrella name of the “easy-to-read”
methodology is generally used to adapt already existing textual material or to
produce content following the proposed guidelines. There are some organizations
which are dedicated to the production of such material, such as the Asociación
Facil Lectura [2], but the primary problem is that manual adaptation is very
costly in terms of human labour because of the time and knowledge required
to produce simplifications. For this reason, making easy-to-read versions of the
current volume of textual information (or even a small proportion of it) would
be impractical with human efforts alone.
Automatic text simplification is a technology to produce adaptable texts by
reducing their syntactic and lexical complexity so that they become readable for
a target user group. Automatic text simplification products can be considered a
kind of e-Accessibility devices with the potential of helping various user groups
including elderly people, second language learners, and inmigrants.
Our research is concerned with the development of an automatic simplification system for Spanish. It forms part of Simplext project1 which has the aim
to provide people with intellectual disabilities access to textual information anytime and anywhere with the help of web and mobile applications. In this paper
we report the results of our research so far conducted for the development of the
first text simplification system for Spanish. We will briefly describe our efforts to
create language resources and a working prototype which is already operative.
2
Text Simplicity and Automatic Simplification
Even if the concept of ”easy-to-read” is not universal, it is possible in a number
of specific contexts to write a text that will suit the abilities of most people with
literacy and comprehension problems. This easy-to-read material is generally
characterized by the following features:
– The text is usually shorter than a standard text and redundant content and
details which do not contribute to the general understanding of the topic are
eliminated.
– Ideally, each sentence should only contain one piece of information. Hence,
easy texts are written in fairly short sentences, avoiding subordinate clauses
whenever possible.
– Previous knowledge is not taken for granted. Background, difficult words
and context are explained but in such a way that it does not disturb the
flow of the text.
– Easy-to-read is always easier than standard language. There are differences
of level in different texts, all depending on the target group in mind.
While the problem of Automatic text simplification has been studied for some
other languages there are no simplification tools for Spanish. Work on automatic
text simplification has followed rule-based paradigms where rules were designed
following linguistic intuitions [5, 6] or statistical machine learning approaches
[7] which require a considerable volume of training data. Although automatic
text simplification has sometimes been studied without paying attention to the
user of the simplification, most research has particular user groups in mind. For
example, the PSET project [8] studied simplification for people with aphasia
while the PorSimples project [9, 10] looked into simplification for people with
poor literacy rate. One important issue in text simplification research are the
factors making a text more or less readable for a target user group, so research
to measure text readability is especially relevant [11–13].
One factor that makes the development of automatic simplification tools
difficult is a noticeable lack of linguistic resources (parallel corpora and lexical
resources). This is especially true for the Spanish language. In the larger context
of our research project we are creating a specialized corpus of 200 news texts.
1
Simplext: An automatic system for text simplification [3]. The idea for creating such
a tool for Spanish was born from the Prodis Foundation [4].
Simplified versions are created by human experts and aligned to the original
texts [14]. The simplified texts have been especially prepared for the target user
group, following theoretically motivated guidelines, based on previous studies on
readability of Spanish texts [15, 16]. The corpus of manual simplifications serves
us as a model of texts which are apt for our target users.
In order to determine what kinds of text simplification operations our system would have to cover, we carried out an initial corpus study. The details of
this study can be found in [17]. We examined 145 simplified sentences which
constituted the first available part of our corpus, contrasting them with the nonsimplified versions. In addition to determining the needs of users, we wanted to
assess how far the necessary text adjustments could be carried out automatically.
In the study we identified different editing operations and evaluated their frequency, together with the possibility to automatize them. We found that many
operations human editors perform are either very hard to classify, since they use
rather free re-wording instead of clear-cut editing operations, or require complex
reasoning on the basis of real-world knowledge. But we also found various editing
operations which were regular enough to be automatized.
On the basis of this corpus study we decided to concentrate on the automatic
treatment of lexical simplification, deletion operations, sentence split operations,
and the insertion of definitions. These four categories are very different in nature
and in the resources they require when automatized. Within our research project
we are developing different simplification modules for these different types of
simplification operations. At the moment only the module for syntactic simplification (carrying out the sentence splitting operations) is in an operational stage.
We are aware of the fact that this is only one aspect of text simplification, but
we believe it to be an important one. Other modules, namely the modules for
lexical simplification and deletions are under development and will be integrated
in our simplification system in the near future.
3
System Components and Evaluation
3.1
The Simplification Grammar
The current version of the prototype concentrates purely on the reduction of
structural complexity.2 The core of the simplification system in its current state
consists of a hand-crafted transduction grammar which operates on dependency
trees, produced by a dependency parser [18]. The output of the parser is a tree
which represents natural language syntax in the form of dependency relations
between words. Figure 1 is an example of such a dependency tree and corresponds
to example (1a), which we will discuss below. The parser also associates each
node in the tree with its morphologic information, such as its part of speech and
agreement information.3
2
3
Further details about the system architecture can be found in [17].
Figure 1 is a visualization of the tree produced by the MATE development environment. Morphological information is inherent in the nodes, but it is not shown in the
Fig. 1. A target structure containing a relative clause.
(1)
a. Las lluvias torrenciales, que comenzaron el pasado 1 de octubre (. . . )
hicieron que los rı́os se desbordaran (. . . )
The torrential rains that began on October 1 (. . . ), caused rivers to
overflow (. . . )
b. Las lluvias torrenciales hicieron que los rı́os se desbordaran (. . . )
Estas lluvias comenzaron el pasado 1 de octubre.
The torrential rains caused rivers to overflow (. . . ) These rains began
on October 1.
The grammar itself is being developed within the MATE framework [19]. MATE
is a tree transduction tool which was created with the mapping between different layers of linguistic representation in mind and is especially useful for text
generation. In our context, however, we use MATE as a tool that maps syntactic dependency structures which we detect as requiring simplification onto
simplified versions of these structures.
The simplification grammar typically splits one sentence in two or more
shorter ones. In this process some parts of the syntactic tree may be copied,
deleted or split. Also some reordering operations apply. The process is done in
two steps: first the grammar identifies a target structure which appears to lend
itself to simplification. In a second step the actual simplification is carried out.
This strategy allows for a hybrid approach in which we can let a statistical
classifier decide whether a simplification should be carried out or not after identification. This is particularly important since most of the target structures are
ambiguous in one way or another.
Relative clauses, such as (1a), are a good example of the typical application
of sentence split rules. Relative clauses often express information about a nomimage. A further type of information which is implicit but not shown here is the
linear ordering of the words.
inal referent which can be expressed in a separate sentence. Such a separation
results in shorter and less complex sentences, especially in cases with multiple
and recursive subordination. Our grammar is able to detect such relative clause
structures and turn them into an output like (1b). In order to manipulate the
input text, the grammar first has to identify a matching part of the tree shown
in figure 1. More specifically, the grammar looks for a verbal node (in this case
comenzaron/started ) which depends on a noun (lluvias/rains) and in turn dominates a relative pronoun (que/that). Expressed informally, the rule then cuts
the whole subtree dominated by this verb and turns it into a separate sentence
which follows the main clause. In order to convert a relative clause into an independent sentence, the relative pronoun has to be replaced by the form of the
head noun of the clause (lluvias/rains) and an appropriate determiner (in this
case estas/these) has to be inserted.
Apart from relative clauses, the current version of our grammar is able to
simplify gerundive and participle constructions (e.g. (2)), coordinations (e.g.(3),
an example of verb phrase coordination) and a special operation which we encountered very often in the corpus and call quote inversion (e.g.(4)). All of these
examples are taken from our corpus and the simplified versions have been produced by our grammar.
(2)
a. Los participantes (. . . ) recibirán como obsequio un libro editado por
el Ayuntamiento (. . . )
The participants (. . . ) will receive a book as a present, edited by the
town council (. . . )
b. Los participantes (. . . ) recibirán como obsequio un libro. Este libro
está editado por el Ayuntamiento (. . . )
The participants (. . . ) will receive a book as a present. This book is
edited by the town council (. . . )
(3)
a. (. . . ) los precios se han disparado y mucha gente no puede permitirse
el lujo de comprar alimentos.
(. . . ) the prices have exploded and many people cannot afford the
luxury of buying food.
b. (. . . ) los precios se han disparado. Mucha gente no puede permitirse
el lujo de comprar alimentos.
(. . . ) the prices have exploded. Many people cannot afford the luxury
of buying food.
(4)
a. ”Se necesita más apoyo que nunca antes”, apuntó (. . . )
(. . . ) ”More support than ever is necessary”, he pointed out (. . . )
b. Apuntó: ”Se necesita más apoyo que nunca antes”.
He pointed out: ”More support than ever is necessary”.
What these cases have in common is that the sentences resulting from simplification are much shorter and, more importantly, structurally less complex. This
Operation
Precision Recall Frequency
Relative Clauses
39.34% 66.07%
20.65%
Gerundive Constructions
63.64% 20.59%
2.48%
Quotation Inversion
78.95% 100%
2.14%
Object coordination
42.03% 58.33%
7.79%
VP and clause coordination
64.81%
50%
6.09%
Table 1. Precision, recall and frequency of application per rule type
corresponds to the idea that, whenever possible, one simplified sentence should
not express more than one idea.
3.2
Evaluation
An evaluation of the performance of the different simplification operations is
given in table 1. This evaluation was carried out over 886 sentences. We counted
places where the rule had produced a felicitous output, ignoring minor grammaticality issues which can be solved with further fine-tuning of the grammar
rules. The precision here is defined as the percentage of correct applications of
each rule. For the calculation of recall we manually annotated 262 sentences for
structures which contain a target structure that could be simplified. The frequency of rule application is given as the percentage of sentences affected by a
rule.
In interpreting table 1 it is important to note that no statistical filtering has
been applied yet in order to resolve structural ambiguities, such as the defining
vs non-defining difference for relative clauses. Turning a restrictive clause into
an independent sentence leads to an infelicitous output and in Spanish this distinction is usually not reflected in the syntax. Such errors constitute 57% of all
errors in the application of this rule. In addition, parse errors are a serious problem and propagate into the simplification module. These constitute a large part
of all errors, up to 37% in the category of gerundive constructions. Finally, error
analysis showed us that there is still much room for improvement of precision
and recall with further grammar engineering.
We have also implemented a series of support vector machine classifiers [20]4
to address specific problems which would be difficult to deal with a rule-based
approach alone. The task of the classifier is to help decide whether or not the
application of a rule would be correct. We have concentrated on very specific
problems such as deciding if a sentence should be split or whether a full sentence
should be deleted. For sentence deletions we obtain an F-score of 76.03%, but a
simple baseline (deleting the last sentence of each text in the specific text genre)
is nearly as accurate (73.00%). In the case of deciding on sentence splitting our
classifier can improve significantly over the baseline which only considers sentence length: the classifier yields an F-score of 80.06%, while the baseline only
4
We used the support vector machine implementation provided in the GATE framework [21, 22].
reaches 40.00%. The experiments were only performed on a sample of 40 documents. Although the results are still modest, we believe that with further data
and carefully selected features the performance of the classifiers will improve.
4
Conclusion and Outlook
In this paper we have described the prototype of a text simplification system for
Spanish which concentrates on the reduction of syntactic complexity. This prototype is the result of an ongoing research project. The syntactic simplification
module is at an operational stage, but we still see much room for improvement.
Other components of what we plan to be the final simplification system are still
under development and need to be integrated, more specifically a lexical simplification module, a statistic filter for rule application and deletion operations and
also a module for the addition of clarifying definitions. The system presented
here is part of a software architecture, described in [23], which includes web
services, a web browser plugin and mobile phone applications.
In the near future we will carry out an extrinsic evaluation with the help of
twenty intellectually disabled persons, in order to compare the reading comprehension of original and simplified texts.
Acknowledgements The research described in this paper arises from a Spanish research project called Simplext: An automatic system for text simplification
[3]. Simplext is led by Technosite and partially funded by the Ministry of Industry, Tourism and Trade of the Government of Spain, by means of the National
Plan of Scientific Research, Development and Technological Innovation (I+D+i),
within strategic Action of Telecommunications and Information Society (Avanza
Competitiveness, with file number TSI-020302-2010-84). We are grateful to fellowship RYC-2009-04291 from Programa Ramón y Cajal 2009, Ministerio de
Economı́a y Competitividad, Secretarı́a de Estado de Investigación, Desarrollo
e Innovación, Spain.
References
1. United Nations: Convention on the rights of persons with disabilities. http:
//www2.ohchr.org/english/law/disabilities-convention.htm
2. Asociación Facil Lectura: Social space for research and innovation. http://www.
lecturafacil.net
3. Simplext: An automatic system for text simplification. http://www.simplext.es
4. Prodis Foundation: Social space for research and innovation. http://www.
fundacionprodis.org/
5. Chandrasekar, R., Doran, C., Srinivas, B.: Motivations and methods for text simplification. In: Proceedings of the International Conference on Computational
Linguistics. (1996) 1041–1044
6. Siddharthan, A.: An architecture for a text simplification system. In: Proceedings
of the Language Engineering Conference. (2002) 64–71
7. Zhu, Z., Bernhard, D., Gurevych, I.: A monolingual tree-based translation model
for sentence simplification. In: Proceedings of the International Conference on
Computational Linguistics, Beijing, China (Aug 2010) 1353–1361
8. Carroll, J., Minnen, G., Canning, Y., Devlin, S., Tait, J.: Practical simplification of
english newspaper text to assist aphasic readers. In: Proceedings of the AAAI-98
Workshop on Integrating Artificial Intelligence and Assistive Technology. (1998)
7–10
9. Aluı́sio, S.M., Specia, L., Pardo, T.A.S., Maziero, E.G., de Mattos Fortes, R.P.:
Towards brazilian portuguese automatic text simplification systems. In: ACM
Symposium on Document Engineering. (2008) 240–248
10. Gasperin, C., Maziero, E.G., Aluı́sio, S.M.: Challenging choices for text simplification. In: The International Conference on Computational Processing of Portuguese.
(2010) 40–50
11. Flesch, R.: A new readability yardstick. Journal of applied psychology 32(3) (1948)
221–233
12. Graesser, A.C., McNamara, D.S., Louwerse, M.M., Cai, Z.: Coh-Metrix: Analysis
of text on cohesion and language. Behavior Research Methods, Instruments and
Computers 36(2) (2004) 193–202
13. Feng, L., Jansche, M., Huenerfauth, M., Elhadad, N.: A comparison of features for
automatic readability assessment. In: Proceedings of the International Conference
on Computational Linguistics (Posters). (2010) 276–284
14. Bott, S., Saggion, H.: An unsupervised alignment algorithm for text simplification
corpus construction. In: ACL Workshop on Monolingual Text-To-Text Generation,
Porland, Oregon (2011)
15. Anula, A.: Tipos de textos, complejidad lingüı́stica y facilicitación lectora. In:
Actas del Sexto Congreso de Hispanistas de Asia, Seúl (2007) 45–61
16. Anula, A.: Lecturas adaptadas a la enseñanza del español como L2: variables
lingüı́sticas para la determinación del nivel de legibilidad. In Cesteros, S.P., Roca,
S., eds.: La evaluación en el aprendizaje y la enseñanza del español como LE/L2,
Alicante (2008) 162–170
17. Bott, S., Saggion, H.: Text simplification tools for Spanish. In: Proceedings of the
International Conference on Language Resources and Evaluation. (2012)
18. Bohnet, B.: Efficient parsing of syntactic and semantic dependency structures. In:
Proceedings of the Conference on Natural Language Learning (CoNLL), Boulder,
Colorado, Association for Computational Linguistics (2009) 67–72
19. Bohnet, B., Langjahr, A., Wanner, L.: A development environment for MTT-based
sentence generators. Revista de la Sociedad Española para el Procesamiento del
Lenguaje Natural (2000)
20. Li, Y., Zaragoza, H., Herbrich, R., Shawe-Taylor, J., Kandola, J.: The Perceptron Algorithm with Uneven Margins. In: Proceedings of the 9th International
Conference on Machine Learning (ICML-2002). (2002) 379–386
21. Maynard, D., Tablan, V., Cunningham, H., Ursu, C., Saggion, H., Bontcheva, K.,
Wilks, Y.: Architectural Elements of Language Engineering Robustness. Journal
of Natural Language Engineering – Special Issue on Robust Methods in Analysis
of Natural Language Data 8(2/3) (2002) 257–274
22. GATE: General architecture for text engineering. http://gate.ac.uk
23. Saggion, H., Bott, S., Mille, S., Bourg, L., Figueroa, D., Santos, J., Etayo, E.,
Madrid-Sánchez, J., Gómez-Martı́nez, E., Anula, A.: Facilitating information access through automatic text simplification. (submitted)