Sentence Synthesis in Machine Translation

WDS'07 Proceedings of Contributed Papers, Part I, 151–156, 2007.
ISBN 978-80-7378-023-4 © MATFYZPRESS
Sentence Synthesis in Machine Translation
J. Ptáček
Charles University, Institute of Formal and Applied Linguistics, Prague, Czech Republic.
Abstract. We report work in progress on a complex system generating Czech
sentences expressing the meaning of input syntactic-semantic structures. Such
component is usually referred to as a realizer in the domain of Natural Language
Generation.
Existing realizers usually take advantage of a background linguistic theory.
We introduce the Functional Generative Description, a framework of our choice
conceived in 1960’s by Petr Sgall. This language theory lays out foundations of
the formalism in which our input syntactic-semantic structures are specified. The
structure definition was further elaborated and refined during the annotation of the
Prague Dependency Treebank, now available in its second version.
A section of the paper is devoted to description of another theoretical framework
suitable for the task of Natural Language Generation – the Meaning-Text Theory.
We explore state-of-the-art realizers deployed in real life applications, describe
common architecture of a generation system and highlight the strengths and
weaknesses of our approach. Finally, preliminary output of our surface realizer is
compared against a baseline solution.
1. Introduction
In this paper we deal with the following problem. Let there be a meaning to be conveyed
to the reader. The meaning is specified by the means of an abstract, semantically oriented data
structure. We aim to build a device mapping every given data structure to the corresponding
Czech sentence. The resulting sentence should be grammatically correct and has to convey the
same meaning as specified by the input structure. The desired mapping device is referred to as
a (linguistic, surface or syntactical and morphological) realizer (as in Reiter and Dale [1997];
Bateman and Zock [2003]).
During the computation of an output sentence several linguistic phenomena are addressed.
In order to deliver acceptable sentence, the realizer has to deal with syntactic constructions,
auxiliary words, prepositions, conjunctions, agreement, word order, vocalization and punctuation. Generating word forms – the morphology – is usually treated separately and the realizer
calls the morphological module through public interface or the surface realizer prepares data to
be processed by morphological tool afterwards.
2. Motivation
The surface realizer can be used in all automatic systems where textual messages are delivered to the user. Obtained texts enriched by additional meta data can also serve as an input
to speech synthesis (Pan and McKeown [1997]). Sometimes the set of produced messages is
not large or the messages follow a simple pattern. Under such circumstances it is advantageous to prepare the messages in advance or deploy a simple template system instead of using
a fully-fledged generation system. Mail-merge feature of a Microsoft Word text processor (first
introduced in version 6.0 and shipped with current version) is example of a commonly used
template system generating business letters out of a template and a database of recipients.
In numerous applications it is unavoidable to assemble the message on the fly. Feasible
solution is to specify the content of the message using a more abstract notation and integrate
a surface realizer module (as ours) in order to obtain the final textual message. This method
151
PTÁČEK: SENTENCE SYNTHESIS IN MACHINE TRANSLATION
of text production is referred to as Natural Language Generation (NLG in short). Because
our work fits into this field of study we describe the NLG later in greater detail. Here we
just note that the NLG approach demands more linguistic, engineering and coding expertise to
deploy. But in return, produced text are of higher quality in comparison to mail-merged output.
Moreover, a NLG system is then well extensible when new types of messages are needed to be
communicated.
We see Machine Translation (MT in sequel) as an automatic system producing textual
messages as well. However the MT problem shows additional complexity when compared to
NLG. The content of produced message is already stored in the system in machine readable
format under the NLG scenario. It is not the case of Machine Translation. Source language
analysis have to be done in order to obtain the message, which is in the end presented in a target
language by a surface realization component as illustrated in Figure 1.
Figure 1. Two use cases for a surface realization component: a third step of Machine Translation and a sixth step of Natural Language Generation.
2.1. Natural Language Generation
In this section we examine six steps (listed in Figure 1) that common NLG system performs
in order to produce a natural language text from a non-linguistic representation of information.
Text generation is illustrated on a hypothetical train information system adopted from Reiter
and Dale [1997]. The purpose of the system is to respond to queries like: “When is the next
train to Glasgow?” with appropriate answers. Based on a daily schedule the system may reply:
“There are 20 trains each day from Aberdeen to Glasgow. The next train is Caledonian Express;
it leaves Aberdeen at 10am. It is due to arrive in Glasgow at 1pm.”
Content determination comes first. Parsing of the question and selection of relevant
data is highly application dependent. However, we can assume that selected relevant data will
be organized in messages. Within messages we distinguish entities, concepts and relations.
Important nouns in the domain of interest (as specific trains, times and places) are the entities.
Similarly, both domain specific verbs (to depart, to arrive) and common relations (identity) are
relations we are interested in. Concepts are shifters, deictic words for entities (i.e., next train)
denoting important properties of entities. Those messages are usually stored as attribute-value
matrices. Each matrix corresponds to relation that holds between its arguments, which are
either entities or concepts.
When the set of messages – matrices – is determined, we proceed to Discourse planning.
Hierarchy is introduced among messages reflecting the order and discourse relations in emerging
story. The result of discourse planning has therefore the shape of a tree.
152
PTÁČEK: SENTENCE SYNTHESIS IN MACHINE TRANSLATION
As the name suggests, in the Sentence aggregation phase sentence boundaries are established. Messages that share same constituents are grouped together into a single sentence.
Though this phase is not mandatory, it has positive impact on fluency of the resulting text.
Particular lexemes are chosen in two subsequent steps of Lexicalization and Referring
expression generation. The first step takes care of words and phrases expressing domain
concepts and relations. The second task is to name the entities. Pronominalization is a subtask
of the entities naming.
When all listed steps are completed only the Linguistic realization remains. Here the
knowledge of target language grammar is applied to form correct sentences. There is no wide
agreement on particular format of data that serves as an input for surface realization. Applications differ according to the background linguistic theory they build on. We adhere to the
Functional Generative Description (FGD in short) (Sgall [1967], Sgall et al. [1986]), specifically to the tectogrammatical description of sentence that is used in the Prague Dependency
Treebank (PDT in sequel) (Jan Hajič et al. [2006]).
2.2. Machine Translation
More prominent motivation that drives our effort to implement a surface realization component lies in Machine Translation. There are several methods used today. Shallow MT based
on morphosyntactic analysis is well suited for translation between related languages (Hajič et al.
[2000]). Noisy channel model and phrase translation is exploited by statistical means.1 Statistical phrase-based MT systems currently achieve top results.2 Our component is of use under
a so called transfer-based MT that breaks the work into three well defined, separated parts
(analysis, transfer and realization). Statistical and non-statistical approaches can be mixed in
a transfer-based MT.
The idea is to perform a syntactic (perhaps even semantic) analysis on the side of source
language in order to obtain more abstract description of translated meaning. Presumably, the
source and target language abstract descriptions will be not so distant as original sequences of
word forms. Thus the transfer is supposed to be easier to perform than in direct translation. The
obtained data structure is then mapped into resulting sentence by a surface realizer component
for the target language.
Our surface realizer is currently tested in a transfer-based MT environment and that present
a valuable source of feedback. Not only the quality of output (as measured by the BLEU score
described later) steadily improves because of reported problems. The experiment makes use of
a statistical tree-to-tree rewriter. As a consequence, the trees we get on input do not necessarily
fulfill all constraints imposed on tectogrammatical trees in Prague Dependency Treebank. This
automatically analyzed data serve as an example of real world input and testing on then improve
robustness of the surface realizer.
3. Theoretical Frameworks
For the task of NLG, some language theories are more directly applicable than others.
A Chomskian approach is geared towards analysis of a sentence and deciding whether the
sentence in question conforms to a grammar. On the other side, works dealing with functions of language and corresponding surface forms bring more insight into the process of NLG.
A Systemic Functional Linguistics (Halliday and Martin [1981]) and Functional Generative
Description (Sgall [1967]) (FGD in short) theories both represent this standpoint. Also the
Meaning-Text Theory (Mel’čuk [1988]) describes the transition from semantic structure into
1
Manning and Schütze [1999] gives an introduction to the field of statistical MT.
2
See NIST evaluation: http://www.nist.gov/speech/tests/mt/.
153
PTÁČEK: SENTENCE SYNTHESIS IN MACHINE TRANSLATION
a sequence of phonemes. We have studied so far two theoretical frameworks making allowance
for NLG, the Functional Generative Description and the Meaning-Text Theory.
3.1. Functional Generative Description (and the Prague Dependency Treebank)
The functional and dependency based approach founded by Sgall [1967], Sgall et al. [1986]
uses a number of layers to describe a language system. We are particularly interested in a recent
formalism that was applied to annotate a considerable amount of texts, the Prague Dependency
Treebank 2.0 (abbreviated as PDT). In this formalism there are three layers of description:
a morphological, analytical and tectogrammatical layer. Our realizer is capable of generating
the surface sentences out of tectogrammatical trees.
The PDT data consist of 7,129 Czech documents containing 116,065 manually annotated
sentences. On the morphological layer, each token is lemmatized and tagged. On the analytical
layer, a tree manifesting surface-syntactic relations is built. Most abstract description of the
sentence is given on the tectogrammatical layer. The tectogrammatical annotation is done for
44 % of sentences in PDT 2.0 and was achieved in four stages:
• building the dependency tree structure of autosemantic words in the sentence and labeling
of dependency relations and valency annotation,
• topic / focus annotation,
• annotation of coreference (i.e. relations between nodes referring to the same entity),
• annotation of grammatemes and related attributes.
The tectogrammatical tree contains only autosemantic words, i.e. words with full meaning.
Auxiliary verbs, subordinate conjunctions and prepositions are omitted and their meanings are
noted as properties of governing autosemantic words.
3.2. Meaning-Text Theory
A comparably old theory of Meaning and Text (MTT) by Mel’čuk [1988] addresses encoding
of meanings in texts. It shares the dependency and stratificational viewpoint with the FGD
framework. There are seven layers of representation, one semantic and surface/deep pair of
syntactic, morphological and phonetic representation. Each of the seven layers consists of
several structures.
The semantic structure (part of semantic representation) is a network of semantemes.
Directed edges are pointing to arguments – semantemes filling valency slots of predicates. The
notion of valency is similar to valency on tectogrammatical layer in FGD, though the exact
definitions differ significantly. The MTT defines two more structures concerning communicative
and rhetoric features as part of the semantic representation.
The main structure of the next (deep syntactic) layer is usually compared to the tree
structure of tectogrammatical layer in FGD. Both are trees of autosemantic (full meaning)
words. Morphological attributes of nodes imposed by government and agreement are excluded
in both frameworks. And finally there is additional structure encoding coreferential relations. In
spite of these similarities mentioned in Žabokrtský [2005] we do not come to same conclusions.
We examine the passivization phenomenon comparing the structures side by side. We
witness that the transformation leaves semantic structure almost intact but makes a huge reorganization on the deep syntactic layer. In FGD, the situation in tectogrammatical tree is very
similar as in MTT semantic structure. All the passivization changes happen on the analytical
layer. We conclude that the connection between tectogrammatical layer and MTT semantic
representation should not be disregarded.
154
PTÁČEK: SENTENCE SYNTHESIS IN MACHINE TRANSLATION
4. State of the Art
It is worth noting that number of nowadays state-of-the-art realizers is based on the MTT.
Translation system ETAP-3 (Apresian et al. [2003]) and the generator RealPro (Lavoie and
Rambow [1997]) are examples of MTT in practice. Both these systems include a realization
component. But as long as they do not support Czech language, a direct comparison with our
realizer is not possible.
AlethGen (Coch [1996]) is a MTT generator of texts from a non-linguistic data stored in
a database or obtained interactively. Unlike our surface realizer, AlethGen system is domain
specific and deals also with text and sentence planning.
The systemic grammar (Eggins [2004]) approach has produced two wide-spread open-source
surface realizers, KPML (Bateman [1997]) and SURGE (Elhadad and Robin [1996]).
The mechanism of grammar rules application is also important. A graph rewriting approach
suggested by Mel’čuk [1988] dominates here. Such approach treats grammar as a separable resource and needs a nontrivial framework (such as MATE by Bohnet and Wanner [2001]) for its
processing. Our grammar of Czech is ‘hardwired’; encoded in the Perl programming language
and not available for immediate reuse. It is modularized and uses pluggable resources. Procedural design results in quick prototyping and also natural order of operations is highlighted.
5. Preliminary Evaluation
We list sample output sentences here to provide a more concrete notion about the realizator
performance. The O lines contain the original PDT 2.0 sentence, the B lines present a baseline output (just linearized input trees), and the R lines represent the automatically realized
sentences.
(1)
O : Trvalo to až do roku 1928, než se tento problém podařilo překonat.
B: trvat až rok 1928 podařit se tento problém překonat
R: Trvalo až do roku 1928, že se podařilo tento problém překonat.
(2)
O : Stejně tak si je i adresát výtky podle ostrosti a výšky tónu okamžitě jist nejen tı́m,
že jde o něj, ale i tı́m, co skandál vyvolalo.
B: stejně tak být i adresát výtka ostrost a výška tón okamžitý jistý nejen jı́t ale i skandál
vyvolat co
R: Stejně tak je i adresát výtky podle ostrosti a podle výšky tónu okamžitě jistý, nejen
že jde o něj, ale i co skandál vyvolalo.
The sequence of annotation and surface realization is treated as if it was a Czech to Czech
translation. We measure the difference between the original sentence and realized sentence by
the means of BLUE score [Papineni et al., 2001].
We are not aware of any other system capable of generating the same set of evaluation
sentences. Czech language is not supported by existing range of surface realizers. Because of
this limitation, we compare ourselves with the baseline. When evaluating the realization system
on 4700 sentences from PDT 2.0 evaluation data, the obtained BLEU score is 0.478 (with the
theoretically possible maximum for a realization problem being the score of 1). This result
seems to be very optimistic; moreover, the obtained would be even higher if there were more
alternative reference translations available. Note that the baseline solution reaches only 0.03 on
the same data.
Acknowledgments. The present work was supported by projects 1ET101120503, 1ET201120505
and by the Charles University Grant Agency under Contract 7643/2007.
155
PTÁČEK: SENTENCE SYNTHESIS IN MACHINE TRANSLATION
References
Apresian, J., Boguslavsky, I., Iomdin, L., Lazursky, A., Sannikov, V., Sizov, V., and Tsinman, L.,
ETAP-3 Linguistic Processor: a Full-Fledged NLP Implementation of the MTT, MTT 2003, First
International Conference on Meaning–Text Theory, pp. 16–18, 2003.
Bateman, J., Enabling technology for multilingual natural language generation: the KPML development
environment, Natural Language Engineering, 3 , 15–55, 1997.
Bateman, J. and Zock, M., Natural language generation, Oxford Handbook of Computational Linguistics,
pp. 284–304, 2003.
Bohnet, B. and Wanner, L., On Using a Parallel Graph Rewriting Grammar Formalism in Generation,
Proceedings of the 8th European Natural Language Generation Workshop at the Annual Meeting of the
Association for Computational Linguistics, Toulouse, 2001.
Coch, J., Overview of AlethGen, Demonstrations and Posters of the Eighth International Natural Language Generation Workshop (INLG96), pp. 25–28, 1996.
Eggins, S., An Introduction to Systemic Functional Linguistics, New York , 2004.
Elhadad, M. and Robin, J., An overview of SURGE: A reusable comprehensive syntactic realization component, Eighth International Natural Language Generation Workshop. Demonstrations and Posters,
pp. 1–4, 1996.
Hajič, J., Kuboň, V., and Hric, J., Česı́lko - an MT system for closely related languages, pp. 7—8, 2000.
Halliday, M. and Martin, J., Readings in Systemic Linguistics, Batsford Academic and Educational,
1981.
Jan Hajič et al., Prague Dependency Treebank 2.0, Linguistic Data Consortium, CAT LDC2006T01,
ISBN 1-58563-370-4, 2006.
Lavoie, B. and Rambow, O., RealPro–a fast, portable sentence realizer, Proceedings of the Conference
on Applied Natural Language Processing (ANLP97), 1997.
Manning, C. and Schütze, H., Foundations of statistical natural language processing, MIT Press, 1999.
Mel’čuk, I., Dependency Syntax: Theory and Practice, State University of New York Press, 1988.
Pan, S. and McKeown, K., Integrating language generation with speech synthesis in a concept to speech
system, Concept to Speech Generation Systems, pp. 23–28, 1997.
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J., Bleu: a Method for Automatic Evaluation of
Machine Translation, Tech. rep., IBM, 2001.
Reiter, E. and Dale, R., Building applied natural language generation systems, Natural Language Engineering, 3 , 57–87, 1997.
Sgall, P., Generativnı́ popis jazyka a česká deklinace, Academia, 1967.
Sgall, P., Hajičová, E., and Panevová, J., The Meaning of the Sentence in Its Semantic and Pragmatic
Aspects, D. Reidel Publishing Company, Dordrecht, 1986.
Žabokrtský, Z., Resemblances between Meaning-Text Theory and Functional Generative Description,
in Proceedings of the 2nd International Conference of Meaning-Text Theory, edited by L. L. I. Jurij
D. Apresjan, pp. 549–557, Slavic Culture Languages Publishers House, Moscow, Russia, June 23-25,
2005.
156

Download Report

Sentence Synthesis in Machine Translation

Paperzz.com

Your Paperzz