WDS'07 Proceedings of Contributed Papers, Part I, 151–156, 2007. ISBN 978-80-7378-023-4 © MATFYZPRESS Sentence Synthesis in Machine Translation J. Ptáček Charles University, Institute of Formal and Applied Linguistics, Prague, Czech Republic. Abstract. We report work in progress on a complex system generating Czech sentences expressing the meaning of input syntactic-semantic structures. Such component is usually referred to as a realizer in the domain of Natural Language Generation. Existing realizers usually take advantage of a background linguistic theory. We introduce the Functional Generative Description, a framework of our choice conceived in 1960’s by Petr Sgall. This language theory lays out foundations of the formalism in which our input syntactic-semantic structures are specified. The structure definition was further elaborated and refined during the annotation of the Prague Dependency Treebank, now available in its second version. A section of the paper is devoted to description of another theoretical framework suitable for the task of Natural Language Generation – the Meaning-Text Theory. We explore state-of-the-art realizers deployed in real life applications, describe common architecture of a generation system and highlight the strengths and weaknesses of our approach. Finally, preliminary output of our surface realizer is compared against a baseline solution. 1. Introduction In this paper we deal with the following problem. Let there be a meaning to be conveyed to the reader. The meaning is specified by the means of an abstract, semantically oriented data structure. We aim to build a device mapping every given data structure to the corresponding Czech sentence. The resulting sentence should be grammatically correct and has to convey the same meaning as specified by the input structure. The desired mapping device is referred to as a (linguistic, surface or syntactical and morphological) realizer (as in Reiter and Dale [1997]; Bateman and Zock [2003]). During the computation of an output sentence several linguistic phenomena are addressed. In order to deliver acceptable sentence, the realizer has to deal with syntactic constructions, auxiliary words, prepositions, conjunctions, agreement, word order, vocalization and punctuation. Generating word forms – the morphology – is usually treated separately and the realizer calls the morphological module through public interface or the surface realizer prepares data to be processed by morphological tool afterwards. 2. Motivation The surface realizer can be used in all automatic systems where textual messages are delivered to the user. Obtained texts enriched by additional meta data can also serve as an input to speech synthesis (Pan and McKeown [1997]). Sometimes the set of produced messages is not large or the messages follow a simple pattern. Under such circumstances it is advantageous to prepare the messages in advance or deploy a simple template system instead of using a fully-fledged generation system. Mail-merge feature of a Microsoft Word text processor (first introduced in version 6.0 and shipped with current version) is example of a commonly used template system generating business letters out of a template and a database of recipients. In numerous applications it is unavoidable to assemble the message on the fly. Feasible solution is to specify the content of the message using a more abstract notation and integrate a surface realizer module (as ours) in order to obtain the final textual message. This method 151 PTÁČEK: SENTENCE SYNTHESIS IN MACHINE TRANSLATION of text production is referred to as Natural Language Generation (NLG in short). Because our work fits into this field of study we describe the NLG later in greater detail. Here we just note that the NLG approach demands more linguistic, engineering and coding expertise to deploy. But in return, produced text are of higher quality in comparison to mail-merged output. Moreover, a NLG system is then well extensible when new types of messages are needed to be communicated. We see Machine Translation (MT in sequel) as an automatic system producing textual messages as well. However the MT problem shows additional complexity when compared to NLG. The content of produced message is already stored in the system in machine readable format under the NLG scenario. It is not the case of Machine Translation. Source language analysis have to be done in order to obtain the message, which is in the end presented in a target language by a surface realization component as illustrated in Figure 1. Figure 1. Two use cases for a surface realization component: a third step of Machine Translation and a sixth step of Natural Language Generation. 2.1. Natural Language Generation In this section we examine six steps (listed in Figure 1) that common NLG system performs in order to produce a natural language text from a non-linguistic representation of information. Text generation is illustrated on a hypothetical train information system adopted from Reiter and Dale [1997]. The purpose of the system is to respond to queries like: “When is the next train to Glasgow?” with appropriate answers. Based on a daily schedule the system may reply: “There are 20 trains each day from Aberdeen to Glasgow. The next train is Caledonian Express; it leaves Aberdeen at 10am. It is due to arrive in Glasgow at 1pm.” Content determination comes first. Parsing of the question and selection of relevant data is highly application dependent. However, we can assume that selected relevant data will be organized in messages. Within messages we distinguish entities, concepts and relations. Important nouns in the domain of interest (as specific trains, times and places) are the entities. Similarly, both domain specific verbs (to depart, to arrive) and common relations (identity) are relations we are interested in. Concepts are shifters, deictic words for entities (i.e., next train) denoting important properties of entities. Those messages are usually stored as attribute-value matrices. Each matrix corresponds to relation that holds between its arguments, which are either entities or concepts. When the set of messages – matrices – is determined, we proceed to Discourse planning. Hierarchy is introduced among messages reflecting the order and discourse relations in emerging story. The result of discourse planning has therefore the shape of a tree. 152 PTÁČEK: SENTENCE SYNTHESIS IN MACHINE TRANSLATION As the name suggests, in the Sentence aggregation phase sentence boundaries are established. Messages that share same constituents are grouped together into a single sentence. Though this phase is not mandatory, it has positive impact on fluency of the resulting text. Particular lexemes are chosen in two subsequent steps of Lexicalization and Referring expression generation. The first step takes care of words and phrases expressing domain concepts and relations. The second task is to name the entities. Pronominalization is a subtask of the entities naming. When all listed steps are completed only the Linguistic realization remains. Here the knowledge of target language grammar is applied to form correct sentences. There is no wide agreement on particular format of data that serves as an input for surface realization. Applications differ according to the background linguistic theory they build on. We adhere to the Functional Generative Description (FGD in short) (Sgall [1967], Sgall et al. [1986]), specifically to the tectogrammatical description of sentence that is used in the Prague Dependency Treebank (PDT in sequel) (Jan Hajič et al. [2006]). 2.2. Machine Translation More prominent motivation that drives our effort to implement a surface realization component lies in Machine Translation. There are several methods used today. Shallow MT based on morphosyntactic analysis is well suited for translation between related languages (Hajič et al. [2000]). Noisy channel model and phrase translation is exploited by statistical means.1 Statistical phrase-based MT systems currently achieve top results.2 Our component is of use under a so called transfer-based MT that breaks the work into three well defined, separated parts (analysis, transfer and realization). Statistical and non-statistical approaches can be mixed in a transfer-based MT. The idea is to perform a syntactic (perhaps even semantic) analysis on the side of source language in order to obtain more abstract description of translated meaning. Presumably, the source and target language abstract descriptions will be not so distant as original sequences of word forms. Thus the transfer is supposed to be easier to perform than in direct translation. The obtained data structure is then mapped into resulting sentence by a surface realizer component for the target language. Our surface realizer is currently tested in a transfer-based MT environment and that present a valuable source of feedback. Not only the quality of output (as measured by the BLEU score described later) steadily improves because of reported problems. The experiment makes use of a statistical tree-to-tree rewriter. As a consequence, the trees we get on input do not necessarily fulfill all constraints imposed on tectogrammatical trees in Prague Dependency Treebank. This automatically analyzed data serve as an example of real world input and testing on then improve robustness of the surface realizer. 3. Theoretical Frameworks For the task of NLG, some language theories are more directly applicable than others. A Chomskian approach is geared towards analysis of a sentence and deciding whether the sentence in question conforms to a grammar. On the other side, works dealing with functions of language and corresponding surface forms bring more insight into the process of NLG. A Systemic Functional Linguistics (Halliday and Martin [1981]) and Functional Generative Description (Sgall [1967]) (FGD in short) theories both represent this standpoint. Also the Meaning-Text Theory (Mel’čuk [1988]) describes the transition from semantic structure into 1 Manning and Schütze [1999] gives an introduction to the field of statistical MT. 2 See NIST evaluation: http://www.nist.gov/speech/tests/mt/. 153 PTÁČEK: SENTENCE SYNTHESIS IN MACHINE TRANSLATION a sequence of phonemes. We have studied so far two theoretical frameworks making allowance for NLG, the Functional Generative Description and the Meaning-Text Theory. 3.1. Functional Generative Description (and the Prague Dependency Treebank) The functional and dependency based approach founded by Sgall [1967], Sgall et al. [1986] uses a number of layers to describe a language system. We are particularly interested in a recent formalism that was applied to annotate a considerable amount of texts, the Prague Dependency Treebank 2.0 (abbreviated as PDT). In this formalism there are three layers of description: a morphological, analytical and tectogrammatical layer. Our realizer is capable of generating the surface sentences out of tectogrammatical trees. The PDT data consist of 7,129 Czech documents containing 116,065 manually annotated sentences. On the morphological layer, each token is lemmatized and tagged. On the analytical layer, a tree manifesting surface-syntactic relations is built. Most abstract description of the sentence is given on the tectogrammatical layer. The tectogrammatical annotation is done for 44 % of sentences in PDT 2.0 and was achieved in four stages: • building the dependency tree structure of autosemantic words in the sentence and labeling of dependency relations and valency annotation, • topic / focus annotation, • annotation of coreference (i.e. relations between nodes referring to the same entity), • annotation of grammatemes and related attributes. The tectogrammatical tree contains only autosemantic words, i.e. words with full meaning. Auxiliary verbs, subordinate conjunctions and prepositions are omitted and their meanings are noted as properties of governing autosemantic words. 3.2. Meaning-Text Theory A comparably old theory of Meaning and Text (MTT) by Mel’čuk [1988] addresses encoding of meanings in texts. It shares the dependency and stratificational viewpoint with the FGD framework. There are seven layers of representation, one semantic and surface/deep pair of syntactic, morphological and phonetic representation. Each of the seven layers consists of several structures. The semantic structure (part of semantic representation) is a network of semantemes. Directed edges are pointing to arguments – semantemes filling valency slots of predicates. The notion of valency is similar to valency on tectogrammatical layer in FGD, though the exact definitions differ significantly. The MTT defines two more structures concerning communicative and rhetoric features as part of the semantic representation. The main structure of the next (deep syntactic) layer is usually compared to the tree structure of tectogrammatical layer in FGD. Both are trees of autosemantic (full meaning) words. Morphological attributes of nodes imposed by government and agreement are excluded in both frameworks. And finally there is additional structure encoding coreferential relations. In spite of these similarities mentioned in Žabokrtský [2005] we do not come to same conclusions. We examine the passivization phenomenon comparing the structures side by side. We witness that the transformation leaves semantic structure almost intact but makes a huge reorganization on the deep syntactic layer. In FGD, the situation in tectogrammatical tree is very similar as in MTT semantic structure. All the passivization changes happen on the analytical layer. We conclude that the connection between tectogrammatical layer and MTT semantic representation should not be disregarded. 154 PTÁČEK: SENTENCE SYNTHESIS IN MACHINE TRANSLATION 4. State of the Art It is worth noting that number of nowadays state-of-the-art realizers is based on the MTT. Translation system ETAP-3 (Apresian et al. [2003]) and the generator RealPro (Lavoie and Rambow [1997]) are examples of MTT in practice. Both these systems include a realization component. But as long as they do not support Czech language, a direct comparison with our realizer is not possible. AlethGen (Coch [1996]) is a MTT generator of texts from a non-linguistic data stored in a database or obtained interactively. Unlike our surface realizer, AlethGen system is domain specific and deals also with text and sentence planning. The systemic grammar (Eggins [2004]) approach has produced two wide-spread open-source surface realizers, KPML (Bateman [1997]) and SURGE (Elhadad and Robin [1996]). The mechanism of grammar rules application is also important. A graph rewriting approach suggested by Mel’čuk [1988] dominates here. Such approach treats grammar as a separable resource and needs a nontrivial framework (such as MATE by Bohnet and Wanner [2001]) for its processing. Our grammar of Czech is ‘hardwired’; encoded in the Perl programming language and not available for immediate reuse. It is modularized and uses pluggable resources. Procedural design results in quick prototyping and also natural order of operations is highlighted. 5. Preliminary Evaluation We list sample output sentences here to provide a more concrete notion about the realizator performance. The O lines contain the original PDT 2.0 sentence, the B lines present a baseline output (just linearized input trees), and the R lines represent the automatically realized sentences. (1) O : Trvalo to až do roku 1928, než se tento problém podařilo překonat. B: trvat až rok 1928 podařit se tento problém překonat R: Trvalo až do roku 1928, že se podařilo tento problém překonat. (2) O : Stejně tak si je i adresát výtky podle ostrosti a výšky tónu okamžitě jist nejen tı́m, že jde o něj, ale i tı́m, co skandál vyvolalo. B: stejně tak být i adresát výtka ostrost a výška tón okamžitý jistý nejen jı́t ale i skandál vyvolat co R: Stejně tak je i adresát výtky podle ostrosti a podle výšky tónu okamžitě jistý, nejen že jde o něj, ale i co skandál vyvolalo. The sequence of annotation and surface realization is treated as if it was a Czech to Czech translation. We measure the difference between the original sentence and realized sentence by the means of BLUE score [Papineni et al., 2001]. We are not aware of any other system capable of generating the same set of evaluation sentences. Czech language is not supported by existing range of surface realizers. Because of this limitation, we compare ourselves with the baseline. When evaluating the realization system on 4700 sentences from PDT 2.0 evaluation data, the obtained BLEU score is 0.478 (with the theoretically possible maximum for a realization problem being the score of 1). This result seems to be very optimistic; moreover, the obtained would be even higher if there were more alternative reference translations available. Note that the baseline solution reaches only 0.03 on the same data. Acknowledgments. The present work was supported by projects 1ET101120503, 1ET201120505 and by the Charles University Grant Agency under Contract 7643/2007. 155 PTÁČEK: SENTENCE SYNTHESIS IN MACHINE TRANSLATION References Apresian, J., Boguslavsky, I., Iomdin, L., Lazursky, A., Sannikov, V., Sizov, V., and Tsinman, L., ETAP-3 Linguistic Processor: a Full-Fledged NLP Implementation of the MTT, MTT 2003, First International Conference on Meaning–Text Theory, pp. 16–18, 2003. Bateman, J., Enabling technology for multilingual natural language generation: the KPML development environment, Natural Language Engineering, 3 , 15–55, 1997. Bateman, J. and Zock, M., Natural language generation, Oxford Handbook of Computational Linguistics, pp. 284–304, 2003. Bohnet, B. and Wanner, L., On Using a Parallel Graph Rewriting Grammar Formalism in Generation, Proceedings of the 8th European Natural Language Generation Workshop at the Annual Meeting of the Association for Computational Linguistics, Toulouse, 2001. Coch, J., Overview of AlethGen, Demonstrations and Posters of the Eighth International Natural Language Generation Workshop (INLG96), pp. 25–28, 1996. Eggins, S., An Introduction to Systemic Functional Linguistics, New York , 2004. Elhadad, M. and Robin, J., An overview of SURGE: A reusable comprehensive syntactic realization component, Eighth International Natural Language Generation Workshop. Demonstrations and Posters, pp. 1–4, 1996. Hajič, J., Kuboň, V., and Hric, J., Česı́lko - an MT system for closely related languages, pp. 7—8, 2000. Halliday, M. and Martin, J., Readings in Systemic Linguistics, Batsford Academic and Educational, 1981. Jan Hajič et al., Prague Dependency Treebank 2.0, Linguistic Data Consortium, CAT LDC2006T01, ISBN 1-58563-370-4, 2006. Lavoie, B. and Rambow, O., RealPro–a fast, portable sentence realizer, Proceedings of the Conference on Applied Natural Language Processing (ANLP97), 1997. Manning, C. and Schütze, H., Foundations of statistical natural language processing, MIT Press, 1999. Mel’čuk, I., Dependency Syntax: Theory and Practice, State University of New York Press, 1988. Pan, S. and McKeown, K., Integrating language generation with speech synthesis in a concept to speech system, Concept to Speech Generation Systems, pp. 23–28, 1997. Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J., Bleu: a Method for Automatic Evaluation of Machine Translation, Tech. rep., IBM, 2001. Reiter, E. and Dale, R., Building applied natural language generation systems, Natural Language Engineering, 3 , 57–87, 1997. Sgall, P., Generativnı́ popis jazyka a česká deklinace, Academia, 1967. Sgall, P., Hajičová, E., and Panevová, J., The Meaning of the Sentence in Its Semantic and Pragmatic Aspects, D. Reidel Publishing Company, Dordrecht, 1986. Žabokrtský, Z., Resemblances between Meaning-Text Theory and Functional Generative Description, in Proceedings of the 2nd International Conference of Meaning-Text Theory, edited by L. L. I. Jurij D. Apresjan, pp. 549–557, Slavic Culture Languages Publishers House, Moscow, Russia, June 23-25, 2005. 156
© Copyright 2026 Paperzz