Rich bitext projection features for parse reranking

Morphosyntactic correspondence: a
progress report on bitext parsing
Alexander Fraser, Renjing Wang, Hinrich Schütze
Institute for NLP
University of Stuttgart
INFuture2009: Digital Resources and Knowledge Sharing
Nov 4th 2009, Zagreb
Outline
 The Institute for Natural Language Processing
at the University of Stuttgart
 Bitext parsing
 Using morphosyntactic correspondence
IfNLP Stuttgart
 The Institute for Natural Language Processing (IfNLP/IMS) at
the University of Stuttgart
 Dogil (Phonetics and Speech)
 Large department
 Kuhn/Rohrer (LFG syntax and semantics)
 Cahill (LFG generation)
 Heid (Terminology extraction, morphology)
 Padó (Semantics, lexical semantics)
 Schütze (Statistical NLP and Information Retrieval)
 More on next slide
IfNLP – Statistical NLP Group
 Hinrich Schütze (director since 2004)





Bernd Möbius – Speech recognition and synthesis
Helmut Schmid - Parsing , morphology (known for TreeTagger, BitPar)
Sabine Schulte im Walde – NLP and cognitive modeling of lexical semantics
Michael Walsh – Speech, exemplar theoretic syntax
Alex Fraser - Statistical machine translation, parsing, cross-lingual information retrieval
 General department areas of research
 New statistical NLP models and methods
 Semi-supervised and active learning
 Cognitive/linguistic representation models
 Applied to: NLP, retrieval, MT, speech, e-learning, …
IfNLP - Partnerships
 Partnerships
 Stuttgart: large projects with linguistics, computer science, EE signal
processing, high performance computing
 Germany: Darmstadt, Tübingen, DSPIN/CLARIN consortium (UIMAbased German processing)
 International: large French-led European project (6 universities, 4
industrial partners), collaborations on South African languages,
Edinburgh, CLARIN
 Industrial: various projects with publishers (many focusing on
terminology)
Outline
 The Institute for Natural Language Processing
at the University of Stuttgart
 Bitext parsing
 Using morphosyntactic correspondence
What is bitext parsing?
 Bitext: a text and its translation
 Sentences and their translations are aligned
 Sometimes called a parallel corpus
 Syntactic parsing: automatically find the syntactic structure of a sentence
(syntactic parse)
 Bitext parsing: automatically find the syntactic structure of the parallel
sentences in a bitext
 We will use the complementarity of the syntax of the two languages to obtain
improved parses
Motivation for bitext parsing
 Many advances in syntactic parsing come from better modeling
 But the overall bottleneck is the size of the treebank
 Our research asks a different question:
 Where can we (cheaply) obtain additional information, which helps to
supplement the treebank?
 A new information source for resolving ambiguity is a translation
 The human translator understands the sentence and disambiguates for us!
 Our research goal was to build large databases of improved parses to help
establish preferences for difficult phenomena like PP-attachment
Clause attachment ambiguity
Parse 1: high attachment
(wrong)
Parse 2: low attachment
(correct)
Not ambiguous in German



Number agreement disambiguates
FRAU (woman) and HATTE (had) agree
Unambiguous low attachment
Parse reranking of bitext
 Goal: improve English parsing accuracy
 Parse English sentence, obtain list of 100 best parse candidates
 Parse German sentence, obtain single best parse
 Determine the correspondence of German to English words using a word
alignment
 Calculate syntactic divergence of each English parse candidate and the
projection of the German parse
 Choose probable English parse candidate with low syntactic divergence
Measuring syntactic divergence
 Define features to capture different (overlapping) aspects of
syntactic divergence. Functions of:
 Candidate English parse e
 German parse g
 Word alignment a
 Combine in log-linear model
P(e | g) =
exp ∑m λm hm(g, e, a)
∑e exp ∑m λm hm(g, e, a)
 Discriminatively train λ parameters to maximize parsing
accuracy on a training set (minimum error rate training)
Rich bitext projection features
 Defined 36 features by looking at common English parsing errors
 No monolingual features, except baseline parser probability
 General features
 Is there a probable label correspondence between German and the
hypothesized English parse?
 How expected is the size of each constituent in the hypothesized
English parse given the German parse?
 Specific features
 Are coordinations realized identically?
 Is the NP structure the same?
 Mix of probabilistic and heuristic features
Training
 Use BitPar syntactic forest parser
 English BitPar trained on Penn Treebank
 German BitPar trained on Tiger Treebank
 Probabilistic feature functions built using large parallel text
(Europarl)
 Weights on feature functions (lambda vector) trained on
portion of the Penn Treebank together with its translation into
German
 Minimum error rate training using F score
Reranking English parses
 Difficult task
 German is difficult to parse
 Our knowledge source, the German parser, is out-ofdomain (poor performance)
 Baseline English parser we are trying to improve is indomain (good performance)
 Test set has long sentences
 Result: 0.70% F1 improvement on test data (stat.
significant)
New results
 Reranking German parses
 We needed German gold standard parses (and English translations)
 Sebastian Pado has made a small parallel treebank for Europarl available
 No engineering on German yet
 We are using the same syntactic divergence features which were designed to
improve English parsing
 There are German specific ambiguities which could be modeled, such as subjectobject ambiguity (e.g., Die Maus jagt die Katze, “the mouse chases the cat” or “the
cat chases the mouse”)
 But easier task because the parser we are trying to improve is weaker (German is
hard to parse, Europarl is out of domain)
 2.3% F1 improvement currently, we think this can be further improved
Summary: bitext parsing
 I showed you an approach for bitext parsing
 Reranking the parses of English to minimize syntactic divergence with
an automatically generated German parse
 I then showed our first results for reranking German parses
using a single English parse
 The approach we used for this kind of morphosyntactic
correspondence is more general than just parse reranking
 Machine translation involves morphosyntactic correspondence
 And this is where we are interested in looking at Croatian
Outline
 The Institute for Natural Language Processing
at the University of Stuttgart
 Bitext parsing
 Using morphosyntactic correspondence
Morphosyntactic processing
 I am co-PI of a new IfNLP project funded by the DFG (German Science
Foundation)
 Project: morphosyntactic modeling for statistical machine translation
(SMT)
 SMT research, up until recently, has been dominated by translation into
English
 English expresses a lot of information through word order, very little through
inflection
 Approaches to translating morphologically rich languages to English
are preprocessing based
Present: linguistic preprocessing
 Linguistic preprocessing for SMT (stat. machine translation)
 From: freer syntax, morphologically rich language
 To:
rigid syntax, morphologically poor language
 Existing examples: German to English, Czech to English
Present: linguistic preprocessing
 How this works
 Produce morphosyntactic analysis of German (or Czech)
 Reorder words in the German/Czech sentence to be in English order
 Reduce morphological inflection (for instance, remove case marking,
remove all agreement on adjectives, etc)
 For Czech: insert pseudo-words (e.g. indicate PRO-drop pronouns)
 Use statistics on this “simplified” German or Czech to map directly to
English using SMT
Present: linguistic preprocessing
 How well does this work?
 German to English SMT with linguistic preprocessing
(Stuttgart system)
 Results from 2008 ACL workshop on machine translation (extensive
human evaluation)
 Only system limited to organizer’s data competitive with:
 The best system of 5 rule-based MT systems
 Saarbrücken hybrid rule-based/SMT system
 Google Translate, which does not use linguistic preprocessing but does
use vastly more data
Future: modeling
 What about translating from English to German or to Slavic
languages?
 Problem: morphological generation is more difficult
 It is easy to reduce multiple inflections to one (for instance, stemming)
 Harder to learn to generate the right inflection
Future: modeling
 Current work on morphological generation
 Work at Charles University in Prague on Czech
 Tectogrammatical representation is not (yet) competitive with simple
statistics (little explicit knowledge of morphology or syntax)
 Best English to German SMT systems also use little or no
morphological knowledge
 And they are much worse than rule-based English to German systems
 Challenge: to use morphosyntactic knowledge with statistical
approaches requires more than just linguistic preprocessing
 morphosyntactic modeling
Morphosyntactic correspondence
 In fact, all multilingual problems involve morphosyntactic
correspondence:
 If we have a source parse tree, and source text, and we would like a
target text, this is machine translation
 If we have a source parse tree, source text and target text, and we
would like a target parse, this is bitext parsing
 If we would like to know which word in the target text is a translation
of a particular word in the source text and we use morphosyntactic
analysis, this is syntactic word alignment
 The same thinking can be used for cross-lingual information retrieval
 Very relevant when one of the languages is morphologically rich
Conclusion



I introduced the IfNLP Stuttgart
I presented a new approach to improving parsing using morphosyntactic
correspondence: bitext parsing
I discussed the general challenge of using morphosyntactic correspondence,
focusing on statistical machine translation
 Biggest challenge is translating into freer word order, morphologically rich (e.g., German
and particularly Slavic languages)
 We are interested in the challenge of building systems to translate to Croatian
 To do this: we need partners who are working on Croatian analysis!
 We also request that you think about multilingual applications when producing Croatian
NLP resources

The type of approach I showed for bitext parsing is useful for other multilingual
applications
Thank you!
Title
 text
Statistical Approach
 Using statistical models
 Create many alternatives, called hypotheses
 Give a score to each hypothesis
 Find the hypothesis with the best score through search
 Disadvantages
 Difficulties handling structurally rich models (math and computation)
 Need data to train the model parameters
 Difficult to understand decision process made by system
 Advantages




Avoid hard decisions
Speed can be traded with quality, no all-or-nothing
Works better in the presence of unexpected input
Learns automatically as more data becomes available
Modified from Vogel
Morphosyntactic knowledge
 We use: morphological analyzers & treebanks, which are combined in
parsing models learned from treebanks
 English models have little morphological analysis (suffix analysis to determine
POS for unknown words)
 German syntactic parser BitPar (Schmid) uses SMOR (Stuttgart Morphological
Analyzer)
 Given inflected form, SMOR returns possible fine-grained POS tags
 E.g., for nouns/adjectives: POS, case, gender, number, definiteness
 BitPar puts possible analyses in the chart, and disambiguates
 Slavic languages require even more morphological knowledge than German
Transferring syntactic knowledge
 Need knowledge source!
 English syntactic parser
 About 90% bracketing accuracy
 Mapping
 Requires bitext
 Work discussed here uses German/English Europarl
(European Parliament Proceedings)
 Resource for Croatian: Acquis Communautaire
 Automatically generated word alignment
Additional details in the paper




Formalization of bitext parsing as a parse reranking task
Definitions of bitext feature functions
Analysis of feature functions through feature selection
Comparison of MERT (minimum error rate training) with SVMRank

Download Report

Rich bitext projection features for parse reranking

Paperzz.com

Your Paperzz