Kyoto University Participation to the 3rd Workshop on

Kyoto University Participation to
rd
the 3 Workshop on Asian
Translation
Fabien Cromieres Chenhui Chu Toshiaki Nakazawa Sadao Kurohashi
Overview of our submissions
• 2 Systems
• Kyoto-EBMT
• Example-Based Machine Translation
• Uses Dependency analysis for both source and target side
• Some small incremental improvements over our last year participation
• Kyoto-NMT
• Our new implementation of the Neural MT paradigm
• Sequence-to-Sequence model with Attention Mechanism
• As first introduced by (Bahdanau et al., 2015)
• For the tasks:
•
•
•
•
ASPEC Ja->En
ASPEC En -> Ja
ASPEC Ja -> Zh
ASPEC Zh -> Ja
KyotoEBMT
KyotoEBMT Overview
• Example-Based MT paradigm
• Need parallel corpus
• Few language-specific assumptions
• still a few language-specific rules
• Tree-to-Tree Machine Translation
• Maybe the least commonly used variant of x-to-x
• Sensitive to parsing quality of both source and target languages
• Maximize the chances of preserving information
• Dependency trees
• Less commonly used than Constituent trees
• Most natural for Japanese
• Should contain all important semantic information
4
KyotoEBMT pipeline
• Somehow classic pipeline
• 1- Preprocessing of the parallel corpus
• 2- Processing of input sentence
• 3- Decoding/Tuning/Reranking
• Tuning and reranking done with kbMira
• seems to work better than PRO for us
5
KyotoNMT
KyotoNMT Overview
• Uses the sequence-to-sequence with attention model
• as proposed in (Bahdanau et al., 2015)
• with other subsequent improvements
• UNK-tags replacement (Luang et al., 2015)
• ADAM training, sub-word units, …
• Hopefully we can add more original ideas in the future
• Implemented in Python using the Chainer library
• A version is GPL open-sourced
Sequence-to-Sequence with Attention
Bahdanau+, 2015
Previous
state
Attention
Model
<1000>
<1000>
<1000>
<1000>
<1000>
<1000>
<1000>
<1000>
<1000>
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
<3620>
<1000>
maxout
LSTM
LSTM
<1000>
<2620>
<1000>
LSTM
new
state
Current
context
<620>
<500>
softmax
<620>
<620>
<620>
<620>
Target Embedding
Source Embedding
私
は
学生
です
I
am
a
Previously generated word
<30000>
Student
New word
Depending on experiments:
GRU
LSTM
2-layers LSTM
Other values were the
same for all
experiments
Previous
state
Attention
Model
<1000>
<1000>
<1000>
<1000>
<1000>
<1000>
<1000>
<1000>
<1000>
new
state
LSTM
<2620>
<3620>
<1000>
<1000>
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
maxout
LSTM
LSTM
<1000>
Current
context
<620>
Target Vocabulary Size:
<500>30 000 – 50 000
softmax
<620>
<620>
<620>
<620>
Target Embedding
Source Embedding
Source Vocabulary Size:
私
30 000 – 200 000
は
学生
です
I
am
a
Previously generated word
<30000>
Student
New word
Regularization
• Weight Decay
• Choosing a good value seemed quite important
• 1e-6 worked noticeably better than 1e-5 or 1e-7
• Early Stopping
• Keep the parameters with the best loss on dev set
• Or keep the parameters with the best BLEU on dev set
• “best BLEU” work better, but even better to ensemble “best BLEU” and “best loss”
• Dropout
• Only used between LSTM layers (when used multi-layer LSTM)
• 20% dropout
• Noise on target word embeddings
Noise on target word embedding
Previous
state
Idea: add random noise
Attention
at training time here to
Model
force the network to not
rely too much on this
information
<1000>
<1000>
<1000>
<1000>
<1000>
<1000>
<1000>
<1000>
<1000>
new
Seems to work
state
LSTM
<2620>
<3620>
<1000>
<1000>
LSTM
LSTM
<620>
私
LSTM
LSTM
は
学生
maxout
LSTM
This part
can be the
source of
LSTM
LSTM
LSTM
cascading errors at translation
time
<620>
<620>
At training time, we<620>
always
give the correct previous word,
Source Embedding
but not at translation time
です
(+ 1.5 BLEU)
But<1000>
is it actually because the
network became less prone
to cascading errors?
Or simply a regularization
effect?
Current
context
<620>
<500>
softmax
Target Embedding
I
am
a
Previously generated word
<30000>
Student
New word
Translation
• Translation with beam-search
• Large beam (maximum 100)
• Although other authors mention issues with large beam, it worked for us
• Normalization of the score by the length of the sentence
• final n-best candidates are pruned by the average loss per word
• UNK words replaced with a dictionary using the attention values
• Dictionary extracted from the aligned training corpus
• attention not always very precise, but does help
Ensembling
• Ensembling is known to improve Neural-MT results substantially.
• We could confirm this, using three type of ensembling:
• “Normal” Ensembling
• Train different models and ensemble over them
• Self-Ensembling
• Ensembling of several parameters at different steps of the same training session
• Mixed Ensembling
• Train several models, and use several parameters for each models
• Observations:
• Ensembling does help a lot
• Mixed > Normal > Self
• Diminishing returns ( typically +2-3 BLEU going from one to two models, less than
+0.5 going from three to four models)
• Geometric averaging of probabilities worked better than Arithmetic averaging
The question of segmentation
• Several options for segmentation
•
•
•
•
Natural (ie. “words” for English)
Subword units, using eg. BPE (Senrich et al., 2015)
Automatic segmentation tools (JUMAN, SKP)
-> Trade-off between sentence size, generalization capacity and computation efficiency
• English
• Words units
• Subword units with BPE
• Japanese
• JUMAN segmentation
• Subword units with BPE
• Chinese
• SKP segmentation
• “short units” segmentation
• Subword units with BPE
Results
Results for WAT 2016
Ja -> En
EBMT
NMT 1
NMT 2
BLEU
21.22
24.71
26.22
AM-FM
59.52
56.27
55.85
In term of BLEU,In term of Human
In term of AM-FM
ensembling 4 simple
evaluation, the larger
ranks the
models beats theNMT model has actually
a
EBMT system higher
larger NMT system
slightly better score
Pairwise
47.0 (3/9)
44.25 (4/9)
# layers Source Vocabulary Target Vocabulary
JPO Adequacy
3.89 (1/3)
-
Ensembling
NMT 1 2
200k (JUMAN)
52k (BPE)
-
NMT 2 1
30k (JUMAN)
30k (words)
x4
Results for WAT 2016
En -> Ja
EBMT
NMT 1
BLEU
31.03
36.19
AM-FM
74.75
73.87
# layers Source Vocabulary
NMT 1 2
52k (BPE)
Pairwise
55.25 (1/10)
Target Vocabulary
52k (BPE)
JPO Adequacy
4.02 (1/4)
Ensembling
-
Results for WAT 2016
Ja -> Zh
EBMT
NMT 1
BLEU
30.27
31.98
AM-FM
76.42
76.33
# layers Source Vocabulary
NMT 1 2
30k (JUMAN)
Pairwise
30.75 (3/5)
58.75 (1/5)
Target Vocabulary
30k (KyotoMorph)
JPO Adequacy
3.88 (1/3)
Ensembling
-
Results for WAT 2016
Zh -> Ja
EBMT
NMT 1
NMT 2
BLEU
36.63
46.04
44.29
AM-FM
76.71
78.59
78.44
Pairwise
63.75 (1/9)
56.00 (2/9)
JPO Adequacy
3.94 (1/3)
-
# layers Source Vocabulary
NMT 1 2
30k (KyotoMorph)
Target Vocabulary
30k (JUMAN)
Ensembling
x2
NMT 2 2
50k (JUMAN)
-
200k (KyotoMorph)
EBMT vs NMT
Src
本フローセンサーの型式と基本構成,規格を図示, 紹介。
Ref
Shown here are type and basic configuration and standards of this flow with some diagrams.
EBMT
This flow sensor type and the basic composition, standard is illustrated, and introduced.
NMT
This paper introduces the type, basic configuration, and standards of this flow sensor.
• NMT vs EBMT:
• NMT seems more fluent
• NMT sometimes add parts not in the source (over-translation)
• NMT sometimes forget to translate some part of the source (under-translation)
Conclusion
• Neural MT proved to be very efficient
• Especially for Ja -> Zh (almost +10 BLEU compared with EBMT)
• NMT vs EBMT:
• NMT output is more fluent and readable
• NMT has more often issues of under- or over-translation
• NMT takes longer to train but can be faster to translate
• Finding the optimal settings for NMT is very tricky
• Many hyper-parameters
• Each training takes a long time on a single GPU