Kyoto University Participation to WAT 2016

Kyoto University Participation to WAT 2016
Fabien Cromieres Chenhui Chu Toshiaki Nakazawa
[email protected]
[email protected]
KyotoNMT
[email protected]
Attention
Model
<1000>
Results
new
state
LSTM
<1000>
Input:
ウイスキーはオオムギから
製造される
<2620>
水素
hydrogen
is
<1000>
<1000>
<1000>
LSTM
LSTM
LSTM
LSTM
LSTM
さ
LSTM
<620>
<620>
ウイスキー
を
We
investigated
調査
whisky
した
<30000>
Source Embedding
私
は
学生
I
です
am
a
Previously generated word
produced
from
barley
present
<620>
Target Embedding
is
at
<500>
<620>
whisky
petroleum
さ
softmax
<620>
and
れる
れる
Current
context
gas
製造
製造
maxout
LSTM
natural
から
から
<1000>
from
石油
オオムギ
<3620>
produced
や
は
<1000>
LSTM
天然ガス
ウイスキー
<1000>
Ja -> En
EBMT
NMT 1
NMT 2
the
現在
<1000>
オオムギ
Student
barley
Output:
whisky is
produced
from
barley
EBMT vs NMT
• Depending on experiments, we changed (see Result column for
details):
• multi-layer LSTM
• larger source and target vocabulary size
The important details
• During our experiments, we found that using these settings
appropriately had a significant impact on final results:
• Regularization
• Beam-search
• ADAM
• normalizing the loss by length
• Ensembling
• ensembling of several models or selfensembling
• Segmentation
• Automatic segmentation via JUMAN or
KyotoMorph
• Or subword units with BPE
• Random noise on previous word embedding
• in the hope of reducing cascading errors at translation time
• we add noise to the target word embedding at training time
• works well, but maybe just a regularization effect
AM-FM
59.52
56.27
55.85
Pairwise
47.0 (3/9)
44.25 (4/9)
JPO Adequacy
3.89 (1/3)
-
# layers
Source Vocabulary Target Vocabulary
Ensembling
NMT 1
2
200k (JUMAN)
52k (BPE)
-
NMT 2
1
30k (JUMAN)
30k (words)
x4
En -> Ja
EBMT
NMT 1
BLEU
31.03
36.19
AM-FM
74.75
73.87
# layers Source Vocabulary
• as shown in picture above
• Training algorithm
BLEU
21.22
24.71
26.22
Pairwise
55.25 (1/10)
JPO Adequacy
4.02 (1/4)
Target Vocabulary
Ensembling
52k (BPE)
-
New word
• We used mostly the network size used in the original paper
• weight decay
• dropout
• early stopping
• random noise on previous word
embedding
• from 2 days for single-layer model on ASPEC Ja-Zh
• to 2 weeks for multi-layer model on ASPEC Ja-En
• Uses dependency trees for both source and target side
は
<1000>
• Training done on a NVIDIA Titan X (Maxwell)
• Example-based Machine Translation
• Tree-to-Tree
Previous
state
<1000>
[email protected]
KyotoEBMT
• Essentially an implementation of (Bahdanau et al., 2015)
• Implemented in Python with the Chainer library
<1000>
Sadao Kurohashi
• EBMT: less fluent
• NMT: more under/over-translation issues
Src
Ref
EBMT
NMT
本フローセンサーの型式と基本構成,規格を図示, 紹介。
Shown here are type and basic configuration and
standards of this flow with some diagrams.
This flow sensor type and the basic composition, standard
is illustrated, and introduced.
This paper introduces the type, basic configuration, and
standards of this flow sensor.
Conclusion and Future Work
• Very good results with Neural Machine Translation
• especially for Zh -> Ja
• Long training times mean that we could not test every
combination of setting for each language pair
• Some possible future improvements:
• Adding more linguistic aspects
• Adding newly proposed mechanisms (copy mechanism, etc.)
NMT 1 2
Ja -> Zh
EBMT
NMT 1
52k (BPE)
BLEU
30.27
31.98
AM-FM
76.42
76.33
# layers Source Vocabulary
NMT 1 2
Zh -> Ja
EBMT
NMT 1
NMT 2
30k (JUMAN)
BLEU
36.63
46.04
44.29
AM-FM
76.71
78.59
78.44
# layers Source Vocabulary
Pairwise
30.75 (3/5)
58.75 (1/5)
JPO Adequacy
3.88 (1/3)
Target Vocabulary
Ensembling
30k (KyotoMorph)
-
Pairwise
63.75 (1/9)
56.00 (2/9)
JPO Adequacy
3.94 (1/3)
-
Target Vocabulary
Ensembling
NMT 1 2
30k (KyotoMorph)
30k (JUMAN)
x2
NMT 2 2
200k (KyotoMorph)
50k (JUMAN)
-
Code available (GPL)
• KyotoEBMT: http://lotus.kuee.kyoto-u.ac.jp/~john/kyotoebmt.html
• KyotoNMT: https://github.com/fabiencro/knmt