Kyoto University Participation to rd the 3 Workshop on Asian Translation Fabien Cromieres Chenhui Chu Toshiaki Nakazawa Sadao Kurohashi Overview of our submissions • 2 Systems • Kyoto-EBMT • Example-Based Machine Translation • Uses Dependency analysis for both source and target side • Some small incremental improvements over our last year participation • Kyoto-NMT • Our new implementation of the Neural MT paradigm • Sequence-to-Sequence model with Attention Mechanism • As first introduced by (Bahdanau et al., 2015) • For the tasks: • • • • ASPEC Ja->En ASPEC En -> Ja ASPEC Ja -> Zh ASPEC Zh -> Ja KyotoEBMT KyotoEBMT Overview • Example-Based MT paradigm • Need parallel corpus • Few language-specific assumptions • still a few language-specific rules • Tree-to-Tree Machine Translation • Maybe the least commonly used variant of x-to-x • Sensitive to parsing quality of both source and target languages • Maximize the chances of preserving information • Dependency trees • Less commonly used than Constituent trees • Most natural for Japanese • Should contain all important semantic information 4 KyotoEBMT pipeline • Somehow classic pipeline • 1- Preprocessing of the parallel corpus • 2- Processing of input sentence • 3- Decoding/Tuning/Reranking • Tuning and reranking done with kbMira • seems to work better than PRO for us 5 KyotoNMT KyotoNMT Overview • Uses the sequence-to-sequence with attention model • as proposed in (Bahdanau et al., 2015) • with other subsequent improvements • UNK-tags replacement (Luang et al., 2015) • ADAM training, sub-word units, … • Hopefully we can add more original ideas in the future • Implemented in Python using the Chainer library • A version is GPL open-sourced Sequence-to-Sequence with Attention Bahdanau+, 2015 Previous state Attention Model <1000> <1000> <1000> <1000> <1000> <1000> <1000> <1000> <1000> LSTM LSTM LSTM LSTM LSTM LSTM <3620> <1000> maxout LSTM LSTM <1000> <2620> <1000> LSTM new state Current context <620> <500> softmax <620> <620> <620> <620> Target Embedding Source Embedding 私 は 学生 です I am a Previously generated word <30000> Student New word Depending on experiments: GRU LSTM 2-layers LSTM Other values were the same for all experiments Previous state Attention Model <1000> <1000> <1000> <1000> <1000> <1000> <1000> <1000> <1000> new state LSTM <2620> <3620> <1000> <1000> LSTM LSTM LSTM LSTM LSTM LSTM maxout LSTM LSTM <1000> Current context <620> Target Vocabulary Size: <500>30 000 – 50 000 softmax <620> <620> <620> <620> Target Embedding Source Embedding Source Vocabulary Size: 私 30 000 – 200 000 は 学生 です I am a Previously generated word <30000> Student New word Regularization • Weight Decay • Choosing a good value seemed quite important • 1e-6 worked noticeably better than 1e-5 or 1e-7 • Early Stopping • Keep the parameters with the best loss on dev set • Or keep the parameters with the best BLEU on dev set • “best BLEU” work better, but even better to ensemble “best BLEU” and “best loss” • Dropout • Only used between LSTM layers (when used multi-layer LSTM) • 20% dropout • Noise on target word embeddings Noise on target word embedding Previous state Idea: add random noise Attention at training time here to Model force the network to not rely too much on this information <1000> <1000> <1000> <1000> <1000> <1000> <1000> <1000> <1000> new Seems to work state LSTM <2620> <3620> <1000> <1000> LSTM LSTM <620> 私 LSTM LSTM は 学生 maxout LSTM This part can be the source of LSTM LSTM LSTM cascading errors at translation time <620> <620> At training time, we<620> always give the correct previous word, Source Embedding but not at translation time です (+ 1.5 BLEU) But<1000> is it actually because the network became less prone to cascading errors? Or simply a regularization effect? Current context <620> <500> softmax Target Embedding I am a Previously generated word <30000> Student New word Translation • Translation with beam-search • Large beam (maximum 100) • Although other authors mention issues with large beam, it worked for us • Normalization of the score by the length of the sentence • final n-best candidates are pruned by the average loss per word • UNK words replaced with a dictionary using the attention values • Dictionary extracted from the aligned training corpus • attention not always very precise, but does help Ensembling • Ensembling is known to improve Neural-MT results substantially. • We could confirm this, using three type of ensembling: • “Normal” Ensembling • Train different models and ensemble over them • Self-Ensembling • Ensembling of several parameters at different steps of the same training session • Mixed Ensembling • Train several models, and use several parameters for each models • Observations: • Ensembling does help a lot • Mixed > Normal > Self • Diminishing returns ( typically +2-3 BLEU going from one to two models, less than +0.5 going from three to four models) • Geometric averaging of probabilities worked better than Arithmetic averaging The question of segmentation • Several options for segmentation • • • • Natural (ie. “words” for English) Subword units, using eg. BPE (Senrich et al., 2015) Automatic segmentation tools (JUMAN, SKP) -> Trade-off between sentence size, generalization capacity and computation efficiency • English • Words units • Subword units with BPE • Japanese • JUMAN segmentation • Subword units with BPE • Chinese • SKP segmentation • “short units” segmentation • Subword units with BPE Results Results for WAT 2016 Ja -> En EBMT NMT 1 NMT 2 BLEU 21.22 24.71 26.22 AM-FM 59.52 56.27 55.85 In term of BLEU,In term of Human In term of AM-FM ensembling 4 simple evaluation, the larger ranks the models beats theNMT model has actually a EBMT system higher larger NMT system slightly better score Pairwise 47.0 (3/9) 44.25 (4/9) # layers Source Vocabulary Target Vocabulary JPO Adequacy 3.89 (1/3) - Ensembling NMT 1 2 200k (JUMAN) 52k (BPE) - NMT 2 1 30k (JUMAN) 30k (words) x4 Results for WAT 2016 En -> Ja EBMT NMT 1 BLEU 31.03 36.19 AM-FM 74.75 73.87 # layers Source Vocabulary NMT 1 2 52k (BPE) Pairwise 55.25 (1/10) Target Vocabulary 52k (BPE) JPO Adequacy 4.02 (1/4) Ensembling - Results for WAT 2016 Ja -> Zh EBMT NMT 1 BLEU 30.27 31.98 AM-FM 76.42 76.33 # layers Source Vocabulary NMT 1 2 30k (JUMAN) Pairwise 30.75 (3/5) 58.75 (1/5) Target Vocabulary 30k (KyotoMorph) JPO Adequacy 3.88 (1/3) Ensembling - Results for WAT 2016 Zh -> Ja EBMT NMT 1 NMT 2 BLEU 36.63 46.04 44.29 AM-FM 76.71 78.59 78.44 Pairwise 63.75 (1/9) 56.00 (2/9) JPO Adequacy 3.94 (1/3) - # layers Source Vocabulary NMT 1 2 30k (KyotoMorph) Target Vocabulary 30k (JUMAN) Ensembling x2 NMT 2 2 50k (JUMAN) - 200k (KyotoMorph) EBMT vs NMT Src 本フローセンサーの型式と基本構成,規格を図示, 紹介。 Ref Shown here are type and basic configuration and standards of this flow with some diagrams. EBMT This flow sensor type and the basic composition, standard is illustrated, and introduced. NMT This paper introduces the type, basic configuration, and standards of this flow sensor. • NMT vs EBMT: • NMT seems more fluent • NMT sometimes add parts not in the source (over-translation) • NMT sometimes forget to translate some part of the source (under-translation) Conclusion • Neural MT proved to be very efficient • Especially for Ja -> Zh (almost +10 BLEU compared with EBMT) • NMT vs EBMT: • NMT output is more fluent and readable • NMT has more often issues of under- or over-translation • NMT takes longer to train but can be faster to translate • Finding the optimal settings for NMT is very tricky • Many hyper-parameters • Each training takes a long time on a single GPU
© Copyright 2026 Paperzz