Convolutional Sequence to Sequence Learning

Convolutional Sequence to
Sequence Learning
Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin
Facebook AI Research (2017.05.12)
Shiyu Zhang 2017.05.18
Classic RNN seq2seq
• Encoder-decoder with a soft-attention
• Encoder:
• Decoder:
How about CNN seq2seq?
• Advantages:
• Do not depend on previous steps => parallelization
• Hierarchical structure provides a shorter path to capture long-range
dependencies
• O(n) -> O(n/k)
Can see the
whole
sentence
Architecture
• encoder
• Input: word + position
• Kernel parameter: W (kd×2d), b(2d)
• GLU (gated linear units):
• Non-linearity allow the networks to exploit full
• input field or to focus on fewer elements; gated
• Sigmoid(b) control which inputs A are relevant
𝑙
• 𝑧𝑖𝑙 = ν 𝑊 𝑙 𝑧 𝑙−1𝑘 , … , 𝑧 𝑙−1𝑘 + 𝑏𝑤
+ 𝑧𝑖𝑙−1
𝑖−2
𝑖+2
Residual
connection
Architecture
• Decoder
𝑙
• ℎ𝑖𝑙 = ν 𝑊 𝑙 ℎ𝑙−1𝑘 , … , ℎ𝑙−1𝑘 + 𝑏𝑤
+ ℎ𝑖𝑙−1
𝑖−2
𝑖+2
Architecture
• Attention
• Separate attentions for each decoder layer
• c is simply added to h
Architecture
• Output
Strategies
• To stabilize learning: maintain the variance of activations throughout the
forward and backward passes.
• Normalization
• × 0.5 at the sum of residual connection
• ×𝑚 1/𝑚 at the weighted sum of attention
• Initialization
• Layers no GLU, initialize weights 𝑁(0,
1
)
𝑛𝑙
• Layers with GLU, output variance is ¼ of input variance, so initialize weights
4
𝑁(0,
)
𝑛𝑙
• If use dropout with probability p, above two are: 𝑁(0,
𝑝
),
𝑛𝑙
𝑁(0,
4𝑝
)
𝑛𝑙
Experiments
Experiments
Experiments
• summarization
Reference
• Gehring, J., Auli, M., Grangier, D., Yarats, D., and Dauphin, Y. N.
Convolutional Sequence to Sequence Learning. ArXiv e-prints (May
2017).
• Gehring, J., Auli, M., Grangier, D., and Dauphin, Y. N. A Convolutional
Encoder Model for Neural Machine Translation. ArXiv e-prints (Nov.
2016).
• https://github.com/facebookresearch/fairseq
Questions
• Parallelization at decoder?