Convolutional Sequence to Sequence Learning Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin Facebook AI Research (2017.05.12) Shiyu Zhang 2017.05.18 Classic RNN seq2seq • Encoder-decoder with a soft-attention • Encoder: • Decoder: How about CNN seq2seq? • Advantages: • Do not depend on previous steps => parallelization • Hierarchical structure provides a shorter path to capture long-range dependencies • O(n) -> O(n/k) Can see the whole sentence Architecture • encoder • Input: word + position • Kernel parameter: W (kd×2d), b(2d) • GLU (gated linear units): • Non-linearity allow the networks to exploit full • input field or to focus on fewer elements; gated • Sigmoid(b) control which inputs A are relevant 𝑙 • 𝑧𝑖𝑙 = ν 𝑊 𝑙 𝑧 𝑙−1𝑘 , … , 𝑧 𝑙−1𝑘 + 𝑏𝑤 + 𝑧𝑖𝑙−1 𝑖−2 𝑖+2 Residual connection Architecture • Decoder 𝑙 • ℎ𝑖𝑙 = ν 𝑊 𝑙 ℎ𝑙−1𝑘 , … , ℎ𝑙−1𝑘 + 𝑏𝑤 + ℎ𝑖𝑙−1 𝑖−2 𝑖+2 Architecture • Attention • Separate attentions for each decoder layer • c is simply added to h Architecture • Output Strategies • To stabilize learning: maintain the variance of activations throughout the forward and backward passes. • Normalization • × 0.5 at the sum of residual connection • ×𝑚 1/𝑚 at the weighted sum of attention • Initialization • Layers no GLU, initialize weights 𝑁(0, 1 ) 𝑛𝑙 • Layers with GLU, output variance is ¼ of input variance, so initialize weights 4 𝑁(0, ) 𝑛𝑙 • If use dropout with probability p, above two are: 𝑁(0, 𝑝 ), 𝑛𝑙 𝑁(0, 4𝑝 ) 𝑛𝑙 Experiments Experiments Experiments • summarization Reference • Gehring, J., Auli, M., Grangier, D., Yarats, D., and Dauphin, Y. N. Convolutional Sequence to Sequence Learning. ArXiv e-prints (May 2017). • Gehring, J., Auli, M., Grangier, D., and Dauphin, Y. N. A Convolutional Encoder Model for Neural Machine Translation. ArXiv e-prints (Nov. 2016). • https://github.com/facebookresearch/fairseq Questions • Parallelization at decoder?
© Copyright 2026 Paperzz