Show Attend and Tell: Neural image Caption Generation with visual attention Xu, Ba, Kiros, Cho, Courville, Salakhutdinov, Zemel, Bengio E L I E L H OJ M A N A D VA N C E D S E M I N A R I N D E E P L E A R N I N G , H E B R E W U N I V E RS I T Y JA N UA R Y 3 , 2 0 1 6 CONTENTS Introduction Image Generation with Attention Mechanism ◦ Encoder ◦ Decoder “Hard” vs “Soft” Attention ◦ Stochastic “Hard” ◦ Deterministic “Soft” Training Experiments ◦ Qualitative analysis INTRODUCTION Very hard problem ◦ Heart of scene understanding. Primary goal of computer vision Evolution of Image captioning ◦ ◦ ◦ ◦ Caption templates filled in based of object detections and attribute discovery (2013) Retrieve similar captioned images and modify the retrieved captions to fit the query ANN feed-forward. No use of templates. (2014) RNN with LSTM ◦ Show the image only at the beginning ◦ Showing the image features on every step ◦ Three-step pipeline, incorporating object/visual concept detections. PROPOSED METHOD ADVANTAGES “Using representations as those from the very top layer of a convent has the drawback of losing information which could be useful for richer, more descriptive captions” “Rather than compress an entire image into a static representation, attention allows for salient features to dynamically come to the forefront as needed” “Working with these features necessitates a powerful mechanism to steer the model to information important to the task at hand” “The proposed attention framework learns latent alignments from scratch” CONTRIBUTIONS OF THE PAPER Introduction of two attention-based image caption generators under a common framework ◦ “Soft” deterministic ◦ “Hard” stochastic Show how we can gain insight by visualizing “where” and “what” the attention is focused Achieve state-of-the-art results on three benchmark datasets: Flickr8k, Flickr30k, MSCOCO Image Generation with Attention Mechanism ENCODER-DECODER MODEL Split problem in two tasks ◦ Encode the input sequence into a fixed size representation vector ◦ Decode the representation vector into output sequence MODEL DETAILS - ENCODER CNN used to extract the features of the image Feature map used of 14 x 14 x 512 (Fourth convolution layer before max pooling. Oxford VGG CNN) Feature vectors: ◦ 𝑎 = 𝒂𝟏 , … , 𝒂𝑳 , 𝒂𝒊 ∈ ℝ𝐷 L is the number of different areas of attention (L = 196) D is the number of features MODEL DETAILS - DECODER LSTM network Decoder output ◦ 𝑦 = 𝑦1 , … , 𝑦𝐶 , 𝑦𝑖 ∈ ℝ𝐾 ◦ K, is the size of our language Every word is conditioned on ◦ 𝑧𝑡 , a context vector. Visual information with attention ◦ ℎ𝑡−1 , the previous hidden state ◦ 𝑦𝑡−1 , the previously generated words 𝐸 ∈ ℝ𝑚×𝐾 , is an embedded matrix The embedded matrix represents proximity in words MODEL DETAILS – CONTEXT VECTOR 𝑧𝑡 , Dynamic representation of the relevant part of the image 𝛼𝑡,1 𝑒𝑡,𝑖 = 𝑓𝑎𝑡𝑡 (𝑎𝑖 , ℎ𝑡−1 ) … ◦ 𝑓𝑎𝑡𝑡 , attention model computed by a multilayer perceptron network αt,i = exp(et,i ) L k=1 𝑒𝑡,1 … exp(et,k ) MLP 𝛼 can be interpreted as: ◦ Probability location i is the best place to focus (“hard”) ◦ Relative importance of location i when blending all the locations 𝛼𝑡,𝑖 … 𝛼𝑡,𝐿 Softmax 𝑒𝑡,𝑖 … 𝑒𝑡,𝐿 MLP MLP ℎ𝑡−1 𝑧𝑡 = 𝜙( 𝑎𝑖 , 𝛼𝑡,𝑖 ) LSTM 𝑎1 … 𝑎𝑖 … 𝑎𝐿 MODEL DETAILS – FINAL RESULT We want to compute the output word probability as: 𝑝 𝑦𝑡 𝑎, 𝑦1𝑡−1 ∝ exp 𝐿0 𝐸𝑦𝑡−1 + 𝐿ℎ ℎ𝑡 + 𝐿𝑧 𝑧𝑡 where 𝐿0 ∈ ℝ𝐾×𝑚 , 𝐿ℎ ∈ ℝ𝑚×𝑛 , 𝐿𝑧 ∈ ℝ𝑚×𝐷 , 𝐸 ∈ ℝ𝑚×𝐾 are learned parameters initialized randomly. The model also implements a deep output layer. ◦ “This allows the hidden state of the model more compact and may result in the model being able to summarize the history of previous inputs more efficiently” “HARD” vs. “SOFT” ATTENTION STOCHASTIC “HARD” ATTENTION 𝑠𝑡 location variable representing where the model decides to focus when generating the t word 𝑠𝑡,𝑖 indicator one-hot variable out of L locations Considering the attention locations to be intermediate latent variables we have: ◦ 𝑝 𝑠𝑡,𝑖 = 1 𝑠𝑗<𝑡 , 𝑎) = 𝛼𝑡,𝑖 We can create the random variable ◦ 𝑧𝑡 = 𝑖 𝑠𝑡,𝑖 𝑎𝑖 STOCHASTIC “HARD” ATTENTION II Define new objective function 𝐿𝑠 , variational lower boundary on the marginal log-likelihood with 𝑠 being the location variable 𝐿𝑠 = 𝑝 𝑠 𝑎) log 𝑝 𝑦 𝑠, 𝑎 ≤ 𝑙𝑜𝑔 𝑠 𝜕𝐿𝑠 = 𝜕𝑊 𝑝 𝑠 𝑎 𝑝 𝑦 𝑠, 𝑎 = log 𝑝(𝑦|𝑎) 𝑠 𝑠 𝜕 𝑙𝑜𝑔 𝑝 y 𝑠, a 𝜕 𝑙𝑜𝑔 𝑝 𝑠 a 𝑝(𝑠|a) + 𝑙𝑜𝑔 𝑝 y 𝑠, a 𝜕𝑊 𝜕𝑊 STOCHASTIC “HARD” ATTENTION III Reinforcement gradient approximation using Monte Carlo 𝑠 𝑛 = (𝑠1𝑛 , 𝑠2𝑛 , … ), sampled attention locations 𝑠𝑡𝑛 ∼ Multinoulli𝐿 ( 𝛼𝑡𝑛 ) 𝜕𝐿𝑠 1 ≈ 𝜕𝑊 𝑁 𝑁 𝑛=1 𝑛 𝜕 log 𝑝 𝐲 𝑠 𝑛 , 𝐚 𝜕 log 𝑝 𝑠 𝐚 𝑛 + log 𝑝 𝐲 𝑠 , 𝐚 𝜕𝑊 𝜕𝑊 STOCHASTIC “HARD” ATTENTION IV The term log 𝑝 𝐲 𝑠 𝑛 , 𝐚 might deliver high variance in the gradient estimation. To reduce this we use the moving average baseline technique. For the k’th minibatch the moving average is estimated as: 𝑏𝑘 = 0.9 × 𝑏𝑘−1 + 0.1 × log 𝑝(𝐲|𝑠𝑘 , 𝐚) STOCHASTIC “HARD” ATTENTION V The final learning rule for the model is: 𝐻[𝑠], entropy of the multinoulli distribution 𝜕𝐿𝑠 1 ≈ 𝜕𝑊 𝑁 𝑁 𝑛=1 𝐻 𝑋 = 𝐸[− ln 𝑃 𝑋 ] 𝑛 𝐚 𝑛] 𝜕 log 𝑝 𝐲 𝑠 𝑛 , 𝐚 𝜕 log 𝑝 𝑠 𝜕𝐻[ 𝑠 + 𝜆𝑟 log 𝑝 𝐲 𝑠 𝑛 , 𝐚 − 𝑏 + 𝜆𝑒 𝜕𝑊 𝜕𝑊 𝜕𝑊 𝜆𝑟 , 𝜆𝑒 are hyper-parameters set by cross-validation “In order to further improve the robustness of this learning rule, with probability 0.5 for a given image, we set the sampled attention location 𝑠 to its expected value 𝛼" DETERMINISTIC “SOFT” ATTENTION We consider the expected value of the context vector 𝐿 𝐸𝑝 𝑠𝑡 𝑎 𝑧𝑡 = 𝛼𝑡,𝑖 𝑎𝑖 𝑖=1 The end model is smooth and differentiable so it can be trained using standard backpropagation DOUBLE STOCHASTIC ATTENTION By construction: ◦ 𝑖 𝛼𝑡,𝑖 =1 The model also tries to introduce the regularization term ◦ 𝒕 𝜶𝒕,𝒊 ≈ 𝟏. The model is encouraged to pay attention to very part of the image during the caption generation This was discovered to improve the BLEU scores as well as to deliver richer descriptions Besides the 𝛼 values a gating scalar 𝛽 is also predicted from the previous hidden state ℎ𝑡−1: 𝐿 𝑧𝑡 = 𝛽 𝛽= 𝛼𝑡,𝑖 𝑎𝑖 𝑖=1 𝜎(𝑓𝛽 ℎ𝑡−1 ) Was shown to deliver more emphasis on the objects of the image when including the 𝛽 scalar END-TO-END TRAINING 𝐿 𝐿𝑑 = − log 𝑝 𝑦 𝑎 +𝜆 2 𝐶 1− 𝑖 𝑎 = 𝑎1 , … , 𝑎𝐿 , 𝑎𝑖 ∈ ℝ𝐷 𝑦 = 𝑦1 , … , 𝑦𝐶 , 𝑦𝑖 ∈ ℝ𝐾 𝛼𝑡𝑖 𝑡 TRAINING PROCEDURE Both attention models were trained with SGD using adaptive learning rate algorithms: ◦ RMSProp for Flickr8k ◦ Adam for Flickr30k, MSCOCO Oxford VGGnet convnet used ◦ Pretrained on ImageNet Mini batch of size 64 ◦ The samples chosen had similar number of words in its captions ◦ This is to avoid wasting time during the training Regularization ◦ Dropout ◦ Early stopping on BLEU score DATASETS A guy in a yellow shirt with a bike gets very high in the air. A firefighter extinguishes a fire under the hood of a car. A man in mid-air holding his bike's handle bars a fireman spraying water into the hood of small white car on a jack A man is performing a trick high in the air with a bicycle. A man wearing a yellow shirt is doing a trick high in the air with his bike. Man holding onto bike handlebars while in mid-air A fireman sprays inside the open hood of small white car, on a jack. A fireman using a firehose on a car engine that is up on a carjack. Firefighter uses water to extinguish a car that was on fire RESULTS QUALITATIVE ANALYSIS New layer of interpretability due to added attention component ◦ In this way we can attend to “non object” salient regions The input to the convent is a resized image with preserved aspect ratio of 224 x 224 ◦ After 4 max pool layers we end up with 14 x 14. ◦ For the visualization of the attention region we upsample the weights by a factor of 24 and apply a Gaussian filter ◦ Note that the receptive fields of the 14 x 14 units are highly overlapping THANK YOU
© Copyright 2026 Paperzz