Show Attend and Tell: Neural image Caption Generation with visual

Show Attend and Tell:
Neural image Caption Generation with visual attention
Xu, Ba, Kiros, Cho, Courville, Salakhutdinov, Zemel, Bengio
E L I E L H OJ M A N
A D VA N C E D S E M I N A R I N D E E P L E A R N I N G , H E B R E W U N I V E RS I T Y
JA N UA R Y 3 , 2 0 1 6
CONTENTS
Introduction
Image Generation with Attention Mechanism
◦ Encoder
◦ Decoder
“Hard” vs “Soft” Attention
◦ Stochastic “Hard”
◦ Deterministic “Soft”
Training
Experiments
◦ Qualitative analysis
INTRODUCTION
Very hard problem
◦ Heart of scene understanding. Primary goal of computer vision
Evolution of Image captioning
◦
◦
◦
◦
Caption templates filled in based of object detections and attribute discovery (2013)
Retrieve similar captioned images and modify the retrieved captions to fit the query
ANN feed-forward. No use of templates. (2014)
RNN with LSTM
◦ Show the image only at the beginning
◦ Showing the image features on every step
◦ Three-step pipeline, incorporating object/visual concept detections.
PROPOSED METHOD ADVANTAGES
“Using representations as those from the very top layer of a convent has the drawback of losing
information which could be useful for richer, more descriptive captions”
“Rather than compress an entire image into a static representation, attention allows for salient
features to dynamically come to the forefront as needed”
“Working with these features necessitates a powerful mechanism to steer the model to
information important to the task at hand”
“The proposed attention framework learns latent alignments from scratch”
CONTRIBUTIONS OF THE PAPER
Introduction of two attention-based image caption generators under a common framework
◦ “Soft” deterministic
◦ “Hard” stochastic
Show how we can gain insight by visualizing “where” and “what” the attention is focused
Achieve state-of-the-art results on three benchmark datasets: Flickr8k, Flickr30k, MSCOCO
Image Generation with
Attention Mechanism
ENCODER-DECODER MODEL
Split problem in two tasks
◦ Encode the input sequence into a fixed size representation vector
◦ Decode the representation vector into output sequence
MODEL DETAILS - ENCODER
CNN used to extract the features of the image
Feature map used of 14 x 14 x 512 (Fourth convolution layer before max pooling. Oxford VGG CNN)
Feature vectors:
◦ 𝑎 = 𝒂𝟏 , … , 𝒂𝑳 , 𝒂𝒊 ∈ ℝ𝐷
L is the number of different areas of attention (L = 196)
D is the number of features
MODEL DETAILS - DECODER
LSTM network
Decoder output
◦ 𝑦 = 𝑦1 , … , 𝑦𝐶 , 𝑦𝑖 ∈ ℝ𝐾
◦ K, is the size of our language
Every word is conditioned on
◦ 𝑧𝑡 , a context vector. Visual information with attention
◦ ℎ𝑡−1 , the previous hidden state
◦ 𝑦𝑡−1 , the previously generated words
𝐸 ∈ ℝ𝑚×𝐾 , is an embedded matrix
The embedded matrix represents proximity in words
MODEL DETAILS – CONTEXT VECTOR
𝑧𝑡 , Dynamic representation of the relevant part of the image
𝛼𝑡,1
𝑒𝑡,𝑖 = 𝑓𝑎𝑡𝑡 (𝑎𝑖 , ℎ𝑡−1 )
…
◦ 𝑓𝑎𝑡𝑡 , attention model computed by a multilayer perceptron network
αt,i =
exp(et,i )
L
k=1
𝑒𝑡,1 …
exp(et,k )
MLP
𝛼 can be interpreted as:
◦ Probability location i is the best place to focus (“hard”)
◦ Relative importance of location i when blending all the locations
𝛼𝑡,𝑖 … 𝛼𝑡,𝐿
Softmax
𝑒𝑡,𝑖 … 𝑒𝑡,𝐿
MLP
MLP
ℎ𝑡−1
𝑧𝑡 = 𝜙( 𝑎𝑖 , 𝛼𝑡,𝑖 )
LSTM
𝑎1 …
𝑎𝑖
…
𝑎𝐿
MODEL DETAILS – FINAL RESULT
We want to compute the output word probability as:
𝑝 𝑦𝑡 𝑎, 𝑦1𝑡−1 ∝ exp 𝐿0 𝐸𝑦𝑡−1 + 𝐿ℎ ℎ𝑡 + 𝐿𝑧 𝑧𝑡
where 𝐿0 ∈ ℝ𝐾×𝑚 , 𝐿ℎ ∈ ℝ𝑚×𝑛 , 𝐿𝑧 ∈ ℝ𝑚×𝐷 , 𝐸 ∈ ℝ𝑚×𝐾 are learned parameters initialized
randomly.
The model also implements a deep output layer.
◦ “This allows the hidden state of the model more compact and may result in the model being able to
summarize the history of previous inputs more efficiently”
“HARD” vs. “SOFT” ATTENTION
STOCHASTIC “HARD” ATTENTION
𝑠𝑡 location variable representing where the model decides to focus when generating the t word
𝑠𝑡,𝑖 indicator one-hot variable out of L locations
Considering the attention locations to be intermediate latent variables we have:
◦ 𝑝 𝑠𝑡,𝑖 = 1 𝑠𝑗<𝑡 , 𝑎) = 𝛼𝑡,𝑖
We can create the random variable
◦ 𝑧𝑡 =
𝑖 𝑠𝑡,𝑖 𝑎𝑖
STOCHASTIC “HARD” ATTENTION II
Define new objective function 𝐿𝑠 , variational lower boundary on the marginal
log-likelihood with 𝑠 being the location variable
𝐿𝑠 =
𝑝 𝑠 𝑎) log 𝑝 𝑦 𝑠, 𝑎 ≤ 𝑙𝑜𝑔
𝑠
𝜕𝐿𝑠
=
𝜕𝑊
𝑝 𝑠 𝑎 𝑝 𝑦 𝑠, 𝑎 = log 𝑝(𝑦|𝑎)
𝑠
𝑠
𝜕 𝑙𝑜𝑔 𝑝 y 𝑠, a
𝜕 𝑙𝑜𝑔 𝑝 𝑠 a
𝑝(𝑠|a)
+ 𝑙𝑜𝑔 𝑝 y 𝑠, a
𝜕𝑊
𝜕𝑊
STOCHASTIC “HARD” ATTENTION III
Reinforcement gradient approximation using Monte Carlo
𝑠 𝑛 = (𝑠1𝑛 , 𝑠2𝑛 , … ), sampled attention locations
𝑠𝑡𝑛 ∼ Multinoulli𝐿 ( 𝛼𝑡𝑛 )
𝜕𝐿𝑠 1
≈
𝜕𝑊 𝑁
𝑁
𝑛=1
𝑛
𝜕 log 𝑝 𝐲 𝑠 𝑛 , 𝐚
𝜕
log
𝑝
𝑠
𝐚
𝑛
+ log 𝑝 𝐲 𝑠 , 𝐚
𝜕𝑊
𝜕𝑊
STOCHASTIC “HARD” ATTENTION IV
The term log 𝑝 𝐲 𝑠 𝑛 , 𝐚 might deliver high variance in the gradient estimation.
To reduce this we use the moving average baseline technique.
For the k’th minibatch the moving average is estimated as:
𝑏𝑘 = 0.9 × 𝑏𝑘−1 + 0.1 × log 𝑝(𝐲|𝑠𝑘 , 𝐚)
STOCHASTIC “HARD” ATTENTION V
The final learning rule for the model is:
𝐻[𝑠], entropy of the multinoulli distribution
𝜕𝐿𝑠 1
≈
𝜕𝑊 𝑁
𝑁
𝑛=1
𝐻 𝑋 = 𝐸[− ln 𝑃 𝑋 ]
𝑛 𝐚
𝑛]
𝜕 log 𝑝 𝐲 𝑠 𝑛 , 𝐚
𝜕
log
𝑝
𝑠
𝜕𝐻[
𝑠
+ 𝜆𝑟 log 𝑝 𝐲 𝑠 𝑛 , 𝐚 − 𝑏
+ 𝜆𝑒
𝜕𝑊
𝜕𝑊
𝜕𝑊
𝜆𝑟 , 𝜆𝑒 are hyper-parameters set by cross-validation
“In order to further improve the robustness of this learning rule, with probability 0.5 for a given
image, we set the sampled attention location 𝑠 to its expected value 𝛼"
DETERMINISTIC “SOFT” ATTENTION
We consider the expected value of the context vector
𝐿
𝐸𝑝 𝑠𝑡 𝑎 𝑧𝑡 =
𝛼𝑡,𝑖 𝑎𝑖
𝑖=1
The end model is smooth and differentiable so it can be trained using standard backpropagation
DOUBLE STOCHASTIC ATTENTION
By construction:
◦
𝑖 𝛼𝑡,𝑖
=1
The model also tries to introduce the regularization term
◦
𝒕 𝜶𝒕,𝒊
≈ 𝟏.
The model is encouraged to pay attention to very part of the image during the caption generation
This was discovered to improve the BLEU scores as well as to deliver richer descriptions
Besides the 𝛼 values a gating scalar 𝛽 is also predicted from the previous hidden state ℎ𝑡−1:
𝐿
𝑧𝑡 = 𝛽
𝛽=
𝛼𝑡,𝑖 𝑎𝑖
𝑖=1
𝜎(𝑓𝛽 ℎ𝑡−1
)
Was shown to deliver more emphasis on the objects of the image when including the 𝛽 scalar
END-TO-END TRAINING
𝐿
𝐿𝑑 = − log 𝑝 𝑦 𝑎
+𝜆
2
𝐶
1−
𝑖
𝑎 = 𝑎1 , … , 𝑎𝐿 , 𝑎𝑖 ∈ ℝ𝐷
𝑦 = 𝑦1 , … , 𝑦𝐶 , 𝑦𝑖 ∈ ℝ𝐾
𝛼𝑡𝑖
𝑡
TRAINING PROCEDURE
Both attention models were trained with SGD using adaptive learning rate algorithms:
◦ RMSProp for Flickr8k
◦ Adam for Flickr30k, MSCOCO
Oxford VGGnet convnet used
◦ Pretrained on ImageNet
Mini batch of size 64
◦ The samples chosen had similar number of words in its captions
◦ This is to avoid wasting time during the training
Regularization
◦ Dropout
◦ Early stopping on BLEU score
DATASETS
A guy in a yellow shirt with a bike gets very high in the air.
A firefighter extinguishes a fire under the hood of a car.
A man in mid-air holding his bike's handle bars
a fireman spraying water into the hood of small white car on a
jack
A man is performing a trick high in the air with a bicycle.
A man wearing a yellow shirt is doing a trick high in the air
with his bike.
Man holding onto bike handlebars while in mid-air
A fireman sprays inside the open hood of small white car, on a
jack.
A fireman using a firehose on a car engine that is up on a
carjack.
Firefighter uses water to extinguish a car that was on fire
RESULTS
QUALITATIVE ANALYSIS
New layer of interpretability due to added attention component
◦ In this way we can attend to “non object” salient regions
The input to the convent is a resized image with preserved aspect ratio of 224 x 224
◦ After 4 max pool layers we end up with 14 x 14.
◦ For the visualization of the attention region we upsample the weights by a factor of 24 and apply a
Gaussian filter
◦ Note that the receptive fields of the 14 x 14 units are highly overlapping
THANK YOU