Show and Tell: A Neural Image Caption Generator

Show and Tell: A Neural Image Caption
Generator
Vinyals et al. (Google)
The IEEE Conference on Computer Vision and Pattern
Recognition, 2015
The Problem
I
Image Caption Generation
I
Automatically describe content of an image
I
Image → Natural Language
I
Computer Vision + NLP
I
Much more difficult than image classification/recognition
Background
I
Success in image classification/recognition
I
Close to human level performance
I
Deep CNN’s, Big Datasets
I
Image to fixed length vector
Background
I
Machine Translation
I
Language generating RNN’s
I
Decoder-Encoder framework
I
Maximize likelihood of target sentence
Idea
I
Combine Vision CNN with Language RNN
I
Deep CNN as encoder
I
Language Generating RNN as decoder
I
End to end model I → S
I
Maximize p(S|I )
The Model
Neural Image Caption (NIC)
I
CNN: 22 layer GoogleNet
I
LSTM for modeling
P
log p(S|I ) = N
t=0 log p(St |I , S0 , . . . , St−1 )
I
Word embedding We
Language LSTM
I
I
I
I
Predicts next word in sentence
Memory cell for longer memory
St one-hot vectors + START/END token
x−1 = CNN(I ), xt = We St , pt+1 = LSTM(xt )
(a)
(b)
Training
PN
I
Loss function L(I , S) = −
I
CNN pre-trained on ImageNet
I
Minimize w.r.t. LSTM parameters, We and CNN top layer
I
SGD on mini-batches
I
Dropout and ensembling
I
512 dimensional embedding
t=1 log pt (St )
Generation
I
Give x−1 = CNN(I)
I
x0 = We S0 , S0 START token
I
Sample word S1
I
Feed We S1 to LSTM
I
BeamSearch, beam size 20
Results
I
MSCOCO dataset: 80k train, 40k eval and test
I
5 human made captions per image
I
M1-M5 human judgements
Results
Results
I
Improved Flickr8k, Flickr30k, PASCAL BLEU scores
I
Need better evaluation metrics
I
80% of top-1 in training set
I
50% of top-15 in training set
I
Similiar diversity as human captions
Results
I
Trained word embeddings We
I
Captures semantics from the language data
I
Independent of vocubulary size
Summary
I
End-to-end model (Encoder-Decoder)
I
Vision CNN + Language generating RNN
I
Maximize likelihood of S given I
I
State of the art results on major datasets
I
Datasets are limiting: Unsupervised approaches?