Show and Tell: A Neural Image Caption Generator Vinyals et al. (Google) The IEEE Conference on Computer Vision and Pattern Recognition, 2015 The Problem I Image Caption Generation I Automatically describe content of an image I Image → Natural Language I Computer Vision + NLP I Much more difficult than image classification/recognition Background I Success in image classification/recognition I Close to human level performance I Deep CNN’s, Big Datasets I Image to fixed length vector Background I Machine Translation I Language generating RNN’s I Decoder-Encoder framework I Maximize likelihood of target sentence Idea I Combine Vision CNN with Language RNN I Deep CNN as encoder I Language Generating RNN as decoder I End to end model I → S I Maximize p(S|I ) The Model Neural Image Caption (NIC) I CNN: 22 layer GoogleNet I LSTM for modeling P log p(S|I ) = N t=0 log p(St |I , S0 , . . . , St−1 ) I Word embedding We Language LSTM I I I I Predicts next word in sentence Memory cell for longer memory St one-hot vectors + START/END token x−1 = CNN(I ), xt = We St , pt+1 = LSTM(xt ) (a) (b) Training PN I Loss function L(I , S) = − I CNN pre-trained on ImageNet I Minimize w.r.t. LSTM parameters, We and CNN top layer I SGD on mini-batches I Dropout and ensembling I 512 dimensional embedding t=1 log pt (St ) Generation I Give x−1 = CNN(I) I x0 = We S0 , S0 START token I Sample word S1 I Feed We S1 to LSTM I BeamSearch, beam size 20 Results I MSCOCO dataset: 80k train, 40k eval and test I 5 human made captions per image I M1-M5 human judgements Results Results I Improved Flickr8k, Flickr30k, PASCAL BLEU scores I Need better evaluation metrics I 80% of top-1 in training set I 50% of top-15 in training set I Similiar diversity as human captions Results I Trained word embeddings We I Captures semantics from the language data I Independent of vocubulary size Summary I End-to-end model (Encoder-Decoder) I Vision CNN + Language generating RNN I Maximize likelihood of S given I I State of the art results on major datasets I Datasets are limiting: Unsupervised approaches?
© Copyright 2026 Paperzz