Generating image captions with neural networks

Generating image captions with
neural networks
Ryan Kiros
University of Toronto
Joint work with Ruslan Salakhutdinov
& Richard Zemel
The problem to solve
A man skiing down the snow
covered mountain with a dark
sky in the background.
INPUT
OUTPUT
This requires:
- Identifying and detecting objects, scenes, people, etc
- Reasoning about spatial relationships and properties of objects
- Combining several sources of information into a coherent sentence
Caption generation is like MT
A cat is sitting behind some books
- MT: translate from one language to another
- Caption generation: translate from an image to a description
- Similar notion of “words”, “phrases” and “alignments”
- translation feature functions:
...but it has some differences
Granularity of captions:
Denotation graph
(Young et al., 2014)
Size vs. quality of datasets:
Flickr 8K
Flickr 30K
SBU 1M
Flickr 1M
Increase in images / Decrease in caption quality
Yahoo 100M
Encoder-decoder models for MT
The end of
the world
Sentence feature
vector
(encoder)
La fin du monde
(decoder)
Encode text in English to a distributed representation
Decode text into French by conditioning on the encoded vector
- Kalchbrenner and Blunsom (2013): ConvNet encoder, RNN decoder
- Cho et al. (2014): RNN encoder, RNN decoder
- Sutskever et al. (2014): LSTM encoder, LSTM decoder
An image-text encoder-decoder
Image feature
vector
ConvNet
(encoder)
Neural LM
(decoder)
steam ship in
the water
- Multimodal Neural Language Models (Kiros et al., 2014)
Conditional log-bilinear model:
- Use the image features to additively
bias the prediction of the next word
representation
A new encoder-decoder model
ConvNet
(encoder)
ship
Neural LM
(decoder)
Joint space
steam ship in
the water
water
Learn a joint embedding space of images and text:
- This allows us to condition on anything (images, words, phrases, etc)
- Natural definition of a scoring function (inner products in the joint space)
- Use a new language model that incorporates additional structure
- Supplement the language model with large monolingual corpora (optional)
A joint image-text embedding
A castle and
reflecting water
Minimize the following objective:
images
text
Joint space
A ship sailing
in the ocean
Convex semantic combinations
ImageNet predictions
Encoder
embedding
Detector
chimpanzee, chimp, Pan troglodytes,
gorilla, Gorilla gorilla
Three, monkey
embedding
embedding
Train globally, retrieve locally
beach
snow
tower, building, cathedral,
dome, castle
bowl, cup, soup, cups, coffee
kitchen, stove, oven,
refrigerator, microwave
ski, skiing, skiers, skiiers,
snowmobile
Adjectives
Nearest images
fluffy
delicious
adorable
sexy
Multimodal linguistic regularities
Nearest images
- dog + cat =
- cat + dog =
- plane + bird =
- man + woman =
colours
Nearest images
- blue + red =
- blue + yellow =
- yellow + red =
- white + red =
Some interesting examples
Nearest images
- day + night =
- flying + sailing =
- bowl + box =
- box + bowl =
Sanity check
Nearest images
night
sailing
box
bowl
Structure-topic NLMs
__________ (NN VBN IN DT NN)
DT
A __________ (VBN IN DT NN -)
NN
A bicycle __________ (IN DT NN - -)
VBN
A bicycle parked __________ (DT NN - - -)
IN
A bicycle parked on __________ (NN - - - -)
DT
n-th word
word context
POS context
A bicycle parked on the __________ (- - - - -)
NN
Fill in the blanks
(cat)
The __________
is in the box.
NN
(box)
The cat is in the __________
.
NN
(sitting) in the box .
The cat is __________
VBG
(cute)
The __________
cat is in the box .
JJ
(bus)
This is a __________
.
NN
(parked) .
The bus is __________
JJ
(car)
There is a __________
behind the bus .
NN
(on)
The tree is __________
the bus.
IN
Fill in the blanks
(tower) .
This is a __________
NN
(tall)
This is a __________
building .
JJ
(above) the tower .
The grass is __________
IN
(near)
The sky is __________
the tower .
IN
(boat)
This is a __________
.
NN
(paddle) boat .
This is a __________
JJ
(sitting) on the water .
The boat is __________
VBG
(back)
The dock is to the __________
of the boat .
RB
TAGS:
castle, palace, monastery, moated, motte
DENOTATIONS:
stone castle
old castle
building castle
large castle
construct castle
ADJECTIVES:
majestic, verdant, ancient, wooden, quaint
TOP-5 MODEL SAMPLES:
an ancient castle near the ruins .
an ancient stone castle in prague .
one of an ancient buildings near castle .
ancient wooden houses near the castle .
built along the ancient castle .
TAGS:
cup, bowl, coffee, soup, cups
DENOTATIONS:
coffee cup
cup of coffee
espresso cup
styrofoam coffee cup
cup of espresso
ADJECTIVES:
yummy, delicious, plastic, foamy, savory
TOP-5 MODEL SAMPLES:
from a plastic cup of tea .
sweet cup of tea in my kitchen .
we had a cup of delicious .
cups of tea in red wine .
cup of red wine in our kitchen .
TAGS:
store, supermarket, grocery, supermarkets, stores
DENOTATIONS:
grocery store
oriental grocery store
local grocery store
small grocery store
large grocery store
ADJECTIVES:
wholesale, bustling, festive, colorful, organic
TOP-5 MODEL SAMPLES:
from the local grocery shop window .
christmas fruit is in the grocery store .
red fruit and veg in the market .
lots of fruit boxes in the store .
christmas shop in the grocery market .
TAGS:
cat, ferret, hamster, weasel, puppy
DENOTATIONS:
cat hide
toy cat
cat swat
baby pull cat
cat doll
ADJECTIVES:
cute, furry, cuddly, adorable, naughty
TOP-5 MODEL SAMPLES:
my cat who lives in a box .
i put his cat in the world .
cute little cat in the box .
hanging around the cat in santa monica .
kitty cat toys in the box .
TAGS:
spider, spiders, arachnid, insects, insect
DENOTATIONS:
spider web
giant spider
have spider web
toy spider
hold spider
ADJECTIVES:
male, female, creepy, spooky, elfin
TOP-5 MODEL SAMPLES:
giant spider found in the netherlands .
look at the new spider web .
this was near the black spider web .
i like the spider .
the pattern of one spider web .
Current and future work
Experimentation:
- Fill-in-the-blank ranking task
- Caption ranking task of Hodosh et al.
- User study
- Quantitative evaluation for vector arithmetic?
Future work:
- Other similar tasks (abstractive summarization, image segmentation)
- RNN decoders
- A multilingual, multimodal space (generate captions in different languages!)
- Long term: Reasoning about images (Q&A)