Generating image captions with neural networks Ryan Kiros University of Toronto Joint work with Ruslan Salakhutdinov & Richard Zemel The problem to solve A man skiing down the snow covered mountain with a dark sky in the background. INPUT OUTPUT This requires: - Identifying and detecting objects, scenes, people, etc - Reasoning about spatial relationships and properties of objects - Combining several sources of information into a coherent sentence Caption generation is like MT A cat is sitting behind some books - MT: translate from one language to another - Caption generation: translate from an image to a description - Similar notion of “words”, “phrases” and “alignments” - translation feature functions: ...but it has some differences Granularity of captions: Denotation graph (Young et al., 2014) Size vs. quality of datasets: Flickr 8K Flickr 30K SBU 1M Flickr 1M Increase in images / Decrease in caption quality Yahoo 100M Encoder-decoder models for MT The end of the world Sentence feature vector (encoder) La fin du monde (decoder) Encode text in English to a distributed representation Decode text into French by conditioning on the encoded vector - Kalchbrenner and Blunsom (2013): ConvNet encoder, RNN decoder - Cho et al. (2014): RNN encoder, RNN decoder - Sutskever et al. (2014): LSTM encoder, LSTM decoder An image-text encoder-decoder Image feature vector ConvNet (encoder) Neural LM (decoder) steam ship in the water - Multimodal Neural Language Models (Kiros et al., 2014) Conditional log-bilinear model: - Use the image features to additively bias the prediction of the next word representation A new encoder-decoder model ConvNet (encoder) ship Neural LM (decoder) Joint space steam ship in the water water Learn a joint embedding space of images and text: - This allows us to condition on anything (images, words, phrases, etc) - Natural definition of a scoring function (inner products in the joint space) - Use a new language model that incorporates additional structure - Supplement the language model with large monolingual corpora (optional) A joint image-text embedding A castle and reflecting water Minimize the following objective: images text Joint space A ship sailing in the ocean Convex semantic combinations ImageNet predictions Encoder embedding Detector chimpanzee, chimp, Pan troglodytes, gorilla, Gorilla gorilla Three, monkey embedding embedding Train globally, retrieve locally beach snow tower, building, cathedral, dome, castle bowl, cup, soup, cups, coffee kitchen, stove, oven, refrigerator, microwave ski, skiing, skiers, skiiers, snowmobile Adjectives Nearest images fluffy delicious adorable sexy Multimodal linguistic regularities Nearest images - dog + cat = - cat + dog = - plane + bird = - man + woman = colours Nearest images - blue + red = - blue + yellow = - yellow + red = - white + red = Some interesting examples Nearest images - day + night = - flying + sailing = - bowl + box = - box + bowl = Sanity check Nearest images night sailing box bowl Structure-topic NLMs __________ (NN VBN IN DT NN) DT A __________ (VBN IN DT NN -) NN A bicycle __________ (IN DT NN - -) VBN A bicycle parked __________ (DT NN - - -) IN A bicycle parked on __________ (NN - - - -) DT n-th word word context POS context A bicycle parked on the __________ (- - - - -) NN Fill in the blanks (cat) The __________ is in the box. NN (box) The cat is in the __________ . NN (sitting) in the box . The cat is __________ VBG (cute) The __________ cat is in the box . JJ (bus) This is a __________ . NN (parked) . The bus is __________ JJ (car) There is a __________ behind the bus . NN (on) The tree is __________ the bus. IN Fill in the blanks (tower) . This is a __________ NN (tall) This is a __________ building . JJ (above) the tower . The grass is __________ IN (near) The sky is __________ the tower . IN (boat) This is a __________ . NN (paddle) boat . This is a __________ JJ (sitting) on the water . The boat is __________ VBG (back) The dock is to the __________ of the boat . RB TAGS: castle, palace, monastery, moated, motte DENOTATIONS: stone castle old castle building castle large castle construct castle ADJECTIVES: majestic, verdant, ancient, wooden, quaint TOP-5 MODEL SAMPLES: an ancient castle near the ruins . an ancient stone castle in prague . one of an ancient buildings near castle . ancient wooden houses near the castle . built along the ancient castle . TAGS: cup, bowl, coffee, soup, cups DENOTATIONS: coffee cup cup of coffee espresso cup styrofoam coffee cup cup of espresso ADJECTIVES: yummy, delicious, plastic, foamy, savory TOP-5 MODEL SAMPLES: from a plastic cup of tea . sweet cup of tea in my kitchen . we had a cup of delicious . cups of tea in red wine . cup of red wine in our kitchen . TAGS: store, supermarket, grocery, supermarkets, stores DENOTATIONS: grocery store oriental grocery store local grocery store small grocery store large grocery store ADJECTIVES: wholesale, bustling, festive, colorful, organic TOP-5 MODEL SAMPLES: from the local grocery shop window . christmas fruit is in the grocery store . red fruit and veg in the market . lots of fruit boxes in the store . christmas shop in the grocery market . TAGS: cat, ferret, hamster, weasel, puppy DENOTATIONS: cat hide toy cat cat swat baby pull cat cat doll ADJECTIVES: cute, furry, cuddly, adorable, naughty TOP-5 MODEL SAMPLES: my cat who lives in a box . i put his cat in the world . cute little cat in the box . hanging around the cat in santa monica . kitty cat toys in the box . TAGS: spider, spiders, arachnid, insects, insect DENOTATIONS: spider web giant spider have spider web toy spider hold spider ADJECTIVES: male, female, creepy, spooky, elfin TOP-5 MODEL SAMPLES: giant spider found in the netherlands . look at the new spider web . this was near the black spider web . i like the spider . the pattern of one spider web . Current and future work Experimentation: - Fill-in-the-blank ranking task - Caption ranking task of Hodosh et al. - User study - Quantitative evaluation for vector arithmetic? Future work: - Other similar tasks (abstractive summarization, image segmentation) - RNN decoders - A multilingual, multimodal space (generate captions in different languages!) - Long term: Reasoning about images (Q&A)
© Copyright 2026 Paperzz