Leveraging Visual Question Answering for Image-Caption Ranking Xiao Lin Devi Parikh Virginia Tech Overview Approach Image-caption ranking • Given an image, retrieve its caption from K (=1,000) captions • Given a caption, retrieve its image from K (=1,000) images What sport is this? Baseball? What is the batter about to do? Hit the ball? What is the brown thing on the kid’s hand? Gloves? ¾ VQA “agnostic” baseline Image “A batter up at the plate in a baseball game” • Learn a scoring function ܵሺܫǡ ܥሻ to predict whether image ܫ and caption ܥare compatible [ 0.75 0.99 0.80 ] [ 1.00 0.95 0.83 ] Caption Image CNN Linear ݐூ Caption RNN Identity ݐ ܵ௧ ሺܫǡ ܥሻ Dot Product • Baseline features ݐூ , ݐ ¾ VQA question-answer pairs provide many different perspectives to interpret images and captions “fc7” extracted using an existing state-of-the-art image-caption ranking model. ݐூ for image, ݐ for caption. ¾ Knowledge in VQA dataset improves image-caption ranking ¾ VQA “aware” approach • Score-level fusion Combining VQA and baseline features at score level VQA as “feature” extraction ݑூ • VQA features ݑூ for image I, ݑ for caption C Log probabilities of a set of N (=3,000) question-answer pairs ሺܳ ǡ ܣ ሻ. ሺሻ ݑூ ൌ െ ܲொିூ ሺܣ ȁܳ ǡ ܫሻ ሺሻ ݑ + Caption Caption ݅ ൌ ͳǡʹǡ ǥ ǡ ܰ ൌ െ ܲொି ሺܣ ȁܳ ǡ ܥሻ ݑ Image - Q: What are the men wearing on their heads? A: Helmets “VQA-only” Dot ܵ௩ (I,C) Product Linear Image Linear CNN Linear ݐூ RNN Identity Linear ܵ௦ ሺܫǡ ܥሻ Dot Product “Baseline” ܵ௧ (I,C) ݐ • Representation-level fusion Combining VQA and baseline feature representations ݑூ Image + Image - Q: Is it clean? A: Yes Caption Caption ݑ Linear Linear Linear CNN Linear ݐLinear ூ RNN Identity Dot ܵ ሺܫǡ ܥሻ Product ݐ Result Approach MSCOCO-1k test Caption Retrieval Accuracy Image Retrieval Accuracy R@1 R@5 R@10 R@1 R@5 R@10 Baseline (Kiros et al. 2014) VQA-only Score-level fusion Representation-level fusion 43.4 37.0 46.9 50.5 75.7 67.9 78.6 80.1 85.8 79.4 88.9 89.7 31.0 26.2 35.8 37.0 66.7 60.1 70.3 70.9 79.9 74.3 83.6 82.9 53.6 86.3 91.3 39.8 73.8 85.8 + Baseline VQA (58.0% acc) Representation-level fusion + Better VQA (60.5% acc) +10.1 ¾ State-of-the-art on MSCOCO • The (Q,A) pairs that have large mutual information between validity ܸொǡǡூ and relevance of captions ܥூ given image I • Approximate ܲሺܥூ ǡ ܸொǡǡூ ሻ by taking expectation over model space ȣ using dropout ܥூ ȣ ܸொǡǡூ ܥ ܪூ െ ܪሺܥூ ȁܸொǡǡூ ሻ ሺொǡሻ ൌ ܫܯሺܸொǡǡூ Ǣ ܥூ ሻ ሺொǡሻ Top facts that machine would like to verify +8.8 50 Recall @ 1 ¾ VQA consistently improves image-caption ranking across different amount of imagecaption ranking data ¾ Which facts does my model want to verify for better imagecaption ranking? VQA-aware Caption retrieval 40 VQA-agnostic Caption retrieval 30 VQA-aware Image retrieval 20 VQA-agnostic Image retrieval 1 3 5 Number of training captions per image What kind of food is this? Cake. What is the brown object in the foreground of the picture? Train. What activity is the Where was this picture taken? Beach. man doing? Skateboarding. What is the table made of? Wood. What color is the train? Red. Where is this place? Beach. What is this person throwing? Frisbee.
© Copyright 2026 Paperzz