Leveraging Visual Question Answering for Image

Leveraging Visual Question Answering for Image-Caption Ranking
Xiao Lin Devi Parikh
Virginia Tech
Overview
Approach
Image-caption ranking
• Given an image, retrieve its caption from K (=1,000) captions
• Given a caption, retrieve its image from K (=1,000) images
What sport is this? Baseball?
What is the batter about to do? Hit the ball?
What is the brown thing on the kid’s hand? Gloves?
¾ VQA “agnostic” baseline
Image
“A batter up at
the plate in a
baseball game”
• Learn a scoring function ܵሺ‫ܫ‬ǡ ‫ܥ‬ሻ to predict whether image ‫ܫ‬
and caption ‫ ܥ‬are compatible
[ 0.75 0.99 0.80 ]
[ 1.00 0.95 0.83 ]
Caption
Image
CNN
Linear
‫ݐ‬ூ
Caption
RNN
Identity
‫ݐ‬஼
ܵ௧ ሺ‫ܫ‬ǡ ‫ܥ‬ሻ
Dot
Product
• Baseline features ‫ݐ‬ூ , ‫ݐ‬஼
¾ VQA question-answer pairs provide many different
perspectives to interpret images and captions
“fc7” extracted using an existing state-of-the-art image-caption
ranking model. ‫ݐ‬ூ for image, ‫ݐ‬஼ for caption.
¾ Knowledge in VQA dataset improves image-caption ranking
¾ VQA “aware” approach
• Score-level fusion
Combining VQA and baseline features at score level
VQA as “feature” extraction
‫ݑ‬ூ
• VQA features ‫ݑ‬ூ for image I, ‫ݑ‬஼ for caption C
Log probabilities of a set of N (=3,000) question-answer pairs
ሺܳ௜ ǡ ‫ܣ‬௜ ሻ.
ሺ௜ሻ
‫ݑ‬ூ ൌ െ Ž‘‰ ܲ௏ொ஺ିூ ሺ‫ܣ‬௜ ȁܳ௜ ǡ ‫ܫ‬ሻ
ሺ௜ሻ
‫ݑ‬஼
+
Caption
Caption
݅ ൌ ͳǡʹǡ ǥ ǡ ܰ
ൌ െ Ž‘‰ ܲ௏ொ஺ି஼ ሺ‫ܣ‬௜ ȁܳ௜ ǡ ‫ܥ‬ሻ
‫ݑ‬஼
Image
-
Q: What are the men wearing on their heads? A: Helmets
“VQA-only”
Dot
ܵ௩ (I,C)
Product
Linear
Image
Linear
CNN
Linear ‫ݐ‬ூ
RNN
Identity
Linear
ܵ௦ ሺ‫ܫ‬ǡ ‫ܥ‬ሻ
Dot
Product “Baseline”
ܵ௧ (I,C)
‫ݐ‬஼
• Representation-level fusion
Combining VQA and baseline feature representations
‫ݑ‬ூ
Image
+
Image
-
Q: Is it clean? A: Yes
Caption
Caption
‫ݑ‬஼
Linear
Linear
Linear
CNN
Linear ‫ ݐ‬Linear
ூ
RNN
Identity
Dot ܵ௥ ሺ‫ܫ‬ǡ ‫ܥ‬ሻ
Product
‫ݐ‬஼
Result
Approach
MSCOCO-1k test
Caption Retrieval
Accuracy
Image Retrieval
Accuracy
R@1 R@5 R@10 R@1 R@5 R@10
Baseline (Kiros et al. 2014)
VQA-only
Score-level fusion
Representation-level fusion
43.4
37.0
46.9
50.5
75.7
67.9
78.6
80.1
85.8
79.4
88.9
89.7
31.0
26.2
35.8
37.0
66.7
60.1
70.3
70.9
79.9
74.3
83.6
82.9
53.6 86.3 91.3 39.8 73.8
85.8
+ Baseline VQA (58.0% acc)
Representation-level fusion
+ Better VQA
(60.5% acc)
+10.1
¾ State-of-the-art on MSCOCO
• The (Q,A) pairs that have large mutual
information between validity ܸொǡ஺ǡூ and
relevance of captions ‫ܥ‬ூ given image I
• Approximate ܲሺ‫ܥ‬ூ ǡ ܸொǡ஺ǡூ ሻ by taking
expectation over model space ȣ using dropout
‫ܥ‬ூ
ȣ
ܸொǡ஺ǡூ
ƒ”‰ ƒš ‫ܥ ܪ‬ூ െ ‫ܪ‬ሺ‫ܥ‬ூ ȁܸொǡ஺ǡூ ሻ
ሺொǡ஺ሻ
ൌ ƒ”‰ ƒš ‫ܫܯ‬ሺܸொǡ஺ǡூ Ǣ ‫ܥ‬ூ ሻ
ሺொǡ஺ሻ
Top facts that machine would like to verify
+8.8
50
Recall @ 1
¾ VQA consistently
improves image-caption
ranking across different
amount of imagecaption ranking data
¾ Which facts does my model want to verify for better imagecaption ranking?
VQA-aware
Caption retrieval
40
VQA-agnostic
Caption retrieval
30
VQA-aware
Image retrieval
20
VQA-agnostic
Image retrieval
1
3
5
Number of training captions per image
What kind of food is
this? Cake.
What is the brown
object in the
foreground of the
picture? Train.
What activity is the
Where was this
picture taken? Beach. man doing?
Skateboarding.
What is the table
made of? Wood.
What color is the
train? Red.
Where is this place?
Beach.
What is this person
throwing? Frisbee.