Generating Chinese Captions for Flickr30K Images

Generating Chinese Captions for Flickr30K Images
Hao Peng
Indiana University, Bloomington
Nianhen Li
Indiana University, Bloomington
[email protected]
[email protected]
Abstract
these models are trained on images with English captions.
Thus we don’t know their performance in other languages to
see that whether this method works universally. We tested
it on Chinese caption system in this paper.
As Chinese sentence has no space between words, it is
very different from English. We implemented the RNN
model with the same architecture used by [6] on Flickr30
dataset with Chinese captions in two scenarios. The Chinese captions are obtained by translating the original English caption using Google Translation API.
Our experiments show that the generated Chinese sentences aligns pretty well with the translated Chinese captions, we also report the BLEU [10] score computed with
the coco-caption code [1], which is a metric evaluating a candidate sentence by measuring how well it matches
a set of five reference sentences written by humans.
We trained a Multimodal Recurrent Neural Network on
Flickr30K dataset with Chinese sentences. The RNN model
is from Karpathy and Fei-Fei, 2015 [6]. As Chinese sentence has no space between words, we implemented the
model on Flickr30 dataset in two methods. In the first setting, we tokenized each Chinese sentence into a list of words
and feed them to the RNN. While in the second one, we split
each Chinese sentence into a list of characters and feed
them into the same model. We compared the BLEU score
achieved by our two methods to that achieved by [6]. We
found that the RNN model trained with char-level method
for Chinese captions outperforms the word-level one. The
former method performs very close to that trained on English captions by [6]. This came to a conclusion that the
RNN model works universally well, or at least the same, for
image caption system on different languages.
2. Related Work
Researchers have explored a lot in vision to language,
such as, examining image captioning (Lin et al., 2014;
Karpathy and Fei-Fei, 2015; Vinyals et al., 2015; Xu et
al., 2015; Chen et al., 2015; Young et al., 2014; Elliott
and Keller, 2013), question answering (Antol et al., 2015;
Ren et al., 2015; Gao et al., 2015; Malinowski and Fritz,
2014), visual phrases (Sadeghi and Farhadi, 2011), video
understanding (Ramanathan et al., 2013), and visual concepts (Krishna et al., 2016; Fang et al., 2015).
To build the visual description system, recent state of
the art work [6, 14] has used the multimodal recurrent neural network (RNN) to create a “sequence to sequence” machine learning system which is similar to the kind other researchers have used for machine translation. In this case,
however, instead of translating from, say, French to English,
the researchers were training the system to translate from
images to sentences.
Multiple closely related work has also used RNNs to
generate image descriptions [9, 14, 3, 8, 5, 2]. But [6]
claims their model to be simpler than most of the previous
approaches. Thus we decided to apply their model on our
Chinese caption task on the same image dataset, Flickr30k.
We also quantify the performance and comparison in our
1. Introduction
Humans are good at describing and understanding the visual scene expressed in images with just a glance. But it is a
kind of tough task for computers to describe the context or
even just recognize all the objects in one image. Therefor,
an automated image caption system is very helpful in many
way. The self-driving cars, the VR glass all need this technology to build up its functionality. These tools also could
potentially be used to provide richer descriptions of images
for people who are blind or visually impaired.
The majority of previous work in visual recognition has
focused on labeling images with a fixed set of visual categories and great progress has been achieved in these endeavors [4, 11]. However, while closed visual “words” or
“vocabularies” consists of reasonable modeling assumption,
they are vastly limited when compared to the descriptions
articulated by humans.
Recently, many researches on image caption tasks has
been devoted to RNN models, as they are said to be very
effective on modeling sequential data, and also to capture
context and semantic relation in languages. However, all
1
experiments to their original results.
3. Our Model
As been said, the architecture of our RNN model is the
same with the one used in [6] because we want to make a
performance comparision in this paper. Thus some of the
lines in this section (the Training/Testing process and Optimization as well) are borrowed from [6].
The RNN model accepts a image vector and outputs a
corresponding sentence description.Each sentence is split
into a sequence of elements and feed into the RNN (as we
implemented a word level and a character level method, we
refer them as elements here). It generates elements by defining a probability distribution of the next element in a sequence given the current element and context from pervious
time steps. At the first time step, it conditions the probability of element only on the input image vector. For testing,
the model can predict a variable length of elements given an
image.
Specifically, our RNN model is trained as follows, it
takes the image pixels I and a sequence of one-hot encoded word vectors (x1 , x2 , · · · , xT ). It then computes a
sequence of hidden states (h1 , h2 , · · · , ht ) and a sequence
of outputs (y1 , y2 , · · · , yt ) by iterating the following formula for t = 1 to T :
bv
=
Whi [CN Nθc (I)]
ht
=
f (Whx xt + Whh ht−1 + bh + 1(t = 1) bv )(2)
yt
=
sof tmax(Woh ht + bo )
Figure 1. The image vector produced from VGG net.
(1)
(3)
In the above equations, Whi , Whx , Whh , Woh , xt and bh , bo
are learnable parameters which would be updated during
training, and CN Nθc (I) is the output of the last layer of the
V GG [12] net (as shown in Figure 1). During our training,
the image encoding size, word encoding size and hidden
size are all set to 256, which means xt , bv , ht , bh and bo are
all 256 dimensional vectors. The output vector yt is a log
probabilities of words in the vocabulary and one additional
dimension for a special END token. We feed into RNN the
image encoding vector bv only at the first iteration.
Figure 2. Illustration of RNN sentence generating process.
3.2. Testing process
To predict a sentence, we compute the image encoding
vector bv , set h0 = 0, x1 to the START vector and compute the distribution over the first word y1 . We sample a
word from that distribution, set its embedding vector as input word x2 , and repeat this process until the END token
is generated or the length of generated sequence exceeds
20. We also report the BLUE score with different beam size
search.
3.1. Training process
Our RNN model is trained to predict the next word yt
based on the input word xt and the previous context (hidden
state) ht−1 . We simply treat the image encoding vector bv
as a bias term on the first iteration.The training process is illustrated in Figure 2: We set h0 = ~0, x1 to a special START
vector, and we expect y1 to be close to the first word in the
sequence. Similarly, we set x2 to the first word vector and
expect the network to predict the second word, etc. Finally,
xT would be the last word vector in the sequence and we
expect the RNN to predict a special END token. The goal is
to maximize the log probability assigned to the target labels.
3.3. Optimization
As we are going to compare the performance of the RNN
model on Chinese sentence generation to that on English
sentence generation in [6], we need to keep the RNN architecture and training parameters the same with [6]. We use
SGD with mini-batches of 100 image-sentence pairs and
momentum of 0.9 to optimize the alignment model. We
cross-validate the learning rate and the weight decay. We
2
Figure 3. Some examples of English captions and their translated
Chinese. The translated sentences are obtained by using Google
Translation API.
Figure 6. Chinese caption generated by character level RNN during test. For understanding, the Chinese sentence in the bottom of
this figure means “a young girl is wearing a red shirt and a black
trousers”.
Figure 4. An example of sentence segmentation in word level
method.
Figure 5. An example of sentence segmentation in character level
method.
also use dropout regularization in all layers except in the
recurrent layers. We also achieved the best results using
RMSprop [13].
4. Experiments
Figure 7. Chinese caption generated by word level RNN during
test. For understanding, the Chinese sentence in the bottom of this
figure means “a man and a woman are dancing”.
4.1. Dataset processing
We experiment on Flickr30K [15] dataset, which contains 31, 000 images and each comes with 5 Chinese sentences. Be noted that the original captions are in English.
We obtained the Chinese captions using Google Translation
API. Some examples are shown in Figure 3. For Flickr30K,
we use 1, 000 images for validation, another 1, 000 images
for testing and the rest for training (the same setting as [6]).
4.2. Methods
As you can see from Figure 3, Chinese is very different
from English, we have no blank between Chinese characters. Thus we trained our model in two different method. In
the first method, we tokenized each Chinese sentences into
a list of words, such an example is shown in Figure 4. In the
second method, we split each Chinese caption into a list of
Chinese characters, such an example is shown in Figure 5.
Figure 8. An example of different captions generated by word level
(left) and character level (right) methods. For understanding, the
corresponding English sentences for each of them are shown in the
bottom of this figure.
We also compared the captions generated by our model
in two methods on the same test images. The interesting
part is that, though there is a slightly difference between
the two captions, each of the generated sentence still makes
sense. Such an example is shown in Figure 8.
Before we came to a conclusion, we also compared
their performance with the result achieved by [6] (FeiFei’s
model) on English caption generation quantitatively. Thus
4.3. Model evaluation and comparison
We first trained the RNN model in two ways, it can produce reasonable descriptions of images during test for both
method. Figure 6 shows an example of test image Chinese
caption generated by the character level RNN model. Figure 7 shows an example of generated Chinese caption by
the word level RNN model.
3
Acknowledgments
We highly appreciate the help we received from Professor David and all the AIs in this great course. Most of the
knowledge and techniques used in this paper are learnt from
the vision course. The idea of training RNN model for image caption generation on Chinese language is inspired by
Professor David, and he had provided us with many valuable suggestions and feedbacks.
We want to say thank you to all the course staff in this
course. Thanks for hosting attentive office hours and devoting efforts on the course development and poster session.
We really learnt a lot from it.
Figure 9. BLEU scores for the RNN model on English caption
generation (FeiFeis), word level and character level method on
Chinese caption generation on Flickr30K image dataset.
we report the BLEU [10] score (see in Figure 9) for both
methods with Beam size of 7 using the coco-caption
code [1], which is the same setting in [6].
From the BLEU scores in Figure 9, we can see that the
RNN model trained with character level method for Chinese
captions outperforms the model trained with word level
method. The character level method performs very close
to the original model trained on English captions [6], while
the word level method performs slightly worse.
Thus, we came to the conclusion that this RNN model
works universally well for image caption on different languages. Before our work in this paper, we haven’t seen any
application of the RNN model on image caption generation
other than English. Thus it is not clear whether this sequential model applies to different languages or not. We proved
or at least tested this model on Chinese language and finally
came to a conclusion which we think is fair.
One surprise we have noticed is that the character
level method works better than the word level one in
this task. As some researchers have proved, character
level convolutional neural network works even better in
text/sentiment/sentiment classification [16, 7] than word
level one. Our finding in this paper opens a further research
to explore the performance of character level method on
RNN model in other tasks, such as LSTM on test classification.
References
[1] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta,
P. Dollár, and C. L. Zitnick. Microsoft coco captions:
Data collection and evaluation server.
arXiv preprint
arXiv:1504.00325, 2015.
[2] X. Chen and C. L. Zitnick. Learning a recurrent visual representation for image caption generation. arXiv preprint
arXiv:1411.5654, 2014.
[3] J. Donahue, L. Anne Hendricks, S. Guadarrama,
M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual
recognition and description. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition,
pages 2625–2634, 2015.
[4] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and
A. Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–
338, 2009.
[5] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng,
P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt, et al. From
captions to visual concepts and back. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pages 1473–1482, 2015.
[6] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings
of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 3128–3137, 2015.
[7] Y. Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014.
[8] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying
visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.
[9] J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Explain
images with multimodal recurrent neural networks. arXiv
preprint arXiv:1410.1090, 2014.
[10] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a
method for automatic evaluation of machine translation. In
Proceedings of the 40th annual meeting on association for
computational linguistics, pages 311–318. Association for
Computational Linguistics, 2002.
[11] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
5. Limitation and Future work
We used Google translation API to obtain the Chinese
captions due to limited resources. However, the automatic
Machine Translation is not very accurate currently. The reported BLEU scores may not be influenced too much, as
it measures the relatively similarity between the generated
sentence and reference one. But the quality of translation
may compromise the quality of generated image descriptions.
So what we could do in the future, instead of using the
sentences translated by Google, we may review on a few
thousands of images and translated sentences and manually
correct them by ourselves. With that small set of clean data,
we may try to train it again to see if it could work better. Or
fine tune the hyper parameters of the RNN model to see if
it yields a even better result.
4
[12]
[13]
[14]
[15]
[16]
A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual
Recognition Challenge. International Journal of Computer
Vision (IJCV), 115(3):211–252, 2015.
K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. arXiv preprint
arXiv:1409.1556, 2014.
T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide
the gradient by a running average of its recent magnitude.
COURSERA: Neural Networks for Machine Learning, 4:2,
2012.
O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and
tell: A neural image caption generator. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pages 3156–3164, 2015.
P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–
78, 2014.
X. Zhang, J. Zhao, and Y. LeCun. Character-level convolutional networks for text classification. In Advances in Neural
Information Processing Systems, pages 649–657, 2015.
5