Associating Textual Features with Visual Ones
to Improve Affective Image Classification
Ningning Liu1 , Emmanuel Dellandréa1, Bruno Tellez2 , and Liming Chen1
Université de Lyon, CNRS,
Ecole Centrale de Lyon, LIRIS, UMR5205, F-69134, France
{ningning.liu,emmanuel.dellandrea,liming.chen}@ec-lyon.fr
2
Université Lyon 1, LIRIS, UMR5205, F-69622, France
[email protected]
1
Abstract. Many images carry a strong emotional semantic. These last
years, some investigations have been driven to automatically identify induced emotions that may arise in viewers when looking at images, based
on low-level image properties. Since these features can only catch the
image atmosphere, they may fail when the emotional semantic is carried
by objects. Therefore additional information is needed, and we propose
in this paper to make use of textual information describing the image,
such as tags. Thus, we have developed two textual features to catch the
text emotional meaning: one is based on the semantic distance matrix
between the text and an emotional dictionary, and the other one carries the valence and arousal meanings of words. Experiments have been
driven on two datasets to evaluate visual and textual features and their
fusion. The results have shown that our textual features can improve the
classification accuracy of affective images.
Keywords: textual feature, visual feature, affective image classification,
fusion.
1
Introduction
Recently, with more and more photos that are published and shared online,
providing effective methods to retrieve semantic information from images in
huge collections has become inevitable. Currently, most of commercial systems
use textual indexable labels to find the relevant images according to a given
query. In order to avoid manual tagging, several content-based image retrieval
(CBIR) and photo annotation systems have been proposed by using low level
image visual features, such as color, shape and texture etc. [1]. However, these
works remain on a cognitive level since they aim at automatically detecting
the presence of particular objects in the image (”face”, ”animal”, ”bus”, etc.)
or identifying the type of scene (”landscape”, ”sunset”, ”indoor” etc.). Very
few works deal with the affective level which can be described as identifying
the emotion that is expected to arise in humans when looking at an image, or
called affective image classification. Even though contributions in this emerging
S. D´Mello et al. (Eds.): ACII 2011, Part I, LNCS 6974, pp. 195–204, 2011.
c Springer-Verlag Berlin Heidelberg 2011
196
N. Liu et al.
research area remain rare, it gains more and more attention in the research
community [2,3,4,8,17]. Indeed, many applications can benefit from this affective
information, such as image retrieval systems, or more generally ”intelligent”
human/computer interfaces. Moreover, affective image classification is extremely
challenging due to the semantic gap between the low-level features extracted
from images and high-level concepts such as emotions that need to be identified.
This research area, which is at its beginning stage [3], is at the crossroad between
computer vision, pattern recognition, artificial intelligence, psychology, cognitive
science, which makes it particularly interesting.
One of the initial works is from Colombo et al. [6], who developed expressive
and emotional level features based on Itten’s theory [7] and semiotic principles.
They evaluated their retrieval performance on art images and commercial videos.
Yanulevskaya et al. [9] proposed an emotion categorization approach for art
works based on the assessment of local image statistics using support vector machines. Wang wei-ning et al. [10] firstly developed an orthogonal three-dimension
emotional model using 12 pairs of emotional words, and then predicted the emotional factor using SVM regression based on three fuzzy histograms. Recently,
Machajdik et al. [11] studied features for affective image classification including
color, texture, composition, content features, and conducted experiments using
IAPS dataset and two other small collected datasets.
All these works rely on visual features including color, shape, texture and
composition, e.g. color harmony features based on Itten’s color theory [6], fuzzy
histograms based on Luminance Chroma Hue (LCH) color space [10], face counting and skin area, and aesthetic [11]. These visual features have generally been
elaborated to catch the image atmosphere that plays a role of first importance
in the emotion induced to viewers. However, these approaches may fail when the
emotional semantic is carried by objects in images, such as a child crying for example, or a whale dying on a beautiful beach. Therefore additional information is
needed, and we propose in this paper to make use of textual information describing the image that is often provided by photo management and sharing systems
in the form of tags. Thus, we have developed two textual features designed to
catch the emotional meaning of image tags. Moreover, we have proposed an
approach based on the evidence theory to combine these textual features with
visual ones with the goal to improve the affective image classification accuracy.
The contributions of the work presented in this paper can be summarized as
follows:
– Proposition of two textual features to represent emotional semantics: one
is based on a semantic distance matrix between the text and emotional
dictionary; the other one carries the valence and arousal meanings expressed
in the text.
– A combination method based on Dempster-Shafer’s theory of evidence has
been proposed to combine those textual features with visual ones.
– A discretization of the dimensional model of emotions has been considered for the classification of affective images that is well adapted to image
collection navigation and visualization use cases.
Associating Textual and Visual Features for Affective Image Classification
197
– Different types of features, classifiers and fusion methods have been evaluated
on two datasets.
The rest of this paper is organized as follows. Our textual features and four
groups of visual features for representing emotion semantics are presented in
section 2. Experiments are described in section 3. Finally, the conclusion and
future works are drawn in section 4.
2
Features for Emotional Semantics
Feature extraction is a key issue for concept recognition in images, and particularly emotions. In this work, we propose to use two types of features to identify
the emotion induced by an image: visual features to catch the global image
atmosphere, and textual features to catch the emotional meanings of the text
associated with the image (in the form of tags for example) that we expect to
be helpful when the induced emotion is mainly due to the presence in the image
of an object having a strong emotional connotation.
2.1
Textual Features
We propose in this section two new textual features that are designed to catch
the emotional meaning of the text associated with an image. Indeed, most of
photo management and sharing systems provide textual information for images,
including a title, tags and sometimes a legend.
Method 1 (textM1): The basic idea is to calculate the semantic distance between the text associated with an image and an emotional dictionary based on
path similarity, denoting how similar two word senses are, based on the shortest
path that connects the senses in a taxonomy. Firstly, a dictionary has been built
made of Kate Hevner’s Adjective Circle, which consists of 66 single adjectives
[13] such as exciting, happy, sad etc. After a preprocessing step (text cleaned
by removing irrelevant words), the semantic distance matrix is computed between the text associated with the image and the dictionary by applying the
path distance based on a WordsNet and using the Natural Language Toolkit
[14]. At last, the semantic distance feature is build based on the words semantic
distance matrix. It expresses the emotional meaning of the text according to the
emotional dictionary. The procedure of Method 1 is detailed below.
Method 2 (textM2): The idea of is to directly measure the emotional ratings of
valence and arousal dimensions by using the Affective Norms for English Words
(ANEW) [15]. This set of words has been developed to provide a set of normative
emotional ratings (including valence, arousal dimension) for a large number of
words in English language. Thus, the semantic similarity between the image text
and ANEW words is computed to measure the emotional meaning of the text
according to the valence and arousal dimensions. The procedure for Method 2
is detailed below.
198
N. Liu et al.
Method 1
Input: labels data W and dictionary D = {di } with |D| = d
Output: text feature; |f | = d, 0 < fi < 1 .
– Preprocess the tags by using a stop-words filter.
– If image has no tags W = 0, return f, fi = 1/2
– Do for each words wt ∈ W :
1. If the path distance of wt and di cannot be found, set S(t, i) = 0.
2. Calculate the path distance dist(wt , di ), where dist is a simple node counting in
the path from wt to di .
3. Calculate the path similarity as:
S(t, i) = 1/(dist(wt , di ) + 1),
– Calculate the feature f as: fi = t S(t, i), and normalize it to [0 1].
Method 2
Input: labels data W , dictionary D ratings of valence V and arousal A for each word
in D. Ratings vary from 1 to 9. |D| = |V | = |A| = d.
Output: text feature; |f | = 2, 0 < fi < 9 .
– Preprocess the tags by using a stop-words filter.
– If image has no tags W = 0, return f, fi = 5
– Do for each words wt ∈ W :
1. If the path distance of wt and di cannot be found, set S(t, i) = 0,
2. Calculate the path distance dist(wt , di ), where dist is a simple node counting in
the path from wt to di ,
3. Calculate the path similarity as: S(t,
i) = 1/(dist(wt , di ) + 1),
S(t, i), and normalize it to[0 1],
– Calculate the distance vector mi = t
– Calculate the feature f as: f1 = (1/d) i (mi .Vi ), and f2 = (1/d) i (mi .Ai ),
2.2
Visual Features
In order to catch the global image atmosphere that may have an important role
in the emotion communicated by the image, we propose to use various visual
features including information according to color, texture, shape and aesthetic.
They are listed in Table 1.
3
3.1
Experiments and Results
Emotion Models
There are generally two approaches to model emotions: the discrete one and the
dimensional one. The first model considers adjectives or nouns to specify the
emotions, such as happiness, sadness, fear, anger, disgust and surprise. On the
contrary, with the dimensional model, emotions are described according to one
or several dimensions such as valence, arousal and control. The choice for an
emotion model is generally guided by the application. We propose in this paper
to use a kind of hybrid representation that is a discretization of the dimensional
model made of valence and arousal dimensions. We believe that it is particularly
well suitable for applications such as image indexation and retrieval since it allows characterizing images according to their valence and arousal independently
Associating Textual and Visual Features for Affective Image Classification
199
Table 1. Summary of the visual features used in this work
Category Features (Short name) # Short Description
Color
Color
moments (C M)
Color
histogram (C H)
Color
correlograms (C C)
Tamura (T T)
Texture
Shape
Grey level
Co-occurrence
matrix (T GCM)
Local binary
pattern (T LBP)
Histogram of line
orientations (S HL)
Harmony (H H)
High level
Dynamism (H D)
Y. Ke (H Ke)
R.Datta (H Da)
Three central moments (mean, standard deviation
144 and skewness) on HSV channels computed on a
pyramidal representation of the image.
Histograms of 64 bins for each HSV channel that
192
are concatenated.
A three-dimensional table representing the spatial
256
correlation of colors in an image.
Features from Tamura [16] including coarseness,
contrast and directionality.
Described by Haralick (1973): defined over an im16 age to be the distribution of co-occurring values
at a given offset.
A compact multi-scale texture descriptor analysis
256
of textures with multiple scales.
3
12
Different orientations of lines detected by a Hough
transform.
Describes the color harmony of images based on
Itten’s color theory [8].
Ratio of oblique lines (which communicate dynamism and action) with respect to horizontal and
1
vertical lines (which rather communicate calmness
and relaxation) [8].
Ke’s aesthetic criteria including spatial distribu5 tion of edges, hue count, blur, contrast and brightness [19].
Datta’s aesthetic features (44 of 56) except those
44 that are related to IRM (integrated region matching) technique [18].
1
which improves the applicability for navigation and visualization use cases [17].
Thus, six emotion classes are considered by discretizing each dimension into
three levels: low, neutral and high. This model is illustrated in Figure 1.
3.2
Datasets
For testing and training, we have used two data sets. The International Affective Picture System (IAPS) [21] consists of 1182 documentary-style images
characterized by three scores according to valence, arousal and dominance. We
have considered another dataset that we called Mirflickr, which is a set of 1172
creative-style photographs extracted randomly from MIRFLICKR25000 Collection [22]. This collection supplies all original tag data provided by the Flickr
users with an average total number of 8.94 tags per image. In order to obtain the
ground truth for emotional semantic ratings, we have organized an annotation
campaign within our laboratory. Thus, 20 people (researchers) have participated
200
N. Liu et al.
Fig. 1. The dimensional emotion model (a). This model includes two dimensions: valence (ranging from pleasant to unpleasant) and arousal (ranging from calm to excited).
Each blue point represents an image from IAPS. (b) We build six classes by dividing valence and arousal dimension into three levels, such as low arousal (LA), neutral arousal
(NA), high arousal (HA), low valence (LV), neutral valence (NV) and high valence
(HV).
which has allowed obtaining an average of 10 annotations for each image. The
image annotation consists in two rates according to the emotion communicated
by the image in terms of valence (from negative to positive) and arousal (from
passive to active). The description of the two data sets is given in Table 2.
Table 2. The description of the two data sets used in our experiments
Database
Size
Text
LV NV HV LA NA HA
IAPS
1182
No tags
340 492 350 291 730 161
MirFlickr2000 1172 8.93/image 257 413 502 261 693 218
3.3
Experimental Setup
Experiment setup was done as follows: we built six Support Vector Machine
(SVM) classifiers, each one being dedicated to an emotion class, and following
a one-against-all strategy. A 5-fold cross-validation was conducted to obtain the
results. Our objectives with these experiments were to evaluate: a) the performance of the different features on the two data sets; b) the performance of fusion
strategies including max-score, min-score, mean-score, majority-voting and Evidence Theory on IAPS; c) the performance of the combination of visual features
and textual features on Mirflickr2000 by using the Evidence Theory.
More specifically, in a) Libsvm tool was employed, and the input features were
normalized to [0 1] to train a RBF kernel; in b) each feature set was used to train
classifier cn , which produces a measurement vector y n as the degree of belief that
the input belongs to different classes, then classifiers were trained based on adjusting the evidence of different classifiers by minimizing the MSE of training
Associating Textual and Visual Features for Affective Image Classification
201
data according to [5]. Meanwhile a comparison with different types of combination approaches has been made, including: min-score: zk = min(yk1 , yk2 , ..., ykN ),
n
1 2
N
mean-score: zk = N1 N
n=1 yk , max-score: zk = max(yk , yk , ..., yk ), majority1 2
N
n
th
vote: zk = argmax(yk , yk , ..., yk ), where yk represent the k measurement of
classifier cn .
3.4
Results: Performance of Different Features
Figure 2 shows the performance of different features on the two data sets. For
IAPS, it appears that texture features (LBP, Coocurrences, Tamura) are the
most efficient ones among the visual features. However, for Mirflickr data set,
which is composed of professional creative photographs, the aesthetic features
from Datta [18] perform better on pleasant dimension and the color correlograms, color moment and aesthetic features is better on arousal dimension. This
suggests that the aesthetic features are related to pleasant feelings particularly
for photographs and colors affect arousal perception in a certain way. One can
note that textual features do not perform as good as visual features, which may
be explain by the fact that even if the text contains important information it
is not sufficient and should be combined with visual ones, as it is evaluated in
next experiment. Finally, the high-level features dynamism and harmony may
first seem giving lower performance, but as they consist in a single value, their
efficiency is in fact remarkable.
Fig. 2. The feature performance on two databases. Measurement is based on average
classification accuracy: average of accuracy on each of the three emotion level categories.
3.5
Results: Performance of Different Combination Approaches on
IAPS
The results for different combination methods using IAPS data set are given in
Table 3. They show that the combination of classifiers based on the Evidence
Theory with an average classification accuracy of 58% performs better than
the other combination methods for affective image classification, which may be
explained by its ability to handle knowledge that can be ambiguous, conflictual
and uncertain that is particularly interesting when dealing with emotions.
202
N. Liu et al.
Table 3. The performance of classifier combination techniques using IAPS. LV, NV
and HV represent low, neutral, high level of valence and LA, NA and HA represent low,
neutral, high level of arousal respectively. Results are given in percentage of average
classification accuracy.
EarlyMaxfusion(%) score(%)
Minscore(%)
Meanscore(%)
Majority
vote(%)
Evidence
theory(%)
LV
52.4
51.4
47.7
53.4
49.9
50.5
NV
51.3
53.2
51.6
52.3
51.0
55.3
HV
49.8
52.1
50.1
55.0
52.1
54.8
LA
60.6
61.7
57.2
62.7
58.7
62.7
NA
62.3
62.5
54.6
61.8
63.1
66.4
HA
53.7
54.1
52.1
50.9
53.6
58.3
Average
55.0
55.8
52.2
56.0
54.7
58.0
Table 4. The performance with different settings on Mirflickr2000 dataset by combining textual and visual features based on Evidence Theory. Text M1 and M2 refer to
text feature method 1 and method 2, Color + Text M1 refer to visual classifiers that
are trained based on color feature group and text classifier trained based on text M1
feature. The best performance in each panel is indicated in bold.
LV(%)
NV(%) HV(%) LA(%)
NA(%) HA(%)
Text M1
Text M2
20.1
27.2
36.2
35.5
25.2
33.8
25.2
34.4
37.3
40.2
30.4
34.3
Color
Color+Text M1
Color+Text M2
39.5
39.1
39.4
42.8
45.2
52.8
36.7
38.1
37.3
51.6
52.7
56.6
54.7
55.2
57.1
48.3
47.7
50.5
Texture
Texture+Text M1
Texture+Text M2
43.1
44.2
45.3
44.2
42.2
46.8
40.1
42.3
44.3
46.8
50.1
51.3
48.4
52.0
55.5
43.1
46.7
50.9
Shape
Shape+Text M1
Shape+Text M2
28.7
29.7
31.5
34.2
34.6
38.4
25.8
29.5
33.7
26.7
29.1
37.3
27.2
36.5
41.8
24.8
30.7
36.4
Highlevel
Highlevel+Text M1
Highlevel+Text M2
48.7
49.4
52.1
55.1
53.4
56.2
44.5
45.2
47.7
51.3
54.6
55.7
56.3
52.3
58.7
46.0
47.4
54.6
All
All
All
All
54.1
55.4
56.2
59.5
56.8
57.2
59.0
62.2
45.5
44.1
49.7
50.2
54.6
56.8
61.1
63.8
57.2
58.4
62.5
63.7
55.9
56.1
59.7
62.5
visual
visual+Text M1
visual+Text M2
visual+Text M1&M2
Associating Textual and Visual Features for Affective Image Classification
3.6
203
Results: Performance of Textual Features on Mirflickr2000
The results provided by different combination strategies using textual and visual
features on Mirflickr data set are given in Table 4. These results show that the
textual feature textM2 performs better than textM1 except for the neutral valence level. As pointed in section 3.4, textual feature did not outperform visual
features when considered independently. However, the combination of textual
and visual features improves the classification accuracy. Indeed, combination of
high level features and textual features perform better on the valence dimension, and the color features combined with text features performs well on the
arousal dimension. When combined with textual features, the performance of
the shape feature group improves obviously. Moreover, the combination of all
the visual features with textual features significantly improves the classification
accuracy for all classes. These results show that by using the Evidence Theory
as fusion method to combine visual features with our proposed textual features,
the identification of the emotion that may arise in image viewers can greatly be
improved compared to methods that only rely on visual information.
4
Conclusion
In this paper, we have proposed two textual features that have been designed to
catch the emotional connotation of the text associated with images for the problem of affective image classification. Our motivation was to provide additional
information to enrich visual features that can only catch the global atmosphere
of images. This may be particularly useful when the emotion induced is mainly
due to objects in the image, and not due to global image properties. Our experimental results have shown that the combination of visual features and our
proposed textual features can significantly improve the classification accuracy.
However, this supposes that a text is available for the images, in the form of tags
or legend. Therefore our future research directions will include the proposition of
strategies to overcome this difficulty using for example automatic image annotation approaches or by exploiting the text associated to similar images based on
their visual content. Moreover, we will investigate solutions to exploit as much
as possible text information when it is noisy and not completely reliable.
Acknowledgment. This work is partly supported by the french ANR under
the project VideoSense ANR-09-CORD-026.
References
1. Smeulders, A.W.M., et al.: Content-based Image Retrieval: the end of the early
years. IEEE Trans. PAMI 22(12), 1349–1380 (2000)
2. Zeng, Z., et al.: A survey of affect recognition methods: audio, visual and spontaneous expressions. IEEE Transactions PAMI 31(1), 39–58 (2009)
204
N. Liu et al.
3. Wang, W., He, Q.: A survey on emotional semantic image retrieval. In: ICIP, pp.
117–120 (2008)
4. Wang, S., Wang, X.: Emotion semantics image retrieval: a brief overview. In: Tao,
J., Tan, T., Picard, R.W. (eds.) ACII 2005. LNCS, vol. 3784, pp. 490–497. Springer,
Heidelberg (2005)
5. Al-Ani, A., Deriche, M.: A new technique for combing multiple classifiers using the
Dempster Shafer theory of evidence. J. Artif. Intell. Res. 17, 333–361 (2002)
6. Columbo, C., Del Bimbo, A., Pala, P.: Semantics in visual information retrieval.
IEEE Multimedia 6(3), 38–53 (1999)
7. Itten, J.: The art of colour. Otto Maier Verlab, Ravensburg, Germany (1961)
8. Dellandréa, E., Liu, N., Chen, L.: Classification of affective semantics in images
based on discrete and dimensional models of emotions. In: CBMI, pp. 99–104 (2010)
9. Yanulevskaya, V., et al.: Emotional valence categorization using holistic image
features. In: ICIP, pp. 101–104 (2008)
10. Weining, W., Yinlin, Y., Shengming, J.: Image retrieval by emotional semantics:
A study of emotional space and feature extraction. ICSMC 4, 3534–3539 (2006)
11. Machajdik, J., Hanbury, A.: Affective image classification using features inspired
by psychology and art theory. ACM Multimedia (2010)
12. Wang, G., Hoiem, D., Forsyth, D.: Building text features for object image classification. In: CVPR, pp. 1367–1374 (2009)
13. Hevner, K.: Experimental studies of the elements of expression in music. American
Journal of Psychology 48(2), 246–268 (1936)
14. Natural language toolkit, http://www.nltk.org
15. Bradley, M.M., Lang, P.J.: Affective norms for English words (ANEW). Tech. Rep
C-1, GCR in Psychophysiology, University of Florida (1999)
16. Tamura, H., Mori, S., Yamawaki, T.: Textural features corresponding to visual
perception. IEEE Transactions on SMC 8(6), 460–473 (1978)
17. Liu, N., Dellandréa, E., Tellez, B., Chen, L.: Evaluation of Features and Combination Approaches for the Classification of Emotional Semantics in Images. VISAPP
(2011)
18. Datta, R., Li, J., Wang, J.Z.: Content-based image retrieval: approaches and trends
of the new age. In: ACM Workshop MIR (2005)
19. Ke, Y., Tang, X., Jing, F.: The Design of High-Level Features for Photo Quality
Assessment. In: CVPR (2006)
20. Dunker, P., Nowak, S., Begau, A., Lanz, C.: Content-based mood classification for
photos and music. In: ACM MIR, pp. 97–104 (2008)
21. Lang, P.J., Bradley, M.M., Cuthbert, B.N.: The IAPS: Technical manual and affective ratings. Tech. Rep A-8., GCR in Psychophysiology, Unv. of Florida (2008)
22. Huiskes, M.J., Lew, M.S.: The MIR Flickr Retrieval Evaluation. In: ACM Multimedia Information Retrieval, MIR 2008 (2008)
© Copyright 2026 Paperzz