Retrieval of Commercials by Semantic Content: The Semiotic

Multimedia Tools and Applications, 13, 93–118, 2001
c 2001 Kluwer Academic Publishers. Manufactured in The Netherlands.
°
Retrieval of Commercials by Semantic Content:
The Semiotic Perspective
CARLO COLOMBO
[email protected]
ALBERTO DEL BIMBO
[email protected]
PIETRO PALA
[email protected]
Dipartimento Di Sistemi e Informatica, Universitá di Firenze, via Santa Marta 3, I-50139 Firenze, Italy
Abstract. Video information processing and retrieval is a key aspect of future multimedia technologies and
applications. Commercial videos encode several planes of expression through a rich and dense use of colors,
editing effects, viewpoints and rhythms, which are exploited together to attract potential purchasers. Databases
of commercials can be accessed in order to analyze how a commercial has been developed, retrieve commercials
similar to an example, catalog commercials according to the kind of message conveyed to the user. In this paper,
we present a system allowing the retrieval of commercial streams based on their salient semantics. Semantics is
regarded from the semiotics perspective: collections of signs and semantic features like colors, editing effects,
motion, etc. are used as basic blocks with which the meaning of a commercial is constructed. In our system, it is
possible to retrieve commercials according to both the meaning they convey and to their similarity to examples.
Keywords: video content analysis, video libraries, video indexing, semiotics
1.
Introduction
The extraction and manipulation of the information embedded in video data is a key challenge for future multimedia technologies and emerging applications such as digital libraries
and interactive video analysis. The expansion of low cost mass storage devices, improved
compression techniques, the availability of high transmission rates and computing power
have all contributed to make video archives accessible through digital networks and available on everyday desktop computers. However, the advances of computer systems towards
becoming “true” multimedia applications largely depend on the availability of tools for
manipulating video information oriented to cataloguing videos into a content-searchable
database. Much more than text, video conveys information through a multiplicity of planes
of communication which encompass what is represented in the images, how the images are
linked together, how the subject is imaged and so on.
Several recent papers have addressed aspects and problems related to the access and retrieval by content of video streams. Research on automatic segmentation has been presented
by several authors [2, 7, 13, 16, 18, 20]. All of them analyze interframe differences to
automatically detect sharp and gradual shot transitions (cuts, fades and dissolves) as well
as other special transition effects (such as mattes and wipes), and to estimate the motion of
camera and objects in the scene. Automatic segmentation into higher level aggregates has
also been addressed by some authors. In [1], a set of rules to identify macro segments of
94
COLOMBO, DEL BIMBO AND PALA
a video was proposed. Algorithms to extract story units from video are described in [22].
Also in [8, 19] the specific characteristics of a video type are exploited to build higher level
aggregates of shots.
While significant research efforts have been devoted to the analysis of movies, news
reports, and sport videos, the analysis of commercial videos has been virtually neglected by
the research community. Only quite recently a few works explicitly addressing commercials
investigate the possibility of detecting and extracting advertising content from a stream of
video data [6, 15]. The automatic characterization of a commercial video is complicated
by the intrinsic and peculiar features of its time-varying content. First, the effectiveness of
commercials is mainly related to its perceptual impact than to its mere content or explicit
message: the way colors are chosen and modified throughout a spot, characters are coupled
and shooting techniques are selected create a large part of the message in a commercial, while
the extraction of canonical contents (e.g., imaged objects) has less conceptual relevance than
in other contexts. Second, strict time requirements compel the director to make a condensed
use of color, rhythm, camerawork, sound, etc. Finally, in commercials, traditional editing
effects are augmented with novel and specific artifacts (e.g., computer graphics, cartoons,
etc.), so as to draw at best the audience’s attention and emotions to the product being
promoted.
Until recently, commercials design and production has been a discipline based on a set
of usually effective yet empirical rules. Formalized by professionals in the marketing field,
such rules associate each single induced impression and emotion with given combinations
of editing effects [12]. The use of color effects provides a significant example: color editing
is the discipline or method of working with, creating, manipulating, and/or selecting specific colors for the explicit purpose of improving a product’s saleability through aesthetics,
decorative composition or function and quality. A color designer must be able to understand
all that affects a product’s color from composition to marketing strategy and to assimilate
this information providing the most appropriate colors for a product. Quite recently, marketing companies have introduced semiotic methodologies in the process of spot making,
so as to ground the principles of advertising into a solid scientific context, and to better
combine together the artistic quality of a commercial with its communication effectiveness.
Semiotics is more concerned with the sense being vehiculated than with the set of induced
emotions. According to semioticians, the analysis (and production) of commercials must
be focused on the same narrative mechanisms and structures fairy-tales are based on: this
leads to characterize a commercial according to four broad semiotic classes, or categories
[4, 10].
In this paper, we address the problem of retrieval by semantic content of commercials
based on the vehiculated message at the semiotic level of the signification hierarchy. The
choice of semiotic level retrieval is motivated by the peculiar characteristics of commercials.
Strategies based on higher semantic levels, such as retrieval by external keywords, would
prove less appropriate or even impossible for the scope, due to the fact that the meaning
of a commercial is mainly vehiculated at a syntactic level. In fact, commercials usually
involve and steer the audience through a rich use and combination of editing effects, colors,
graphics, camera movements, etc., while the emphasis on the story is quite poor.
RETRIEVAL OF COMMERCIALS BY SEMANTIC CONTENT
95
In our system, a four-category semiotic characterization of commercials is produced;
the four semiotic categories are then used to access a database of commercial videos both
by explicit query and by similarity with a template. Perceptual level features of visual
content—colors, motion, cuts, dissolves, etc.—are extracted through both standard and
ad hoc visual processing techniques. The mapping rules connecting perceptual features
to semiotic categories has been developed by conducting an extensive empirical model
validation starting from heuristics provided by semiotics and marketing experts.
The paper is organized as follows: in Section 2 the basic semiotic categories of commercials are introduced. In Section 3 the perceptual features characterizing commercials
are described. In Section 4 the rules used to construct the semiotic level starting from
the feature level are presented and motivated. In Section 5 the system and its interface
are described, and experimental results of commercials retrieval are discussed. Finally, in
Section 6 conclusions are drawn and future work is addressed.
2.
Semiotics and commercials
Semiotics is the science of signs as carriers of sense. In semiotics, “sign” is anything which
conveys a sense: words, pictures, sounds, gestures, clothes, etc. Semiotics suggests that
signs are related to their meaning by social conventions (or, in the semiotics jargon, codes),
i.e., by a specific cultural context.
Semiotic principles represent a reference and useful framework for the analysis of videos.
Codes in a video include genre, editing effects (cuts, fades, dissolves, cutting rate and
rhythm), camerawork (shot size, focus, camera movement, angle, slope of framing), manipulation of time (compression, flashbacks, flashforward, slow motion), and well defined
choices of lighting, color, sound, graphics (text or cartoons) and narrative style. The difficulty to realize the existence of these codes is mainly due to the fact that we are often
so used with them that the meaning of signs deceptively seems to be natural and univocal.
Through a semiotic analysis, the nature and use of each code can be highlighted, thus making it explicit how signs are properly organized for the construction of sense. Commercials
are a particular kind of videos where, due to time constraints, the link between sign and
sense is particularly stressed, so as to obtain the best quality and effectiveness for the conveyed message. Therefore it is not surprising that, quite recently, semiotics studies have
appeared explicitly addressing the analysis of commercial videos and their characteristics
[4]. According to research in this field, the narrative structure of commercials conforms to a
four-element morphology closely related to the one introduced by Propp for fairy-tales [10].
Semiotics introduce a classification of commercials into four different categories, related to
the narrative element which is relevant w.r.t. the others. Practical commercials emphasize
the qualities of a product according to a common set of values. These commercials represent
everyday life scenes, commonplaces that are recognized by the audience. The product is
described in a familiar environment so that the audience naturally perceives it as useful in
everyday life. Critical commercials introduce a hierarchy of reference values. In this kind
of commercials, the product is the subject of the story. It is a real story, allowing to focus
96
COLOMBO, DEL BIMBO AND PALA
Figure 1.
The semiotic square for commercials.
on the qualities of the product through an apparently objective description of its features.
Utopic commercials provide the definite evidence that the product is able to succeed in
critical tests. In this kind of commercials, the story doesn’t follow a realistic plot: rather,
situations are presented as in a dream. Wide scenarios are used to present the product,
which is shown to succeed in critical conditions often in a fantastic and unrealistic way.
Playful commercials emphasize the match between user’s needs and product’s qualities.
These commercials represent a manifest parody of the other typologies of commercials:
it is clearly stated to the audience that they are watching advertising material. Situations
and places are visibly different from everyday life, and deformed in such a caricatural and
grotesque fashion that the agreement between product qualities and purchaser’s needs is
often remarked in an ironical way (e.g., an old woman driving a Ferrari at 30 Km/hour).
A common representation of commercials categories uses the semiotic square [11]. This
square allows to combine in pairs each out of four semiotic objects with a same semantic
level according to three basic relationships: opposition, completion, contradiction (see
figure 1). Of these, only the last one has a quantitative characterization: this implies that
the objects placed at opposite sides of the square diagonals are strongly related each other,
being complementary.
In Section 4 it is shown how all the conceptual categories introduced here come into play
in the creative process of spot making—the “spot language.”
3.
Feature extraction
In this section we introduce the perceptual-level features which are significant for a semioticlevel characterization, and summarize the algorithms devised to extract such features automatically in a commercial video.
3.1.
Video Segmentation
The primary task of video analysis is its segmentation, i.e., the identification of the start
and end points of each shot that has been edited, in order to characterize the entire shot
through its most representative keyframes. The automatic recognition of the beginning and
end of each shot implies solving two problems: i) avoiding incorrect identification of shot
RETRIEVAL OF COMMERCIALS BY SEMANTIC CONTENT
97
changes due to rapid motion or sudden lighting change in the scene; ii) identification of
sharp transitions (cuts) as well as gradual transitions (dissolves).
3.1.0.1. Cuts. Rapid motion in the scene and sudden change in lighting yield low correlation between contiguous frames especially in the case in which a high temporal subsampling
rate is adopted. To avoid false cut detection, a metric has been studied which proves highly
insensitive to such variations, while maintaining reliable in detecting “true” cuts [8]. For this
reason, each frame has been partitioned into nine subframes. Each subframe is represented
by considering the color histograms in the HSI color space. More precisely, to improve
independence with respect to lighting conditions, the histogram takes into account only hue
H and saturation S properties. It can thus be represented as a function H(H, S) : <2 → <+ .
Cut detection is performed by considering the volume of the difference of subframe hisj
tograms in two consecutive frames. If Hi (H, S) is the histogram computed for the j-th
subframe ( j ranging from 1 to 9) of the i-th video frame, for each subframe the following
quantity is computed:
Z Z
j
vi =
£
¤
j
j
Hi (H, S) − Hi+1 (H, S) dH dS.
(1)
Such quantity represents for the j-th subframe the volume of the difference of histograms
in two consecutive frames. The presence of a cut at frame i is detected by thresholding the
j
average value of vi for the nine subframes.
3.1.0.2. Dissolves. The dissolve effect merges two sequences by partly overlapping them.
Dissolves detection in commercials is particularly difficult because of their very limited
duration. Due to this peculiarity, existing approaches to dissolve detection (developed for
movie segmentation purposes) have shown a very poor performance. We have developed
instead an original method, based on corner statistics, as a means to detect dissolves [5].
Indeed, during the dissolve, the first sequence gradually fades out (i.e., is darkened) while
the second sequence fades in. Therefore, during the editing effect, corners associated to the
first sequence gradually disappear and those associated with the second sequence gradually
appear. This yields a local minimum in the number of corners detected during the dissolve.
Corner detection is based on the algorithm presented in [17]. An image location (x, y)
is defined as a corner if the intensity gradient ∇ I in a patch (x + u, y + v) around it is
not isotropic, i.e., it is distributed along two preferred directions. Operationally speaking, a
corner is characterized by large and distinct values of λ1 (x, y) and λ2 (x, y), the eigenvalues
of the gradient auto-correlation matrix
A(x, y) =
µ ­ 2®
Ix
hI y Ix i
¶
hIx I y i
­ 2® ,
Iy
(2)
where (I x , I y ) = ∇ I and h i denotes Gaussian smoothing over the patch. The overall feature
related to the presence of cuts and dissolves in a video is obtained as Fcuts = #cuts +#cuts
#dissolves
and Fdissolves = #cuts +#cuts
with Fcuts , Fdissolves ∈ [0, 1].
#dissolves
98
COLOMBO, DEL BIMBO AND PALA
3.1.0.3. Rhythm. Another relevant video editing characteristic is the rhythm of a sequence
of shots, as related to shot duration and to the use of cuts and dissolves to join shots. For
instance, a sequence of short shots can be used to keep continuously alive the audience’s
attention and emphasize dynamism, modernity, etc., while a sequence of gradually shorter
shots can induce an increase of tension. The rhythm r (i 1 , i 2 ) ∈ [0, 1] of a video sequence
over a frame interval [i 1 , i 2 ] is defined as
r (i 1 , i 2 ) =
#cuts + #dissolves
,
i2 − i1 + 1
(3)
where #cuts and #dissolves are measured in the same interval. A simple feature measuring
the internal rhythm of an entire sequence is the average rhythm, as related to the overall number of breaks: Fbreaks = r (1, #frames). This is a normalized quantity, such that
Fbreaks = Fcuts + Fdissolves . The absence of breaks is obviously described by the dual feature
Fcontinuous = 1 − Fbreaks .
3.2.
Shot content
Once a video has been fragmented into shots and video editing features have been extracted,
the content of each shot needs to be internally described. To this end, features are extracted
from each shot keyframe describing characteristics such as the presence and distribution
of relevant colors in the scene, and the distribution and orientation of lines highlighting
specific camera takes.
3.2.0.4. Colors. A description of the shot chromatic content is obtained by performing
keyframe color segmentation, thus highlighting its main color regions. In our system, image
segmentation is carried out by color cluster analysis: segmentation is then achieved by backprojecting cluster centroids (feature space) onto the image [9]. The use of the LUV color
space allows that small feature distances correspond to similar colors in the perceptual
domain. Clustering in the 3-dimensional feature space is obtained using an improved
version of the standard K-means algorithm [14], which avoids convergence to non-optimal
solutions. Competitive learning is adopted as the basic technique for grouping points in
the color space as in [21]. The chromatic content of a video sequence is expressed using
a set of eight numbers ci ∈ [0, 1] with i denoting one color out of the set {red, orange,
yellow, green, blue, purple, white, black}, each number quantifying the presence in the
keyframe of a region exhibiting the i-th color. Color related features used in the system are
Frecurrent ∈ [0, 1] (expressing the ratio between colors which recur in a high percentage of
keyframes and the overall number of significant colors in the video sequence) and Fsaturated
(expressing the relative presence of saturated colors in the scene). Their dual are respectively
Fsporadic = 1 − Frecurrent and Funsaturated = 1 − Fsaturated .
3.2.0.5. Lines. The detection of significant line slopes in the keyframe is accomplished
by exploiting the Hough transform [3] to generate a line slope histogram. The feature
Fslanted ∈ [0, 1] gives the ratio of slanted (i.e., with a slope neither horizontal nor vertical)
RETRIEVAL OF COMMERCIALS BY SEMANTIC CONTENT
99
lines with respect to the overall number of lines in the sequence. Its dual is Fhor/vert =
1 − Fslanted .
4.
Mapping features onto semiotic categories
We summarize below the mapping between the set of perceptual features (lower semantic
level) and each of the four semiotic categories (higher semantic level) of commercials.
Such a mapping allows to organize a database of commercials on the basis of their semiotic
content and provide content-based access facilities. The idea is to express the degree of
similarity S of each video with a given query by a simple weighted average of the partial
scores expressing the match with each individual semiotic category:
S = q1 Spractical + q2 Splayful + q3 Sutopic + q4 Scritical ,
(4)
the query being expressed in terms of the set of weights {qi }.
The way partial scores are obtained highlights the language link for each semiotic category. Language links have been formalized with the support of experts in the semiotics
and marketing fields, who provided a number of heuristics also used in the practice of
commercial production (a good survey can be found in [4] see also [12]). The heuristics
are used to identify the perceptual features contributing to each semiotic category. Specifically, we assume that each partial score is obtained as a linear combination of perceptual
features according to a set of weights {wi j } ∈ [0, 1], i = 1 . . . 4, j = 1 . . . 3. Weights are
then empirically adjusted by linear regression based on ground truth data provided by the
experts. The rationale for the mapping follows.
Practical commercials have a linear narrative style: everything in the video must appear
real and close to everyday experience. Camera takes are usually frontal, and care is taken
that all transitions take place in a smooth and natural way. This implies choosing long
dissolves for merging shots (short dissolves are deliberately interpreted by the system as
cuts), and the prevalence of horizontal and vertical lines—giving respectively the impression
of relax and solidity—over slanted lines:
Spractical = w11 Fdissolves + w12 Fhor/vert + w13 Funsaturated .
(5)
In playful commercials, the presence of the camera is always emphasized, and all possible
effects are used to stimulate the active participation of the audience in the creation of
sense. Everything looks strange and “false” (colors are unnatural, camera takes are usually
unprobable, etc.). Hence
Splayful = w21 Fcuts + w22 Fslanted + w23 Fsaturated .
(6)
The main characteristic of utopic commercials is to present the product as part of a world
which loops real but doesn’t resemble everyday life (i.e., it is a realistically rendered ideal
world). For this reason, care is taken to produce a movie-like atmosphere, with a set of
100
COLOMBO, DEL BIMBO AND PALA
Table 1. Typical ranges of values of the perceptual features for the four semiotic categories. The symbol—means
that the feature is not relevant for the category.
P. feature
Practical
Fdissolves
Playful
Utopic
Critical
–
[0, 1]
–
[0, 0.5]
Fcuts
–
[0.6, 1]
[0, 0.5]
–
Fslanted
–
[0.3, 1]
–
–
Fhor/vert
[0, 1]
–
–
[0, 1]
[0.5, 1]
–
–
–
Fsaturated
–
[0.5, 1]
–
–
Frecurrent
–
–
[0.3, 1]
–
Fsporadic
–
–
–
[0.3, 1]
Fcontinuous
–
–
–
[0.3, 1]
Funsaturated
dominant colors defining a closed chromatic world and with all of the traditional editing
effects (cuts, dissolves) possibly taking place:
Sutopic = w31 Fcuts + w32 Fdissolves + w33 Frecurrent .
(7)
Critical commercials spend most of the spot time displaying the product (typically in central
and frontal views), while the audio comment continues listing its qualities: the scene has to
appear “more realistic than reality itself.” For this reason, the number of breaks is kept low,
while the ever changing colors in the background due to smooth camera motions contribute
to draw the attention to the (constant) color of the product:
Scritical = w41 Fcontinuous + w42 Fsporadic + w43 Fhor/vert .
(8)
The discussion above validates the usual semiotic square representation of commercials
categories (see again figure 1). Specifically, the pairs practical-playful and utopic-critical
appear to be strongly complementary: notice for instance that the features characterizing the
“playful” category (Eq. (6)) are virtually dual to those of the “practical” category (Eq. (5)).
Table 1 reports typical ranges of values of the perceptual features for the four semiotic
categories.
5.
5.1.
Implementation and results
System setup and implementation
The retrieval system has been implemented on a stand-alone platform, featuring a Silicon
Graphics R4600 133 MHz processor, 128 MB Memory and IRIX 5.3 Operating System.
The system is composed of three main components: i) a feature extraction engine, ii) a
retrieval engine and iii) a graphic interface. At database population time, each commercial
RETRIEVAL OF COMMERCIALS BY SEMANTIC CONTENT
101
is automatically processed through the feature extraction engine. This associates with each
video a set of four scores (Spractical , Splayful , Sutopic , Scritical ) representing the extent to which
the video conforms to the four semiotic categories (as expounded in Sections 3 and 4).
Processing of a 450-frames video requires about 5 minutes. Scores extracted from all the
database videos are stored in an index signature file associating each set of scores with the
video they refer to. Presently the system includes over 150 commercial videos digitized
from several Italian TV channels.
The retrieval engine and the graphic interface (see figure 3) have been developed using
JDK 1.1.3. The graphic interface is designed to support video browsing (upper left part of
the interface) and two retrieval modalities:
– the user can select the degree to which the video should conform to the four basic semiotic
categories (upper central part of the interface);
– the user can select one of the videos from the database and query for similar videos
(upper right part of the interface).
In the first case, a set of four weights is extracted according to the values selected by the
user for the degree of conformity to each category. Categories are arranged according to
the semiotic square of figure 1: the relevance of each category, ranging from 0% to 100%,
can be selected through a scroll-bar. The matching score S is computed for each video in
the database according to Eq. (4), and videos are presented in decreasing order of match
Figure 2.
Representation of database commercials semiotic features (see text).
102
COLOMBO, DEL BIMBO AND PALA
in the lower left part of the interface. In the case of search by global similarity, matching
scores are computed by considering the correlation between the characteristic features of
the video used as reference and those of the rest of the database. Videos are then presented
to the user in decreasing order of similarity. Also, by selecting a video from the output
list, the video can be either viewed at full or in its most salient keyframes through a movie
player application. In both cases, retrieval is performed in about 4 sec.
5.2.
Experimental results
Several tests were conducted to assess system performance and conformity with human
experts judgement. Each video was classified in advance by a team composed of 5 experts
in the semiotic and marketing fields. The experts were asked to classify the commercials
by associating with each commercial a position on the semiotic square (figure 2(a) shows
the practical-playful and critical-utopic diagonals of the square as coordinate axes). To
ease the classification task, 25 rectangular regions were identified in the square through a
regular partition, supporting the definition of three bands of conformity of a commercial
to a generic category (the bands correspond to conformity degrees of 0–20%, 20–60%,
60–100% repectively).
Figure 3.
Retrieval of playful commercials.
RETRIEVAL OF COMMERCIALS BY SEMANTIC CONTENT
Figure 4.
Some of the frames for the first ranked spot in figure 3 (“Tacchini”).
Figure 5.
Some of the frames for the second ranked spot in figure 3 (“Audi”).
103
104
COLOMBO, DEL BIMBO AND PALA
Table 2. Agreement (in terms of city block distance) between system and experts classification of commercials
with reference to queries for purely practical, critical, utopic and playful (d) commercials.
Practical
Rank
Critical
Utopic
Playful
Dist.
Rank
Dist.
Rank
Dist.
Rank
Dist.
1
0
1
0
1
1
1
0
2
5
2
3
2
0
2
1
3
4
3
1
3
4
3
0
4
3
4
3
4
3
4
1
5
4
5
3
5
0
5
0
Figure 2 evidences how database commercials were classified by the semiotic experts
(figure 2(b)) and by the system (figure 2(c)): each region in the square is associated with
a vertical bar, whose height is proportional to the percentage of commercials located in
that region. Notice that many database commercials conform to the playful category,
thus evidencing a general trend of nowadays commercials to prefer an unconventional
and “smart” way of presenting the product. A more accurate quantitative measure of
system effectiveness is shown in Table 2. This table shows the agreement between the
system and experts with reference to queries for purely practical, critical, utopic and playful
commercials. For each query, the five best ranked commercials are considered. For a generic
commercial, the agreement between the system and experts is measured as the city block
distance between the two blocks in the semiotic square where the system and the experts
located the commercial. The best average agreement corresponds to the query for playful
commercial, evidencing the effectiveness of the features used to model this category of
commercials. The worst performance corresponds to queries for practical commercials: in
this case, many commecials that are classified as practical by the system, are classified as
critical by the experts. The reason why the experts classify these commercials as critical is
that these commercials feature a remarkable presence of foreground views of the promoted
product. However, this is not a feature which can be coped with by the system—presently
the system isn’t able to check whether the scene represents a foreground object or not—and
consequently the system classifies as practical all those critical commercials exhibiting as
distinguishing feature the foreground view of the promoted product.
Figure 3 shows the output of the retrieval system in response to a query of purely playful
commercials. At the right of the list of retrieved items, a bar-graph displays the degree by
which each shot of the best ranked video—cuts and dissolves are represented by white thin
vertical lines—belongs to the playful category. The best ranked videos in this category all
advertise products for “young and smart” people (sportswear, sport watches, blue jeans etc.):
this is not occasional, but reflects instead the common marketing practice of calibrating a
commercial to a specific audience.
Figures 4 and 5 report some of the keyframes for the two best ranked spots. The first
ranked commercial features a relevant conformity to the playful category and a weak conformity to the critical one (Splayful = 90% and Scritical = 15%). Differently, the second ranked
RETRIEVAL OF COMMERCIALS BY SEMANTIC CONTENT
105
commercial features a relevant conformity to the playful category and a weak conformity
to the utopic one (Splayful = 89% and Sutopic = 21%).
The first best ranked spot (advertising sportswear), presents all the typical features of a
playful commercial, including a very fast rhythm, non orthodox camera takes—and situations, like skating in a tennis court, reflecting semantic issues at a higher level w.r.t. the level
of our computer analysis—and very saturated colors. Similar features are present in the
second ranked spot. Concerning the second ranked spot (figure 5), notice the presence of
quasi-identical keyframes (e.g., the close-ups of the man), which are typical of a non linear
Figure 6.
Feature values for the “Tacchini” commercial.
106
COLOMBO, DEL BIMBO AND PALA
development of the story. Indeed, in this spot the camera frenetically switches between
close-up views of the man and his dogs, almost never alternating details and global views
like a utopic spot would do.
In figures 6 and 7 they are shown respectively the diagrams of the characteristic features
(saturation, hue, luminance, line slopes, number of cuts and dissolves) for the two best
ranked commercials in figure 3. Notice how in playful commercials cuts are by far the
prevailing editing effect, while dissolves are almost absent. Also, as evident from the figures,
line slopes are typically slanted, and colors are saturated.
Figure 7.
Feature values for the “Audi” commercial.
RETRIEVAL OF COMMERCIALS BY SEMANTIC CONTENT
Figure 8.
Retrieval of practical commercials.
Figure 9.
Some of the frames for the first ranked spot in figure 8 (“Knorr”).
107
108
COLOMBO, DEL BIMBO AND PALA
Figure 8 shows the output of the retrieval system in response to a query of purely practical
commercials. This time, the best ranked videos obtained in response to the query all
advertize typical family products (the first three retrieved commercials advertize a soup,
a soap and a kind of rice, respectively): this again highlights the fact that the kind of
promoted product often drives the choice of the semiotic category to use. Figures 9 and 10
show some relevant frames for the two best ranked commercials in figure 8. The first ranked
commercial features a relevant conformity to the practical category and a weak conformity
to the critical one (Spractical = 86% and Scritical = 20%). The second ranked commercial
features a relevant conformity to the practical category and a less relevant conformity to the
critical one (Spractical = 85% and Scritical = 50%).
From a rapid inspection of these frame sequences, it is evident that practical spots have a
quite linear narrative structure, making it is easy to reconstruct the story told in the spot and
fill in the “semantic gaps” from frame to frame: this could be a very hard task with the playful
sequences of figures 4 and 5. Figures 11 and 12 show the feature distributions for the two best
ranked commercials in figure 8. Notice how, as opposite to playful commercials, in practical
commercials long dissolves prevail over cuts (see again figure 9), horizontal/vertical lines
are dominant, and colors are non-saturated.
Figure 13 shows retrieval results for a query where a database commercial is used as
template. In this case, the system output includes commercials in decreasing degree of
similarity w.r.t. the template. Figures 14 and 16 report respectively some keyframes for the
template and for the first two best ranked commercials. Similarity between commercials
is evaluated by global correlation between perceptual level features. The commercial used
as template in figure 13 is a practical commercial (wax for pavements), featuring long
Figure 10.
Some of the frames for the second ranked spot in figure 8 (“Dixan”).
RETRIEVAL OF COMMERCIALS BY SEMANTIC CONTENT
109
dissolves, lines with horizontal slopes, and non-saturated colors (see figure 17). Despite
the apparent dissimilarity with the template feature distributions, the characteristic features
similarity of two best ranked commercials is quite high. As an example, notice that color
saturation and cut density in figure 18 are quite similar to those of figure 17. Similarly,
despite a higher value for the saturation, figure 19 has a hue and luminance distribution very
close to that of the template commercial, thus reaching a high value of the similarity score
by linear superposition of the partial scores.
Figure 11.
Feature values for the “Knorr” commercial.
110
Figure 12.
COLOMBO, DEL BIMBO AND PALA
Feature values for the “Dixan” commercial.
RETRIEVAL OF COMMERCIALS BY SEMANTIC CONTENT
Figure 13.
Retrieval by similarity.
Figure 14.
Some of the frames for the commercial used as template in figure 13 (“Emulsio”).
111
112
COLOMBO, DEL BIMBO AND PALA
Figure 15.
Some of the frames for the first best ranked commercial in figure 13 (“Barilla”).
Figure 16.
Some of the frames for the second best ranked commercial in figure 13 (“Egitto”).
RETRIEVAL OF COMMERCIALS BY SEMANTIC CONTENT
Figure 17.
Feature values for the “Emulsio” commercial.
113
114
Figure 18.
COLOMBO, DEL BIMBO AND PALA
Feature values for the “Barilla” commercial.
RETRIEVAL OF COMMERCIALS BY SEMANTIC CONTENT
Figure 19.
Feature values for the “Egitto” commercial.
115
116
6.
COLOMBO, DEL BIMBO AND PALA
Conclusions and future work
In this paper, we have addressed the problem of the semantic characterization of commercial
videos from a semiotic perspective. A specific set of rules allows us to map low (perceptual)
level features of a commercial into higher semantic categories, thus making it possible to
represent and retrieve commercials based on their dominant semiotic characterization and
to global similarity w.r.t. a template video.
Our system is intended to support the marketing professionals both in the creation process
(as a source of inspiration for choosing a message with predefined characteristics) and in the
detection of possible “sense overlaps” and conflicts occurring when two or more semiotic
categories happen to coexist in the same advertising campaign or even in the same spot.
Future work will address including in the system a higher number of perceptual features,
working with very large archives (e.g., some thousands videos), and performing a further
analysis of the effectiveness and usability of the system with the aid of experts in the
marketing and semiotics fields.
Acknowledgments
The authors warmly thank Bruno Bertelli, Laura Lombardi, Mauro Caliani and Jacopo
M. Corridoni for their help in the development of this research.
References
1. P. Aigrain, P. Joly, P. Lepain, and V. Longueville, “Medium knowledge-based macro segmentation of video
sequences,” in Intelligent Multimedia Information Retrieval, M. Maybury (Ed.), 1996.
2. F. Arman, A. Hsu, and M. Chi, “Feature management for large video databases,” in Conf. on Storage and
Retrieval for Image and Video Databases, W. Niblack (Ed.), San Jose, CA, May 1993, pp. 2–12.
3. D.H. Ballard and C.M. Brown, Computer Vision, Prentice-Hall: Engelwood Cliffs, NJ, 1982.
4. B. Bertelli, “La pubblicitazione: la citazione delle avanguardie storiche nelle pubblicità video e stampa,”
Ph.D. Thesis, Dept. of Visual Arts, Univerity of Siena, Italy, 1997.
5. M. Caliani, “ Ricerca per contenuto di filmati pubblicitari,” Master’s Thesis, Dept. of Systems and Informatics,
University of Florence, Italy, 1997.
6. M. Caliani, C. Colombo, A. Del Bimbo, and P. Pala, “Commercial video retrieval by induced semantics,” in
Proc. IEEE Int’l Workshop on Content-based Access of Image and Video Databases CAIVD’98, Bombay,
India, Jan. 1998, pp. 72–80.
7. J.M. Corridoni and A. Del Bimbo, “Film editing reconstruction and semantic analysis,” in Proc. CAIP’95,
Prague, Czech Republic, Sept. 1995.
8. J.M. Corridoni and A. Del Bimbo, “Structured digital video indexing,” in Proc. 13th Int’l Conf. on Pattern
Recognition ICPR’96, Wien, Austria. Aug. 1996, pp. (III):125–129.
9. J.M. Corridoni, A. Del Bimbo, and P. Pala, “Sensations and psychological effects in color image databases,”
ACM Multimedia Systems Journal, Vol. 7, No. 3, May 1999, pp. 175–183.
10. J.-M. Floch, Sémiotique, marketing et communication. Sous les signes, les stratégies. Presses Universitaires
de France, Paris, France, 1990.
11. A.J. Greimas, Sémantique structurale. Larousse, Paris, 1966.
12. C.R. Haas, Pratique de la publicité, Bordas: Paris, France, 1988.
13. A. Hampapur, R. Jain, and T. Weymouth, “Digital video segmentation,” in 2nd Annual ACM Multimedia
Conference and Exposition, San Francisco, CA, Oct. 1994.
RETRIEVAL OF COMMERCIALS BY SEMANTIC CONTENT
117
14. A.K. Jain, Algorithms for Clustering Data, Prentice Hall: Englewood Cliffs, NJ, 1991.
15. R. Lienhart, C. Kuhmünch, and W. Effelsberg, “On the detection and recognition of television commercials,”
in Proc. Int’l Conf. on Multimedia Computing and Systems, Ottawa, Canada, June 1997, pp. 509–516.
16. A. Nagasaka and Y. Tanaka, “Automatic video indexing and full video search for object appearances,” in IFIP
Trans., Visual Database Systems II, W.E. Knuth (Ed.), 1992, pp. 113–128.
17. J.M. Pike and C.G. Harris, “A combined corner and edge detector,” in Proc. Fourth Alvey Vision Conference,
1988, pp. 147–151.
18. S. Smoliar and H. Zhang, “Content-based video indexing and retrieval,” IEEE Multimedia, Vol. 2, No. 1, pp.
63–75, Summer 1994.
19. D. Swanberg, C.F. Shu, and R. Jain, “Knowledge guided parsing in video databases,” in Conf. on Storage and
Retrieval for Image and Video Databases, W. Niblack (Ed.), San Jose, CA, May 1993, pp. 13–24.
20. Y. Tonomura, A. Akutsu, Y. Taniguchi, and G. Suzuki, “Structured video computing,” IEEE Multimedia,
Vol. 1, No. 3, pp. 35–43, Fall 1994.
21. T. Uchiyama and M.A. Arbib, “Color image segmentation using competitive learning,” IEEE Trans. on Pattern
Analysis and Machine Intelligence, Vol. 16, No. 12, pp. 1197–1206, Dec. 1994.
22. M. Yeung, B.L. Yeo, and B. Liu, “Extracting story units from long programs for video browsing and navigation,” in Proc. IEEE Int’l Conf. on Multimedia Computing and Systems, Hiroshima, Japan, June 1996, pp.
296–305.
Carlo Colombo holds a M.S. in electronic engineering from the University of Florence, Italy (1992) and a Ph.D. in
robotics from the Scuola Superiore di Studi Universitari e di Perfezionamento Sant’Anna, Pisa, Italy (1996). He
is presently an assistant professor at the Faculty of Engineering of the University of Florence, Italy. His main
research interests are in computer vision and its applications to advanced human-machine interfaces, robotics and
multimedia.
Alberto Del Bimbo is Full Professor and Director of the Department of Sistemi e Informatica at the Università
degli Studi di Firenze, Italy. He is also the Director of the Master in Multimedia at the same University. His
scientific interests and activity have addressed the subject of Image Technology and Multimedia, with particular
reference to object recognition and image sequence analysis, content-based retrieval for image and video databases,
visual languages and advanced man-machine interaction. Prof. Del Bimbo is the author of over 150 publications,
118
COLOMBO, DEL BIMBO AND PALA
appeared in the most distinguished international journals and conference proceedings and is the author of the
monography “Visual Information Retrieval” edited by Morgan Kaufman in 1999. He has also been the Guest
Editor of several special issues of International Journals and the Chairman of several conferences in the field of
Image Processing, Image Databases and Multimedia. He is IAPR fellow and presently a Member of the Steering
Committee of IEEE ICME, Int. Conference on Multimedia and Expo and of the VISUAL conference series. From
1996 to 2000 he was the President of the Italian Chapter of IAPR, the International Association for of the Italian
Chapter of IAPR, the International Association for Pattern Recognition. Since 1999 he is a Member of the IEEE
Publications Board. He presently serves as Associate Editor of IEEE Trans. On Multimedia, IEEE Trans.
enditemize on Pattern Analysis and Machine Intelligence, Pattern Recognition, Journal of Visual Languages and
Computing, and Multimedia Tools and Applications Journals.
Pietro Pala holds a Laurea degree in Electronic Engineering (1994) and a Ph.D. in Computer Science (1998)
both from the Università di Firenze, Italy. Presently he is an Assistant Professor at the Dipartimento Sistemi e
Informatica of the Università di Firenze, Italy. His main research include pattern recognition, image and video
databases, neural networks and related applications.