Cartoon Recognition and Classification

Cartoon Recognition and Classification
E. Humphrey
EEN571
University of Miami
Abstract – It is becoming ever more evident that,
given the overwhelming continual increase in
available multimedia, content-based analysis and
retrieval methods are attractive options in
maintaining enormous databases. Here, the task of
automatic video genre classification is investigated
for cartoons using low-level descriptors, classifying a
query as either cartoon or non-cartoon. A neural
network is developed by training weights with a
genetic algorithm for a developed database of
ground-truth features. The system performs relatively
well for training data compared to previous work, but
does not exhibit the same classification accuracy in
test. Reasons for such shortcomings are discussed,
and several conclusions are reached regarding the
body of work presented.
1.
Introduction
Genre classification of multimedia has become a
very popular research topic in recent years,
particularly with the success of Internet video
services like YouTube. In general, metadata alone is
rarely sufficient for the correct and efficient
cataloging, indexing, and retrieval of content.
Automating this process positively impacts all facets
of the multimedia entertainment experience. Queries
would return relevant material, regardless of
metadata accuracy or completeness. Additionally,
that very metadata would no longer need to be
supplied by human users. Playback devices could
potentially cooperate to enhance the observer’s
experience by presenting content in the manner best
suited to it, or filter content based on user-defined
settings.
This generalized problem of genre classification is
by no means a trivial task. Delineation of genres
between humans may not always necessarily result in
unanimous agreement, due to the occasional
subjective nature of the problem. Some video genres,
like news broadcasts, contain elements of different
genres. For example, a news segment may contain
shots of both sports and interviews, to name only
two. Distinguishing between these genres requires a
significant computational effort, as high-level events
and actions must be detected and represented via
object tracking, segmentation, and so forth.
Cartoons as a genre are significantly different from
all other video genres, and it is the motivation of the
work here to use low-level, computationally simple
video features to determine whether or not a query is
a cartoon or not. Using an equally distributed variety
of both cartoon and non-cartoon video from the
Internet, a database of nearly 100 single-shot clips is
compiled, uniformly sampled to QVGA (320x240)
resolution at 25 fps. A system is developed to best fit
the training data, and whose performance is qualified
using a second set of similarly processed clips. The
remainder of the paper is arranged as follows:
Section 2 provides an overview of the proposed
system and the visual descriptors used to generate the
feature space of all analyzed video, Section 3 details
the process of developing and training the neural
network, Section 4 presents the results of the system,
and Section 5 discusses the system’s performance
and future work.
2.
System Design
The classification system was developed as a threestage process: a feature space is defined and
calculated for each training video, the neural network
weights are trained accordingly, and queries are
classified and compared to the actual classification. A
high-level system overview is shown in Figure 1.
2.1 Previous Work
As mentioned, video classification and retrieval is
by no means a new topic, and much research and
effort has been invested in the field. A review of
relevant literature is conducted in [1], and provides a
thorough survey of video classification techniques
and methods. Functional methods include text,
auditory, and visual based approaches, and while
Family of Descriptors
Brightness
Motion Activity
Saturation
Color Nuance
Edge Prominence
HSV Histograms
YCbCr DCT Centroids
Table 1 – Family of Descriptors used in the proposed system.
using cues from various domains is clearly
advantageous, only visual information is considered
in the context of this work. With this in mind, prior
visual content analysis systems are investigated.
Motivation for the descriptor families employed in
this system are presented and discussed in [2], [3],
[7], and [9]. While most of the work reviewed exhibit
overlap between descriptors, there is no commonality
in the literature with regard to the best-suited
classifier for the application. Varied classification
performance resulted from using HMM [2], MLP [3],
SVM [5], C4.5 decision tree [6], fuzzy integrals [8],
and PCA [9]. With no definitive classifier
demonstrated in the literature, the decision is made to
investigate the merit of neural networks in video, and
specifically cartoon, recognition and classification.
Figure 2 – High-level System Diagram
2.2.2
Saturation
Similar to the descriptors computed for brightness,
average saturation, frame differential saturation, and
percent saturation for a defined threshold are derived.
However, in this case, the threshold is considered in
regard to the value plane in the HSV color space,
therefore quantifying the percentage of bright, highly
saturated color.
1
2.2 Video Descriptors
1
After a review of relevant literature, seven
descriptor families were deemed appropriate for
compactly summarizing the content of each clip. In
general, cartoons are expected to exhibit patches of
bright, saturated color, low levels of motion activity,
and little texture variance.
2.2.1
Brightness
1
1
2.2.3
Three video features are calculated from the
brightness descriptor for each frame over the entirety
of a shot. Below, the definitions are given for average
brightness (Eq 1), the percentage of image brighter
than a set threshold (Eq 2), and the change in frame
brightness (Eq 3).
1
1
|
1
|
|
,
1 |
Edge Prominence
Motion Activity
A single descriptor is used to represent the motion
information of a shot, and can be loosely described as
the magnitude of the distance between pixels in
consecutive frames.
,
1
1
|
Given the notion that cartoons will generally
exhibit far fewer strong edges, a Canny edge detector
is applied to the frames in a shot, and the resulting
binary images summed. The descriptor serves to
represent the strength and existences of edges in a
video clip.
2.2.4
,
,
1 |
2.2.5
Color Nuance
Disregarding boundary pixels, the mean color
distance, defined below, is computed for the 3x3
neighborhood surrounding each pixel in a frame, and
averaged over the length of a shot.
cos
1
8
sin
cos
sin
Figure 2 – Nodal representation of a two-class Neural
Network
2.2.6
HSV Histograms
Motivated by the Scalable Color Descriptor
defined in [7], color content is described by three
histograms in the HSV domain. Accordingly, 16 bins
are used for the hue channel, and 4 each are used for
the saturation and value channels.
2.2.7
YCbCr DCT Centroid
A modified version of the Color Layout Descriptor,
also given in [7] is compactly defined as the center of
mass for an unwrapped 8x8 DCT matrix in the
YCbCr color space. Each color plane is reduced to 64
uniformly distributed blocks and the average value
calculated. The DCT is then computed for each color
plane, and reshaped into a 1-dimensional vector by
zigzag scanning to maintain the relationships
between resulting DCT values. Centroids are then
calculated for each DCT vector by weighting the
value of the DCT with its corresponding index.
2.3 Feature Space
From the seven descriptor families, a feature vector
is calculated that describes the content and evolution
of each video clip. The mean, standard deviation, or
both are computed for the descriptor vectors, indexed
by the frame, over all frames in the video. A resulting
40-point feature vector is taken as the input to the
classifier.
3
Developing the Classifier
The second stage of the system, once the feature
space is defined, is the implementation of a suitable
classifier to differentiate between cartoons and noncartoons. As noted, the diversity in this area provides
an opportunity to investigate a mechanism of interest.
A three-layer neural network is chosen as the
classifier, and the weights are adjusted to a set of
training data via a genetic algorithm.
3.1 Neural Networks
Taking cues from nature and biology, a neural
network operates on similar principles to those
understood about the functionality and mechanisms
of the brain. Simply put, neurons are interconnected
by synapses, which fire, or send electrical impulses,
when activation criteria are met. It is through these
relatively simple concepts the entire brain produces
memories, thoughts, and emotions.
The brain is also capable of distinguishing between
difficult, abstract, and even noisy information.
Artificial neural networks, referred to as ANNs,
attempt to computationally mimic the firing of
synapses in the brain by computing the weighted sum
of neural layers to determine the activation of each
neuron. As seen in Figure 2, each node in the second
layer, denoted by hi, sums a weighted version of
every node in the previous layer, denoted by fi. These
nodes can be activated by two manners: activation
threshold or sigmoid response. The former is very
straightforward, in that should the activation
threshold be met or exceeded, the neuron fires a ‘1,’
or else a ‘0.’ Alternatively, the sigmoid function is
used to determine the activation energy of a neuron,
which effectively serves as a continuous response
curve. The network employed opts for a sigmoid
response to facilitate the training of the weights, such
that small improvements are noticed by the genetic
algorithm.
There are multiple aspects of the neural network
that are left to be adjusted with the application. The
number of hidden layers, as well as the number of
nodes in those hidden layers, must be determined.
The general consensus, however, is that intuition and
trial-and-error are the best means of optimizing a
neural network’s performance, and here one hidden
layer is used.
3.2 Genetic Algorithm
Topology of the neural network is only half the
required effort, as the weights themselves must be
derived. In keeping consistent with biomimetic
computing, a genetic algorithm is used as a search
mechanism to derive at the weights that produce the
best results. Other neural network training
mechanisms include gradient-descent and backpropagation, but the genetic algorithm is invoked on
the principle that it can quickly optimize a large
search space.
Modeling reproduction and biological evolution, a
population of “chromosomes” is randomly generated.
In training the neural network, each chromosome is a
set of weights, where each weight has a uniform bit
depth and expressed as a consecutive binary string.
The fitness of the members in the population is
computed by an evaluation function, and those that
perform better are given a better chance of survival.
Chromosomes in the population are then selected and
paired at random, and offspring are produced by
crossing over parts of the paired chromosomes. Other
biological processes, such as mutations or inversions,
may be implemented, and serve to further produce
more evolutionary results.
For training the neural network, an evaluation
function that determines the fitness of each
chromosome is defined as the percentage of the preclassified training data the chromosome, or set of
weights, gets correct. Ideally, the winning
chromosome would have a fitness of 1, meaning that
the set of weights it represents correctly classifies the
training feature spaces. A correct response is given
when the corresponding correct class output is higher
than that of the other classes.
4
Before continuing, it is crucial to observe the
misclassifications of the training data to better
understand the system performance for test videos.
Interestingly enough, the majority of incorrect
classifications were of real (i.e., non-cartoon) video
clips rather than cartoons. Only 3 cartoon training
videos were misclassified, compared to 12 noncartoon misclassifications. This fact gives the distinct
impression that, in developing the feature space, it
might be necessary to normalize the data to a global
level. It is expected that real video should produce
drastically different feature vectors than its cartoon
counterparts, and this signifies that more could be
done to enhance performance. Pinpointing the source
of this confusion, however, proved to be an involved
process.
A collection of 40 video segments (19 C, 21 NC)
were compiled for testing the system. Overall
classification accuracy was found to be about 64%,
with 63.2% for cartoons and 64.5% for non-cartoons.
This is admittedly worse than the training data would
lead one to expect, but can be broadly attributed to a
few key factors. First, it was found that, for some
reason not fully understood, many of the non-cartoon
test videos were slightly corrupted in preprocessing
(shot segmentation, resolution adjustment, and
decompression). While the system was designed to
make the best of this data, the fact remains that the
features calculated for these clips are unreliable.
Ignoring these videos in the classification accuracy
raises the percentage to over 75% for non-cartoon,
and 70% overall. Also, the test videos were created
from a different library of content (i.e., different
television shows), such that the training process may
have inherently biased the weights in this manner.
Establishing a more diverse training database would
presumably help alleviate this issue.
Results
In the course of developing the system, there are
three discrete stages that produce quantifiable
performance results. Using the feature vectors
calculated for the training videos, the highest
confidence value achieved by optimizing the neural
network weights was found to be 0.8469. The
significance of this observation will be discussed
later, but it should be noted here that for the training
data, 15% of the videos were misclassified. However,
for a binary classification task (two output classes, A
or B), this is roughly 35% more accurate than a
random process. It was also found that, for the data
collected and analyzed here, the number of nodes in
the hidden layer did not significantly alter classifier
performance. However, expanding beyond a single
hidden neural layer was not explored.
5
Discussion
Observing the performance of the system, with
regard to both successes and shortcomings, there are
several interesting conclusions to be drawn.
Ultimately the entire system depends on the quality
and relevance of the extracted features, and the
majority of classification error can be roughly
attributed to this fact. While the features used here
are somewhat adequate, more could be done to
further describe texture, frequency content, and color
composition, to start. Additionally, it is intuitive to
assume that to some degree the features of cartoon
and non-cartoon video will exhibit significant
overlap, and separating the principal components
prior to classifying would no doubt prove beneficial.
On a purely logistic note, it is also necessary to
highlight the heavy reliance on the quality, and
quantity, of training data. This takes into account two
different, albeit related, aspects. The video selection
and resulting diversity will affect the degree to which
the classifier can distinguish between content.
Succinctly, compiling an appropriate database of
clean data is essential to system performance, and
was found to be relatively difficult given prevalent
copyright protection. It was also observed that the
software used in converting video into workable
formats (uncompressed .avi files) often produced
extremely noisy data. Artifacts ranged from
discoloration to frame jitter and a good deal in
between without provocation or cause, and inevitably
degraded system performance. That being said, this
observation further underscores two giant hurdles in
automatic video classification. First, a human
observer would easily be able to identify the
corrupted video files processed, and consequently
misclassified, by the system as cartoon or not.
Secondly, real world applications will invariably
require the processing of less-than-perfect data that
should, to some degree, still be handled by a good
classification system.
Not only could the feature space be improved
upon, but more work could definitely be done to the
development and training of the neural network. It
would be of great interest to see how the system
performance changes as additional layers are added
to the network topology, as it is expected that
performance would greatly improve. The network
used here contained only 48 neurons, whereas the
human brain is composed of an estimated one
hundred billion. Also, it was observed in the
classification accuracy that the confidence intervals
of correct classifications were generally larger by an
order of magnitude than misclassifications. This
lends the notion that additional layers would serve to
widen the gap between the neurons at the output layer
and further distinguish between different content.
In the same breath, the implementation of the
genetic algorithm would also stand to benefit from
creative modification. There may be better means by
which to evaluate the fitness of a set of weights than
simply measuring the classification accuracy. Also,
the combination and crossover mechanism of
chromosomes employed is suboptimal. Ideally, the
genetic algorithm should always find a maximum,
but in this case typically degenerates to just below
better values found. This problem stems from trying
to simultaneously optimize a highly dimensional
function, where a slight change of different weights
can produce extremely different results. Again, a
more comprehensive training space would aid in the
development of a more universal classifier, but a
fundamental component of the genetic algorithm is
the creativity with which it is applied.
Despite the shortcomings of the developed system,
it is apparent that low-level video descriptors can be
used with some degree of reliability to categorize
certain classes of video. Obviously higher level
descriptors would be necessary to differentiate
between more complex genres like sports and music
videos, but the fact remains that computationally
simple features can serve as viable cues in
classification tasks. Future work would include shot
segmentation, as shot duration itself is another crucial
element to genre classification, rather than the
manual segmentation performed here. A larger, more
comprehensive feature space is advocated, as well as
exploratory work with the implementation of a
genetic algorithm for tuning neural networks. It has
been shown that it can be a powerful tool when used
properly, though it may not always be immediately
clear what the proper implementation is. Regardless,
as further efforts are made in the field of automated
video and, on a grander scale, multimedia
classification, we will not only learn more about
artificial intelligence and machine learning but also
gain a deeper understanding of the way in which
humans recognize and process information.
6
References
[1] D. Brezeale, D. Cook, “Automatic Video
Classification: A Survey of the Literature” IEEE
Trans. on Systems, Man, and Cybernetics. vol. 38,
no. 3, pp. 416-430, May 2008.
[2] R. Glasberg, S. Schmiedeke, M. Mocigemba, and
T. Sikora, “New Real-Time Approaches for VideoGenre-Classification using High-Level Descriptors
and a Set of Classifiers,” IEEE Conf. on Semantic
Comp. pp. 120-127, 2008.
[3] R. Glasberg, A. Samour, K. Elazouzi, and T.
Sikora, “Cartoon-Recognition Using Video & Audio
Descriptors,” Proc. of EUSIPCO 2005.
[4] B. Ionescu, P. Lambert, D. Coquin, L. Darlea,
“Color-based
Semantic
Characterization
of
Cartoons,” International Symposium on Signals,
Circuits, and Systems (ISSCS 05), vol. 1, pp. 223226, July 2005.
[5] T. Ianeva, A. de Vries, H. Rohrig, “Detecting
Cartoons: A Case Study in Automatic Video-Genre
Classification,” Proc. IEEE International Conf on
Multimedia and Expo, vol. 1, pp. 449-452, July 2003.
[6] B. Truong, S. Venkatesh, and C. Dorai,
“Automatic Genre Identification for Content-Based
Video Categorization,” Proc. 15th International Conf
on Pattern Rec. pp. 230-233, 2000.
[7] B. Manjunath, J. Ohm, V. Vasudevan, and A.
Yamada, “Color and Texture Descriptors,” IEEE
Trans. Circuits and Systems for Video Tech. vol. 11,
no. 6, June 2001.
[8] A. Roma, F. Tarres, L. Sanchez, “Cartoon
Detection using Fuzzy Integral,” 8th International
Workshop on Image Analysis for Multimedia
Interactive Services, 2007.
[9] L. Xu, Y. Li, “Video Classification Using
Spatial-Temporal Features and PCA,” Proc. ICME
2003, pp. 485-488, 2003.
[10] Burjorjee, K. VectorGA – Vectorized Genetic
Algorithm MATLAB Implementation, available
online. http://code.google.com/p/vector-ga/