Cartoon Recognition and Classification E. Humphrey EEN571 University of Miami Abstract – It is becoming ever more evident that, given the overwhelming continual increase in available multimedia, content-based analysis and retrieval methods are attractive options in maintaining enormous databases. Here, the task of automatic video genre classification is investigated for cartoons using low-level descriptors, classifying a query as either cartoon or non-cartoon. A neural network is developed by training weights with a genetic algorithm for a developed database of ground-truth features. The system performs relatively well for training data compared to previous work, but does not exhibit the same classification accuracy in test. Reasons for such shortcomings are discussed, and several conclusions are reached regarding the body of work presented. 1. Introduction Genre classification of multimedia has become a very popular research topic in recent years, particularly with the success of Internet video services like YouTube. In general, metadata alone is rarely sufficient for the correct and efficient cataloging, indexing, and retrieval of content. Automating this process positively impacts all facets of the multimedia entertainment experience. Queries would return relevant material, regardless of metadata accuracy or completeness. Additionally, that very metadata would no longer need to be supplied by human users. Playback devices could potentially cooperate to enhance the observer’s experience by presenting content in the manner best suited to it, or filter content based on user-defined settings. This generalized problem of genre classification is by no means a trivial task. Delineation of genres between humans may not always necessarily result in unanimous agreement, due to the occasional subjective nature of the problem. Some video genres, like news broadcasts, contain elements of different genres. For example, a news segment may contain shots of both sports and interviews, to name only two. Distinguishing between these genres requires a significant computational effort, as high-level events and actions must be detected and represented via object tracking, segmentation, and so forth. Cartoons as a genre are significantly different from all other video genres, and it is the motivation of the work here to use low-level, computationally simple video features to determine whether or not a query is a cartoon or not. Using an equally distributed variety of both cartoon and non-cartoon video from the Internet, a database of nearly 100 single-shot clips is compiled, uniformly sampled to QVGA (320x240) resolution at 25 fps. A system is developed to best fit the training data, and whose performance is qualified using a second set of similarly processed clips. The remainder of the paper is arranged as follows: Section 2 provides an overview of the proposed system and the visual descriptors used to generate the feature space of all analyzed video, Section 3 details the process of developing and training the neural network, Section 4 presents the results of the system, and Section 5 discusses the system’s performance and future work. 2. System Design The classification system was developed as a threestage process: a feature space is defined and calculated for each training video, the neural network weights are trained accordingly, and queries are classified and compared to the actual classification. A high-level system overview is shown in Figure 1. 2.1 Previous Work As mentioned, video classification and retrieval is by no means a new topic, and much research and effort has been invested in the field. A review of relevant literature is conducted in [1], and provides a thorough survey of video classification techniques and methods. Functional methods include text, auditory, and visual based approaches, and while Family of Descriptors Brightness Motion Activity Saturation Color Nuance Edge Prominence HSV Histograms YCbCr DCT Centroids Table 1 – Family of Descriptors used in the proposed system. using cues from various domains is clearly advantageous, only visual information is considered in the context of this work. With this in mind, prior visual content analysis systems are investigated. Motivation for the descriptor families employed in this system are presented and discussed in [2], [3], [7], and [9]. While most of the work reviewed exhibit overlap between descriptors, there is no commonality in the literature with regard to the best-suited classifier for the application. Varied classification performance resulted from using HMM [2], MLP [3], SVM [5], C4.5 decision tree [6], fuzzy integrals [8], and PCA [9]. With no definitive classifier demonstrated in the literature, the decision is made to investigate the merit of neural networks in video, and specifically cartoon, recognition and classification. Figure 2 – High-level System Diagram 2.2.2 Saturation Similar to the descriptors computed for brightness, average saturation, frame differential saturation, and percent saturation for a defined threshold are derived. However, in this case, the threshold is considered in regard to the value plane in the HSV color space, therefore quantifying the percentage of bright, highly saturated color. 1 2.2 Video Descriptors 1 After a review of relevant literature, seven descriptor families were deemed appropriate for compactly summarizing the content of each clip. In general, cartoons are expected to exhibit patches of bright, saturated color, low levels of motion activity, and little texture variance. 2.2.1 Brightness 1 1 2.2.3 Three video features are calculated from the brightness descriptor for each frame over the entirety of a shot. Below, the definitions are given for average brightness (Eq 1), the percentage of image brighter than a set threshold (Eq 2), and the change in frame brightness (Eq 3). 1 1 | 1 | | , 1 | Edge Prominence Motion Activity A single descriptor is used to represent the motion information of a shot, and can be loosely described as the magnitude of the distance between pixels in consecutive frames. , 1 1 | Given the notion that cartoons will generally exhibit far fewer strong edges, a Canny edge detector is applied to the frames in a shot, and the resulting binary images summed. The descriptor serves to represent the strength and existences of edges in a video clip. 2.2.4 , , 1 | 2.2.5 Color Nuance Disregarding boundary pixels, the mean color distance, defined below, is computed for the 3x3 neighborhood surrounding each pixel in a frame, and averaged over the length of a shot. cos 1 8 sin cos sin Figure 2 – Nodal representation of a two-class Neural Network 2.2.6 HSV Histograms Motivated by the Scalable Color Descriptor defined in [7], color content is described by three histograms in the HSV domain. Accordingly, 16 bins are used for the hue channel, and 4 each are used for the saturation and value channels. 2.2.7 YCbCr DCT Centroid A modified version of the Color Layout Descriptor, also given in [7] is compactly defined as the center of mass for an unwrapped 8x8 DCT matrix in the YCbCr color space. Each color plane is reduced to 64 uniformly distributed blocks and the average value calculated. The DCT is then computed for each color plane, and reshaped into a 1-dimensional vector by zigzag scanning to maintain the relationships between resulting DCT values. Centroids are then calculated for each DCT vector by weighting the value of the DCT with its corresponding index. 2.3 Feature Space From the seven descriptor families, a feature vector is calculated that describes the content and evolution of each video clip. The mean, standard deviation, or both are computed for the descriptor vectors, indexed by the frame, over all frames in the video. A resulting 40-point feature vector is taken as the input to the classifier. 3 Developing the Classifier The second stage of the system, once the feature space is defined, is the implementation of a suitable classifier to differentiate between cartoons and noncartoons. As noted, the diversity in this area provides an opportunity to investigate a mechanism of interest. A three-layer neural network is chosen as the classifier, and the weights are adjusted to a set of training data via a genetic algorithm. 3.1 Neural Networks Taking cues from nature and biology, a neural network operates on similar principles to those understood about the functionality and mechanisms of the brain. Simply put, neurons are interconnected by synapses, which fire, or send electrical impulses, when activation criteria are met. It is through these relatively simple concepts the entire brain produces memories, thoughts, and emotions. The brain is also capable of distinguishing between difficult, abstract, and even noisy information. Artificial neural networks, referred to as ANNs, attempt to computationally mimic the firing of synapses in the brain by computing the weighted sum of neural layers to determine the activation of each neuron. As seen in Figure 2, each node in the second layer, denoted by hi, sums a weighted version of every node in the previous layer, denoted by fi. These nodes can be activated by two manners: activation threshold or sigmoid response. The former is very straightforward, in that should the activation threshold be met or exceeded, the neuron fires a ‘1,’ or else a ‘0.’ Alternatively, the sigmoid function is used to determine the activation energy of a neuron, which effectively serves as a continuous response curve. The network employed opts for a sigmoid response to facilitate the training of the weights, such that small improvements are noticed by the genetic algorithm. There are multiple aspects of the neural network that are left to be adjusted with the application. The number of hidden layers, as well as the number of nodes in those hidden layers, must be determined. The general consensus, however, is that intuition and trial-and-error are the best means of optimizing a neural network’s performance, and here one hidden layer is used. 3.2 Genetic Algorithm Topology of the neural network is only half the required effort, as the weights themselves must be derived. In keeping consistent with biomimetic computing, a genetic algorithm is used as a search mechanism to derive at the weights that produce the best results. Other neural network training mechanisms include gradient-descent and backpropagation, but the genetic algorithm is invoked on the principle that it can quickly optimize a large search space. Modeling reproduction and biological evolution, a population of “chromosomes” is randomly generated. In training the neural network, each chromosome is a set of weights, where each weight has a uniform bit depth and expressed as a consecutive binary string. The fitness of the members in the population is computed by an evaluation function, and those that perform better are given a better chance of survival. Chromosomes in the population are then selected and paired at random, and offspring are produced by crossing over parts of the paired chromosomes. Other biological processes, such as mutations or inversions, may be implemented, and serve to further produce more evolutionary results. For training the neural network, an evaluation function that determines the fitness of each chromosome is defined as the percentage of the preclassified training data the chromosome, or set of weights, gets correct. Ideally, the winning chromosome would have a fitness of 1, meaning that the set of weights it represents correctly classifies the training feature spaces. A correct response is given when the corresponding correct class output is higher than that of the other classes. 4 Before continuing, it is crucial to observe the misclassifications of the training data to better understand the system performance for test videos. Interestingly enough, the majority of incorrect classifications were of real (i.e., non-cartoon) video clips rather than cartoons. Only 3 cartoon training videos were misclassified, compared to 12 noncartoon misclassifications. This fact gives the distinct impression that, in developing the feature space, it might be necessary to normalize the data to a global level. It is expected that real video should produce drastically different feature vectors than its cartoon counterparts, and this signifies that more could be done to enhance performance. Pinpointing the source of this confusion, however, proved to be an involved process. A collection of 40 video segments (19 C, 21 NC) were compiled for testing the system. Overall classification accuracy was found to be about 64%, with 63.2% for cartoons and 64.5% for non-cartoons. This is admittedly worse than the training data would lead one to expect, but can be broadly attributed to a few key factors. First, it was found that, for some reason not fully understood, many of the non-cartoon test videos were slightly corrupted in preprocessing (shot segmentation, resolution adjustment, and decompression). While the system was designed to make the best of this data, the fact remains that the features calculated for these clips are unreliable. Ignoring these videos in the classification accuracy raises the percentage to over 75% for non-cartoon, and 70% overall. Also, the test videos were created from a different library of content (i.e., different television shows), such that the training process may have inherently biased the weights in this manner. Establishing a more diverse training database would presumably help alleviate this issue. Results In the course of developing the system, there are three discrete stages that produce quantifiable performance results. Using the feature vectors calculated for the training videos, the highest confidence value achieved by optimizing the neural network weights was found to be 0.8469. The significance of this observation will be discussed later, but it should be noted here that for the training data, 15% of the videos were misclassified. However, for a binary classification task (two output classes, A or B), this is roughly 35% more accurate than a random process. It was also found that, for the data collected and analyzed here, the number of nodes in the hidden layer did not significantly alter classifier performance. However, expanding beyond a single hidden neural layer was not explored. 5 Discussion Observing the performance of the system, with regard to both successes and shortcomings, there are several interesting conclusions to be drawn. Ultimately the entire system depends on the quality and relevance of the extracted features, and the majority of classification error can be roughly attributed to this fact. While the features used here are somewhat adequate, more could be done to further describe texture, frequency content, and color composition, to start. Additionally, it is intuitive to assume that to some degree the features of cartoon and non-cartoon video will exhibit significant overlap, and separating the principal components prior to classifying would no doubt prove beneficial. On a purely logistic note, it is also necessary to highlight the heavy reliance on the quality, and quantity, of training data. This takes into account two different, albeit related, aspects. The video selection and resulting diversity will affect the degree to which the classifier can distinguish between content. Succinctly, compiling an appropriate database of clean data is essential to system performance, and was found to be relatively difficult given prevalent copyright protection. It was also observed that the software used in converting video into workable formats (uncompressed .avi files) often produced extremely noisy data. Artifacts ranged from discoloration to frame jitter and a good deal in between without provocation or cause, and inevitably degraded system performance. That being said, this observation further underscores two giant hurdles in automatic video classification. First, a human observer would easily be able to identify the corrupted video files processed, and consequently misclassified, by the system as cartoon or not. Secondly, real world applications will invariably require the processing of less-than-perfect data that should, to some degree, still be handled by a good classification system. Not only could the feature space be improved upon, but more work could definitely be done to the development and training of the neural network. It would be of great interest to see how the system performance changes as additional layers are added to the network topology, as it is expected that performance would greatly improve. The network used here contained only 48 neurons, whereas the human brain is composed of an estimated one hundred billion. Also, it was observed in the classification accuracy that the confidence intervals of correct classifications were generally larger by an order of magnitude than misclassifications. This lends the notion that additional layers would serve to widen the gap between the neurons at the output layer and further distinguish between different content. In the same breath, the implementation of the genetic algorithm would also stand to benefit from creative modification. There may be better means by which to evaluate the fitness of a set of weights than simply measuring the classification accuracy. Also, the combination and crossover mechanism of chromosomes employed is suboptimal. Ideally, the genetic algorithm should always find a maximum, but in this case typically degenerates to just below better values found. This problem stems from trying to simultaneously optimize a highly dimensional function, where a slight change of different weights can produce extremely different results. Again, a more comprehensive training space would aid in the development of a more universal classifier, but a fundamental component of the genetic algorithm is the creativity with which it is applied. Despite the shortcomings of the developed system, it is apparent that low-level video descriptors can be used with some degree of reliability to categorize certain classes of video. Obviously higher level descriptors would be necessary to differentiate between more complex genres like sports and music videos, but the fact remains that computationally simple features can serve as viable cues in classification tasks. Future work would include shot segmentation, as shot duration itself is another crucial element to genre classification, rather than the manual segmentation performed here. A larger, more comprehensive feature space is advocated, as well as exploratory work with the implementation of a genetic algorithm for tuning neural networks. It has been shown that it can be a powerful tool when used properly, though it may not always be immediately clear what the proper implementation is. Regardless, as further efforts are made in the field of automated video and, on a grander scale, multimedia classification, we will not only learn more about artificial intelligence and machine learning but also gain a deeper understanding of the way in which humans recognize and process information. 6 References [1] D. Brezeale, D. Cook, “Automatic Video Classification: A Survey of the Literature” IEEE Trans. on Systems, Man, and Cybernetics. vol. 38, no. 3, pp. 416-430, May 2008. [2] R. Glasberg, S. Schmiedeke, M. Mocigemba, and T. Sikora, “New Real-Time Approaches for VideoGenre-Classification using High-Level Descriptors and a Set of Classifiers,” IEEE Conf. on Semantic Comp. pp. 120-127, 2008. [3] R. Glasberg, A. Samour, K. Elazouzi, and T. Sikora, “Cartoon-Recognition Using Video & Audio Descriptors,” Proc. of EUSIPCO 2005. [4] B. Ionescu, P. Lambert, D. Coquin, L. Darlea, “Color-based Semantic Characterization of Cartoons,” International Symposium on Signals, Circuits, and Systems (ISSCS 05), vol. 1, pp. 223226, July 2005. [5] T. Ianeva, A. de Vries, H. Rohrig, “Detecting Cartoons: A Case Study in Automatic Video-Genre Classification,” Proc. IEEE International Conf on Multimedia and Expo, vol. 1, pp. 449-452, July 2003. [6] B. Truong, S. Venkatesh, and C. Dorai, “Automatic Genre Identification for Content-Based Video Categorization,” Proc. 15th International Conf on Pattern Rec. pp. 230-233, 2000. [7] B. Manjunath, J. Ohm, V. Vasudevan, and A. Yamada, “Color and Texture Descriptors,” IEEE Trans. Circuits and Systems for Video Tech. vol. 11, no. 6, June 2001. [8] A. Roma, F. Tarres, L. Sanchez, “Cartoon Detection using Fuzzy Integral,” 8th International Workshop on Image Analysis for Multimedia Interactive Services, 2007. [9] L. Xu, Y. Li, “Video Classification Using Spatial-Temporal Features and PCA,” Proc. ICME 2003, pp. 485-488, 2003. [10] Burjorjee, K. VectorGA – Vectorized Genetic Algorithm MATLAB Implementation, available online. http://code.google.com/p/vector-ga/
© Copyright 2025 Paperzz