Content-enriched Classifier for Web Video Classification
Bin Cui1
Ce Zhang1
Gao Cong2
1
Department of Computer Science & Key Lab of High Confidence
Software Technologies (Ministry of Education), Peking University
{bin.cui,zhangce}@pke.edu.cn
2
School of Computer Engineering, Nanyang Technological University, Singapore
[email protected]
ABSTRACT
With the explosive growth of online videos, automatic real-time
categorization of Web videos plays a key role for organizing, browsing and retrieving the huge amount of videos on the Web. Previous
work shows that, in addition to text features, content features of
videos are also useful for Web video classification. Unfortunately,
extracting content features is computationally prohibitive for realtime video classification. In this paper we propose a novel video
classification framework that is able to exploit both content and
text features for video classification while avoiding the expensive
computation of extracting content features at classification time.
The main idea of our approach is to utilize the content features extracted from training data to enrich the text based semantic kernels,
yielding content-enriched semantic kernels. The content-enriched
semantic kernels enable to utilize both content and text features
for classifying new videos without extracting their content features.
The experimental results show that our approach significantly outperforms the state-of-the-art video classification methods.
Categories and Subject Descriptors
H.3 [Information Systems]: Information Storage and Retrieval
General Terms
Algorithms, Experimentation, Performance
Keywords
Web, Video, Content, text, Classification
1.
INTRODUCTION
Recent years have witnessed an explosive growth of online video
sharing and many Web sites provide video sharing services such as
YouTube, GoogleVideo, YahooVideo, MySpace, and ClipShack.
These video sharing Web sites often organize online videos into
categories so that users can browse videos by categories and search
within a certain category.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SIGIR’10, July 19–23, 2010, Geneva, Switzerland.
Copyright 2010 ACM 978-1-60558-896-4/10/07 ...$10.00.
The category labels in the video sharing Web sites are usually
provided by users when they upload videos to the Web sites. However, the manual annotation of category label for the Web videos
suffers from some problems. For example, it can be burdensome
for users to choose an appropriate category for a video; and users
generally have different understandings on video categories, and
thus the user labeled categories are often inconsistent. Hence, it
will be highly desirable to be able to automatically suggest categories based on the text description and content of video even if
the suggestion may not be perfect. Additionally, some Web sites,
e.g., Google Video, collect not only the Web videos uploaded by
users in video sharing Web sites, but also the videos embedded in
news websites, blogs, etc. For the latter, the videos usually do not
have category information. Hence, automatic video categorization
is essential to determine categories of such videos so that the videos
without user labeled categories can also be organized in the same
way as the videos having category labels.
The semantics of videos are described by both content features
and textual descriptions. On one hand, most Web videos are associated with text descriptions: almost every video sharing Web site
(e.g., YouTube) requires users to attach tags and descriptions; most
videos from other circumstances, e.g., blogsphere and news Web
sites, are also surrounded by text. On the other hand, Web videos
are visual content-rich, e.g., color, texture, shape, etc. The previous work on automatic video classification [4, 26, 16] exploits both
text features and content features for video categorization, and the
results reported in the existing work show that text features consistently outperform content features for video classification; however, the performance of using both text features and content features is better than the performance of using either of them alone.
Although content features can contribute to video classification,
extracting content features is computationally expensive. The time
cost of extracting content features is generally in the same order
of magnitude with the video length. For example, it is reported
[9] that the speed of extracting some visual features, e.g., SIFT, is
less than 10 frames per second. The expensive computational cost
renders the previous work utilizing content features impractical in
two typical application scenarios: 1) to automatically categorize
huge volumes of Web videos, and 2) to provide real-time category
suggestions for a new uploaded video in a video sharing Web site.
The existing work on video classification [4, 26, 16] simply treats
the text description of a video as a bag of words to build classifier as
in text classification for general documents. However, the text associated with Web video has special characteristics compared with
normal text documents. First, the text descriptions of videos are
usually quite short (tens of words). Additionally the descriptions
of different videos usually share very few identical words. That is,
the feature space is extreme sparse. Second, the text descriptions
(often generated from users) of Web videos often contain special
words, such as names of persons or organizations, acronyms (e.g.
“TNA” for “Total Nonstop Action") and Web language.
The extreme sparsity limits the performance of most existing text
classification methods. One would be tempted to employ the classification methods based on semantic kernels using WordNet or
Word co-occurrence for Web video classification to alleviate the
sparsity problem. However, the large number of special words are
not covered by Wordnet and this will limit the performance of classification method [3] using WordNet based semantic kernels; the
method [22] based on co-occurrences of words in a collection itself
is limited by the feature sparsity problem.
To this end, in this paper we propose a new video classification
framework that exploits features from both text and visual content
in a novel way. Specifically, we integrate visual content features extracted from the training data into the computation of the semantic
similarity between words to enrich the text-based semantic similarity. Based on this novel similarity measure, we construct contentenriched semantic kernel, and employ it to build a text-based classifier. At classification time, we do not need to extract the content
features of videos to be classified, which is computationally expensive, while we are still able to implicitly utilize the content features
to promote classification performance.
Compared with existing work on video classification, the proposed framework has a salient feature: it incorporates content features into building text-based classifier without sacrificing efficiency
at the classification stage. To further improve the classification performance, we also employ Multi-Kernel SVM techniques to combine multiple kernels including the proposed content-enriched semantic kernel and text-based semantic kernels. In a summary, this
paper makes the following contributions:
1. We present a novel framework that is able to exploit both
content features, e.g. visual features, and text features extracted from training data without requiring to extract content
features at the classification stage.
2. We implement this framework by introducing Content-Enriched
Similarity (CES) between words, which integrates the visual
features to enhance the semantic similarity between words.
We also theoretically justify that the CES can effectively capture the similarity between words by establishing its connection with Pointwise Mutual Information that is developed to
measure similarity in information theory and has solid foundation in statistics.
3. We conduct an extensive performance study on a large reallife dataset containing more than 10,000 videos (more than
500 hours) with text descriptions downloaded from YouTube.
The experimental results demonstrate the superiority of our
method over state-of-the-art approaches for online video classification.
The rest of this paper is organized as follows. Section 2 discusses related work. In Section 3, we present the proposed video
classification framework and Content-Enriched Similarity (CES).
The proposed approach is evaluated in Section 4. We conclude this
paper in Section 5.
2.
PRELIMINARIES
2.1 Related Work on Video Classification
With the increasing availability of online videos, automatic video
categorization has attracted increasing attention recently [4, 26, 15,
16, 19, 23]. We summarize the existing work into three video classification frameworks.
(a) Text-based
(b) Content-based
(c) Fusion of text and content
Figure 1: Existing frameworks for video classification
Text-based Framework: The framework is given in Figure 1(a).
Existing approaches in this framework cast video classification as
a text classification task. Each video is represented with “a bag of
word” features and text classification models are used to build classifiers. Approaches in this framework avoid the expensive computation on extracting content features and can be efficient in classification time. However, the video classification methods based on
text feature alone have two disadvantages. First, they cannot leverage the rich information contained in video content. Second, the
sparsity of text features for online videos limits the performance of
these approaches as discussed in Introduction.
Content-based Framework: Figure 1(b) shows the framework.
The approaches in this framework use visual content features, e.g.,
color, texture and edge, to construct the classifiers [15, 26, 19, 23].
Different classification models are employed, such as rule based
model[15], decision tree [23] and SVM [19]. Several categorization methods are used in [4, 26], such as Naive Bayes, Maximum
Entropy, SVM, etc., and it is shown that SVM obtains the best performance. The content features are less effective in general than
text features, although they perform better on some video categories. In addition, extracting content features is very time consuming, which limits their application on a very large video database.
Fusion Framework using Both Text and Content Features:
The framework is shown in Figure 1(c), and is adopted in the recent work [4, 26]. These approaches build separate classifiers using text features or different content features. At the classification
time, these approaches extract the content features for a new video,
and the results from separate classifiers are fused to determine the
category of the video. As for the fusion methods, [4] uses a voting scheme, the classification is based on a linear combination of
results of separate classifiers in [26], and [16] uses the judgment
for each class from each classifier as features, and then builds a
meta-classifier to make the final decision. These approaches [4,
26] perform better than the approaches using text features or content features alone. However, this framework also suffers from the
expensive computational cost of extracting content features at the
classification time. Moreover, they build separate classifiers independently and then employ fusion methods to combine the results.
This misses the correlation of different features for classification.
2.2 Related Work on Text Classification
Since text-based classifiers play significant roles in video classification [26, 4], we also introduce related work on text classifica-
tion. Many techniques have been proposed in text categorization
field [8, 21, 27]. According to the surveys in [21, 27], SVM often outperforms other methods. There is a lot of work on semantic
kernels using SVM for categorization to explore the semantic similarity between words. Bloehdorn et al. [2] summarized previous
work on semantic kernels and proposed a general framework for
semantic kernels. The framework [2] divides the methods of semantic kernels into two categories. First, syntactic structures of
sentences are extracted and Tree Kernels [18] have been proposed
to exploit the parse trees of sentences. Second, ontology and term
co-occurrence have been employed to compute the semantic similarity between words and this semantic similarity is used in kernel
functions. WordNet [10] is used in [3] and term co-occurrence [6]
is used in [22] to derive a Latent Semantic Kernel. The idea behind
them is to incorporate a similarity matrix that records the similarity
between tokens (words or latent semantic words), and then use this
matrix for semantic expansion of original feature space.
To our knowledge, none of previous work on video classification employs semantic kernels for video classification. Even if the
semantic kernels are used, it can only partly address the feature
sparsity problem for Web video classification: The text descriptions of online videos contain a large number of words that are not
in WordNet and this limits the classification performance of semantic kernels based on WordNet. Additionally, the video descriptions
are usually short and generated by users. Users may not often use
two words with similar senses in a short description. For example,
consider two synonyms bunny and rabbit. Most users will not use
both words to describe animal rabbit in a short video description.
Therefore, this limits the effectiveness of Semantic Kernels using
co-occurrence Similarity for Web video classification.
We briefly introduce SVM, kernel function and Multi-Kernel
SVM[7, 17] to provide background of the proposed approach. SVM
trains a hyper-plane to separate data of two categories while minimizing the empirical risk. Various non-linear kernels are proposed
for classification problems where the data in different categories is
not linearly separable. The idea is to apply a non-linear mapping
of data to a feature space, in which the linear SVM method can be
used. This mapping is defined as φ : X → F , where X is the
original space, and F is the feature space. Because it is difficult to
build a feature space directly, instead kernel functions are used to
implicitly define the feature space.
D EFINITION 1 (K ERNEL F UNCTION ). A kernel is a function
K, such that for all x, z ∈ X, K(x, z) = φ(x)φ(z), where φ is a
mapping from X to an (inner product) feature space F .
The works of semantic kernel aim at defining the different function φ(.) satisfying the above definition using knowledge from WordNet or cooccurence statistics.
Two word-similarity matrices are popularly used in defining semantic kernels. First, several semantic similarity measure [25, 14]
between words are defined based on WordNet . Second, semantic
similarity between words is also computed by Word Co-occurrence.
Intuitively, two words are similar when they frequently appear together in some documents. The relationship between words occurring in the same document is called co-occurrence. Term COOccurrence (COO) is a popular similarity measure of words. If
we represent each word as a vector of term frequency (tf ) in each
document, then COO of two words can be computed by the cosine
similarity of their vectors.
3.
CONTENT ENRICHED CLASSIFIER
In this section, we first present the proposed framework for video
classification, then introduce the core component of the proposed
framework, namely content-enriched semantic kernels, and finally
present the algorithm for Web video classification.
3.1 The Proposed Framework
Our proposed framework for online video classification is shown
in Figure 2. The framework consists of the following steps: given a
training video data, we extract both text and visual features from it.
After that, we obtain Content-Enriched Similarity (CES) between
words (to be explained in the next subsection), and extend the semantic kernel technique to the CES to build a video classifier. At
the classification stage, this classifier classifies a new video using
its text features (but not its content features).
The idea is that we incorporate visual content information extracted from training data into the semantic similarity between words.
That is, classifier is built using the kernels based on the semantic
similarity that considers both text and content information. The design mechanism enables us to utilize visual clues of Web videos to
obtain more reasonable semantic similarity among words.
Figure 2: The proposed content-enriched framework
Although the proposed framework makes use of both text features and content features as the fusion framework in Figure 1(c),
it is very different from the fusion framework in the following two
aspects: First, in the fusion methods [4, 16] content features are employed as features to build classifier, and thus the content features
of video to be classified need to be extracted at classification time.
In contrast, in the proposed framework content features of training
data are used to calculate Content-Enriched Similarity (CES) between words and the similarity is integrated into semantic kernels
to build classifier. We avoid extracting content features from the
video to be classified at the classification time. Second, the text
classifiers built using fusion methods[4, 16] for video categorization do not consider the semantic information of words.
Compared with existing video classification frameworks, the proposed framework has two salient advantages. First, our framework
is comparable to the text-based framework in Figure 1(a) in terms
of classification efficiency. Both frameworks do not need to extract content features at the classification stage. This is essential for
a framework to be applied to real-time online video classification
and classifying online videos of a large scale. Second, our framework is able to achieve better classification accuracy than existing
approaches, because our approach can not only take advantage of
both text and content features, but also effectively address the problem of sparse features in Web video categorization.
3.2 Content-enriched Semantic Kernel
In the existing classification work, only the text information is
taken into consideration for kernel construction, which is insufficient for Web video classification as we discussed previously.
The motivation of proposing Content-Enriched Similarity is based
on two key observations. First, the effects of text feature and content feature are typically complementary: for some categories (e.g.,
news video, music video), text-based classifiers have better accuracy than content-based classifiers; while for other categories, like
films, content-based classifiers work better [4, 26]. The combination of two types of features is more effective for video classification as shown in [4, 26], although text-based approaches generally
perform better than content-based approaches. Second, utilizing
content feature in video classification stage is computationally expensive, as content feature extraction is time consuming.
Ideally we can have a way to leverage content features without
extracting content features at the classification time. This might
sound impossible at first glance. Our idea is to use content features extracted from training data to enhance the text features to
obtain content-enriched semantic kernels for classification. This
idea is inspired by the way that semantic similarity is used to enhance the text features in semantic kernel. However, we will see
that we compute the content-enriched semantic kernels in a very
different fashion from existing work on semantic kernels.
We proceed to introduce the key idea of semantic kernels, and
present the idea of integrating content features into kernels for classification. The key component of semantic kernels is a word similarity matrix. We denote the matrix by P, where each matrix entry Pij represents the semantic similarity between words i and j.
The semantic similarity between two words can be computed using
WordNet or term co-occurrence as presented in Section 2.2. Following the work on text classification using semantic kernel, e.g.,
[3], we can compute the semantic kernels using similarity matrix P
as follows:
K = X × P × PT × XT ,
(1)
where X is the document-word matrix and Xij gives the term frequency (tf ) of word j in document i.
Although WordNet or co-occurrence based similarity measures
can capture some semantic similarities between words, they are
still not sufficient for online video classification and they ignore
the content information of videos.
Our idea is to exploit the content features to enhance the similarity matrix P, so that we are able to utilize the content features. If
we have the Content-Enriched Similarity matrix, we can define the
content-enriched semantic kernel similarly as it is in Equation 1.
We next present how to compute Content-Enriched Similarity between words and how to derive matrix P using Content-Enriched
Similarity.
We observe that similar words often appear in the text descriptions of similar videos although they may not appear in text description of single video. Here, we use the visual content information of videos to decide whether two videos are similar. If two
videos are similar in terms of their content information, the words
in their text description would be somehow similar. It is expected
that we can capture the similarity between words even if they do
not appear in the text description of a single video, e.g., bunny and
rabbit, since they may appear in the text descriptions of different
videos that all contain visual features of animal rabbit. We notice
that the similarity between bunny and rabbit can also be captured
by WordNet (but not term co-occurrence). However the coverage
of WordNet is only about 50% in our video collection downloaded
from YouTube. As another example, Content-Enriched Similarity can also capture the word similarity between or favorite and
favourite. Note that, for the example of bunny and rabbit, content
features like shape and color can associate these two words. For the
example of favorite and favourite, there are no specified features;
however, as these two words are often used alternatively, the visual
characters of their associated videos are similar statistically. These
relations cannot be extracted by WordNet and term co-occurrence
(more examples will be given in Section 4.2).
We proceed to present the method of computing Content-Enriched
Similarity. We first extract content features of training data, and
then cluster videos based on visual content features to obtain k
clusters, C1 ,...Ck , where k is a parameter to be determined on
training data. Generally, we expect that two words are similar if
they appear in the same cluster, within which the videos are similar in terms of content. This statistical co-occurrence information
implies the semantic tightness between two words, i.e., the higher
frequency of co-occurrence represents closer similarity. The parameter k, the number of clusters, will affect the similarity measure between words. With a small k we will get large clusters, and
thus two dissimilar videos may be included in a cluster. This will
degrade the similarity implicated by co-occurrence. On the contrary, if too many clusters are generated with a large k, some words
may be only related with weak connections indicated by infrequent
co-occurrences. The weak connections are not always informative:
they may introduce noise in the Content-Enriched Similarity, however we will miss similarity relationships if we prune all weak connections. Hence, we need to use an appropriate value for parameter
k, which can be set empirically using a development set, if any, or
using cross-validation on training set.
After we have k clusters, we next compute the word similarity in
the space of k clusters. For each word w, it is associated with a vector in document (video) space, i.e., w = <tfw,1 , tfw,2 , ..., tfw,|D|
>, where tfw,i is the frequency of term w in the ith text description, and |D| is the number of text descriptions (videos) in the collection. In order to compute the frequencies of word w in each
cluster, we project the vector w from document space to cluster
space. To do this, we need to define a video-cluster relation matrix VS, where each entry VSij = 1, if video i is in cluster Cj ;
otherwise VSij = 0. Then we can project w to wc as follows.
wc = w × VS
(2)
wc = <tfwc ,1 , tfwc ,2 , ..., tfwc ,k >, where tfwc ,i is the frequency
of word w in cluster i.
Given two words x and y, together with their term frequency
vectors in document space, x = <tfx,1 , tfx,2 , ..., tfx,|D| > and
y = <tfy,1 , tfy,2 , ..., tfy,|D| >, we compute their term frequency
vectors in cluster space according to Equation 2. Then we can compute their semantic similarity using cosine similarity as follows.
Sim(x, y) = cos(xc , yc ) = cos(x × VS, y × VS)
(3)
where VS is the video-cluster relation matrix presented earlier.
In practice, we can use matrix calculation to compute the contentenriched semantic similarity for all word pairs. Given the documentword matrix X and video-cluster relation matrix VS, we first compute the product of the two matrices Y = XT × VS. Matrix Y is a
word-cluster matrix and we need normalize the matrix to compute
cosine similarity. For each row vector y of Y, we normalize it by
p
P 2
e The Contentyi , yi ∈ y. Then we get normalized matrix Y.
Enriched Similarity matrix CES can be calculated as follows:
e ×Y
eT
CES = Y
(4)
T
It is natural to derive explanation for Equation 4. Y =X × VS
gives the distribution of words in the cluster space, and according
to Equation 4, word pairs occurring frequently in the same clusters
will have larger similarity score. The visual content information of
videos is incorporated in the matrix VS without introducing expensive computation in matrix processing.
The Content-Enriched Similarity matrix CES is used to replace
the matrix P in Equation 1 to compute the Content-Enriched Similarity kernel K.
3.3 Theoretical Justification of CES
In the previous section, we proposed a novel Content-Enriched
Similarity measurement to evaluate the similarity between words.
To theoretically justify our approach can effectively capture the
similarity information, we start from Pointwise Mutual Information (PMI) in information theory [5], which is a measure of association/similarity between two objects (words) and defined as
p(x,y)
SI(x, y) = log p(x)p(y)
. We start from PMI because it is built
on solid statistical theory.
Let x, y be two words in vocabulary set V , d a video in video
dataset D. Then P (x, y) can be derived by:
P (x, y) = P (x ∈ d, y ∈ d)
=P (x ∈ d1 , y ∈ d2 , d1 = d2 )
X
∝
P (di = dj )P (x ∈ di , y ∈ dj |di = dj )
(5)
i,j
where ∝ denotes the values are proportional, P (di = dj ) is the
probability that video di and dj are equal. Then, we can compute
P (x) using this P (x, y), i.e.,
P (x) = P (x, x)
X
∝
P (di = dj )P (x ∈ di , x ∈ dj |di = dj ).
(6)
i,j
Substituting Formulas 5 and 6 into P M I, we can get
P (x, y)
SI(x) ∝
P (x)P (y)
P
i,j P (di = dj )P (x ∈ di , y ∈ dj |di = dj )
∝P
i,j P (di = dj )P (x ∈ di , x ∈ dj |di = dj )
×P
After normalizing x and y to unit norm [12], i.e., |x| = |y| = 1,
we can get
SI(x, y) ∝ cos(x, y) = Sim′ (x, y).
(7)
1
P
(d
=
d
)P
(y
∈ di , y ∈ dj |di = dj )
i
j
i,j
Now, the remaining task is to estimate the value of P (di = dj )
and P (x ∈ di , y ∈ dj |di = dj ).
In the traditional co-occurrence similarity calculation, only the
words appearing in the same video text description are considered
to be similar, i.e., i must be equal to j. In our proposed CES mechanism, P (di = dj ) can be seen as the probability that videos di and
dj are similar, e.g., di and dj are in the same cluster. A straightforward method can be used to estimate P (di = dj ) which is denoted
by eij as:
1, di and dj are similar
P (di = dj ) ∝
(8)
0, di and dj are not similar
To compute the value of P (x ∈ di , y ∈ dj |di = dj ), we can
adopt the model based on term frequency (tf ). There exist many
reasonable estimation methods, e.g., the popularly used two models
as follows:
P (x ∈ di , y ∈ dj |di = dj ) ∝ tf (x, di ) × tf (y, dj )
or
P (x ∈ di , y ∈ dj |di = dj ) ∝
p
tf (x, di ) × tf (y, dj )
(9)
(10)
Integrating Formula 9 into SI(x, y), we have:
SI(x, y)
∝P
P
eij tf (x, di )tf (y, dj )
P
i,j eij tf (x, di )tf (x, dj )
i,j eij tf (y, di )tf (y, dj )
i,j
The above theory is a natural extension of state-of-the-art approaches by introducing a new parameter of eij , i.e., P (di = dj ).
There may exist the case that i 6= j but P (di = dj ) 6= 0; in other
words, the similar videos can contribute to word similarity computation.
Recall the aforementioned co-occurrence similarity computation
which uses inner product as semantic similarity measurement. The
inner product of co-occurrence measurement is based on the assumption that eij is not equal to 0 if and only if i=j. Our mechanism relaxes this restriction that we apply the inner production in
an un-orthogonal space, where eij is not equal to 0 if videos i and
j are similar, e.g., videos are in same cluster.
Let x and y represent the tf vectors of words x and y respectively. We can redefine the dot function to compute the similarity:
< x, y >
.
(12)
Sim′ (x, y) = cos(x, y) =
|x| × |y|
P
where < x, y > is i,j tf (x, di )tf (y, dj )eij , |x| is equal to
qP
qP
i,j eij tf (x, di )tf (x, dj ), |y| is
i,j eij tf (y, di )tf (y, dj ).
This equation can be easily proved in a 2-norm Euclid space. Obviously, Sim′ (x, y) 6= 0 does not require words x and y to appear
in the same video. However, it relies on eij , which represents the
similarity between the videos i and j.
Revisit Formula 11, let x and y in SI(x, y) be also represented
by two tf vectors, we have
< x, y >
SI(x, y) ∝
(13)
|x|2 |y|2
(11)
(14)
The above formula shows that our similarity measure is consistent with PMI model, i.e, if there is a tighter “semantic" association
between words x and y, the similarity value will be higher. In other
words, the Content-Enriched Similarity is capable of capturing the
semantic relation between words, and hence can be applied to build
semantic kernel for video classifier. Moreover, our discussion is not
limited to 2-norm Euclid space, and one can derive a formula in another space, e.g., 1-norm space, by making a different estimation of
p(x, y) and p(x), using Equation 10 instead of 9.
3.4 The Classification Process
In this subsection, we summarize the procedure of building classifier using the content-enriched mechanism and applying classifier
at the classification stage.
The process for building classifiers is presented in Algorithm 1.
To build classifiers using the Content-enriched semantic kernels,
three steps are processed for the training dataset.
Feature extraction (line 1):We extract the text feature from the
text description, and visual content feature from video such as color
and texture.
CES computation (lines 3-10): We first employ k-means clustering method to cluster the training data into k clusters using content
features extracted from training data. And then, we construct a
video-cluster relation matrix V S to compute the Content-enriched
Similarity between the words in the word set by Formula 4.
Building Classifiers (lines 11-12): For each category in the training dataset, we build SVM classifier employing the Content-enriched
kernel. As Web video sets are multi-class sets, we use One-AgainstAll [11, 17] method, one of the most widely used methods for
multi-class classification, to train classifiers. Thus, each category
will have a classifier.
Algorithm 1: Algorithm for building CES classifiers
Input: Labeled training video data, and parameter k
Result: A set of classifiers
1 Extract text and content features;
2 Cluster the dataset into k groups;
3 foreach Video vi do
4
if vi in clusterj then
5
VSij = 1; /*VS:video-cluster matrix */
6
7
8
9
10
11
12
13
else
VSij = 0
foreach word-pair in Vocabulary Set do
Compute the content-enriched similarity of the word-pair
by Formula 4; /*build word similarity matrix*/
Construct a CES kernel by Formula 1;
foreach category do
Build a One-Against-All SVM classifier using the CES
kernel;
return Classif iers.
At the classification stage, we use only text features of new video,
but not its content features, as the input to the classifiers. Each
classifier will return a score to indicate the possibility of the video
belonging to the category; the class label of the classifier which
returns the highest score is assigned to the video.
3.5 Multi-Kernel Enhancement
We proceed to extend the proposed content-enriched framework
using Multi-Kernel SVM techniques [13, 20]. When various kernel functions exist for a classification job, Multi-Kernel learning
optimization can take the advantage of individual kernel function
and converge towards a more reasonable solution. For example,
[13] proposed the Multi-Kernel technique that defines a new optimization function and solves it as Semi-definite Programming and
Quadratically Constrained Quadratic Programming.
Multi-Kernel SVM techniques can be applied to the proposed
CES scheme that only uses text features at the classification stage.
Therefore, we can generate different types of kernels for multiple
kernel optimization, e.g., the content-enriched semantic kernels and
text-based semantic kernels using WordNet or term co-occurrence.
In order to assign appropriate weights to different types of kernels, we adopt a standard Multi-Kernel scheme [20]. Given several
kernels created using different word pair-wise similarity matrices
as we discussed above, we first train a linear combination between
them, and then use this fusion result as the final kernel. The MultiKernel optimization can effectively explore the advantages of various similarity measures, and will not have obvious effect on the
classification efficiency as only the text features are considered at
the classification stage of our framework.
com). During the process of crawling the video data, we collect the
videos uploaded to YouTube every 5 minutes by YouTube API on
Sep 23 & 24, 2009, which are denoted as “YT923” and “YT924”
respectively. This makes our collected two data sets be representative of the distribution of YouTube videos. Finally, 5149 videos
from “YT923” and 4447 videos from “YT924” are collected. For
our studies, the datasets of “YT923” and “YT924” are taken as our
training set and testing set, respectively. The Youtube videos are organized with 15 categories, and the videos belonging to “comedy",
“music" and “entertainment" attract more interests relatively.
4.1.2 Feature Extraction
Both the text descriptions and visual content features of videos
are extracted for Web video classification task. In our study, the text
features extracted from videos include “video title” and the descriptions provided by users who uploaded the videos. We do stemming
on these text features by using a standard WordNet stemmer and remove stop words. Employing a similar filtering mechanism in [3],
we eliminate words with frequency less than 5 to remove the noisy
text features in our experiments. For the visual content features of
our videos, we extract the followings features: color (24D), texture
(8D) and edge (180D) features.
4.1.3 Performance Metrics
We evaluate both effectiveness and efficiency in our performance
study. For efficiency, we concern more on the time cost at the classification stage, as the classifiers are generated offline. The effectiveness performance in terms of accuracy of our video classification problem is measured as an F-score for classification results. It
is defined as (2pr)/(p +r), where p is the precision (the number of
correct results divided by the number of all returned results) and r
represents the recall value (the number of correct results divided by
the number of results that should have been returned). F-score can
be calculated for each category and then averaged, or can be calculated over all decisions and then averaged. The former is called
Macro-F, and the latter is called micro-F.
4.1.4 Parameter Tuning
In this section, we evaluate our proposed method for online video
classification by comparing with state-of-the-art classification techniques. All experiments were conducted on a PC running Linux
with 2.4 GHz CPU and 3G memory.
In the proposed content-enriched Classifier, we first generate the
clusters based on the visual content feature of video as introduced
in Section 3. The number of clusters k is the key parameter to compute the matrix, which may affect the similarity between words and
hence the classification performance. We use k-means clustering
algorithm and compute the enriched similarity matrixes using the
training data with 5149 videos. We have observed the performance
of classification is poor when the number of cluster is too large or
small. If k is small, i.e., the cluster has more videos averagely, two
dissimilar videos may be included in a cluster, which degrades the
similarity implicated by word co-occurrence in an identical cluster. Additionally, more word relationships are generated, which
makes the kernel space denser and difficult to be separated by SVM
classifiers. On the contrary, if the number of clusters is large, the
clustering based mechanism cannot extract enough similarity relationships. Based on the experimental results, we set the number
of clusters for generating enriched semantic kernels as 100, which
will be used as default parameter in the rest of our experiments.
The detailed tuning results are omitted due to the space constraint.
4.1 Experimental Setup
4.2 Comparison on word similarity
4.
PERFORMANCE STUDY
4.1.1 Data preparation
In order to evaluate the performance of our approach, we collect
the real-life video data from YouTube (http://www.youtube.
The content-enriched word similarity is the core component in
our CES framework. The CES based on content feature can capture
some interesting relationships between words through clustering of
visual information, which cannot be discovered by other text-based
methods. Some examples are illustrated in Table 1. We also show
the results of WordNet and co-occurrence (COO) based methods
which are generally used for text classifier.
content feature and video information, thus direct application of
visual feature cannot achieve satisfactory results.
Table 2: Performance Comparison with Different Frameworks
Table 1: Example of Word Similarity by Different Approaches
word1
bird
wombat
rose
jasmine
francesco
favorite
word2
goose
animal
fragrance
fragrance
francisco
favourite
CES
√
√
√
√
√
√
COO
WordNet
√
√
√
We observe that the word pairs in Table 1 are quite “contentenriched" in several meaningful ways. First, we can extract WordNetlike type-of relations, e.g., (goose, bird) and (wombat, animal), because they often point to similar objects, which can be analyzed
easily using visual features instead of text features. Second, we
can extract Visual Similarities between words, which can be represented by own-properties-of relations, e.g. (jasmine, fragrance)
and (rose, fragrance). These relations may not be simply extracted
by only using text features. For example, visual similarity between
videos captures the relationship for flower jasmine and rose, and
this relationship can be utilized to further predict their relation with
fragrance. Finally, it is interesting that we find our methods can
even discover some typos and synonyms in different languages. For
example, (francesco, francisco) is mined by CES, which seems like
a common typo; (favorite, favourite) is also detected: one of them
is American English, and the other is British English. It is difficult
for text-based approaches to discover such relationships, because it
seems unlikely that a user uses both words to describe the video.
Note that, there exist some noises in the extracted word relationships, and some discovered relationships may not precisely represent the similarity between words. However, it is difficult, if not
impossible, to evaluate the performance quantitatively, due to the
absences of golden-standard word similarity/relationship and evaluation methods, as the text descriptions for Web video are generally
free text and with diverse content. The results demonstrated in Table 1 show that these relationships discovered by CES are meaningful and agree with our common sense, and the classification results
reflect the superiority of our proposed methods.
4.3 Comparison on Web video Classification
In this subsection, we report our experimental results on the
comparisons among four frameworks including our proposal and
three existing frameworks introduced in Section 2.1. We compare
our proposed CES method with other competitors, i.e., T ext [3]
that represents the classifier based on the text information of Web
videos, Content that represents the classifier based on the visual
content of videos, and F usion that explores multi-modal information of videos by fusing the results from T ext and Content
classifiers [4, 26]. Note that, the T ext classifier can expand word
similarity to form SVM kernels based on the WordNet and word
co-occurrence as introduced in Section 2.2, which performs better
than linear SVM based on text feature. The WordNet based kernel
yields similar performance with the co-occurrence based kernel in
our experiment, and we integrate word co-occurrence based kernel
in the T ext classifier using the approach [3] in our study.
4.3.1 Effect on effectiveness
Table 2 shows the performance on classification effectiveness of
different frameworks. We find that the performance considering
only content features is almost 50% lower than that of utilizing
text features in general. There exists semantic gap between visual
Macro-F
Micro-F
Text
.21
.28
Content
.11(-47%)
.13 (-53%)
Fusion
.22 (+5%)
.30 (+7%)
CES
*.25 (+20%)
*.32 (+15%)
F usion framework using fused classifiers outperforms the frameworks using text or content features, and it achieves 5% increase
compared with T ext in terms of Macro-F score. As the Content
classifier is much weaker than the T ext classifier, the fusion may
also import noise, and hence degenerates the impact of fusion.
The proposed CES classifier outperforms the existing methods
apparently by 20% than the T ext approach. In CES approach, we
incorporate visual information into the semantic similarity measure
between words rather than into the classifier directly. This design
mechanism is expected to make use of visual clues of Web videos
to obtain more reasonable semantic relations among words, which
can better bridge the semantic gap between visual feature and video
content. CES exploits the video similarity based on visual content
and integrates both text and visual content feature for video classification. The effects of text and content features are typically complementary, and the combination can improve overall effectiveness.
Additionally, the clustering of videos can help extract relations between words from different videos, and thus effectively addresses
the problem of over-sparse kernel matrix in text categorization.
Table 3: F-Score (%) on Various Categories
howto
news
comedy
music
travel
animals
education
sports
tech
film
entertainment
games
people
nonprofit
autos
Macro-F
Micro-F
Text
15.5
*45.4
10.4
29.6
14.2
3.6
15.7
20.7
18.6
21.7
24.7
3.7
17.3
23.4
49.1
20.9
27.7
Fusion
*17.6
39.5
15.9
29
*31.3
7.2
20.7
18.4
23.6
*22.9
26.8
3
19.3
18.5
47.5
22.2
29.5
CES
15.4
40.7
*28.1
*30.2
24.6
*7.4
*21
*20.8
*23.9
20
*28.4
*13.2
*19.7
*27.8
*53.1
*25.1
*31.7
Table 3 shows the detailed F-score performance on all the categories of video dataset. Since the Content classifier performs
much worse than other methods, we exclude it for comparison. We
can see that CES performs the best in 11 out of 15 categories. The
detailed performance comparison on F-score demonstrates the advantage of our methods.
4.3.2 Effect on Efficiency
Compared with classification accuracy, time cost is usually considered to be less important [21]. However, in the scenario of online
video classification, it is essential for us to take the efficiency into
consideration. This is because only with high efficiency can a realtime categorization approach provide satisfactory user experience.
In this section, we only evaluate the time cost of the classification
stage, as the building classifiers is done offline.
Text-based classifiers and CES classifier yield comparable efficiency and average classification time cost for an incoming video
is less than 0.01 second. There exist some marginal computational
cost differences due to the kernel density in SVM classifier. The
approaches based on content features, i.e., Content and F usion,
perform significantly worse due to the expensive content feature
extraction. The time cost of content feature extraction from video
is typically of the same order of magnitude as video length, which
makes it unacceptable to utilize the content-based video classification methods for Web video classification. Take Youtube as an
example, more than 20 hours videos are uploaded per minute on
average.
In a summary, compared with the state of the art approaches
for video classification, our CES-based video classification method
performs superiorly in terms of both effectiveness and efficiency .
It utilizes the content features at the training stage, but it does not
need to extract content features when classifying an new video in
the Web. The proposed content-enriched kernel provides a promising solution for online video classification.
4.4 Performance on Multi-kernel
In the last set of experiments, we evaluate the classification performance on multi-kernel enhancement. Since the content based
methods, e.g., Content classifier and F usion, perform much worse
in terms of time efficiency, and thus are not applicable in online
Web video classification scenario, we here only present the results
on the combination of two approaches, i.e., T ext and CES.
Table 4: Performance Comparison with Multi-Kernel Solution
Single
Kernel
Multi-Kernel
Text
√
√
CES
√
√
Macro-F
.21
*.25 (+20%)*
*.28 (+33%)*
Micro-F
.28
*.32 (+15%)
*.34 (+21%)
The results on classification effectiveness are shown in Table 4.
As been discovered earlier, the Multi-Kernel method that takes all
of the similarity measures into consideration performs best for video
classification. We can see that Multi-Kernel SVM can obtain the
best performance, which is 33% better than Text-based classifier.
This improvement not only demonstrates the great success of MultiKernel SVM in fusing different semantic kernels, but also indicates
the different similarity measures may convey and capture different complementary features and they together can provide highperformance video classification. Note that, the Multi-kernel optimization will not introduce extra time cost at the classification
stage, as only one joint kernel will be generated after Multi-kernel
learning in the training stage.
5.
CONCLUSION
In this paper, we presented a novel framework that efficiently
exploits both visual content and text features to facilitate online
video categorization. Within this framework, we proposed effective content-enriched semantic kernel which can extract word relationship by clustering the videos with visual content features. Extensive experimental results based on a large dataset demonstrate
the superior performance of the proposed method in terms of both
effectiveness and efficiency for video classification.
This work opens to several interesting directions for future work.
Notably, it is interesting to investigate the performance issues of the
current framework by using other content features of video, such
as motion and audio features, to generate CES. Moreover, it will be
interesting to adapt and apply CES for query expansion/suggestion
in video retrieval.
6.
ACKNOWLEDGMENTS
This research was supported by the National Natural Science
Foundation of China under Grant No. 60933004 and 60811120098.
7. REFERENCES
[1] F. Bach, G. Lanckriet and M. Jordan. Multiple kernel learning, conic
duality, and the SMO algorithm. In Proc. of ICML Conference, 2004.
[2] S. Bloehdorn and A. Moschitti. Structure and semantics for
expressive text kernels. In Proc. of CIKM conference, 2007.
[3] M. Cammisa, S. Bloehdorn, R. Basili and A. Moschitti. Semantic
kernels for text classification based on topological measures of
feature similarity. In Proc. of IEEE ICDM Conference, 2006.
[4] J. H. Chow, W. Dai, R. F. Zhang, R. Sarukkai and Z. F. Zhang. Joint
categorization of queries and clips for Web-based video search. In
Proc. of ACM MM Workshop on MIR, 2006.
[5] K. W. Church and P. Hanks. Word association norms, mutual
information, and lexicography. In Computational Linguistics 16,
1990.
[6] C. Ciro, B. Dominik, H. Andreas and S. Gerd. Semantic Grounding
of Tag Relatedness in Social Bookmarking Systems. In Proc. of
ISWC Conference, 2008.
[7] N. Cristianini and J. S. Taylor. An Introduction to Support Vector
Machines and Other Kernel-based Learning Methods. Cambridge
University Press, 2000.
[8] F. J. Damerau, C. Apte and S. M. Weiss. Automated learning of
decision rules for text categorization. In ACM Trans. Information
Systems, vol. 12, no. 3, pp. 233-251, 1994.
[9] Z. Dong, G. Zhang, J. Jia and H. Bao. Keyframe-Based Real-Time
Camera Tracking. In Proc. of IEEE ICCV Conference, 2009.
[10] C. D. Fellbaum. WordNet: An Electronic Lexical Database. MIT
Press, 1998.
[11] C.W. Hsu and C. J. Lin A comparison of methods for multiclass
support vector machines. In IEEE Trans. on Neural Networks, vol.
13, no. 2, pp. 415-425, 2002.
[12] A. B. A. Graf and S. Borer. Normalization in Support Vector
Machines. In Proc. of DAGM-Symposium on Pattern Recognition,
2001.
[13] G. Lanckriet, N. Cristianini, P. Bartlett, L. Ghaoui and M. Jordan.
Learning the Kernel Matrix with Semidefinite Programming. In
Journal of Machine Learning Research, Vol. 5, pp 27-72, 2004.
[14] C. Leacock and M. Chodorow. Combining local context and wordnet
similarity for word sense identification. MITPress, 1998.
[15] R. Lienhart, S. Fischer and W. Effelsberg. Automatic recogition of
film genres. In Proc. of ACM MM Conference, 1995.
[16] W. H. Lin and A. Hauptmann. News video classification using
svm-based multimodal classifiers and combination strategies. In
Proc. of ACM MM Conference, 2002.
[17] Y. Liu and Y. F. Zheng. One-against-all multi-class svm classification
using reliability measures. In Proc. of IJCNN Conference, 2005.
[18] A. Moschitti. Efficient convolution kernels for dependency and
constituent syntactic trees. In Proc. of ECML Conference , 2006.
[19] T. Mei, X. S. Hua, X. Yuan, W. Lai and X. Q. Wu. Automatic video
genre categorization using hierarchical svm. In Proc. of ICIP
Confernece, 2006.
[20] A. Rakotomamonjy, F. Bach, Y. Grandvalet and S. Canu.
SimpleMKL. In Journal of Machine Learning Research, Vol. 9, pp
2491-2521, 2008
[21] F. Sebastiani. Machine learning in automated text categorization. In
ACM Computing Surveys, 2002.
[22] J. Shawe-Taylor, N. Cristianini and H. Lodhi. Latent semantic
kernels. In Journal of Intelligent Information Systems,
18(2-3):127-152, 2002.
[23] B. T. Truong, S. Venkatesh and C. Dorai. Automatic Genre
Indentification for Content-based Video Categorizaion. In Proc. of
ICPR Conference, 2000.
[24] P. Wang and C. Domeniconi. Building Semantic Kernels for Text
Classification using Wikipedia. InProc. of SIGKDD conference,
2008.
[25] Z. Wu and M. Palmer. Verb semantic and lexical selection. In Proc.
of ACL Conference, 1994.
[26] X. Yang, X.-S. Hua, L. Yang and J. Liu. Multi-modality Web video
categorization. In Proc. of ACM MM Workshop on MIR, 2007.
[27] Y. Yang and X. Liu. A re-examination of text categorization
methods. In Proc. of SIGIR Conference, 1998.
© Copyright 2026 Paperzz