A Robust SIFT-Based Descriptor for Video Classification
Raziyeh Salarifard, Mahshid Alsadat Hosseini, Mahmood Karimian and Shohreh Kasaei
Department of Computer Engineering, Sharif University of Technology
Tehran, Iran
ABSTRACT
Voluminous amount of videos in today’s world has made the subject of objective (or semi-objective) classification of
videos to be very popular. Among the various descriptors used for video classification, SIFT and LIFT can lead to highly
accurate classifiers. But, SIFT descriptor does not consider video motion and LIFT is time-consuming. In this paper, a
robust descriptor for semi-supervised classification based on video content is proposed. It holds the benefits of LIFT and
SIFT descriptors and overcomes their shortcomings to some extent. For extracting this descriptor, the SIFT descriptor is
first used and the motion of the extracted keypoints are then employed to improve the accuracy of the subsequent
classification stage. As SIFT descriptor is scale invariant, the proposed method is also robust toward zooming. Also,
using the global motion of keypoints in videos helps to neglect the local motions caused during video capturing by the
cameraman. In comparison to other works that consider the motion and mobility of videos, the proposed descriptor
requires less computations. Obtained results on the TRECVIT 2006 dataset show that the proposed method achieves
more accurate results in comparison with SIFT in content-based video classifications by about 15 percent.
Keywords: Robust Video Descriptor, SIFT, Video Classification, LIFT.
1. INTRODUCTION
With recent technological advances and the huge boost in video capturing devices, video data has grown
exponentially. This calls for a fast and accurate solution in classification for indexing and retrieval of the data. Humanbased video classification increases the time and costs. This is where automatic video data classification comes into view
for many researchers. For video data classification, appropriate descriptors are extracted initially and the video class is
determined based on these descriptors subsequently. The more the extracted descriptors show the differences among
various types of videos, the more accurate the classification will be. Since a video usually contains a sequence of frames,
all extractable descriptors from its frames can also be extracted. In most video classification methods only extractable
descriptors from its frames are utilized independently and therefore the motion trajectory is ignored. Therefore,
employing the motion trajectory can lead to a more accurate classification.
Visual features of video can be generally classified to static descriptors extracted from the main frame, extracted
descriptors from the video objects, and motion descriptors [1]. Static descriptors from the main frame involve colorbased, texture-based, and shape-based descriptors [2, 3]. These static descriptors only describe the visual aspect of video
and are weak in describing other aspects such as objects and motion. Researchers use various methods to extract objects
from video, for example, Visser [4] uses the Kalman filter and Zhang [5] benefits spatio-temporal independent
component analysis (stICA) and multi-scale analysis.
The motion descriptor of video is used in [6, 7, 8, 9]. Different researches use various information of video. For
example, [6] uses motion vectors embedded in MPEG bitstream as a video descriptor, in [7] motion vector field is used
in order to extract a motion descriptor. Also, [8] extracts a video motion descriptor based on local and global motion
information, and [9] benefits the spatio-temporal distribution within a shot for video indexing and retrieval. Along with
using motion descriptor, [10] uses static descriptors and SIFT descriptor to generate a new descriptor. However, using
various features for video retrieval is time consuming and can be non-applicable in time-sensitive tools.
One of the most important descriptors in content-based video classification is the SIFT descriptor which is a scale
invariant feature [11]. Image and video classification using SIFT descriptor has a high accuracy. However, SIFT
descriptor is applied on independent frames and thus ignores the motion in videos. In [12], a descriptor called local
invariant feature tracks (LIFT) is presented which tracks the SIFT descriptor in consecutive frames of each shot. It
considers the dynamism of video and consequently leads to better results. In order to equalize its descriptor vector, the
LIFT descriptor uses many complicated and time-consuming calculations which are not appropriate for online video
classification. Using the SIFT descriptor, in this work a descriptor similar to LIFT is extracted which tracks the SIFT
keypoints in consecutive frames and extracts the final descriptors, by making these tracks equal to each other. The
proposed descriptor is as accurate as LIFT while it uses a very simple method to equalize the length of the descriptor
vector which results in reducing the time complexity.
2. NOTATIONS AND FORMULATIONS
Before explaining the proposed descriptor extraction algorithm, the used notations and formulations are described in
this section. A brief description of notations is listed in Table 1.
Table 1: Notations used in implementation.
Fi
X
ij
Sij
, Yij
ith sampled frame
Coordinates of the jth keypoint in the ith frame
SIFT descriptor of the jth point in the ith frame
Axk
Half of square spatial window side
The transmission matrix to X curve in the K th track
Ayk
Transmission matrix to Y curve in the K th track
Xk
Matrix containing coefficient of the X curve
Yk
Matrix containing coefficient of the Y curve
Matrix contains index of points
Z
Each video contains a number of shots and every shot consists of some frames that are presented within a short time
interval. In this paper, Fi is the ith frame selected out of every 25 successive frames. Each Fi has many keypoints that
their coordinates are denoted by X ij , Yij . The SIFT descriptor of each point is denoted by Sij . There are many tracks
extracted for each shot. In order to extract a track, a set of consecutive points are found where is half of a square
spatial window side for finding the points. As shown in (1), the Axk matrix is calculated by using the X coordinate of the
K th track points. The Ayk matrix is calculated in the same way as well. The Axk and Ayk matrices transform the K th track to a
twentieth degree polynomial curve, where X k and Yk contain coefficients of the curve.
1,1,......,1,1,1
x1k , x2 k ..., xnk
Axk .
.
x 20 , x 20 ..., x 20
nk
1k 2 k
3. PROPOSED ROBUST VIDEO DESCRIPTOR
In this section, the proposed robust descriptor for video classification is described. Actually, the proposed descriptor is
extracted to classify the shots. As shown in Figure 1, some frames are sampled from each shot. Then, the keypoints of
these sampled frames are extracted. A SIFT descriptor is then extracted for each keypoint. Among the points in the
neighborhood of these keypoints in the next frame, a point which has a more similar SIFT descriptor is selected. By
continuation of this procedure a sequence of points with similar location and descriptor are generated. These tracks of
points have different lengths, thus by transmitting each track to a constant degree polynomial curve and saving the curve
coefficient, a vector which has a constant length is formed. This vector along with the average of the SIFT descriptor
form the semi-final descriptor. Among the extracted descriptors, by using the bag-of-words method only some of them
are selected to represent the others. These descriptors are the final descriptors which are used in video classification
stage. In the following, the descriptor extraction and shot classification are comprehensively described.
Figure 1. Proposed robust video descriptor.
3.1 Frame Sampling
In 25 frames per second videos, assuming that the probability of changing the video content in less than a second is
very low, here just one frame per 25 frames is selected to represent it. Thus, as shown in (2), a set of consecutive
sampled frames represent a shot by
Sh {F1 , F2 ,..., Fj ,..., Fn } .
(2)
3.2 SIFT Descriptor
In order to extract the SIFT descriptors, keypoints are first extracted from each frame. To find keypoints, among all
existing pixels in the frame, those which are immutable toward scale and rotation changes in all scales are considered. As
shown in Figure 2, for each keypoint locating in the middle of small blocks the surrounding pixels are divided into 4
parts. Then, by using a Gaussian function, shown by a circle in the figure, a weight is assigned to each vector of these 4
parts. Finally, a histogram with 8 different directions is formed for existing vectors in each part [13]. After SIFT
extraction for each sampled frame Fi , a set of keypoints is obtained where each has the location X ij , Yij and SIFT
descriptor Sij .
Figure 2. SIFT descriptor extraction [13].
3.3 Robust Video Descriptor
The details of each stage of the proposed descriptor are given next.
3.3.1 Motion Estimation
In order to extract a track, for each keypoint in a frame, the similar one in the next frame is found. A similar
keypoint has the following conditions
| Xij Xi (j1) |
(3)
| Yij Yi (j1) |
| Sij Si (j1) | min(| Sij Sk ( j 1) |) .
(4)
(5)
Equations (3) and (4) denote that a similar keypoint should be located in a square spatial window of size 2 and
equation (5) denotes that this keypoint has the most similar SIFT descriptor among other keypoints in the mentioned
neighborhood. If such a point is found, it will be added to the track. This search will continue until the last point of the
track cannot find a similar point in the next frame. For each track, the average of SIFT descriptors of the points will also
be saved. Since there is a number of keypoints in each frame, and the last point of each track can be located anywhere in
the next frames, many tracks with different lengths will be generated.
3.3.2 Curve Estimation
As shown in the previous subsection, the lengths of the tracks are different. Thus, in order to transmit these tracks to
feature vectors, a vector with a constant length should be extracted from each track. For each track, the sequence of
X i and Yi elements in the time dimension are mapped to a curve. Thus, in order to extract the mentioned descriptors, each
of these curves is transmitted to a twentieth degree polynomial curve. The X k and Yk matrices that denote the coefficient
of curves are calculated as
Z
Axk
(6)
Z .
Ayk
(7)
Xk
Yk
As described in Table 1, Z is a matrix that contains the index of tracked keypoints and Axk and Ayk are matrices that
transmit X k and Yk to a polynomial curve. The coefficient of these two polynomial curves along with the average of the
SIFT descriptor of all points in a track, forming a 168 element vector, construct the semi-final descriptors. Therefore,
with a very simple method and a few calculations a descriptor is extracted from the tracks of a shot where it represents
the motion feature of video well.
3.3.3 Bag-of-Words
There are a number of extracted 168-element vectors for each shot. In order to classify a shot, a constant number of
vectors should be selected to represent it. The bag-of-words method is used to choose a specific number of vectors
among all input vectors. In this method, all these 168-element vectors are transmitted to a 168-element space. In this
space, a clustering method is applied. It can be performed by clustering method. The K-means clustering is applied in
this paper. It groups the vectors into K clusters and selects a vector from each cluster to represent that cluster. Thus, K
number of 168-element vectors are selected to represent the shot. Now, for each shot, a vector with an equal number of
elements is extracted as its descriptor.
3.4 Shot Classification
To classify shots, a supervised classification has been used. To do so, we have used the 10 fold cross-validation
method which uses Support Vector Machine with RBF kernel as its classifier.
4. EXPERIMENTAL RESULTS
In this section, the proposed descriptor is compared with SIFT in precision and complexity aspects. In order to
evaluate the proposed method, the TRECVIT 2006 dataset is used. The proposed method is implemented using C
programming language. To run and test the method, the program has been run on a personal laptop with INTEL®
CORE™2 DUO processor with the process speed of 2.40 GHz.
One of the most prevalent criteria of video classification assessment is precision. Thus, the criteria of evaluating the
proposed descriptor is the average of the video classification precision in each label.
Figure 3 describes the effect of σ on the descriptor extraction precision. As increases, the number of adjacent
keypoints will be raised and consequently the probability of finding them will be increased as well. Thus, as shown in
Figure 3, growth will result in increment of average precision of classification. But, by increasing for more than
10 pixels, the number of irrelevant keypoints added to the track will be increased and thus the average precision will be
decreased. Therefore, equal to 10 is chosen as an experimental setup.
Figure 4 shows the average precision of classification for the proposed and SIFT descriptors in various contents.
According to this figure, for videos having mid and high motion the precision of the proposed method is about 15
percent higher than that of SIFT and in videos with low motion (like airplane and explosion) the precisions are the same.
In videos with no motion (such as building exterior, waterscape, and smoke) the precision of the proposed is about 10
percent less than SIFT. The analysis on the descriptor extraction execution time done on 2000 shots shows that the SIFT
execution time is 200 milliseconds and that for our proposed descriptor is 215 milliseconds and this time increase in
execution time is negligible.
Figure 3. Effect of
on average precision.
Figure 4. Average precision of proposed and SIFT descriptors.
5. CONCLUSION
In this paper a robust motion-based descriptor is proposed by using SIFT. It is simple and fast. It uses the motion
trajectory in videos to improve the accuracy of the subsequent classification stage. The experimental results show that
the proposed method is efficient for content-based video classification with negligible time overload. In order to have an
effective classification in all video contents, SIFT can be selected as descriptor for non-motion video contents, which is
our future work.
REFERENCES
[1] Hu, Weiming, et al. "A survey on visual content-based video indexing and retrieval." Systems, Man, and Cybernetics, Part C:
Applications and Reviews, IEEE Transactions on 41.6 (2011): 797-819.
[2] Yan, Rong, and Alexander G. Hauptmann. "A review of text and image retrieval approaches for broadcast news video."
Information Retrieval 10.4-5 (2007): 445-484.
[3] Amir, Arnon, et al. "IBM research TRECVID-2003 video retrieval system." NIST TRECVID-2003 (2003).
[4] R. Visser, N. Sebe, and E. M. Bakker, “Object recognition for video retrieval,” inProc. Int. Conf. Image Video Retrieval, London,
U.K., Jul. 2002, pp. 262–270.
[5] Zhang X P, Chen Z. An automated video object extraction system based on spatiotemporal independent component analysis and
multiscale segmentation. EURASIP Journal on Applied Signal Processing, 2006, 2006: 184
[6] Dao, Minh-Son, F. G. B. DeNatale, and Andrea Massa. "Video retrieval using video object-trajectory and edge potential function."
Intelligent Multimedia, Video and Speech Processing, 2004. Proceedings of 2004 International Symposium on. IEEE, 2004.
[7]. Su, Chih-Wen, et al. “Motion Flow-Based Video Retrieval.” Multimedia, IEEE Transactions on 9.6 (2007): 11931201
[8] Ma, Yu-Fei, and Hong-Jiang Zhang. "Motion texture: a new motion based video representation." Pattern Recognition, 2002.
Proceedings. 16th International Conference on. Vol. 2. IEEE, 2002.
[9] Fablet, Ronan, Patrick Bouthemy, and Patrick Pérez. "Nonparametric motion characterization using causal probabilistic models for
video indexing and retrieval." Image Processing, IEEE Transactions on 11.4 (2002): 393-407.
[10] Basharat, Arslan, Yun Zhai, and Mubarak Shah. "Content based video matching using spatiotemporal volumes." Computer
Vision and Image Understanding 110.3 (2008): 360-377
[11] Lowe, David G. "Object recognition from local scale-invariant features."Computer vision, 1999. The proceedings of the seventh
IEEE international conference on. Vol. 2. Ieee, 1999.
[12] Mezaris, Vasileios, Anastasios Dimou, and Ioannis Kompatsiaris. "Local invariant feature tracks for high-level video feature
extraction." Analysis, Retrieval and Delivery of Multimedia Content. Springer New York, 2013. 165-180
[13] David G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints”, INTERNATIONAL JOURNAL OF COMPUTER
VISION, vol. 60, n. 2, pp.91-110, 2004.
© Copyright 2026 Paperzz