Video classification with Densely extracted HOG/HOF/MBH features

Int J Multimed Info Retr
DOI 10.1007/s13735-014-0069-5
REGULAR PAPER
Video classification with Densely extracted HOG/HOF/MBH
features: an evaluation of the accuracy/computational efficiency
trade-off
J. Uijlings · I. C. Duta · E. Sangineto · Nicu Sebe
Received: 16 July 2014 / Revised: 11 September 2014 / Accepted: 11 September 2014
© Springer-Verlag London 2014
Abstract The current state-of-the-art in video classification is based on Bag-of-Words using local visual descriptors.
Most commonly these are histogram of oriented gradients
(HOG), histogram of optical flow (HOF) and motion boundary histograms (MBH) descriptors. While such approach is
very powerful for classification, it is also computationally
expensive. This paper addresses the problem of computational efficiency. Specifically: (1) We propose several speedups for densely sampled HOG, HOF and MBH descriptors
and release Matlab code; (2) We investigate the trade-off
between accuracy and computational efficiency of descriptors in terms of frame sampling rate and type of Optical
Flow method; (3) We investigate the trade-off between accuracy and computational efficiency for computing the feature vocabulary, using and comparing most of the commonly
adopted vector quantization techniques: k-means, hierarchical k-means, Random Forests, Fisher Vectors and VLAD.
Keywords Video classification · HOG · HOF · MBH ·
Computational efficiency
J. Uijlings
University of Edinburgh, Edinburgh, UK
e-mail: [email protected]
I. C. Duta · E. Sangineto · N. Sebe (B)
DISI, University of Trento, Trento, Italy
e-mail: [email protected]
I. C. Duta
e-mail: [email protected]
E. Sangineto
e-mail: [email protected]
1 Introduction
The Bag-of-Words method [10,37] has been successfully
adapted from the domain of still images to the domain of
video using local, visual, space-time descriptors (e.g. [13,22,
25,35,36,45]). Successful applications range from Human
Action Recognition [24,25,32] to Event Detection [38] and
Concept Classification [38,39]. However, analysing video is
even more computationally expensive than analysing images.
Hence, to deal with the enormous growing amount of digitalized video it is important to have not only accurate, but
also computationally efficient methods.
In this paper, we take a powerful, commonly used
Bag-of-Words pipeline for video classification and investigate how we can make it more computationally efficient
while sacrificing as little accuracy as possible. The general pipeline is visualised in Fig. 1. In this pipeline we
focus on densely sampled local visual descriptors only,
since dense sampling has been found to be more accurate
than keypoint-based sampling, both in images [20] and in
video [46]. As type of local visual descriptors, we focus
on the standard ones, which are based on local 3D volumes of histogram of oriented gradients (HOG) [11], histogram of optical flow (HOF) [12,25] and motion boundary histograms (MBH) [12]. For transforming the set of
local descriptors extracted from a video into a fixed-length
vector necessary for classification, we compare a variety of techniques: k-means, hierarchical k-means, Random Forests [4,16], Fisher Vectors [31] and Vector of
Locally Aggregated Descriptors (VLAD) [19]. Starting from
this pipeline, this evaluation paper makes the following
contributions:
Fast dense HOG/HOF/MBH We exploit the nature of densely
sampled descriptors to speed-up their computation. HOG,
123
Int J Multimed Info Retr
Fig. 1 General framework for video classification using a Bag-of-Words pipeline. The methods evaluated in this paper are instantiated in this
diagram
HOF and MBH descriptors are created from subvolumes.
These subvolumes can be shared by different descriptors similar to what was done in [42]. In this paper, we generalize
their idea of reusing subregions to three dimensions. Matlab
source code will be made available.1
Evaluation of frame subsampling Videos consist of many
frames, making them computational expensive to analyse.
However, subsequent frames also largely carry the same
information. In this paper, we evaluate the trade-off between
accuracy and computational efficiency when subsampling
video frames.
Evaluation of optical flow Calculating optical flow is generally expensive and takes up much of the total HOF and MBH
descriptor extraction time. But for optical flow there is also
a trade-off between computational efficiency and accuracy.
Moreover, optical flow methods are generally tested against
optical flow benchmarks such as [2,7], but it is not immediately obvious that methods which perform well on these
benchmarks would automatically also yield better HOF and
MBH descriptors. Therefore, in this paper, we evaluate optical flow methods directly in our task of interest: video classification. Specifically, we compare the optical flow methods
of Lukas-Kanade [28], Horn–Schunk [17], Farnebäck [15],
Brox 04 [5], and Brox 11 [6].
Evaluation of descriptor encoding The classical way of
transforming a set of local visual descriptors into a single
fixed-length vector is using a k-means visual vocabulary and
assigning local descriptors to the mean of the nearest cluster (e.g. [10]). However, both hierarchical k-means and Random Forests [30,42] are viable fast alternatives. Furthermore,
the Fisher Vector [31] significantly outperforms classical kmeans representation in many tasks, whereas VLAD [19]
can be considered a simplified non-probabilistic version
of the Fisher Vector [33] and it is computationally more
efficient. In this paper, we evaluate the accuracy/efficiency
trade-off of all five methods above in the context of video
classification.
1
http://homepages.inf.ed.ac.uk/juijling/index.php#page=software.
123
2 Related work
The most used local spatio-temporal descriptors are modelled after SIFT [27]: each local video volume is divided into
blocks, for each block one aggregates responses (either oriented gradients or optical flow), and the final descriptor is a
concatenation of the aggregated responses of several adjacent
blocks. Both Dalal et al. [12] and Laptev et al. [25] proposed
to aggregate 2D oriented gradient responses (HOG) and optical flow responses (HOF). In addition, Dalal et al. [12] also
proposed to calculate changes of optical flow, or MBH. Both
Scovanner et al. [36] and Kläser et al. [22] proposed to measure oriented gradients also in the temporal dimension, resulting in 3-dimensional gradient responses. Everts et al. [14]
extended [22] to include colour channels. As Wang et al. [46]
found little evidence that the 3D responses of [22] are better than HOG, in this evaluation paper we implemented and
evaluated the descriptors which are most widely used: HOG,
HOF and MBH.
Wang et al. [46] evaluated several interest point selection
methods and several spatio-temporal descriptors. They found
that dense sampling methods generally outperform interest
points, especially on more difficult datasets. As this result was
earlier found in image analysis [20,34], this paper focuses
on dense sampling for videos. In [46] the evaluation was on
accuracy only. In contrast, this paper focuses on the trade-off
between computational efficiency and accuracy.
Recently, Wang et al. [45] proposed to use dense trajectories. In their method, the local video volume moves spatially
through time; it tries to stay on the same part of the object. In
addition, they use changes in optical flow rather than the optical flow itself. They show good improvements over normal
HOG, HOF and MBH descriptors. Nevertheless, combining
their dense trajectory descriptors with both normal HOG,
HOF and MBH descriptors still gives significant improvements over dense trajectories alone [21,45]. In this paper we
focus on HOG, HOF and MBH. Note that we evaluate the
accuracy/efficiency trade-off for several optical flow methods
which may be of interest also when using dense trajectories.
In [34], Sangineto proposes to use Integral Images [44]
to efficiently compute densely extracted SURF features [3]
in still images. The work of Uijlings et al. [42] proposes
Int J Multimed Info Retr
several methods to speed-up the Bag-of-Words classification
pipeline for image classification and provides a detailed evaluation on the trade-off between computational efficiency and
classification accuracy. In this paper we perform such evaluation on video classification. Inspired by [42] we propose
accelerated densely extracted HOG, HOF and MBH descriptors and provide efficient Matlab implementations. In addition, we evaluate various video-specific aspects such as frame
sampling rate and the choice of optical flow method.
The Fisher Vector [31] has been shown to outperform standard vector quantization methods such as k-means in the
context of Bag-of-Words. On the other hand, the recently
proposed VLAD descriptors [19] can be seen as a nonprobabilistic version of Fisher Vectors which are faster
to compute [19,33]. In this paper, we evaluate the accuracy/efficiency trade-off using Fisher Vector and VLAD in
the context of video classification.
3 Bag-of-Words for video
In this section we explain in detail the pipeline that we use. We
mostly use off-the-shelf yet state-of-the-art components to
construct our Bag-of-Words pipeline, which is necessary for
a good evaluation paper. In addition, we explain how to create
a fast implementation of densely sampled HOG and HOF
descriptors, and also implicitly for MBH, being MBH based
on HOG and Optical Flow. We make the HOG/HOF/MBH
descriptor code publicly available.
3.1 Descriptor extraction
In this section we describe the details of our implementation
for dense extraction of HOG, HOF and MBH descriptors.
Specifically, in Sect. 3.1.1 we show how HOG and HOF can
be efficiently extracted and aggregated from video blocks.
Then, in Sect. 3.1.2 we deal with MBHs, which are largely
based on HOG. Finally, since in this paper we compare
our implementation with the widely used available code of
Laptev [25], in Sect. 3.1.3, we show the parameters we have
adopted in using Laptev’s code. Both ours and the Laptev’
system work on grey values only. Note that Laptev’s implementation does not include MBH descriptors, thus the comparison performed in our experiments only concerns HOG
and HOF.
3.1.1 Fast dense HOG/HOF descriptors
For both HOG and HOF descriptors, there are several
steps. First one needs to calculate either gradient magnitude
responses in horizontal and vertical directions (for HOG), or
optical flow displacement vectors in horizontal and vertical
directions (for HOF). Both result in a 2-dimensional vec-
Fig. 2 Blocks in a video volume can be reused for descriptor extraction. In our paper descriptors consist of 3 × 3 blocks in space and 2
blocks in time, shown in blue (color figure online)
tor field per frame. Then for each response the magnitude is
quantized in o orientations, usually o = 8. Afterwards, one
needs to aggregate these responses over blocks of pixels in
both spatial and temporal directions. The next step is to concatenate responses of several adjacent pixel blocks. Finally,
descriptors have to be normalized and sometimes PCA is
performed to reduce their dimensionality, often leading to
computational benefits or improved accuracy.
To calculate gradient magnitude responses we use HAARfeatures. These are faster to compute than Gaussian Derivatives and have proven to work better for HOG [11]. Quantization in o orientations is done by dividing each response
magnitude linearly over two adjacent orientation bins.
We use the classical Horn–Schunk [17] method for optical flow responses as a default. We use the version implemented by the Matlab Computer Vision System Toolbox. In
addition, we evaluate four other optical flow methods: LucasKanade [28], also using the Matlab Computer Vision System
Toolbox, the method of Färneback [15], using OpenCV2 with
the mexopencv interface,3 Brox 04 [5], and Brox 11 [6] using
the author’s publicly available code.
Both HOG and HOF descriptors are created out of blocks.
By choosing the sampling rate identically to the size of a
single block, one can reuse these blocks. Figure 2 shows an
example on how a video volume can be divided into blocks.
Once responses per block are computed, descriptors can be
formed by concatenating adjacent blocks. In this paper we
use descriptors of 3 × 3 blocks in the spatial domain and 2
blocks in the temporal domain, as shown in blue in Fig. 2, but
these parameters can be easily changed. Hence each block is
reused 18 times (except for the blocks on the borders of the
video volume).
2
http://opencv.org.
3
https://github.com/kyamagu/mexopencv.
123
Int J Multimed Info Retr
To aggregate responses over space we use the Matlabfriendly method proposed by [42]: Let R be an N × M matrix
containing responses in a single orientation (be it gradient
magnitude or optical flow magnitude). Let B N and B M be
the number of elementary blocks from which HOG/HOF features are composed. Now it is possible to construct (sparse)
matrices O and P of respectively B N × N and M × B M
such that O R P = A, where A is a B N × B M matrix containing the aggregated responses for each block. O and P
resemble diagonal matrices but are rectangular and the filled
in elements follow the ’diagonal’ of the rectangle instead of
positions (i, i). By proper instantiation of these matrices we
perform interpolation between blocks, which provides the
descriptors some translation invariance. For integration over
time we add the responses of the frames belonging to a single block. For more details we refer the reader to the work
of [42].
In this paper, we extract descriptors on a single scale where
blocks consist of 8 × 8 pixels × 6 frames, which at the
same time is our dense sampling rate. Descriptors consist
of 3 × 3 × 2 blocks. Both for HOG and HOF the magnitude responses are divided into 8 orientations, resulting in
144-dimensional descriptors. PCA is performed to reduce
the dimensionality by 50 % resulting in 72-dimensional vectors. Afterwards, normalization is performed by the L1-norm
followed by the square root, which effectively means that
Euclidean distances between descriptors in fact reflect the
often superior Hellinger distance [1].
3.1.2 Motion boundary histograms descriptor
Another commonly used descriptor for video classification
tasks is MBH, proposed by Dalal et al. [12], who proved its
robustness to camera and background motion. The intuitive
idea of MBH is to represent the oriented gradients computed
over the vertical and the horizontal optical flow components.
The advantage of such representation is that constant camera movements tend to disappear and the description focuses
on optical flow differences between frames (motions boundaries).
In more detail, the optical flow’s horizontal and vertical components are separately represented using two scalar
maps, which can be seen as grey-level “images” of the
motion components. HOG are then computed for each of
the two optical flow component images, using the same
approach used for computing HOG in still images. Taken into
account only flow differences, the information about changes
in motion boundaries is kept and the constant motion information is removed, which leads to the cancellation of most
of the effects of camera motion.
In our MBH implementation we follow the pipeline suggested in [12] and mentioned above. Once computed the
horizontal and vertical optical flow components, HOG are
123
computed on each image component using the same efficient approach and the same parameters shown in Sect. 3.1.1.
Also the block-based aggregation step is analogous to what
described in Sect. 3.1.1.
The outcome of this process is a pair of horizontal (MBHx)
and vertical (MBHy) descriptors [12], each one composed
of 144 dimensions. We separately apply PCA to both MBHx
and MBHy and we obtain two vectors of 72 dimensions
each. The (PCA-reduced) MBHx and MBHy vectors can
then be either separately used in the subsequent visual word
assignment and classification stages (Fig. 1) or combined in
order to get a unique descriptor. In [45] the authors state that
late fusion of MBHx and MBHy gives a better performance
than concatenating the two descriptors before the visual word
assignment step. Hence in this paper we will report results
for MBHx and MBHy separately, and a late fusion of the two
which we simply denote as MBH. This late fusion combines
the outcomes of the two (independent) classifications with
equal weights. Finally, in Sect. 4.6, we will also show results
concerning a late fusion strategy involving all the descriptors
(MBHx, MBHy, HOG and HOF).
3.1.3 Existing HOG/HOF descriptors
We use the existing implementation of Laptev et al. [25]. We
use the default parameters as suggested by the authors, which
compared to our descriptors are as follows: They perform a
dense sampling at multiple scales. At the finest scale, blocks
are 12 × 12 pixels × 6 frames, sampling rate is every 16
pixels by every 6 frames. They consider 8 spatial scales and
2 temporal scales for a total of 16 scales, where
√ each scale
increases the descriptor size by a factor of 2. In the end,
they generate around 33 % less descriptors than our single
scale dense sampling method.
Unlike our descriptor extraction, the implementation
of [25] uses 4 orientations for HOG and 5 orientations
for HOF, resulting in respectively 72- and 90-dimensional
descriptors.
3.2 Visual word assignment
We use five different ways of creating a single feature representation of a set of descriptors extracted from a single video:
k-means, hierarchical k-means, Random Forests [4,16],
VLAD [19] and Fisher Vectors [31].
For hierarchical k-means we use the implementation made
available by VLFeat [43]. For the regular k-means assignment, we make use of the fact that the descriptors are L2normalized: Euclidean distances are proportional to dot products (cosine of angles) between the vectors. Hence finding
the minimal euclidean distance is equivalent to finding the
maximal dot product, yet more efficient to compute [42].
For both hierarchical k-means and regular k-means, we use
Int J Multimed Info Retr
4,096 visual words. For hierarchical k-means, we learn a
hierarchical tree of depth 2 with 64 branches per node of
the tree (preliminary experiments showed a large decrease
in accuracy when using a higher depth with fewer branches,
but only marginal improvements in computational efficiency,
data not shown). We normalize the resulting frequency histograms using the square root, which discounts frequently
occurring visual words, followed by L1-normalization.
Random Forests are binary decision trees which are
learned in a supervised way by randomly picking several
descriptor dimensions at each node with several random
thresholds and choose the one with the highest Entropy Gain.
We follow the recommendations of [42], using 4 binary decision trees of depth 10, resulting in 4,096 visual words. The
resulting vector is normalized by taking the square root followed by L1.
The Fisher Vector [18] as used in [31] encodes a set of
descriptors D with respect to a Gaussian mixture model
(GMM) which is trained to be a generative model of these
descriptors. Specifically, the set of descriptors is represented
as the gradient with respect to the parameters of the GMM.
This can be intuitively explained in terms of the EM algorithm
for GMMs: Let G λ be the learned GMM with parameters λ.
Now use the E-step to assign the set of descriptors D to G λ .
Then the M-step yields a vector F with adjustments on how
λ should be updated to fit the data (i.e. how the GMM clusters should be adjusted). This vector F is exactly the Fisher
Vector representation. We follow [31] and normalize the vector using a square root of the absolute values and afterwards
√
keep the original sign ((sign( f i )) | f i |), followed by L2. In
this paper we use two common cluster sizes for the GMM:
64 and 256 clusters [31]. Without a spatial pyramid [26], for
our 72-dimensional HOG/HOF/MBHx/MBHy features this
will yield vectors of 9,216 and 36,864 dimensions respectively.While not comparable with the dimensionality of other
methods, Fisher Vectors (and VLAD) allow for linear Support Vector Machines rather than Histogram Intersection or
χ 2 -kernels. Hence efficiency wise, the simpler classifiers will
compensate for the larger dimension of the feature vectors.
The recently proposed VLAD [19] representation can be
seen as a simplification of the Fisher Vector [19,33] in which:
(1) a spherical GMM is used, (2) the soft assignment is
replaced with a hard assignment and (3) only the gradient
of G λ with respect to the mean is considered (first order
statistics). This leads to a lower dimensional representation,
half of the dimensions of a Fisher Vector, in which second order statistics are also used. Following [19] we use for
VLAD the same normalization scheme used for Fisher Vectors: We square root the VLAD vectors while keeping their
sign, followed by L2-normalization. For good comparison to
the Fisher Vectors, we use a dictionary of 128 and 512 clusters
respectively, leading to features of dimensionality identical
to the Fisher Vectors: 9,216 and 36,864 dimensions.
We use the Spatial Pyramid [26] in all our experiments. Specifically, we divide each video volume into the
whole video and into three horizontal parts which intuitively
roughly corresponds to a ground, object, and sky division (in
outdoor scenes).
3.3 Classification
For classification we use Support Vector Machines which
are powerful and widely used in a Bag-of-Words context
(e.g. [10,26,42,43]). For k-means, hierarchical k-means, and
Random Forests, we use SVMs with the Histogram Intersection kernel, using the fast classification method as proposed
by [29]. For the Fisher Vector and VLAD, we use linear
SVMs. For both types of SVMs, we make use of the publicly available LIBSVM library [8] and the fast Histogram
Intersection classification of [29].
4 Experiments
Our baseline consists of densely sampled HOG, HOF and
MBH(x/y) descriptors, all consisting of blocks of 8 × 8 pixels × 6 frames. For HOF and MBH(x/y), optical flow is calculated using Horn–Schunk. Gradient and flow magnitude
responses are quantized in 8 bins. The final descriptors consist of 3 × 3 × 2 blocks. PCA always reduces dimensionality
of descriptors by 50 %. We use a spatial pyramid division of
1 × 1 × 1 and 1 × 3 × 1 [26] (we have no temporal division).
Normalization after word assignment is done by either taking the square root while keeping the sign followed by L2
for the Fisher Kernel, or by the square root plus L1 for all
other methods. We use SVMs for classification, with either a
linear kernel for the Fisher Vectors or histogram intersection
kernel for all other visual word assignment methods.
Starting from our baseline we perform four experiments:
(1) We compare five different visual word assignment
methods: k-means, hierarchical k-means, Random Forests,
VLAD and the Fisher Kernel; (2) We compare our densely
extracted descriptors with the descriptors provided by Laptev
et al. [25]; (3) We evaluate the efficiency/accuracy tradeoff by subsampling video frames for the descriptor extraction process; (4) For HOF and MBH(x/y) descriptors, we
compare five different optical flow implementations: Horn–
Schunk, Lukas-Kanade, Farnebäck [15], Brox 04 [5] and
Brox 11 [6].
All timing experiments are performed on a single core of
an Intel(R) Xeon(R) CPU E5620 2.40GHz. We use mainly
Matlab, but most toolboxes used by us have mex-interfaces to
C++ implementations for critical functions. All implementations are heavily optimized for speed. Since the computation involves many common operations that use standardized
and optimized libraries (e.g. convolutions, matrix multipli-
123
Int J Multimed Info Retr
HOG − HOF − MBH
0.9
Kmeans+HOG
HKmeans+HOG
Forest+HOG
Fisher64+HOG
Fisher256+HOG
VLAD128+HOG
VLAD512+HOG
0.85
accuracy
cations) on large quantities of data, virtually the entire time is
spent on core calculations while the overhead is negligible;
using only C++ will not result in noticeable differences in
the overall timing results presented in this paper.
Based on our experiments we provide two recommendations, one for real-time video classification and one for accurate video classification. Finally we give a comparison with
the state-of-the-art.
Kmeans+HOF
HKmeans+HOF
Forest+HOF
Fisher64+HOF
Fisher256+HOF
VLAD128+HOF
VLAD512+HOF
Kmeans+MBH
HKmeans+MBH
Forest+MBH
Fisher64+MBH
Fisher256+MBH
VLAD128+MBH
VLAD512+MBH
0.8
0.75
4.1 Dataset
0.7
We perform all experiments on the UCF50 Human Action
Recognition dataset [32]. This dataset contains 6,600 realistic videos taken from Youtube and as such has large variations
in camera motion, object appearance and pose, illumination
conditions, scale, etc. The 50 human action categories are
mutually exclusive and include actions such as biking, diving, drumming, and fencing. The frames of the videos are 320
× 240 pixels. The video clips are relatively short with a length
that varies around 70–200 frames. The dataset is divided in
25 predefined groups. Following the standard procedure we
perform a leave-one-group-out cross-validation and report
the average classification accuracy over all 25 folds. Optimization of the SVM slack parameter is done for every class
for every fold on the training set (containing 24 groups).
4.2 Visual word assignment
In this experiment we compare the following visual word
assignment methods: k-means, hierarchical k-means, Random Forests, VLAD and Fisher Vector. k-means, hierarchical
k-means and Random Forests are similar in the sense that the
final vector represents visual word counts. To compare these
methods we ensure that all have 4,096 visual words. For kmeans this means performing clustering with k = 4,096. For
hierarchical k-means we use a hierarchy of depth 2 with 64
branches at each node. The Random Forest consists of 4 trees
of depth 10. We choose to base our Fisher Vectors on standard
sizes for the number of clusters: 64 and 256 clusters [9,31].
While Fisher Vectors are of higher dimensionality, the vectors work with linear classifiers. This means that Fisher Vectors are best compared with the other visual word assignment
methods in terms of the accuracy/efficiency trade-off. Similarly, we adopted two standard cluster sizes for VLAD: 128
and 512 dimensions respectively [19] and we used linear
classifiers as well.
The accuracy and computational efficiency for the various
word assignment methods for our HOG, HOF and MBH(x/y)
features are presented in Fig. 3 and Table 1. The first thing to
notice is that the Fisher Vector with 256 clusters has the best
accuracy of 0.765 for HOG, 0.795 for HOF, 0.796 for MBHx
and 0.804 for MBH, while taking 3.39 s/video (per descriptor
type). k-means has also good accuracy at 0.728 for HOG,
123
0.65
0
200
400
600
800
1000 1200 1400 1600 1800 2000
frame/sec
Fig. 3 Accuracy/efficiency trade-off for various word assignment
methods and features. For a better readability of the figure, we omitted
the results concerning MBHx and MBHy (see Table 1)
0.791 for HOF, 0.782 for MBHx and 0.8 for MBH. However,
the computational time is at 1.81 s/video. This means that
the Fisher Vector (with 256 clusters) for video classification
is superior in accuracy but slightly slower compared to kmeans. For computational efficiency, the Random Forest is
by far the fastest and takes 0.1 s/video. The hierarchical kmeans (hk-means) is four times slower at 0.47 s/video, and
performs slightly worse on HOG (0.718 hk-means vs. 0.729
RF) but significantly better on HOF (0.780 hk-means vs.
0.732 RF) and on MBHx, MBHy and MBH (respectively,
0.774 vs. 0.738, 0.763 vs. 0.739 and 0.791 vs. 0.765).
In terms of classification time per video, we measure 0.017
s/video when using the fast Histogram Intersection-based
classification for SVMs [29] for k-means, hk-means, and
Random Forests. We measure 0.001 s/video for the linear
classifier used on the Fisher Vector representation with 256
clusters. This means that the classification time is negligible
compared to the word assignment time and is of little concern
for video classification.
For the remainder of this paper, we choose to perform
our evaluation on two word assignment methods: the Fisher
Vector, which yields the most accurate results, and hk-means,
which is the second fastest after Random Forests, while its
accuracy for HOF and MBH(x/y) is much higher than using
Random Forests.
4.3 Comparison with Laptev et al.
In this experiment we compare the publicly available code
from [25] with our implementation. We compare only to the
dense sampling option as [46] has already proven that dense
Int J Multimed Info Retr
Table 1 Trade-off accuracy/efficiency for the following visual word assignment methods: k-means, hierarchical k-means (hk-means), Random
Forest (RF), Fisher Kernel with 64 and 256 clusters (FK 64 and FK 256)
HOG Acc
k-means
hk-means
RF
FV 64
FV 256
VLAD 128
VLAD 512
0.728
0.718
0.729
0.746
0.765
0.653
0.671
HOF Acc
0.791
0.780
0.732
0.779
0.795
0.751
0.783
MBHx Acc
0.782
0.774
0.738
0.767
0.796
0.749
0.774
MBHy Acc
0.772
0.763
0.739
0.759
0.787
0.737
0.765
MBH Acc
0.800
0.791
0.765
0.786
0.804
0.769
0.792
s/video
1.81
0.51
0.10
1.10
3.39
0.19
0.47
Frame/s
108
387
1,910
180
58
1,011
415
Assignment time for HOG and HOF is the same
Fig. 4 Accuracy comparison
between [25] and our
HOG/HOF descriptors
HOG+HKmeans
HOG+Fisher
HOF+HKmeans
HOF+Fisher
Laptev et al.
Ours
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
accuracy
Fig. 5 Computational
efficiency comparison
between [25] and our
HOG/HOF descriptors
Ours
Laptev et al.
0
2
4
6
8
10
12
14
16
frame/sec
Table 2 Comparing the dense HOG/HOF implementation of [25] and
ours
hk-means
FV 256
Efficiency
HOG
HOF
HOG
HOF
s/vid
Frame/s
[25]
0.657
0.590
0.670
0.725
141
1.4
Ours
0.718
0.780
0.765
0.795
15
12.8
The descriptor extraction time is measured for extracting both HOG
and HOF features, as the binary provided by [25] does always both.
Descriptor extraction time is independent of the visual word assignment
method (RF or FV 256)
sampling outperforms the use of space-time interest points.
Moreover, only HOG and HOF features are used for comparison because the code in [25] does not include any implementation for MBH features. Results are presented in Figs. 4
and 5 and in Table 2.
The results show that for all settings there is a significant
difference in accuracy between the dense implementation
of [25] and our method. For the Fisher Vector, HOG descriptors yield 0.670 accuracy for [25] and 0.765 accuracy for our
implementation and HOF descriptors yield 0.725 accuracy
for [25] and 0.795 accuracy for our implementation. These
are accuracy increases of 9 and 7 % respectively. Similar
differences are obtained using hk-means. Part of the difference can be explained by the fact that we sample differently:
because we reuse blocks of the descriptors, our sampling rate
is defined by the size of a single block. This means we sample
descriptors every 8 pixels and every 6 frames at a single scale,
whereas [25] samples every 16 pixels and every 6 frames at
10 increasingly course scales. For our method this yields
around 150 descriptors per frame or around 29,000 descriptors per video whereas [25] generates around 90 descrip-
123
Int J Multimed Info Retr
tors per frame or around 17,500 descriptors per video, which
means we generate 66 % more descriptors. While this may
seem unfair towards [25], in this paper we are interested in
the trade-off between accuracy and computational efficiency,
which makes the exact locations from where descriptors are
sampled irrelevant.
In terms of computational efficiency our method is more
than 9 times faster: their method takes 141 s/video while our
method takes 15 s/video. Our method is faster because we
reuse blocks in our dense descriptor extraction method. Note
that because the method of [25] samples fewer descriptors,
visual word assignment time is faster. But using [25] the
overall computation time will be completely dominated by
descriptor extraction.
To conclude, our implementation is significantly faster
and significantly more accurate than the version of [25].
For HOG descriptors, subsampling video frames have surprisingly little effect on the accuracy, both for hk-means and
Fisher Vectors: using Fisher Vectors, a sampling rate of 1
yields an accuracy of 0.765 while a sampling rate of 6 yields
0.762 accuracy. The result of hk-means is basically constant,
with slight oscillations. In terms of computational efficiency,
a significant speed-up is achieved: sampling every 6 frames
instead of every frame gives a speed-up from 6.5 s/video to
3.3 s/video.
For HOF descriptors, subsampling has a bigger impact:
For the Fisher Vector, the accuracy is 0.795 using a sampling
rate of 1, maintains a respectable 0.791 accuracy at a subsampling rate of 2 frames, while dropping significantly to 0.763
for sampling every 6 frames. Accuracy with hk-means is less
affected and drops from 0.78 at sample rate of 1 to 0.762
Table 3 Trade-off between frame sampling rate and accuracy
4.4 Subsampling video frames
In video, subsequent video frames largely contain the same
information. As the time for descriptor extraction is the
largest bottleneck in video classification, we investigate how
the accuracy behaves if we subsample video frames and
hence speed-up the descriptor extraction process.
For a fair comparison, we want the descriptors always
to describe the same video volume. In our baseline, each
descriptor block consists of 8 × 8 pixels × 6 frames. To subsample in such a way that every block describes the same
video volume regardless of the sampling rate, we do the following: if we sample every 2 frames, we aggregate responses
over 3 frames (i.e. of frame 2, 4 and 6). When sampling every
3 frames, we aggregate responses over 2 frames (i.e. frame 2
and 5), and when sampling every 6 frames in which we only
consider a single frame per descriptor block (i.e. frame 3).
Results are presented in Fig. 6 and Table 3.
F=1
HOG+HKmeans
HOG+Fisher
HOF+HKmeans
HOF+Fisher
MBH+HKmeans
MBH+Fisher
F=2
F=3
0.8
F=1
F=1
F=2
F=2
accuracy
0.78
F=3
F=1
F=3
HOF
MBHx
MBHy
2
3
0.712
1
6
0.719
FV 256
0.765
0.759
0.760
0.762
s/vid
6.5
4.5
3.9
3.3
Frame/s†
Frames/block
samplerate
hk-means
30.2
6
1
0.780
43.7
3
2
0.773
50.3
2
3
0.766
58.9
1
6
0.762
FV 256
0.795
0.791
0.784
0.763
s/vid
8.9
5.9
4.8
3.8
Frame/s†
Frames/block
samplerate
22.1
6
1
0.774
33.5
3
2
0.767
40.7
2
3
0.769
51.8
1
6
0.758
F=6
F=6
F=3
F=2
F=3
F=6
MBH
0.74
FV 256
0.796
0.794
0.788
0.771
s/vid
9.4
6.1
5.0
3.9
Frame/s†
Frames/block
samplerate
20.9
6
1
0.763
32.1
3
2
0.757
39.4
2
3
0.752
50.7
1
6
0.741
FV 256
0.787
0.785
0.772
0.750
s/vid
9.4
6.1
5.0
3.9
Frame/s†
Frames/block
samplerate
20.9
6
1
0.791
32.1
3
2
0.788
39.4
2
3
0.787
50.7
1
6
0.772
hk-means
F=6
F=1
0.72
F=2
F=3
0.7
0
10
20
30
40
50
60
70
80
frame/sec
Fig. 6 Trade-off accuracy/efficiency when varying sampling rate. F
stands for frames per block and it is directly related to sampling rate
123
hk-means
F=2
F=1
3
2
0.716
Frames/block
samplerate
hk-means
F=6
F=6
0.76
hk-means
6
1
0.718
HOG
FV 256
0.804
0.803
0.800
0.775
s/vid
13.7
8.3
6.5
4.6
Frame/s†
14.3
23.6
30.1
42.9
We keep video volumes from which descriptors are extracted the same
for all sampling rates
† Frames/s is measured in terms of the total number of frames of the
video, not in terms of how many frames are actually processed during
descriptor extraction
Int J Multimed Info Retr
at sample rate of 6. Again, a good speed-up is obtained by
subsampling. While descriptor extraction takes 8.9 s when
using every frame, a sampling rate of 2 yields a factor 1.5
speed-up while sampling every 6 frames yields a factor 2.34
speed-up.
The remaining rows of Table 3 present results obtained
with different combinations of the vertical and the horizontal components of the MBH (Sect. 3.1.2). Note that when
calculating both components of the MBH features, the optical flow has to be calculated only once, so computation time
is faster than simply adding the times of MBHx and MBHy.
We observe a particular order of accuracy among these
three combinations: using the only horizontal component
(MBHx) always results in a higher accuracy than using the
only vertical component (MBHy), independently of whether
Fisher Vectors or hk-means is used as word assignment
method. This sharp difference is probably due to the fact
that in the test videos the horizontal motion is more frequent
than the vertical one. Moreover, as expected, late fusion of the
two components (MBH), always outperforms using MBHx
only. Concerning the drop of accuracy depending on the sample rate, for all the three descriptor combinations (MBHx,
MBHy, MBH) and both word assignment methods (Fisher
Vectors and hk-means), the accuracy loss as a function of
the sample rate is similar to what happens with HOF and
much higher than HOG. We believe that this is due to the
fact that HOG are basically “static” features, representing
the appearance of a given image window independently of
possible motion information. As a consequence, they are less
affected by optical flow errors [which is used to compute both
HOF and MBH(x/y)] and better exploit the redundancy of
consecutive video frames.
As for HOG and HOF and also for MBH(x/y) and MBH,
we observe a significant computational efficiency gain using
subsampling. For instance, sampling every 6 frames yields a
factor of 2.4 speed-up for MBH(x/y) and a factor of 3 speedup for MBH with respect to using all the frames.
To conclude, HOG descriptors can be sampled every 6
frames with negligible loss of accuracy yielding a speed-up
of a factor 2. HOF and MBH descriptors can be sampled every
2 frames with negligible loss of accuracy yielding a speedup of a factor 1.5 and 1.7 respectively. When speed is more
important than accuracy, both HOF and MBH descriptors can
also be sampled every 6 frames leading to 1–3 % accuracy
loss while gaining a significant speed-up of a factor 2.3–3.
4.5 Choice of optical flow
The results reported in the previous section show that both the
HOF and the MBH(x/y) descriptors are much more expensive to extract than the HOG descriptors (Table 3). This
is because calculating the optical flow is computationally
expensive. In addition, not much research has been done
Table 4 Comparison of different optical flow methods used to compute
HOF features
Horn–Schunk
Lucas-Kanade
Färneback
hk-means
0.780
0.750
0.652
FV 256
0.795
0.747
0.641
s/video
8.8
7.2
19.0
Frame/s
22
27
10
Results obtained with no frame subsampling and at full original spatial
resolution (320 × 240 pixels)
Table 5 Comparison of different optical flow methods used to compute
HOF features
Horn–Schunk Lucas-Kanade Färneback Brox 04 Brox 11
hk-means 0.713
0.681
0.529
0.548
FV 256
0.718
0.697
0.542
0.638
0.552
0.652
s/video
2.9
2.8
0.76
7.2
12.4
Frame/s
68
69
257
27.4
16
Results obtained subsampling a frame every 6 and at reduced spatial
resolution (80 × 60 pixels)
on how different optical flow methods affect HOF/MBH
descriptors. Therefore, in this experiment we evaluate five
available optical flow implementations to investigate both
their computational efficiency and accuracy. In particular,
we compare: (1) Farnebäck [15] from OpenCV using the
mexopencv interface, (2) Lucas-Kanade [28] and (3) Horn–
Schunk [17] from the Matlab Computer Vision Systems Toolbox, (4) Brox 04 [5] and (5) Brox 11 [6] using the available
author’s code.
Results are presented in Tables 4 and 5. Specifically, while
in Table 4 we used the same setting adopted in the other
experiments of this paper, in Table 5 we downscaled the frame
resolution of all the videos by a factor of 4 (i.e., using 80 ×
60 pixel frames) and we subsampled every 6 frames (see
Sect. 4.4). This scale and time subsampling was necessary
to process our large video dataset with both Brox 04 and
Brox 11, two state-of-the-art dense optical flow methods not
able to process videos in real time. In fact, processing all
the frames of our 6,600 videos at full spatial resolution with
Brox 11 would require a few months.
With the original frame resolution (Table 4), and with both
hk-means and Fisher Vectors, the three computationally feasible optical flow methods have the same ranking in terms
of accuracy. For the Fisher Vector, Horn–Schunk performs
best at an accuracy of 0.795, followed by Lucas-Kanade at
an accuracy of 0.747, while the method of Farnebäck performs relatively poorly with an accuracy of 0.641. These
results show that the optical flow method is crucial to the
performance of the HOF descriptor: the choice of optical
flow affects the results by up to 15 %(!).
123
Int J Multimed Info Retr
In terms of computational efficiency, Lucas-Kanade is
the fastest at 27 frames/s, followed by Horn–Schunk at 22
frames/s, while Farnebäck is slower with 10 frames/s. However, while Lucas-Kanade is faster, its trade-off between efficiency and accuracy is not good: As seen in Table 3 Horn–
Schunk with a frame sampling rate of 2 outperforms the
Lukas-Kanade results in Table 4 in both speed (33 frames
vs. 27 frames) and accuracy (0.77 vs. 0.75).
Table 5 reports results when we subsample frames and
reduce the frame size by a factor 4, enabling comparison
with the Brox methods. Note that for a fair comparison these
times include the computation for reducing the frame sizes
(although these times are negligible compared to the total
description extraction time). It can be seen that both Brox
methods are better than Farneback, but surprisingly not better than the Horn–Schunk and Lucas-Kanade method. One
explanation is that this is due to the low resolution of the
frames, which makes dense optical flow extraction not sufficiently accurate. Another possibility is that optical flow methods performing better on optical flow benchmarks are not
necessarily optimal for use in classification; reducing mistakes in most parts of the flow may introduce artefacts elsewhere that negatively affect results in a classification framework.
In terms of computational efficiency, Brox 11 is the slowest, followed by Brox 04: even subsampled on reduced frames
Brox 04 still processes only 27 frames/s. In contrast to results
without downsampling, Farnebäck is here the fastest method.
Apparently, there is some overhead in the Matlab optical flow
implementations.
To conclude, the choice of optical flow method drastically
influences the power of the resulting HOF descriptor and it
is not necessarily correlated with the performance on optical
flow benchmarks. In addition, many optical flow methods
aim for accuracy rather then computational efficiency (e.g.
Sun et al. [41] provide a very good overview for accuracy but
do not report computational efficiency). Indeed, except the
Horn–Schunk, Lucas-Kanade, and Farnebäck methods we
did not find any other freely available optical flow method
fast enough for use in our classification pipeline. Our evaluation shows that the Horn–Schunk method has the best tradeoff between accuracy and computational efficiency and that
subsampling every two frames works better than switching
to Lucas-Kanade optical flow. Horn–Schunk is therefore the
current method of choice.
4.6 Recommendations for practitioners
Based on the results of the previous experiments, we can now
give several recommendations when accuracy or computational efficiency is preferred. For calculating Optical Flow,
Sect. 4.5 showed that the Matlab implementation of Horn–
Schunk is always the method of choice. In terms of frame
123
sampling rate, for HOG descriptors we always recommend a
sampling rate of every six frames. For HOF descriptor, if one
wants accuracy we recommend a sampling rate of every two
frames and if one wants computational efficiency we recommend a sampling rate of 6. The same holds for MBH(x/y)
descriptors. For the word assignment method, the Fisher Vector is the method of choice for accuracy. For computational
efficiency there are two candidates: hierarchical k-means and
the Random Forest. Observe first that the descriptor extraction time is the most costly phase of the pipeline: Extracting
HOF descriptors with a sampling rate of 6 frames takes 3.8
s/video to compute. And while the Random Forest is five
times faster than hierarchical k-means, the difference is only
0.41 s/video, which is very small compared to the descriptor extraction phase. Furthermore, Table 1 showed a significant drop of accuracy from 0.780 for hierarchical k-means to
0.732 for Random Forests [and a similar drop of accuracy is
observed with MBH(x/y)]. Therefore we recommend using
hierarchical k-means for a fast video classification pipeline.
We found that late fusion of the classifier outputs gave
slightly better results than early fusion of the descriptors (e.g.
concatenating HOG and HOF). Hence in our recommendations we perform a late fusion with equal weights.
We tested different descriptor combinations, using equalweights-based late fusion and with the goal of selecting:
(1) the most accurate set of descriptors, possibly taking into
account the complementarity of appearance/motion information of different features, and (2) the fastest solution with a
sufficiently good accuracy degree. The final recommended
pipelines are visualised in Figs. 7 and 8.
The most accurate pipeline (Fig. 7) combines all the
descriptors we adopted in this paper: HOG, HOF, MBHx
and MBHy. HOG are extracted using all the frames, while
HOF and MBH(x/y) are extracted with a sampling rate of 2.
The word assignment method used in this case is the Fisher
Vector. Using this pipeline we can process 11 frames/s (for
video frames of 320 × 240 pixels) at an accuracy of 0.818 on
UCF50. Conversely, our recommended pipeline for computational efficiency (Fig. 8) is based on late fusion of only HOG
and HOF, both extracted with a sampling rate of 6 and using
hk-means. This second pipeline can process 28 frames/s at a
respectable accuracy of 0.790.
4.7 Comparison with state-of-the-art
In this section we compare our descriptors to the state-of-theart. Results of several recent works are given in Table 6. This
comparison is done in terms of accuracy only, as most compared methods evaluate accuracy only. This paper in Table 6
indicates the late fusion of all the descriptors [HOG, HOF,
MBH(x/y)]: see Sect. 4.6 and Fig. 7.
As can be seen, the method of [45] yields the best results.
This method is a combination of Dense Trajectories and
Int J Multimed Info Retr
Fig. 7 Recommended pipeline for accurate video classification. This pipeline yields an accuracy of 0.818 on UCF50 while processing 9 frames/s
Fig. 8 Recommended pipeline for realtime video classification. This pipeline yields an accuracy of 0.790 on UCF50 while processing 28 frames/s
Table 6 Comparison with the state-of-the-art
Method
Accuracy (%)
Wang et al. [45]
0.856
This paper
0.818
Reddy et al. [32]
0.769
Solmaz et al. [40]
0.737
Everst et al. [14]
0.729
Kliper-Gross et al. [23]
0.727
STIP features [25]. As our results are better than [25], we
expect that a combination of dense trajectories with our
method would increase results further. In general, our method
yields good performance compared to many recently proposed methods, which shows that we provide a strong implementation of densely sampled HOG, HOF and MBH(x/y)
descriptors.
5 Conclusion
This paper presented an evaluation of the trade-off between
computational efficiency and accuracy for video classification using a Bag-of-Words pipeline with HOG, HOF and
MBH descriptors. Our first contribution is a strong and fast
Matlab implementation of densely sampled HOG, HOF and
MBH descriptors, which we make publicly available.
In terms of visual word assignment, the most accurate
method is the Fisher Kernel. Hierarchical k-means is more
than six times faster while yielding an accuracy loss of less
than 2 % and is the method of choice for a fast video classification pipeline. HOG descriptors can be subsampled every
six frames with a negligible loss in accuracy, while being
two times faster. HOF and MBH descriptors can be subsampled every two frames with negligible loss in accuracy, being
1.5–1.7 times faster. When speed is essential, HOF and MBH
descriptors may be subsampled every six frames.
For the HOF and MBH descriptors, we showed that the
choice of optical flow algorithm has a large impact on the
final performance. The difference between the best method,
Horn–Schunk, and the second best method, Lucas-Kanade,
is already 5 %, while the difference with Färneback is a full
15 %. Brox 04 and Brox 11 are computationally very demanding, and cannot be used in a real-time video classification
scenario.
Compared to the state-of-the-art, the Dense Trajectory
method of [45] obtains better results. Nevertheless, the huge
difference for the choice of optical flow methods suggests
this would also influence dense trajectories. Furthermore,
Dense Trajectories still benefit from a combination with normal HOG, HOF and MBH desciptors [21,45]. Finally, com-
123
Int J Multimed Info Retr
parisons with other recent methods on UCF50 show that we
provide a strong implementation of dense HOG, HOF and
MBH descriptors to the community.
Acknowledgments This work was supported by the European 7th
Framework Program, under grant xLiMe (FP7-611346) and by the FIRB
project S-PATTERNS.
References
1. Arandjelović R, Zisserman A (2012) Three things everyone should
know to improve object retrieval. In: CVPR
2. Baker S, Scharstein D, Lewis JP, Roth S, Black MJ, Szeliski R
(2011) A database and evaluation methodology for optical flow.
Int J Comput Vis 92:1–31
3. Bay H, Ess A, Tuytelaars T, Van L (2008) Speeded-Up Robust
Features (SURF). Comput Vis Image Underst 110:346–359
4. Breiman L (2001) Random forests. Mach Learn 45(1):5–32
5. Brox T, Bruhn A, Papenberg N, Weickert J (2004) High accuracy
optical flow estimation based on a theory for warping. In: ECCV,
pp 25–36
6. Brox T, Malik J (2011) Large displacement optical flow: descriptor
matching in variational motion estimation. PAMI 33(3):500–513
7. Butler DJ, Wulff J, Stanley GB, Black MJ (2012) A naturalistic
open source movie for optical flow evaluation. In: ECCV
8. Chang C-C, Lin C-J (2011) LIBSVM: A library for support vector
machines. ACM Trans Intell Syst Technol. http://www.csie.ntu.
edu.tw/cjlin/libsvm
9. Chatfield K, Lempitsky V, Vedaldi A, Zisserman A (2011) The devil
is in the details: an evaluation of recent feature encoding methods.
In: BMVC
10. Csurka G, Dance CR, Fan L, Willamowski J, Bray C (2004) Visual
categorization with bags of keypoints. In: ECCV international
workshop on statistical learning in computer vision, Prague
11. Dalal N, Triggs B (2005) Histograms of oriented gradients for
human detection. In: CVPR
12. Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: ECCV
13. Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: VS-PETS
14. Everts I, van Gemert J, Gevers T (2013) Evaluation of color STIPs
for human action recognition. In: CVPR
15. Farnebäck G (2003) Two-frame motion estimation based on polynomial expansion. In: Scandinavian conference on image analysis
16. Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees.
Mach Learn 63(1):3–42
17. Horn B, Schunck B (1981) Determining optical flow. Artif Intell
17:185–203
18. Jaakkola T, Haussler D (1999) Exploiting generative models in
discriminative classifiers. In: NIPS
19. Jégou H, Douze M, Schmid C, Pérez P (2010) Aggregating local
descriptors into a compact image representation. In: CVPR, pp
3304–3311
20. Jurie F, Triggs B (2005) Creating efficient codebooks for visual
recognition. In: ICCV
21. Karaman S, Seidenari L, Bagdanov A, del Bimbo A (2013)
L1-regularized logistic regression stacking and transductive CRF
smoothing for action recognition in video. In: ICCV workshop on
action recognition with a large number of classes
123
22. Kläser A, Marszalek M, Schmid C (2008) A spatio-temporal
descriptor based on 3d-gradients. In: BMVC
23. Kliper-Gross O, Gurovich Y, Hassner T, Wolf L (2012) Motion
interchange patterns for action recognition in unconstrained videos.
In: ECCV
24. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB:
a large video database for human motion recognition. In: ICCV
25. Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning
realistic human actions from movies. In: CVPR
26. Lazebnik S, Schmid C, Ponce J (2006) Spatial pyramid matching
for recognizing natural scene categories. In: CVPR. Beyond Bags
of Features
27. Lowe DG (2004) Distinctive image features from scale-invariant
keypoints. IJCV 60:91–110
28. Lucas B, Kanade T (1981) An iterative image registration technique with an application to stereo vision. In: International joint
conference on artificial intelligence
29. Maji S, Berg AC, Malik J (2008) Classification using intersection
kernel support vector machines is efficient. In: CVPR
30. Moosmann F, Nowak E, Jurie F (2008) Randomized clustering
forests for image classification. IEEE Trans Pattern Anal Mach
Intell 9:1632–1646
31. Perronnin F, Sanchez J, Mensink T (2010) Improving the Fisher
kernel for large-scale image classification. In: ECCV
32. Reddy K, Shah M (2013) Recognizing 50 human action categories
of web videos. Mach Vis Appl 24(5):971–981
33. Sánchez J, Perronnin F, Mensink T, Verbeek JJ (2013) Image classification with the fisher vector: theory and practice. Int J Comput
Vis 105(3):222–245
34. Sangineto E (2013) Pose and expression independent facial landmark localization using dense-SURF and the Hausdorff distance.
IEEE Trans Pattern Anal Mach Intell 35(3):624–638
35. Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions:
a local svm approach. In: ICIP
36. Scovanner P, Ali S, Shah M (2007) A 3-dimensional sift descriptor
and its application to action recognition. In: ACM MM
37. Sivic J, Zisserman A (2003) Video Google: a text retrieval approach
to object matching in videos. In: ICCV
38. Smeaton AF, Over P, Kraaij W (2006) Evaluation campaigns and
TRECVID. In: ACM SIGMM international workshop on multimedia information retrieval (MIR)
39. Snoek CGM, Worring M, Gemert J, Geusebroek J, Smeulders
A (2006) The challenge problem for automated detection of 101
semantic concepts in multimedia. In: ACM MM
40. Solmaz B, Assari SM, Shah M (2013) Classifying web videos using
a global video descriptor. Mach Vis Appl 24(7):1473–1485
41. Sun D, Roth S, Black M (2014) A quantitative analysis of current
practices in optical flow estimation and the principles behind them.
Int J Comput Vis 106:115–137
42. Uijlings JRR, Smeulders AWM, Scha RJH (2010) Real-time visual
concept classification. IEEE Trans Multimed 12(7):665–681
43. Vedaldi A, Fulkerson B (2010) VLFeat—an open and portable
library of computer vision algorithms. In: ACM MM
44. Viola P, Jones M (2001) Rapid object detection using a boosted
cascade of simple features. Proc CVPR 1:511–518
45. Wang H, Kläser A, Schmid C, Liu C (2013) Dense trajectories and
motion boundary descriptors for action recognition. Int J Comput
Vis 103:60–79
46. Wang H, Ullah M, Kläser A, Laptev I, Schmid C (2009) Evaluation
of local spatio-temporal features for action recognition. In: BMVC