Elastic Sequence Correlation for Human Action Analysis

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 6, JUNE 2011
1725
Elastic Sequence Correlation for
Human Action Analysis
Li Wang, Li Cheng, Member, IEEE, and Liang Wang
Abstract—This paper addresses the problem of automatically
analyzing and understanding human actions from video footage.
An “action correlation” framework, elastic sequence correlation
(ESC), is proposed to identify action subsequences from a database
of (possibly long) video sequences that are similar to a given query
video action clip. In particular, we show that two well-known algorithms, namely approximate pattern matching in computer and
information sciences and dynamic time warping (DTW) method
in signal processing, are special cases of our ESC framework. The
proposed framework is applied to two important real-world applications: action pattern retrieval, as well as action segmentation
and recognition, where, on average, its run time speed (in matlab)
is about 3.3 frames per second. In addition, comparing with the
state-of-the-art algorithms on a number of challenging data sets,
our approach is demonstrated to perform competitively.
Index Terms—Action correlation, action pattern retrieval,
action recognition, approximate pattern matching, dynamic time
warping, edit distance.
I. INTRODUCTION
ITH the ubiquitous presence of video data in everyday
life, it becomes increasingly demanding nowaday to automatically analyze and understand human actions from large
amount of video footage, which is strongly driven by a wide
range of applications including automatic visual surveillance,
smart human–machine interface, sports event interpretation, and
video browsing and retrieval.
In this paper, we consider the tasks of video action analysis
and understanding from an “action correlation” viewpoint. A
central question is: given a query action sequence of length
and a database video sequence of length (generally,
), identify the locations where matches a subsequence of with bounded correlation cost. As will be shown
in greater details later, this question is formulated as solving a
minimization problem [cf. (1)], which naturally gives rise to a
dynamic programming (DP) formula [cf. (3)], the corner stone
of the proposed elastic sequence correlation (ESC) algorithm
(i.e., Algorithm 1). In particular, we show that our framework
W
Manuscript received September 21, 2009; revised October 06, 2010; accepted
December 14, 2010. Date of publication December 23, 2010; date of current version May 18, 2011. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Miles Wernick.
L. Wang is with the Department of Computing Science, Nanjing Forestry
University, 210037 Nanjing, China.
L. Cheng is with the Bioinformatics Institute, A*STAR, Singapore.
L. Wang is with the National Laboratory of Pattern Recognition, Chinese
Academy of Sciences, 100190 Beijing, China.
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TIP.2010.2102043
includes as special cases two widely used algorithms: dynamic
time warping (DTW) [1] from the signal processing community
as well as approximate pattern matching (string edit distance, or
Levenshtein distance) [2], [3] from the computer and information sciences community. These connections directly provide
us access to a number of dedicated techniques developed over
the years in either communities, which conveniently lead to
possible variants of ESC to deal with specific scenarios, and as
examples we present two such variants in this paper.
The proposed ESC algorithm is rather flexible in term of
accommodating either local or global feature representations. In
this paper, we focus particularly on the local feature representations that aim to capture the local salient aspect of image and
video gradients for representing image and video context. This
choice is motivated by a recent neuro-psychological finding [4]
that visual and motor cortices of human perception system are
more responsible than the semantic ones for retrieval of visual
patterns. The proposed approach is further examined in two
practical applications: action pattern retrieval as well as action
segmentation and recognition, which are often addressed by
methods including probabilistic methods [e.g., hidden Markov
model (HMM)] [5], [6], and support vector machines (SVMs)
[7], [8]. As will be demonstrated later on a number of challenging
datasets (cf. Figs. 2, 5, and Fig. 7), our approach performs competitively compared with the state-of-the-art methods.
Action Retrieval: As shown in Fig. 1, this application
studies the retrieval of action subsequences that are similar to
the query clip. This is in practice a major technical challenge
for the emerging industry of content-based video retrieval
from internet sources (e.g., Google and Yahoo video search,
VideoSurf,1 Blinkx,2 CastTV,3 Pixsy4, and Viewdle5). A variety
of spatiotemporal interest points such as [9]–[13] have been
devised and utilized in action video retrieval. In addition,
DeMenthon and Doermann [14] propose to work with 3-D
spatiotemporal volume of pixels. The work of Laptev and
Perez [7] adopts a learning-based approach aiming to retrieve
a specific action type (“drinking”) from film clips. Interested
readers might refer to the work of Moeslund et al. [15] or Poppe
[16] for a detailed survey of research developments in this field.
Action Segmentation & Recognition: There are a vast and
growing literature on this topic, so we have to restrict our description to a few work that we feel are more relevant or representative. Established methods for modeling and analyzing human
1[Online].
Available: http://www.videosurf.com
Available: http://www.blinkx.com
3[Online]. Available: http://www.casttv.com
2[Online].
4[Online].
5[Online].
Available: http://www.pixsy.com
Available: http://www.viewdle.com
1057-7149/$26.00 © 2010 IEEE
1726
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 6, JUNE 2011
Fig. 1. Example that illustrates the retrieval of a set of hairpin net shot actions from a badminton playing video sequence, with one query clip.
actions are predominantly generative statistical methods, especially the Markov models [5], [6], [17], [18], e.g., HMMs and its
variants such as coupled HMMs [5], [6]. Recently, discriminative learning scheme has also been extended to structured outputs
(e.g., support vector machine with semi-Markov model output
space (SVM-SMM) of [19] and conditional random fields (CRFs)
[20], [21]), and encouraging results are obtained for action segmentation and recognition. In this paper, we reduce the inference problem to matching a query to an existing set of annotated
databases, a matching problem nicely solved by the proposed dynamic programming component which is the corner stone of our
ESC framework. While being conceptually simple, empirical experiments suggest that our method performs competitively comparing to state-of-the art methods that often relying on training
sophisticated parametric models.
Our Contribution: Major contributions of this paper are
given here.
1) A new correlation-based framework (ESC) is proposed
for action sequence analysis, which bears close connections to established work of DTW and approximate
pattern matching. By exploiting existing dedicated techniques developed for either DTW and edit distance, two
ESC variants are further developed to address specific
scenarios.
2) We examine ESC in two important real-world applications
and conduct extensive experiments where ESC is shown to
perform competitively.
Asmentioned above, the ESC algorithm is closelyrelatedto approximate pattern matching methods (e.g., [3], [22]–[33]) where
string edit (or Levenshtein) distance is predominantly used. Although having demonstrated sound robustness property against
observation noise [3], these methods are designed for combinatorial pattern matching using a finite alphabet, which unfortunately precludes the versatility of measuring distance from real
feature vectors (thus uncountable alphabet), that are often used
by nowaday action analysis methods and also readily dealt with
by our ESC algorithm. Our algorithm also bears strong connection to dynamic time warping (DTW) e.g., [1], [34]–[41], and in
particular, elastic matching e.g., [42] and deformable template
e.g., [43] methods in computer vision. In video action analysis,
a number of recent work [10]–[12] are also related to this line
of DTW approaches. However, DTW is known to be sensitive to
noise (e.g., large feature deviations in a small cluster of frames),
which often leads to extra missing or false alarm cases [1]. By the
ability to utilizing the matching techniques developed in approximate pattern matching, our algorithm is empirically shown (cf.
Figs. 2 and 6) to be rather robust to these noise. Moreover, our
ESC framework exploits more of the temporal perspective for an
WANG et al.: ELASTIC SEQUENCE CORRELATION FOR HUMAN ACTION ANALYSIS
1727
Fig. 2. Three queries in a synthetic data set. In each query, the left of the red (online version) vertical bar shows the query sequence, and the right is the database
sequence. Each sequence is a 1-D time series line drawing starting from left to right: at each time step, a 2-D point is observed and connected to the existing
sequence. The retrieved subsequences are highlighted in green. More precisely, (a) illustrates the effectiveness of ESC when the subsequences vary slightly from
the query template, (b) demonstrates the efficiency of adopting ESC-S (29% speed-up) while maintaining the same accuracy; and (c) presents an example of
different scales. See text for more details. (a) Robust to noise. (b) 29% Speed up with ESC-S. (c) Invariant to scales.
action sequence, and this greatly differentiates ESC from those
of [10]–[12] that emphasize a particular feature design of either
spatiotemporal interest points or space–time volumes.
II. ESC FRAMEWORK
We present here the main idea underlines the proposed ESC
framework, as well as its variants to address more specific issues
such as varying temporal scale-spaces, and speed-up with non-
repetitive patterns. This is followed by formal analysis of the related algorithmic properties: its time and space complexities as
well as the correlation probability that characterizes how many
subsequences would be identified using a query action pattern.
A. Main Idea
The main idea of ESC is to formulate the possible correlations
of the query and the database actions into a correlation matrix,
1728
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 6, JUNE 2011
and the key ingredient is the utilization of a dynamic programming procedure to identify the optimal solution efficiently from
a search space defined by this correlation matrix. More formally,
assume that each video sequence is captured by sampling in the
time domain a stream of frames under a certain sample rate.
and the dataAssume that the query sequence of length
base sequence of length are obtained using the same sample
rate, and let
upper-bound the number of consecutive frames
to which one frame from the opposite side could be. Denote
the amount of correlation costs of matching between sequences and , and let 6 be an upper bound of the correlation
cost. The problem now becomes finding all positions of in
such that there exists a suffix of
matching by
(1)
As a consequence, this gives the set of feasible action subsequences as
is satisfied
(2)
We define a query frame index and a database video index
and denote
as a (start, end) pair of current correlation subsequences together with as a zero–based index of
this subsequence. In other words,
. Now let be
a
matrix with
indexing the rows
and
indexing the columns, respectively, and each entry
storing
, the partial correlation cost
between sequences
and
. We consider three elementary operations, compression, expansion, and substitution, and
devise a DP procedure to compute the correlation matrix recursively as
approximate pattern matching (or edit/Levenshtein distance),
e.g., [2], [3] in computer science community, as given here.
and
, (3) becomes the standard
• By setting
form of the approximate pattern matching;
• Letting
and
gives back the familiar DTW
formula.
In their standard forms, DTW measures similarities between
two sequences when the two ends of each sequence are known in
priori, meanwhile the finite alphabet size assumption of approximate pattern matching is not consistent with the feature representations of the action sequences adopted in this paper. These
incompatibility issues are addressed in the ESC algorithm (i.e.,
Algorithm 1) that integrates this proposed DP formula.
As presented in Algorithm 1, where, rather than utilizing a
matrix , we instead work with a much
small matrix of
rows and
columns: This
saves a significant amount of storage space since usually
for very long database video, and the only payoff is that
we have to reset the matrix once a new subsequence of is
about to be screened. The local frame tolerance
judges
whether a current pair of frames are sufficiently correlated: a
loose tolerance value
is set to allow the occurrence of
multiple matches (i.e., Algorithm 1 allows mismatch of several
pairs of frames during the sequence correlation of and as
long as the average cost does not exceed ); When no single
error is allowed, such as in Algorithm 2,
will instead be set
to a much tighter value (i.e.,
).
Algorithm 1 Elastic Sequence Correlation (ESC)
Input: a query action video
and a database video
Output: a set of subsequences
Initialize the correlation cost matrix to ; Set ,
, ,
to some positive values; Set database video index , ,
and query frame index
.
do
while
(3)
where the constants
and
. Denote as a frame
observation, thus the th frame is , and let
measure the cost of the pair of frames
that were induced from
the feature representation (we will revisit this in greater detail
in Section II-D). We further introduce an upper-bounded cost
. Now,
is robust
to perturbation from few noisy inputs, as a local object frame
e.g., has only bounded effect
to global correlation cost. The
upper bound is set to
, with a constant
.7 It
follows that the solution to the problem in (1) is exactly these locations
where
,
denotes the length
. As an example,
of the path from (0, 0) to current place
Fig. 2(a) illustrates a pedagogical result of this procedure.
This proposed DP formula is a generalization of both the
DTW, e.g., [1] in signal processing community, as well as the
6By
default, is set to 0.18 during the experiments.
7In this paper, ! and " are both fixed to 0.1 by default.
if
then
Attempt a new correlation
,
while
do
to
for
do
then apply (3)
if
end if
end for
, s.t.
if
else
break
end if
then
WANG et al.: ELASTIC SEQUENCE CORRELATION FOR HUMAN ACTION ANALYSIS
[45] of radius . In addition, we exploit the ideas of filtering
[2], [3] from computer science community for approximate
pattern matching to prune off those impossible subsequences
as quick as possible before entering the more thorough but
computationally demanding DP procedure.
We note that the proposed ESC (Algorithm 1) is an error-tolerant algorithm. In practice, there are cases in which we would
like to enforce strict matching where no single error is allowed.
The ESC can be easily adapted to deal with this particular case
and is thus termed ESC-S (including Algorithms 2 and 3). For
returns the corresponding index in data
a query index ,
set video to the first matched frame for frame in . Its efficiency is noticeably improved by introducing the partial correlation table in Algorithm 3, motivated by the Knuth–Morris–Pratt
(KMP) algorithm [46] in pattern matching. The idea is to “prescan” the query sequence itself and compile a list of possible
prefix positions that bypass as much as possible those impossible frames to apply correlations, and at the same time not sacrifice any potential correlations. This trick provides a significant
speed-up for a query sequence , especially when it contains
nonrepetitive patterns.
end while
then
if
Add to
backtrack to pick up the correlation path from
up to
this subsequence of
, reset
1729
to
end if
else
end if
end while
B. Variants of ESC
Variants of the proposed ESC alg have been derived to address
several practical scenarios.
1) Interpolation to Deal With Scale-Space Issue: One important observation is that the same action type might be performed in distinct speeds by different persons. This often dues
to a number of factors: camera hardware, age difference, gender,
and health status of the subject. For example, an elder female
patient tend to walk much slower than a young healthy man.
The scale-space theory has been well developed in image spatial domain and successfully deployed in e.g., object detection
problem [44]. Here, it can also be naturally extended to address this varying temporal scale-spaces phenomenon, where
the sample rates of and might be vastly different.
Similar to the scale-space search problem in object detection [44], interpolation is used to deal with this scale-space
issue. More specifically, an iterative interpolation procedure is
adopted: the query is interpolated (scale up or down) by a
set of scaling factors
, where each scaling factor
produces a sequence
with length
. In other words,
the interpolation from of length
to
of length
is
achieved by
Algorithm 2 ESC-S Main Algorithm
Input: a query action video
and a database video
Output: a set of subsequences
Initialize the correlation cost matrix to
progress flag
to
; Set
,
index , ,
and query frame index
do
while
if
then
Attempt a new correlation
,
while
do
set
to
for
do
then apply (3)
if
set
(4)
where
denotes the th frame of , and
is the
th
frame of .
2) Speed-Up Using Partial Correlation Table: The complexity of a naive algorithm amounts to
, linear in
cardinalities of both the database video and the query clip.
This is acceptable for a small database but not for the real-life
scenarios where we need to deal with large-scale databases, and
it is often costly to compute the entries of the matrix. In this
respect, Algorithm 1 provides an efficient solution by adopting
a well-known idea from the signal processing community used
to speed up DTW, where the search path in matrix is constraint to be within a banded matrix: e.g., Sakoe–Chiba band
and the row-wise
, database video
.
end if
end for
then
if
else
Shift index
break
using PCT
1730
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 6, JUNE 2011
end if
end while
then
if
Add to
backtrack the correlation path of
to
this subsequence of
reset
to
end if
else
end if
end while
Algorithm 3 Build Partial Correlation Table (PCT)
an efficient algorithm of
time complexity, in
both average and worst cases. The efficiency is from partial
, there is a miscorrelation table. If
match between two frames. Rather than beginning to search
again at
frame, we move on to next frame, setting
and
.
A fundamental question besides the time and space complexities is the correlation probability, which is the probability that
any subsequence would be identified by a query action pattern.
Given a query action pattern of length and a database video
sequence of length , as well as the correlation cost upper
bound incurred from the three elementary operations (compression, expansion, and substitution), we define
as
the error ratio of the sequence correlation. In addition, the usage
implies a quantization of
of the local frame tolerance score
the action feature space as , and its size as
. Now, let
be the probability that a query matches a subsequence
of with at most error rate (or equivalently upper-bounded by
). The following lemma (adopted from [3]) suggests that this
correlation probability decreases exponentially as increases,
as long as
.
Lemma 1 (Correlation Probability): Assume that the frames
action feature of any video sequences are iid generated. Then
, with
the correlation probability is
(5)
Input:
Output:
,
Initialize
, and
,
repeat
if
then
,
else
if
then
else
,
Clearly,
as it becomes
as
, and grows up
to 1 as grows to 1. Although not a particularly tight bound,
this lemma nevertheless indicates the sparseness of the potential
matches as
increases, and in particular, this lemma clearly
suggests that as the length of the query sequence grows, it is
crucial to drop off irrelevant subsequences from the database
sequence as early as possible, and in turn supports our usage of
e.g., Sakoe–Chiba bands and filtering methods for an efficient
solution. Empirically, the average running time speed of our approach is about 3.3 frames per second for midsize (320 240)
videos. This is obtained on a desktop PC with AMD Athlon
2.6-GHz Dual CPUs and 2-GB memory. As our current implementation is in MATLAB, a more efficient implementation using
programming languages such as C++ would lead to further improvement of the run-time performance.
D. Feature Representation and Distance Measure
end if
end if
until
C. Algorithmic Analysis
The space complexity is
for ESC (Algorithm 1)
as well as ESC-S (Algorithms 2 and 3 together),8 by exand the
ploiting the Sakoe–Chiba band matrix of radius
filtering tricks. The time complexity of ESC is
, with
the accumulative matching errors
determined by ,
and . For a restricted class of problems, ESC-S leads to
!
" #
" #
8It is in fact
! "# ! # for ESC and
both in the scale of
# as
in practice.
#
! ! $#" ! ## for ESC-S,
Neuro-psychological findings such as [4] suggest that the visual and motor cortices of human perception system are more
responsible than the semantic ones for retrieval and recognition
of visual action patterns. This motivates us to represent action
features by a set of key points that capture the salient aspect of
spatial and temporal video gradients. It is noteworthy to mention
that, although we focus on local feature representations in this
paper, the proposed ESC framework is also able to accommodate global features. In what follows, we consider in particular
the idea of point-set frame representation and matching between
pair of frames, which is achieved by adapting the shape context
(SC) features (Belongie et al. [47]) to our context, as follows.
Without loss of generality, we assume that there exists
exactly one object in any given frame. An object
WANG et al.: ELASTIC SEQUENCE CORRELATION FOR HUMAN ACTION ANALYSIS
can be sufficiently represented by a set of key points
of the object.9 The proposed
measuring scheme, which is illustrated in Fig. 3, consists of
three steps.
Step 1) Key points correspondence: Given two objects
and , solve the problem of point sets correspondence by finding an optimal one-to-one match of
points.
Step 2) Key points alignment: To align the correspondences
found so far, estimate an alignment transformation
such that each point from object
is spatially aligned with its corresponding point
from .
Step 3) Key points distance: Compute key points distance as
a sum of the effort spent in the key points alignment
(the bending energy), plus the remaining key points
correspondence cost after aligning the two key point
sets.
This scheme can be described as follows. For each point , a
key point descriptor is defined using its shape context descriptor,
which is a local log-polar histogram that compactly encodes
the spatial configuration of the remaining point
.
test is then used to determine the cost of matching two
The
given points and that come from objects and , respectively,
. This cost function enables a direct computation of the set of costs
between all pairs
of points
and
. The key points correspondence
problem is thus formulated and solved as a standard bipartite
graph matching by minimizing the total matching cost
(6)
where denotes a single permutation.
The key points alignment transformation
is a dedicated linear system equation. Thin plate spline (TPS), which
generalizes the cubic spline to two dimensions, is used to
recover the transformation between curves. The bending energy
(in e.g., [47]) measures the magnitude of this
transformation. We omit detailed explanation of alignment and
bending energy, and instead present an illustrative example in
Fig. 3.
After applying the alignment transformation on the point set
(denoted as
,
), the remaining key
of object
points correspondence cost is defined as
(7)
The key points distance between the two objects,
is thus a sum of the two costs, i.e.,
and
,
(8)
9As shown in Fig. 3, the object
is represented as a set of two-dimensional
points, where each point is encoded as a local histogram. This is essentially a
nonvectorial representation of the object , since the points do not necessarily
respect an order.
1731
Fig. 3. Default local feature representation used in, e.g., the WBD dataset. The
left panel shows two objects, which are then represented as point sets around the
objects shown in the middle panels. In each point set, a point is further encoded
as a local log-polar histogram with radius ! and angle " , shown in the top right
panel. The points in different silhouettes are not synchronized until (shown in
the bottom right panel) the key points correspondences are found.
where
is a constant to tradeoff between the two costs
and is fixed to 1 in this paper.
Thus far, a limitation of this distance measure is that it only
works on pairs of frames but it is desirable to incorporate the
temporal information which characterizes the local motion
flow. Directly utilizing the motion flow computation, however,
is often computationally demanding. Instead, we adopt a simple
temporal sliding window method as follows. For the th frame,
as a window of
frames centered around it, and
denote
similarly
for frame . Let and index the current frames
in
and
, respectively, and synchronize them such that
they always point to the same location of the window. Now,
introduce a distance function as a convex combination of
over temporal windows
(9)
where the weights
and
. In practice, we let
and fix the weight vector to be all 1/7.
The main differences of our feature representation comparing to that of e.g., Video Google [48] are: 1) we use shape
context feature as a robust representation and 2) a temporal
local window is utilized to incorporate spatio-temporal context
during the local key points measurement. We note in passing
that, while a specific local feature representation, shape context
feature [47], is adopted here as the default feature scheme, ESC
is flexible and can work with other feature representations. This
is demonstrated later in one set of empirical experiments where
a different spatiotemporal feature representation [49] is used.
1732
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 6, JUNE 2011
III. APPLICATION I: ACTION RETRIEVAL
Here, we concentrate on the application of our ESC framework in action pattern retrieval, which has received increasing
attention in recent years from both multimedia and computer vision communities.
This application can be formally characterized as follows.
Given a query action video and a database containing videos
, the task of action pattern retrieval is to retrieve the number of action subsequences that are similar to ,
namely
(10)
This is a major technical issue for the emerging industry of video
retrieval from internet sources as mentioned previously.
It is clear from (10) that this application is intimately related
to the key problem we consider in this paper, as evidenced by (1)
and (2). In light of this observation, we adopt a straightforward
methodology where the proposed ESC framework is applied to
each database video clip, and then the desired results are naturally obtained as a union of all of the retrieved subsequences.
A. Experimental Results and Analysis
Our approach is first evaluated on a synthetic data set to
demonstrate the applicability of ESC as well as its two variants. We further conduct experiments on two representative
real-world data sets, where ESC is compared with two special
cases: DTW (by setting
and
) and approximate
pattern matching (
and
). During these experiments
we empirically set
.
1) Synthetic Data Set: As displayed in Fig. 2, a synthetic data
set is generated and three experiments are conducted. For each
experiment, the left side refers as the query sequence, meanwhile the database sequence is on the right. Each sequence consists of a 1-D time series line drawing starting from left to right:
at each time step, a 2-D point is produced and connected to the
existing sequence. To illustrate this, we present in the bottom of
each subplot in Fig. 2 an flat-out one dimensional time series line
with color code from dark purple to light green to indicate the
transition from start to end. For the database sequence, we use
black as the base color and, when a subsequence is matched with
query, use the above-mentioned color code to indicate the element by element matches. The first experiment (query 1) aims
to show that ESC is error tolerant: three subsequences from the
database sequence are identified as well-correlated to the query
pattern subject to of (1). Moreover, the number of retrieved
subsequences will gracefully reduce to one (i.e., the middle subsequence) as we decrease the value. In scenarios where we
would like to retrieve without significant errors, by delivering
the same results ESC-S is shown performs with 29% time reduction, as demonstrated in query 2. In the third experiment,
while query 3 is presented at a scale space entirely different
from that of the database sequence, our algorithm is still able
to work well after adopting the iterative interpolation method of
(4). The values used in these three queries are 0.7, 0.3, and
0.3, respectively. In addition, the robustness of the retrieval performance is demonstrated in Fig. 4 following the experiment
Fig. 4. Demonstration of the robustness of the performance when varying parameter . This experiment is conducted on the synthetic data set using the same
query as of Fig. 2(a).
protocol of Fig. 2(a), where the detection rate does not have significant change when we vary the parameter value of between
0.0 and 1.0.
2) Beach Data Set: The beach data set contains one database video of 456 frames of size 360 180 taken from the
beach scenes. Some sample frames are displayed in Fig. 5.
One “walking” query of six frames is shown in the top-left
panel of Fig. 5. The highlighted objects displayed in Fig. 5 are
from the retrieved subsequences that identified as similar to the
query. As a preprocessing step for the beach and the badminton
data sets, the foreground objects are segmented from dynamic
backgrounds using an efficient background subtraction technique [50] and the local feature points are restricted to lie
inside or sufficiently close to the foreground objects. Based on
these local features, the shape context features of the object
of interest is thus obtained. The quantitative ROC curve is
presented in Fig. 6(a): comparing with two special cases, DTW
and approximate pattern matching, the proposed approach is
shown to achieve better performance by allowing to tune to the
parameters to dedicated values (here we set to 0.9). Since
repetitive walking behavior can be continuously identified
using our approach, the length of the retrieved subsequences
my vary significantly, e.g., in this data set it ranges from three
frames to 58 frames.
3) Badminton Data Set: We build a badminton data set that
contains a sequence of 9218 frames collected from a badminton
match, where the size of each image frame is 360 288. Three
query sequences are created by manually picking-up three
short subsequences from this database sequence, including
one hairpin net shot action as well as two smash actions.
Fig. 7 displays nine sample images out of this data set: three
query actions are given in the first row where the temporal
action pattern of each query are overlaid onto one image, the
second row displays three random images, and three exemplar
retrieved frames (each from one retrieved action subsequence).
Note that the human foregrounds are highlighted with three
distinct colors to illustrate to which query sequence this frame
corresponds. As an example, we presented in Fig. 1 a retrieval
flowchat when one short query clip of hairpin net shot action
is provided. As presented in the ROC curve of Fig. 6(b), DTW
and approximate pattern matching methods give comparable
results while both of them are inferior to our ESC algorithm.
WANG et al.: ELASTIC SEQUENCE CORRELATION FOR HUMAN ACTION ANALYSIS
1733
Fig. 5. Gallery of the beach dataset. Left: the query walking pattern where the temporal walking behaviors are overlaid onto this image. The rest are example
frames from the dataset where the highlighted objects are from the frames identified as matching to the query in their respective subsequences.
Fig. 6. ROC curve of two datasets. (a) Beach. (b) Badminton.
Fig. 7. Sample frames of the badminton data set. The top row presents three query sequences where the action pattern of each query are overlaid onto one image.
The bottom row displays randomly selected frames and three exemplar retrieved frames, each from one retrieved subsequence.
During these two experiments,
matching errors.
is set to 0.4 to allow several
IV. APPLICATION II: ACTION RECOGNITION
Action recognition is one of key topics in action analysis and
understanding and has a wide range of promising applications
such as visual surveillance, video event analysis and intelligent
interface. Here, we emphasize on elementary actions such as
running, walking, and drawing on a board. In particular, we consider three specific scenarios: 1) action recognition without segmentation; 2) jointly action recognition and action cycles segmentation; and 3) jointly segmenting and identifying actions
from a video sequence where one person performs a sequence
of continuous different actions.
A. Experimental Results and Analysis
In what follows, we examine three scenarios by conducting experiments on four standard data sets, where the
proposed ESC algorithm is augmented with simple 1NN
(Nearest-Neighbor) strategies to recognize (and segment
when necessary) action subsequences. During these experifor the proposed approach
ments we empirically set
and compare its performance to those of the state-of-the-art
algorithms.
1734
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 6, JUNE 2011
Fig. 8. Sample frames of one person engaging in six types of actions in the KTH data set.
Fig. 9. Cuboid features on the KTH data set.
1) Action Recognition (Single Action Per Sequence): We
consider a relatively simple action recognition scenario where
we have a training database of action sequences containing
classes of actions, where each action sequence possesses ex-
actly one action. Now, an unseen test action sequence will be
categorized into one of the actions by querying through each
of training sequences, and followed by a simple 1NN classifier
to identify the best (sub)sequence correlation.
WANG et al.: ELASTIC SEQUENCE CORRELATION FOR HUMAN ACTION ANALYSIS
1735
TABLE I
COMPARISONS OF ACTION RECOGNITION RATES ON KTH DATA SET
To demonstrate the flexibility of the proposed framework,
a different local feature representation (i.e., “cuboid” representation in [49]) is adopted here, which is essentially an
extension of the SIFT descriptor [51] to the spatiotemporal
domain. More specifically, this detector is tuned to fire whenever variations in local image intensities contain distinguishing
spatiotemporal characteristics. At each detected interest point
location, a 3-D cuboid is then extracted and represented as
a flattened vector that contains the spatiotemporal windowed
information including normalized pixel values, brightness gradient, and windowed optical flow. In [49], a codebook representation is constructed using -means clustering, similar to
the visual vocabulary approach of [48] and [8], and each cuboid
is further projected into this codebook space as a codeword. To
fit into our framework, each action frame is now represented
as a set of 3-D cuboid codewords intersecting current frame,
and is computed as the sum of Euclidean distances between
the two sets of cuboids where the correspondence between two
cuboids (each from one frame) is made using 1NN. Then we
still use (7) and (9) to compute the distance between pair of
frames.
The first data set is the KTH data set used in [52]. There are 25
individuals engaged in six actions: running, walking, jogging,
boxing, handclapping, and handwaving, under four different environment conditions. Together this amounts to 600 action video
clips. Fig. 8 displays example frames of one person performing
each of the six actions. We used the training, validation, and
testing splits as proposed in [52]. Their cuboid representations
are illustrated in Fig. 9.
Fig. 10(a) reports the confusion matrix results of the existing
Dollar et al. approach [49] on the left and those of ESC on the
right. Overall the results of ESC improve over those of Dollar
et al.: For example, 40% of the time handclapping is wrongly
labeled as boxing in Dollar et al., and it is significantly improved
in ESC where now only 15% of the time the same kind of error
occurs. The average recognition accuracy of ESC across the six
types of actions is 0.86, which outperforms that of Dollar et al.
being 0.81 [49]. Furthermore, Table I presents a list of state-ofthe-art methods in term of recognition accuracy on this KTH
dataset. Our method is shown to performs favorably comparing
to methods that also utilize “cuboid” features (e.g., [49] and
SVM of [53]) reported in the literature.
10The experiments are performed using the publicly available implementation of Dollar et al. [49]. [Online]. Available: http://vision.ucsd.edu/~pdollar/research/cuboids_code/cuboids_Apr19_2006.zip.
Fig. 10. Confusion matrices of the two data sets in (a) and (b), respectively. In
each dataset, Left is the result of DRCB and right is ESC. (a) KTH. (b) Facial
expression.
The second is a facial expression data set compiled by [49].
We use a subset of this data set that contains two individuals
with each expressing 6 different emotions under the same illumination. The expressions are anger, disgust, fear, joy, sadness, and surprise, as illustrated in Fig. 11, where expressions
such as joy and sadness are quite distinct, others (e.g., anger
and disgust) are very similar. Each individual repeats each of
the 6 expressions 8 times, which gives a total of 96 video clips.
Each sequence contains about 120 frames each of size 152
194, where the subject always starts with a neutral expression,
expresses an emotion, then returns to neutral.
Fig. 10(b) reports the result of the approach of Dollar et al.
[49] (left) and that of ESC (right), both averaged over five repetitions. Again ESC significantly outperforms that of Dollar et al.10
on this data set. For example, anger is entirely confused with
disgust (63% of the time) and joy (38% of the time) in Dollar
et al., which is correctly labeled by ESC. Similar improvement
is observed for surprise as well. Overall, the average recognition
accuracy of ESC across the six types of actions is 0.83, which
outperforms that of Dollar et al. being 0.62.
2) Action Segmentation and Recognition (Multiple Repeated
Action Per Sequence): This is a different scenario: each data-
1736
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 6, JUNE 2011
Fig. 11. Sample frames of one person expressing six types of emotions in the facial expression data set.
base video sequence contains exactly one type of action and assume it can be partitioned into action cycles. The task now
becomes segmenting the unseen test sequence into action cycles
and at the same time recognizing the action type of the entire
test sequence. To meet this goal, we use each action cycle from
the training database sequences as a query pattern and exploit
our ESC algorithm as follows. For each feasible action type, a
simple 1NN classification is employed to retrieve the optimal
set of correlated action cycles, which gives a segmentation of
the test sequence, as well as the average correlation cost (i.e.,
the average cost per segment as
). The final
action type is then identified to be the one with the least average
correlation cost.
We conduct experiments on the CMU MoBo data set [56],
which contains 24 individuals11 walking on a treadmill. As
illustrated in Fig. 12, each subject performs in a video clip
one of the four different actions: slow walk, fast walk, incline walk and slow walk with a ball. Each clip has been
preprocessed to contain several cycles of a single action.
Following [8], the boundary positions of these cycles are
manually labeled as ground truth, and we evaluate the action
recognition and segmentation performance separately as the
score, respectively. The SIFT
recognition rate and the
codebook feature representation of [8] is incorporated into
ESC to facilitate a direct comparison. The -score given by
is adopted as
well to measure segmentation performance.
The top two rows of Table II list the recognition and segmentation results of the comparison algorithms besides ESC, namely
11The data set is originally consisted of 25 subjects. We drop the last person
since we have problems obtaining the sequences of this individual walking with
balls.
Fig. 12. Sample frames of subjects each performs one of the four actions: slow
walk, fast walk, incline walk, and walk with a ball, in an action sequence of the
CMU MoBo dataset.
1NN, SVM, as well as the recent SVM-HMM and SVM-SMM
[8]. Overall, ESC clearly outperforms these comparison algorithms. This is to be expected, as the cyclic motion patterns are
better preserved using the ESC algorithm.
3) Action Segmentation and Recognition (Multiple Different
Actions Per Sequence): In this case, each database sequence
contains multiple actions, and we would like to partition an unseen test sequence into action segments as well as to identify
their corresponding action labels. Similar to the previous scenario, Algorithm 1 is used where each action segment from the
training database sequences is used as a query pattern. A 1NN
classification scheme is then used iteratively to retrieve the optimal set of action segments, given these segments are mostly
nonoverlapped and their union covers the most part of entire
action sequence. As before, the optimality is determined in the
sense of the least correlation cost.
The Walk-Bend-Draw (WBD) dataset from [8] is an indoor
video data set that contains three subjects, each performs six
action sequences at 30 frames per second, and each sequence
consists of three continuous actions: slow walk, bend body and
draw on board, and on average each action lasts about 2.5 s. We
WANG et al.: ELASTIC SEQUENCE CORRELATION FOR HUMAN ACTION ANALYSIS
1737
TABLE II
COMPARISONS OF PERFORMANCE ON THE MOBO AND THE WBD DATASETS
Fig. 13. Sample frames of three subjects each engaging in a continuous sequence of three actions: walk, bend, and draw, in the WBD data set.
subsample each sequence to obtain 30 key frames and manually label the ground truth actions. Figs. 3 and 13 present various sample frames, and Fig. 13 shows three subjects each performing the continuous WBD actions in one video sequence.
During this experiment, we adopt the same feature representation of [8] as well for a direct comparison.
In Table II, the bottom row presents the average recognition
accuracy over three types of actions, where ESC performs on
par with SVM-SMM and clearly outperforms SVM, 1NN and
SMM-HMM.
V. OUTLOOK AND DISCUSSION
We have proposed a simple yet powerful sequence correlation framework, namely ESC, for the tasks of video action
analysis and understanding. In particular, we devise a generalized DP formula that enables the exploitation of useful
techniques from both DTW and approximate pattern matching
research communities, and is convenient to integrated with
various local feature representation schemes. We evaluate our
approach in two related applications: action pattern retrieval
and action segmentation and recognition, where performance
comparable to the state-of-the-arts are obtained during empirical evaluations on a number of video action data sets. Future
work includes incorporating geometrically invariant feature
representations to deal with the issue of multiple views, and
extension to kernel learning. In particular, we are interested in
applying the proposed framework to the problem of unusual
video activity detection.
REFERENCES
[1] C. S. Myers and L. R. Rabiner, “A comparative study of several dynamic time-warping algorithms for connected word recognition,” Bell
Syst. Tech. J., vol. 60, no. 7, pp. 1389–1409, 1981.
[2] D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer
Science and Computational Biology. Cambridge, U.K.: Cambridge
Univ. Press, 1997.
[3] G. Navarro, “A guided tour to approximate string matching,” ACM
Comput. Surv., vol. 33, no. 1, pp. 31–88, 2001.
[4] J. Phillips, G. Humphreys, U. Noppeney, and C. Price, “The neural
substrates of action retrieval: An examination of semantic and visual
routes to action,” Visual Cognition, vol. 9, no. 4–5, pp. 662–685,
2002.
[5] J. Yamato, J. Ohya, and K. Ishii, “Recognizing human action in timesequential images using hidden Markov model,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit., 1992, pp. 379–385.
[6] M. Brand, N. Oliver, and A. Pentland, “Coupled hidden Markov models
for complex action recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 1997, p. 994.
[7] I. Laptev and P. Perez, “Retrieving actions in movies,” in Proc. Int.
Conf. Comput. Vis., 2007, pp. 1–8.
[8] Q. Shi, L. Wang, L. Cheng, and A. Smola, “Discriminative human action segmentation and recognition using semi-Markov model,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., 2008, pp. 1–8.
[9] L. Zelnik-Manor and M. Irani, “Event-based analysis of video,” in
Proc. Int. Conf. Comput. Vis. Pattern Recognit., 2001, pp. 123–130.
[10] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, “Actions as space-time shapes,” in Proc. Int. Conf. Comput. Vis., 2005, pp.
1395–1402.
[11] E. Shechtman and M. Irani, “Space-time behavior-based correlation,”
IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 11, pp. 2045–2056,
Nov. 2007.
[12] I. Laptev, B. Caputo, C. Schüldt, and T. Lindeberg, “Local velocityadapted motion events for spatio-temporal recognition,” Comput. Vis.
Image Underst., vol. 108, no. 3, pp. 207–229, 2007.
[13] T. Thi, L. Cheng, J. Zhang, L. Wang, and S. Satoh, “Human action
recognition and localization in video using structured learning of local
space-time features,” in Proc. Int. Conf. Adv. Video Signal Based
Surveillance, 2010, pp. 1–8.
[14] D. DeMenthon and D. Doermann, “Video retrieval using spatio-temporal descriptors,” in Proc. 11th ACM Int. Conf. Multimedia, 2003, pp.
508–517.
[15] T. B. Moeslund, A. Hilton, and V. Krüger, “A survey of advances in
vision-based human motion capture and analysis,” Comput. Vis. Image
Underst., vol. 104, no. 2, pp. 90–126, 2006.
[16] R. Poppe, “A survey on vision-based human action recognition,” Image
Vision Comput., vol. 28, pp. 976–990, 2010.
[17] F. Lv and R. Nevatia, “Recognition and segmentation of 3-d human
action using HMM and multi-class Adaboost,” in Proc. Eur. Conf.
Comput. Vis., 2006, pp. IV: 359–IV: 372.
[18] A. Kale, A. Sundaresan, A. Rajagopalan, N. Cuntoor, A. RoyChowdhury, V. Kruger, and R. Chellappa, “Identification of humans using
gait,” IEEE Trans. Image Process., vol. 13, no. 9, pp. 1163–1173, Sep.
2004.
[19] Q. Shi, L. Cheng, L. Wang, and A. Smola, “Discriminative human action segmentation and recognition using SMMs,” Int. J. Comput. Vis.,
vol. 93, no. 1, pp. 22–32, 2010.
[20] C. Sminchisescu, A. Kanaujia, Z. Li, and D. Metaxas, “Conditional
models for contextual human motion recognition,” in Proc. IEEE Int.
Conf. Comput. Vis., 2005, pp. 1808–1815.
1738
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 6, JUNE 2011
[21] L. Wang and D. Suter, “Recognizing human activities from silhouettes: Motion subspace and factorial discriminative graphical model,”
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2007, pp. 1–8.
[22] J. Kärkkäinen, G. Navarro, and E. Ukkonen, “Approximate string
matching on Ziv-Lempel compressed text,” J. Discrete Algorithms,
vol. 1, no. 3–4, pp. 313–338, 2003.
[23] K. Fredriksson and G. Navarro, “Average-optimal single and multiple
approximate string matching,” J. Exp. Algorithmics, vol. 9, pp. 1.4–1.4,
2004.
[24] S. Deorowicz, “Speeding up transposition-invariant string matching,”
Inf. Process. Lett., vol. 100, no. 1, pp. 14–20, 2006.
[25] T. N. D. Huynh, W.-K. Hon, T.-W. Lam, and W.-K. Sung, “Approximate string matching using compressed suffix arrays,” Theor. Comput.
Sci., vol. 352, no. 1, pp. 240–249, 2006.
[26] C. du Mouza, P. Rigaux, and M. Scholl, “Parameterized pattern
queries,” Data Knowl. Eng., vol. 63, no. 2, pp. 433–456, 2007.
[27] S. Grabowski and K. Fredriksson, “Bit-parallel string matching under
!
"# worst case time,” Inf. Process. Lett.,
Hamming distance in
vol. 105, no. 5, pp. 182–187, 2008.
[28] X. Yan, F. Zhu, P. S. Yu, and J. Han, “Feature-based similarity search
in graph structures,” ACM Trans. Database Syst., vol. 31, no. 4, pp.
1418–1453, 2006.
[29] C.-F. Cheung, J. X. Yu, and H. Lu, “Constructing suffix tree for gigabyte sequences with megabyte memory,” IEEE Trans. Knowl. Data
Eng., vol. 17, no. 1, pp. 90–105, Jan. 2005.
[30] H. Lee, R. T. Ng, and K. Shim, “Extending q-grams to estimate selectivity of string matching with low edit distance,” in Proc. 33rd Int.
Conf. Very Large Data Bases, 2007, pp. 195–206.
[31] M. Kurucz, A. A. Benczúr, T. Kiss, I. Nagy, A. Szabó, and B. Torma,
“KDD cup 2007 task 1 winner report,” SIGKDD Explor. Newsl., vol.
9, no. 2, pp. 53–56, 2007.
[32] S. Mihov and K. U. Schulz, “Fast approximate search in large dictionaries,” Comput. Linguist., vol. 30, no. 4, pp. 451–477, 2004.
[33] F. Mandreoli, R. Martoglia, and P. Tiberio, “Extra: A system for example-based translation assistance,” Mach. Translation, vol. 20, no. 3,
pp. 167–197, 2006.
[34] L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition.
Upper Saddle River, NJ: Prentice-Hall, 1993.
[35] E. J. Keogh and M. J. Pazzani, “Scaling up dynamic time warping to
massive dataset,” in PKDD ’99: Proc. 3rd Eur. Conf. Principles of Data
Mining and Knowledge Discovery, London, U.K., 1999, pp. 1–11.
[36] X. Ge and P. Smyth, “Deformable Markov model templates for timeseries pattern matching,” in KDD ’00: Proc. 6th ACM SIGKDD Int.
Knowl. Discovery Data Mining, 2000, pp. 81–90.
[37] T. Oates, L. Firoiu, and P. R. Cohen, “Using dynamic time warping
to bootstrap HMM-based clustering of time series,” in Sequence
Learning—Paradigms, Algorithms, and Applications. London,
U.K.: Springer-Verlag, 2001, pp. 35–52.
[38] Z. Bar-joseph, G. K. Gerber, D. K. Gifford, T. S. Jaakkola, and I.
Simon, “Continuous representations of time series gene expression
data,” J. Computat. Biol., vol. 10, pp. 3–4, 2003.
[39] J. Yang, W. Wang, and P. S. Yu, “Mining surprising periodic patterns,”
Data Min. Knowl. Discov., vol. 9, no. 2, pp. 189–216, 2004.
[40] M. Vlachos, G. Kollios, and D. Gunopulos, “Elastic translation invariant matching of trajectories,” Mach. Learn., vol. 58, no. 2–3, pp.
301–334, 2005.
[41] A. Efrat, Q. Fan, and S. Venkatasubramanian, “Curve matching, time
warping, and light fields: New algorithms for computing similarity between curves,” J. Math. Imaging Vis., vol. 27, no. 3, pp. 203–216, 2007.
[42] R. K. Bajcsy and C. Broit, “Matching of deformed images,” in Proc.
ICPR, 1982, pp. 351–353.
[43] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham, “Active shape
models: Their training and application,” Comput. Vis. Image Underst.,
vol. 61, no. 1, pp. 38–59, 1995.
[44] P. Viola and M. Jones, “Robust real-time object detection,” Int. J.
Comput. Vis., 2001.
[45] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization for spoken word recognition,” IEEE Trans. Acoust., Speech, Signal
Process., vol. ASSP-26, no. 1, pp. 43–49, Jan. 1978.
[46] D. Knuth, J. Morris, and V. Pratt, “Fast pattern matching in strings,”
SIAM J. Computing, vol. 6, no. 2, pp. 323–350, Jun. 1977.
[47] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and object
recognition using shape contexts,” IEEE Trans. Pattern Anal. Mach.
Intell., vol. 24, no. 4, pp. 509–522, Apr. 2002.
[48] J. Sivic and A. Zisserman, “Video Google: A text retrieval approach to
object matching in videos,” in Proc. Int. Conf. Comput. Vis., 2003, vol.
2, pp. 1470–1477.
! "#$
[49] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, “Behavior recognition via sparse spatio-temporal features,” in Proc. 2nd Joint IEEE Int.
Workshop Visual Surveill. Performance Eval. Tracking Surveill., 2005,
pp. 65–72.
[50] L. Cheng and M. Gong, “Realtime background subtraction from dynamic scenes,” in Proc. ICCV, 2009, pp. 1–8.
[51] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vision, vol. 60, no. 2, pp. 91–110, 2004.
[52] C. Schuldt, I. Laptev, and B. Caputo, “Recognizing human actions: A
local SVM approach,” in Proc. Int. Conf. Pattern Recognit., 2004, pp.
32–36.
[53] S. Nowozin, G. Bakir, and K. Tsuda, “Discriminative subsequence
mining for action classification,” in Proc. Int. Conf. Comput. Vis.,
2007, pp. 1919–1923.
[54] Y. Ke, R. Sukthankar, and M. Hebert, “Efficient visual event detection
using volumetric features,” in Proc. Int. Conf. Comput. Vis., Oct. 2005,
vol. 1, pp. 166–173.
[55] S. Wong, T. Kim, and R. Cipolla, “Learning motion categories using
both semantic and structural information,” in Proc. IEEE Conf. CVPR,
2007, pp. 1–6.
[56] R. Gross and J. Shi, The CMU Motion of Body (MoBo) Database
Robotics Inst. Carnegie Mellon Univ., Pittsburgh, PA, Tech. Rep.
CMU-RI-TR-01-18, 2001.
Li Wang received the M.S. and Ph.D. degrees from
the Institute of Automation, SouthEast University,
Nanjing, China, in 2005 and 2009, respectively.
He is currently a Lecturer with the Nanjing
Forestry University, Nanjing. China. His research
interests include human action recognition, human
detection and tracking, as well as machine learning
on computer vision.
Li Cheng (M’04) received the Ph.D. degree in
computer science from the University of Alberta,
AB, Canada.
He is a Research Scientist with the Bioinformatics
Institute (BII), A*STAR, Singapore. Prior to joining
BII in July 2010, he was with the Statistical Machine
Learning group of NICTA, Australia, TTI-Chicago,
IL, and the University of Alberta. His research expertise is mainly on machine learning and computer
vision.
Liang Wang received the B.Eng. and M.Eng.
degrees from Anhui University in 1997 and 2000,
respectively, and the Ph.D. degree from the Institute
of Automation, Chinese Academy of Sciences
(CAS), Beijing, China, in 2004.
From 2004 to 2010, he was a Research Assistant
with Imperial College London, London, U.K., and
Monash University, Australia, a Research Fellow
with the University of Melbourne, Australia, and
a Lecturer with the University of Bath, U.K., respectively. Currently, he is a Professor of Hundred
Talents Program at the National Lab of Pattern Recognition, Institute of
Automation, Chinese Academy of Sciences, Beijing, China. His major research
interests include machine learning, pattern recognition, and computer vision.
He has widely published in highly ranked international journals and leading
international conferences. He is an associate editor for the International Journal
of Image and Graphics, Signal Processing, Neurocomputing and International
Journal Of Cognitive Biometrics.
Prof. Wang is a member of BMVA. He was the recipient of the Special
Prize of the Presidential Scholarship of Chinese Academy of Sciences. He
is an associate editor for the IEEE TRANSACTIONS ON SYSTEMS, MAN AND
CYBERNETICS—PART B. He has been a guest editor for four special issues, a
coeditor of five edited books, and a cochair of six international workshops.