IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 6, JUNE 2011 1725 Elastic Sequence Correlation for Human Action Analysis Li Wang, Li Cheng, Member, IEEE, and Liang Wang Abstract—This paper addresses the problem of automatically analyzing and understanding human actions from video footage. An “action correlation” framework, elastic sequence correlation (ESC), is proposed to identify action subsequences from a database of (possibly long) video sequences that are similar to a given query video action clip. In particular, we show that two well-known algorithms, namely approximate pattern matching in computer and information sciences and dynamic time warping (DTW) method in signal processing, are special cases of our ESC framework. The proposed framework is applied to two important real-world applications: action pattern retrieval, as well as action segmentation and recognition, where, on average, its run time speed (in matlab) is about 3.3 frames per second. In addition, comparing with the state-of-the-art algorithms on a number of challenging data sets, our approach is demonstrated to perform competitively. Index Terms—Action correlation, action pattern retrieval, action recognition, approximate pattern matching, dynamic time warping, edit distance. I. INTRODUCTION ITH the ubiquitous presence of video data in everyday life, it becomes increasingly demanding nowaday to automatically analyze and understand human actions from large amount of video footage, which is strongly driven by a wide range of applications including automatic visual surveillance, smart human–machine interface, sports event interpretation, and video browsing and retrieval. In this paper, we consider the tasks of video action analysis and understanding from an “action correlation” viewpoint. A central question is: given a query action sequence of length and a database video sequence of length (generally, ), identify the locations where matches a subsequence of with bounded correlation cost. As will be shown in greater details later, this question is formulated as solving a minimization problem [cf. (1)], which naturally gives rise to a dynamic programming (DP) formula [cf. (3)], the corner stone of the proposed elastic sequence correlation (ESC) algorithm (i.e., Algorithm 1). In particular, we show that our framework W Manuscript received September 21, 2009; revised October 06, 2010; accepted December 14, 2010. Date of publication December 23, 2010; date of current version May 18, 2011. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Miles Wernick. L. Wang is with the Department of Computing Science, Nanjing Forestry University, 210037 Nanjing, China. L. Cheng is with the Bioinformatics Institute, A*STAR, Singapore. L. Wang is with the National Laboratory of Pattern Recognition, Chinese Academy of Sciences, 100190 Beijing, China. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2010.2102043 includes as special cases two widely used algorithms: dynamic time warping (DTW) [1] from the signal processing community as well as approximate pattern matching (string edit distance, or Levenshtein distance) [2], [3] from the computer and information sciences community. These connections directly provide us access to a number of dedicated techniques developed over the years in either communities, which conveniently lead to possible variants of ESC to deal with specific scenarios, and as examples we present two such variants in this paper. The proposed ESC algorithm is rather flexible in term of accommodating either local or global feature representations. In this paper, we focus particularly on the local feature representations that aim to capture the local salient aspect of image and video gradients for representing image and video context. This choice is motivated by a recent neuro-psychological finding [4] that visual and motor cortices of human perception system are more responsible than the semantic ones for retrieval of visual patterns. The proposed approach is further examined in two practical applications: action pattern retrieval as well as action segmentation and recognition, which are often addressed by methods including probabilistic methods [e.g., hidden Markov model (HMM)] [5], [6], and support vector machines (SVMs) [7], [8]. As will be demonstrated later on a number of challenging datasets (cf. Figs. 2, 5, and Fig. 7), our approach performs competitively compared with the state-of-the-art methods. Action Retrieval: As shown in Fig. 1, this application studies the retrieval of action subsequences that are similar to the query clip. This is in practice a major technical challenge for the emerging industry of content-based video retrieval from internet sources (e.g., Google and Yahoo video search, VideoSurf,1 Blinkx,2 CastTV,3 Pixsy4, and Viewdle5). A variety of spatiotemporal interest points such as [9]–[13] have been devised and utilized in action video retrieval. In addition, DeMenthon and Doermann [14] propose to work with 3-D spatiotemporal volume of pixels. The work of Laptev and Perez [7] adopts a learning-based approach aiming to retrieve a specific action type (“drinking”) from film clips. Interested readers might refer to the work of Moeslund et al. [15] or Poppe [16] for a detailed survey of research developments in this field. Action Segmentation & Recognition: There are a vast and growing literature on this topic, so we have to restrict our description to a few work that we feel are more relevant or representative. Established methods for modeling and analyzing human 1[Online]. Available: http://www.videosurf.com Available: http://www.blinkx.com 3[Online]. Available: http://www.casttv.com 2[Online]. 4[Online]. 5[Online]. Available: http://www.pixsy.com Available: http://www.viewdle.com 1057-7149/$26.00 © 2010 IEEE 1726 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 6, JUNE 2011 Fig. 1. Example that illustrates the retrieval of a set of hairpin net shot actions from a badminton playing video sequence, with one query clip. actions are predominantly generative statistical methods, especially the Markov models [5], [6], [17], [18], e.g., HMMs and its variants such as coupled HMMs [5], [6]. Recently, discriminative learning scheme has also been extended to structured outputs (e.g., support vector machine with semi-Markov model output space (SVM-SMM) of [19] and conditional random fields (CRFs) [20], [21]), and encouraging results are obtained for action segmentation and recognition. In this paper, we reduce the inference problem to matching a query to an existing set of annotated databases, a matching problem nicely solved by the proposed dynamic programming component which is the corner stone of our ESC framework. While being conceptually simple, empirical experiments suggest that our method performs competitively comparing to state-of-the art methods that often relying on training sophisticated parametric models. Our Contribution: Major contributions of this paper are given here. 1) A new correlation-based framework (ESC) is proposed for action sequence analysis, which bears close connections to established work of DTW and approximate pattern matching. By exploiting existing dedicated techniques developed for either DTW and edit distance, two ESC variants are further developed to address specific scenarios. 2) We examine ESC in two important real-world applications and conduct extensive experiments where ESC is shown to perform competitively. Asmentioned above, the ESC algorithm is closelyrelatedto approximate pattern matching methods (e.g., [3], [22]–[33]) where string edit (or Levenshtein) distance is predominantly used. Although having demonstrated sound robustness property against observation noise [3], these methods are designed for combinatorial pattern matching using a finite alphabet, which unfortunately precludes the versatility of measuring distance from real feature vectors (thus uncountable alphabet), that are often used by nowaday action analysis methods and also readily dealt with by our ESC algorithm. Our algorithm also bears strong connection to dynamic time warping (DTW) e.g., [1], [34]–[41], and in particular, elastic matching e.g., [42] and deformable template e.g., [43] methods in computer vision. In video action analysis, a number of recent work [10]–[12] are also related to this line of DTW approaches. However, DTW is known to be sensitive to noise (e.g., large feature deviations in a small cluster of frames), which often leads to extra missing or false alarm cases [1]. By the ability to utilizing the matching techniques developed in approximate pattern matching, our algorithm is empirically shown (cf. Figs. 2 and 6) to be rather robust to these noise. Moreover, our ESC framework exploits more of the temporal perspective for an WANG et al.: ELASTIC SEQUENCE CORRELATION FOR HUMAN ACTION ANALYSIS 1727 Fig. 2. Three queries in a synthetic data set. In each query, the left of the red (online version) vertical bar shows the query sequence, and the right is the database sequence. Each sequence is a 1-D time series line drawing starting from left to right: at each time step, a 2-D point is observed and connected to the existing sequence. The retrieved subsequences are highlighted in green. More precisely, (a) illustrates the effectiveness of ESC when the subsequences vary slightly from the query template, (b) demonstrates the efficiency of adopting ESC-S (29% speed-up) while maintaining the same accuracy; and (c) presents an example of different scales. See text for more details. (a) Robust to noise. (b) 29% Speed up with ESC-S. (c) Invariant to scales. action sequence, and this greatly differentiates ESC from those of [10]–[12] that emphasize a particular feature design of either spatiotemporal interest points or space–time volumes. II. ESC FRAMEWORK We present here the main idea underlines the proposed ESC framework, as well as its variants to address more specific issues such as varying temporal scale-spaces, and speed-up with non- repetitive patterns. This is followed by formal analysis of the related algorithmic properties: its time and space complexities as well as the correlation probability that characterizes how many subsequences would be identified using a query action pattern. A. Main Idea The main idea of ESC is to formulate the possible correlations of the query and the database actions into a correlation matrix, 1728 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 6, JUNE 2011 and the key ingredient is the utilization of a dynamic programming procedure to identify the optimal solution efficiently from a search space defined by this correlation matrix. More formally, assume that each video sequence is captured by sampling in the time domain a stream of frames under a certain sample rate. and the dataAssume that the query sequence of length base sequence of length are obtained using the same sample rate, and let upper-bound the number of consecutive frames to which one frame from the opposite side could be. Denote the amount of correlation costs of matching between sequences and , and let 6 be an upper bound of the correlation cost. The problem now becomes finding all positions of in such that there exists a suffix of matching by (1) As a consequence, this gives the set of feasible action subsequences as is satisfied (2) We define a query frame index and a database video index and denote as a (start, end) pair of current correlation subsequences together with as a zero–based index of this subsequence. In other words, . Now let be a matrix with indexing the rows and indexing the columns, respectively, and each entry storing , the partial correlation cost between sequences and . We consider three elementary operations, compression, expansion, and substitution, and devise a DP procedure to compute the correlation matrix recursively as approximate pattern matching (or edit/Levenshtein distance), e.g., [2], [3] in computer science community, as given here. and , (3) becomes the standard • By setting form of the approximate pattern matching; • Letting and gives back the familiar DTW formula. In their standard forms, DTW measures similarities between two sequences when the two ends of each sequence are known in priori, meanwhile the finite alphabet size assumption of approximate pattern matching is not consistent with the feature representations of the action sequences adopted in this paper. These incompatibility issues are addressed in the ESC algorithm (i.e., Algorithm 1) that integrates this proposed DP formula. As presented in Algorithm 1, where, rather than utilizing a matrix , we instead work with a much small matrix of rows and columns: This saves a significant amount of storage space since usually for very long database video, and the only payoff is that we have to reset the matrix once a new subsequence of is about to be screened. The local frame tolerance judges whether a current pair of frames are sufficiently correlated: a loose tolerance value is set to allow the occurrence of multiple matches (i.e., Algorithm 1 allows mismatch of several pairs of frames during the sequence correlation of and as long as the average cost does not exceed ); When no single error is allowed, such as in Algorithm 2, will instead be set to a much tighter value (i.e., ). Algorithm 1 Elastic Sequence Correlation (ESC) Input: a query action video and a database video Output: a set of subsequences Initialize the correlation cost matrix to ; Set , , , to some positive values; Set database video index , , and query frame index . do while (3) where the constants and . Denote as a frame observation, thus the th frame is , and let measure the cost of the pair of frames that were induced from the feature representation (we will revisit this in greater detail in Section II-D). We further introduce an upper-bounded cost . Now, is robust to perturbation from few noisy inputs, as a local object frame e.g., has only bounded effect to global correlation cost. The upper bound is set to , with a constant .7 It follows that the solution to the problem in (1) is exactly these locations where , denotes the length . As an example, of the path from (0, 0) to current place Fig. 2(a) illustrates a pedagogical result of this procedure. This proposed DP formula is a generalization of both the DTW, e.g., [1] in signal processing community, as well as the 6By default, is set to 0.18 during the experiments. 7In this paper, ! and " are both fixed to 0.1 by default. if then Attempt a new correlation , while do to for do then apply (3) if end if end for , s.t. if else break end if then WANG et al.: ELASTIC SEQUENCE CORRELATION FOR HUMAN ACTION ANALYSIS [45] of radius . In addition, we exploit the ideas of filtering [2], [3] from computer science community for approximate pattern matching to prune off those impossible subsequences as quick as possible before entering the more thorough but computationally demanding DP procedure. We note that the proposed ESC (Algorithm 1) is an error-tolerant algorithm. In practice, there are cases in which we would like to enforce strict matching where no single error is allowed. The ESC can be easily adapted to deal with this particular case and is thus termed ESC-S (including Algorithms 2 and 3). For returns the corresponding index in data a query index , set video to the first matched frame for frame in . Its efficiency is noticeably improved by introducing the partial correlation table in Algorithm 3, motivated by the Knuth–Morris–Pratt (KMP) algorithm [46] in pattern matching. The idea is to “prescan” the query sequence itself and compile a list of possible prefix positions that bypass as much as possible those impossible frames to apply correlations, and at the same time not sacrifice any potential correlations. This trick provides a significant speed-up for a query sequence , especially when it contains nonrepetitive patterns. end while then if Add to backtrack to pick up the correlation path from up to this subsequence of , reset 1729 to end if else end if end while B. Variants of ESC Variants of the proposed ESC alg have been derived to address several practical scenarios. 1) Interpolation to Deal With Scale-Space Issue: One important observation is that the same action type might be performed in distinct speeds by different persons. This often dues to a number of factors: camera hardware, age difference, gender, and health status of the subject. For example, an elder female patient tend to walk much slower than a young healthy man. The scale-space theory has been well developed in image spatial domain and successfully deployed in e.g., object detection problem [44]. Here, it can also be naturally extended to address this varying temporal scale-spaces phenomenon, where the sample rates of and might be vastly different. Similar to the scale-space search problem in object detection [44], interpolation is used to deal with this scale-space issue. More specifically, an iterative interpolation procedure is adopted: the query is interpolated (scale up or down) by a set of scaling factors , where each scaling factor produces a sequence with length . In other words, the interpolation from of length to of length is achieved by Algorithm 2 ESC-S Main Algorithm Input: a query action video and a database video Output: a set of subsequences Initialize the correlation cost matrix to progress flag to ; Set , index , , and query frame index do while if then Attempt a new correlation , while do set to for do then apply (3) if set (4) where denotes the th frame of , and is the th frame of . 2) Speed-Up Using Partial Correlation Table: The complexity of a naive algorithm amounts to , linear in cardinalities of both the database video and the query clip. This is acceptable for a small database but not for the real-life scenarios where we need to deal with large-scale databases, and it is often costly to compute the entries of the matrix. In this respect, Algorithm 1 provides an efficient solution by adopting a well-known idea from the signal processing community used to speed up DTW, where the search path in matrix is constraint to be within a banded matrix: e.g., Sakoe–Chiba band and the row-wise , database video . end if end for then if else Shift index break using PCT 1730 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 6, JUNE 2011 end if end while then if Add to backtrack the correlation path of to this subsequence of reset to end if else end if end while Algorithm 3 Build Partial Correlation Table (PCT) an efficient algorithm of time complexity, in both average and worst cases. The efficiency is from partial , there is a miscorrelation table. If match between two frames. Rather than beginning to search again at frame, we move on to next frame, setting and . A fundamental question besides the time and space complexities is the correlation probability, which is the probability that any subsequence would be identified by a query action pattern. Given a query action pattern of length and a database video sequence of length , as well as the correlation cost upper bound incurred from the three elementary operations (compression, expansion, and substitution), we define as the error ratio of the sequence correlation. In addition, the usage implies a quantization of of the local frame tolerance score the action feature space as , and its size as . Now, let be the probability that a query matches a subsequence of with at most error rate (or equivalently upper-bounded by ). The following lemma (adopted from [3]) suggests that this correlation probability decreases exponentially as increases, as long as . Lemma 1 (Correlation Probability): Assume that the frames action feature of any video sequences are iid generated. Then , with the correlation probability is (5) Input: Output: , Initialize , and , repeat if then , else if then else , Clearly, as it becomes as , and grows up to 1 as grows to 1. Although not a particularly tight bound, this lemma nevertheless indicates the sparseness of the potential matches as increases, and in particular, this lemma clearly suggests that as the length of the query sequence grows, it is crucial to drop off irrelevant subsequences from the database sequence as early as possible, and in turn supports our usage of e.g., Sakoe–Chiba bands and filtering methods for an efficient solution. Empirically, the average running time speed of our approach is about 3.3 frames per second for midsize (320 240) videos. This is obtained on a desktop PC with AMD Athlon 2.6-GHz Dual CPUs and 2-GB memory. As our current implementation is in MATLAB, a more efficient implementation using programming languages such as C++ would lead to further improvement of the run-time performance. D. Feature Representation and Distance Measure end if end if until C. Algorithmic Analysis The space complexity is for ESC (Algorithm 1) as well as ESC-S (Algorithms 2 and 3 together),8 by exand the ploiting the Sakoe–Chiba band matrix of radius filtering tricks. The time complexity of ESC is , with the accumulative matching errors determined by , and . For a restricted class of problems, ESC-S leads to ! " # " # 8It is in fact ! "# ! # for ESC and both in the scale of # as in practice. # ! ! $#" ! ## for ESC-S, Neuro-psychological findings such as [4] suggest that the visual and motor cortices of human perception system are more responsible than the semantic ones for retrieval and recognition of visual action patterns. This motivates us to represent action features by a set of key points that capture the salient aspect of spatial and temporal video gradients. It is noteworthy to mention that, although we focus on local feature representations in this paper, the proposed ESC framework is also able to accommodate global features. In what follows, we consider in particular the idea of point-set frame representation and matching between pair of frames, which is achieved by adapting the shape context (SC) features (Belongie et al. [47]) to our context, as follows. Without loss of generality, we assume that there exists exactly one object in any given frame. An object WANG et al.: ELASTIC SEQUENCE CORRELATION FOR HUMAN ACTION ANALYSIS can be sufficiently represented by a set of key points of the object.9 The proposed measuring scheme, which is illustrated in Fig. 3, consists of three steps. Step 1) Key points correspondence: Given two objects and , solve the problem of point sets correspondence by finding an optimal one-to-one match of points. Step 2) Key points alignment: To align the correspondences found so far, estimate an alignment transformation such that each point from object is spatially aligned with its corresponding point from . Step 3) Key points distance: Compute key points distance as a sum of the effort spent in the key points alignment (the bending energy), plus the remaining key points correspondence cost after aligning the two key point sets. This scheme can be described as follows. For each point , a key point descriptor is defined using its shape context descriptor, which is a local log-polar histogram that compactly encodes the spatial configuration of the remaining point . test is then used to determine the cost of matching two The given points and that come from objects and , respectively, . This cost function enables a direct computation of the set of costs between all pairs of points and . The key points correspondence problem is thus formulated and solved as a standard bipartite graph matching by minimizing the total matching cost (6) where denotes a single permutation. The key points alignment transformation is a dedicated linear system equation. Thin plate spline (TPS), which generalizes the cubic spline to two dimensions, is used to recover the transformation between curves. The bending energy (in e.g., [47]) measures the magnitude of this transformation. We omit detailed explanation of alignment and bending energy, and instead present an illustrative example in Fig. 3. After applying the alignment transformation on the point set (denoted as , ), the remaining key of object points correspondence cost is defined as (7) The key points distance between the two objects, is thus a sum of the two costs, i.e., and , (8) 9As shown in Fig. 3, the object is represented as a set of two-dimensional points, where each point is encoded as a local histogram. This is essentially a nonvectorial representation of the object , since the points do not necessarily respect an order. 1731 Fig. 3. Default local feature representation used in, e.g., the WBD dataset. The left panel shows two objects, which are then represented as point sets around the objects shown in the middle panels. In each point set, a point is further encoded as a local log-polar histogram with radius ! and angle " , shown in the top right panel. The points in different silhouettes are not synchronized until (shown in the bottom right panel) the key points correspondences are found. where is a constant to tradeoff between the two costs and is fixed to 1 in this paper. Thus far, a limitation of this distance measure is that it only works on pairs of frames but it is desirable to incorporate the temporal information which characterizes the local motion flow. Directly utilizing the motion flow computation, however, is often computationally demanding. Instead, we adopt a simple temporal sliding window method as follows. For the th frame, as a window of frames centered around it, and denote similarly for frame . Let and index the current frames in and , respectively, and synchronize them such that they always point to the same location of the window. Now, introduce a distance function as a convex combination of over temporal windows (9) where the weights and . In practice, we let and fix the weight vector to be all 1/7. The main differences of our feature representation comparing to that of e.g., Video Google [48] are: 1) we use shape context feature as a robust representation and 2) a temporal local window is utilized to incorporate spatio-temporal context during the local key points measurement. We note in passing that, while a specific local feature representation, shape context feature [47], is adopted here as the default feature scheme, ESC is flexible and can work with other feature representations. This is demonstrated later in one set of empirical experiments where a different spatiotemporal feature representation [49] is used. 1732 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 6, JUNE 2011 III. APPLICATION I: ACTION RETRIEVAL Here, we concentrate on the application of our ESC framework in action pattern retrieval, which has received increasing attention in recent years from both multimedia and computer vision communities. This application can be formally characterized as follows. Given a query action video and a database containing videos , the task of action pattern retrieval is to retrieve the number of action subsequences that are similar to , namely (10) This is a major technical issue for the emerging industry of video retrieval from internet sources as mentioned previously. It is clear from (10) that this application is intimately related to the key problem we consider in this paper, as evidenced by (1) and (2). In light of this observation, we adopt a straightforward methodology where the proposed ESC framework is applied to each database video clip, and then the desired results are naturally obtained as a union of all of the retrieved subsequences. A. Experimental Results and Analysis Our approach is first evaluated on a synthetic data set to demonstrate the applicability of ESC as well as its two variants. We further conduct experiments on two representative real-world data sets, where ESC is compared with two special cases: DTW (by setting and ) and approximate pattern matching ( and ). During these experiments we empirically set . 1) Synthetic Data Set: As displayed in Fig. 2, a synthetic data set is generated and three experiments are conducted. For each experiment, the left side refers as the query sequence, meanwhile the database sequence is on the right. Each sequence consists of a 1-D time series line drawing starting from left to right: at each time step, a 2-D point is produced and connected to the existing sequence. To illustrate this, we present in the bottom of each subplot in Fig. 2 an flat-out one dimensional time series line with color code from dark purple to light green to indicate the transition from start to end. For the database sequence, we use black as the base color and, when a subsequence is matched with query, use the above-mentioned color code to indicate the element by element matches. The first experiment (query 1) aims to show that ESC is error tolerant: three subsequences from the database sequence are identified as well-correlated to the query pattern subject to of (1). Moreover, the number of retrieved subsequences will gracefully reduce to one (i.e., the middle subsequence) as we decrease the value. In scenarios where we would like to retrieve without significant errors, by delivering the same results ESC-S is shown performs with 29% time reduction, as demonstrated in query 2. In the third experiment, while query 3 is presented at a scale space entirely different from that of the database sequence, our algorithm is still able to work well after adopting the iterative interpolation method of (4). The values used in these three queries are 0.7, 0.3, and 0.3, respectively. In addition, the robustness of the retrieval performance is demonstrated in Fig. 4 following the experiment Fig. 4. Demonstration of the robustness of the performance when varying parameter . This experiment is conducted on the synthetic data set using the same query as of Fig. 2(a). protocol of Fig. 2(a), where the detection rate does not have significant change when we vary the parameter value of between 0.0 and 1.0. 2) Beach Data Set: The beach data set contains one database video of 456 frames of size 360 180 taken from the beach scenes. Some sample frames are displayed in Fig. 5. One “walking” query of six frames is shown in the top-left panel of Fig. 5. The highlighted objects displayed in Fig. 5 are from the retrieved subsequences that identified as similar to the query. As a preprocessing step for the beach and the badminton data sets, the foreground objects are segmented from dynamic backgrounds using an efficient background subtraction technique [50] and the local feature points are restricted to lie inside or sufficiently close to the foreground objects. Based on these local features, the shape context features of the object of interest is thus obtained. The quantitative ROC curve is presented in Fig. 6(a): comparing with two special cases, DTW and approximate pattern matching, the proposed approach is shown to achieve better performance by allowing to tune to the parameters to dedicated values (here we set to 0.9). Since repetitive walking behavior can be continuously identified using our approach, the length of the retrieved subsequences my vary significantly, e.g., in this data set it ranges from three frames to 58 frames. 3) Badminton Data Set: We build a badminton data set that contains a sequence of 9218 frames collected from a badminton match, where the size of each image frame is 360 288. Three query sequences are created by manually picking-up three short subsequences from this database sequence, including one hairpin net shot action as well as two smash actions. Fig. 7 displays nine sample images out of this data set: three query actions are given in the first row where the temporal action pattern of each query are overlaid onto one image, the second row displays three random images, and three exemplar retrieved frames (each from one retrieved action subsequence). Note that the human foregrounds are highlighted with three distinct colors to illustrate to which query sequence this frame corresponds. As an example, we presented in Fig. 1 a retrieval flowchat when one short query clip of hairpin net shot action is provided. As presented in the ROC curve of Fig. 6(b), DTW and approximate pattern matching methods give comparable results while both of them are inferior to our ESC algorithm. WANG et al.: ELASTIC SEQUENCE CORRELATION FOR HUMAN ACTION ANALYSIS 1733 Fig. 5. Gallery of the beach dataset. Left: the query walking pattern where the temporal walking behaviors are overlaid onto this image. The rest are example frames from the dataset where the highlighted objects are from the frames identified as matching to the query in their respective subsequences. Fig. 6. ROC curve of two datasets. (a) Beach. (b) Badminton. Fig. 7. Sample frames of the badminton data set. The top row presents three query sequences where the action pattern of each query are overlaid onto one image. The bottom row displays randomly selected frames and three exemplar retrieved frames, each from one retrieved subsequence. During these two experiments, matching errors. is set to 0.4 to allow several IV. APPLICATION II: ACTION RECOGNITION Action recognition is one of key topics in action analysis and understanding and has a wide range of promising applications such as visual surveillance, video event analysis and intelligent interface. Here, we emphasize on elementary actions such as running, walking, and drawing on a board. In particular, we consider three specific scenarios: 1) action recognition without segmentation; 2) jointly action recognition and action cycles segmentation; and 3) jointly segmenting and identifying actions from a video sequence where one person performs a sequence of continuous different actions. A. Experimental Results and Analysis In what follows, we examine three scenarios by conducting experiments on four standard data sets, where the proposed ESC algorithm is augmented with simple 1NN (Nearest-Neighbor) strategies to recognize (and segment when necessary) action subsequences. During these experifor the proposed approach ments we empirically set and compare its performance to those of the state-of-the-art algorithms. 1734 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 6, JUNE 2011 Fig. 8. Sample frames of one person engaging in six types of actions in the KTH data set. Fig. 9. Cuboid features on the KTH data set. 1) Action Recognition (Single Action Per Sequence): We consider a relatively simple action recognition scenario where we have a training database of action sequences containing classes of actions, where each action sequence possesses ex- actly one action. Now, an unseen test action sequence will be categorized into one of the actions by querying through each of training sequences, and followed by a simple 1NN classifier to identify the best (sub)sequence correlation. WANG et al.: ELASTIC SEQUENCE CORRELATION FOR HUMAN ACTION ANALYSIS 1735 TABLE I COMPARISONS OF ACTION RECOGNITION RATES ON KTH DATA SET To demonstrate the flexibility of the proposed framework, a different local feature representation (i.e., “cuboid” representation in [49]) is adopted here, which is essentially an extension of the SIFT descriptor [51] to the spatiotemporal domain. More specifically, this detector is tuned to fire whenever variations in local image intensities contain distinguishing spatiotemporal characteristics. At each detected interest point location, a 3-D cuboid is then extracted and represented as a flattened vector that contains the spatiotemporal windowed information including normalized pixel values, brightness gradient, and windowed optical flow. In [49], a codebook representation is constructed using -means clustering, similar to the visual vocabulary approach of [48] and [8], and each cuboid is further projected into this codebook space as a codeword. To fit into our framework, each action frame is now represented as a set of 3-D cuboid codewords intersecting current frame, and is computed as the sum of Euclidean distances between the two sets of cuboids where the correspondence between two cuboids (each from one frame) is made using 1NN. Then we still use (7) and (9) to compute the distance between pair of frames. The first data set is the KTH data set used in [52]. There are 25 individuals engaged in six actions: running, walking, jogging, boxing, handclapping, and handwaving, under four different environment conditions. Together this amounts to 600 action video clips. Fig. 8 displays example frames of one person performing each of the six actions. We used the training, validation, and testing splits as proposed in [52]. Their cuboid representations are illustrated in Fig. 9. Fig. 10(a) reports the confusion matrix results of the existing Dollar et al. approach [49] on the left and those of ESC on the right. Overall the results of ESC improve over those of Dollar et al.: For example, 40% of the time handclapping is wrongly labeled as boxing in Dollar et al., and it is significantly improved in ESC where now only 15% of the time the same kind of error occurs. The average recognition accuracy of ESC across the six types of actions is 0.86, which outperforms that of Dollar et al. being 0.81 [49]. Furthermore, Table I presents a list of state-ofthe-art methods in term of recognition accuracy on this KTH dataset. Our method is shown to performs favorably comparing to methods that also utilize “cuboid” features (e.g., [49] and SVM of [53]) reported in the literature. 10The experiments are performed using the publicly available implementation of Dollar et al. [49]. [Online]. Available: http://vision.ucsd.edu/~pdollar/research/cuboids_code/cuboids_Apr19_2006.zip. Fig. 10. Confusion matrices of the two data sets in (a) and (b), respectively. In each dataset, Left is the result of DRCB and right is ESC. (a) KTH. (b) Facial expression. The second is a facial expression data set compiled by [49]. We use a subset of this data set that contains two individuals with each expressing 6 different emotions under the same illumination. The expressions are anger, disgust, fear, joy, sadness, and surprise, as illustrated in Fig. 11, where expressions such as joy and sadness are quite distinct, others (e.g., anger and disgust) are very similar. Each individual repeats each of the 6 expressions 8 times, which gives a total of 96 video clips. Each sequence contains about 120 frames each of size 152 194, where the subject always starts with a neutral expression, expresses an emotion, then returns to neutral. Fig. 10(b) reports the result of the approach of Dollar et al. [49] (left) and that of ESC (right), both averaged over five repetitions. Again ESC significantly outperforms that of Dollar et al.10 on this data set. For example, anger is entirely confused with disgust (63% of the time) and joy (38% of the time) in Dollar et al., which is correctly labeled by ESC. Similar improvement is observed for surprise as well. Overall, the average recognition accuracy of ESC across the six types of actions is 0.83, which outperforms that of Dollar et al. being 0.62. 2) Action Segmentation and Recognition (Multiple Repeated Action Per Sequence): This is a different scenario: each data- 1736 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 6, JUNE 2011 Fig. 11. Sample frames of one person expressing six types of emotions in the facial expression data set. base video sequence contains exactly one type of action and assume it can be partitioned into action cycles. The task now becomes segmenting the unseen test sequence into action cycles and at the same time recognizing the action type of the entire test sequence. To meet this goal, we use each action cycle from the training database sequences as a query pattern and exploit our ESC algorithm as follows. For each feasible action type, a simple 1NN classification is employed to retrieve the optimal set of correlated action cycles, which gives a segmentation of the test sequence, as well as the average correlation cost (i.e., the average cost per segment as ). The final action type is then identified to be the one with the least average correlation cost. We conduct experiments on the CMU MoBo data set [56], which contains 24 individuals11 walking on a treadmill. As illustrated in Fig. 12, each subject performs in a video clip one of the four different actions: slow walk, fast walk, incline walk and slow walk with a ball. Each clip has been preprocessed to contain several cycles of a single action. Following [8], the boundary positions of these cycles are manually labeled as ground truth, and we evaluate the action recognition and segmentation performance separately as the score, respectively. The SIFT recognition rate and the codebook feature representation of [8] is incorporated into ESC to facilitate a direct comparison. The -score given by is adopted as well to measure segmentation performance. The top two rows of Table II list the recognition and segmentation results of the comparison algorithms besides ESC, namely 11The data set is originally consisted of 25 subjects. We drop the last person since we have problems obtaining the sequences of this individual walking with balls. Fig. 12. Sample frames of subjects each performs one of the four actions: slow walk, fast walk, incline walk, and walk with a ball, in an action sequence of the CMU MoBo dataset. 1NN, SVM, as well as the recent SVM-HMM and SVM-SMM [8]. Overall, ESC clearly outperforms these comparison algorithms. This is to be expected, as the cyclic motion patterns are better preserved using the ESC algorithm. 3) Action Segmentation and Recognition (Multiple Different Actions Per Sequence): In this case, each database sequence contains multiple actions, and we would like to partition an unseen test sequence into action segments as well as to identify their corresponding action labels. Similar to the previous scenario, Algorithm 1 is used where each action segment from the training database sequences is used as a query pattern. A 1NN classification scheme is then used iteratively to retrieve the optimal set of action segments, given these segments are mostly nonoverlapped and their union covers the most part of entire action sequence. As before, the optimality is determined in the sense of the least correlation cost. The Walk-Bend-Draw (WBD) dataset from [8] is an indoor video data set that contains three subjects, each performs six action sequences at 30 frames per second, and each sequence consists of three continuous actions: slow walk, bend body and draw on board, and on average each action lasts about 2.5 s. We WANG et al.: ELASTIC SEQUENCE CORRELATION FOR HUMAN ACTION ANALYSIS 1737 TABLE II COMPARISONS OF PERFORMANCE ON THE MOBO AND THE WBD DATASETS Fig. 13. Sample frames of three subjects each engaging in a continuous sequence of three actions: walk, bend, and draw, in the WBD data set. subsample each sequence to obtain 30 key frames and manually label the ground truth actions. Figs. 3 and 13 present various sample frames, and Fig. 13 shows three subjects each performing the continuous WBD actions in one video sequence. During this experiment, we adopt the same feature representation of [8] as well for a direct comparison. In Table II, the bottom row presents the average recognition accuracy over three types of actions, where ESC performs on par with SVM-SMM and clearly outperforms SVM, 1NN and SMM-HMM. V. OUTLOOK AND DISCUSSION We have proposed a simple yet powerful sequence correlation framework, namely ESC, for the tasks of video action analysis and understanding. In particular, we devise a generalized DP formula that enables the exploitation of useful techniques from both DTW and approximate pattern matching research communities, and is convenient to integrated with various local feature representation schemes. We evaluate our approach in two related applications: action pattern retrieval and action segmentation and recognition, where performance comparable to the state-of-the-arts are obtained during empirical evaluations on a number of video action data sets. Future work includes incorporating geometrically invariant feature representations to deal with the issue of multiple views, and extension to kernel learning. In particular, we are interested in applying the proposed framework to the problem of unusual video activity detection. REFERENCES [1] C. S. Myers and L. R. Rabiner, “A comparative study of several dynamic time-warping algorithms for connected word recognition,” Bell Syst. Tech. J., vol. 60, no. 7, pp. 1389–1409, 1981. [2] D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge, U.K.: Cambridge Univ. Press, 1997. [3] G. Navarro, “A guided tour to approximate string matching,” ACM Comput. Surv., vol. 33, no. 1, pp. 31–88, 2001. [4] J. Phillips, G. Humphreys, U. Noppeney, and C. Price, “The neural substrates of action retrieval: An examination of semantic and visual routes to action,” Visual Cognition, vol. 9, no. 4–5, pp. 662–685, 2002. [5] J. Yamato, J. Ohya, and K. Ishii, “Recognizing human action in timesequential images using hidden Markov model,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 1992, pp. 379–385. [6] M. Brand, N. Oliver, and A. Pentland, “Coupled hidden Markov models for complex action recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 1997, p. 994. [7] I. Laptev and P. Perez, “Retrieving actions in movies,” in Proc. Int. Conf. Comput. Vis., 2007, pp. 1–8. [8] Q. Shi, L. Wang, L. Cheng, and A. Smola, “Discriminative human action segmentation and recognition using semi-Markov model,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2008, pp. 1–8. [9] L. Zelnik-Manor and M. Irani, “Event-based analysis of video,” in Proc. Int. Conf. Comput. Vis. Pattern Recognit., 2001, pp. 123–130. [10] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, “Actions as space-time shapes,” in Proc. Int. Conf. Comput. Vis., 2005, pp. 1395–1402. [11] E. Shechtman and M. Irani, “Space-time behavior-based correlation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 11, pp. 2045–2056, Nov. 2007. [12] I. Laptev, B. Caputo, C. Schüldt, and T. Lindeberg, “Local velocityadapted motion events for spatio-temporal recognition,” Comput. Vis. Image Underst., vol. 108, no. 3, pp. 207–229, 2007. [13] T. Thi, L. Cheng, J. Zhang, L. Wang, and S. Satoh, “Human action recognition and localization in video using structured learning of local space-time features,” in Proc. Int. Conf. Adv. Video Signal Based Surveillance, 2010, pp. 1–8. [14] D. DeMenthon and D. Doermann, “Video retrieval using spatio-temporal descriptors,” in Proc. 11th ACM Int. Conf. Multimedia, 2003, pp. 508–517. [15] T. B. Moeslund, A. Hilton, and V. Krüger, “A survey of advances in vision-based human motion capture and analysis,” Comput. Vis. Image Underst., vol. 104, no. 2, pp. 90–126, 2006. [16] R. Poppe, “A survey on vision-based human action recognition,” Image Vision Comput., vol. 28, pp. 976–990, 2010. [17] F. Lv and R. Nevatia, “Recognition and segmentation of 3-d human action using HMM and multi-class Adaboost,” in Proc. Eur. Conf. Comput. Vis., 2006, pp. IV: 359–IV: 372. [18] A. Kale, A. Sundaresan, A. Rajagopalan, N. Cuntoor, A. RoyChowdhury, V. Kruger, and R. Chellappa, “Identification of humans using gait,” IEEE Trans. Image Process., vol. 13, no. 9, pp. 1163–1173, Sep. 2004. [19] Q. Shi, L. Cheng, L. Wang, and A. Smola, “Discriminative human action segmentation and recognition using SMMs,” Int. J. Comput. Vis., vol. 93, no. 1, pp. 22–32, 2010. [20] C. Sminchisescu, A. Kanaujia, Z. Li, and D. Metaxas, “Conditional models for contextual human motion recognition,” in Proc. IEEE Int. Conf. Comput. Vis., 2005, pp. 1808–1815. 1738 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 6, JUNE 2011 [21] L. Wang and D. Suter, “Recognizing human activities from silhouettes: Motion subspace and factorial discriminative graphical model,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2007, pp. 1–8. [22] J. Kärkkäinen, G. Navarro, and E. Ukkonen, “Approximate string matching on Ziv-Lempel compressed text,” J. Discrete Algorithms, vol. 1, no. 3–4, pp. 313–338, 2003. [23] K. Fredriksson and G. Navarro, “Average-optimal single and multiple approximate string matching,” J. Exp. Algorithmics, vol. 9, pp. 1.4–1.4, 2004. [24] S. Deorowicz, “Speeding up transposition-invariant string matching,” Inf. Process. Lett., vol. 100, no. 1, pp. 14–20, 2006. [25] T. N. D. Huynh, W.-K. Hon, T.-W. Lam, and W.-K. Sung, “Approximate string matching using compressed suffix arrays,” Theor. Comput. Sci., vol. 352, no. 1, pp. 240–249, 2006. [26] C. du Mouza, P. Rigaux, and M. Scholl, “Parameterized pattern queries,” Data Knowl. Eng., vol. 63, no. 2, pp. 433–456, 2007. [27] S. Grabowski and K. Fredriksson, “Bit-parallel string matching under ! "# worst case time,” Inf. Process. Lett., Hamming distance in vol. 105, no. 5, pp. 182–187, 2008. [28] X. Yan, F. Zhu, P. S. Yu, and J. Han, “Feature-based similarity search in graph structures,” ACM Trans. Database Syst., vol. 31, no. 4, pp. 1418–1453, 2006. [29] C.-F. Cheung, J. X. Yu, and H. Lu, “Constructing suffix tree for gigabyte sequences with megabyte memory,” IEEE Trans. Knowl. Data Eng., vol. 17, no. 1, pp. 90–105, Jan. 2005. [30] H. Lee, R. T. Ng, and K. Shim, “Extending q-grams to estimate selectivity of string matching with low edit distance,” in Proc. 33rd Int. Conf. Very Large Data Bases, 2007, pp. 195–206. [31] M. Kurucz, A. A. Benczúr, T. Kiss, I. Nagy, A. Szabó, and B. Torma, “KDD cup 2007 task 1 winner report,” SIGKDD Explor. Newsl., vol. 9, no. 2, pp. 53–56, 2007. [32] S. Mihov and K. U. Schulz, “Fast approximate search in large dictionaries,” Comput. Linguist., vol. 30, no. 4, pp. 451–477, 2004. [33] F. Mandreoli, R. Martoglia, and P. Tiberio, “Extra: A system for example-based translation assistance,” Mach. Translation, vol. 20, no. 3, pp. 167–197, 2006. [34] L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition. Upper Saddle River, NJ: Prentice-Hall, 1993. [35] E. J. Keogh and M. J. Pazzani, “Scaling up dynamic time warping to massive dataset,” in PKDD ’99: Proc. 3rd Eur. Conf. Principles of Data Mining and Knowledge Discovery, London, U.K., 1999, pp. 1–11. [36] X. Ge and P. Smyth, “Deformable Markov model templates for timeseries pattern matching,” in KDD ’00: Proc. 6th ACM SIGKDD Int. Knowl. Discovery Data Mining, 2000, pp. 81–90. [37] T. Oates, L. Firoiu, and P. R. Cohen, “Using dynamic time warping to bootstrap HMM-based clustering of time series,” in Sequence Learning—Paradigms, Algorithms, and Applications. London, U.K.: Springer-Verlag, 2001, pp. 35–52. [38] Z. Bar-joseph, G. K. Gerber, D. K. Gifford, T. S. Jaakkola, and I. Simon, “Continuous representations of time series gene expression data,” J. Computat. Biol., vol. 10, pp. 3–4, 2003. [39] J. Yang, W. Wang, and P. S. Yu, “Mining surprising periodic patterns,” Data Min. Knowl. Discov., vol. 9, no. 2, pp. 189–216, 2004. [40] M. Vlachos, G. Kollios, and D. Gunopulos, “Elastic translation invariant matching of trajectories,” Mach. Learn., vol. 58, no. 2–3, pp. 301–334, 2005. [41] A. Efrat, Q. Fan, and S. Venkatasubramanian, “Curve matching, time warping, and light fields: New algorithms for computing similarity between curves,” J. Math. Imaging Vis., vol. 27, no. 3, pp. 203–216, 2007. [42] R. K. Bajcsy and C. Broit, “Matching of deformed images,” in Proc. ICPR, 1982, pp. 351–353. [43] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham, “Active shape models: Their training and application,” Comput. Vis. Image Underst., vol. 61, no. 1, pp. 38–59, 1995. [44] P. Viola and M. Jones, “Robust real-time object detection,” Int. J. Comput. Vis., 2001. [45] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization for spoken word recognition,” IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-26, no. 1, pp. 43–49, Jan. 1978. [46] D. Knuth, J. Morris, and V. Pratt, “Fast pattern matching in strings,” SIAM J. Computing, vol. 6, no. 2, pp. 323–350, Jun. 1977. [47] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and object recognition using shape contexts,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 4, pp. 509–522, Apr. 2002. [48] J. Sivic and A. Zisserman, “Video Google: A text retrieval approach to object matching in videos,” in Proc. Int. Conf. Comput. Vis., 2003, vol. 2, pp. 1470–1477. ! "#$ [49] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, “Behavior recognition via sparse spatio-temporal features,” in Proc. 2nd Joint IEEE Int. Workshop Visual Surveill. Performance Eval. Tracking Surveill., 2005, pp. 65–72. [50] L. Cheng and M. Gong, “Realtime background subtraction from dynamic scenes,” in Proc. ICCV, 2009, pp. 1–8. [51] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vision, vol. 60, no. 2, pp. 91–110, 2004. [52] C. Schuldt, I. Laptev, and B. Caputo, “Recognizing human actions: A local SVM approach,” in Proc. Int. Conf. Pattern Recognit., 2004, pp. 32–36. [53] S. Nowozin, G. Bakir, and K. Tsuda, “Discriminative subsequence mining for action classification,” in Proc. Int. Conf. Comput. Vis., 2007, pp. 1919–1923. [54] Y. Ke, R. Sukthankar, and M. Hebert, “Efficient visual event detection using volumetric features,” in Proc. Int. Conf. Comput. Vis., Oct. 2005, vol. 1, pp. 166–173. [55] S. Wong, T. Kim, and R. Cipolla, “Learning motion categories using both semantic and structural information,” in Proc. IEEE Conf. CVPR, 2007, pp. 1–6. [56] R. Gross and J. Shi, The CMU Motion of Body (MoBo) Database Robotics Inst. Carnegie Mellon Univ., Pittsburgh, PA, Tech. Rep. CMU-RI-TR-01-18, 2001. Li Wang received the M.S. and Ph.D. degrees from the Institute of Automation, SouthEast University, Nanjing, China, in 2005 and 2009, respectively. He is currently a Lecturer with the Nanjing Forestry University, Nanjing. China. His research interests include human action recognition, human detection and tracking, as well as machine learning on computer vision. Li Cheng (M’04) received the Ph.D. degree in computer science from the University of Alberta, AB, Canada. He is a Research Scientist with the Bioinformatics Institute (BII), A*STAR, Singapore. Prior to joining BII in July 2010, he was with the Statistical Machine Learning group of NICTA, Australia, TTI-Chicago, IL, and the University of Alberta. His research expertise is mainly on machine learning and computer vision. Liang Wang received the B.Eng. and M.Eng. degrees from Anhui University in 1997 and 2000, respectively, and the Ph.D. degree from the Institute of Automation, Chinese Academy of Sciences (CAS), Beijing, China, in 2004. From 2004 to 2010, he was a Research Assistant with Imperial College London, London, U.K., and Monash University, Australia, a Research Fellow with the University of Melbourne, Australia, and a Lecturer with the University of Bath, U.K., respectively. Currently, he is a Professor of Hundred Talents Program at the National Lab of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China. His major research interests include machine learning, pattern recognition, and computer vision. He has widely published in highly ranked international journals and leading international conferences. He is an associate editor for the International Journal of Image and Graphics, Signal Processing, Neurocomputing and International Journal Of Cognitive Biometrics. Prof. Wang is a member of BMVA. He was the recipient of the Special Prize of the Presidential Scholarship of Chinese Academy of Sciences. He is an associate editor for the IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS—PART B. He has been a guest editor for four special issues, a coeditor of five edited books, and a cochair of six international workshops.
© Copyright 2025 Paperzz