Author's personal copy Int J Comput Vis DOI 10.1007/s11263-013-0666-4 Multi-Target Tracking by Online Learning a CRF Model of Appearance and Motion Patterns Bo Yang · Ramakant Nevatia Received: 21 December 2012 / Accepted: 7 October 2013 © Springer Science+Business Media New York 2013 Abstract We introduce an online learning approach for multi-target tracking. Detection responses are gradually associated into tracklets in multiple levels to produce final tracks. Unlike most previous approaches which only focus on producing discriminative motion and appearance models for all targets, we further consider discriminative features for distinguishing difficult pairs of targets. The tracking problem is formulated using an online learned CRF model, and is transformed into an energy minimization problem. The energy functions include a set of unary functions that are based on motion and appearance models for discriminating all targets, as well as a set of pairwise functions that are based on models for differentiating corresponding pairs of tracklets. The online CRF approach is more powerful at distinguishing spatially close targets with similar appearances, as well as in tracking targets in presence of camera motions. An efficient algorithm is introduced for finding an association with low energy cost. We present results on four public data sets, and show significant improvements compared with several stateof-art methods. Keywords Multi-target tracking · Online learned CRF · Appearance and motion patterns · Association based tracking Electronic supplementary material The online version of this article (doi:10.1007/s11263-013-0666-4) contains supplementary material, which is available to authorized users. B. Yang (B) · R. Nevatia Institute for Robotics and Intelligent Systems, University of Southern California, Los Angeles, CA, USA e-mail: [email protected]; [email protected] R. Nevatia e-mail: [email protected] 1 Introduction Automatic tracking of multiple targets is an important problem in computer vision, and is helpful for many real applications and high level analysis, e.g., event detection, crowd analysis, robotics. The aim of the task is to automatically locate concerned targets, give a unique id to each, and compute their trajectories. Due to significant improvements in object detection techniques in recent years, many researchers have proposed tracking methods that associate object detection responses into tracks, i.e., Association Based Tracking (ABT) methods (Perera et al. 2006; Xing et al. 2009; Andriyenko and Schindler 2011; Pirsiavash et al. 2011; Yang and Nevatia 2012a). They often apply an offline learned detector to each frame to provide detection responses, and then link these responses into tracklets, i.e., track fragments; tracklets are further associated into longer tracks in one or multiple steps. The probability of association of two tracklets is based on the motion smoothness and appearance similarity; a global solution with maximum probability is often found by Hungarian algorithm (Perera et al. 2006; Xing et al. 2009), MCMC (Yu et al. 2007; Song et al. 2010), or network flow analysis (Zhang et al. 2008; Pirsiavash et al. 2011), etc. Association based approaches are powerful at dealing with extended occlusions between targets and the complexity is polynomial in the number of targets. To associate each target, motion and appearance information are often adopted to produce discriminative descriptors. Motion descriptors are often based on speed and distance between tracklet pairs, while appearance descriptors are often based on global or part based color histograms to distinguish different targets. However, how to better distinguish nearby targets and how to deal with motion in moving cameras remain key issues that limit the performance of ABT. In this paper, we propose 123 Author's personal copy Int J Comput Vis Fig. 1 Examples of tracking results by our approach an online learned conditional random field (CRF) model to reliably track targets under camera motions and to better discriminate different targets, especially difficult pairs, which are spatially near targets with similar appearance. Figure 1 shows some tracking examples by our approach, which is able to track targets under heavy occlusion, camera motion, and crowded scenes. In most previous ABT work, appearance models are predefined (Xing et al. 2009; Pirsiavash et al. 2011) or online learned to discriminate all targets from one another (Kuo et al. 2010; Shitrit et al. 2011) or to discriminate one target from all others (Breitenstein et al. 2011; Kuo and Nevatia 2011). Though such learned appearance models are able to distinguish many targets, they are not so effective at differentiating difficult pairs, i.e., close targets with similar appearances. The discriminative features for a difficult pair are possibly quite different than those for distinguishing with all other targets. Linear motion models have been widely used in previous tracking work (Stalder et al. 2010; Yang et al. 2011; Breitenstein et al. 2011); linking probabilities between tracklets are often based on how well a pair of tracklets satisfies a linear motion assumption, i.e., assuming that targets move without changing directions and with constant speeds. However, as shown in the first row of Fig. 2, if the view angle changes due to camera motion, motion is not linear; this could conceivably be compensated for by camera motion inference techniques, but this is a challenging task by itself. Relative positions between targets are less influenced by view angles, as all targets shift together when the camera moves; they may be more stable than linear motion models for estimation of 123 positions under camera motions. For static cameras, relative positions are still helpful, as shown in the second row of Fig. 2; some targets may not follow a linear motion model, and relative positions between co-moving targets are useful for recovering errors of a linear motion model. Based on above observations, we propose an online learning approach, which formulates the multi-target tracking problem as one of inference in a CRF model as shown in Fig. 3. Our CRF framework incorporates global models for distinguishing all targets and pairwise models for differentiating difficult pairs of targets. Given detection responses for a video, low level tracklets are first built from reliable response pairs in neighboring frames as shown in Fig. 3. All linkable tracklet pairs form the nodes in this CRF model, and labels of each node (1 or 0) indicate whether two tracklets may be linked or not. The energy cost for each node is estimated based on its observation, e.g., the green rectangles in Fig. 3. The observation is composed of global motion and appearance affinities, as similar in Kuo et al. (2010). As shown in the global models in Fig. 3, the global motion affinity is based on linear motion models between the tracklet pairs in CRF nodes, e.g., using Δs1 and Δs2 , the distances between estimated locations and actual locations, to compute motion probability. The global appearance affinity is based on an online learned appearance model which best distinguishes all tracklets; we can see that the positive pairs, e.g., those labeled with +1, are responses from the same tracklet, and the negative pairs, e.g., those labeled with −1, are responses from temporal overlapping tracklets; the learned appearance model produce high and Author's personal copy Int J Comput Vis Fig. 2 Examples of relative positions and linear estimations. In the 2D map, filled circles indicate positions of person at the earlier frames; dashed circles and solid circles indicate estimated positions by linear motion models and real positions at the later frames respectively Fig. 3 Tracking framework of our approach. In the CRF model, each node denotes a possible link between a tracklet pair and has a unary energy cost based on global appearance and motion models; each edge denotes a correlation between two nodes and has a pairwise energy cost based on discriminative appearance and motion models specifically for the two nodes. Colors of detection responses indicate their belonging tracklets. For clarity purpose, we only show four nodes in the CRF model. Best viewed in color (Color figure online) 123 Author's personal copy Int J Comput Vis low probability scores for positive and negative pairs, respectively. The same global motion and appearance models are used for all CRF nodes. Energy cost for an edge is based on discriminative pairwise models, i.e., appearance and motion descriptors, that are learned online for distinguishing tracklets in the two connected CRF nodes. As shown in Fig. 3, positive pairs for the pairwise appearance models are extracted from only the concerned two tracklets, e.g., the green and the blue tracklets in pairwise model 1, and each negative pair consists of one response in the green tracklet and one response in the blue tracklet. The pairwise appearance model is for discriminating specific pair of targets without considering others. Pairwise motion model is based on the relative distances between concerned two targets; for example, in pairwise model 1 in Fig. 3, we evaluate the relative distance Δs between the green and the blue tracklets after Δt, and use Δs to match other close-by tracklet pairs (see Sect. 4 for details). This considers relative distance between specific pair of targets that may move together. Global and pairwise models are used to produce unary and pairwise energy functions respectively, and the tracking problem is transformed into an energy minimization task. The contributions of this paper are: – A CRF framework for modeling both global tracklet affinity models and pairwise motion and discriminative appearance models. – An online learning approach for producing unary and pairwise energy functions in a CRF model. – An approximation algorithm for efficiently finding good tracking solutions with low energy costs. – An improved tracking algorithm with evaluation on standard data sets. Some parts of this work were presented in Yang and Nevatia (2012b); we provide more technical details as well as more experiments in this paper. The rest of the paper is organized as follows: related work is discussed in Sect. 2; our tracking framework is given in Sect. 3; Sect. 4 describes the online learning approach for a CRF model; experiments are shown in Sect. 5, followed by conclusion in Sect. 6. 2 Related Work Multi-target tracking has been an active topic in computer vision research. The main challenges focus on how to distinguish targets from the background and from each other, and how to deal with missed detections, false alarms, and occlusions. Category free tracking (CFT) methods, often called visual tracking approaches, focus on tracking a single object or mul- 123 tiple objects separately (Kalal et al. 2010; Wang et al. 2011; Zhang et al. 2012a,b; Holzer et al. 2012). These methods are able to track target without regard to its category, from a manual initialization; no offline trained detector is needed, and tracking is usually based on appearance only to deal with abrupt motions. From the given label in the first frame, an appearance model is learned to discriminate the target from all other regions; the appearance model is usually online updated to deal with illumination and view angle changes. Meanshift (Comaniciu et al. 2000) or particle filtering (Isard and Blake 1998) approaches are often adopted to make tracking decisions. Spatially correlated regions are often considered together to reduce the risk of drifting (Grabner et al. 2010; Duan et al. 2012). CFT methods are able to track targets under severe motions and large appearance changes, but once the result drifts it is very difficult to recover. Most ABT methods focus on tracking multiple objects of a pre-known class simultaneously (Perera et al. 2006; Song et al. 2010; Andriyenko and Schindler 2011; Xing et al. 2011). They usually associate detection responses produced by a pre-trained detector into long tracks, and find a global optimal solution for all targets. The key issue in ABT is how to measure association probabilities between tracklets; the probability is often based on appearance and motion cues. In some ABT work, appearance and motion models are predefined (Perera et al. 2006; Huang et al. 2008; Pirsiavash et al. 2011; Breitenstein et al. 2011; Andriyenko et al. 2012). To automatically find appropriate weights for different cues, Li et al. (2009) proposed an offline learning framework to integrate multiple cues; then in order to better differentiate different targets, online learned discriminative appearance models are proposed to distinguish multiple targets globally (Kuo et al. 2010; Kuo and Nevatia 2011; Shitrit et al. 2011). Association probabilities between tracklets are usually assumed to be independent of each other, Yang et al. (2011) introduced an offline trained CRF model to relax this assumption. However, global appearance models are not necessarily able to differentiate difficult pairs of targets, i.e., those that are spatially close and have similar appearances. For example, the best global appearance model may be the color histogram of upper body, but the most discriminative models for a pair of near-by persons may be the histogram of lower body. In addition, linear motion models between tracklet pairs (Stalder et al. 2010; Breitenstein et al. 2011; Zamir et al. 2012) are often adopted to constrain motion smoothness. However, camera motion may invalidate the linear motion assumption, and targets may sometimes follow non-linear paths. Our online CRF approach considers both global and pairwise appearance and motion models. Note that CRF models have also been utilized in Yang et al. (2011). Both Yang et al. (2011) and our approach relax the assumption that associations between tracklet pairs are independent of each other. However, Yang et al. (2011) focused Author's personal copy Int J Comput Vis on modeling association dependencies, while our approach aims at a better distinction between difficult pairs of targets and therefore the meanings of edges in the CRF are different. In addition, Yang et al. (2011) is an offline approach that integrates multiple cues on pre-labeled ground truth data, but our approach is an online learning method that finds discriminative models automatically without pre-labeled data. 3 ABT Framework Our tracking framework is composed of three modules: object detection to provide responses for association, a sliding window method to improve efficiency and reduce latency, and CRF tracking in one sliding window. Given an input video, an object detector is first applied to each frame to produce detection responses. Then based on appearance and motion cues, these responses are gradually linked into tracks; some of the missed detections are inserted by interpolation, and some of the false detections are removed. 3.1 Object Detection ABT associates detection responses into tracks; therefore, the detection performance is an important factor in the quality of the resulting tracks. Though much progress has been made on detection techniques, state-of-art detectors are still far from perfect, and there is always a trade-off between precision and recall. In our framework, we prefer a detector that has high precision and reasonable recall, e.g., over 80 % precision and over 70 % recall. In association based framework, gaps between responses can be filled by measuring their appearance similarity and motion smoothness; however, spatially and temporally close false detections may produce false tracks that are much more difficult to remove. In this paper, we focus on automatically tracking multiple pedestrians. We adopt the JRoG pedestrian detector (Huang and Nevatia 2010) due to its high accuracy and computational efficiency; also it is a part based detector and is thus able to deal with partial occlusions to some extent. 3.2 Sliding Window Method for Tracking a Long Video To reduce latency, we track targets in sliding windows instead of in the whole video. In the whole video, there may be many targets, most of which may not have temporal overlap. Therefore, global optimization on the whole video is impractical due to the huge search space and long latency, and is also unnecessary as many targets only exist for some period but not the whole video. To produce tracks across sliding windows, we use overlapping neighboring windows. An example is given in Fig. 4. Given an input video, we first apply the object detector in the first sliding window and track within the window. Then we move the window with a step of half of its size, so that the new window has a 50 % overlap with the previous one, and apply the detector in the new frames (frames 9–12 in this example). We further evenly divide the overlapped part into two; in the first 25 % of the new window (frames 5, 6), we keep the tracking results produced from the previous window; in the 25–50 % of the new window (frames 7, 8), we break all tracking results into original detection responses. After that, the tracking algorithm is applied in the new window; this approach can produce tracks across sliding windows. Note that the tracking results in frames 7, 8 could be different in neighboring windows; we trust the results in the later window, as there are more frames around frames 7, 8 in the new window indicating more information used for making the association decisions. To properly set the size of a sliding window creates a tradeoff between latency and accuracy. More frames in a sliding window bring more information, and therefore the tracking results are more accurate; however, this indicates more cached frames in a window, which leads to longer latency. In the experiments reported later, we fix the sliding window to be eight second long with a 50 % overlap between neighboring windows; therefore, our algorithm has a two second latency but is able to produce robust results even in complex scenes. 3.3 CRF Formulation for Tracking in One Temporal Window Given detection responses in a temporal window, our task is to produce reliable tracks within this window. Similar to Kuo et al. (2010), we adopt a low level association process to connect detection responses in neighboring frames into reliable tracklets, and then associate the tracklets progressively in multiple levels. The low level association only links responses in neighboring frames that have high appearance similarities and small spatial distances. There are two advantages to associating tracklets rather than the raw detections: (1) the search space is reduced, as we explore all linkable tracklet pairs instead of all detection pairs; (2) motion cues are extracted more reliably from tracklets than from single frame detections. We formulate notations to be used as follows: – S = {T1 , T2 , . . . , Tn }: the input tracklets produced in the previous level; low level association results are used as the input for the first level. ts te – Ti = {di i , . . . , di i }: a tracklet which is defined as a set of detection or interpolated responses in consecutive frames, starting at frame tis and ending at frame tie . It is possible that tis = tie . 123 Author's personal copy Int J Comput Vis Fig. 4 Illustration for the sliding window approach. Circles denote detection responses, and the colors indicate their appearances. There is a 50 % overlap between neighboring sliding windows. See text for details (Color figure online) – dit = { pit , sit , vit }: a detection or interpolated response at frame t at position pit , with size sit and velocity vector vit . – G = {V, E}: the CRF graph created for the current level. – V = {vi = (Ti1 → Ti2 )}: all possible association pairs of tracklets in the current level. – E = {e j = (v j1 , v j2 )}: all possible correlated pairs of nodes in V . – L = {l1 , l2 , . . . , lm }: the labels for all possible association pairs of tracklets; li = 1 indicates Ti1 is linked to Ti2 , and li = 0 indicates the opposite. At each level of association, we aim to find the best set of associations with the highest probability. We formulate the tracking problem as finding the best L given S L ∗ = argmax P(L|S) = argmax L L 1 exp(−Ψ (L|S)) Z (1) where Z is a normalization factor, and Ψ is a cost function. Assuming that the joint distributions of more than two associations do not make contributions to P(L|S), we have L ∗ = argmin Ψ (L|S) L = argmin( L i U (li |S) + B(li , l j |S)) (2) vi ∈H eadi 1 vi ∈T aili 2 where H eadi1 {(Ti1 → T j ) ∈ V } ∀T j ∈ S T aili2 {(T j → Ti2 ) ∈ V } ∀T j ∈ S (3) where the first constraint limits any tracklet Ti1 link to at most one other tracklet, and the second constraint limits that at most one tracklet may be link to any tracklet Ti2 . 4 Online Learning of CRF Models In this section, we describe our tracking approach in several steps, including the CRF creation, online learning of unary and pairwise terms, as well as how to find an association label set with the lowest overall energy as in Eq. 2. ij where U (li |S) = − ln P(li |S) and B(li , l j |S) = − ln P(li , l j |S) denote the unary and pairwise energy functions respectively. In Eq. 2, U (li |S) defines the linking probability between two tracklets based on global appearance and motion models, while B(li , l j |S) defines the correlation between tracklet pairs based on discriminative models learned for the corresponding pairs of tracklets. We model the tracking problem by finding appropriate labels in a CRF model, which is learned in each slid- 123 ing window repeatedly. For simplicity, we assume that one tracklet cannot be associated with more than one tracklet; it makes energy minimization efficient. In the detection process, nearby responses are merged in a conservative way so multiple targets are unlikely to be detected as one object, but multiple responses may exist for one target. Any valid label set L should satisfy li ≤ 1 & li ≤ 1 4.1 CRF Creation for Tracklets Association In order to create the CRF graph, we first find all linkable tracklet pairs. Tracklet Ti is linkable to T j if the gap between the end of Ti and the beginning of T j satisfies 0 < t sj − tie < tmax (4) where tmax is a threshold for maximum gap between any linkable pair of tracklets, and t sj and tie denotes the start and Author's personal copy Int J Comput Vis Fig. 5 Examples of head-close and tail-close tracklet pairs. The x-axis denotes the frame number, and y-axis denotes the position end frames of T j and Ti respectively. The “>0” constraint avoids temporally overlapping tracklets to be connected, as one target cannot appear in two positions at the same time; the “< tmax ” constraint prevents tracklets with large temporal gaps to be connected, as it is less safe to assume that long gap tracklets belong to the same target. In our implementation, tmax is fixed to be 4 s long, i.e., half of our sliding window size. If a target disappears or is occluded for more than 4 s, it is treated as a new target. We create the set of nodes V in CRF to modeling all linkable tracklets as V = {vi = (Ti1 → Ti2 )} s.t. Ti1 is linkable to Ti2 (5) Before defining the set of edges E in CRF, we first define head-close and tail-close pairs of tracklets. Two tracklets Ti and T j are a head-close pair if they satisfy (suppose tis ≥ t sj ) tis + ζ < tie & tis + ζ < t ej & ∀t ∈ [tis , tis + ζ ] || pit − p tj || < γ min{sit , s tj } (6) where γ is a distance control factor (set to 3 in our experiments), and ζ is a temporal length threshold set to one second long in our experiments; see Fig. 5 for an illustration. Note that in Eq. 6, the distance is normalized to the size of target. This definition indicates that the head part of Ti is spatially close to T j for a certain period, suggesting possible co-movement of the two targets. The definition of tail-close is similar; two tracklets Tk and Tm are a tail-close pair if they satisfy (suppose tke ≥ tme ) tme − ζ > tks & tme − ζ > tms t & ∀t ∈ [tme − ζ, tme ] || pkt − pm || < γ min{skt , smt } (7) We define the set of edges E as E = {(vi , v j )} ∀vi , v j ∈ V s.t. Ti1 and T j1 are tail-close, or Ti2 and T j2 are head-close. (8) The cost of edges is based on pairwise motion and appearance models as shown in Fig. 3. Such pairs are often more difficult to differentiate than spatially far-away targets; if they follow the same motion patterns, e.g., the same linear or non-linear motion path, it would be helpful for motion estimations of each other. Note that a CRF model is also used in Yang et al. Fig. 6 Global motion models for unary terms in CRF. The x-axis denotes the frame number, and y-axis denotes the position (2011); however, there, the edges are used for modeling association dependencies and not for possible co-moving targets. 4.2 Learning of Unary Terms Unary terms in Eq. 2 define the energy cost for associating pairs of tracklets. As defined in Sect. 3.3, U (li |S) = − ln P(li |S). We further divide the probability into a motion based probability Pm (·) and an appearance based probability Pa (·) as U (li = 1|S) = − ln(Pm (Ti1 → Ti2 |S)Pa (Ti1 → Ti2 |S)) (9) Pm is a function of the distance between estimations of positions using the linear motion model and the real positions as in Kuo et al. (2010), Yang et al. (2011), and Kuo and Nevatia (2011). As shown in Fig. 6, the distances are computed as Δp1 = p head − v head Δt − p tail and Δp2 = p tail + v tail Δt − p head . The motion probability between tracklets T1 and T2 are defined as Pm (Ti1 → Ti2 |S) = G(Δp1 , Σ p )G(Δp2 , Σ p ) (10) where G(·, Σ) is the zero-mean Gaussian function, and Δt is the frame difference between p tail and p head . For Pa (·), we adopt the online learned discriminative appearance models (OLDAMs) defined in Kuo et al. (2010), which focus on learning appearance models with good global discriminative abilities between targets. Responses from the same tracklet are used as positive samples, and those from temporal overlapping tracklets are used as negative samples. RealBoost is adopted to learn a single global appearance model which provides high Pa (·) for responses in positive samples and low Pa (·) for responses in negative samples. Features include color, texture, and shape for different regions of targets as in Kuo et al. (2010). 4.3 Learning of Pairwise Terms Similar to unary terms, pairwise terms are also decomposed into motion based and appearance based parts. Motion based probabilities are defined based on relative distances between tracklet pairs. Take two nodes (T1 , T3 ) and (T2 , T4 ) as an example. Suppose T1 and T2 are a tail-close pair; therefore, 123 Author's personal copy Int J Comput Vis responses is expected to have high appearance similarity as they belong to the same target. For a tail-close tracklet pair T1 and T2 , the positive sample set S+ is defined as S+ = {(dkt1 , dkt2 )} k ∈ {1, 2} ∀t1 , t2 ∈ [max{tks , tke − θ }, tke ] Fig. 7 Pairwise motion models for pairwise terms in CRF. The x-axis denotes the frame number, and y-axis denotes the position there is an edge between the two nodes. Let tx = min{t1e , t2e }, and t y = max{t3s , t4s }. As T1 and T2 are tail-close, we estimate positions of both targets at frame t y , as shown in dash circles in Fig. 7. Then we can get the estimated relative distance between the positions of T1 and T2 at frame t y as te Δp1 = ( p11 + V1tail (t y te − t1e ))−( p22 +V2tail (t y − t2e )) (11) where V1tail and t1e are the tail velocity and end frames of T1 ; V2tail and t2e are similar. Equation 11 indicates that if T1 and T2 do not change their motion directions and speeds, they should have a distance of Δp1 at frame t y . We compare the estimated relative distance with the observed one Δp2 , and compute the pairwise motion probability as G(Δp1 − Δp2 , Σ p ), where G is the zero-mean Gaussian function. As T1 and T2 are tail-close, they likely follow similar non-linear motion path or are under similar position shifts caused by camera motions; however, their relative distances change less in these conditions. As shown in Fig. 7, the difference between Δp1 and Δp2 is small. This indicates that if T1 is associated to T3 , there is a high probability that T2 is associated to T4 and vice versa. Note that if T3 and T4 in Fig. 7 are head-close, we also compute a similar probability as G(Δp3 − Δp4 , Σ p ) where Δp3 is computed as in Eq. 12 and Δp4 is as shown in Fig. 7; the final motion probability is taken as the average of both. ts ts Δp3 = ( p33 − V3head (t3s − tx )) − ( p44 − V4head (t4s − tx )) (12) Pairwise appearance models are designed for differentiating specific close-by pairs. For example, in Fig. 7, T1 and T2 are a tail-close pair; we want to produce an appearance model that best distinguishes the two targets without considering other targets or backgrounds. For this, we online collect positive and negative samples from the concerned two tracklets only. Positive samples are selected from responses in the same tracklet; any pair of these 123 (13) where θ is a threshold for the number of frames used for computing appearance models (set to 10 in our experiments), as after some time a target may change appearance a lot due to illumination, view angle, or pose changes. Negative samples are sampled from responses in different tracklets; we want to find an appearance model that produces low appearance similarity for any pair of these responses as they belong to different targets. The negative sample set S− between T1 and T2 is defined as S− = {(d1t1 , d2t2 )} ∀t1 ∈ [max{t1s , t1e − θ }, t1e ], ∀t2 ∈ [max{t2s , t2e − θ }, t2e ] (14) Sample collection for head-close pairs follows a similar procedure, but detection responses are from the first θ frames of each tracklet. With the positive and negative sample sets, we use the standard RealBoost algorithm to learn appearance models for best distinguishing T1 and T2 ; the same features are used as in unary terms. This learned model is used to produce appearance based probabilities for (T1 , T3 ) and (T2 , T4 ). If T3 and T4 are a head-close pair, we adopt a similar learning approach to get appearance probabilities based on discriminative models for T3 and T4 , and use the average of both scores as the final pairwise appearance probabilities for (T1 , T3 ) and (T2 , T4 ). Note that the discriminative appearance models for T1 and T2 are only learned once for all edges like {(T1 , Tx ), (T2 , Ty )} ∀Tx , Ty ∈ S. Therefore, the complexity is O(n 2 ), where n is the number of tracklets. Moreover, as only a few tracklet pairs are likely to be spatially close, the actual times of learning is typically much smaller than n 2 . 4.4 Energy Minimization For CRF models with submodular energy functions (where B(0, 0) + B(1, 1) < B(1, 0) + B(1, 1)), a global optimal solution can be found by the graph cut algorithm. However, due to the constraints in Eq. 3, the energy function in our formulation is not sub-modular. In addition, the one to one assumption in Eq. 3 makes regular non-submodular algorithms less efficient and less effective, such as QPBO (Hammer et al. 1984) and Loopy Belief Propagation (LBP) (Pearl 1998), as most solutions found by these method would violate the constraint. Instead, we introduce a heuristic algorithm to find a solution in polynomial time. Author's personal copy Int J Comput Vis The unary terms in the CRF model have been shown to be effective for non-difficult pairs by previous work (Kuo et al. 2010). Considering this issue, we first use the Hungarian algorithm (Perera et al. 2006) to find a global optimal solution by only considering the unary terms and satisfying the constraints in Eq. 3. Then, we sort the selected associations, i.e., nodes with labels of 1, according to their unary term energies from least to most as A = {vi = (Ti1 → Ti2 )}. Next, for each selected node, we consider switching label of this node and each neighboring node, i.e., a node that is connected with current node by an edge in the CRF model; if the energy is lower, we keep the change. Similar to Kuo et al. (2010), the unary energy term for self association is set to +∞. Self association denotes that a tracklet is either a false alarm or is a complete track and therefore should not associate with any other tracklet. With such a definition, the algorithm will always select an association for each tracklet no matter how small the link probability. To avoid over association, we set a threshold for linking. For each selected tracklet pair, e.g. a node vi with label li = 1, we define its linking cost as Ci = U (li = 1) + 1 B(li = 1, l j = 1) 2 (15) v j ∈Ni Ni = {v j } s.t. (vi , v j ) ∈ E & l j = 1 (16) Algorithm 1 Finding labels with low energy cost. Input: Tracklets from previous level S = {T1 , T2 , . . . , Tn }; CRF graph G = (V, E). Find the label set L with the lowest unary energy cost by Hungarian algorithm, and evaluate its overall energy Ψ by Eq. 2. Sort nodes with labels of 1 according to their unary energy costs from least to most as {v1 , . . . , vk }. For i = 1, . . . , k do: – Set updated energy Ψ = +∞ – For j = 1, . . . , t that (vi , v j ) ∈ E do: – Switch labels of li and l j under constraints in Eq. 3, and evaluate new energy Ω. – If Ω < Ψ , Ψ = Ω – If Ψ < Ψ , update L with the switch, and set Ψ = Ψ . Output: The set of labels L on the CRF graph. the run time is almost linear in the number of targets, which is less than 30 in typical scenarios. 5 Experiments We evaluated our approach on four public pedestrian data sets: the TUD data set (Andriyenko and Schindler 2011), PETS (2009), Trecvid (2008), and ETH mobile pedestrian (Ess et al. 2009) data set. We show quantitative comparisons with state-of-art methods, as well as visualized results of our approach. Though frame rates, resolutions and densities are different in these data sets, we used the same parameter setting, and performance improves compared to previous methods for all of them. This indicates that our approach has low sensitivity to parameters. All data used in our experiments are publicly available.2 . When this algorithm chooses a node vi with label li = 1, we only link the two corresponding tracklets in the node when Ci < τ , where τ is a threshold. In our experiments, we have four levels of associations to gradually link tracklets to longer tracks; the thresholds are fixed to − ln 0.8, − ln 0.5, − ln 0.1, − ln 0.05, but the performances improve in all data sets, indicating that our algorithm is not sensitive to these thresholds. The lower the thresholds, less likely are the tracklets to be associated, leading to fewer ID switches but more fragments. 1 The energy minimization algorithm is shown in Algorithm 1. Our energy minimization algorithm is a greedy process, and will stop at some local minimum; however, it assures to produce associations with an energy lower (or the same in the worst case) than associations produced by using unary terms only. Note that the Hungarian algorithm used in the first step of Algorithm 1 has a time complexity of O(n 3 ), while our heuristic search process has a complexity of O(|E|) = O(n 4 ); the overall complexity is still polynomial. In addition, as nodes are only defined between tracklets with a proper time gap and edges are only defined between nodes with head-close or tail-close tracklet pair, the actual number of edges is typically much smaller than n 4 . In our experiments, – Recall(↑) correctly matched detections / total detections in ground truth. – Precision(↑) correctly matched detections / total detections in the tracking result. – FAF(↓) Average false alarms per frame. – GT The number of trajectories in ground truth. – MT (↑) The ratio of mostly tracked (MT) trajectories, which are successfully tracked for more than 80 %. – ML(↓) The ratio of mostly lost trajectories, which are successfully tracked for less than 20 %. 1 2 Definitions are given in Sect. 5. 5.1 Evaluation Metrics As it is difficult to use a single score to evaluate tracking performance, we adopt the evaluation metric defined in Li et al. (2009), including: http://iris.usc.edu/people/yangbo/downloads.html 123 Author's personal copy Int J Comput Vis – PT The ratio of partially tracked trajectories, i.e., 1−M T − M L. – Frag(↓) fragments, the number of times that a ground truth trajectory is interrupted. – IDS(↓) ID switch, the number of times that a tracked trajectory changes its matched id. For items with ↑, higher scores indicate better results; for those with ↓, lower scores indicate better results. 5.2 Results on Static Camera Videos We first present our results on data sets captured by static cameras: TUD (Andriyenko and Schindler 2011), PETS (2009), and Trecvid (2008). For fair comparison, we use the same TUD-Stadtmitte and PETS 2009 data set as used in Andriyenko and Schindler (2011). Scenes in these two data sets are not crowded, but people are sometimes occluded by other humans or other objects. Quantitative results are shown in Tables 1 and 2. Note that the approach in Kuo et al. (2010) can be viewed as a baseline; if our pairwise terms are disabled, we only use online learned global models, and our approach degrades to Kuo et al. (2010). By comparing our approach and Kuo et al. (2010), we are able to see the effectiveness of the proposed pairwise terms. We can see that the online CRF tracking results are much improved over those in Andriyenko and Schindler (2011) in both data sets and better than PRIMPT in PETS 2009, but the improvement is not so obvious compared with Kuo et al. (2010) and Kuo and Nevatia (2011) in the TUD data set: we have higher MT and recall and lower ID switches, but OLDAMs and PRIMPT have higher precisions. This is because the online CRF model focuses on better differentiating difficult pairs of targets, but there are not many people in the TUD data set. Some visual results are shown in Fig. 8; our approach is able to keep correct identities while targets are quite close, such as person 0, person 1, and person 2 in the first row, and person 6, person 11, and person 12 in the second row. To see the effectiveness of our approach, we further evaluated our approach on the more difficult Trecvid 2008 data set. There are nine video clips in this data set, each of which has 5,000 frames; these videos are captured in a busy airport, and have high density of people with frequent occlusions. There are many close track interactions in this data set, indicating huge number of edges in the CRF graph (more than 200 edges in most cases). The comparison results are shown in Table 3. Compared with up-to-date approaches, our online CRF achieves best performance on precision, FAF, fragments, and ID switches metrics, while keeping recall and MT as high as 79.8 and 75.5 %. Compared with Kuo and Nevatia (2011), our approach reduces the fragments and the ID switches by about 15 and 14 % respectively. As OLDAMs (Kuo et al. 2010) equals to using unary terms only in our model, the obvious improvement indicates the utility of the pairwise terms. Figure 9 shows some tracking examples by our approach. We can see that when targets with similar appearances get close, the online CRF can still find discriminative features to distinguish these difficult pairs. However, global appearance Table 1 Comparison of results on TUD dataset Method Recall (%) Precision (%) FAF GT MT (%) PT (%) ML (%) Frag IDS 10 60.0 40.0 0.0 3 0 9 60.0 30.0 0.0 4 7 OLDAMs (Kuo et al. 2010) 79.9 98.8 0.014 Energy minimization (Andriyenko and Schindler 2011) PRIMPT (Kuo and Nevatia 2011) – – – 81.0 99.5 0.028 10 60.0 30.0 10.0 0 1 Online CRF tracking 87.0 96.7 0.184 10 70.0 30.0 0.0 1 0 The OLDAMs and PRIMPT results are provided by Kuo and Nevatia (2011). Our ground truth includes all persons appearing in the video, and has one more person than that in Andriyenko and Schindler (2011) Table 2 Comparison of results on PETS 2009 dataset Method Recall (%) Precision (%) FAF GT MT (%) PT (%) ML (%) Frag IDS OLDAMs (Kuo et al. 2010) 89.6 97.9 0.110 19 73.7 26.3 0.0 28 1 Energy minimization (Andriyenko and Schindler 2011) PRIMPT (Kuo and Nevatia 2011) – – – 23 82.6 17.4 0.0 21 15 89.5 99.6 0.020 19 78.9 21.1 0.0 23 1 Online CRF tracking 93.0 95.3 0.268 19 89.5 10.5 0.0 13 0 The OLDAMs and PRIMPT results are provided by Kuo and Nevatia (2011). Our ground truth is more strict than that in Andriyenko and Schindler (2011) 123 Author's personal copy Int J Comput Vis Fig. 8 Tracking examples on TUD (Andriyenko and Schindler 2011) and PETS 2009 (Pets 2009) data sets Table 3 Comparison of tracking results on Trecvid 2008 dataset Method Recall (%) Precision (%) FAF GT MT (%) PT (%) ML (%) Frag IDS Offline CRF tracking (Yang et al. 2011) 79.2 85.8 0.996 919 78.2 16.9 4.9 319 253 OLDAMs (Kuo et al. 2010) 80.4 86.1 0.992 919 76.1 19.3 4.6 322 224 PRIMPT (Kuo and Nevatia 2011) 79.2 86.8 0.920 919 77.0 17.7 5.2 283 171 Online CRF tracking 79.8 87.8 0.857 919 75.5 18.7 5.8 240 147 The human detection results are the same as used in Yang et al. (2011), Kuo et al. (2010), Kuo and Nevatia (2011), and are provided by courtesy of authors of Kuo and Nevatia (2011) Fig. 9 Tracking examples on the Trecvid 2008 data set (National institute of standards and technology 2008) 123 Author's personal copy Int J Comput Vis and motion models are not effective enough in such cases, such as person 106 and 109 in the second row of Fig. 9, who are both in white, move in similar directions, and are quite close. 5.3 Results on Moving Camera Videos We further evaluated our approach on the ETH data set (Ess et al. 2009), which is captured by a pair of cameras on a moving stroller in busy street scenes. The stroller is mostly moving forward, but sometimes has panning motions, which makes the motion affinity between two tracklets less reliable. Many people have speed of 2–4 pixels per frame; however, the camera frequently induces motion of more than 20 pixels per frame. This makes linear motion estimation imprecise. Nevertheless, as all people shift according to the same camera motion, relative positions do not change much in one frame. For fair comparison, we choose the “BAHNHOF” and “SUNNY DAY” sequences used in Kuo and Nevatia (2011) for evaluation. They have 999 and 354 frames respectively; people are under frequent occlusions due to the low view angles of cameras. As in Kuo and Nevatia (2011), we also use the sequence from the left camera; no depth and ground plane information is used. The quantitative results are shown in Table 4. We can see that our approach achieves better or the same performances on all evaluation scores. Compared with most state-of-art method (Kuo and Nevatia 2011), the MT score is improved by about 10 %; fragments are reduced by 17 %; recall and precision are improved by about 2 and 4 % respectively. The obvious improvement in MT and fragment scores indicate that our approach can better track targets under moving camera conditions, where the traditional motion models are less reliable. Figure 10 shows some visual tracking results by our online CRF approach. Both examples show obvious panning movements of cameras. Traditional motion models, i.e., unary motion models in our online CRF, would produce low affinity scores for tracklets belonging to the same targets. However, by considering pairwise terms, relative positions are helpful for connecting correct tracklets into one. Figure 11 shows tracking examples using only unary terms on the same frames as shown in Fig. 2. In the first row of Fig. 11, we can see that an id switch occurs between two close humans due to camera motions and similar appear- Table 4 Comparison of tracking results on ETH dataset Method Recall (%) Precision (%) FAF GT MT (%) PT (%) ML (%) Frag IDS OLDAMs (Kuo et al. 2010) 52.0 91.0 0.398 125 25.8 38.7 35.5 20 25 PRIMPT (Kuo and Nevatia 2011) 76.8 86.6 0.891 125 58.4 33.6 8.0 23 11 Online CRF tracking 79.0 90.4 0.637 125 68.0 24.8 7.2 19 11 The human detection results are the same as used in Kuo et al. (2010), Kuo and Nevatia (2011), and are provided by Kuo and Nevatia (2011) Fig. 10 Tracking examples on the ETH data set (Ess et al. 2009) 123 Author's personal copy Int J Comput Vis Fig. 11 Tracking examples using only unary terms of our approach on the same frames as shown in Fig. 2 ances. However, by using both unary and pairwise terms, we are able to find discriminative appearance models to distinguish them and build more reliable motion affinities; thus, we can track them correctly as shown in Fig. 2. In the second row of Fig. 11, both person 135 and 136 are tracked at frame 3,692; however, person 136 is lost at frame 3,757, because the person makes a sharp direction change causing the linear motion model failed. By further considering pairwise terms, we are able to locate the non-linear path and to track both persons correctly as shown in the second row of Fig. 2. 5.4 Computational Speed As discussed in Sect. 4.4, the complexity of our algorithm is polynomial in the number of tracklets in the worst case. Our experiments were performed on an Intel 3.0 GHz PC with 8 GB memory, and the code is implemented in C++. For the less crowded TUD, PETS 2009, and ETH data sets, the speed is 10, 14, and 10 fps, respectively, not including detection time cost; for crowded Trecvid 2008 data set, the speed is about 6 fps. Compared with the speed of 7 fps for Trecvid 2008 reported in Kuo and Nevatia (2011), the online CRF does not add much to the computation cost. 3 6 Conclusion We described an online CRF framework for multi-target tracking. This CRF considers both global descriptors for distinguishing different targets as well as pairwise descriptors for differentiating difficult pairs. Unlike global descriptors, pairwise motion and appearance models are learned from corresponding difficult pairs, and represented by pairwise terms in the CRF energy function. An effective algorithm is introduced to efficiently find associations with low energy, and the experiments show significantly improved results compared with previous methods. Future improvement can be achieved by adding camera motion inference into pairwise motion models. 3 Detection time costs are not included in either measurements. 123 Author's personal copy Int J Comput Vis Acknowledgments Research was sponsored, in part, by Office of Naval Research under Grant number N00014-10-1-0517 and by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-10-2-0063. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. References Andriyenko, A., & Schindler, K. (2011). Multi-target tracking by continuous energy minimization. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Colorado Springs, USA, pp. 1265–1272. Andriyenko, A., Schindler, K., & Roth, S. (2012). Discrete–continuous optimization for multi-target tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, pp. 1926–1933. Breitenstein, M. D., Reichlin, F., Leibe, B., Koller-Meier, E., & Gool, L. V. (2011). Online multi-person tracking-by-detection from a single, uncalibrated camera. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(9), 1820–1833. Comaniciu, D., Ramesh, V., & Meer, P. (2000). Real-time tracking of non-rigid objects using mean shift. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Hilton Head Island, SC, USA, pp. 142–149. Duan, G., Ai, H., Cao, S., & Lao, S. (2012). Group tracking: Exploring mutual relations for multiple object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Florence, Italy, pp. 129–143. Ess, A., Leibe, B., Schindler, K., & van Gool, L. (2009). Robust multiperson tracking from a mobile platform. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(10), 1831–1846. Grabner, H., Matas, J., Gool, L. V., & Cattin, P. (2010). Tracking the invisible: Learning where the object might be. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, USA, pp. 1285–1292. Hammer, P. L., Hansen, P., & Simeone, B. (1984). Roof duality, complementation and persistency in quadratic 0–1 optimization. Mathematical Programming, 28(2), 121–155. Holzer, S., Pollefeys, M., Ilic, S., Tan, D., & Navab, N. (2012). Online learning of linear predictors for real-time tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Florence, Italy, pp. 470–483. Huang, C., & Nevatia, R. (2010). High performance object detection by collaborative learning of joint ranking of granule features. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, USA, pp. 41–48. Huang, C., Wu, B., & Nevatia, R. (2008). Robust object tracking by hierarchical association of detection responses. In Proceedings of the European Conference on Computer Vision (ECCV), Marseille, France, pp. 788–801. Isard, M., & Blake, A. (1998). Condensation—conditional density propagation for visual tracking. International Journal of Computer Vision, 29(1), 5–28. Kalal, Z., Matas, J., & Mikolajczyk, K. (2010). P–N learning: Bootstrapping binary classifiers by structural constraints. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, USA, pp. 49–56. Kuo, C.-H., & Nevatia, R. (2011). How does person identity recognition help multi-person tracking? In Proceedings of IEEE Confer- 123 ence on Computer Vision and Pattern Recognition (CVPR), Colorado Springs, USA, pp. 1217–1224. Kuo, C.-H., Huang, C., & Nevatia, R. (2010). Multi-target tracking by on-line learned discriminative appearance models. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, USA, pp. 685–692. Li, Y., Huang, C., & Nevatia, R. (2009). Learning to associate: Hybridboosted multi-target tracker for crowded scene. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, pp. 2953–2960. National Institute of Standards and Technology: Trecvid 2008 evaluation for surveillance event detection. Retrieved October 1, 2012 from http://www.nist.gov/speech/tests/trecvid/2008/. Pearl, J. (1998). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Francisco: Morgan Kaufmann. Perera, A. G. A., Srinivas, C., Hoogs, A., Brooksby, G., & Hu, W. (2006). Multi-object tracking through simultaneous long occlusions and split-merge conditions. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New York, USA, pp. 666–673. Pets 2009 dataset. Retrieved October 1, 2012 from http://www.cvg.rdg. ac.uk/PETS2009. Pirsiavash, H., Ramanan, D., & Fowlkes, C. C. (2011). Globally-optimal greedy algorithms for tracking a variable number of objects. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Colorado Springs, USA, pp. 1201–1208. Shitrit, H. B., Berclaz, J., Fleuret, F., & Fua, P. (2011). Tracking multiple people under global appearance constraints. In Proceedings of IEEE International Conference on Computer Vision (ICCV), Barcelona, Spain, pp. 137–144. Song, B., Jeng, T. Y., Staudt, E., & Roy-Chowdhury, A. K. (2010). A stochastic graph evolution framework for robust multi-target tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Crete, Greece, pp. 605–619. Stalder, S., Grabner, H., & Gool, L. V. (2010). Cascaded confidence filtering for improved tracking-by-detection. In Proceedings of the European Conference on Computer Vision (ECCV), Crete, Greece, pp. 369–382. Wang, S., Lu, H., Yang, F., & Yang, M.-H. (2011). Superpixel tracking. In Proceedings of IEEE International Conference on Computer Vision (ICCV), Barcelona, Spain, pp. 1323–1330. Xing, J., Ai, H., & Lao, S. (2009). Multi-object tracking through occlusions by local tracklets filtering and global tracklets association with detection responses. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, pp. 1200–1207. Xing, J., Ai, H., Liu, L., & Lao, S. (2011). Multiple players tracking in sports video: A dual-mode two-way bayesian inference approach with progressive observation modeling. IEEE Transaction on Image Processing, 20(6), 1652–1667. Yang, B., & Nevatia, R. (2012a). Online learned discriminative partbased appearance models for multi-human tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Florence, Italy, pp. 484–498. Yang, B., & Nevatia, R. (2012b). An online learned CRF model for multi-target tracking. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, pp. 2034–2041. Yang, B., Huang, C., & Nevatia, R. (2011). Learning affinities and dependencies for multi-target tracking using a CRF model. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Colorado Springs, USA, pp. 1233–1240. Yu, Q., Medioni, G., & Cohen, I. (2007). Multiple target tracking using spatio-temporal markov chain monte carlo data association. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Minneapolis, MN, USA, pp. 1–8. Author's personal copy Int J Comput Vis Zamir, A. R., Dehghan, A., & Shah, M. (2012). GMCP-Tracker: Global multi-object tracking. Using generalized minimum clique graphs. In Proceedings of the European Conference on Computer Vision (ECCV), Florence, Italy, pp. 343–356. Zhang, L., Li, Y., & Nevatia, R. (2008). Global data association for multi-object tracking using network flows. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Anchorage, AK, USA, pp. 1–8. Zhang, T., Ghanem, B., Liu, S., & Ahuja, N. (2012a). Robust visual tracking via multi-task sparse learning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, pp. 2042–2049. Zhang, K., Zhang, L., & Yang, M.-H. (2012b). Real-time compressive tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Florence, Italy, pp. 864–877. 123
© Copyright 2026 Paperzz