Multi-Target Tracking by Online Learning a CRF Model of

Author's personal copy
Int J Comput Vis
DOI 10.1007/s11263-013-0666-4
Multi-Target Tracking by Online Learning a CRF Model
of Appearance and Motion Patterns
Bo Yang · Ramakant Nevatia
Received: 21 December 2012 / Accepted: 7 October 2013
© Springer Science+Business Media New York 2013
Abstract We introduce an online learning approach for
multi-target tracking. Detection responses are gradually associated into tracklets in multiple levels to produce final tracks.
Unlike most previous approaches which only focus on producing discriminative motion and appearance models for all
targets, we further consider discriminative features for distinguishing difficult pairs of targets. The tracking problem is
formulated using an online learned CRF model, and is transformed into an energy minimization problem. The energy
functions include a set of unary functions that are based on
motion and appearance models for discriminating all targets,
as well as a set of pairwise functions that are based on models for differentiating corresponding pairs of tracklets. The
online CRF approach is more powerful at distinguishing spatially close targets with similar appearances, as well as in
tracking targets in presence of camera motions. An efficient
algorithm is introduced for finding an association with low
energy cost. We present results on four public data sets, and
show significant improvements compared with several stateof-art methods.
Keywords Multi-target tracking · Online learned CRF ·
Appearance and motion patterns · Association based
tracking
Electronic supplementary material The online version of this
article (doi:10.1007/s11263-013-0666-4) contains supplementary
material, which is available to authorized users.
B. Yang (B) · R. Nevatia
Institute for Robotics and Intelligent Systems,
University of Southern California, Los Angeles, CA, USA
e-mail: [email protected]; [email protected]
R. Nevatia
e-mail: [email protected]
1 Introduction
Automatic tracking of multiple targets is an important problem in computer vision, and is helpful for many real applications and high level analysis, e.g., event detection, crowd
analysis, robotics. The aim of the task is to automatically
locate concerned targets, give a unique id to each, and compute their trajectories. Due to significant improvements in
object detection techniques in recent years, many researchers
have proposed tracking methods that associate object detection responses into tracks, i.e., Association Based Tracking (ABT) methods (Perera et al. 2006; Xing et al. 2009;
Andriyenko and Schindler 2011; Pirsiavash et al. 2011; Yang
and Nevatia 2012a). They often apply an offline learned
detector to each frame to provide detection responses, and
then link these responses into tracklets, i.e., track fragments;
tracklets are further associated into longer tracks in one or
multiple steps. The probability of association of two tracklets is based on the motion smoothness and appearance similarity; a global solution with maximum probability is often
found by Hungarian algorithm (Perera et al. 2006; Xing et
al. 2009), MCMC (Yu et al. 2007; Song et al. 2010), or network flow analysis (Zhang et al. 2008; Pirsiavash et al. 2011),
etc.
Association based approaches are powerful at dealing with
extended occlusions between targets and the complexity is
polynomial in the number of targets. To associate each target, motion and appearance information are often adopted to
produce discriminative descriptors. Motion descriptors are
often based on speed and distance between tracklet pairs,
while appearance descriptors are often based on global or
part based color histograms to distinguish different targets.
However, how to better distinguish nearby targets and how
to deal with motion in moving cameras remain key issues
that limit the performance of ABT. In this paper, we propose
123
Author's personal copy
Int J Comput Vis
Fig. 1 Examples of tracking results by our approach
an online learned conditional random field (CRF) model to
reliably track targets under camera motions and to better discriminate different targets, especially difficult pairs, which
are spatially near targets with similar appearance. Figure 1
shows some tracking examples by our approach, which is
able to track targets under heavy occlusion, camera motion,
and crowded scenes.
In most previous ABT work, appearance models are predefined (Xing et al. 2009; Pirsiavash et al. 2011) or online
learned to discriminate all targets from one another (Kuo et al.
2010; Shitrit et al. 2011) or to discriminate one target from
all others (Breitenstein et al. 2011; Kuo and Nevatia 2011).
Though such learned appearance models are able to distinguish many targets, they are not so effective at differentiating
difficult pairs, i.e., close targets with similar appearances. The
discriminative features for a difficult pair are possibly quite
different than those for distinguishing with all other targets.
Linear motion models have been widely used in previous
tracking work (Stalder et al. 2010; Yang et al. 2011; Breitenstein et al. 2011); linking probabilities between tracklets are
often based on how well a pair of tracklets satisfies a linear
motion assumption, i.e., assuming that targets move without
changing directions and with constant speeds. However, as
shown in the first row of Fig. 2, if the view angle changes
due to camera motion, motion is not linear; this could conceivably be compensated for by camera motion inference
techniques, but this is a challenging task by itself. Relative
positions between targets are less influenced by view angles,
as all targets shift together when the camera moves; they may
be more stable than linear motion models for estimation of
123
positions under camera motions. For static cameras, relative
positions are still helpful, as shown in the second row of
Fig. 2; some targets may not follow a linear motion model,
and relative positions between co-moving targets are useful
for recovering errors of a linear motion model.
Based on above observations, we propose an online learning approach, which formulates the multi-target tracking
problem as one of inference in a CRF model as shown in
Fig. 3. Our CRF framework incorporates global models for
distinguishing all targets and pairwise models for differentiating difficult pairs of targets.
Given detection responses for a video, low level tracklets are first built from reliable response pairs in neighboring
frames as shown in Fig. 3. All linkable tracklet pairs form
the nodes in this CRF model, and labels of each node (1 or
0) indicate whether two tracklets may be linked or not. The
energy cost for each node is estimated based on its observation, e.g., the green rectangles in Fig. 3. The observation
is composed of global motion and appearance affinities, as
similar in Kuo et al. (2010). As shown in the global models
in Fig. 3, the global motion affinity is based on linear motion
models between the tracklet pairs in CRF nodes, e.g., using
Δs1 and Δs2 , the distances between estimated locations and
actual locations, to compute motion probability. The global
appearance affinity is based on an online learned appearance
model which best distinguishes all tracklets; we can see that
the positive pairs, e.g., those labeled with +1, are responses
from the same tracklet, and the negative pairs, e.g., those
labeled with −1, are responses from temporal overlapping
tracklets; the learned appearance model produce high and
Author's personal copy
Int J Comput Vis
Fig. 2 Examples of relative positions and linear estimations. In the 2D map, filled circles indicate positions of person at the earlier frames; dashed
circles and solid circles indicate estimated positions by linear motion models and real positions at the later frames respectively
Fig. 3 Tracking framework of our approach. In the CRF model, each
node denotes a possible link between a tracklet pair and has a unary
energy cost based on global appearance and motion models; each edge
denotes a correlation between two nodes and has a pairwise energy cost
based on discriminative appearance and motion models specifically for
the two nodes. Colors of detection responses indicate their belonging
tracklets. For clarity purpose, we only show four nodes in the CRF
model. Best viewed in color (Color figure online)
123
Author's personal copy
Int J Comput Vis
low probability scores for positive and negative pairs, respectively. The same global motion and appearance models are
used for all CRF nodes. Energy cost for an edge is based on
discriminative pairwise models, i.e., appearance and motion
descriptors, that are learned online for distinguishing tracklets in the two connected CRF nodes. As shown in Fig. 3, positive pairs for the pairwise appearance models are extracted
from only the concerned two tracklets, e.g., the green and
the blue tracklets in pairwise model 1, and each negative
pair consists of one response in the green tracklet and one
response in the blue tracklet. The pairwise appearance model
is for discriminating specific pair of targets without considering others. Pairwise motion model is based on the relative
distances between concerned two targets; for example, in
pairwise model 1 in Fig. 3, we evaluate the relative distance
Δs between the green and the blue tracklets after Δt, and
use Δs to match other close-by tracklet pairs (see Sect. 4
for details). This considers relative distance between specific
pair of targets that may move together. Global and pairwise
models are used to produce unary and pairwise energy functions respectively, and the tracking problem is transformed
into an energy minimization task.
The contributions of this paper are:
– A CRF framework for modeling both global tracklet affinity models and pairwise motion and discriminative appearance models.
– An online learning approach for producing unary and pairwise energy functions in a CRF model.
– An approximation algorithm for efficiently finding good
tracking solutions with low energy costs.
– An improved tracking algorithm with evaluation on standard data sets.
Some parts of this work were presented in Yang and Nevatia (2012b); we provide more technical details as well as more
experiments in this paper.
The rest of the paper is organized as follows: related work
is discussed in Sect. 2; our tracking framework is given in
Sect. 3; Sect. 4 describes the online learning approach for a
CRF model; experiments are shown in Sect. 5, followed by
conclusion in Sect. 6.
2 Related Work
Multi-target tracking has been an active topic in computer
vision research. The main challenges focus on how to distinguish targets from the background and from each other,
and how to deal with missed detections, false alarms, and
occlusions.
Category free tracking (CFT) methods, often called visual
tracking approaches, focus on tracking a single object or mul-
123
tiple objects separately (Kalal et al. 2010; Wang et al. 2011;
Zhang et al. 2012a,b; Holzer et al. 2012). These methods
are able to track target without regard to its category, from
a manual initialization; no offline trained detector is needed,
and tracking is usually based on appearance only to deal
with abrupt motions. From the given label in the first frame,
an appearance model is learned to discriminate the target
from all other regions; the appearance model is usually online
updated to deal with illumination and view angle changes.
Meanshift (Comaniciu et al. 2000) or particle filtering (Isard
and Blake 1998) approaches are often adopted to make tracking decisions. Spatially correlated regions are often considered together to reduce the risk of drifting (Grabner et al.
2010; Duan et al. 2012). CFT methods are able to track targets under severe motions and large appearance changes, but
once the result drifts it is very difficult to recover.
Most ABT methods focus on tracking multiple objects of
a pre-known class simultaneously (Perera et al. 2006; Song et
al. 2010; Andriyenko and Schindler 2011; Xing et al. 2011).
They usually associate detection responses produced by a
pre-trained detector into long tracks, and find a global optimal solution for all targets. The key issue in ABT is how
to measure association probabilities between tracklets; the
probability is often based on appearance and motion cues.
In some ABT work, appearance and motion models are predefined (Perera et al. 2006; Huang et al. 2008; Pirsiavash et
al. 2011; Breitenstein et al. 2011; Andriyenko et al. 2012). To
automatically find appropriate weights for different cues, Li
et al. (2009) proposed an offline learning framework to integrate multiple cues; then in order to better differentiate different targets, online learned discriminative appearance models
are proposed to distinguish multiple targets globally (Kuo
et al. 2010; Kuo and Nevatia 2011; Shitrit et al. 2011). Association probabilities between tracklets are usually assumed
to be independent of each other, Yang et al. (2011) introduced
an offline trained CRF model to relax this assumption. However, global appearance models are not necessarily able to
differentiate difficult pairs of targets, i.e., those that are spatially close and have similar appearances. For example, the
best global appearance model may be the color histogram of
upper body, but the most discriminative models for a pair of
near-by persons may be the histogram of lower body. In addition, linear motion models between tracklet pairs (Stalder et
al. 2010; Breitenstein et al. 2011; Zamir et al. 2012) are often
adopted to constrain motion smoothness. However, camera
motion may invalidate the linear motion assumption, and targets may sometimes follow non-linear paths. Our online CRF
approach considers both global and pairwise appearance and
motion models.
Note that CRF models have also been utilized in Yang et
al. (2011). Both Yang et al. (2011) and our approach relax the
assumption that associations between tracklet pairs are independent of each other. However, Yang et al. (2011) focused
Author's personal copy
Int J Comput Vis
on modeling association dependencies, while our approach
aims at a better distinction between difficult pairs of targets
and therefore the meanings of edges in the CRF are different. In addition, Yang et al. (2011) is an offline approach that
integrates multiple cues on pre-labeled ground truth data, but
our approach is an online learning method that finds discriminative models automatically without pre-labeled data.
3 ABT Framework
Our tracking framework is composed of three modules:
object detection to provide responses for association, a sliding window method to improve efficiency and reduce latency,
and CRF tracking in one sliding window. Given an input
video, an object detector is first applied to each frame to
produce detection responses. Then based on appearance and
motion cues, these responses are gradually linked into tracks;
some of the missed detections are inserted by interpolation,
and some of the false detections are removed.
3.1 Object Detection
ABT associates detection responses into tracks; therefore,
the detection performance is an important factor in the quality of the resulting tracks. Though much progress has been
made on detection techniques, state-of-art detectors are still
far from perfect, and there is always a trade-off between precision and recall. In our framework, we prefer a detector
that has high precision and reasonable recall, e.g., over 80 %
precision and over 70 % recall. In association based framework, gaps between responses can be filled by measuring
their appearance similarity and motion smoothness; however,
spatially and temporally close false detections may produce
false tracks that are much more difficult to remove.
In this paper, we focus on automatically tracking multiple
pedestrians. We adopt the JRoG pedestrian detector (Huang
and Nevatia 2010) due to its high accuracy and computational
efficiency; also it is a part based detector and is thus able to
deal with partial occlusions to some extent.
3.2 Sliding Window Method for Tracking a Long Video
To reduce latency, we track targets in sliding windows instead
of in the whole video. In the whole video, there may be many
targets, most of which may not have temporal overlap. Therefore, global optimization on the whole video is impractical
due to the huge search space and long latency, and is also
unnecessary as many targets only exist for some period but
not the whole video.
To produce tracks across sliding windows, we use overlapping neighboring windows. An example is given in Fig. 4.
Given an input video, we first apply the object detector in the
first sliding window and track within the window. Then we
move the window with a step of half of its size, so that the new
window has a 50 % overlap with the previous one, and apply
the detector in the new frames (frames 9–12 in this example). We further evenly divide the overlapped part into two;
in the first 25 % of the new window (frames 5, 6), we keep
the tracking results produced from the previous window; in
the 25–50 % of the new window (frames 7, 8), we break all
tracking results into original detection responses. After that,
the tracking algorithm is applied in the new window; this
approach can produce tracks across sliding windows. Note
that the tracking results in frames 7, 8 could be different in
neighboring windows; we trust the results in the later window, as there are more frames around frames 7, 8 in the new
window indicating more information used for making the
association decisions.
To properly set the size of a sliding window creates a tradeoff between latency and accuracy. More frames in a sliding
window bring more information, and therefore the tracking results are more accurate; however, this indicates more
cached frames in a window, which leads to longer latency.
In the experiments reported later, we fix the sliding window
to be eight second long with a 50 % overlap between neighboring windows; therefore, our algorithm has a two second
latency but is able to produce robust results even in complex
scenes.
3.3 CRF Formulation for Tracking in One Temporal
Window
Given detection responses in a temporal window, our task
is to produce reliable tracks within this window. Similar to
Kuo et al. (2010), we adopt a low level association process
to connect detection responses in neighboring frames into
reliable tracklets, and then associate the tracklets progressively in multiple levels. The low level association only links
responses in neighboring frames that have high appearance
similarities and small spatial distances. There are two advantages to associating tracklets rather than the raw detections:
(1) the search space is reduced, as we explore all linkable
tracklet pairs instead of all detection pairs; (2) motion cues
are extracted more reliably from tracklets than from single
frame detections.
We formulate notations to be used as follows:
– S = {T1 , T2 , . . . , Tn }: the input tracklets produced in the
previous level; low level association results are used as the
input for the first level.
ts
te
– Ti = {di i , . . . , di i }: a tracklet which is defined as a set of
detection or interpolated responses in consecutive frames,
starting at frame tis and ending at frame tie . It is possible
that tis = tie .
123
Author's personal copy
Int J Comput Vis
Fig. 4 Illustration for the sliding window approach. Circles denote detection responses, and the colors indicate their appearances. There is a 50 %
overlap between neighboring sliding windows. See text for details (Color figure online)
– dit = { pit , sit , vit }: a detection or interpolated response at
frame t at position pit , with size sit and velocity vector vit .
– G = {V, E}: the CRF graph created for the current level.
– V = {vi = (Ti1 → Ti2 )}: all possible association pairs of
tracklets in the current level.
– E = {e j = (v j1 , v j2 )}: all possible correlated pairs of
nodes in V .
– L = {l1 , l2 , . . . , lm }: the labels for all possible association pairs of tracklets; li = 1 indicates Ti1 is linked to Ti2 ,
and li = 0 indicates the opposite.
At each level of association, we aim to find the best set of
associations with the highest probability. We formulate the
tracking problem as finding the best L given S
L ∗ = argmax P(L|S) = argmax
L
L
1
exp(−Ψ (L|S))
Z
(1)
where Z is a normalization factor, and Ψ is a cost function.
Assuming that the joint distributions of more than two associations do not make contributions to P(L|S), we have
L ∗ = argmin Ψ (L|S)
L
= argmin(
L
i
U (li |S) +
B(li , l j |S))
(2)
vi ∈H eadi 1
vi ∈T aili 2
where H eadi1 {(Ti1 → T j ) ∈ V } ∀T j ∈ S
T aili2 {(T j → Ti2 ) ∈ V } ∀T j ∈ S
(3)
where the first constraint limits any tracklet Ti1 link to at
most one other tracklet, and the second constraint limits that
at most one tracklet may be link to any tracklet Ti2 .
4 Online Learning of CRF Models
In this section, we describe our tracking approach in several
steps, including the CRF creation, online learning of unary
and pairwise terms, as well as how to find an association label
set with the lowest overall energy as in Eq. 2.
ij
where U (li |S) = − ln P(li |S) and B(li , l j |S) = − ln
P(li , l j |S) denote the unary and pairwise energy functions
respectively. In Eq. 2, U (li |S) defines the linking probability
between two tracklets based on global appearance and motion
models, while B(li , l j |S) defines the correlation between
tracklet pairs based on discriminative models learned for the
corresponding pairs of tracklets.
We model the tracking problem by finding appropriate labels in a CRF model, which is learned in each slid-
123
ing window repeatedly. For simplicity, we assume that one
tracklet cannot be associated with more than one tracklet; it makes energy minimization efficient. In the detection
process, nearby responses are merged in a conservative way
so multiple targets are unlikely to be detected as one object,
but multiple responses may exist for one target. Any valid
label set L should satisfy
li ≤ 1 &
li ≤ 1
4.1 CRF Creation for Tracklets Association
In order to create the CRF graph, we first find all linkable
tracklet pairs. Tracklet Ti is linkable to T j if the gap between
the end of Ti and the beginning of T j satisfies
0 < t sj − tie < tmax
(4)
where tmax is a threshold for maximum gap between any
linkable pair of tracklets, and t sj and tie denotes the start and
Author's personal copy
Int J Comput Vis
Fig. 5 Examples of head-close and tail-close tracklet pairs. The x-axis
denotes the frame number, and y-axis denotes the position
end frames of T j and Ti respectively. The “>0” constraint
avoids temporally overlapping tracklets to be connected, as
one target cannot appear in two positions at the same time;
the “< tmax ” constraint prevents tracklets with large temporal
gaps to be connected, as it is less safe to assume that long gap
tracklets belong to the same target. In our implementation,
tmax is fixed to be 4 s long, i.e., half of our sliding window
size. If a target disappears or is occluded for more than 4 s,
it is treated as a new target.
We create the set of nodes V in CRF to modeling all linkable tracklets as
V = {vi = (Ti1 → Ti2 )} s.t. Ti1 is linkable to Ti2
(5)
Before defining the set of edges E in CRF, we first define
head-close and tail-close pairs of tracklets. Two tracklets Ti
and T j are a head-close pair if they satisfy (suppose tis ≥ t sj )
tis + ζ < tie & tis + ζ < t ej
& ∀t ∈ [tis , tis + ζ ] || pit − p tj || < γ min{sit , s tj }
(6)
where γ is a distance control factor (set to 3 in our experiments), and ζ is a temporal length threshold set to one second long in our experiments; see Fig. 5 for an illustration.
Note that in Eq. 6, the distance is normalized to the size of
target. This definition indicates that the head part of Ti is
spatially close to T j for a certain period, suggesting possible
co-movement of the two targets. The definition of tail-close
is similar; two tracklets Tk and Tm are a tail-close pair if they
satisfy (suppose tke ≥ tme )
tme − ζ > tks & tme − ζ > tms
t
& ∀t ∈ [tme − ζ, tme ] || pkt − pm
|| < γ min{skt , smt }
(7)
We define the set of edges E as
E = {(vi , v j )}
∀vi , v j ∈ V
s.t. Ti1 and T j1 are tail-close, or Ti2
and T j2 are head-close.
(8)
The cost of edges is based on pairwise motion and appearance
models as shown in Fig. 3. Such pairs are often more difficult
to differentiate than spatially far-away targets; if they follow
the same motion patterns, e.g., the same linear or non-linear
motion path, it would be helpful for motion estimations of
each other. Note that a CRF model is also used in Yang et al.
Fig. 6 Global motion models for unary terms in CRF. The x-axis
denotes the frame number, and y-axis denotes the position
(2011); however, there, the edges are used for modeling association dependencies and not for possible co-moving targets.
4.2 Learning of Unary Terms
Unary terms in Eq. 2 define the energy cost for associating pairs of tracklets. As defined in Sect. 3.3, U (li |S) =
− ln P(li |S). We further divide the probability into a motion
based probability Pm (·) and an appearance based probability
Pa (·) as
U (li = 1|S) = − ln(Pm (Ti1 → Ti2 |S)Pa (Ti1 → Ti2 |S))
(9)
Pm is a function of the distance between estimations of
positions using the linear motion model and the real positions as in Kuo et al. (2010), Yang et al. (2011), and Kuo
and Nevatia (2011). As shown in Fig. 6, the distances are
computed as Δp1 = p head − v head Δt − p tail and Δp2 =
p tail + v tail Δt − p head . The motion probability between
tracklets T1 and T2 are defined as
Pm (Ti1 → Ti2 |S) = G(Δp1 , Σ p )G(Δp2 , Σ p )
(10)
where G(·, Σ) is the zero-mean Gaussian function, and Δt
is the frame difference between p tail and p head .
For Pa (·), we adopt the online learned discriminative
appearance models (OLDAMs) defined in Kuo et al. (2010),
which focus on learning appearance models with good global
discriminative abilities between targets. Responses from the
same tracklet are used as positive samples, and those from
temporal overlapping tracklets are used as negative samples. RealBoost is adopted to learn a single global appearance model which provides high Pa (·) for responses in positive samples and low Pa (·) for responses in negative samples. Features include color, texture, and shape for different
regions of targets as in Kuo et al. (2010).
4.3 Learning of Pairwise Terms
Similar to unary terms, pairwise terms are also decomposed
into motion based and appearance based parts. Motion based
probabilities are defined based on relative distances between
tracklet pairs. Take two nodes (T1 , T3 ) and (T2 , T4 ) as an
example. Suppose T1 and T2 are a tail-close pair; therefore,
123
Author's personal copy
Int J Comput Vis
responses is expected to have high appearance similarity as
they belong to the same target. For a tail-close tracklet pair
T1 and T2 , the positive sample set S+ is defined as
S+ = {(dkt1 , dkt2 )} k ∈ {1, 2}
∀t1 , t2 ∈ [max{tks , tke − θ }, tke ]
Fig. 7 Pairwise motion models for pairwise terms in CRF. The x-axis
denotes the frame number, and y-axis denotes the position
there is an edge between the two nodes. Let tx = min{t1e , t2e },
and t y = max{t3s , t4s }.
As T1 and T2 are tail-close, we estimate positions of both
targets at frame t y , as shown in dash circles in Fig. 7. Then we
can get the estimated relative distance between the positions
of T1 and T2 at frame t y as
te
Δp1 = ( p11
+
V1tail (t y
te
− t1e ))−( p22 +V2tail (t y
− t2e ))
(11)
where V1tail and t1e are the tail velocity and end frames of T1 ;
V2tail and t2e are similar. Equation 11 indicates that if T1 and T2
do not change their motion directions and speeds, they should
have a distance of Δp1 at frame t y . We compare the estimated
relative distance with the observed one Δp2 , and compute the
pairwise motion probability as G(Δp1 − Δp2 , Σ p ), where
G is the zero-mean Gaussian function.
As T1 and T2 are tail-close, they likely follow similar
non-linear motion path or are under similar position shifts
caused by camera motions; however, their relative distances
change less in these conditions. As shown in Fig. 7, the difference between Δp1 and Δp2 is small. This indicates that
if T1 is associated to T3 , there is a high probability that T2
is associated to T4 and vice versa. Note that if T3 and T4 in
Fig. 7 are head-close, we also compute a similar probability
as G(Δp3 − Δp4 , Σ p ) where Δp3 is computed as in Eq. 12
and Δp4 is as shown in Fig. 7; the final motion probability
is taken as the average of both.
ts
ts
Δp3 = ( p33 − V3head (t3s − tx )) − ( p44 − V4head (t4s − tx ))
(12)
Pairwise appearance models are designed for differentiating specific close-by pairs. For example, in Fig. 7, T1 and
T2 are a tail-close pair; we want to produce an appearance
model that best distinguishes the two targets without considering other targets or backgrounds.
For this, we online collect positive and negative samples
from the concerned two tracklets only. Positive samples are
selected from responses in the same tracklet; any pair of these
123
(13)
where θ is a threshold for the number of frames used for
computing appearance models (set to 10 in our experiments),
as after some time a target may change appearance a lot due
to illumination, view angle, or pose changes.
Negative samples are sampled from responses in different
tracklets; we want to find an appearance model that produces
low appearance similarity for any pair of these responses as
they belong to different targets. The negative sample set S−
between T1 and T2 is defined as
S− = {(d1t1 , d2t2 )}
∀t1 ∈ [max{t1s , t1e − θ }, t1e ],
∀t2 ∈ [max{t2s , t2e − θ }, t2e ]
(14)
Sample collection for head-close pairs follows a similar
procedure, but detection responses are from the first θ frames
of each tracklet.
With the positive and negative sample sets, we use the
standard RealBoost algorithm to learn appearance models for
best distinguishing T1 and T2 ; the same features are used as in
unary terms. This learned model is used to produce appearance based probabilities for (T1 , T3 ) and (T2 , T4 ). If T3 and T4
are a head-close pair, we adopt a similar learning approach to
get appearance probabilities based on discriminative models
for T3 and T4 , and use the average of both scores as the final
pairwise appearance probabilities for (T1 , T3 ) and (T2 , T4 ).
Note that the discriminative appearance models for T1 and
T2 are only learned once for all edges like {(T1 , Tx ), (T2 , Ty )}
∀Tx , Ty ∈ S. Therefore, the complexity is O(n 2 ), where n is
the number of tracklets. Moreover, as only a few tracklet pairs
are likely to be spatially close, the actual times of learning is
typically much smaller than n 2 .
4.4 Energy Minimization
For CRF models with submodular energy functions (where
B(0, 0) + B(1, 1) < B(1, 0) + B(1, 1)), a global optimal
solution can be found by the graph cut algorithm. However,
due to the constraints in Eq. 3, the energy function in our
formulation is not sub-modular. In addition, the one to one
assumption in Eq. 3 makes regular non-submodular algorithms less efficient and less effective, such as QPBO (Hammer et al. 1984) and Loopy Belief Propagation (LBP) (Pearl
1998), as most solutions found by these method would violate
the constraint. Instead, we introduce a heuristic algorithm to
find a solution in polynomial time.
Author's personal copy
Int J Comput Vis
The unary terms in the CRF model have been shown to
be effective for non-difficult pairs by previous work (Kuo
et al. 2010). Considering this issue, we first use the Hungarian algorithm (Perera et al. 2006) to find a global optimal
solution by only considering the unary terms and satisfying
the constraints in Eq. 3. Then, we sort the selected associations, i.e., nodes with labels of 1, according to their unary
term energies from least to most as A = {vi = (Ti1 → Ti2 )}.
Next, for each selected node, we consider switching label
of this node and each neighboring node, i.e., a node that is
connected with current node by an edge in the CRF model;
if the energy is lower, we keep the change.
Similar to Kuo et al. (2010), the unary energy term for
self association is set to +∞. Self association denotes that a
tracklet is either a false alarm or is a complete track and therefore should not associate with any other tracklet. With such
a definition, the algorithm will always select an association
for each tracklet no matter how small the link probability.
To avoid over association, we set a threshold for linking. For
each selected tracklet pair, e.g. a node vi with label li = 1,
we define its linking cost as
Ci = U (li = 1) +
1 B(li = 1, l j = 1)
2
(15)
v j ∈Ni
Ni = {v j } s.t. (vi , v j ) ∈ E & l j = 1
(16)
Algorithm 1 Finding labels with low energy cost.
Input: Tracklets from previous level S = {T1 , T2 , . . . , Tn }; CRF graph
G = (V, E).
Find the label set L with the lowest unary energy cost by Hungarian
algorithm, and evaluate its overall energy Ψ by Eq. 2.
Sort nodes with labels of 1 according to their unary energy costs from
least to most as {v1 , . . . , vk }.
For i = 1, . . . , k do:
– Set updated energy Ψ = +∞
– For j = 1, . . . , t that (vi , v j ) ∈ E do:
– Switch labels of li and l j under constraints in Eq. 3, and evaluate new energy Ω.
– If Ω < Ψ , Ψ = Ω
– If Ψ < Ψ , update L with the switch, and set Ψ = Ψ .
Output: The set of labels L on the CRF graph.
the run time is almost linear in the number of targets, which
is less than 30 in typical scenarios.
5 Experiments
We evaluated our approach on four public pedestrian data
sets: the TUD data set (Andriyenko and Schindler 2011),
PETS (2009), Trecvid (2008), and ETH mobile pedestrian
(Ess et al. 2009) data set. We show quantitative comparisons
with state-of-art methods, as well as visualized results of our
approach. Though frame rates, resolutions and densities are
different in these data sets, we used the same parameter setting, and performance improves compared to previous methods for all of them. This indicates that our approach has low
sensitivity to parameters. All data used in our experiments
are publicly available.2 .
When this algorithm chooses a node vi with label li = 1,
we only link the two corresponding tracklets in the node
when Ci < τ , where τ is a threshold. In our experiments, we have four levels of associations to gradually
link tracklets to longer tracks; the thresholds are fixed to
− ln 0.8, − ln 0.5, − ln 0.1, − ln 0.05, but the performances
improve in all data sets, indicating that our algorithm is not
sensitive to these thresholds. The lower the thresholds, less
likely are the tracklets to be associated, leading to fewer ID
switches but more fragments. 1
The energy minimization algorithm is shown in Algorithm 1.
Our energy minimization algorithm is a greedy process,
and will stop at some local minimum; however, it assures to
produce associations with an energy lower (or the same in
the worst case) than associations produced by using unary
terms only.
Note that the Hungarian algorithm used in the first step
of Algorithm 1 has a time complexity of O(n 3 ), while our
heuristic search process has a complexity of O(|E|) =
O(n 4 ); the overall complexity is still polynomial. In addition,
as nodes are only defined between tracklets with a proper
time gap and edges are only defined between nodes with
head-close or tail-close tracklet pair, the actual number of
edges is typically much smaller than n 4 . In our experiments,
– Recall(↑) correctly matched detections / total detections
in ground truth.
– Precision(↑) correctly matched detections / total detections in the tracking result.
– FAF(↓) Average false alarms per frame.
– GT The number of trajectories in ground truth.
– MT (↑) The ratio of mostly tracked (MT) trajectories,
which are successfully tracked for more than 80 %.
– ML(↓) The ratio of mostly lost trajectories, which are successfully tracked for less than 20 %.
1
2
Definitions are given in Sect. 5.
5.1 Evaluation Metrics
As it is difficult to use a single score to evaluate tracking
performance, we adopt the evaluation metric defined in Li et
al. (2009), including:
http://iris.usc.edu/people/yangbo/downloads.html
123
Author's personal copy
Int J Comput Vis
– PT The ratio of partially tracked trajectories, i.e., 1−M T −
M L.
– Frag(↓) fragments, the number of times that a ground truth
trajectory is interrupted.
– IDS(↓) ID switch, the number of times that a tracked trajectory changes its matched id.
For items with ↑, higher scores indicate better results; for
those with ↓, lower scores indicate better results.
5.2 Results on Static Camera Videos
We first present our results on data sets captured by static cameras: TUD (Andriyenko and Schindler 2011), PETS
(2009), and Trecvid (2008).
For fair comparison, we use the same TUD-Stadtmitte and
PETS 2009 data set as used in Andriyenko and Schindler
(2011). Scenes in these two data sets are not crowded, but
people are sometimes occluded by other humans or other
objects.
Quantitative results are shown in Tables 1 and 2. Note that
the approach in Kuo et al. (2010) can be viewed as a baseline; if our pairwise terms are disabled, we only use online
learned global models, and our approach degrades to Kuo
et al. (2010). By comparing our approach and Kuo et al.
(2010), we are able to see the effectiveness of the proposed
pairwise terms.
We can see that the online CRF tracking results are much
improved over those in Andriyenko and Schindler (2011)
in both data sets and better than PRIMPT in PETS 2009,
but the improvement is not so obvious compared with Kuo
et al. (2010) and Kuo and Nevatia (2011) in the TUD data
set: we have higher MT and recall and lower ID switches,
but OLDAMs and PRIMPT have higher precisions. This is
because the online CRF model focuses on better differentiating difficult pairs of targets, but there are not many people
in the TUD data set. Some visual results are shown in Fig. 8;
our approach is able to keep correct identities while targets
are quite close, such as person 0, person 1, and person 2 in the
first row, and person 6, person 11, and person 12 in the second
row.
To see the effectiveness of our approach, we further evaluated our approach on the more difficult Trecvid 2008 data set.
There are nine video clips in this data set, each of which has
5,000 frames; these videos are captured in a busy airport, and
have high density of people with frequent occlusions. There
are many close track interactions in this data set, indicating
huge number of edges in the CRF graph (more than 200 edges
in most cases). The comparison results are shown in Table 3.
Compared with up-to-date approaches, our online CRF
achieves best performance on precision, FAF, fragments, and
ID switches metrics, while keeping recall and MT as high as
79.8 and 75.5 %. Compared with Kuo and Nevatia (2011),
our approach reduces the fragments and the ID switches by
about 15 and 14 % respectively. As OLDAMs (Kuo et al.
2010) equals to using unary terms only in our model, the
obvious improvement indicates the utility of the pairwise
terms.
Figure 9 shows some tracking examples by our approach.
We can see that when targets with similar appearances get
close, the online CRF can still find discriminative features to
distinguish these difficult pairs. However, global appearance
Table 1 Comparison of results on TUD dataset
Method
Recall (%)
Precision (%)
FAF
GT
MT (%)
PT (%)
ML (%)
Frag
IDS
10
60.0
40.0
0.0
3
0
9
60.0
30.0
0.0
4
7
OLDAMs (Kuo et al. 2010)
79.9
98.8
0.014
Energy minimization
(Andriyenko and Schindler 2011)
PRIMPT (Kuo and Nevatia 2011)
–
–
–
81.0
99.5
0.028
10
60.0
30.0
10.0
0
1
Online CRF tracking
87.0
96.7
0.184
10
70.0
30.0
0.0
1
0
The OLDAMs and PRIMPT results are provided by Kuo and Nevatia (2011). Our ground truth includes all persons appearing in the video, and has
one more person than that in Andriyenko and Schindler (2011)
Table 2 Comparison of results on PETS 2009 dataset
Method
Recall (%)
Precision (%)
FAF
GT
MT (%)
PT (%)
ML (%)
Frag
IDS
OLDAMs (Kuo et al. 2010)
89.6
97.9
0.110
19
73.7
26.3
0.0
28
1
Energy minimization
(Andriyenko and Schindler 2011)
PRIMPT (Kuo and Nevatia 2011)
–
–
–
23
82.6
17.4
0.0
21
15
89.5
99.6
0.020
19
78.9
21.1
0.0
23
1
Online CRF tracking
93.0
95.3
0.268
19
89.5
10.5
0.0
13
0
The OLDAMs and PRIMPT results are provided by Kuo and Nevatia (2011). Our ground truth is more strict than that in Andriyenko and Schindler
(2011)
123
Author's personal copy
Int J Comput Vis
Fig. 8 Tracking examples on TUD (Andriyenko and Schindler 2011) and PETS 2009 (Pets 2009) data sets
Table 3 Comparison of tracking results on Trecvid 2008 dataset
Method
Recall (%)
Precision (%)
FAF
GT
MT (%)
PT (%)
ML (%)
Frag
IDS
Offline CRF tracking (Yang et al. 2011)
79.2
85.8
0.996
919
78.2
16.9
4.9
319
253
OLDAMs (Kuo et al. 2010)
80.4
86.1
0.992
919
76.1
19.3
4.6
322
224
PRIMPT (Kuo and Nevatia 2011)
79.2
86.8
0.920
919
77.0
17.7
5.2
283
171
Online CRF tracking
79.8
87.8
0.857
919
75.5
18.7
5.8
240
147
The human detection results are the same as used in Yang et al. (2011), Kuo et al. (2010), Kuo and Nevatia (2011), and are provided by courtesy
of authors of Kuo and Nevatia (2011)
Fig. 9 Tracking examples on the Trecvid 2008 data set (National institute of standards and technology 2008)
123
Author's personal copy
Int J Comput Vis
and motion models are not effective enough in such cases,
such as person 106 and 109 in the second row of Fig. 9, who
are both in white, move in similar directions, and are quite
close.
5.3 Results on Moving Camera Videos
We further evaluated our approach on the ETH data set (Ess
et al. 2009), which is captured by a pair of cameras on a
moving stroller in busy street scenes. The stroller is mostly
moving forward, but sometimes has panning motions, which
makes the motion affinity between two tracklets less reliable.
Many people have speed of 2–4 pixels per frame; however,
the camera frequently induces motion of more than 20 pixels
per frame. This makes linear motion estimation imprecise.
Nevertheless, as all people shift according to the same camera motion, relative positions do not change much in one
frame.
For fair comparison, we choose the “BAHNHOF” and
“SUNNY DAY” sequences used in Kuo and Nevatia (2011)
for evaluation. They have 999 and 354 frames respectively;
people are under frequent occlusions due to the low view
angles of cameras. As in Kuo and Nevatia (2011), we also
use the sequence from the left camera; no depth and ground
plane information is used.
The quantitative results are shown in Table 4. We can see
that our approach achieves better or the same performances
on all evaluation scores. Compared with most state-of-art
method (Kuo and Nevatia 2011), the MT score is improved
by about 10 %; fragments are reduced by 17 %; recall and
precision are improved by about 2 and 4 % respectively. The
obvious improvement in MT and fragment scores indicate
that our approach can better track targets under moving camera conditions, where the traditional motion models are less
reliable.
Figure 10 shows some visual tracking results by our online
CRF approach. Both examples show obvious panning movements of cameras. Traditional motion models, i.e., unary
motion models in our online CRF, would produce low affinity
scores for tracklets belonging to the same targets. However,
by considering pairwise terms, relative positions are helpful
for connecting correct tracklets into one.
Figure 11 shows tracking examples using only unary terms
on the same frames as shown in Fig. 2. In the first row of
Fig. 11, we can see that an id switch occurs between two
close humans due to camera motions and similar appear-
Table 4 Comparison of tracking results on ETH dataset
Method
Recall (%)
Precision (%)
FAF
GT
MT (%)
PT (%)
ML (%)
Frag
IDS
OLDAMs (Kuo et al. 2010)
52.0
91.0
0.398
125
25.8
38.7
35.5
20
25
PRIMPT (Kuo and Nevatia 2011)
76.8
86.6
0.891
125
58.4
33.6
8.0
23
11
Online CRF tracking
79.0
90.4
0.637
125
68.0
24.8
7.2
19
11
The human detection results are the same as used in Kuo et al. (2010), Kuo and Nevatia (2011), and are provided by Kuo and Nevatia (2011)
Fig. 10 Tracking examples on the ETH data set (Ess et al. 2009)
123
Author's personal copy
Int J Comput Vis
Fig. 11 Tracking examples using only unary terms of our approach on the same frames as shown in Fig. 2
ances. However, by using both unary and pairwise terms, we
are able to find discriminative appearance models to distinguish them and build more reliable motion affinities; thus,
we can track them correctly as shown in Fig. 2. In the second
row of Fig. 11, both person 135 and 136 are tracked at frame
3,692; however, person 136 is lost at frame 3,757, because
the person makes a sharp direction change causing the linear
motion model failed. By further considering pairwise terms,
we are able to locate the non-linear path and to track both
persons correctly as shown in the second row of Fig. 2.
5.4 Computational Speed
As discussed in Sect. 4.4, the complexity of our algorithm
is polynomial in the number of tracklets in the worst case.
Our experiments were performed on an Intel 3.0 GHz PC
with 8 GB memory, and the code is implemented in C++.
For the less crowded TUD, PETS 2009, and ETH data sets,
the speed is 10, 14, and 10 fps, respectively, not including
detection time cost; for crowded Trecvid 2008 data set, the
speed is about 6 fps. Compared with the speed of 7 fps for
Trecvid 2008 reported in Kuo and Nevatia (2011), the online
CRF does not add much to the computation cost. 3
6 Conclusion
We described an online CRF framework for multi-target
tracking. This CRF considers both global descriptors for distinguishing different targets as well as pairwise descriptors
for differentiating difficult pairs. Unlike global descriptors,
pairwise motion and appearance models are learned from corresponding difficult pairs, and represented by pairwise terms
in the CRF energy function. An effective algorithm is introduced to efficiently find associations with low energy, and
the experiments show significantly improved results compared with previous methods. Future improvement can be
achieved by adding camera motion inference into pairwise
motion models.
3
Detection time costs are not included in either measurements.
123
Author's personal copy
Int J Comput Vis
Acknowledgments Research was sponsored, in part, by Office of
Naval Research under Grant number N00014-10-1-0517 and by the
Army Research Laboratory and was accomplished under Cooperative
Agreement Number W911NF-10-2-0063. The views and conclusions
contained in this document are those of the authors and should not
be interpreted as representing the official policies, either expressed or
implied, of the Army Research Laboratory or the U.S. Government. The
U.S. Government is authorized to reproduce and distribute reprints for
Government purposes notwithstanding any copyright notation herein.
References
Andriyenko, A., & Schindler, K. (2011). Multi-target tracking by continuous energy minimization. In Proceedings of IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), Colorado
Springs, USA, pp. 1265–1272.
Andriyenko, A., Schindler, K., & Roth, S. (2012). Discrete–continuous
optimization for multi-target tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, pp. 1926–1933.
Breitenstein, M. D., Reichlin, F., Leibe, B., Koller-Meier, E., & Gool, L.
V. (2011). Online multi-person tracking-by-detection from a single,
uncalibrated camera. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 33(9), 1820–1833.
Comaniciu, D., Ramesh, V., & Meer, P. (2000). Real-time tracking of
non-rigid objects using mean shift. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Hilton
Head Island, SC, USA, pp. 142–149.
Duan, G., Ai, H., Cao, S., & Lao, S. (2012). Group tracking: Exploring
mutual relations for multiple object tracking. In Proceedings of the
European Conference on Computer Vision (ECCV), Florence, Italy,
pp. 129–143.
Ess, A., Leibe, B., Schindler, K., & van Gool, L. (2009). Robust multiperson tracking from a mobile platform. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 31(10), 1831–1846.
Grabner, H., Matas, J., Gool, L. V., & Cattin, P. (2010). Tracking
the invisible: Learning where the object might be. In Proceedings
of IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), San Francisco, USA, pp. 1285–1292.
Hammer, P. L., Hansen, P., & Simeone, B. (1984). Roof duality, complementation and persistency in quadratic 0–1 optimization. Mathematical Programming, 28(2), 121–155.
Holzer, S., Pollefeys, M., Ilic, S., Tan, D., & Navab, N. (2012). Online
learning of linear predictors for real-time tracking. In Proceedings
of the European Conference on Computer Vision (ECCV), Florence,
Italy, pp. 470–483.
Huang, C., & Nevatia, R. (2010). High performance object detection
by collaborative learning of joint ranking of granule features. In
Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), San Francisco, USA, pp. 41–48.
Huang, C., Wu, B., & Nevatia, R. (2008). Robust object tracking by
hierarchical association of detection responses. In Proceedings of
the European Conference on Computer Vision (ECCV), Marseille,
France, pp. 788–801.
Isard, M., & Blake, A. (1998). Condensation—conditional density
propagation for visual tracking. International Journal of Computer
Vision, 29(1), 5–28.
Kalal, Z., Matas, J., & Mikolajczyk, K. (2010). P–N learning: Bootstrapping binary classifiers by structural constraints. In Proceedings
of IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), San Francisco, USA, pp. 49–56.
Kuo, C.-H., & Nevatia, R. (2011). How does person identity recognition help multi-person tracking? In Proceedings of IEEE Confer-
123
ence on Computer Vision and Pattern Recognition (CVPR), Colorado
Springs, USA, pp. 1217–1224.
Kuo, C.-H., Huang, C., & Nevatia, R. (2010). Multi-target tracking by
on-line learned discriminative appearance models. In Proceedings
of IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), San Francisco, USA, pp. 685–692.
Li, Y., Huang, C., & Nevatia, R. (2009). Learning to associate: Hybridboosted multi-target tracker for crowded scene. In Proceedings of
IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), Miami, FL, USA, pp. 2953–2960.
National Institute of Standards and Technology: Trecvid 2008 evaluation for surveillance event detection. Retrieved October 1, 2012 from
http://www.nist.gov/speech/tests/trecvid/2008/.
Pearl, J. (1998). Probabilistic reasoning in intelligent systems: Networks
of plausible inference. San Francisco: Morgan Kaufmann.
Perera, A. G. A., Srinivas, C., Hoogs, A., Brooksby, G., & Hu, W.
(2006). Multi-object tracking through simultaneous long occlusions
and split-merge conditions. In Proceedings of IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), New York, USA,
pp. 666–673.
Pets 2009 dataset. Retrieved October 1, 2012 from http://www.cvg.rdg.
ac.uk/PETS2009.
Pirsiavash, H., Ramanan, D., & Fowlkes, C. C. (2011). Globally-optimal
greedy algorithms for tracking a variable number of objects. In
Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), Colorado Springs, USA, pp. 1201–1208.
Shitrit, H. B., Berclaz, J., Fleuret, F., & Fua, P. (2011). Tracking multiple
people under global appearance constraints. In Proceedings of IEEE
International Conference on Computer Vision (ICCV), Barcelona,
Spain, pp. 137–144.
Song, B., Jeng, T. Y., Staudt, E., & Roy-Chowdhury, A. K. (2010). A
stochastic graph evolution framework for robust multi-target tracking. In Proceedings of the European Conference on Computer Vision
(ECCV), Crete, Greece, pp. 605–619.
Stalder, S., Grabner, H., & Gool, L. V. (2010). Cascaded confidence
filtering for improved tracking-by-detection. In Proceedings of the
European Conference on Computer Vision (ECCV), Crete, Greece,
pp. 369–382.
Wang, S., Lu, H., Yang, F., & Yang, M.-H. (2011). Superpixel tracking. In Proceedings of IEEE International Conference on Computer
Vision (ICCV), Barcelona, Spain, pp. 1323–1330.
Xing, J., Ai, H., & Lao, S. (2009). Multi-object tracking through occlusions by local tracklets filtering and global tracklets association with
detection responses. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, pp.
1200–1207.
Xing, J., Ai, H., Liu, L., & Lao, S. (2011). Multiple players tracking
in sports video: A dual-mode two-way bayesian inference approach
with progressive observation modeling. IEEE Transaction on Image
Processing, 20(6), 1652–1667.
Yang, B., & Nevatia, R. (2012a). Online learned discriminative partbased appearance models for multi-human tracking. In Proceedings
of the European Conference on Computer Vision (ECCV), Florence,
Italy, pp. 484–498.
Yang, B., & Nevatia, R. (2012b). An online learned CRF model for
multi-target tracking. In Proceedings IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), Providence, RI, USA, pp.
2034–2041.
Yang, B., Huang, C., & Nevatia, R. (2011). Learning affinities and
dependencies for multi-target tracking using a CRF model. In Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), Colorado Springs, USA, pp. 1233–1240.
Yu, Q., Medioni, G., & Cohen, I. (2007). Multiple target tracking
using spatio-temporal markov chain monte carlo data association.
In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Minneapolis, MN, USA, pp. 1–8.
Author's personal copy
Int J Comput Vis
Zamir, A. R., Dehghan, A., & Shah, M. (2012). GMCP-Tracker: Global
multi-object tracking. Using generalized minimum clique graphs.
In Proceedings of the European Conference on Computer Vision
(ECCV), Florence, Italy, pp. 343–356.
Zhang, L., Li, Y., & Nevatia, R. (2008). Global data association for
multi-object tracking using network flows. In Proceedings of IEEE
Conference on Computer Vision and Pattern Recognition (CVPR),
Anchorage, AK, USA, pp. 1–8.
Zhang, T., Ghanem, B., Liu, S., & Ahuja, N. (2012a). Robust visual
tracking via multi-task sparse learning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, pp. 2042–2049.
Zhang, K., Zhang, L., & Yang, M.-H. (2012b). Real-time compressive
tracking. In Proceedings of the European Conference on Computer
Vision (ECCV), Florence, Italy, pp. 864–877.
123