tracking of multiple soccer players using a 3d particle filter based on

Advances in Computer Science and Engineering
Volume 6, Number 1, 2011, Pages 93-104
Published Online: February 22, 2011
This paper is available online at http://pphmj.com/journals/acse.htm
© 2011 Pushpa Publishing House
TRACKING OF MULTIPLE SOCCER PLAYERS USING A 3D
PARTICLE FILTER BASED ON DETECTOR CONFIDENCE
TAKURO NISHINO, TETSUYA TAKIGUCHI and YASUO ARIKI
Graduate School of Engineering
Kobe University
e-mail: [email protected]
[email protected]
[email protected]
Abstract
In this paper, a novel method for tracking multiple soccer players based
on a support vector machine (SVM) and a particle filter is proposed for a
monocular fixed soccer video. This paper considers three main topics.
First, the state of a particle filter presented in a 3D world coordinate
space, instead of a 2D image coordinate space, provides a more realistic
model tracking because depth considerations are taken into account.
Second, by integrating SVM into the particle filter framework, the initial
positions of the players can be detected or rediscovered when the tracker
has lost the players. Third, the feature histogram and correlation are
combined when the likelihood for a particle is computed. The
effectiveness of this method has been confirmed by experiments that
actually tracked multiple soccer players.
1. Introduction
When professional cameramen video or edit a sports game, they are required to
make it interesting for the TV or video audience. In order to address this requirement
by computer, automatic production systems including digital camera work have been
studied (ex. [1, 2]). A typical video production technology involves shooting the
Keywords and phrases: particle filter, soccer, tracking, earth mover’s distance.
Received October 11, 2010
94
TAKURO NISHINO, TETSUYA TAKIGUCHI and YASUO ARIKI
sports game using a fixed camera, and then clipping and connecting the video
frames. This technology allows videos to be edited according to individual audience
preferences.
An automatic production system is composed of several image recognition
techniques and requires the semantic analysis of the process and strategy of the
game. The position and trajectory of players in the game can provide useful
information to facilitate the semantic analysis. In order to obtain such information,
detection and tracking of objects (players, the ball, and the lines) are required.
Many tracking methods have been proposed to date, and one of the popular
methods of tracking-by-detection has been reported, where a detection framework is
integrated into a tracking algorithm [3, 4, 6]. It makes it easy to get the initial
positions of targets or rediscover them when the tracker has lost the targets. For
object detection, there are two types of detection algorithms; SVM [3, 4] and
AdaBoost [5, 6]. Both methods can detect their targets correctly if an abundance of
training data is used. In the tracking algorithm, a particle filter is often employed
because of its robustness with respect to occlusion. Therefore, our tracking algorithm
also employs tracking-by-detection with a particle filter for robust tracking despite
occlusion.
In our system, at first, players are detected on the first frame. Next, tracking
using a particle filter is carried out, where the particle filter approximates the
probabilistic model discretely using a large number of particles. Each particle is
translated by a state transition model and is weighted by an observation model, and
then the state is propagated. Due to the detection results, the particles are weighted
more accurately, where the proposed method computes the likelihood for a particle
using a linear combination of the feature histogram and correlation.
2. Player Detection
For each video frame, player detection is first performed. The classifier that
detects players is trained in advance using the training data. The number of training
data is 3,000 positive samples and 1,000 negative samples. Examples of positive and
negative samples are shown in the upper part and the lower part of Fig. 1,
respectively. The gray value at each pixel on a gray scale image was used as a
feature. In the detection step, the multiple detectors search the video frame by sliding
the window and changing the size, and the classifier outputs the detected regions
TRACKING OF MULTIPLE SOCCER PLAYERS USING A 3D …
95
with an SVM score c(det) for a detector det. The SVM score is defined as the
distance between the window image and the decision boundary of the classifier.
Figure 1. Training samples.
Figure 2 shows the detection results. Player regions are correctly detected, but in
the case where multiple players are crossing in front of each other, an individual
player cannot be detected.
Figure 2. Detection results.
3. Tracking Algorithm
3.1. 3D particle filter
A tracking algorithm is based on estimating the distribution of each state of a
particle filter. The world coordinate is fixed to the soccer stadium, and the state x at
time t is defined as Eq. (1) using the 3D coordinate.
x = [ p x , p y , v x , v y , a x , a y ]T .
(1)
The state x consists of the 3D world position ( p x , p y ) , the velocity (v x , v y ) and
the acceleration (a x , a y ). The state presented in the 3D world coordinate space,
rather than the 2D image coordinate space, makes tracking more robust owing to
depth considerations when occlusion occurs. Here, the height of the players is not
considered, and ( p x , p y ) represents the position of a player’s centroid.
TAKURO NISHINO, TETSUYA TAKIGUCHI and YASUO ARIKI
96
Then, we assumed that the state transition model from time t − τ to t is an
accelerated motion model as follows:
x (t ) = Cx (t − τ ) + ϒ
 I 2 ∗2 τI 2∗2

C =  O2 ∗2 I 2 ∗2
O2 ∗2O2 ∗2

( τ 2 2 ) I 2 ∗2 
τI 2 ∗2
I 2 ∗2




ϒ = [ε p I1∗2 , ε v I1∗2 , ε a I1∗2 ]T .
The process noises ε p , εv , ε a for state variables are independently drawn from
zero-mean normal distributions. The variance σ2p for the position noise changes
with the size of the tracking target, whereas the variances σ2v and σ2a for the
velocity and acceleration noise are inversely proportional to the number of
successfully tracked frames. Hence, we can expect that longer a target is tracked
successfully, the less the particles are spread.
3.2. Assigning detection to tracker
In order to integrate the detector into the tracking framework using SVM, a
detector/tracker pair must be decided. For this pairing decision, the algorithm
described in [3] is utilized. The matching algorithm works as follows. First, a
matching score s(tr , det ) for each pair (tr, det ) of the tracker tr and the detector
det is computed as shown in Eq. (2):
N



s(tr , det ) = g (tr , det ) ⋅  c(det ) + α
p N (det − p )  ,


p∈tr


(2)
 sizetr − sizedet 
g (tr , det ) = p( sizedet | tr ) = p N 
,
sizetr


(3)
∑
where
p N (det − p ) ~ N (det − p; 0, σ 2det )
denotes
the
normal
distribution
evaluated for the distance between the detector det and each particle p of tracker tr,
and g (tr , det ) is a gating function which additionally assesses each detector. sizetr
denotes the size of tracker located at p y , and sizedet denotes the window size of the
TRACKING OF MULTIPLE SOCCER PLAYERS USING A 3D …
97
detector det. The values of α and σ 2det are set experimentally. The meaning of Eq.
(2) is that a pair consisting of tracker tr and detector det gets a higher matching score
s(tr , det ) , if their average distance and window size difference are smaller. Finally,
the matching matrix function S (the number of trackers × the number of detectors),
whose element is s(tr , det ) , is computed.
Then, the pair (tr ∗ , det ∗ ) with the maximum score in S is iteratively selected,
and the rows and columns belonging to the tracker tr∗ and the detector det ∗ in S are
deleted. This is repeated until no further valid pair is available.
3.3. Observation model
To compute the weight ωtr, p for a particle p of the tracker tr, our algorithm
estimates the conditional likelihood of the new observation yt of the propagated
particle, where the state is xt . For this purpose, we combine different sources of
information, namely the likelihood based on the detector det ∗ , the histogram
distance and the correlation as follows:
ωtr, p = p( yt | xt )
= p D + β ⋅ p H + (1 − β) ⋅ pC .
(4)
The first term pD denotes the likelihood based on the detector det ∗ , which is
computed as follows:
p D = I (tr ) (φ ⋅ c( p ) + ψ ⋅ p N ( p − det ∗ )) ,
1 s(tr , det ∗ ) > threshold
I (tr ) = 
,
0 other
(5)
where c ( p ) denotes the SVM score at the particle position, and p N ( p − det ∗ )
denotes the distance between the particle p and the associated detector det ∗ ,
evaluated under a normal distribution p N . I (tr ) is the indicator function that
returns 1 if a detector is associated with the tracker, namely the matching score
s(tr , det ∗ ) is larger than threshold. Otherwise, 0 is returned. When a matching
98
TAKURO NISHINO, TETSUYA TAKIGUCHI and YASUO ARIKI
detection is found, this term robustly guides the particles. The parameters φ, ψ and
threshold are set experimentally.
The second and third terms ( pH and pC ) denote the likelihood based on the
histogram distance and the likelihood based on the correlation, respectively. The
former is robust to occlusion or pose change of players, whereas the variance of
particle grows more than the necessity. On the other hand, the latter is weak to
occlusion but it has an advantage in that the particle variance is small. Therefore, the
weight for the former increases, when the distance between players is small (when
occlusion of players tends to occur). In contrast, when the distance is long, the
weight for the latter increases. Due to this method, the players can be tracked
accurately for a long time. In this paper, we employ the Earth Mover’s Distance
(EMD) [7] as the histogram distance and Normalized Cross-Correlation as the
correlation. They are defined as follows:
pH = pH (distEMD ) ,
pC =
∑
∑
i, j
i, j
(6)
{I (i, j ) − I } ⋅ {T (i, j ) − T }
{I (i, j ) − I }2 ⋅
β = p N (min disti ) ,
i∈ P
∑
i, j
{I (i, j ) − T }2
,
(7)
(8)
where distEMD in Eq. (6) denotes EMD.
The normalized cross-correlation pC is defined by Eq. (7). Here, I ( x, y ) and
T ( x, y ) are gray values at point ( x, y ) in the search area and the template image,
respectively. I and T are averaged gray value of the search area and the template
image, respectively.
Eq. (8) denotes the weight of pH and pC . It is defined as the distance between
the target particle and the closest player, evaluated under a normal distribution p N .
disti denotes the distance between the target player and player i, and P denotes a set
of players.
3.4. Calibration
The perspective projection of a pinhole camera model is needed to project world
TRACKING OF MULTIPLE SOCCER PLAYERS USING A 3D …
99
coordinates ( X w , Yw , Z w ) of state x to image coordinates ( x f , y f ) as shown in
Eq. (9). A is an intrinsic parameter, R is the rotation matrix, and t is the translation
vector. R and t are extrinsic parameters.
Xw
x f 
 
 y  = A[R | t ]  Yw 
f
 
 Zw 
 1 
 1 
 
(9)
A soccer field has known details, such as the length of lines in the penalty area
and the goal area, etc. Therefore, several points on an image are manually associated
with the world coordinates of the soccer field, and then the camera parameters are
calibrated using the Tsai-method [8]. Fig. 3 shows the points used for calibration.
Figure 3. The points used for calibration (red points).
4. Experiment
4.1. Experiment environment
We videoed a film of a soccer game played during the 38th National High
School Soccer Championship in Japan. The image size was 1,280 × 720 pixels with
24-bit color. The image size of the player template was 20 × 40 pixels. The number
of particles was 300, and the frame rate was 30 fps. As the experiment data, the
video sequence as shown in Fig. 4 was used, where the left half of the field located
in the lower half of the image. We extracted 12 video clips of 330 frames (on
average), and evaluated the results in terms of tracking accuracy as defined by Eq.
(10).
1
Tracking Accuracy =
P
P
of correctly tracked frames for player i
∑ Number
Number of frames including player i
i =1
(10)
100
TAKURO NISHINO, TETSUYA TAKIGUCHI and YASUO ARIKI
Figure 4. Frame shot used in experiment.
If the distance between the correct 2D image coordinate obtained manually and the
coordinate given to a 2D image from a tracking result obtained by Eq. (9) was five
pixels or less, it was defined as correct tracking. P denotes the number of players.
The camera parameters estimated by calibration are shown in Table 1.
Table 1. Estimated camera parameters
Extrinsic parameters
R
t
Intrinsic parameters
Rx
93.29 [deg]
f
3.16 [m]
Ry
5.39 [deg]
k
3.61e-02
Ry
–0.30 [deg]
sx
1.0
tx
–28.38 [m]
sy
1.0
ty
12.25 [m]
Cx
640 [pixel]
tz
71.53 [m]
Cy
360 [pixel]
4.2. Experiment results
Fig. 5 shows the tracking results obtained using the proposed 3D particle filter
compared with the traditional 2D particle filter [1], where we used the normalized
cross-correlation as the likelihood evaluation. The average tracking accuracy was
improved significantly from 63.1% to 70.3%, so it can be said that modeling based
TRACKING OF MULTIPLE SOCCER PLAYERS USING A 3D …
101
on the 3D world coordinates is effective. But the tracking accuracy decreased for
sample 13, sample 12, sample 09 and sample 05, where the players hardly move or
simply move linearly. It appears that the accuracy decreased in the proposed method,
even though the motion was quite simple, because complex 3D modeling is being
carried out.
Figure 5. Tracking accuracy of the proposed 3D tracking compared with 2D
tracking.
We compared the performance of pD , pH and pC in Eq. (4). The results are
shown in Table 2, where the circle indicates the utilization of the corresponding
likelihood. In
pH , the Bhattacharyya distance and EMD were compared.
Comparing the second line with the third line in Table 2, EMD showed better
performance than the Bhattacharyya distance. It can be said that EMD is more
effective than the Bhattacharyya distance with respect to occlusion. Comparing the
second line with the fourth line in Table 2, pD improved the tracking accuracy
significantly.
Table 2. Tracking accuracy in terms of observation model
pD
pH
pC
Tracking Accuracy [%]
×
×
0
70.30
×
Bhattacharyya
0
72.41
×
EMD
0
73.64
0
Bhattacharyya
0
75.22
102
TAKURO NISHINO, TETSUYA TAKIGUCHI and YASUO ARIKI
Fig. 6 shows the tracking results for video clips with narrow windows viewed
close-up. A number of occlusion incidents occurred during a short time, but the
Figure 6. Tracking results (close-up view).
proposed method was able to track the players without problems. Fig. 7 shows the
tracking results for video clips with wide windows view from further away. In both
the cases, the players can be tracked accurately even when occlusion occurs.
Figure 7. Tracking results (distant view).
TRACKING OF MULTIPLE SOCCER PLAYERS USING A 3D …
103
5. Conclusion
In this paper, we proposed a novel method for detecting and tracking multiple
objects soccer players in a fixed monocular soccer video. By making use of the state
in the 3D world coordinate space in a particle filter, we were able to achieve a more
realistic model. In addition, the observation model incorporating the different
information sources makes tracking more robust when occlusion or pose changes
occur.
In future, we are planning to carry out work focusing on the detection of more
complex soccer events (e.g., offsides and fouls) using the 3D positional information
of players acquired by the proposed method.
References
[1]
Y. Ariki, T. Takiguchi and K. Yano, Digital camera work for soccer video production
with event recognition and accurate ball tracking by switching search method, IEEE
International Conference on Multimedia & Expo, pp. 889-892, Hannover, Germany,
June 2008.
[2]
T. Nishino, T. Takiguchi and Y. Ariki, Situation recognition using 3D positional
information of ball from monocular soccer image sequence, International Conference
on Multimedia, Information Technology and its Applications (MITA’2009) (2009)
109-112.
[3]
M. Breitenstein, F. Reichin, B. Leibe, E. Koller-Meier and L. V. Gool, Robust
tracking-by-detection using a detector confidence particle filter, The 12th IEEE
International Conference on Computer Vision (ICCV), pp. 1515-1522, Kyoto, Japan,
Sept. 2009.
[4]
G. Zhu, C. Xu, Q. Huang and W. Gao, Automatic multi player detection and tracking
in broadcast sports video using support vector machine and particle filter, IEEE
International Conference on Multimedia & Expo (ICME), pp. 1629-1632, Toronto,
Canada, July 2006.
[5]
Y. Wu, X. Tong, Y. Zhang and H. Lu, Boosted interactively distributed particle filter
for automatic multi-object tracking, IEEE International Conference on Image
Processing (ICIP), pp. 1844-1847, San Diego, U. S. A., Oct. 2008.
[6]
K. Okuma, A. Taleghani, N. D. Freitas, J. J. Littele and D. G. Lowe, A boosted
particle filter: multi target detection and tracking, The 8th European Conference on
Computer Vision (ECCV), pp. 28-39, Prague, Czech, May 2004.
104
TAKURO NISHINO, TETSUYA TAKIGUCHI and YASUO ARIKI
[7]
Y. Rubner, C. Tomasi and L. J. Guibas, The earth mover’s distance a metric for image
retrieval, International Journal of Computer Vision (IJCV) 40(2) (2000), 99-121.
[8]
R. Y. Tsai, An efficient and accurate camera calibration technique for 3D machine
vision, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.
364-374, Miami, U. S. A., June 1986.
[9]
Xinguo Yu, Xin Yan and Yiqun Li, Tool-aided semantics acquisition for live soccer
video, IEEE International Conference on Multimedia & Expo, pp. 893-896, Hannover,
Germany, June 2008.