Recent Advances in Circuits, Communications and Signal Processing Mean-Shift based Object Tracking Algorithm using SURF Features SOURAV GARG Innovation Lab Tata Consultancy Services Noida, Uttar Pradesh, India [email protected] SWAGAT KUMAR Innovation Lab Tata Consultancy Services Noida, Uttar Pradesh, India [email protected] Abstract: Mean-Shift tracking is primarily used for carrying out localized search on an image frame using colour histograms. The application of mean-shift tracking directly to SURF features is limited due to the unavailability of sufficient number of key points for a given object. This paper proposes a method called re-projection to overcome this limitation so that the mean-shift algorithm can be used directly with SURF descriptors for tracking an object in a video recorded from a non-stationary camera. Since the SURF features are computed only for the object being tracked, the computational requirement is small enough to allow real-time tracking of the object. The efficacy of the approach is demonstrated through various simulation results. Key–Words: SURF, Mean-shift, Object Tracking, re-projection 1 Introduction frame-rate) performance [14] even with these computationally heavy features. In this paper, we use mean-shift algorithm directly with SURF features to track objects in a video sequence. This method belongs to the category of interest point based tracking methods [11] which use a method of object recognition based on SURF correspondence. This approach does not require estimation of object or feature motion model unlike other approaches that use optical flow or Kalman Filter for such estimations or predictions [8] [13]. However, such motion models may become indispensable if one needs to track partially or fully occluded objects. Application of mean-shift algorithm for tracking requires a histogram of the object template which will be searched in subsequent frames. The template histogram is formed by creating a fixed number of clusters with the SURF features of the object template. This is similar to the object recognition method used in a bag-of-words approach [15] [16]. This histogram is used by the mean-shift algorithm to localize the object in the next frame. The approach is not straight forward and one has to address various issues like, availability of very few descriptors for the object model, depletion of matching features over subsequent frames, presence of outliers, scaling of tracking window and so on. These problems have apparently discouraged the researchers to apply the mean-shift algorithm directly to SURF descriptors. Most of these problems arise due to the fact that the histogram, created with the limited number of key points available for a given object, may not represent the true pdf of the object model. We overcome this Object tracking in a video sequence is an important problem in computer vision with applications in areas like video surveillance, vehicle navigation, perceptual user interface and augmented reality [1]. It also forms an integral part of the vision based robot tracking technologies [2]. Mean-shift tracking is a local search algorithm based on colour histogram matching [3]. This method is very simple and easy to implement which makes it very popular among the colour based tracking methods [4]. However, the colour based tracking methods are sensitive to variation in illumination condition and necessitate having non-matching backgrounds [5]. This has prompted researchers to look for more distinctive features like SIFT [6] and SURF [7] which have been shown to be robust to photometric and geometric distortions. These robust local point features are being increasingly used for visual object tracking application [8] [9]. SURF being computationally more efficient compared to SIFT features, we focus on object tracking methods that make use of SURF features. Many works have been reported in literature which use SURF features for visual object tracking. These works may be broadly classified into two groups - one using SURF features to improve the robustness of colour-based object tracking algorithms as in [10] [4] and, the other using SURF features directly for object tracking as in [11] [12][13]. The latter approaches are becoming more popular with the appearance of algorithms that can achieve real-time (near- ISBN: 978-1-61804-164-7 187 Recent Advances in Circuits, Communications and Signal Processing dow with width w and height h. Given I0 , W0 and V0 (I0 , W0 ), the task is to compute the tracking window Wi (ci , wi , hi ) for all image frames i = 1, 2, · · · , N . problem by using a method called re-projection where the histogram of the object template is updated on-line for every frame. The method of re-projection aims to enrich the source histogram by making a homographic projection of the matching points from the target window on to the source window (object model) at the end of each mean-shift convergence. This increases the number of key-points in the source window and thus, improves the pdf of the object model. This, in turn, overcomes several other problems mentioned above. The proposed method has several advantages. First, the tracking can be carried out in real-time as the SURF features needed for tracking are computed only over the region containing the target. Second, tracking can be carried out in a video recorded from a non-stationary camera where the background is not static. This is due to the fact that our approach does not make use of any foreground/background classification methods as used by many other authors [14] [10]. However, we do not consider the cases of partial or full occlusion [13] of the target in this paper, which forms the future scope of the work. The main contribution of this paper is that we propose a method that enables us to use mean-shift algorithm to track an object using SURF features directly without using any other additional information about the object. According to our literature survey, such a work has not been reported so far and hence, we consider this to be a novel contribution in this field. The rest of this paper is organized as follows. The problem definition is provided in the next section. The mean-shift algorithm for implementing object tracking is provided in Section 3. The simulation and experimental results are provided in Section 4 followed by conclusion in Section 5. 2 3 In this paper, we use mean-shift algorithm [3] directly on SURF features to track the object in subsequent frames. This is different from other mean-shift based approaches as in [4][9], where mean-shift algorithm is used with colour histograms and SURF features are used only for improving its performance based on point correspondences. Mean-shift tracking necessitates having an object histogram model which will be used for searching the object in the next frame based on histogram matching. The pseudo-code for the tracking algorithm is provided in Table 1. The tracking method consists of the following four steps: 1. Creating object histograms using SURF descriptors. This is done only for the first frame as described in the lines 2-6 of the pseudo-code. 2. Searching the object in the new frame through histogram matching and localize the target window using mean-shift algorithm. The mean-shift iteration is carried out as shown in lines 10-17 in the pseudo-code. 3. Scaling the target window to reflect the correct size of the tracking object. The computation of scaling coefficient α and positioning of the target window is done in line numbers 18 and 19 respectively. Problem Definition 4. Re-projecting the matching key points of the target window (obtained after mean-shift convergence) on to the source window using homography. This is done in line numbers 20-23 of the pseudo-code. The re-projected point locations are represented by X ′ . The histogram is updated only if the locations of the projected points are close to the original points X. Consider a set of frames Ii , i = 0, 1, 2, · · · N of a video sequence where an object identified by the user in the first frame is to be tracked over all the frames. The object is identified by the user by selecting a rectangular region on the first frame. Let this rectangular region be denoted by W0 corresponding to the first image I0 . Let V (I, W ) = {(x1 , v1 ), (x2 , v2 ), · · · , (xn , vn )} be the set of SURF key points of an image I within the window W , where xi is the 2-dimensional key point location of the SURF descriptor vi . We use X to denote the set of key point locations in the source window or the object model and Y to denote the corresponding set on the target window. The tracking window W is represented by W = (c, w, h) where c = (cx , cy ) centre of the winISBN: 978-1-61804-164-7 The Method We use the following notations for continuing the discussion in the remaining part of this section. The source window (Ws ) refers to the window on the first frame which contains the object to be tracked. The target window (Wt ) refers to the window on the destination frame where the target is to be searched or is found. 188 Recent Advances in Circuits, Communications and Signal Processing 3.1 Creating Histogram with SURF features Object Model/Source Window: The SURF descriptors are computed for the source window Ws = W0 . The 64-dimensional SURF feature vectors are then clustered into M number of clusters using k-means algorithm. These centres become the bins for the object (source) histogram Hs . For every SURF descriptor belonging to a centre, the count of the corresponding bin is incremented. The belongingness of a descriptor to a centre is decided based on its minimum distance from all cluster centres. This is similar to the histogram creation process in a bag-of-words approach used for object recognition [15] [16]. The clustering of SURF descriptors is done only once for the source window. Hence there is no computational burden during on-line tracking. Target Window: For a target window on the destination image frame, compute all the SURF descriptors lying within this rectangular region. Now create the target histogram Ht by considering the clusters in the source window as the bins. A descriptor belongs to a particular source cluster if its Euclidean distance to this cluster centre is minimum. Note that no clustering is done for the target window. The target histogram Ht is computed by finding the belongingness of target descriptors with the source bins. Since the mean-shift algorithm works on histogram matching, the outliers (features corresponding to background objects) will pollute the target histogram as shown in Figure 1. Figure 1(a) is the source window or the object model which is to be tracked in the destination frame. Figures 1(b) and 1(c) are the two possible target windows one may come across during mean-shift search. The target window 2 shows a case where a part of the background is selected as well. The histograms for these three windows are shown in Figure 1(d). It can be seen that the histograms for the source window and the target window 1 are similar while that of the target window 2 is significantly different. The points having same colour within each window refer to the SURF key points which belong to the same cluster. There are 5 clusters representing the 5 bins of the histogram. This difference between the histograms of the source and the target window 2 would eventually cause the meanshift algorithm to drift the target window in a direction where the dissimilarity will decrease. 3.2 Normalized Frequency (a) Source Window 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 (c) Target Window 2 Source Histogram Target Window 1 Target Window 2 -1 0 1 2 Bin Index 3 4 5 (d) Histogram for three windows Figure 1: Effect of outliers on histogram matching. First row shows the locations of descriptors in three windows. (a) is the object model which is to be tracked in the next frame. (b) is one target window which is similar to the source window. (c) is the target window which contains a part of the background and hence contributes outliers. Points of same colour belong to the same cluster. (d) shows the SURF histograms for all the three windows. One can see that the histogram for first two windows are similar compared to that of the third window. coefficient [3]. This, in turn, requires computing histograms of the source as well as the target windows. Once the histograms are computed using SURF features, the mean-shift procedure is identical to that used for colour histogram based method. The details of the mean-shift algorithm is omitted from this paper to maintain brevity and avoid repetition, in stead, the readers are referred to the original paper by Comaniciu et. al. [3] for further details. There is, however, a slight difference between the current implementation and the one with colour histograms as explained below. Mean-Shift Algorithm Mean-shift algorithm uses mean-shift iterations to find the target window which is most similar to a given object model (source window), with the similarity being expressed by a metric based on the Bhattacharyya ISBN: 978-1-61804-164-7 (b) Target Window 1 The centre of the new target window computed by 189 Recent Advances in Circuits, Communications and Signal Processing the mean-shift algorithm is given by: ! n X x − xi 2 wi g h xi z = i=1n ! X x − x i 2 wi g h Table 1 Pseudo-code for updating tracker window using Mean-Shift Algorithm Require: A sequence of frames: {Ii }, i = 0, 1, 2, . . . , N 1: for i = 0 to N do 2: if i = 0 then {first frame} 3: Select the object using a rectangular window: W0 4: Extract the SURF features of the object: V0 (I0 , W0 ) 5: Create D clusters using k − means 6: Create Object Histogram by treating clusters as bins. Denote this object histogram by Hs 7: else {for other frames} 8: k ← 0 {counter for mean-shift loop} 9: Initialize tracker window: Wi (ci (k), wi (k), hi (k)) = Wi−1 (ci−1 , wi−1 , hi−1 ) 10: repeat 11: k ←k+1 12: Compute SURF descriptors V (Ii , Wi (k)) for the target window. 13: Compute target histogram Ht using the source cluster centres as bins. 14: Find out the set of matching descriptors between source and target windows. 15: Compute new window centre using the matching descriptors: ci (k) = mean-shift(Hs , Ht ) 16: Update the tracker window on frame Ii : Wi (k) = (ci (k), αwi (k − 1), αhi (k − 1)) 17: until kci (k) − ci (k − 1)k ≤ ε 18: Compute scaling coefficient: α(Wi ) 19: Draw the target window on the image. 20: Compute Re-Projected points: Xi′ using homography (RANSAC) 21: if Xi ∼ Xi′ then 22: Update V0 (I0 , W0 ) and Hs 23: end if 24: end if 25: end for (1) i=1 where g(x) = −k ′ (x) is the derivative of the kernel profile and wi is the weight associated with each key point location i of the source window which has a correspondence in the target window. The new centre location depends on the number of correspondences n between the source and the target window. In colourbased mean-shift algorithm, the correspondences include each and every pixel location with in the source window. In our method which depends on SURF descriptors, the weighted average is computed over the n SURF correspondences available between the two windows. The SURF correspondences between the windows are computed using minimum distance criterion and RANSAC for removing outliers. 3.3 Scaling Since the SURF correspondences are used between the source and the target window, it is difficult to find a bounding box for the object being tracked on the destination image. The solution is to scale the original window based on how much the matching points have scaled up or down in the target window [4] [3]. We use the method as described in [4] for scaling the target window in our case. The scaling factor is given by α = (n2 ) X sk where points obtained from the source window Ws are limited, the histogram created with this set of key points may not truly represent the object model. In case of the Bag of Words approach for object recognition [15], a large set of images containing the same object is used for computing the object histogram. It is not possible to have a large number of samples in our case as the tracking is carried out on-line. In order to overcome this issue, we propose a method called re-projection where the matching key points of the final tracking window (obtained after Mean Shift convergence) are projected back onto the source window. Since homography (using RANSAC) is used for avoiding wrong correspondences, the projected points lie close to the original key points on the source window as shown in Figure 2(a) and 2(b). In the worst case with very poor correspondence (see Figure 2(c)), the projected points may not lie close to the original points as shown in Figure 2(d). Such points are discarded and not included in the updated histogram. Hence the re-projection improves the source histogram by appending projected key points and descriptors to the original set. The improved source (2) k=0 sk = kyi − yj k n , (i, j) → k, k = 1, 2, . . . , kxi − xj k 2 (3) The locations of matching key points on the source window are represented by xj and those on the target window are given by yj , where j = 1, 2, . . . , n bethe number of matching descriptors. There are n 2 number of unique pairs of inter-interest-points in each window and sk is the scaling value for the pair (i, j) → k. 3.4 Improving the source histogram using Re-Projection Mean-shift tracking algorithm requires a histogram of the object model. Since the number of SURF key ISBN: 978-1-61804-164-7 190 Recent Advances in Circuits, Communications and Signal Processing histogram will lead to faster mean-shift convergence thereby improving the real-time performance of the algorithm. The number of key points in the source window increases over time. The dominant key points will have more descriptors in its vicinity compared to others. In order to keep a check on the total number of descriptors in the source window, we allow at most two re-projections at a given location in the source window. Since there are more descriptors at a given location, the chances of obtaining matching correspondences become higher, which in turn leads to better tracking. The improvement obtained due to re-projection will be discussed further in the simulation section where many of these conjectures will be validated. (a) The tracking performance is expressed in terms of the percentage overlap between the window obtained from the ground truth and the converged target window obtained from our algorithm as given by %Overlap = (4) where A and B represent the set of pixels in the window from the ground truth and the converged target window respectively. Higher value of this quantity represents better tracking performance. Some of the snapshots of tracker window along with the ground truth are shown in Figure 4. The red window represents the ground truth while the blue window is obtained from our algorithm. The first row in this figure shows the cases where the mean-shift tracker achieves proper scaling and hence, leads to correct tracking of the object. The second row shows some of the poor cases where the algorithm tracks the object with improper scaling. 90 (b) Without re-projection With re-projection Percentage Overlap 85 (c) A∩B × 100 A∪B (d) 80 75 70 65 Figure 2: Understanding re-projection: (a) and (b) refer to the best case with good correspondence, where the re-projected points (red colour) overlap with the original key point locations (cyan colour); (c) and (d) refer to the worst case with poor correspondence, where the reprojected points lie far away from the original point locations. 60 0 200 400 600 Frame Index 800 1000 Figure 3: Effect of re-projection on tracking performance. 4 Re-projection leads to better tracking with higher value of percentage overlap between the ground truth and the target window obtained from our algorithm. Simulation Results In order to test our algorithm we record a video from a non-stationary camera which is mounted on a mobile robot platform following an object. Note that in this video, the foreground as well as the background is dynamic and hence the methods based on background subtraction can not be used [5]. Since our tracking algorithm is based on SURF features, it is necessary to have a significant number of key points within the object model. In order to develop the theory of this paper, we choose a black and white checkered pattern for the target object. However, we will also show that the tracking works well with other natural objects with similar outcomes. ISBN: 978-1-61804-164-7 The mean-shift tracker is initialized by selecting a rectangular region containing the object to be tracked in the first frame. The centre of the tracking window for all subsequent frames is computed using the meanshift algorithm as described in Table 1. The meanshift algorithm is considered to have converged if the Euclidean distance between two consecutive centres is less than 3 pixels. The mean-shift iteration loop is stopped whenever it does not converge within 50 iterations. The performance of the tracking algorithm in terms of percentage overlap is shown in Figure 3. As one can see, the re-projection method leads to better 191 Recent Advances in Circuits, Communications and Signal Processing Number of matching points 40 Figure 4: Results for mean-shift tracking based on SURF features. The red window is the ground truth and the blue window is obtained using our algorithm. The first row shows the best cases and the second row shows the worst case. Without re-projection With re-projection 35 30 25 20 15 10 0 200 400 600 800 1000 Frame Index tracking performance compared to the case when reprojection is not used. The effect of re-projection on our tracking algorithm can be better understood by analyzing the figures 5, 6 and 7. As explained in section 3.4, reprojection increases the number of SURF descriptors in the source window, which increases the chances of getting better correspondences. This effect is shown in Figure 5 where one can see that re-projection leads to larger number of matching points between the source and the final target window. With the availability of improved histogram, it becomes easier to find the target object using the mean-shift algorithm. The Figure 6 shows the similarity between the source window and the final tracking window obtained from mean-shift convergence in terms of Bhattacharyya Coefficient. One can conclude that the final target window obtained with re-projection are more similar to the original object model compared to the case when re-projection is not used. One should also note that the source histogram stabilizes over the subsequent frames as shown in Figure 7. This means, as time progresses, the histogram obtained with re-projection comes closer to the actual object description. The improvement in histogram leads to faster convergence of the mean-shift iterations as shown in Figure 8. As one can see, the mean-shift algorithm takes approximately 2 to 3 steps to converge to the final tracking window. On the other hand, the number of steps needed for mean-shift convergence increases over time if re-projection is not used. In order to corroborate the above findings, the performance of our tracking algorithm is tested on two different example videos as shown in Figure 9 and 10. In the first video, a human torso is tracked based on the SURF features obtained from the clothes worn by the subject. The second video shows a difficult case where we track a box containing few letters. In this ISBN: 978-1-61804-164-7 Figure 5: Effect of re-projection on number of matching points for each image: The average number of matching points between the source and target window remains more or less constant for all frames. This number decreases monotonically if re-projection is not used. The trajectory for re-projection case is shown in green color. Bhattacharyya Coefficient 1 Without re-projection With re-projection 0.98 0.96 0.94 0.92 0.9 0.88 0.86 0 200 400 600 800 1000 Frame Index Figure 6: Effect of re-projection on Bhattacharyya Coefficient. Re-projection leads to higher value of Bhattacharyya Coefficient (shown by dashed line) between the source window and the converged target window. Hence the final window is more similar to the object model compared to the case when re-projection is not used (solid line). The y-axis values are averaged over the frame count. 192 0.45 Iterations for Mean-Shift Convergence Normalized Frequency of Bins(Cluster Centers) Recent Advances in Circuits, Communications and Signal Processing Bin-1 Bin-2 Bin-3 Bin-4 Bin-5 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 100 200 300 400 500 600 Frame Index 700 800 900 1000 Figure 7: Effect of re-projection on the PDF. The bin fre- 11 10 9 8 7 6 Without re-projection With re-projection 5 4 3 2 1 0 200 400 600 Frame Index 800 1000 Figure 8: Effect of re-projection on mean-shift conver- quency (number of points in each bin) tends to stabilize over the frames. This means, as more points are added to the source window through re-projection, the resulting histogram represents the true pdf model of the object. The source histogram has 5 bins which represent the clusters created with k-means. gence. Algorithm with re-projection (dashed line) needs less number of iterations for mean-shift convergence as compared to the one without re-projection (solid line). The y-axis values shown are averaged over the number of frames. case, the number of descriptors available in the object model is quite small and are concentrated over a very small region within the box. In both the cases, we are able to track the object satisfactorily. Since it was time consuming to draw the ground truth manually for all the frames, we only show the tracking window obtained from our algorithm in these two figures. The tracking videos will be made available on request. 5 Conclusion Figure 9: Mean-shift tracking results for Example 2 where a human torso is being tracked based on the SURF features obtained from the clothes worn by the subject. Blue window is the tracking window obtained from our algorithm Mean-shift algorithm is a popular method for tracking objects based on colour histograms. Its application to SURF descriptors is limited due to the unavailability of sufficient key points which can be used for computing a reliable histogram for the object model. This paper proposes a mean-shift based object tracking algorithm that uses SURF descriptors for creating histograms. The problem associated with the availability of smaller number of key points is resolved by using an approach called re-projection and it is shown to provide significant improvement in the tracking performance. Figure 10: Mean-shift tracking results for Example 3 where a box with very few SURF descriptors is used for tracking. Scaling is affected due to the availability of lesser number of descriptors. Blue window is the tracking window obtained from our algorithm References: [1] A. Yilmaz, O. Javed, and M. Shah. Object tracking: A survey. ACM Computing Surveys ISBN: 978-1-61804-164-7 193 Recent Advances in Circuits, Communications and Signal Processing (CSUR), 38(4), December 2006. [12] Duy-Nguyen Ta, Wei-Chao Chen, Natasha Gelfand, and Kari Pulli. SURFTrac: Efficient tracking and continuous object recognition using local feature descriptors. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2937–2944, Miami,FL, 2009. IEEE. [2] Zhigang Bing, Yongxia Wang, Jinsheng Hou, Hailong Lu, and Hongda Chen. Research of tracking robot based on surf features. In International Conference on Natural Computation (ICNC), pages 3523–3527, Yantai, Shandong, 2010. IEEE. [13] Wei He, T. Yamashita, Lu Hongtao, and Shihong Lao. Surf tracking. In International Conference on Computer Vision, pages 1586–1592, Kyoto, 2009. IEEE. [3] D. Comaniciu, V. Ramesh, and P. Meer. Realtime tracking of non-rigid objects using meanshift. In Proc. of Int. Conf. on Computer Vision and Pattern Recognition (CVPR), pages 142– 149, vol. 2, Hilton Head Island, CS, 2000. IEEE. [14] Steve Gu, Ying Zheng, and Carlo Tomasi. Efficient visual object tracking with online nearest neighbor classifier. In 10th Asian Conference on Computer Vision (ACCV), pages 271– 282, Queenstown, New Zealand, 2010. Springer Berlin Heidelberg. [4] Jian Zhang, Jun Fang, and Jin Lu. Mean-shift algorithm integrating with surf for tracking. In Natural Computation (ICNC), pages 960–963, Shanghai, 2011. IEEE. [15] A. Ahmadi, M. R. Daliri, A Nodehi, and A Qorbani. Objects recognition using the histogram based on descriptors of SIFT and SURF. Journal of Basic and Applied Scientific Research, 2(9):8612–8616, 2012. [5] M. Gupta, L. Behera, and V. K. Subramanian. A novel approach of human motion tracking with mobile robotic platform. In UKSIM Int. Conf. on Computer Modeling and Simulation, pages 218– 223, Cambridge, UK, 2011. IEEE. [16] Bart Thomee, Erwin M. Bakker, and Michael S. Lew. TOP-SURF: a visual words toolkit. In Proc. of International Conference on Multimedia, pages 1473–1476, New York, 2010. ACM. [6] D. G. Lowe. Distinctive image features from scale-invariant keypoints. Internation Journal of Computer Vision, 60(2):91–110, January 2004. [7] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool. Speeded-up robust features (surf). Computer Vision and Image Understanding, Elsevier, 110:346–359, December 2008. [8] Yuichi Motai, Sumit Kumar Jha, and Daniel Kruse. Human tracking from a mobile agent: Optical flow and kalman filter arbitration. Signal Processing: Image Communication, 27(1):83– 95, January 2012. [9] Huiyu Zhou, Yuan Yuan, and Chunmei Shi. Object tracking using sift features and mean shift. Computer Vision and Image Understanding, 113(3):345–352, March 2009. [10] S. Haner and I. Y. Gu. Combining foreground / background feature points and anisotropic mean shift for enhanced visual object tracking. In International Conference on Pattern Recognition (ICPR), pages 3488–3491, Istanbul, 2010. IEEE. [11] Werner Kloihofer and Martin Kampel. Interest point based tracking. In International Conference on Pattern Recognition (ICPR), pages 3549–3552. ACM, 2010. ISBN: 978-1-61804-164-7 194
© Copyright 2026 Paperzz