Ball Detection via Machine Learning RAFAEL OSORIO Master of Science Thesis Stockholm, Sweden 2009 Ball Detection via Machine Learning RAFAEL OSORIO Master’s Thesis in Computer Science (30 ECTS credits) at the School of Computer Science and Engineering Royal Institute of Technology year 2009 Supervisor at CSC was Örjan Ekeberg Examiner was Anders Lansner TRITA-CSC-E 2009:004 ISRN-KTH/CSC/E--09/004--SE ISSN-1653-5715 Royal Institute of Technology School of Computer Science and Communication KTH CSC SE-100 44 Stockholm, Sweden URL: www.csc.kth.se Abstract This thesis evaluates a method for real time detection of footballs in low resolution images. The company Tracab uses a system of 8 camera pairs that cover the whole pitch during a football match. By using stereo vision it is possible to track the players and the ball, to be able to extract statistical data. In this report a method proposed by Viola and Jones is evaluated to see if it can be used to detect footballs in the images extracted by the cameras. The method is based on a boosting algorithm called Adaboost and has mainly been used for face detection. A cascade of boosted classifiers is trained from examples of positive and negative images of footballs. In this report the images are much smaller than the typical objects that the method has been developed for and a question that this thesis tries to answer is if the method is applicable to objects of such small sizes. The Support Vector Machine (SVM) method has also been tested to see if the performance of the classifier can be improved. Since the SVM method is a time-consuming method it has been tested as a last step in the classifier cascade, using features selected by the boosting process as input. In addition to this, a database of images of footballs from 6 different matches consisting of 10317 images used for training and 2221 images used for testing has been produced. Results show that detection can be made with improved performance compared to Tracab’s existing software. Sammanfattning Bolldetektion via maskininlärning Denna rapport granskar en metod för realtidsdetektion av fotbollar i lågupplösta bilder. Företaget Tracab använder sig av 8 kamerapar som tillsammans täcker en hel fotbollsplan under en match. Med hjälp av stereoseende är det möjligt att följa spelare och boll för att sedan erbjuda statistik till fans. I denna rapport utvärderas en metod utvecklad av Viola och Jones för att se om den går att använda till att detektera fotbollar i bilderna från de 16 kamerorna. Metoden baseras på en boostingalgoritm som kallas Adaboost och som främst har använts för ansiktsdetektion. En kaskad av boostade klassificerare tränas utifrån positiva och negativa exempelbilder av fotbollar. I denna rapport används bilder på små bollar som är mindre än de vanliga objekt som metoden skapats för. En fråga som denna rapport försöker svara på är huruvida denna metod är applicerbar på så små objekt. Support Vektor Maskiner (SVM) har även testats för att se om klassificerarens prestanda kan höjas. Eftersom SVM är en långsam metod har den integrerats som ett sista steg i den tränade kaskaden. Features från Viola och Jones metod har använts som input till SVM. En databas bestående utav ett träningsset och ett testset har skapats från 6 matcher. Träningssetet består av 10317 bilder och testsetet består av 2221 bilder. Resultaten visar att detektion går att göra med högre precision jämfört med Tracabs nuvarande mjukvara. Content Introduction .............................................................................................................. 1 1.1 Background ........................................................................................................... 1 1.2 Objective of the thesis .......................................................................................... 2 1.3 Hit rate vs. false positive rate ............................................................................... 4 1.4 Related Work ........................................................................................................ 5 1.5 Thesis Outline ....................................................................................................... 9 Image Database ....................................................................................................... 10 2.1 Ball tool ............................................................................................................... 10 2.2 Images ................................................................................................................. 11 2.2.1 Training set .................................................................................................. 11 2.2.2 Negatives ..................................................................................................... 12 2.2.3 Test set ........................................................................................................ 12 2.2.4 Five-a-side.................................................................................................... 13 2.2.5 Correctness .................................................................................................. 13 Theoretical background ........................................................................................... 15 3.1 Overview ............................................................................................................. 15 3.2 Features .............................................................................................................. 16 3.2.1 Haar features ............................................................................................... 16 3.2.2 Integral Image .............................................................................................. 17 3.5 AdaBoost ............................................................................................................. 19 3.5.1 Analysis ........................................................................................................ 19 3.5.2 Weak classifiers ........................................................................................... 20 3.5.3 Boosting ....................................................................................................... 21 3.6 Cascade ............................................................................................................... 22 3.6.1 Bootstrapping .............................................................................................. 23 3.7 Support Vector Machine ..................................................................................... 24 3.7.1 Overfitting ................................................................................................... 24 3.7.2 Non-linearly separable data ........................................................................ 25 3.7.3 Features extracted with Adaboost .............................................................. 26 3.8 Tying it all together ............................................................................................. 26 Method ................................................................................................................... 27 4.1 Training ............................................................................................................... 27 4.2 Step size and scaling ........................................................................................... 28 4.3 Masking out the audience .................................................................................. 29 4.4 Number of stages ................................................................................................ 29 4.5 Brightness threshold ........................................................................................... 30 4.6 SVM ..................................................................................................................... 31 4.7 OpenCV ............................................................................................................... 32 Results ..................................................................................................................... 33 5.1 ROC-curves .......................................................................................................... 33 5.2 Training results ................................................................................................... 34 5.3 Using different images for training ..................................................................... 35 5.3.1 Image size .................................................................................................... 35 5.3.2 Image sets .................................................................................................... 36 5.3.3 Negative images .......................................................................................... 37 5.4 Step Size .............................................................................................................. 39 5.5 Real and Gentle Adaboost .................................................................................. 41 5.6 Minimum hit rate and max false alarm rate ....................................................... 42 5.7 Brightness threshold ........................................................................................... 44 5.8 Number of stages ................................................................................................ 46 5.9 Support Vector Machine ..................................................................................... 47 5.10 Compared to existing detection........................................................................ 49 5.11 Five-a-side ......................................................................................................... 50 5.12 Discussion ......................................................................................................... 52 Conclusions and future work ................................................................................... 55 6.1 Conclusions ......................................................................................................... 55 6.2 Future work......................................................................................................... 56 Bibliography ............................................................................................................ 57 Appendix 1 .............................................................................................................. 59 Training set: .......................................................................................................... 59 Test set 1: ............................................................................................................. 59 Chapter 1 Introduction In this chapter the circumstances of the problem are presented as well as the goal of the thesis. Related work is described and an outline of the thesis is given. 1. 1 Background This Master thesis was performed at Svenska Tracab AB. Tracab has developed real-time camera-based technology for locating the positions of football players and the ball during football matches. Eight pairs of cameras are installed around the pitch, controlled by a cluster of computers. Fig 1 shows how pairs of cameras give stereo vision and how this makes it possible to calculate the X and Y coordinates of an object on the pitch. Fig 1 - Eight camera pairs cover the pitch giving stereo vision. With this information it is possible to extract statistics such as total distance covered by a player, a heat map where warmer color means that the player has spent more time in that area of the pitch, completed passes, speed and acceleration of the ball and of the players and a lot more. The whole process is carried out in real-time (25 times per second). The system is semi-automatic and is staffed with operators during the game. All moving objects that are player-like are shown as targets by the system. The operators need to assign the players to a target, since no face recognition or shirt number identification is 1 done to identify the players. They must also remove targets that are not subject of tracking, e.g. medics and ball boys. One big advantage of the system is that it does not interfere with the game in any way. No transmitters or any other kind of device on the players or the ball are used. 1. 2 Objective of the thesis The objective of this Master thesis is to improve the ball detection using machine learning techniques. Today the existing ball tracking method primarily uses the movement of an object to recognize the ball, instead of the appearance. In this report we will see if it is possible to shift the focus from using the movement to doing object detection in every frame. A key attribute of the method used is that it has to be fast enough for real-time usage. Tracab’s technology is already good at detecting moving balls against a static background, so an aim for this project is to produce reasonable ball hypotheses in more difficult situations such as: The ball is partially occluded by players. The lighting conditions are uneven, especially when the sun only lights up a part of the pitch. Other objects like the head of a player or the socks of a player look like the ball. The ball is still, e.g. a free kick. A classifier is to be trained to detect footballs, based on a labeled data set of ball / non-ball image regions from images captured by Tracab’s cameras. When talking about image regions in this report, a smaller sub window that is part of the whole image is what is meant (left of figure 2). When only talking about an image, the whole image is meant (right of figure 2). Fig 2 Example of an image region and an image captured by Tracab’s cameras. The classifier needs to be somewhat robust to changes in ball size and preferably also ball color since they differ in different situations. One big difference between this project and previous studies of object detection such as 2 the paper of Viola and Jones is the size of the object [32]. Here it is very small, only a few pixels wide. A big question is to see if the method presented in this report can be applied to objects of this size. Even with reasonably good detection of the ball it is difficult to tell apart the ball from other objects using only techniques based on the analysis of still images. One way of solving this is by examining the trajectory of the object in a sequence of images to discard objects that do not move like a ball. Also, if the classifier detects the ball most of the time, only missing out a few frames at a time, it is possible to do post processing to calculate the most likely ball path between two detections. These two steps are already used today at Tracab and are not a part of this thesis. Hopefully results from this thesis can be used to provide more accurate detections thus improving the data input to these steps and hence reduce the amount of calculation that needs to be done in those steps. The aim for this thesis is to evaluate if a machine learning approach can be used to detect a small object such as a ball. More specifically, different algorithms based on the work of Viola and Jones, which uses Adaboost, will be evaluated [32]. In addition to this an extension to their work using SVMs at the last stage has been tested and evaluated, inspired by the study by Le and Satoh [13]. An overview of how the methods are combined can be seen in fig 3. Input - all image regions Cascade of classifiers Non Ball Ball Rejected Brightness threshold Non Ball Ball Rejected SVM-classifier Non Ball Ball Rejected Fig 3 Overview. Image regions of an image are extracted and a cascade of classifiers trained with Adaboost is run for each image region. A bright pixel is searched for before running a SVM classifier on the regions that have not been rejected in earlier stages. 3 1. 3 Hit rate vs. false positive rate While it is possible to get close to 100% in the hit rate, this will also probably lead to a very high false positive rate. Having a hit rate of 80% and 0.1% false positives could be great in some situations and for some applications; in other situations this is not possible. In a medical environment for example this may not be good enough, because you really want to be certain before giving some risky medicine. In this application there is no limit that has to be reached. Instead it is the ratio between the hit rate and the false alarm that is interesting. Results will therefore be presented using Receiver Operating Characteristiccurves (ROC) with hit rate on the y-axis and the false positive rate on the xaxis. The different results are obtained by varying a sensitivity parameter. In many applications a threshold is varied to get different detection rates. In this thesis the number of stages used for detection and the step size used during detection are varied to get the different rates. ROC-curves are usually used to present this kind of results, which makes it easier to compare these results with others. A more detailed description of the ROC-curves is given in Section 5.1 along with the results. Somewhat promising results have been achieved, as seen in fig 4. Compared to the existing method in Tracab we can see small improvements. More results can be seen in Chapter 5. Fig 4 Results compared to an existing method used by Tracab. The open source project OpenCV has been used in this project for most of the algorithms [38]. Preprocessing of the image data has been done using Matlab. The SVM part has been done using libSVM [6] which has been integrated with OpenCV. 4 1. 4 Related Work Much of the research in object detection focuses on face detection and texture classification. Little work has been done on ball detection. Research based on face detection and texture classification is therefore also presented. More work has been done on ball tracking in multiple image sequences but are not of interest for my work and will thus not be regarded. The organization of this chapter is as follows: First, specific research on ball detection will be presented. Then, the most interesting methods found for face detection and texture recognition will be presented. Much of the research on color independent ball detection has been done using some kind of edge detection. A Circle Hough Transform (CHT) is used by D’Orazio and Guaragnella for circle detection [9]. Edges are first obtained by applying the canny edge detector. Each edge point contributes a circle of radius R to an output accumulator space. When the radius is not previously known, the algorithm needs to be run for all possible radiuses. This requires a large amount of computations to be performed. Another disadvantage is that this method is bad at handling noisy images. The sub-windows found by the CHT step are then used to train a Neural Network with different wavelet filters as features. The Haar wavelet was found to be the best among Daubechies 2, Daubechies 3 and Daubechies 4 with different decomposition levels. A faster version of the Circle Hough Transform is used for ball detection by Scaramuzza et al [27]. Coath and Musumeci use an edge based arc detection to detect partially occluded balls [7]. Ville Lumikero thresholds a grayscale image into binary form [19]. He applies morphological operations such as dilation and erosion to “clean” the images from noise and to fill up holes in objects. The ball candidates are then thresholded by size and color. The remaining ball candidates are further processed by a tracking algorithm not described here. Ancona et al [1, 3] implement an example based method for ball detection by training a Support Vector Machine (SVM, read Burges tutorial for more information [5]) with positive and negative example images of a ball. 20x20 pixels large images of footballs are used as input to the SVM. The images are preprocessed with histogram equalization to reduce variations in image brightness and contrast. The SVM is a robust algorithm that searches for the hyper plane that maximizes the minimum distance from the hyper plane to the closest training point. SVMs are described in more detail in Section 3.7. To do face detection, a bootstrapping technique is used by Osuna et al to improve the performance of a SVM [23]. Misclassified images that do not contain faces are stored and used as negative examples in later training phases. The importance of this step was first shown by Sung and Poggio [30]. To reduce computation time of the SVM a reduced set of vectors is introduced by 5 Romdhani et al [26]. By applying the reduced vectors one after the other it may not be necessary to evaluate all the reduced vectors. If an image is considered very unlikely to be a face early in the chain it can be discarded fast. Regular detectors based on a single classifier such as the SVM without using the reduced vector set are slow because they spend an equally large amount of time evaluating negative responses as positive responses in an image. This approach of a cascaded detector that is coarse-to-fine has been used widely in the literature. One of these cascade approaches is the algorithm proposed by Viola and Jones [31, 32]. Their algorithm uses a cascade of boosted classifiers to detect faces and mimics a decision tree behavior. In the first stages of the cascade only a few features are evaluated. Later on down the cascade the stages become more complex, thus a lot of non-faces can be thrown away without much effort while face candidates are more thoroughly examined. Viola and Jones have shown that their final classifier is very efficient and well suited for real time applications (15 384x288 frames/s on a 700 MHz Pentium II). Much of their work is based on Papageorgiou et al who did not work directly on pixel values [24]. Instead they use a set of Haar filters as features. These features are selected by evaluating them in a statistical way and then used as input to a SVM, used to classify new images. The features in Viola and Jones report are selected with the help of a boosting technique called Adaboost. Adaboost is a boosting algorithm that enhances the performance of a simple, so called “weak” classifier. Haar rectangles are used as features, so the task for the weak learner is to select the rectangle with the lowest classification error of the examples. After each round of learning, the examples are re-weighted so that the examples that were misclassified are given greater importance in the next step. Interesting is that it has been shown in a study by Freund and Schapire that this approach minimizes the training error exponentially in the number of rounds [10]. Viola and Jones also use a fast way of computing the Haar features by pre-calculating the Integral Image. The Integral Image entry I(x, y) is the rectangle sum to the left and up of the point (x, y). This makes it possible to calculate any rectangular sum in four array references. The resulting detector is easily shifted in scale and location thanks to this. A simple tradeoff between processing time and detection rate can be done by experimenting with the step size. This report is largely based on the work by Viola and Jones, but also regards extensions of their work. This is mainly due to the efficiency of the method shown in their study [32] as well as the promising proofs about the error that are explained in Section 3.5.1. Many extensions to the work of Viola & Jones have been done, for example in the study by Mitri et al where it is used for ball detection [21]. First images are preprocessed by extracting edges using a Sobel filter. These images are then used as input to train the classifier. A variant of Adaboost is used called Gentle Adaboost, as proposed by Lienhart et al [15]. Here it is shown that for the face detection problem, Gentle Adaboost outperform the Discrete Adaboost (the 6 original Adaboost is nowadays often referred to as the discrete version) and the Real Adaboost which was introduced by Shapire and Singer as an extension to the original version [28]. Lienhart et al also extend the work done by Viola and Jones by introducing the Rotated Integral Image [14]. With the help of this, an extended set of Haar features is used which show improvements of the detection rate of the classifier while the computation time required does not increase to the same extent. Some improvements to the weak classifiers are done by Rasolzadeh [25]. The feature response of the Haar wavelets for both the positive and the negative examples used in Viola and Jones are modeled using a normal distribution. By doing this it is possible to improve the discriminative power of the weak classifiers by using multiple thresholds (i.e. two new thresholds are introduced where the two distributions intersect). In Viola and Jones algorithm a single threshold is used by the weak classifiers to separate the two classes. They further show that this multithresholding procedure is a specific implementation of response binning. By calculating two histograms of the feature response for the positive and the negative examples for a certain number of bins it is possible to determine multiple thresholds without modeling the response as normal distributions. The weak classifier hypothesis consists of comparing the two histograms for the feature response. This change can be implemented replacing the old weak classifiers immediately without any other major changes to the algorithm. Their results suggest some improvement of the detection rate while keeping the low computing time. Another extension of the Viola and Jones algorithm that combines their work with a Support Vector Machine at the last stage of the cascade is presented in a study by Le and Satoh [13]. Two new stages are added to the cascade of classifiers. Firstly a stage to reject non-face regions more quickly has been added. It does so by using a larger window size and a larger step size. As the last stage of the new cascade a SVM is used. The Haar features that were selected by Adaboost in the previous stage are used to train the SVM. As they are already calculated no recalculation is needed. In contrast to the extension of the Haar features proposed by Lienhart et al [14] a reduction of these features is implemented here. This is mainly done to reduce training time, and their experiments show that for rejection purposes no efficiency is lost. Wu et al [33] have been able to make an algorithm for training the cascade that is roughly 100 times faster than the one of Viola and Jones. The difference is that Wu et al only train the weak classifier once per node, instead of once for each feature in the cascade. Liu et al map the feature response into two histograms (one for the positive set and one for the negative set) and searches for the feature that maximizes the divergence of the two classes [17]. This is done by using the Kullback-Leibler (KL) divergence as the margin of the data. Results are promising compared to 7 Adaboost but the final speed of the classifier is slower (2.5 320x240 frames/s on a P4 1.8GHz). Lin and Liu train 8 different cascades to handle 8 different types of occluded faces [16]. If a sample is detected as a non-face only the trained classifiers that contain features that do not intersect with the occluded part are evaluated. A majority of these classifiers should give positive responses when the sample was indeed a face. The sample is then evaluated with one of the new cascades. These additional stages yield in a three times longer computing time (18 320x240 frames/s on a P4 3.06 GHz). To avoid overfitting, the mechanism to select weak learners during boosting is reconsidered. Influenced by the Kullback-Leibler Boost they use the Bhattacharyya distance as a measure of the seperability of the two classes. They claim that the Bhattacharyya coefficient is much easier to compute than the Kullback-Leibler distance while maintaining the same performance. A totally different approach to feature detection has been developed by Lowe called SIFT [18]. It is a highly appreciated method that has been used widely and proven effective [20]. It has a high matching percentage and is robust to lighting variations compared to other local feature methods. The main interesting part of this work is how the features are described. Points of interest are found by looking for peaks in the gradient image. Descriptors representing the local image gradients are extracted from the area around these points of interest. A 4x4 grid is constructed and for each of these “bins” a histogram of 8 gradients is computed. This representation has the advantage that it is good at capturing the small distinctive spatial patterns of an object. Rotation invariance is achieved by combining all the gradient orientations into a reference orientation. When a new image is to be classified, these descriptors are matched to the descriptors in the database trained with example images using a Nearest Neighbor method. For further reading, a study made by Mikolajczyk and Schmidt evaluates an improvement to SIFT and different local descriptors are compared [20]. A similar approach to SIFT is a feature descriptor called SURF [4]. The main difference here is that instead of gradients, Haar features are calculated around a point of interest and represented as vectors. By using the Integral Image, calculations can be made faster. Local Binary Patterns is another approach that has been used for texture classification by Ojala et al [22]. For each pixel, an occurrence histogram of features is calculated. A feature is represented by the signs of the difference in value between the center pixel and its neighbors (positive or negative responses). The neighbor pixels are chosen as equally spaced pixels on a circle of radius R. In the report it has been tested with different values of R and a different number of neighbors. With this representation the output can take as many as different values, P being the number of neighbor points. Certain 8 local binary patterns are overrepresented in the textures and these patterns share the property of having few spatial transitions. These are called uniform and the definition is that they contain less than three 0/1 changes in the pattern. By making the representation rotationally invariant and by only considering uniform patterns this amount is reduced significantly. Noticeable is that the occurrence histogram does not save the spatial layout in the image; it only stores information about the frequency of the local features. 1. 5 Thesis Outline The rest of the thesis is organized into five different chapters. In Chapter 2, the database of images needed for training and testing is described. Chapter 3 explains the theory behind the methods used, both the method proposed by Viola and Jones based on a cascade of classifiers trained with Adaboost and the method of using a Support Vector Machine as a classifier. It also describes how these two methods can be combined in different ways. Chapter 4 describes how the methods were used in this specific problem setting. Experimental results are shown and discussed in Chapter 5 and conclusions are drawn in Chapter 6, along with proposed future work. 9 Chapter 2 Image Database The image database was produced semi-automatically from example images taken from the Tracab system cameras. This section describes how this was done and how the image sets have been created. 2. 1 Ball tool The images are extracted using a tool that Eric Hayman at Tracab developed. The procedure was to look at an image and click out the ball and try to center the position as exact as possible. At the same time the images were labeled with the degree of freeness of the ball and with the contrast in the image. A free ball is not in contact with anything and is completely visible. The higher number in degree of freeness of the ball, the closer to other objects it is. In the same way the higher the number of contrast the harder it is to distinguish the ball from the background. Examples can be seen in fig 5. The resulting information is a text file holding the path to the image, the x and y position of the ball in the image, the degree of freeness and the degree of contrast of the ball. Later this information is used to extract image regions that consist of the ball in the center along with some of the background. This way it is easy to do tests using image regions of different sizes for training, as described in Section 5.3.1. Example (row created by the ball tool): Image path x “C:\HEIF-GAIS\ball.000001.04.jpg” 116.53 10 y 25.10 free 2 contrast 3 Description of the two scales: Freeness 1. 2. 3. 4. 5. Free Close to a player/line but not in contact Contact with the player/line but not over the same Over the player/line Partially occluded by player Contrast The contrast scale ranges from 1 to 4 where 1 represents a very high contrast where the ball is easily distinguishable from the background and 4 represents a very low contrast where it is difficult to separate the football from the background. 2. 2 Images The images taken by the 16 cameras and saved by the ball tool all have a resolution of 352x288 pixels. The whole images are saved by the ball tool as RGB images but are converted into grayscale color space for training and detection. The transformation from RGB to grayscale has been done in the same way as calculating the luminance of the Y’UV color space [35]. It has been calculated with the transformation to Y as: 0.299R + 0.587G + 0.114B. Each channel has color values ranging between [0..255] so Y ranges between [0..255] as well. The images have been split into two groups: A training set and a test set. The training set along with the negative samples is used to train the cascade while the test set is used to measure performance. 2. 2. 1 Training set 10317 positive example images were extracted for training from 6 different matches in Allsvenskan, Champions League and an international match from 2007, all with different lighting conditions. Table 1 shows the number of images that have a value equal to or lower than the contrast/freeness measures indicated on the left and top of the table. 11 Table 1 The different types of images in the training set. Contrast/Free 1 1 2 3 4 75 3258 4651 4840 2 3 4 5 83 4988 7447 7845 83 5898 9194 9777 84 5982 9446 10120 84 6031 9582 10317 2. 2. 2 Negatives To construct negatives the same images are used as for the positive samples. The ball is removed from the image by setting the pixels in the area of the ball to black. The whole 352x288 image is then saved and labeled as a negative image. During training, image regions that are detected as positives are then extracted from these images and used as negatives, as we know that there is no football in the images. This procedure is called Bootstrapping and is described further in section 3.6.1 2. 2. 3 Test set 2221 images were extracted from other sequences in the same matches. Testing is done on these. The image set is expected to have the same ratio of different kinds of images as there is in a match generally. Tests are done on images from the same matches as used in training, to avoid the problem of not having enough variety in the training data. This set is called test set 1 and the distribution of different images can be seen in table 2. The ratio of the different images is similar to the ratio in the training set. The ratio in percentage of the different images of the two sets can be seen in Appendix 1. Table 2 The different types of images in test set 1. Contrast/Free 1 1 2 3 4 6 659 951 972 2 3 4 5 12 1071 1612 1643 12 1277 2097 2149 12 1284 2131 2202 12 1287 2139 2221 Tests have also been done on a match not used in training, though we should not expect getting good results in general on matches that have not been used during training when having only 6 different matches to train on. From this match there are 884 images. This set is called test set 2. 12 (a) (b) Fig 5 Examples of image regions with different properties. (a) 2 on the free scale and 3 on the contrast scale. (b) 1 on the free scale and 2 on the contrast scale 2. 2. 4 Five-a-side New images have been collected from a five a side-match where the cameras are positioned closer to the pitch. This is possible since the pitch in a five a side-match is about 16 times smaller than normal. This setup gives images of the footballs of higher resolution since they come closer to the cameras. The footballs are now between 2 and 8 pixels in radius in the images which is significantly larger than before and the texture of the ball now becomes visible. The training set contains a total of 5937 images and the test set contains 2068 images extracted in the same way as the training set. For this set no analysis of the quality (freeness and contrast) of the images has been done. Also, to save time, the process of extracting footballs in the images has been made faster by mostly including easy targets. 2. 2. 5 Correctness The data set contains football images of variable size and with a wide range of lighting conditions. Balls that were close to the cameras are larger than those far away from the cameras; they can vary up to a couple of pixels in diameter. It is questionable if the variation in lighting conditions in the extracted images is enough to capture the variance there is in reality between all different matches. Optimal would be to also have images from a much wider range of matches to be able to generalize completely. This has not been done due to the large amount of time it takes to extract footballs from images manually. The same can be said about the problem of having to deal with different kinds of footballs. They are not always white. Some are black and white checkered and others are even red. This could be solved by training several cascades. It is also uncertain if the test set represents a general set of images. To be able to detect the football as often as possible it is optimal to have a training set that represents all the different image types that are present during a game. 13 Hopefully this is achieved automatically when taking a wide range of images without any special selection process. Another thing that could affect the results in a negative way is the labeling of the data that has been done in table 1 and table 2. This labeling is as always when there are humans involved a result of subjective reasoning. Also according to the research done by gestalt psychologists the eye is easily fooled [37]. 14 Chapter 3 Theoretical background This chapter gives an overview of the general approach used in the project and describes the theory needed to understand the method. It includes these areas: Training a boosted classifier using Adaboost, constructing the trained classifiers into a coarse-to-fine cascade and training a Support Vector Machine to be used as the last stage of the classifier. 3. 1 Overview The algorithm used in this report is largely based on the work of Viola and Jones from 2001 [31]. It is a popular method that has been widely used (See Related Work). Some proofs about its generalization ability and the bound of the error have been made which makes the algorithm very interesting. More about this can be found in the analysis of the method in Section 3.5.1. The method has mainly been developed and evaluated for face detection rather than ball detection. The algorithm works in the following way. A classifier is trained using positive and negative image regions of an object of same size. The classifier consists of several so called weak classifiers (3.5.2), consisting of haar-like features (3.2.1), which are trained using a boosting technique called Adaboost (3.5). The boosted weak classifiers are combined into a cascade of coarse-to-fine classifiers (3.6). The idea is to reject a lot of non-objects in the early stages where the computation is light, reducing process time, while positives get processed further. When classifying an image region, the classifier outputs if the object is detected or not, like a binary classifier. The classifier is easily scaled and shifted to be able to detect objects at different locations and of different sizes in an image. This algorithm is combined with a Support Vector Machine (SVM) at the end as described in Section 3.7. SVMs have been reported to be good classifiers [26]. The SVM only needs to evaluate the image regions selected by the Adaboost classifier in the previous stage, making it faster. Otherwise, a big 15 disadvantage with the SVM method is that it may be too slow for real time [23]. An overview of how the method works can be seen in fig 6. Input - all sub images Cascade of classifiers Non Ball Ball Rejected Brightness threshold Non Ball Ball Rejected SVM-classifier Non Ball Ball Rejected Fig 6 System overview. A cascade of classifiers trained with Adaboost is combined with a brightness threshold and a SVM classifier as the last stage. Image regions that make it through the system without being rejected are classified as footballs. Notice: same figure as fig 3. 3. 2 Features A feature is the characteristic that is used to distinguish objects from nonobjects. The two main reasons why features are used instead of using the pixel values directly is that it improves speed (explained in more detail in Chapter 2 about Integral Images) and that they can capture different kinds of properties in an image. Any feature can be used such as the total sum of an area, variance, gradients etc. 3. 2. 1 Haar features In this thesis, the difference in pixel value between adjacent rectangles is used as features. The features can be seen in fig 7. The value of the pixels in the white rectangle is subtracted from the value of the pixels in the black rectangle. The resulting difference that comes from calculating the feature is called the feature response. 16 As shown in fig 7, 14 different features are used. With a base resolution of the detector of 12x12, the total number of possible features in my setup is 8893. This is a large number of features, but as we will see only a portion of the entire set of features will be needed. The features are called Haar-like features because they mimic the behavior of the basis haar-wavelets. Much like gradients they capture the change in pixel value rather than the pixel value itself. They are insensitive to mean differences in intensity and scale. Fig 7 The extended set of features as Lienhart et al [14] suggested. 1a-b, 2a-d and 3a in figure 7 were in the original set of features used by Viola and Jones. The rest of the features were introduced by Lienhart et al [14]. The new set of features consists of 7 additional rectangles that have been rotated by 45 degrees. Having a larger set of features make it possible to more accurately capture the properties of the object, but it also affects the time it takes to train the classifier since there are more features to evaluate. However, it does not automatically mean that a larger amount of features will be used by the final classifier, which would affect the speed of the final classifier. To speed up the calculations of the features an integral image is used. 3. 2. 2 Integral Image An Integral Image is a matrix made to simplify the calculation of the value of an upright rectangular area in an image. It is a pre-calculation step made to speed up other calculations. The value of the Integral Image (II) at II(x,y) is the sum of all the pixels from the original image (OI) up and to the left of OI(x,y). An example can be seen in fig 8. 17 155 201 226 98 78 48 14 111 44 → Original 155 356 582 253 532 806 267 657 975 Integral Image Fig 8 The original image (left) and the corresponding Integral image (right) This makes calculations of the value of any rectangle in the image faster. The formula is: II(x,y) = II(x-1,y) + II(x,y-1) – II(x-1,y-1) + OI(x,y) (1) where OI(x,y) is the original image and II(x,y) is the Integral Image. Once the Integral Image is calculated, the rectangle D in figure 9 can be computed by: 4-3-2+1 = (A-(A+B) – (A+C) + (A+B+C+D)) = 2A-2A+B-B+C-C+D = D (2) Fig 9 Any rectangle D can be computed from the Integral Image by: 1-2-3+4 A rectangle can with the help of the Integral Image be calculated with only 4 array references. Differences in two adjacent rectangles (the edge features) can be calculated with six array references while three adjacent rectangles (the line features) require eight array references. Since the rectangles are small (the training samples are 8-15 pixels in both width and height) and therefore so are the features, the Integral Image does not help in all cases. When the rectangles are small enough it is faster to do the calculations directly on the pixels. This is true when the rectangles are smaller 18 than 3 pixels large. But the loss in time using the Integral Images in those cases is very small. The calculations with the Integral Image are done in O(1). For the rotated rectangles a different Integral Image is needed, called a Rotated Integral Image. The idea is the same as previous and it is calculated with the formula: (3) where RII is the Rotated Integral Image. 3. 5 AdaBoost The principal idea of boosting is to combine many weak classifiers to produce a more powerful one. This is motivated by the idea that it is difficult to find one single highly accurate classifier. The T weak classifiers are combined into a strong classifier by: (4) where are found during boosting. The weak classifiers are used to select which features among the large number of features best separate the two classes: objects and non-objects. This kind of feature selection was first done in a statistical manner by Papageorgiou et al [24]. By using Adaboost this selection process can be optimized when the best feature is not obvious. The boosting step of the algorithm is done by reweighting the examples and putting more weight on the difficult ones. A new round of feature selection is then done with the new distribution. These weak classifiers add up to strong classifiers which are combined to construct a cascade of strong classifiers. 3. 5. 1 Analysis It has been proven by Shapire and Freund that the error of the final classifier drops exponentially fast if it is possible to find weak classifiers that classify more than 50% of the examples correctly [29]. 19 The final error is at most: (5) where is the error of the t-th weak hypothesis The bound of the error on the final classifier improves when any of the weak classifiers is improved. In the same article it was also shown that the generalization error of the final classifier with high probability is bounded by the training error. This means that the final classifier is most likely to generalize well on samples it has not seen before. They say that with high probability, the generalization error is less than (6) where Pr[] is the empirical probability on the training sample, T is the number of rounds of boosting, m is the size of the sample and d is the VC-dimension (Vapnik-Chervonenkis dimension) of the base classifier space. The VCdimension of hypothesis space H defined over instance space X, is the size of the largest finite subset of X shattered by H. Further explanation of the VCdimension can for example be found in the tutorial by Sewell [34]. Shapire and Freund’s analysis implies that overfitting may be a problem if running training for too many rounds. However, their tests showed that boosting does not overfit even after thousands of rounds. They also found out that the generalization error decreases even after the training error has reached zero. These are promising results that motivates the use of this method. 3. 5. 2 Weak classifiers A weak classifier is a simple classifier that only has two prerequisites: that it is better than chance, i.e. that it classifies more than 50% of the samples correctly and that it can handle a set of weights over the training examples. The weights are needed in the boosting step. In this case the weak classifier consists of one feature along with its threshold. (7) 20 The feature response is compared to the threshold used to indicate the direction of the inequality sign. . The variable is In order to find the best weak classifier at each round of training the feature responses from all samples are calculated and by applying a threshold it is possible to separate the samples into two classes. During training the optimal threshold is determined for each feature, by optimal meaning the threshold that minimizes the classification error of that feature. In each step of boosting the feature and its according threshold with the lowest classification error is selected along with a weight α inversely proportional to the classification error of that feature. This weight can be seen as a measure of the importance of that particular weak classifier. The error is calculated with respect to the weights of the examples (I.e. the error is the sum of the weights of the misclassified examples). (8) where is the true label of the example. 3. 5. 3 Boosting In order to train and combine several weak classifiers instead of using one more complex classifier, Adaboost repeats the training step with a modified distribution of the training set of examples. For each round more emphasis is put on the more difficult examples. Those examples which were wrongly classified by the previous weak classifier are given higher weights than the correctly classified examples. The weights are then normalized. This is done at each round of training until the total number of rounds is reached. Different variants of Adaboost have been evaluated for face detection and two of them have been used and compared for ball detection in this report [15]. The Discrete Adaboost is the original version proposed by Schapire and Freund [29]. According to Lienhart et al [15] the Gentle Adaboost is the most successful algorithm for face detection. The Real Adaboost uses class probability estimates to construct real-valued contributions. They are all similar in computational complexity during classification, but differ somewhat during learning in the way they update the weights at each round of boosting. 21 The main idea is still the same in all three cases: General pseudo-code for Adaboost Initialize weights w = 1/m (normalized) For t = 1..T Train weak learners using distribution w Fit the weak classifiers to the data and calculate the error with respect to the weights. Choose the weak classifier with the lowest error and update weights by increasing the weights of the misclassified examples. Do until T is reached. Output the final hypothesis as a weighted combination (related to the error) of the weak classifiers. For more information on the different variants of Adaboost see the comparative study made by Friedman et al [11]. 3. 6 Cascade By forming several strong classifiers into a cascade of simple to complex level it is possible to reduce computation time. In the first stages of the cascade simpler and faster classifiers are used. Since the majority of the image regions going in to the first stage are non-objects many of them are easily rejected by the early stages while letting through the majority of the positives. Once an image region has been rejected at some stage it is discarded for the rest of the cascade as a non-object and thus not evaluated further. A positive image region goes through the whole cascade and is evaluated in every stage requiring further processing, but in total this is a rare event. When going through the cascade the classifiers at deeper stages get more and more complex, requiring more computation time. Also, with increasing stage number the number of weak classifiers which are needed to achieve the desired false alarm rate at the given hit rate, increases. The cascade of classifiers is trained by introducing goals in terms of positive detections and number of false positives. For example, to achieve a final classifier with a hit rate of 90% and a false positive rate of 0.1%, each stage in a classifier with 10 stages needs to have a hit rate of 99% (0.99^10 = 0.9) but 22 only a maximum in false positive rate of 50% (0.5^10 = 0.001). Each stage reduces both values but since the hit rate is close to one the result of the multiplication stays close to one while the result of the multiplication of the smaller false positive rate rapidly decreases towards zero. This is all done under the assumption that the different stages in the classifier are independent of each other. The way the cascade is formed is by setting a minimum value for every stage. The stage is trained and features are added until the desired hit rate value and the desired false positive rate value have been reached. By specifying these goals it is possible to get a classifier of your choice. 3. 6. 1 Bootstrapping A new negative set is constructed for each stage by selecting those image regions that were falsely detected by the classifier using all previous stages. A false detection as the one in figure 10 would be added to the negative set. This method is called bootstrapping. Intuitively this makes sense as we expect the new examples to help us get away from the current mistakes. Since at each stage the classifier becomes more and more accurate, it becomes more and more difficult to find false positives. Also the false positives get more and more similar to the true detections making the separation task harder. As a result, deeper stages are more likely to have a high rate of false positives. Fig 10 A typical hit along with a false detection. The image region of the players shoe is used as a negative sample in the training of the next stage. 23 3. 7 Support Vector Machine Support vector machines (SVM) are used for data classification. The basics of SVMs needed to understand the method is presented here. In the same way as in the Adaboost case a training set and a test set is needed to train and evaluate the SVM. Given a set of labeled data points (the training set) belonging to one of two classes, the SVM finds the hyper plane that best separates the two classes. It does this by maximizing the margin between the two classes. In the left image of fig 11 we can see an example of a hyper plane that separates the two classes with a small margin. In the right image the hyper plane that maximizes the margin has been found by the SVM. The points that constrain the margin are called support vectors. Fig 11 SVM finds the plane that maximizes the margin. The image to the right is considered to have greater generalization capabilities. Image taken from DTREG’s homepage [36]. 3. 7. 1 Overfitting Fig 12 shows how a classifier that is fitted well to the training set may not generalize well. In image (a) the classifier has been learnt to classify all examples correctly. As seen in image (b) this results in some wrongly classified examples on the test set. However in image (c) we see a classifier that although classifying one example from the training set wrongly it classifies all the examples in the test set correctly (as seen in image (d) ). The latter classifier generalizes better due to that it allowed a wrongly classified example during training. This can be handled by introducing the penalty parameter that weights the samples according to how they were classified. Misclassifying a sample now costs and by increasing the cost of misclassifying an example increases, making the model more adjusted to the training data. 24 (a)Training data and an overfitting classifier (b) Applied on test data (c) Training data and a better classifier (d) Applied on test data Fig 12 An overfitting classifier and a more general classifier. Images from libSVM Guide[6]. 3. 7. 2 Non-linearly separable data The examples in fig 11 show two linearly separable classes. When having more complicated data, a line may not be enough to separate the two classes. To cope with this problem the data is moved into a higher (maybe infinite) dimensional space by a function [6]. The function can take many forms. In this new space it may be possible to find a plane to separate the data. The problem of going into a higher dimensional space is that calculations get more expensive and it makes the method slow. Therefore the kernel trick, first introduced by Aizerman et al, is used to solve this [1]. Since all SVM calculations can be done using the dot product <x,y> between the training samples, the operations in the high dimensional space do not have to be performed. Instead we can try to find a function . This function is called the Kernel function. Examples of popular kernels are: the Polynomial kernel, the Radial Basis Function (RBF), the linear kernel and the sigmoid kernel. As proposed by LibSVM, the RBF is a good choice to start with: (9) The linear and the sigmoid kernels are special cases of the RBF with certain values of the parameters (C, ) [6]. The polynomial kernel is more complex in terms of number of parameters to select. When using the RBF kernels there are two parameters to select: C and . Since it may not be useful to achieve high training accuracy, these parameters have been evaluated by doing cross validation on the training set. This is done by 25 dividing the training set into two parts, one for training and the other part for testing. This is done repeatedly with different partitions to get a more accurate result. The values of the parameter have been tested by increasing the values logarithmically and then doing cross-validation to get the performance. The cross-validation can help us get around the problem of overfitting. 3. 7. 3 Features extracted with Adaboost Having a good classifier does not make sense unless the data points represent something meaningful. The idea is to use features extracted from some stage of the cascade constructed with Adaboost. Fig 12 could then be interpreted as having the feature response of one feature on the x-axis and the feature response of another feature on the y-axis. But unlike the examples in fig 11 and fig 12, more features than two needs to be used. This does not change any of the theory except that we move into a higher dimensional space. The samples used for training were gathered by letting an Adaboost trained classifier classify the image regions in the training set as in Chapter 2. The feature response from the samples classified as positives were chosen as the positive training set and the feature responses from some of the false positives from each image were selected to be part of the negative set. A detailed explanation on how this was done is given in Section 4.6. 3. 8 Tying it all together To be able to add SVM as the last stage of the classifier we need to decide from which stage to take the features and how many features to use. Results in a face detection study by Le and Satoh suggest that the switch between the Adaboost classifier to the SVM classifier can be done in any stage [13]. Also shown in the same study there seems to be a big increase in performance when going from 25 to 75 features, while the difference between using 75 and 200 features is not significantly large. Since the objects to detect are different in this report and in the study by Le and Satoh there is no guarantee that the optimal number of features is the same. The speed of the classifier depends very much on the number of features that are used, so it is important to find an optimal tradeoff here. 26 Chapter 4 Method This section describes how boosted classifiers described in section 3 have been trained and how they are used on the specific task of detecting footballs. It describes how the classifiers are shifted in both location and scale across the image during detection. To reduce the detections of some false positives a brightness threshold is introduced, and also a mask is used to only do detection on the area where it is interesting to search for the ball: the pitch. A description of how the features for the SVM have been collected and tested is given. 4. 1 Training Several different cascades are trained as described in Chapter 3 and the performances of these classifiers can be seen in Chapter 5. Image regions of sizes between 8x8 and 15x15 pixels have been used to train four different classifiers. Bigger image regions result in training samples that include more or less of the background. If no background was included in the image regions used for training, the classifier would only learn the texture of the ball. Since the resolution is low, it is very difficult to distinguish any texture on the footballs. The idea here is therefore to include some of the background to give the classifier more information to work with. By including the background the classifier has the possibility of finding the difference between the dark background and the bright ball. How much of the background that should be included in the samples is not clear. If there is too little background maybe the classifier will not be able to capture the property that the ball is white and round compared to the darker background. On the other hand, if too much of the background is used the classifier will probably do detections only based on the background instead. The difference in using different parts of the training set has been evaluated. One classifier has been trained with easier images and another with harder images. So called easier images are the images labeled with contrast 1 and 2 and labeled with freeness 1, 2 and 3. The harder images have an additional 1747 images that have been labeled with contrast 3. By using harder images 27 where the ball is occluded and the contrast is bad during training should result in better detection when the ball is close to a player or occluded in some other way but also implies that it is harder to distinguish between a ball and a nonball. The rejection process will be more forgiving, letting more examples through the cascade since the training images have a wider diversity. One can expect that more false detections will be made, requiring a higher number of stages to reach the same level of false detections. Two classifiers have been trained to evaluate the importance of using a high number of negative samples. 2000 and 5000 false positives have been extracted to use as negative samples in the bootstrapping step. Discrete Adaboost and Gentle Adaboost, two different kinds of boosting algorithms are evaluated and the minimum hit rate and maximum false alarm rate are varied to train three new classifiers. An overview of how the classifier is used can be seen in fig 6. 4. 2 Step size and scaling As mentioned in Chapter 3, detection is done by sweeping a window of different sizes over the image, running the classifier at each image region. Since the footballs are not perfectly aligned and have a small variation in position and size, the trained detector is somewhat independent to small shifts. An object can therefore be detected even though not being perfectly centered. However, if not going through all possible image regions some objects are likely to be missed. The step size also affects the detector speed. With a step size of 1 pixel and running 10 different window sizes there are around one million image regions that the classifier needs to be run on. By simply increasing the step size to 2 (hence skipping one pixel at each step) the number of image regions can be halved and thus halving the total time of the classification. The step size is therefore a tradeoff between detection rate and time. When having such small objects to detect as is the case in this report, it is very likely that a small step size is required. The shift in location and window size has been tested with different step sizes and results are shown in Section 5.3. Since the balls we want to detect range in size between 3 and 7 pixels in diameter there is no reason to search for objects of other sizes. The detection window is therefore scaled until it is larger than the biggest object possible, but not more. When scaling, this can be done either by scaling the image region itself or by scaling the features. In this case scaling is done by scaling the features since this is done without any cost (see the section of Integral Image 28 that shows that the size of the rectangle doesn’t affect calculation time) while scaling the image region is time-consuming. 4. 3 Masking out the audience Since the ball only needs to be detected when it is in play there is no need to perform detection outside the pitch. Since the system of cameras have a model of the pitch it is easy to get the limits from there. For each match 16 different mask images are constructed, one for each camera (to the right in fig 13). When stepping through the x and y coordinates of an image it is first evaluated to see if the mask says that detection should be made at this position or not. By doing this a more accurate result can be achieved. The result can be seen to the left in figure 13. No detections are made outside the boundaries of the mask. Fig 13 Detections (left) and the corresponding mask(right). 4. 4 Number of stages What has been noticed when running the trained classifiers on the test data is that images from different matches respond very differently to the cascade. Images from a bright match may give a lot of false detections when running the cascade with many stages, while images from another match give hardly any false detections. But when increasing the number of stages used, positive detections are lost from the latter while only reducing the false positives from the former. This means that the optimal number of stages used for classification differs from match to match. So another way of getting the ROC-curves would be to set a limit on how many false detections the classifier is allowed to find in an image, and run the classifier until this limit has been reached. This increases 29 the performance of the classifier when testing on a range of matches, and a test made this way is presented in Chapter 5. However during testing of other parameters this method would not be practicable, since it would be impossible to get any comparison data that depended only on the tested parameter. 4. 5 Brightness threshold A brightness threshold has been used to eliminate some of the false detections. Since the features used do not capture the pixel value but only how the pixel values are related to each other, some of the detections have been found to be located on grass areas which are totally green. These can easily be rejected by looking at the brightness of the pixels in the detected area. If no single pixel is bright enough the detection can be ruled out. Fig 14 Left: Image of false alarms on grass. Right: Mask that indicates the two different thresholds when an umbra is present. To the left in fig 14 we can see two false detections that have the same feature responses as a ball. It can be seen that these two detections consist of a brighter circular area in the center and darker pixels around. It also seems clear that the areas do not contain any pixels that are as white as a ball would be. To find the optimal threshold value for each individual match, a histogram of the intensity of the pixels in the whole image has been used. Usually, a peak in this histogram indicates the color of the grass. When there is an umbra because of the sun, it should be possible to find two peaks indicating the color of the grass. It is possible to extract a mask of where the umbra is as seen to the right in fig 14 and by looking at the brightness histogram it is possible to find a corresponding threshold that is optimal for the two different regions. On the dark sections of the pitch the threshold is close to one peak, on the bright sections of the field the threshold is close to the other peak. To the right in fig 14 the threshold has been chosen to be . 30 4. 6 SVM Two different types of features have been used to train and evaluate the SVM method: using the pixel value of the image regions directly and using the feature responses from a classifier trained with Adaboost. False positives have been extracted the same way. Both methods uses images from the training set and the test set 1 described in Chapter 2. The training and testing procedure for the SVM using feature responses is as follows: Training - Let a cascade using some low number of stages run the training set discussed in Chapter 2. The feature responses of around 200 features starting from some stage are calculated, for both positives and negatives (one false positive from each image is taken to get an equal amount of positives and negatives in the resulting training data). How many features that should be used is not known but results from the study made by Le and Satoh show that 200 is a good number [13]. Some different values are tested. The Support Vector Machine is then trained with these feature responses. By using feature responses from later stages it should be possible to change the SVM into classifying more difficult samples better. At the same time it should classify the easier samples worse, but the cascade of boosted classifiers that will be run before the SVM step are meant to take care of these. Testing - Let the same cascade classify the test set 1 discussed in Chapter 2. Get the around 200 feature responses starting from some stage and label the ones that are classified as detections as positives, and the rest as negatives. Run the SVM on the extracted feature responses and evaluate the performance. By adding the SVM after different number of stages it will be possible to construct a ROC-curve. Run the cascade up until stages that give similar rejection or detection rate as the SVM and evaluate the performance. We can now compare how well the two different methods perform on the same positive and negative set. As mentioned in the guide made by LibSVM scaling of the data is very important to achieve good results [6]. Without scaling the data it was not possible to get any acceptable results. The main reason why scaling is important is to avoid the domination of large numbers over small numbers. Also, by scaling the data calculations are made easier and thus improving the efficiency of the classifier. All data, both the training and the testing data, has been scaled in the same way to the range . 31 4. 7 OpenCV The detection part of OpenCV was developed with face detection in mind. As a result it has been somewhat optimized for larger objects than the footballs in this thesis. This means that modifications to the code have been made to search for smaller objects. 32 Chapter 5 Results In this section results from the different trained classifiers are presented. Comparisons are made using ROC-curves. The performance is only shown for the values of interest since the training of later stages requires a lot of time. 5. 1 ROC-curves To see the ratio between hit rate and false alarm rate, ROC-curves are used to present the results. Hit rates are shown in percentage while the false alarm rate shows the actual count of detected false positives per image. A detection is added as a positive hit if: The distance between the center of the detection and the center of the actual football is less than 30% of the width of the actual football The width of the detection window is within ±50% of the actual football width Other detections are regarded as false alarms. A test with a perfect result has a ROC-curve that passes through the upper left corner. That means 100% hit rate and 0 false alarms. The closer the curve is to this point, the better accuracy of the classifier. Points that fall below the dotted line in fig 15 are the results of a classifier that is worse than chance. 33 Fig 15 A ROC-curve. The closer the curve is to the top left corner the better. To get the curves, the number of stages used for detection is varied. More stages mean a more specified classifier while less number of stages lets more of the image regions through the cascade as positives. The choice of number of stages influences the performance differently on different matches. In some matches a lot of detections are made using a specific number of stages. Using the same number of stages on another match may result in very few numbers of detections. Since there are images from different matches in the test set, this is a problem. A test is therefore made by adjusting the number of stages used during detection. Depending on the number of detections made in the previous image, more or fewer stages are used for the next images. This improves performance as seen in Section 5.8. 5. 2 Training results Training and testing has been done on a 2.16 GHz computer with 2 GB RAM. In general it has taken time in order of days to train a classifier up until stage 35. Some training sessions have been forced to stop earlier due to the training being too time consuming. It has however always been possible to get a ROCcurve that can be used to compare the performance with the others. The variables that affect training time are: the number of negatives used in the bootstrapping step, the size of the image regions, the minimum hit rate and the maximum false alarm. The first attribute can be very time consuming and it is therefore important to have a good set of negative images from where the algorithm can find false positives. The last three attributes increase the number of features that is needed to reach the goal at each step. Some examples of features chosen by the algorithm can be seen in figure 16. Fig 16 Example of features selected by Adaboost in early stages 34 5. 3 Using different images for training Adaboost has been reported to be sensitive to noisy data [8]. Here some tests with different images used during training are reported. 5. 3. 1 Image size Image regions of sizes between 8x8 and 15x15 pixels have been used to train 4 different classifiers. Examples are given in figure 17. Results show small differences in the performance of the classifiers trained using image regions of size 10*10 pixels compared to the classifier trained using image regions of size 12*12 pixels. Using image regions of size 8*8 pixels show worse performance. Using bigger image regions than 12*12 does not improve performance as seen in fig 18. Size 8*8 Size 10*10 Size 12*12 Fig 17 Image regions of different sizes 35 size 15*15 Fig 18 By using image regions of size 12*12 best performance is achieved, although there is not a big difference between the different classifiers. 5. 3. 2 Image sets The next test shows the importance of having a good image database. Two classifiers trained with different images are compared, one trained with images in which the ball is occluded more and the contrast is worse. The results can be seen in fig 19. Around 4 more stages were required to get down to the same false alarm rate. At the same time the classifier that was trained with harder images shows a better performance in total. 36 Fig 19 A classifier trained with images with less contrast and where the ball is partially occluded shows better performance than a classifier trained only on clearly visible footballs. 5. 3. 3 Negative images Another modification that can be done to the set of images used in training is using more negatives in the bootstrapping step (see Section 3.6.1). The bootstrapping step selects a number of false positives classified by the present available classifier as examples of negatives. Until now the algorithm has used only 2000 negative samples at each stage. By letting the training procedure extract 5000 samples of negatives at each stage it should be possible to improve performance. One problem of using a high number of negatives in this step is that as training reaches later stages it becomes more and more difficult to find a large number of false positives. One should also mention that increasing this number immediately increases the time needed for training. As a reference it took 705 s to find 2000 false positives at stage 39, while it took under a second in the first stages. To find 5000 false positives at stage 39 took 2105 s. In early stages the difference was not noticeable. 37 Fig 20 Using a higher number of negative samples for each stage of training increases performance a couple of percentages. The comparison in performance between using 5000 negatives at each stage and using 2000 negatives can be seen in fig 20. The two curves follow each other, with the curve for the classifier trained with 5000 negatives a couple of percentages higher. One question that arises is how it affects the generalization performance to use a higher number of negatives? Is it only positive or could it be so that the new classifier gets too specialized to the training data? If the negatives are very similar to the positives the trained classifier is likely to have a decision boundary that lies very close to the positives. By running the classifiers on the test set 2 described in Section 2 we can see some indications on how well the two classifiers generalize to the data. The difference in performance of the two classifiers is very small. There is not a big enough difference in performance when testing the generalization ability to be able to draw any conclusions about it. As seen in fig 21 the detection rates are very low. 38 Fig 21 The classifier trained with 5000 negatives performs better in the lower regions of the ROC-curve when testing on a game that has not been used in training 5. 4 Step Size During detection a detection window is swept across the image at different locations and at different scales. By shifting the window a few pixels at a time it is possible to scan the whole image. A step factor is used to increase both the window size and the step size. For example, if the current step size is the window is shifted pixels. This means that when the detection window is large it is shifted more than one pixel at a time. Since the image regions used in training are not perfectly centered, a small amount of translational variability is trained into the classifier. While speeding up the classifier substantially, shifting more than one pixel at a time has resulted in a decrease of the performance. With a scale factor of 1.2 the classifier classified around 24 images per second (it took between 87 and 96 seconds to classify all 2221 images). On the other hand it took about double the time with a scale factor of 1.1 This means that it takes one second to classify 13 images. Due to the higher performance when using a small step factor, a fixed step size at 1 pixel and not using the scale factor for the step size has been tested. The window size is still increased with 39 the step factor until the window is large enough. This way it only classifies about 9 images per second. Fig 22 A smaller step factor increases performance but also increases processing time. These results as seen in fig 22 clearly show a big increase in hit rate when decreasing the step size. In the forthcoming results a fixed step size of 1 pixel is used along with a step factor of 1.1 for the window size. Also a pre-calculation step has been removed which used a step size of 2 pixels as the first step. By removing this step the detection rate improves as shown in figure 23 while decreasing the number of images processed per second to 8.5. 40 Fig 23 Old step size as in fig 22 compared with having removed a pre-calculation step. 5. 5 Real and Gentle Adaboost Two variants of the Adaboost algorithm have been evaluated. The discrete Adaboost was not able to finish, so it will therefore be left out of the comparison. What happened was that the boosting step was not able to improve performance by updating the weights, so the process got stuck. Lienhart also declared that he had convergence problems when using LogitBoost for face detection and was not able to evaluate that method [14]. Also, Lienhart could show that the Gentle Adaboost was the best method between the Real, Discrete and Gentle Adaboost, at least for face detection. D. Le and S. Satoh also stated that the Discrete Adaboost is too weak to boost in the case of a hard distinguished dataset [13]. 41 Fig 24 The performance of the Real Adaboost and the Gentle Adaboost is significantly the same. Figure 24 shows how the difference in performance between the two different variants of Adaboost, Real and Gentle, is minimal. 5. 6 Minimum hit rate and max false alarm rate As described in Section 3.6 the min hit rate and max false alarm rate are used to set up the properties of the cascade. They describe the values each stage needs to reach in order to move on to the next stage. We see that increasing the max false alarm rate improves performance significantly. Even better performance is achieved by using a higher minimum hit rate during training. It is worth noting that a higher min hit rate alone shows as good performance as when rising both the min hit rate and the max false alarm. To make it easier to refer back to this classifier later in the report, the classifier with the best performance in fig 25 is called classifier 1. 42 Fig 25 Comparison between using different values of the minimum hit rate and the false alarm rate during training. Better performance is achieved by using a higher min hit rate during training. A question that arises is if this procedure deteriorates the generalization performance of the classifier. Maybe these results show a too specific classifier that has been too well adjusted to the training data? To test this, the performance of this classifier is compared with the performance of the classifier using a min hit rate of 0.995 on the test set 2. Test set 2 contains images from a game not used for training. The results in fig 26 show small indications on that the classifier has reduced its generalization ability. The classifier trained with an increased minimum hit rate and a decreased maximum false alarm rate still performs better than the classifier trained with the default values, but the difference is much smaller. The downside of changing these limits is that it may be bad for training. It can take longer time for the classifier to reach the limits, increasing the total time taken to train the cascade. Some training sessions never even ended because they couldn’t reach the limits. In addition to this, more features are usually needed to reach higher limits which mean that the final classifier will be slower. 43 Fig 26 The classifier trained with a higher min hit rate and a lower false alarm rate still performs better when testing on test set 2, but the difference between the two classifiers is minimal. 5. 7 Brightness threshold When looking at the false positives that the classifiers detect, it can be seen that false detections are sometimes made on the green grass. By looking at them it seems clear that they do not contain any white pixels, but are thought of as footballs anyway. See section 4.5 for more information. A simple way of removing these detections should be to reject the detections if they do not contain any white pixels. This is done by introducing a brightness threshold that rules out detections if no pixel in the detected area is brighter than the threshold. The performance when testing different values of the brightness threshold can be seen in fig 27. 44 Fig 27 Shows how performance changes when using different threshold values to remove detections that are not bright enough. The images are saved by the ball tool in RGB space but later transformed into grayscale with one color channel. The luminance Y in the Y’UV color space is used as grayscale value. This is described in more detail in Section 2.2. The brightness threshold is added as the last step of detection. The results show that by adding this threshold, many of the false detections can be ruled out without decreasing the detection rate of the classifier. These tests were made on the test set 1, and the classifier used was again classifier 1. The best result achieved without losing any hits was using a threshold of 125. Using 20 stages of the cascade no positive detections were lost, while decreasing the number of false detections from 5.4 to 4.3 per image. Further tests show that the threshold can be optimized even more. When increasing the threshold value, only images from one match suffer from weaker detection while the detection rate of the other matches remains. Logically this match had cumbersome lighting conditions and the football was often darker than normal. These results suggest that the threshold value can be optimized for each individual match. By using the threshold mask described in Section 4.5 the threshold can automatically be optimized for the current lighting conditions. The comparison is made in fig 27 between the cascaded classifier 1 with and without threshold. As expected the performance is increased even further using this new threshold. 45 5. 8 Number of stages By increasing or decreasing the number of stages used for detection in an image depending on how many detections that were made in the previous image it is possible to adjust the classifier to each game since the images in the test set are ordered. An upper and a lower limit for when to lower and raise the number of stages are needed. By varying these two limits it is possible to get a ROC-curve as in fig 28. Fig 28 By adapting the number of stages according to the number of detections made in the previous image better results are achieved. Results show that the classifier benefits from being adjusted to each game to maximize performance. 46 5. 9 Support Vector Machine As mentioned in Section 4.6, there are two parameters that need to be selected when using the RBF kernel for SVM classification: C and γ. Unfortunately there is no way to generalize the selection of these parameters. For every data set there is a different choice of parameter values that are optimal. These values have been found using cross-validation (see Section 3.7.2). Only the performances of the best classifiers are shown in this section. In these tests the SVM has been integrated as the last step after the cascaded classifier. The ROC-curve has been constructed by varying the stage when the SVM takes over the classification task. The first test uses a SVM model that has been trained with 283 feature responses from stages 16 and 17 of the best classifier from Section 5.6 (classifier 1). As seen in fig 29, the results of the SVM model stay on the negative side of the cascaded classifier in the ROC-curve. The results from a similar classifier trained on the pixel values directly are even inferior and are therefore left out of the report. Fig 29 Using SVM as the last stage at different stages. Using SVM as the last stage does not show better results than the cascaded classifier when trained on 283 features. 47 Possible explanations to why the results from the SVM are worse than when only using the boosted classifier are that too few or not enough features have been used or that the features have been taken from stages too late or too early in the cascade. The next tests are made to examine these possibilities. The second two tests using SVM have been done on classifiers trained with the same number of feature responses (283) but using feature responses from earlier and later stages of the cascade. The feature responses have been taken starting from stage 2 and 26 respectively. The same cascaded classifier has been used to be able to compare the difference of using more or less discriminative features. Using features from later stages for training should result in a classifier that is better on separating harder samples than before. In fig 30 results show that the overall performances of the two new classifiers are worse than the previous classifier. A higher number of features do not help the SVM classifier, as seen in fig 31. On the other hand, when lowering the number of features the performance is increased. The results in this figure comes when using a classifier trained with features from stage 15 and forth until the wanted number of features have been reached. A zero-mean normalization has also been done on each image region but without signs of improvement. Fig 30 Training the SVM on feature responses from earlier and later stages results in poor performance. 48 Fig 31 Using additional features do not improve the performance of the SVM classifier. Good performance is shown by the classifiers trained with very few features. 5. 10 Compared to existing detection The main method for finding the ball today is based on tracking the movement of the ball. Since there has not been enough time to integrate the method proposed in this report with the existing software used by Tracab, comparisons of the performance of finding a final ball hypotheses has not been done between the two methods. Instead, a specific detection step in Tracab’s system is compared with the method proposed in this report. The comparison has been done by running the two methods on the test set 1. A first step in Tracab’s algorithm extracts possible ball candidates from the test set by looking for image regions that are brighter than the background. These regions are then stereo-matched. The resulting set of image regions are saved and used as the database in the comparison test. The next step of the method continues to narrow down the candidates by using a correlation method that extracts ball-like candidates. This results in a detection rate and a false detection rate that is compared with the cascaded classifier as can be seen in fig 32. The cascaded classifier has been run on the same database using different number of stages to get a ROC-curve. 49 Fig 32 The method proposed in this report compared with a method in the existing system As seen in fig 32 the performance of the cascaded classifier is slightly higher. 5. 11 Five-a-side Five-a-side is the name of the game when there are only 5 persons in each team playing and the dimensions of the pitch is much smaller, around 16 times smaller than in regular football. This test can be compared with having the same setup as before but with better cameras of higher resolution. Since technology evolves rapidly and prices fall, there are reasons to believe that better cameras will be used in the near future. The cascade has been trained in the same way as before as described in Chapter 3, but with images as described in Section 2.2.4. Due to long training time the parameters for training has been set to a max false alarm rate of 0.4, a min hit rate of 0.995 and 2000 negatives for each stage. The SVM has been trained with 197 feature responses from stage 16 and is added as the last step of the cascade. Fig 34 shows the performance of the classifier run with and without SVM on the test set described in Section 2.2.4. The hit rate is over 95% even at low false alarm rate. As before, using SVM does not increase performance. The two curves in fig 34 follow each other closely. 50 Fig 33 System has been set up at a 5-a-side pitch. Cameras are closer to the ball giving footballs of higher resolution. The texture of the ball is now distinguishable. Fig 34 Very good performance is shown by the classifier trained using images of footballs of higher resolution. No improvement can be seen when using SVM as the last stage. In the same way as in Section 5.10 a comparison has also been made between the classifier proposed in this section and a present detection method used by Tracab. This can be seen in fig 35. 51 Fig 35 The method proposed in this report compared with a method in the existing system at Tracab. As seen the cascaded classifier outperforms the current method. Again, it is important to remember that this is not the only method used by Tracab’s system. In addition to a higher detection rate, the most positive result in this test is the low false alarm rate shown by both methods. 5. 12 Discussion Overall results show that the detection task set up for this thesis can be done with pleasing results. It seems as the boosting procedure is capable of extracting the information available in the sample images when looking at the features it selects. In early stages the features are understandable. They capture the property of the football being bright in the middle and darker to the sides. On the other hand, it is not as obvious what features in later stages represent. These may be signs that the process overfits to the data, but as studies have shown the method is robust to overfitting [29]. As expected, results in Section 5.3 indicate that the image database is crucial for getting a good performance. It is a little surprising that best performance is shown when using so much of the background as in the image regions of 12*12 52 pixels. The increased performance when using harder images is probably due to that the image set used in training related better to the test set 1 used for both tests. The results confirm the theory that it requires more stages when using harder images in order to reject as many samples as when using only the easier image. Also as expected, results show a big difference in performance when reducing the step size. This is the case because the objects dealt with in this report are small. Unfortunately this is directly related to the time needed to classify an image. Luckily it is very easy to do a tradeoff between processing time and performance. Tests of using a brightness threshold show how the boosting only trains the classifier to identify relative features, not exact pixel values. This is why the brightness threshold can be successful. The tests of brightness threshold along with varying the number of stages show the importance of adjusting the classifier to each game and lighting condition. The most disappointing results in this report have been shown by the Support Vector Machine method. The results in this thesis contradicts the results in a related study made by Le and Satoh which suggest that a higher number of features extracted by a boosted classifier makes it easier for the SVM to separate the two classes [13]. This may be due to overfitting (Section 3.7.1). These results are discouraging since better results were expected from the Support Vector Machine method. One of the goals with this thesis was to evaluate if it was possible to improve the football detection of today. In the comparison, one has to bear in mind that additional techniques are used by Tracab to find the final ball hypothesis. Both classifiers show good performance on the comparison made in Section 5.10. Another comparison would have been to include harder images such as footballs that are partially occluded by a player and therefore only visible in one camera. This has not been possible. One advantage with the cascaded classifier that can be seen in fig 32 may be that it makes it easy to rate the detections according to how confident they are. A detection that makes it through a high number of stages is more likely to be the actual ball than a detection that gets thrown away in an earlier stage. Even higher detection rates are shown in the test made on the five-a-side match. As the cameras are closer to the pitch, the texture of the ball now becomes visible and the classifier should have more to work with. This is part of the explanation to why the classifier shows such good results as in fig 33. However when comparing the performance one should bear in mind that this test set is not the same as before so it is impossible to compare these results straight off. The labeling of the images regarding the contrast and how free the footballs are has not been done for this set, which makes it even more difficult 53 to compare with the first test set. Also, to save time, the process of extracting footballs in the images has been made faster by mostly including easy targets and by removing detections in areas around tracked players. Of course this makes it easier to explain the high hit rates of the classifiers. Similar disappointing results of the SVM method are shown here as in the earlier SVM tests. Although the performance of the SVM seems to have increased it still does not perform better than the cascaded classifier. This can also be an indication that the cascaded classifier is showing good results. 54 Chapter 6 Conclusions work and future This chapter gives an overview of the results and what conclusions that can be drawn from them. Some thoughts on what needs to be improved in the future are presented. 6. 1 Conclusions In this report a method for object detection has been used to detect small footballs in real time. Finding these footballs is a hard task mainly because the footballs are very small. This method has not been used on objects of this size before. Because of the size of the balls, a smaller spatial step size has been needed to achieve a desirable hit rate compared to what has been reported in previous reports. This results in a much slower detector. When using the best classifier a speed of 8.5 images/seconds is achieved. On the other hand no optimization for speed has been done. By introducing the brightness threshold before the classifier the processing time could be reduced. It is also easy to do a tradeoff between processing time and performance. Tests made on images of footballs in higher resolution show increased performance. On the other hand, so does the method available at Tracab today. The overall performance shown by the classifier in tests is promising, but since the method has not been implemented to do a single final hypothesis of the best ball candidate, it has been difficult to make fair comparisons with the method available at Tracab today. Therefore it is difficult to say if this method implemented as a final ball detector would be better or worse than the one available today at Tracab. The idea of using a classifier such as SVM as the last stage has been shown not to work perfectly. Decreased performance when using a higher number of features during training of the SVM contradicts results in previous studies [13]. This may be due to overfitting. 55 6. 2 Future work The natural first next step to take would be to integrate the method into the existing system to see if it can be used to improve performance of finding a final ball hypothesis. This is the only way of getting a true comparison with the existing methods at Tracab. Big differences in performance can be seen when using different image sets (Section 5.3.2). The image set can therefore probably be improved and should definitely be revised. The current classifiers have been trained on 6 different matches. This small set is not enough to get a variety in the images that cover all possible conditions regarding illumination and color. To get a classifier that generalizes well on any kind of new data it is necessary to use a wider range of matches. With such a data set it will be necessary to test the classifier on a wide range of data from matches not used during training. Another approach would be to train a cascade that is optimized for one setup. This could be done by training only on images with certain lighting conditions or images of a specific football. During classification one would for each match or maybe even for each image region start by examining which of the several trained classifier to use. The results from the SVM method suggest that the feature selection can be done in a better way. Are the features relevant or is one feature worth more than the other and needs to be weighted up? These are some of the questions that need to be answered and as a beginning one can read a survey addressing the problem of feature selection [12]. In the near future cameras of higher resolution will probably be used so it is natural to continue the research towards this. The five-a-side test was a first step towards testing this. 56 Bibliography 1. M. Aizerman, E. Braverman, L. Rozonoer. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control 25, pp 821-837, 1964. 2. N. Ancona, A. Branca. Example based object detection in cluttered background with Support Vector Machines. Instituto Elaborazione Segnali ed Immagini. Bari, Italy 2000. 3. N. Ancona, G. Cicirelli, E. Stella, A. Distante. Ball Detection in Static Images with SVM for Classification. Image and Vision Computing 21, pp 675-692, 2003. 4. H. Bay, T. Tuytelaars, L. Van Gool. SURF: Speeded Up Robust Features. Proceedings of the ninth European Conference on Computer Vision, 2006. 5. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2, 121–167, 1998. 6. C. Chang, C. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm 7. G. Coath, P. Musumeci. Adaptive Arc Fitting for Ball Detection in RoboCup. APRS Workshop on Digital Image Analysing, 2003 8. T. Dietterich. An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting and Randomization. Machine Learning, 1-22, 1999. 9. T. D’Orazio, C. Guaragnella, M. Leo, A. Distante. A new algorithm for ball recognition using circle Hough transform and neural classifer. Pattern Recognition 37, pp 393-408, 2003. 10. Y. Freund, R. Schapire. A Decision-Theoretic Generalization of on-Line Learning and an Application to Boosting. European Conference on Computational Learning Theory, 1995. 11. J.Friedman, T.Hastie, R.Tibshirani. Additive Logistic Regression : a Statistical View of Boosting. Annals of Statistics, vol. 28, no. 2, pp. 237--407, 2000. 12. I. Guyon, A. Elisseeff. An Introduction to Variable and Feature Selection. Journal of Machine Learning Research 3, 2003. 13. D. Le, S. Satoh. A Multi-Stage Approach to Fast Face Detection. IEICE TRANS. INF. & SYST., Vol.E89–D, NO.7, 2006. 14. R. Lienhart, J. Maydt. An Extended Set of Haar-like Features for Rapid Object Detection. IEEE ICIP 2002, Vol. 1, pp. 900-903, 2002. 15. R. Lienhart, A. Kuranov, V. Pisarevsky. Empirical Analysis of Detection Cascades of Boosted Classifiers for Rapid Object Detection. MRL Technical Report, 2002. 16. Y. Lin, T. Liu. Fast Object Detection with Occlusions. The 8th European Conference on Computer Vision (ECCV-2004), Prague, 2004. 17. C. Liu, H. Shum. Kullback-Leibler Boosting. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 587-594, 2003. 57 18. D. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision 60(2), pp 91-110, 2004. 19. Ville Lumikero. Football Tracking in Wide-Screen Video Sequences. Master Thesis in Computer Science, School of Electrical Engineering Royal Institute of Technology. Stockholm, 2004. 20. K. Mikolajczyk, C. Schmid. A Performance Evaluation of Local Descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, VOL. 27, NO. 10, 2005. 21. S. Mitri, K. Pervölz, H. Surmann, A. Nüchter. Fast Color-Independent Ball Detection for Mobile Robots. Fraunhofer Institute for Autonomous Intelligens Systems (AIS), Sankt Augustin, Germany 2004. 22. T. Ojala, M. Pietikäinen, T. Mäenpää. Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 24 NO. 7, 2001. 23. E. Osuna, R. Freund, F. Girosi. Training Support Vector Machines: An Application to Face Detection. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Puerto Rico 1997. 24. C. Papageorgiou, M. Oren, T. Poggio. A General Framework for Object Detection. International Conference on Computer Vision, 1998. 25. B. Rasolzadeh. Response Binning: Improved Weak Classifiers for Boosting. Intelligent Vehicles Symposium, pp 344 – 349, 2006. 26. S. Romdhani, P. Torr, B. Schölkopf, A. Blake. Computationally Efficient Face Detection. Proceeding of the 8th International Conference on Computer Vision, 2001. 27. D. Scaramuzza, S. Pagnottelli, P. Valigi. Ball Detection and Predictive Ball Following Based on a Stereoscopic Vision System. Proceedings of the 2005 IEEE International Conference on Robotics and Automation. Barcelona, Spain, 2005. 28. R. Schapire, Y. Singer. Improved Boosting Algorithms Using Confidence-rated Predictions. Machine Learning, 37(3):297-336, 1999. 29. R. Schapire, Y. Freund. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of computer and system sciences 55, 119-139 (1997) 30. K. Sung and T. Poggio. Example-based Learning for View-based Human Face Detection. A.I. Memo 1521, MIT A.I. Lab. 1994 31. P.Viola, M.Jones. Robust Real-Time Object Detection. IEEE ICCV Workshop Statistical and Computational Theories of Vision, 2001. 32. P. Viola, M. Jones. Rapid object detection using a boosted cascade of simple features. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp 511–518, 2001. 33. J.Wu, J.M. Rehg, and M.D.Mullin, Learning a rare event detection cascade by direct feature selection. Advances in Neural Information Processing Systems (NIPS), 2003. 34. M. Sewell. http://www.svms.org/vc-dimension/ Accessed 2008-05-15 35. C. Poynton. Frequently Asked Questions about Color. http://www.poynton.com/PDFs/ColorFAQ.pdf Accessed 2008-05-15 36. http://www.dtreg.com/svm.htm Accessed 2008-05-28 37. http://www.gestalttheory.net/ Accessed 2008-05-28 38. Open CV library. http://opencvlibrary.sourceforge.net/ Accessed 2008-06-27 58 Appendix 1 Percentage of images that have a value equal or lower than the one on the side and on the top of the table: Training set: Contrast/Free 1 1 2 3 4 0.7 31.0 45.0 46.9 2 3 4 5 0.8 48.3 72.2 76.0 0.8 57.1 89.1 94.7 0.8 58.0 91.6 98.0 0.8 58.5 92.9 100 2 3 4 5 0.5 48.2 72.6 74.0 0.5 57.5 94.4 96.8 0.5 57.8 95.9 99.1 0.5 57.9 96.3 100 Test set 1: Contrast/Free 1 1 2 3 4 0.2 29.7 42.9 43.8 59 TRITA-CSC-E 2009: 004 ISRN-KTH/CSC/E--09/004--SE ISSN-1653-5715 www.kth.se
© Copyright 2026 Paperzz