ROBOTIC PERSON FOLLOWING USING STEREO DEPTH SENSING AND PERSON RECOGNITION _______________ A Thesis Presented to the Faculty of San Diego State University _______________ In Partial Fulfillment of the Requirements for the Degree Master of Science in Computer Science _______________ by Michal Pasamonik Fall 2016 SAN DIEGO STATE UNIVERSITY The Undersigned Faculty Committee Approves the Thesis of Michal Pasamonik: Robotic Person Following Using Stereo Depth Sensing and Person Recognition _____________________________________________ Mahmoud Tarokh, Chair Department of Computer Science _____________________________________________ Xiaobai Liu Department of Computer Science _____________________________________________ Ashkan Ashrafi Department of Electrical and Computer Engineering ______________________________ Approval Date iii Copyright © 2016 by Michal Pasamonik All Rights Reserved iv DEDICATION I dedicate this thesis to my two children, Alice and Kazuma. v vi ABSTRACT OF THE THESIS Robotic Person Following Using Stereo Depth Sensing and Person Recognition by Michal Pasamonik Master of Science in Computer Science San Diego State University, 2016 This thesis tested the viability and effectiveness of using stereoscopic imaging with cheap components for the purpose of person detection and following. Stereoscopic cameras can produce a depth map, much like active imaging systems such as the Microsoft Kinect can, but are not subject to the same environmental limitations and may be configured to work in all types of lighting. A stereoscopic imaging rig was built from two wide angle USB cameras mounted to an aluminum channel and driven by a low power compute platform. The compute platform, which was mounted to the top of the robot, controlled the imaging hardware, detected and tracked human targets using a combination of histogram of oriented gradients, Hough transforms, color and depth tracking, and sent motion commands to a secondary computer, which was mounted on the bottom of the robot. The secondary computer controlled the robot chassis and converted the primary computer's motion commands into actual motion. The Histogram of Oriented Gradients algorithm was used as the primary means of person detection due to its low false positive rate, invariance to color and lighting, and ability to detect humans in various poses. As the Histogram of Oriented Gradients produces a bounding box with predictable margins around the human subject, the upper one fourth of each potential human detection was further processed with a circle Hough transform to detect the head position, which was used to fine tune the person’s location and aid in removing false positives. Finally, the locations of each subject’s shirt and pants were identified and a feature vector describing their coloration was constructed. The feature vector was used to identify individuals from the group, chiefly the primary tracking subject. The system was optimized by tracking the trajectory of individuals through time and limiting the search space to their expected location. The person following robot’s software was written in C++ and ran on an embedded version of Ubuntu Linux, though the software was designed to be platform independent. The robot was tested in indoor and outdoor environments, under varying lighting conditions, and with a varying number of people in the scene. Testing revealed that even though the robot could be made to work in both indoor and outdoor environments, it benefited from a lens change when moving from one environment to the other. Non-IR blocking lenses worked best in low-light indoor environments and IR blocking lenses worked best in outdoors. Differences in terrain type proved to have no effect on performance. The results showed that stereoscopic imaging can be a cheap, robust and effective solution for person following. vii TABLE OF CONTENTS PAGE ABSTRACT ............................................................................................................................. vi LIST OF TABLES ................................................................................................................... ix LIST OF FIGURES ...................................................................................................................x ACKNOWLEDGEMENTS .................................................................................................... xii CHAPTER 1 INTRODUCTION .........................................................................................................1 1.1 Person Detection And Following ...........................................................................1 1.2 Challenges Of Person Detection And Following ...................................................2 1.3 Contributions Of This Thesis .................................................................................3 1.4 Summary Of The Following Chapters ...................................................................4 2 BACKGROUND ...........................................................................................................5 2.1 First Generation .....................................................................................................5 2.2 Second Generation .................................................................................................7 2.3 Third Generation ....................................................................................................8 2.4 Fourth Generation...................................................................................................9 2.5 Fifth Generation ...................................................................................................10 2.6 Further Fifth Generation Experiments .................................................................11 3 METHOD OF PERSON DETECTION.......................................................................13 3.1 Stereo Block Matching ........................................................................................13 3.2 Histogram Of Oriented Gradients ........................................................................16 3.3 Hough Transform For Head Detection .................................................................19 3.4 Color Matching.....................................................................................................21 3.5 Summary ..............................................................................................................22 4 SIXTH GENERATION HARDWARE AND SOFTWARE .......................................24 4.1 Stereoscopic Imaging System ..............................................................................24 4.2 Compute Hardware and Programming Environment ...........................................30 4.3 Initial Person Detection And Target Tracking .....................................................31 viii 4.4 Location Refinement And False Positive Removal .............................................34 4.5 Person Identification.............................................................................................34 4.6 Speed And Direction Calculations .......................................................................38 4.7 Client-Server Communication ..............................................................................41 4.8 Summary ..............................................................................................................43 5 TESTS AND RESULTS ..............................................................................................45 5.1 Stereoscopic Imaging Tests .................................................................................45 5.2 Hough Tests And Alternate Ranging ...................................................................46 5.3 Single Person Indoor Performance .......................................................................53 5.4 Indoor Tracking With Fixed Depth ......................................................................55 5.5 Outdoor Tests .......................................................................................................58 5.6 Testing With IR-Blocking Lenses And Multiple Subjects ...................................61 5.7 Summary ..............................................................................................................63 6 CONCLUSION ............................................................................................................65 6.1 Summary Of Results ............................................................................................65 6.2 Comparison To Prior Generations ........................................................................66 6.3 Future Work .........................................................................................................67 REFERENCES ........................................................................................................................68 ix LIST OF TABLES PAGE Table 4.1. Comparison if Interpupillary Distances and their Accuracy ..................................28 Table 4.2. Priority Lost Logic for Selecting a Tracking Target...............................................37 Table 4.3. List of Valid Commands Supported by the Robot's Server ....................................41 Table 5.1. Comparison if Interpupillary Distances and their Accuracy ..................................45 Table 5.2. Head-Radius Distance Estimation Results .............................................................50 Table 5.3. Stereo Block Matching Distance Estimation Results .............................................50 Table 5.4. Head-Radius Distance Measurement Errors ...........................................................51 Table 5.5. Head-Radius Distance Measurement Ranges .........................................................52 x LIST OF FIGURES PAGE Figure 3.1. Stereo Block Matching Sliding Window Example. ..............................................14 Figure 3.2. Triangulation of Stereoscopic Images to Calculate Depth. ...................................15 Figure 3.3. The Construction of an HOG Feature Vector........................................................18 Figure 3.4. A Visualization of the Lab Color Space. ...............................................................21 Figure 4.1. Pre-rectified Stereo Images used for Initial Testing. .............................................25 Figure 4.2. Disparity Maps Created from the Pre-rectified Images. ........................................25 Figure 4.3. Screenshot of Camera Rectification Software. ......................................................26 Figure 4.4. Stereoscopic Imaging Rig.................................... Error! Bookmark not defined.9 Figure 3.1. A sliding window, shown in red, is moved horizontally across the right image until the same block of pixels is found as in the left image, as shown in blue...............................................................................................................................14 Figure 3.2. The horizontal disparity between two stereo images, along with knowledge of the camera locations, can be used to triangulate distances to objects in the scene. .....................................................................................................15 Figure 3.3. An example of the construction of an HOG feature vector. ..................................18 Figure 3.4. A visualization of the Lab color space [21]...........................................................21 Figure 4.1. Pre-rectified stereoscopic image used in the initial exploratory program. ............26 Figure 4.2. The resulting disparity maps after 7 seconds, 31 seconds, and 207 minutes of processing respectively. ...........................................................................................26 Figure 4.3. Screenshot of the camera rectification software discovering the camera’s intrinsic and extrinsic parameters by analyzing the horizontal lines of the checkerboard pattern. The colored lines indicate the horizontal areas of the board that the software is analyzing for lens distortion. ..............................................27 Table 4.1. Show a comparison of depth measuring accuracy between three different interpupillary distances. ...............................................................................................28 Figure 4.4. The stereoscopic imaging system made from two USB camera boards. ..............29 Figure 4.5. The NVidia Jetson TK1 development board selected for image processing [15]. ..............................................................................................................................31 Figure 4.6. This block diagram shows the program flow of the sixth generation robot’s software. Initialization and program entry are shown in yellow square. xi The first two steps, which use both cameras to capture images and build a depth map are shown in blue. The red squares denote steps which constitute person detection and identification, and only use the left camera’s output. Finally, the green square denotes the actual person following step, when motor commands are sent to the robot chassis and the software’s main loop starts anew. ...................................................................................................................32 Figure 4.7. A snapshot of the robot’s onboard video stream showing the bounding box from the HOG algorithm indicated in white, Hough transform head detection search area indicated in light blue, approximate head location as determined by Hough transform indicated in yellow, and approximate location of shirt and pants indicated in pink. .............................................................................36 Table 4.2. This table shows the conditions of the priority system for selecting which person to follow. The algorithm favors following the same person from frame to frame, even if only part of their clothing matches the training data, and only falls back on pure color matching if the previously tracked person cannot be found by shirt or pants color. .......................................................................................38 Figure 4.8. A rectified image, shown on the right, is processed with the unwarping matrix to return the image to its original coordinate space, as shown on the left. ...............................................................................................................................40 Table 4.3. Shows all of the valid commands supported by the server program. .....................43 Figure 4.9. A screenshot of the companion Android app used to take direct control of the robot. ......................................................................................................................43 Table 5.1. Show a comparison of depth measuring accuracy between three different interpupillary distances. ...............................................................................................45 Figure 5.1. An office chair in profile mistaken as a human being by the HOG algorithm. .....................................................................................................................47 Figure 5.2. A bounding box adjusted to fit a squatting figure by discovering the location of the head. The yellow circle indicates the circle that is most likely the head, while every other circle discovered by the Hough transform is marked in purple. The blue square denotes the search area for the head, while the green square denotes the bounding box for the human as discovered by the HOG algorithm. ...........................................................................................................47 Figure 5.3. A building is identified as human by the HOG algorithm. The red bounding box indicates that the Hough transform was unable to find a head, and the object is probably not a human. ......................................................................49 Figure 5.4. A street scene in which advertisements on the side of a distant building are misidentified as human by both the HOG and Hough algorithms. ........................49 Table 5.2. This table shows the results of depth estimation by examining changes in head radius. ..................................................................................................................50 Table 5.3. This table shows the results of depth estimation as estimated by the stereo block matching algorithm and reprojection. ................................................................51 xii Table 5.4. This table shows the difference between the actual distance and the closest number in the measured range for each attempt. .........................................................51 Table 5.5. This table shows the difference between the maximum estimated distance and minimum estimated distance per position for all three attempts...........................52 Figure 5.5. Over-exposed video stream when entering a brightly lit area due to hardware malfunction. .................................................................................................55 Figure 5.6. A person walks away from the robot wearing a shirt colored similarly to the environment. ..........................................................................................................56 Figure 5.7. Onboard video from the robot passing through the same brightly lit environment as before, with auto-exposure working properly. ...................................57 Figure 5.8. Over exposed video stream due to intense sunlight. Each human appears to be wearing identically colored clothing from the robot’s perspective, even though the outfits were quite different. ........................................................................59 Figure 5.9. Infrared radiation caused all persons to appear purple, rendering identification by color matching impossible. ...............................................................60 Figure 5.10. Because of the intensity of the light, shade and shadowy areas appear pitch black. ...................................................................................................................60 Figure 5.11. After infrared blocking lenses were installed, the robot was able to properly distinguish individual human subjects outdoors. ..........................................62 Figure 5.12. Onboard footage of the robot detecting both humans and correctly choosing the tracking target in an indoor environment. ..............................................63 Figure 3.1. A sliding window, shown in red, is moved horizontally across the right image until the same block of pixels is found as in the left image, as shown in blue...............................................................................................................................14 Figure 3.2. The horizontal disparity between two stereo images, along with knowledge of the camera locations, can be used to triangulate distances to objects in the scene. .....................................................................................................15 Figure 3.3. An example of the construction of an HOG feature vector. ..................................18 Figure 3.4. A visualization of the Lab color space [21]...........................................................21 Figure 4.1. Pre-rectified stereoscopic image used in the initial exploratory program. ............26 Figure 4.2. The resulting disparity maps after 7 seconds, 31 seconds, and 207 minutes of processing respectively. ...........................................................................................26 Figure 4.3. Screenshot of the camera rectification software discovering the camera’s intrinsic and extrinsic parameters by analyzing the horizontal lines of the checkerboard pattern. The colored lines indicate the horizontal areas of the board that the software is analyzing for lens distortion. ..............................................27 xiii Table 4.1. Show a comparison of depth measuring accuracy between three different interpupillary distances. ...............................................................................................28 Figure 4.4. The stereoscopic imaging system made from two USB camera boards. ..............29 Figure 4.5. The NVidia Jetson TK1 development board selected for image processing [15]. ..............................................................................................................................31 Figure 4.6. This block diagram shows the program flow of the sixth generation robot’s software. Initialization and program entry are shown in yellow square. The first two steps, which use both cameras to capture images and build a depth map are shown in blue. The red squares denote steps which constitute person detection and identification, and only use the left camera’s output. Finally, the green square denotes the actual person following step, when motor commands are sent to the robot chassis and the software’s main loop starts anew. ...................................................................................................................32 Figure 4.7. A snapshot of the robot’s onboard video stream showing the bounding box from the HOG algorithm indicated in white, Hough transform head detection search area indicated in light blue, approximate head location as determined by Hough transform indicated in yellow, and approximate location of shirt and pants indicated in pink. .............................................................................36 Table 4.2. This table shows the conditions of the priority system for selecting which person to follow. The algorithm favors following the same person from frame to frame, even if only part of their clothing matches the training data, and only falls back on pure color matching if the previously tracked person cannot be found by shirt or pants color. .......................................................................................38 Figure 4.8. A rectified image, shown on the right, is processed with the unwarping matrix to return the image to its original coordinate space, as shown on the left. ...............................................................................................................................40 Table 4.3. Shows all of the valid commands supported by the server program. .....................43 Figure 4.9. A screenshot of the companion Android app used to take direct control of the robot. ......................................................................................................................43 Table 5.1. Show a comparison of depth measuring accuracy between three different interpupillary distances. ...............................................................................................45 Figure 5.1. An office chair in profile mistaken as a human being by the HOG algorithm. .....................................................................................................................47 Figure 5.2. A bounding box adjusted to fit a squatting figure by discovering the location of the head. The yellow circle indicates the circle that is most likely the head, while every other circle discovered by the Hough transform is marked in purple. The blue square denotes the search area for the head, while the green square denotes the bounding box for the human as discovered by the HOG algorithm. ...........................................................................................................47 xiv Figure 5.3. A building is identified as human by the HOG algorithm. The red bounding box indicates that the Hough transform was unable to find a head, and the object is probably not a human. ......................................................................49 Figure 5.4. A street scene in which advertisements on the side of a distant building are misidentified as human by both the HOG and Hough algorithms. ........................49 Table 5.2. This table shows the results of depth estimation by examining changes in head radius. ..................................................................................................................50 Table 5.3. This table shows the results of depth estimation as estimated by the stereo block matching algorithm and reprojection. ................................................................51 Table 5.4. This table shows the difference between the actual distance and the closest number in the measured range for each attempt. .........................................................51 Table 5.5. This table shows the difference between the maximum estimated distance and minimum estimated distance per position for all three attempts...........................52 Figure 5.5. Over-exposed video stream when entering a brightly lit area due to hardware malfunction. .................................................................................................55 Figure 5.6. A person walks away from the robot wearing a shirt colored similarly to the environment. ..........................................................................................................56 Figure 5.7. Onboard video from the robot passing through the same brightly lit environment as before, with auto-exposure working properly. ...................................57 Figure 5.8. Over exposed video stream due to intense sunlight. Each human appears to be wearing identically colored clothing from the robot’s perspective, even though the outfits were quite different. ........................................................................59 Figure 5.9. Infrared radiation caused all persons to appear purple, rendering identification by color matching impossible. ...............................................................60 Figure 5.10. Because of the intensity of the light, shade and shadowy areas appear pitch black. ...................................................................................................................60 Figure 5.11. After infrared blocking lenses were installed, the robot was able to properly distinguish individual human subjects outdoors. ..........................................62 Figure 5.12. Onboard footage of the robot detecting both humans and correctly choosing the tracking target in an indoor environment. ..............................................63 Figure 3.1. A sliding window, shown in red, is moved horizontally across the right image until the same block of pixels is found as in the left image, as shown in blue...............................................................................................................................14 Figure 3.2. The horizontal disparity between two stereo images, along with knowledge of the camera locations, can be used to triangulate distances to objects in the scene. .....................................................................................................15 Figure 3.3. An example of the construction of an HOG feature vector. ..................................18 xv Figure 3.4. A visualization of the Lab color space [21]...........................................................21 Figure 4.1. Pre-rectified stereoscopic image used in the initial exploratory program. ............26 Figure 4.2. The resulting disparity maps after 7 seconds, 31 seconds, and 207 minutes of processing respectively. ...........................................................................................26 Figure 4.3. Screenshot of the camera rectification software discovering the camera’s intrinsic and extrinsic parameters by analyzing the horizontal lines of the checkerboard pattern. The colored lines indicate the horizontal areas of the board that the software is analyzing for lens distortion. ..............................................27 Table 4.1. Show a comparison of depth measuring accuracy between three different interpupillary distances. ...............................................................................................28 Figure 4.4. The stereoscopic imaging system made from two USB camera boards. ..............29 Figure 4.5. The NVidia Jetson TK1 development board selected for image processing [15]. ..............................................................................................................................31 Figure 4.6. This block diagram shows the program flow of the sixth generation robot’s software. Initialization and program entry are shown in yellow square. The first two steps, which use both cameras to capture images and build a depth map are shown in blue. The red squares denote steps which constitute person detection and identification, and only use the left camera’s output. Finally, the green square denotes the actual person following step, when motor commands are sent to the robot chassis and the software’s main loop starts anew. ...................................................................................................................32 Figure 4.7. A snapshot of the robot’s onboard video stream showing the bounding box from the HOG algorithm indicated in white, Hough transform head detection search area indicated in light blue, approximate head location as determined by Hough transform indicated in yellow, and approximate location of shirt and pants indicated in pink. .............................................................................36 Table 4.2. This table shows the conditions of the priority system for selecting which person to follow. The algorithm favors following the same person from frame to frame, even if only part of their clothing matches the training data, and only falls back on pure color matching if the previously tracked person cannot be found by shirt or pants color. .......................................................................................38 Figure 4.8. A rectified image, shown on the right, is processed with the unwarping matrix to return the image to its original coordinate space, as shown on the left. ...............................................................................................................................40 Table 4.3. Shows all of the valid commands supported by the server program. .....................43 Figure 4.9. A screenshot of the companion Android app used to take direct control of the robot. ......................................................................................................................43 Table 5.1. Show a comparison of depth measuring accuracy between three different interpupillary distances. ...............................................................................................45 xvi Figure 5.1. An office chair in profile mistaken as a human being by the HOG algorithm. .....................................................................................................................47 Figure 5.2. A bounding box adjusted to fit a squatting figure by discovering the location of the head. The yellow circle indicates the circle that is most likely the head, while every other circle discovered by the Hough transform is marked in purple. The blue square denotes the search area for the head, while the green square denotes the bounding box for the human as discovered by the HOG algorithm. ...........................................................................................................47 Figure 5.3. A building is identified as human by the HOG algorithm. The red bounding box indicates that the Hough transform was unable to find a head, and the object is probably not a human. ......................................................................49 Figure 5.4. A street scene in which advertisements on the side of a distant building are misidentified as human by both the HOG and Hough algorithms. ........................49 Table 5.2. This table shows the results of depth estimation by examining changes in head radius. ..................................................................................................................50 Table 5.3. This table shows the results of depth estimation as estimated by the stereo block matching algorithm and reprojection. ................................................................51 Table 5.4. This table shows the difference between the actual distance and the closest number in the measured range for each attempt. .........................................................51 Table 5.5. This table shows the difference between the maximum estimated distance and minimum estimated distance per position for all three attempts...........................52 Figure 5.5. Over-exposed video stream when entering a brightly lit area due to hardware malfunction. .................................................................................................55 Figure 5.6. A person walks away from the robot wearing a shirt colored similarly to the environment. ..........................................................................................................56 Figure 5.7. Onboard video from the robot passing through the same brightly lit environment as before, with auto-exposure working properly. ...................................57 Figure 5.8. Over exposed video stream due to intense sunlight. Each human appears to be wearing identically colored clothing from the robot’s perspective, even though the outfits were quite different. ........................................................................59 Figure 5.9. Infrared radiation caused all persons to appear purple, rendering identification by color matching impossible. ...............................................................60 Figure 5.10. Because of the intensity of the light, shade and shadowy areas appear pitch black. ...................................................................................................................60 Figure 5.11. After infrared blocking lenses were installed, the robot was able to properly distinguish individual human subjects outdoors. ..........................................62 Figure 5.12. Onboard footage of the robot detecting both humans and correctly choosing the tracking target in an indoor environment. ..............................................63 xvii Figure 3.1. A sliding window, shown in red, is moved horizontally across the right image until the same block of pixels is found as in the left image, as shown in blue...............................................................................................................................14 Figure 3.2. The horizontal disparity between two stereo images, along with knowledge of the camera locations, can be used to triangulate distances to objects in the scene. .....................................................................................................15 Figure 3.3. An example of the construction of an HOG feature vector. ..................................18 Figure 3.4. A visualization of the Lab color space [21]...........................................................21 Figure 4.1. Pre-rectified stereoscopic image used in the initial exploratory program. ............26 Figure 4.2. The resulting disparity maps after 7 seconds, 31 seconds, and 207 minutes of processing respectively. ...........................................................................................26 Figure 4.3. Screenshot of the camera rectification software discovering the camera’s intrinsic and extrinsic parameters by analyzing the horizontal lines of the checkerboard pattern. The colored lines indicate the horizontal areas of the board that the software is analyzing for lens distortion. ..............................................27 Table 4.1. Show a comparison of depth measuring accuracy between three different interpupillary distances. ...............................................................................................28 Figure 4.4. The stereoscopic imaging system made from two USB camera boards. ..............29 Figure 4.5. The NVidia Jetson TK1 development board selected for image processing [15]. ..............................................................................................................................31 Figure 4.6. This block diagram shows the program flow of the sixth generation robot’s software. Initialization and program entry are shown in yellow square. The first two steps, which use both cameras to capture images and build a depth map are shown in blue. The red squares denote steps which constitute person detection and identification, and only use the left camera’s output. Finally, the green square denotes the actual person following step, when motor commands are sent to the robot chassis and the software’s main loop starts anew. ...................................................................................................................32 Figure 4.7. A snapshot of the robot’s onboard video stream showing the bounding box from the HOG algorithm indicated in white, Hough transform head detection search area indicated in light blue, approximate head location as determined by Hough transform indicated in yellow, and approximate location of shirt and pants indicated in pink. .............................................................................36 Table 4.2. This table shows the conditions of the priority system for selecting which person to follow. The algorithm favors following the same person from frame to frame, even if only part of their clothing matches the training data, and only falls back on pure color matching if the previously tracked person cannot be found by shirt or pants color. .......................................................................................38 xviii Figure 4.8. A rectified image, shown on the right, is processed with the unwarping matrix to return the image to its original coordinate space, as shown on the left. ...............................................................................................................................40 Table 4.3. Shows all of the valid commands supported by the server program. .....................43 Figure 4.9. A screenshot of the companion Android app used to take direct control of the robot. ......................................................................................................................43 Table 5.1. Show a comparison of depth measuring accuracy between three different interpupillary distances. ...............................................................................................45 Figure 5.1. An office chair in profile mistaken as a human being by the HOG algorithm. .....................................................................................................................47 Figure 5.2. A bounding box adjusted to fit a squatting figure by discovering the location of the head. The yellow circle indicates the circle that is most likely the head, while every other circle discovered by the Hough transform is marked in purple. The blue square denotes the search area for the head, while the green square denotes the bounding box for the human as discovered by the HOG algorithm. ...........................................................................................................47 Figure 5.3. A building is identified as human by the HOG algorithm. The red bounding box indicates that the Hough transform was unable to find a head, and the object is probably not a human. ......................................................................49 Figure 5.4. A street scene in which advertisements on the side of a distant building are misidentified as human by both the HOG and Hough algorithms. ........................49 Table 5.2. This table shows the results of depth estimation by examining changes in head radius. ..................................................................................................................50 Table 5.3. This table shows the results of depth estimation as estimated by the stereo block matching algorithm and reprojection. ................................................................51 Table 5.4. This table shows the difference between the actual distance and the closest number in the measured range for each attempt. .........................................................51 Table 5.5. This table shows the difference between the maximum estimated distance and minimum estimated distance per position for all three attempts...........................52 Figure 5.5. Over-exposed video stream when entering a brightly lit area due to hardware malfunction. .................................................................................................55 Figure 5.6. A person walks away from the robot wearing a shirt colored similarly to the environment. ..........................................................................................................56 Figure 5.7. Onboard video from the robot passing through the same brightly lit environment as before, with auto-exposure working properly. ...................................57 Figure 5.8. Over exposed video stream due to intense sunlight. Each human appears to be wearing identically colored clothing from the robot’s perspective, even though the outfits were quite different. ........................................................................59 xix Figure 5.9. Infrared radiation caused all persons to appear purple, rendering identification by color matching impossible. ...............................................................60 Figure 5.10. Because of the intensity of the light, shade and shadowy areas appear pitch black. ...................................................................................................................60 Figure 5.11. After infrared blocking lenses were installed, the robot was able to properly distinguish individual human subjects outdoors. ..........................................62 Figure 5.12. Onboard footage of the robot detecting both humans and correctly choosing the tracking target in an indoor environment. ..............................................63 Figure 3.1. A sliding window, shown in red, is moved horizontally across the right image until the same block of pixels is found as in the left image, as shown in blue...............................................................................................................................14 Figure 3.2. The horizontal disparity between two stereo images, along with knowledge of the camera locations, can be used to triangulate distances to objects in the scene. .....................................................................................................15 Figure 3.3. An example of the construction of an HOG feature vector. ..................................18 Figure 3.4. A visualization of the Lab color space [21]...........................................................21 Figure 4.1. Pre-rectified stereoscopic image used in the initial exploratory program. ............26 Figure 4.2. The resulting disparity maps after 7 seconds, 31 seconds, and 207 minutes of processing respectively. ...........................................................................................26 Figure 4.3. Screenshot of the camera rectification software discovering the camera’s intrinsic and extrinsic parameters by analyzing the horizontal lines of the checkerboard pattern. The colored lines indicate the horizontal areas of the board that the software is analyzing for lens distortion. ..............................................27 Table 4.1. Show a comparison of depth measuring accuracy between three different interpupillary distances. ...............................................................................................28 Figure 4.4. The stereoscopic imaging system made from two USB camera boards. ..............29 Figure 4.5. The NVidia Jetson TK1 development board selected for image processing [15]. ..............................................................................................................................31 Figure 4.6. This block diagram shows the program flow of the sixth generation robot’s software. Initialization and program entry are shown in yellow square. The first two steps, which use both cameras to capture images and build a depth map are shown in blue. The red squares denote steps which constitute person detection and identification, and only use the left camera’s output. Finally, the green square denotes the actual person following step, when motor commands are sent to the robot chassis and the software’s main loop starts anew. ...................................................................................................................32 Figure 4.7. A snapshot of the robot’s onboard video stream showing the bounding box from the HOG algorithm indicated in white, Hough transform head xx detection search area indicated in light blue, approximate head location as determined by Hough transform indicated in yellow, and approximate location of shirt and pants indicated in pink. .............................................................................36 Table 4.2. This table shows the conditions of the priority system for selecting which person to follow. The algorithm favors following the same person from frame to frame, even if only part of their clothing matches the training data, and only falls back on pure color matching if the previously tracked person cannot be found by shirt or pants color. .......................................................................................38 Figure 4.8. A rectified image, shown on the right, is processed with the unwarping matrix to return the image to its original coordinate space, as shown on the left. ...............................................................................................................................40 Table 4.3. Shows all of the valid commands supported by the server program. .....................43 Figure 4.9. A screenshot of the companion Android app used to take direct control of the robot. ......................................................................................................................43 Table 5.1. Show a comparison of depth measuring accuracy between three different interpupillary distances. ...............................................................................................45 Figure 5.1. An office chair in profile mistaken as a human being by the HOG algorithm. .....................................................................................................................47 Figure 5.2. A bounding box adjusted to fit a squatting figure by discovering the location of the head. The yellow circle indicates the circle that is most likely the head, while every other circle discovered by the Hough transform is marked in purple. The blue square denotes the search area for the head, while the green square denotes the bounding box for the human as discovered by the HOG algorithm. ...........................................................................................................47 Figure 5.3. A building is identified as human by the HOG algorithm. The red bounding box indicates that the Hough transform was unable to find a head, and the object is probably not a human. ......................................................................49 Figure 5.4. A street scene in which advertisements on the side of a distant building are misidentified as human by both the HOG and Hough algorithms. ........................49 Table 5.2. This table shows the results of depth estimation by examining changes in head radius. ..................................................................................................................50 Table 5.3. This table shows the results of depth estimation as estimated by the stereo block matching algorithm and reprojection. ................................................................51 Table 5.4. This table shows the difference between the actual distance and the closest number in the measured range for each attempt. .........................................................51 Table 5.5. This table shows the difference between the maximum estimated distance and minimum estimated distance per position for all three attempts...........................52 Figure 5.5. Over-exposed video stream when entering a brightly lit area due to hardware malfunction. .................................................................................................55 xxi Figure 5.6. A person walks away from the robot wearing a shirt colored similarly to the environment. ..........................................................................................................56 Figure 5.7. Onboard video from the robot passing through the same brightly lit environment as before, with auto-exposure working properly. ...................................57 Figure 5.8. Over exposed video stream due to intense sunlight. Each human appears to be wearing identically colored clothing from the robot’s perspective, even though the outfits were quite different. ........................................................................59 Figure 5.9. Infrared radiation caused all persons to appear purple, rendering identification by color matching impossible. ...............................................................60 Figure 5.10. Because of the intensity of the light, shade and shadowy areas appear pitch black. ...................................................................................................................60 Figure 5.11. After infrared blocking lenses were installed, the robot was able to properly distinguish individual human subjects outdoors. ..........................................62 Figure 5.12. Onboard footage of the robot detecting both humans and correctly choosing the tracking target in an indoor environment. ..............................................63 xxii ACKNOWLEDGEMENTS I am forever in gratitude to the many friends, family and colleagues, without whom this thesis would not have been possible, for their support, advice, and contributions. Foremost, I would like to thank my advisor and committee chair, Dr. Mahmoud Tarokh, for his guidance and support. His expert knowledge on the subject matter, suggestions and feedback were invaluable throughout the research and development of this thesis and motivated me to work beyond my abilities. I would also like to thank Dr. Xiaobai Liu and Dr. Ashkan Ashrafi for taking time from their busy schedules to serve on my thesis committee. Many months of debugging were saved thanks to Shalom Cohen, a thesis student. I would like to give my appreciation to Shalom for meeting with me and showing me how to use the Segway RMP chassis, along with its various quirks and oddities. His examples and explanations were integral to combining my software with the Segway hardware in a timely manner. Finally, I would like to express my utmost gratitude and appreciation to my family. I am sincerely thankful to my wife, Rina, for her continuous encouragement and moral support, as well as for giving me the time and space necessary to develop the software, hardware and thesis paper. And lastly, I would like to thank my brother, Paul, who was always cheerful and willing to travel to the lab on weekends to aid in testing. This project has truly been a collaborative effort and I am humbled by the kindness, encouragement and support of those who contributed. CHAPTER 1 INTRODUCTION Broadly, person following in robotics can be described as a robot which is tasked with detecting humans in the environment, recognizing an individual human, and following the human target as they move through the environment. While person following is trivial for a human, many factors such as changing lighting conditions, complex environments, and variations in the appearance of humans make person following a challenging task for robots. This chapter will discuss person detection, person following and the challenges of person detection and following in the context of robotics in detail. 1.1 Person Detection and Following In order for a robot to follow a person, the person must first be detected in the given scene. The scene must be imaged in some way by sensors attached to the robot, whether through passive imaging using a standard digital camera or thermal camera, active imaging using an RGB-D camera or laser scanner, or by some other means. Then, the sensor data must be processed to separate the humans in the scene from the background and other foreground objects. Once the humans have been extracted, the robot must select a single human to follow and adjust its speed and direction accordingly. Various techniques exist for person detection, each with advantages and drawbacks. Motion based person detection typically involves frame subtraction, where the contents of the current frame are subtracted from the contents of the previous frame, thus eliminating static areas and revealing the silhouettes of objects in motion [26]. Since frame subtraction is computationally easy, such person detection techniques tend to be very efficient [32]. Pierard et al. used this technique to great effect by combining frame subtraction with a shape filling and pixel voting heuristic, where each silhouette is filled with large rectangles that act as a signature for what type of object the silhouette represents, and was able to achieve a high degree of accuracy in detecting humans [1]. Parts-based human detection algorithms work by detecting individual body parts whose proximity and placement relative to one another act as a heuristic for detecting the presence of a human [25]. Such algorithms have the advantage of being able to detect humans that are not in motion, who may be in a variety of poses or states of partial occlusion. Mikolajczyk et al. demonstrated the effectiveness of one such algorithm by modeling the human body as a collection of seven body parts. Each individual part was detected using a Scale-Invariant Feature Transform (SIFT) algorithm and a cascading classifier. The cascading classifier uses the scale and location of each body part as a confidence measure to determine if there exist any neighboring body parts [2]. Once a certain number of body parts have been detected, the region is marked as containing a human figure [2]. Experiments using the MIT-CMU test set revealed that the detector performed comparable to or better than other state of the art detectors at the time of writing [2]. Shape and color recognition algorithms, such as the Chamfer system and color matching, have been used for human detection in video to varying degrees of success [24][32]. Color and shape based human detection algorithms, which use the shape and color of the human target’s head and clothing for detection, have also been used in previous generations of the person following robot, which will be discussed in detail in the next chapter. 1.2 Challenges of Person Detection and Following Though humans can recognize other humans with relative ease, the task is quite difficult for a robot. Problems in human detection primarily stem from three areas: environment-related, human-related and motion-related. If a person following robot is to be used practically, then it must be able to cope with complex environments. Environments can be plain and orderly, such as an empty hallway, or complex and organic, such as an area containing heavy vegetation. Such environments may prohibit algorithms that rely on shape detection and color matching from achieving their goal of detecting humans, as contours of a complex backdrop may confuse shape detection routines and plain, featureless environments may render humans in similarly colored clothing undetectable. Furthermore, the lighting of the environment must be taken into consideration. As the illumination in the environment varies, so too does the appearance of colors. As the human target moves from one location in the environment to another their coloration will appear to change, presenting yet another problem for color matching algorithms. The robot’s own mobility may prohibit person detection from working correctly. When the robot is in motion, both foreground and background will appear to change, rendering frame subtraction based person detection techniques ineffective. Likewise, as the robot makes abrupt turns, drives over uneven terrain, or introduces significant judder or motion blur to the video stream, the human target may become difficult or impossible to detect. Rough terrain proved to be a major problem for previous generations of the person following robot. Finally, humans themselves tend to present a significant challenge to human detection. Human appearance is highly variable, as humans may be clothed in a wide variety of colors and patterns, may appear in a wide range of poses and may be in partial states of occlusion. Parts-based algorithms are able to function despite imperfect lighting and partial occlusion of the human target, but are far too computationally expensive to be used in a realtime system, as is evidenced by the work of Mikolajczyk et al., whose algorithm required approximately ten seconds to process a single 640x480 image [2]. Ideally, a person following robot must be able to detect humans despite their coloration, pose or completeness in the frame while also being power efficient and maintaining a high framerate. The robot must also be able to cope with rough terrain induced video instability and be able to recover from tracking loss. 1.3 Contributions of this Thesis The work presented in this thesis tests the viability of a robot vision system built with off-the-shelf components that is cheap, power efficient, and able to function with an acceptable level of accuracy in sub-optimal lighting and environmental conditions. This thesis attempts to overcome or significantly mitigate all of the problems in person detection and following through the use of a stereoscopic imaging rig built from two ordinary CMOS camera boards and a small, power efficient computer (NVidia TK1). A combination of algorithms are used to efficiently detect and track humans in three dimensional space. Humans are initially detected using the Histogram of Oriented Gradients algorithm, which is a shape-based detection method that is invariant to lighting changes. A depth map of the scene is created using a standard block matching algorithm. Then, the persons’ location is fine-tuned by searching for the location of the head using a Hough circle transform, which is also used to filter out false positives, and combining the persons’ two dimensional location with depth map data. Finally, additional processing steps are taken to further reduce false positives before motor commands are sent to the underlying robot chassis. This thesis also includes many optimizations to allow the software to run on a mobile processor with an acceptable level of accuracy and framerate. Results show that the software is able to detect humans even in severely adverse lighting conditions, in indoor and outdoor environments, and irrespective of human-related factors. 1.4 Summary of the Following Chapters Chapter 2 will contain a description and analysis of previous generations of the person following robot, including what approach was used, what problems were solved and what problems persisted as background information. Chapter 3 will discuss the software architecture and implementation details of the person detection algorithm proposed in this thesis. Chapter 4 will cover the hardware implementation details of the person following robot. Chapter 5 is present the testing that was performed and the results of those tests. The thesis will conclude with Chapter 6, where achievements are summarized and potential solutions for remaining problems are presented as future work. CHAPTER 2 BACKGROUND Five generations of person following robot have been developed as San Diego State University’s Intelligent Machines and Systems Lab, each improving on the previous generation’s achievements. Each generation will be described in detail as background information for the reader to gain an understanding of the evolution of the person following robot project. 2.1 First Generation The first generation of person following robot, developed by P. Ferrari and Dr. Tarokh, consisted of a Pioneer 2DX mobility platform, a remote laptop computer, a transmitter and receiver, and Polar Industries GFP-5005 CCD wireless camera [3]. Initial software was developed, which utilized thresholding and region growing as a means of person detection, and acted as the means for steering the robot. A basic description of the program flow is as follows: The CCD camera, which was mounted on the top of the Pioneer 2DX robot chassis, captured and transmitted images wirelessly to a remote laptop. The laptop processed each image by segmenting the tracking subject from the background, and then calculating a center of mass for the segmented region. The size of the segmented region was used as an estimate of the distance to the tracking subject. The combination of the center of mass and size of the segmented region was used to determine in which direction the robot should travel and how fast. Finally, the newly computed motor commands were then transmitted wirelessly back to the mobility platform, which put the robot into motion [3]. As each image was received, the tracking subject was segmented from the background using image thresholding. Each pixel in the image data is compared against a predetermined range of values that act as a threshold. If the pixel under examination falls outside of the threshold, then it is discarded, leaving only pixels that fall within a specific range of values. In the first generation robot, the tracking subject’s shirt color range was used as the threshold value, resulting in the segmentation of the person’s shirt. To further isolate the subject’s shirt, region growing was used to examine the segmented region’s eccentricity, compactness and circularity [3]. In the event that several regions were segmented in the initial thresholding step, the compactness, eccentricity and circularity measures would be compared against those of the training data to choose a region that most likely represents the tracking subject. Once a segmented region was selected, its size relative to the size of the frame and center of mass were calculated. The size and center of mass values were used with a fuzzy controller to determine the robot’s motion. The horizontal location of the center of pass within the frame was used to determine the direction that the robot should rotate. If the center of pass appeared generally to the left, then the robot would rotate counter-clockwise, and would clockwise if the person appeared generally to the right. The vertical position of the center of mass and size of the segmented region were used as distance estimators and controlled the robot’s forward motion. If the center of mass was high in the frame or the segmented region was small, then the person was expected to be further away and the robot would speed up. Conversely, if the segmented region was large, then the person was expected to be close and the robot would slow down. By employing these simple rules, the fuzzy controller attempted to keep the segmented region in the center of the frame [3]. The authors initially attempted to track the segmented region from frame to frame using a frame subtraction technique, where the image data of the current frame was subtracted from that of the previous frame. The idea was to help isolate the tracking subject, which was expected to be in motion, by eliminating the static background. However, the authors found that this technique was unusable when the robot itself was in motion, as the motion of the robot would produce many differences in the resulting image that were unrelated to the tracking subject. The final version of the software also had several important limitations. Though the robot was shown to be able to track a target in ideal conditions, the color thresholding technique had difficulty in isolating the tracking subject the background was of a similar color or when multiple similarly dressed persons were present in the frame. The hardware was also limited in its range by the wireless modems and the framerate of the software was bottlenecked by the wireless transmission speed of the camera. Finally, the fixed camera prevented the robot from tracking a subject that was making quick directional changes, as the subject would exit the camera’s field of view. 2.2 Second Generation The second generation robot built by J. Kuo and Dr. Tarokh improved on the first generation robot largely through hardware changes [4]. In order to improve the robots performance in tracking fast moving subjects, a new color CCD camera system was added to the robot that supported pan-tilt motion. The robot chassis itself was replaced with an allterrain version from the same vendor to improve performance on rough terrain. The wireless transmitters and receivers were also removed to eliminate network-induced lag and increase the range of the robot. The new hardware additions necessitated changes to the software. With the introduction of a color camera, the robot was now able to take advantage of detecting objects by color instead of just grayscale level. The authors elected to use the HSI color model, which describes color by hue, saturation and intensity, with the intensity component discarded when performing color matching in order to compensate for lighting changes. Like the previous generation, the image data was filtered by threshold values to isolate the tracking subject. However, the second generation robot used value ranges for hue and saturation as thresholds that were derived from an initial training phase. Once the subject was segmented, the size and center of mass were calculated as with the first generation robot. However, the new pan-tilt hardware required that additional fuzzy controllers be created to rotate the camera. Since the pan-tilt hardware could react faster than the robot chassis, two extra fuzzy sets were created to track the target horizontally and vertically in the frame. If the target appeared off center vertically or horizontally, the camera would pan and tilt appropriately to recenter the target. The second set of fuzzy controllers, which controlled the robot’s motion, used a five frame history of the pan-tilt motion to determine which direction the robot should travel and how fast. It was shown in testing that while the pan-tilt apparatus and wired architecture improved the range and tracking of fast moving targets, the system still suffered from tracking loss when changes in lighting occurred. Outdoor testing was performed during twilight as bright sunlight created too great of a contrast between sunlit and shaded areas. Environmental reflections also altered the coloration of clothing, reducing the robot’s ability to track by color. Finally, the all-terrain robot chassis introduced new problems as it was unable to rotate smoothly, and the jittery motion caused motion blur, which further reduced the robot’s ability to track a human subject [4]. 2.3 Third Generation The third generation robot, built by P. Merloti and Dr. Tarokh, improved on the work of the second generation with several hardware and software changes [5]. On the hardware side, a new two-wheeled robotic platform and camera system were used. The robotic platform was changed from a four wheel all-terrain vehicle to a self-balancing Segway RMP, which would give the robot to navigate tight corners and turn in place. The camera system was also changed to one that had a slightly higher resolution at 640x480 and allowed manual control of the shutter and gain. As the new camera hardware allowed manual control of shutter and gain, the developers attempted to compensate for lighting changes by manually adjusting the exposure and gain settings relative to the region of interest containing the subject’s shirt, rather than the overall image [5]. The software was also changed to track the subject’s centroid from frame to frame. Thus, when multiple subjects were in the frame, the software attempted to find the correct subject by comparing the mass and location of centroids, and the region’s color to that of the training data and previously detected regions. The software controlling the motion of the robot was also changed. First, the fuzzy controllers for the pan-tilt hardware, speed and steering were changed to PID controllers, which proved to work better at tracking the subject [5]. When the pan-tilt could no longer track the subject, then the robot would turn in the direction of the camera to recenter the subject. In the event of tracking loss, software was added to move the camera system in a sinusoidal fashion in an attempt to rediscover the subject. Though the hardware and software changes improved the overall performance of the system in areas where the previous generations were deficient, such as tracking in gradual lighting changes, recovering from tracking loss, and navigation in tight spaces, testing revealed that tracking performance was still inadequate when the subject’s clothing color matched that of the background [5]. Finding a solution to this single problem would be the inspiration for the major shift in strategy found in the fourth generation robot. 2.4 Fourth Generation The fourth generation robot, developed by R. Shenoy and Dr. Tarokh, attempted to solve the background coloration problem common to the first three generations of person following robot by shape detection rather than color tracking [6]. Even though the first three generations used various other metrics, such as eccentricity, circularity, compactness, and centroid location to aid in tracking, detection was primarily done by color matching to a trained sample. Thus, it is only natural that if the subject and background were similarly colored, the robot would have trouble detecting the subject. In order to get around this problem, the authors changed detection strategies from color matching to shape detection. Instead of looking for regions similarly colored to training data, the new software attempted to locate the subject’s head by using a Hough Circle transform to detect all circular objects inside the image [6]. Initially, the software would first process the image with a Canny edge detector to remove all color information and only keep the edge information. Then, the edge detected image would be processed with a Hough transform, which attempted to discover all circular shapes in the image at three preselected radii. Initial testing showed that even though the Hough transform was generally good at detecting the head, it also produced many false positives. To compensate for false positives, a color matching based torso detection system was added that would look directly under each detected circle for a shirt similarly colored to that of a training sample [6]. With the addition of the torso detection system, most false positives were removed and the software was able to track subjects under most circumstances. Testing performed on video data from previous generations showed that the fourth generation could track subjects more accurately with fewer false positives. Even though the fourth generation solved the major common problem of the first three generations and showed much promise in tracking ability, it was never tested with actual robotics hardware [6]. The software simply required too much processing time to be practical, requiring several seconds of processing time per frame. Because of this, all of the testing of the software was performed on prerecorded videos from previous generation robots. 2.5 Fifth Generation The fifth generation robot, developed by R. Vu and Dr. Tarokh, utilized the Microsoft Kinect to detect and track humans in three dimensions [7]. Since the Microsoft Kinect projects infrared light as a means of imaging the environment, the human detection is should not be affected by lighting changes. The Microsoft Kinect is a Structured Light system that works by projecting a proprietary 640x480 speckle pattern onto the environment. The speckle pattern is created in such a way that no two regions of the pattern are similar. An infrared sensor on the Kinect device receives the reflected speckle pattern, from which it calculates a depth map in hardware. Finally, the depth map is analyzed for human figures using a machine learning algorithm [8]. The Kinect was mounted to the top of an iRobot Create robotic chassis, along with a laptop for processing [7]. This robot was tested in varying light conditions, ranging from a pitch black environment to a bright day outdoors. The robot was also tested on different surfaces to test how well the Kinect performs with terrain induced judder. The results of lighting experiments showed that the robot was able to detect and track human subjects in a pitch dark environment, but performance began to suffer in a 1200 to 2300 lux environment typical of early morning sunlight. The volume in which humans can be detected was measured to be between 500mm to 4000mm in optimal lighting conditions. As the outdoor illumination became brighter, the Kinect failed to produce a depth map in midday shaded environments, measured at 17000 lux, and prevented the software from extracting human figures. The authors concluded that infrared from the sun was interfering with the Kinect’s own infrared projector and sensor system. The robot was tested on five different surfaces: laminate, tile, carpet, sidewalk and asphalt. Tests were performed on each surface type to determine how well the robot followed the subject in a straight line, from side to side, and in a zigzag pattern. The results showed that the robot performed best on carpet and worst on asphalt, with the other three surface types achieving mediocre results in all tests. The hard surfaces would induce motion judder, which caused loss of tracking. It was thus determined that this generation of robot performed worse than previous generations on non-optimal surface types. It was also observed that tracking suffered greatly when the subject was facing away from the camera, irrespective of terrain or lighting conditions, as the Kinect was designed to detect people facing toward it. The authors concluded that the iRobot Create was a poor choice of chassis, as it was too lightweight and produced instability in motion. The Kinect was also found to be ineffective in outdoor use, when tracking subjects turned away from the camera, and when the Kinect itself was put in motion. 2.6 Further Fifth Generation Experiments Using the same fifth generation chassis, V. Anuvakya and Dr. Tarokh performed further experiments using a different strategy for detection, tracking and tracking loss recovery. The authors attempted to recognize a single person among a group of people by training the software to remember biometric information [19]. The software ran in one of three possible modes. The first mode, dubbed Learning Mode, was the initial mode that the software was in and was responsible for detecting and learning the features of the tracking target [19]. The user would stand in front of the robot with both hands at an upward 90 degree angle. Pose recognition algorithms would recognize the person as the tracking target, and begin to learn their features. The person’s clothing color was memorized, along with their joint positions, relative bone lengths and overall height [19]. The Kinect SDK also returns a unique index number for every person present in the scene, which is consistent from frame to frame. The tracking target’s index number is stored along with their biometric and color information. Lastly, a value from a hardware light sensitive resistor was also memorized, which is later used to account for changes in lighting [19]. Once the tracking target is identified, the software switches to a Tracking Mode. As the index number is consistent from frame to frame, tracking was largely performed by the Kinect libraries. The Kinect libraries report the index number and position of each person in the scene, and thus the software would choose the correct person to track by comparing against the memorized index number from the initial learning phase [19]. Once the correct tracking target was identified, their center of mass was calculated and the offset of the center of mass from the center of the image frame was used to determine the robot’s direction of travel. Likewise, the person’s distance offset from a fixed value was used to determine the robot’s forward speed. If there was a change in lighting condition, as detected by the light resistor, then new lighting and color information was learned once the tracking subject was determined to be facing toward the camera. The person’s orientation to the camera was determined by the distance between their wrist joints [19]. Tracking by index number was unreliable in the event of tracking loss. If tracking was lost, such as in the event that the tracking subject moved out of the frame, then the Kinect SDK would assign the person a new index number and thus the person would have to be identified by other means [19]. To reacquire the correct tracking target, the heights of all people in the scene were compared against that of the initially learned biometric data. The software would then compare each person against initially learned color data and bone lengths to further distinguish between people of the same height. A total weighted sum of differences for these three features was used to determine which person to follow. Once a person was identified as the tracking target, the software switched back to Tracking Mode [19]. Results from experimentation showed that matching color histograms was an accurate way of identifying people by color regardless of changes in lighting [19]. The results also showed that the robot’s Recognition Mode was able to correctly determine the correct tracking subject 83% of the time after tracking was lost [19]. However, the robot experienced tracking problems and was unable to locate the person when their clothing color was similar to the background or when the person was standing next to a large object, in sunlight or in flickering light. The software also failed to correctly track the subject when they were walking at a constant speed next to another person, or when they made rapid changes in speed [19]. CHAPTER 3 METHOD OF PERSON DETECTION This thesis attempts to build on the strengths and weaknesses of prior person following robot projects by taking a relatively unique approach to imaging and person detection. The robot described in this thesis uses a stereoscopic camera to passively image the environment and generate a depth map. The image data is used for person detection through the Histogram of Oriented Gradients algorithm. The upper third of every detected person is further processed with a Hough circle transform to detect heads and remove false positives. Finally, the tracking subject is identified by the color of their clothing, whose location can be estimated with a high degree of accuracy from the prior two processing steps. The person’s location in 2D space and depth data are used to determine the robot’s forward speed and direction of travel. 3.1 Stereo Block Matching Given two images of the same scene taken from two cameras at slightly different positions, it is possible to recover depth information from the two dimensional images by examining their differences. Intuitively, one can understand the concepts behind this algorithm simply by looking at the way humans see their world in three dimensions. As an object moves closer to one’s face, the object will appear differently to the left eye than to the right. Likewise, if an object is distant, then the left and right eyes will both see the object in approximately the same way. By analyzing the differences between what the left and right eye sees, the brain can triangulate an approximate depth to any object in the person’s field of view. This too can be mimicked in software, and by using information about the location, lens focal length and geometry of the cameras, the depth to any region in the image can be triangulated [9][27][30]. Just as humans have two eyes, the Stereo Block Matching algorithm operates on two grayscale images taken from two cameras that are a slight distance apart. For each pixel in the left image, a horizontal search is conducted for the same pixel in the right image. Since single pixels are usually nondescript, the block matching algorithm, as the name implies, operates on blocks of pixels, usually 5x5 or 7x7 [15]. For each pixel in the left image, the block of pixels surrounding it is compared against a horizontally sliding window of the same size on the right image. For each window position in the right image, a sum of absolute differences is computed to compare how closely the pixels in the left image’s block compare to the pixels in the right image’s window, per Equation 3.1 [29]. ℎ𝑒𝑖𝑔ℎ𝑡 𝑤𝑖𝑑𝑡ℎ 𝑠𝑢𝑚 𝑜𝑓 𝑎𝑏𝑠𝑜𝑙𝑢𝑡𝑒 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑠 = ∑ 𝑦=0 3.1 ∑ | 𝐿𝑒𝑓𝑡𝑥 ,𝑦 − 𝑅𝑖𝑔ℎ𝑡𝑥 ,𝑦 | 𝑥=0 As one can see, Equation 3.1 operates on a block of pixels of dimension width by height. The gray level for every pixel in the left image’s block, given by 𝐿𝑒𝑓𝑡𝑥 ,𝑦 is subtracted from every corresponding pixel at the same x and y location in the right image’s sliding window, given by 𝑅𝑖𝑔ℎ𝑡𝑥 ,𝑦 . The sum of the absolute value of differences represents the overall difference between the left and right blocks. Naturally, the closest match will be the window that produced the lowest sum of absolute differences [15]. Once the closest match has been identified in the right image, the horizontal offset, or disparity, is written into a disparity map [9]. Figure 3.1. A sliding window, shown in red, is moved horizontally across the right image until the same block of pixels is found as in the left image, as shown in blue. If the location of each camera is known, and the disparity between every corresponding pixel from the left and right image is known, then the depth to any pixel can be triangulated by drawing a ray projecting from each camera through the corresponding left and right pixels. The point at which the rays converge is the depth of that pixel, which can be saved in a separate depth map [9]. The concept is demonstrated in Figure 3.2. Figure 3.2. The horizontal disparity between two stereo images, along with knowledge of the camera locations, can be used to triangulate distances to objects in the scene. Generally, the distance between the two cameras, shown as Baseline b, and focal length of the lenses, shown as f in Figure 3.2, are known constants [27][29]. For any region of interest in the scene, the horizontal offset from the left side of the image frame, shown as 𝑋𝑙 and 𝑋𝑟 in Figure 3.2, can be discovered in both the left and right images using the standard Stereo Block Matching algorithm given in Equation 3.1. Likewise, the disparity for the region can be calculated simply by subtracting the horizontal offset in the left image frame from the horizontal offset in the right image frame, as shown in Equation 3.2. 𝐷𝑖𝑠𝑝𝑎𝑟𝑖𝑡𝑦 = |𝑋𝑙 − 𝑋𝑟 | 3.2 Finally, we can use the disparity information, along with knowledge of the distance between the cameras and focal length of the lenses to triangulate the depth to the object, as shown in Equation 3.3 [30]. 𝑍= 𝑓∗𝑏 𝐷𝑖𝑠𝑝𝑎𝑟𝑖𝑡𝑦 3.3 Before images can be analyzed for disparity information, the images must first be processed to correct for lens distortion, rotation and positional misalignment of the cameras [31][33]. As each right image is scanned horizontally for matching pixels from the left image, it is critical that corresponding pixels in the left and right images appear on the same horizontal plane [9]. In order to ensure that corresponding pixels appear on the same scanline, the cameras must be calibrated before the stereo block matching algorithm is started. A checkerboard pattern of a known size can be used to build a model of the stereoscopic camera and discover the camera’s intrinsic and extrinsic parameters, which include each camera’s focal length, lens distortion parameters, location and orientation relative to the other cameras [9][31][33]. This thesis makes use of OpenCV’s stereo camera calibration facilities to discover the intrinsic and extrinsic parameters, which once discovered are saved to a file for later use [31]. 3.2 Histogram of Oriented Gradients The first four generations of the person following robot extracted humans from the image data by color thresholding or circle detection, which proved to be problematic when the background was the same color as the person or when other circular non-human objects were present. The fifth generation of person following robot used proprietary libraries for person detection, which required the person to be facing toward the camera. As such, the sixth generation robot requires a person detection algorithm that is independent of color, can consistently detect humans in a variety of poses, and produces a sufficiently low amount of false positives. The Histogram of Oriented Gradients (HOG) approach to person detection meets these requirements and is able to detect people in various poses, with various types of clothing, and even in partial states of occlusion. The detection method is also not affected by subject coloration or changes in lighting, making it ideal for a robot that may travel through environments with changing illumination. In essence, the HOG feature vector describes the silhouette of the average human figure. The vector is constructed in stages, the first of which is edge detection over the entire image. Though many complex methods for edge detection are available, the creators of the HOG algorithm, Navneet Dalal and Bill Triggs, found that convolving the image with a simple [-1 0 1] Sobel kernel produced the best results [10]. Once the edge detection is completed in both vertical and horizontal directions, the resulting edge maps are divided into small 8x8 cells. For each pixel in each cell, Orientation Binning is performed. Orientation Binning is essentially the creation of a histogram of quantized gradient directions [10]. Using the horizontal and vertical edge maps produced in the first step, it is possible to calculate the gradient orientation and magnitude of each edge using Equations 3.4 and 3.5 respectively. 𝑂𝑟𝑖𝑒𝑛𝑡𝑎𝑡𝑖𝑜𝑛 = atan 𝐺𝑦 𝐺𝑥 𝑀𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒 = √𝐺𝑥2 + 𝐺𝑦2 3.4 3.5 Each orientation is converted from radians to degrees, and then quantized in 40 degree increments. Each bin in the histogram represents a 40 degree increment, resulting in 9 bins total. Thus, a gradient direction of 23 degrees would contribute to the first bin of the histogram, which has a range of 0 to 39 degrees, and an orientation of 44 degrees would contribute to the second bin, which has a range of 40 to 79, and so forth. The amount that each gradient orientation contributes to the cell’s histogram is equal to its magnitude, giving prominent edges more influence than weaker edges [10]. After a histogram of quantized gradient orientations has been created for each cell, the cells are grouped into 2x2 blocks, with a 50% overlap between blocks [10][34]. Each cell is then contrast normalized relative to the other cells in the block to prevent aliasing [10][34]. The entire process of creating the HOG feature vector is shown in Figure 3.3. Figure 3.3. An example of the construction of an HOG feature vector. Finally, all of the cells in each block are concatenated together to form the HOG feature vector [10][34]. When the standard 64x128 descriptor size is used as described by Navneet and Triggs, the image is broken into 8x8 cells, 8 cells horizontally and 16 cells vertically, totaling 128 cells [10]. Since 4 cells make up a 2x2 block of 16x16 pixels and each block has a 50% overlap, the descriptor has a total of 105 blocks, 7 horizontally and 15 vertically [10]. Since each block as 4 histograms, one for each cell, and each histogram has 9 bins, then the total feature vector length after all of the histograms are concatenated is 3780 integer values [10]. These feature vectors are used in conjunction with a classifier system, such as a Support Vector Machine or Neural Network, to determine whether the each feature vector contains a human or some other object. During supervised training, feature vectors describing positive examples of humans and negative examples of other objects are created from 64x128 pixel images and are shown to the classifier system [10]. Back propagation can be used to train the classifier on the training data. After training, any region of an image to be tested for the presence of a human is scaled to 64x128 and a feature vector is created. The feature vector is then used as input to the classifier system, and a boolean describing whether a human was detected is returned. By scaling the image frame to various sizes, humans can be detected at a range of distances from the camera. Since the HOG algorithm operates on 64x128 regions, it is natural that people who are close to the camera will appear larger than these dimensions, and people who are distant from the camera will appear smaller. In order to detect everyone in the frame, the entire image is scaled to various sizes, and the HOG algorithm is performed on each new image, increasing the chance that the humans in the scene will fit in the 64x128 window [10]. 3.3 Hough Transform for Head Detection A common fundamental problem of computer vision is that of detecting shapes in image data. The Hough Transform is a feature extraction technique used to detect lines or shapes which can be described by an equation, and has the added benefit of being able to detect shapes which may be incomplete or otherwise only partially visible [12]. As with the Histogram of Oriented Gradients algorithm, the Hough Transform operates on edge detected images, which have a value of 0 when no edge is present and some value other than 0 when an edge is present [13]. Given an edge detected image, which shows only the perimeter of shapes, one can already intuitively understand the basic principle behind the Hough transform – that all non-zero valued points that are part of a specific shape should be solutions to the equation describing that shape [12]. Let us examine the simple case of a line, given by Equation 3.6, where m denotes the slope of the line and b denotes the yintercept, and together form the line parameters. 𝑦 = 𝑚𝑥 + 𝑏 3.6 Given an edge detected image, every non-zero pixel that is part of the same line will share a common value for m and b. Therefore, if we had a way of tracking how many points in the image data fall on a line for every value of m and b, we could use the highest scoring combinations of m and b as evidence of the presence of a line [13]. We can create a parameter space (b, m), sometimes called Hough space, in which every point corresponds to the parameters of the shape’s equation. In the simple case of a line using Equation 3.6, the equation can be rewritten such that x and y are constant and b and m are the variable parameters, as shown in Equation 3.7. 𝑏 = −𝑚𝑥 + 𝑦 3.7 The Hough space would be defined as a two dimensional array of which one dimension denotes the m values and the other dimension denotes the b values. Each index into the Hough space is called an accumulator, whose initial value is 0 [13]. Therefore, the Hough transform algorithm works by iterating over every non-zero pixel in the edge map, calculating every b and m value for the lines that pass through that pixel, and incrementing the corresponding accumulators in the Hough space [13]. Since pixels that lie on the same line will share a common value for m and b, it is only natural that the corresponding accumulator at index (m,b) in Hough space will be incremented multiple times. The algorithm concludes by a search for the highest scoring accumulator cells, which provide evidence of the existence of a line [13]. The Hough transform can be extended to discover the presence of circles simply be replacing the equation of a line with that of a circle, given in Equation 3.8 [11]. 𝑥 = 𝑎 + 𝑅 𝑐𝑜𝑠𝜃 3.8 𝑦 = 𝑏 + 𝑅 sinθ Therefore, the Hough space would be extended to three dimensions to accommodate the three parameters (a, b, R), where parameters (a, b) denote the center of the circle, and R denotes the radius [20]. As before, the edge detected image is iterated over and for every non-zero pixel all possible values of a, b and R are computed. For every value of (a, b, R) that satisfies Equation 3.8 for the given x and y, the Hough space accumulator at index (a, b, R) is incremented [11]. When all pixels in the edge map have been examined, the Hough space is iterated over to find the global maxima, which indicate the presence of a circle in the image data [11]. Since finding all possible values to satisfy Equation 3.8 for a three dimensional parameter space is computationally expensive, the radius parameter is usually limited to a small range of values. 3.4 Color Matching Sometimes it is desirable to segment an image by a specific color or color range. In most applications that require segmentation by color, a threshold is defined for the desired color range and all pixels that fall within the threshold are kept, while all pixels that fall outside of the threshold are discarded. In environments where coloration of clothing may be affected by changes in lighting, it is advisable to use a color space that takes illumination into account. Previous generations of the person following robot used the RGB color space, which proved inadequate, and then later the HSI color space, which consists of hue, saturation and intensity components, with the intensity component being ignored to account for changes in lighting [6]. However, this too proved inadequate as the systems were unable to cope with variable lighting. This thesis uses the Lab color space, which consists of a lightness component L*, and two color-opponent components a* and b* and attempts to approximate human perception of color and luminosity [14]. In the Lab color space, color balance is adjusted by modifying the a* and b* curves, while lightness is adjusted with the L* component. A graphical representation of the Lab color space is shown in Figure 3.4. Figure 3.4. A visualization of the Lab color space [21]. Since Lab color space is designed to mimic not only the way humans perceive color, but also the way humans perceive changes in color, any two colors in Lab color space can be compared for similarity with a distance function, given in Equation 3.9 [14] 𝐸 = √Δ𝐿∗ 2 + Δ𝑎∗ 2 + Δ𝑏 ∗ 2 3.9 Where E is a measure of similarity [22][23]. The smaller the resulting E value, the more similar the two colors are [28]. In the case of tracking a color over time, the L component of color we are interested in, which represents the lighting, can be adjusted as changes in environmental lighting occur. Therefore in the case of person tracking, as the person is detected from frame to frame, the reference Lab color information which identifies them is also updated. 3.5 Summary This chapter discussed the algorithms and software concepts used in the sixth generation person following robot. It was shown how Stereo Block Matching can be used to discover corresponding pixels between images captured by two different cameras, and then the horizontal disparity of the pixels be triangulated into a depth map. An explanation of the Histogram of Oriented Gradients algorithm was given, and how it can be used to detect humans at various distances from the camera. The Hough Transform for the detection of circles was discussed, which is used by the sixth generation robot to detect heads. Finally, the Lab color space was discussed and a method for detecting the similarity of two colors in Lab color space was presented. These four algorithms make up the core of the sixth generation robot’s person following software. In the next chapter, the hardware components of the sixth generation person following robot are discussed, including the development and testing of stereoscopic imaging rig, power efficient compute hardware, and client-server communication model. The program flow of the sixth generation robot’s software, from initial person detection to the physical motion of the robot, is analyzed. Lastly, an auxiliary Android application for taking manual control of the robot is presented. CHAPTER 4 SIXTH GENERATION HARDWARE AND SOFTWARE The sixth generation robot has gone through several revisions of hardware and software over the course of its development. The software was primarily written in C++ on an Ubuntu Linux-based desktop computer and tested on a development board also running an embedded version of Ubuntu Linux. The final hardware platform consists of a stereoscopic imaging rig built from two wide angle cameras and Segway RMP chassis. The robot also features two computers dedicated to image processing and communication with the Segway RMP chassis. 4.1 Stereoscopic Imaging System Before a stereoscopic camera was built, an exploratory program was created to determine the feasibility of a stereoscopic system for depth estimation. The initial program was written as a single threaded Java program and operated on pre-rectified stereo images obtained from the internet. The program would operate on left and right images, an example of which is shown in Figure 4.1, to produce a disparity map using the standard block matching algorithm, shown in Figure 4.2. The program would load the left image and right image into memory, and then create grayscale versions of both images to simplify the block matching process. Since color data was thrown away, only similarity between gray levels was considered. The software would then enter a loop that iterated over the left image. For every pixel in the left image, a window of a predetermined size was placed around the pixel and then the right image was scanned along the same horizontal plane with a sliding window of identical dimensions, as per usual. For every window position in the right image, a sum-of-differences calculation, given in Equation 3.1, was performed against the left image’s window, with the lowest scoring position considered as the closest match. However, in order to speed up processing, an area larger than a single pixel was written into the disparity. The size of the area was usually equal to that of the scanning window. Thus, a large window size would require a larger amount of processing time than a smaller window size in order to find a corresponding pixel in the right image, but the number of pixels needing to be scanned was reduced as the software operated on window boundaries. The effect can be seen in Figure 4.2, in which the grayscale values denote the size of the disparity between the left and right image, with whiter the pixels denoting a larger disparity and darker ones denoting a smaller disparity. The disparity map which required 7 seconds of processing appears to be lower resolution than the one which required 207 minutes because a block of data was written to the disparity map for every corresponding pixel that was found. The software would try to find the next corresponding block in successive iterations rather than the next corresponding pixel. Testing revealed that the exploratory program was indeed able to produce disparity maps that visually appear correct. In the sample image shown in Figure 4.1, the wooden walkway is closest to the camera, followed by the river and rock face on the right. The green hillside on the left half of the image is furthest from the camera. This is reflected in the disparity maps that were generated, shown in Figure 4.2, as the area with the wooden walkway appears lightest, the river and rock face appearing a darker shade of gray, and finally the hillside appearing black or nearly black. Similar results were obtained with other stereoscopic images found on the internet, with the areas closest to the camera correctly showing as lighter than those further from the camera. Even though the exploratory program confirmed that stereo imaging was a viable method of depth sensing, it was not efficient enough for real-time use. In a real-time system, many frames would need to be processed each second as the robot passed through its environment for both person following and to avoid collisions with nearby objects. It became evident that the program needed to be highly parallelized. Fortunately, Stereo Block Matching lends itself to parallelization as each operation in the block matching phase is the same operation performed every pixel in the block. The differences for each block can be calculated in parallel on the GPU. Furthermore, if enough processing units are available on the GPU, the differences can be calculated for every block on the horizontal plane at the same time. OpenCV has both CUDA and OpenCL implementations of the standard Stereo Block Matching algorithm that are highly parallelized and execute on the GPU. The sixth generation person following robot uses OpenCV’s GPU implementation of the Stereo Block Matching algorithm. Figure 4.1. Pre-rectified stereoscopic image used in the initial exploratory program. Figure 4.2. The resulting disparity maps after 7 seconds, 31 seconds, and 207 minutes of processing respectively. Once it was determined that a stereoscopic system can indeed be used to produce a disparity map that reflects the depths in the scene, a prototype stereoscopic camera was built from two Logitech C160 USB webcams to test the accuracy of depth measurement. The webcams were mounted to a cardboard box at varying distances from each other. Each webcam was aligned horizontally and vertically to point in the same direction as to allow the stereo block matching algorithm to discover corresponding pixels on the same horizontal plane. To facilitate this, another piece of software, of which screenshots are shown in Figure 4.3, was created with the OpenCV libraries to rectify each webcam and discover each camera’s intrinsic and extrinsic parameters. The software accesses both cameras and looks for a checkerboard pattern. Since checkerboards have a well-defined shape, consisting of straight horizontal and vertical lines, the software analyzes the distortion in the horizontal lines of the checkerboard to estimate how the lenses are warping the image. Once the checkerboard is visible to each camera, the software takes 30 images of the board and attempts to correct for the distortion of the lenses, such that the horizontal lines of the checkerboard appear straight. The intrinsic and extrinsic parameter values, which describe the camera model in terms of focal length, sensor size, projection matrix and lens correction matrices are written to a file, so that the calibration software need only to be run once each time the hardware is changed [24]. Figure 4.3. Screenshot of the camera rectification software discovering the camera’s intrinsic and extrinsic parameters by analyzing the horizontal lines of the checkerboard pattern. The colored lines indicate the horizontal areas of the board that the software is analyzing for lens distortion. The stereo block matching software was rewritten in C++ and also utilized the OpenCV libraries to perform block matching on the GPU, which brought the average processing speed up to 30 frames per second when limited by the camera’s refresh rate, or 41 frames per second when operating on static images. Depth measurements were taken of a chair placed in predetermined locations to which the distances were known. This test, to which the results are given in Table 4.1, determined how accurate a basic stereoscopic imaging system could be and what the optimal interpupillary distance between the cameras is, where interpupillary distance is defined as the distance from the center of the left camera lens to the center of the right camera lens. Interpupillary Distance: Interpupillary Distance: Interpupillary Distance: 7.62cm 12cm 22cm Min. Distance: 37.6cm Min. Distance: 72.5cm Min. Distance: 135cm Actual Measured Actual Measured Actual Measured 254cm 234cm 254 220 254 236 162cm 150cm 162 145 162 145 89cm 86cm 89 93 89 Failed 71cm 70cm 71 76 71 Failed Table 4.1. Show a comparison of depth measuring accuracy between three different interpupillary distances. From the table above, one can see that the optimal interpupillary distance for the cameras is between approximately 8cm and 12cm, in terms of smallest depth measurement error and shortest detectable distance. The final iteration of the stereoscopic imaging system was built from two high refresh rate USB camera modules sourced from Amazon and were mounted to an aluminum channel with polycarbonate standoffs, as shown in Figure 4.4. Figure 4.4. The stereoscopic imaging system made from two USB camera boards. Each camera has a maximum refresh rate of 120Hz and came with 170 degree fisheye lenses. Cameras with a high refresh rate were specifically sought to decrease the time between image capture intervals to maximize framerates. Since the Logitech webcams had a 30Hz refresh rate, if image processing took more than 1/30th of a second, then the program would idle until the next frame became available, dropping the framerate to 15fps. However, a 120Hz camera would allow the software to only wait a maximum of 1/120th of a second for image data to become available, thus raising the overall framerate even without changes to the software. Another important consideration was the camera’s low light performance. Cameras with small lenses and small sensors usually suffer in low light, requiring the camera to decrease the shutter speed in order to capture the image. Therefore in order to mitigate the effects of poor lighting, cameras without infrared filters on the sensor or lenses were chosen. However, it was discovered in testing that the 170 degree wide angle lenses produced far too much distortion to be practical, even when rectification matrices were applied. The rectification software was unable to fully account for the lens distortion and the Stereo Block Matching algorithm would fail to find corresponding pixels on the same horizontal plane as a result. The 170 degree lenses were replaced with 2.8mm 92 degree field of view lenses, both in IR-pass and IR-block variety for indoor and outdoor use. With the 92 degree lenses, the rectification software was able to produce matrices that removed the lens distortion from all but the edges of the frames, allowing the Stereo Block Matching algorithm to function as expected. IR-pass lenses were used for testing in indoor environments, as they allowed all frequencies of light to pass through to the sensor for a brighter exposure in what would typically be a lower light environment. However, it was discovered in outdoor testing that IR from the sun washed out all color information. Since shutter speed is guaranteed to be high in bright daylight, it was decided that IR-blocking lenses should be used outdoors to cut out IR from the sun and preserve color data. Doing so would allow the robot to correctly identify the tracking target by color. The effects of sunlight on color matching are discussed in greater detail in the next chapter. 4.2 Compute Hardware and Programming Environment The software for the person following robot was almost entirely developed on a desktop computer with an NVidia GTX 980Ti graphics card, Intel Xeon X5680 processor and 18 gigabytes of ram and was designed to execute on the GPU for maximum efficiency. Initial parts of the software were written in Java, but, due to performance concerns, the final iteration of the software was entirely written in C++. In order to maintain platform independence, the software makes heavy use of the OpenCV 2.4.11 libraries for accessing camera and GPU hardware. As the NVidia Jetson TK1 runs an embedded version of Ubuntu Linux, the desktop computer on which the software was developed also ran Ubuntu Linux to maintain compatibility. However, since OpenCV is available for Linux, Windows, OSX and Android, the software can effortlessly be ported to any of the major platforms and run as a native binary. The software was compiled with gcc and Visual Studio 10 and 13, and run on Ubuntu 16.04, Ubuntu 14.04 and Windows 7 and 10. Preprocessor flags allow easy configuration of the software to best suit the underlying hardware, and the software can be compiled to run on Intel and AMD graphics hardware through OpenCL, NVidia graphics hardware through CUDA, or run in a CPU-only mode that does not make use of the GPGPU capabilities of the graphics hardware. Though the desktop hardware was more than adequate to handle the compute heavy algorithms of the sixth generation robot, it used a prohibitively large amount of energy and was thus unsuitable for mobile use. The Segway RMP chassis has a computer attached to its frame, which was used by previous generations of the person following robot for both image processing and communication with the chassis, however this computer has a low power x86-based processor and no discrete GPU capable of general processing. As a result, a second mobile computer, the NVidia Jetson TK1 development board, was purchased and used solely for image processing tasks. The NVidia Jetson TK1, as shown in Figure 4.5, development board contains a quadcore 2.3GHz ARM Cortex-A15 CPU, a 192-core NVidia Kepler GK20a GPU capable of GPGPU programming, 2 gigabytes of RAM and 16 gigabytes of storage [15]. It also includes a USB 3.0 port, which has the necessary bandwidth to simultaneously stream uncompressed video data from two USB cameras at a high refresh rate, and an Ethernet port, which was used to send motion commands to the Segway RMP’s computer. Figure 4.5. The NVidia Jetson TK1 development board selected for image processing [15]. 4.3 Initial Person Detection and Target Tracking Though this thesis uses a stereoscopic imaging rig built from two USB camera boards, only the left camera’s output is used for person detection. At each iteration of the software, both cameras capture an image of the scene, are processed to remove lens distortion, and disparity and depth maps are created from the rectified images. After that, the right image frame is discarded, and the unrectified left image frame is further processed to discover any humans that may be in the scene. The program flow is shown in Figure 4.6 below, where the program entry point and initialization are shown in yellow, the depth sensing stage is shown in blue, the person detection and identification are shown in red, and the physical robot hardware being put into motion is shown in green. Figure 4.6. This block diagram shows the program flow of the sixth generation robot’s software. Initialization and program entry are shown in yellow square. The first two steps, which use both cameras to capture images and build a depth map are shown in blue. The red squares denote steps which constitute person detection and identification, and only use the left camera’s output. Finally, the green square denotes the actual person following step, when motor commands are sent to the robot chassis and the software’s main loop starts anew. The Histogram of Oriented Gradients method is used for initial person detection and is indicated in red in Figure 4.6 above. This algorithm was specifically chosen for its color invariance, resilience to poor lighting conditions and low false positive rate. Another benefit of the HOG algorithm is that once a person is detected, the location of their head and torso can easily be approximated, as the location is a constant offset from the upper left corner of the bounding box. For every person that is detected, their head will appear in the upper 1/3rd and middle 30% of the bounding box. Their upper body appears below the head, and is in the middle 20% of the bounding box, with a height of approximately 16% of the height of the bounding box. The lower body appears below the upper body, is square shaped and has an approximate width of 25% of the bounding box width. Being able to approximate the locations of the head, upper body and lower body within the bounding box facilitate further processing of the human subjects by making additional steps, such as shape-matching to discover the same features, unnecessary. In this thesis, the HOG algorithm can run in two different modes to improve performance. In the first mode, the left image frame is scaled to different sizes and each image is processed by the HOG algorithm to allow detection of human figures at various distances from the camera. This mode is only used when no humans have previously been detected. Every human that is detected in this mode is added to a tracker, which tracks the person’s current location in three dimensions, bounding box dimensions, clothing color, head location, detection timestamp, and a 300 millisecond history of previous locations and attributes. The second mode is used if data exists in the tracker. Since humans were previously detected, scaling the source image and running the HOG algorithm multiple times is no longer a necessity. Instead, a trajectory is calculated for each subject in the tracker and a region of interest based on the person’s expected location is copied from the image data. Since the subject’s expected location is calculated as a velocity from their location history and previous detection timestamps, the resulting location is invariant to any changes in framerate that might have occurred throughout the person’s detection history. The region of interest is then rescaled to the size of the expected HOG feature vector length, which in the case of this thesis is 64x128 pixels, and tested for the presence of a human using the standard HOG algorithm. If a person is detected, the person’s tracker entry is updated with new location information and timestamp, and the previous location information is added to their history. If a person is not found at the expected location, then the entry remains in the tracker for a period of time in case the person reappears. If the person does not reappear after an expected period of time (300 milliseconds was used as the default value), then the person’s entry is removed from the tracker. Lastly, if the last person is removed from the tracker, then the person detection system switches back to the first mode to rescan the scene for human targets. 4.4 Location Refinement and False Positive Removal The next step in the person tracking pipeline is to further process the tracker data with the Hough transform. Early testing revealed that certain non-human objects, such as office chairs viewed in profile, would consistently show up as false positives in the HOG step. To remedy this, the Hough transform was introduced as an additional step in the person tracking pipeline. For each person that appears in the tracker, the person’s region of interest is analyzed for the presence of circles. Since the person’s bounding box was defined by the HOG algorithm, the relative location of the person’s body features are well known, including their head which is expected to be located in the middle of the upper 1/3rd of the bounding box. The radius of the head can also be inferred from the size of the bounding box. As such, the circle Hough transform need not be applied to the entire region of interest as defined by the person’s bounding box, but instead can be efficiently applied to a small region in the upper 1/3rd of the bounding box and with a few radii. By limiting the search space for circles, the Hough transform’s impact on framerate becomes negligible. A 3x3 Gaussian blur is applied to the region before the Hough transform is performed to improve detection. If a head is detected, then the person’s tracker information is updated with an adjusted location relative to their head. If a head is not detected, then the person’s tracker entry is flagged as not having a head and, depending on program settings, may be removed from the tracker or may be omitted from consideration for person following later. Testing showed that the Hough transform had some success in removing false positives, such as in the case of the office chair in profile, but failed to remove other false positives, such as certain street signs containing the letter O, which were often mistaken for a head. 4.5 Person Identification Because the bounding box margin widths surrounding each target from the HOG algorithm are largely constant, it is easy to determine the location of a person’s shirt and pants. Since the location of their chest and torso area is well defined, colors can be sampled from both areas and used as a feature vector to identify an individual within a group. More importantly, since the critical steps of person detection have already been completed and color matching is only used for identification, it is not beholden to the same problems as that previous generations of person following robot had – chiefly that surrounding colors can have a profound impact on the quality of the person detection. Before the main person following program is started, the human to be followed stands in front of the robot and a training program is run. The training program samples pixels from the area containing the shirt and pants, as indicated by the pink squares in Figure 4.7, to create two feature vectors. Figure 4.7. A snapshot of the robot’s onboard video stream showing the bounding box from the HOG algorithm indicated in white, Hough transform head detection search area indicated in light blue, approximate head location as determined by Hough transform indicated in yellow, and approximate location of shirt and pants indicated in pink. Each feature vector, one for the pants and one for the shirt, contains the sampled colors converted to Lab color space and is saved to a file, along with the camera parameters that were used when the feature vectors were created, such as exposure, brightness, contrast, and color saturation settings. When the main person following program is started, the training data is loaded from the training file and the camera parameters are adjusted to match those that were used when the training data was created. Processing continues per usual until after the HOG and Hough steps have filled the tracker with potential person following targets. The tracker entries are iterated over and shirtpants feature vectors are created for each target, where each feature vector consists of 96 color samples. The feature vectors of each target are then compared against the training data using a simple distance function to determine which target most closely resembles the training data in their clothing coloration, as given in Equation 4.1. 𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠 ∑ (4.1) √(𝐿2𝑎 − 𝐿2𝑡 ) + (𝑎𝑎2 − 𝑎𝑡2 ) + (𝑏𝑎2 − 𝑏𝑡2 ) 𝑎=0 For each person in the tracker, Equation 4.1 is used to calculate the similarity of the person’s shirt color, pants color, and an average of shirt and pants color to the training data. The index of the person identified as the tracking target in the previous frame is also noted. The tracking target is then selected by a priority system to increase tracking reliability in scenarios where two or more people might be wearing similarly colored clothing. The parameters for the priority system are shown in Table 4.2. Priority Condition Description 1 Closest Aggregate The person with the shirt and pants color most similar to Index=Previously the training data is also the same person that was Identified Index 2 Closest Shirt Index=Previously Identified Index 3 Closest Pants Index=Previously Identified Index 4 Closest Aggregate < Threshold identified in the previous frame as the tracking target. The person with the shirt color most similar to the training data is also the same person that was identified in the previous frame as the tracking target. The person with the pants color most similar to the training data is also the same person that was identified in the previous frame as the tracking target. The average of the shirt and pants color distance measurement for the most similarly colored person is below a constant threshold of 20.0. 5 (Closest Pants < (0.80*Closest Shirt) < Threshold The pants color distant measurement for the most similarly colored person is 20% more similar than the person’s shirt is to the training data, and both values are below the constant threshold. 6 Closest Shirt < Threshold The person with the closest matching shirt to the training data has a color distance measurement below the threshold of 20.0. 7 -1 The color distance measurement for the shirt, pants and average of shirt and pants for all people was greater than the threshold of 20.0. Table 4.2. This table shows the conditions of the priority system for selecting which person to follow. The algorithm favors following the same person from frame to frame, even if only part of their clothing matches the training data, and only falls back on pure color matching if the previously tracked person cannot be found by shirt or pants color. The priority system prefers to follow the same person from frame to frame rather than purely the person who is most similarly colored to the training data. If the person who has the most similarly colored shirt and pants to the training data is also the same person that was previously being followed, they are once again selected as the tracking target. If, however, the most similarly colored person is not the previous tracking target, but another person with a closely matching shirt is, then the person with the closely matching shirt is once again selected as the tracking target. This is also true for pants color, and a person with very similar pants color to the training data, who was also being tracked in previous frames, will be selected as the tracking target again even if other people’s clothing matches the training data better. The priority system does not select purely by color similarity unless the tracking target in the previous frame cannot be identified by clothing color. In the case that the tracking target from the previous frame cannot be identified, the software will choose who to follow purely on the basis of color similarly. The top priority is someone whose shirt and pants are similar to the training data. The next priority is someone whose pants are at least 20% more similar to the training data than the closest matching shirt color. Finally, the person with the closest matching shirt is selected as the tracking target, provided that their shirt is greater than a constant threshold. If the shirt and pants color for all people in the scene is greater than the threshold, then it is assumed that the person to be followed has left the scene and the robot should perform its tracking loss actions to rediscover the correct person. This priority system, combined with the tracking software’s ability to track people from frame to frame by trajectory and velocity, helps reduce tracking loss or confusion in situations such as people crossing paths or people standing in close proximity to each other. In each new frame, the robot will always prefer to track the same person that was being followed in the previous frames, even if their coloration is not an exact match. By choosing a person to follow based on how similar they are not only to the training data but also to the person that was being followed in the previous frame, the robot is less susceptible to jerky motion or rapid changes in lighting, as would be the case if the robot selected a tracking target purely by color from a group of similarly colored people. Another component of the person identification algorithm is the periodic retraining of the training set feature vectors. Retraining happens after a timeout of 100 milliseconds, rather than per frame. At every 100 millisecond interval, one percent of the selected target’s color data is alpha blended in to the training set, such that after 10 seconds the training set is completely retrained. This continual retraining of the feature data aids in adjusting to changes in lighting, especially if the robot is used in a different environment than where the training data was originally created. After the software chooses a tracking target, it must determine a direction and speed. 4.6 Speed and Direction Calculations The speed and direction of the robot are calculated separately according to different criteria and the robot may perform a range of actions depending on whether the tracking target is visible in the camera frame, recently exited the camera frame, or has not been seen for a while. Under normal circumstances, the person to be followed will be standing in front of the robot, in full view of the stereoscopic imaging rig. In this scenario, forward speed is determined from the target’s distance as read from the depth map. Because the depth map is created from two rectified images, the shape of the depth map itself appears corrected for lens distortions. However, the HOG and Hough algorithms that determine the person’s location within the camera frame operate on unrectified images. As such, the coordinates of pixels in the image data, such as those located at the person’s center of mass, will not correspond to those pixels’ depth at the same coordinates in the depth map. To be able to sample depth data accurately, the depth map must be preprocessed with an unwarping matrix that, in effect, reverses the image rectification process and returns the depth data to the same coordinate space as the unrectified camera images. The unwarping matrix is created in the initialization stage of the program flow. When the program starts up, all of the necessary rectification parameters and matrices to correct for lens distortion are loaded. Since the camera’s frame size is known, a new table of the same dimensions as the camera frame is created. For every x and y coordinate in the camera frame, its lens-corrected location is calculated by a multiplication with the lens correction matrix. This location is then used as an index into the new table, where the original x and y values are written. This table, called the unwarping matrix, acts as a lookup table where the location of every pixel from a lens distortion corrected frame acts as an index into the table, from which the original pre-corrected coordinates of the same pixel can be retrieved. A demonstration of this concept can be seen in Figure 4.8, which shows rectified image data from the right camera unwarped using the generated unwarping matrix and output to the left window. Figure 4.8. A rectified image, shown on the right, is processed with the unwarping matrix to return the image to its original coordinate space, as shown on the left. Since the depth map may contain errors or fluctuations in depth for any given area, the depth data is sampled from the target’s chest area and the median depth value is selected as the person’s depth. Originally, the robot’s forward speed was directly proportional to the person’s distance from the robot. However, to reduce jerky motion induced by constant acceleration and deceleration of the Segway robot, the speed is now determined by a fuzzy logic system with two membership groups. The person’s depth data is compared against a threshold of 200 centimeters. If the person is further from the robot than 200cm, then the robot gradually accelerates to a predetermined high speed. If the person is at 200cm or below, then the robot decelerates to a predetermined low speed. The direction of travel is determined by the person’s center of mass relative to the center of the image frame, and is calculated using Equation 4.2. 𝑑𝑖𝑟𝑒𝑐𝑡𝑖𝑜𝑛 = ( 𝑏𝑜𝑢𝑛𝑑𝑖𝑛𝑔 𝑏𝑜𝑥 𝑤𝑖𝑑𝑡ℎ 2 − 1.0) ∗ 1000.0 𝑓𝑟𝑎𝑚𝑒 𝑤𝑖𝑑𝑡ℎ 2 𝑥 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 + (4.2) The person’s position within the camera frame is calculated as the horizontal offset of the left edge of their bounding box, shown as x position in Equation 4.2, plus half the width of the bounding box itself. The person’s position is then divided by half the frame width to scale their horizontal position between 0.0 and 2.0. Since the Segway RMP expects values between -1000.0 and 1000.0 for full speed left turn and full speed right turn, the person’s position is further scaled to fit this range by subtracting 1.0, bringing the range from -1.0 to 1.0, and then multiplied by 1000.0 to fit the value range that the Segway expects. Since the Segway RMP is a two wheeled platform, turning left or right is performed by accelerating one wheel and decelerating the other wheel. It was discovered that changes in direction also affect the robot’s forward speed, as one wheel is either accelerated or decelerated to make the turn. This too would result in undesirable motion if the robot was constantly adjusting its orientation toward the tracking subject. A dead zone was added suppress unnecessary turns and thus unnecessary changes in speed until the tracking subject was on the far edges of the image frame. Therefore, after the direction of travel value is calculated, the value is checked to see if it falls within the dead zone, which constitutes the middle 40% of the image frame. If the person’s center of mass is within the dead zone, then the robot makes no attempt to turn toward the person and continues its forward motion. If the person’s center of mass falls either in the left 30% or right 30% of the image frame, then the robot will turn left or right respectively at a speed proportional to the person’s proximity to the frame edge. If the person manages to escape the robot’s field of view by exiting either the left or right edge of the frame, then the robot will continue turning in the direction where the person was last seen until the tracking subject is recaptured. This feature allows the robot to quickly recover from tracking loss by turning into the direction of the person, even if they are no longer visible to the robot. Tracking loss can also occur without the person leaving the frame, such as in the case of the person being occluded by an object. In such a case, since the robot has no knowledge of where the person went, it decelerates to a stop and idles for a predetermined amount of time. If the person does not reappear, then the robot begins to slowly turn in a circle to the left until the tracking subject reappears. 4.7 Client-Server Communication When the actual speed and direction have been determined, the Jetson TK1 computer that is responsible for image processing must communicate these intentions to the older secondary computer, which is responsible for controlling the Segway RMP chassis. The secondary computer runs a small server program on boot, which the primary computer connects to on program startup. The client and server are usually connected by TCP over a wired Ethernet connection for reliability, but can also connect wirelessly. If the client disconnects, either intentionally or due to a network error, the server continues to listen for new connections until a shutdown command is sent. A simple text-based communications protocol was developed to allow the client to signal its intentions to the server, but also to simplify any future development involving the Segway RMP. The communication is one way only, from client to server, and the server does not send any responses to the client. When the client software would like to move the robot in a particular direction, it simply sends one of the supported commands to the server that are given in Table 4.3. Command move <-1000.0 - 1000.0> Example move 200.0 Description Moves the robot forward at 20% of maximum speed. turn <-1000.0 - 1000.0> <- turn 300.0 -100.0 500.0 - 500.0> Moves the robot forward at 30% speed and turns left gradually. brake brake Stops the robot immediately. shutdown shutdown Stops the robot and shuts down the server, preventing further connections. Table 4.3. Shows all of the valid commands supported by the server program. A companion Android app was also created that allows a person to take direct control of the robot. The intention of the app was to allow the user to drive the robot manually in complex environments that would be too hazardous for the robot to navigate alone. The Android app has an easy to use interface and connects to the server program mentioned above over wifi. A screenshot of the app is shown in Figure 4.9. Figure 4.9. A screenshot of the companion Android app used to take direct control of the robot. 4.8 Summary This chapter detailed the development of the stereoscopic imaging software and hardware, as well as the image rectification process. Tests to determine the stereoscopic imaging system’s optimal interpupillary distances were also discussed. An explanation of the sixth generation robot’s compute hardware was also given. This chapter also analyzed the general program flow of the sixth generation person following robot’s software, including initialization steps, person detection using the Histogram of Oriented Gradients algorithm, persistent frame-to-frame person tracking by trajectory, and person location refinement with the Hough transform for head detection. False positive removal through head detection and bounding box size was also discussed. The method by which a person is identified amongst a group was also explained, which included a discussion of the Lab color space, color matching with a distance formula, and a priority system that prefers tracking the same person from frame to frame over exact color matching. A method for determining the robot’s forward speed and direction was shown, which uses the tracking subject’s depth and horizontal position within the image frame. Finally, the sixth generation person following robot’s client-server model was presented and an explanation was given of how the speed and distance information is sent over a network to the Segway RMP hardware, which translates the data into physical motion. A companion Android app was also discussed, which allows the user to take manual control of the robot by connecting to the robot’s server software. The next chapter presents tests performed with the sixth generation person following robot under various lighting conditions, types of terrain, and with multiple people in the scene in various poses. Tests were also performed to determine if there are alternate methods for estimating the distance to a person without the use of stereoscopic block matching. The results are presented and analyzed in the next chapter. CHAPTER 5 TESTS AND RESULTS Throughout the robot’s hardware and software development, various tests were performed to gauge the viability and accuracy of the proposed solutions. The stereoscopic imaging system was tested for accuracy in depth estimation and compared against alternative methods of depth sensing. The HOG algorithm was tested in various lighting conditions. The robot’s ability to track a human subject was tested indoors and outdoors, through rapidly changing illumination and on various terrain. The results from these tests are presented in this chapter. 5.1 Stereoscopic Imaging Tests As the stereoscopic imaging hardware was being developed, the effects of interpupillary distance between the two cameras on depth sensing were explored. The cameras were placed at varying distances apart and the depth to a target of a known distance was measured. Measurements were taken using the standard stereo block matching algorithm to create a disparity map and reprojection was used to calculate the depth to the target in centimeters. The results were then compared to discover the optimal interpupillary distance for the camera system. The results, given in Table 4.1, are reproduced for the reader in Table 5.1. Interpupillary Distance: Interpupillary Distance: Interpupillary Distance: 7.62cm 12cm 22cm Min. Distance: 37.6cm Min. Distance: 72.5cm Min. Distance: 135cm Actual Measured Actual Measured Actual Measured 254cm 234cm 254 220 254 236 162cm 150cm 162 145 162 145 89cm 86cm 89 93 89 Failed 71cm 70cm 71 76 71 Failed Table 5.1. Show a comparison of depth measuring accuracy between three different interpupillary distances. The measurements were fairly consistent at longer distances for all three interpupillary distances. However, when the distance to the target was reduced, wide distances between the two camera lenses prevented the system from capturing depth accurately. The furthest spacing between the two cameras, 22cm, was unable to measure distances to objects closer than 135cm from the camera lenses. As there was very little difference in the depth sensing accuracy between the remaining two interpupillary distances, a spacing of 7.62cm was chosen for the final stereoscopic imaging rig as it was able to resolve depth at shorter distances. 5.2 Hough Tests and Alternate Ranging The Hough transform for head detection was added as a component of the robot’s software to help remove false positives and improve positional tracking of targets. Before the Hough transform was added, the HOG algorithm would occasionally confuse objects that may have a human-like silhouette as humans and pollute the tracker module with false positives. An example of this, in which an office chair viewed in profile is mistaken for a human, is shown in Figure 5.1. The HOG algorithm also failed to adjust the bounding box of previously detected human subjects as they performed different poses, such as squatting or tilting to the side, which can indicate the person’s direction of travel and aid in calculating an accurate trajectory. Thus, the upper 1/3rd of each bounding box discovered by the HOG algorithm is scanned for the head using the Hough transform for circles. If at least one circle is found, then the person’s entry in the tracker is marked as having a head. If multiple circles are found, then the circle closest to the center of the search area is considered the head. An example of this is shown in Figure 5.2. Figure 5.1. An office chair in profile mistaken as a human being by the HOG algorithm. Figure 5.2. A bounding box adjusted to fit a squatting figure by discovering the location of the head. The yellow circle indicates the circle that is most likely the head, while every other circle discovered by the Hough transform is marked in purple. The blue square denotes the search area for the head, while the green square denotes the bounding box for the human as discovered by the HOG algorithm. Testing of the Hough and HOG algorithms was done on YouTube videos of street scenes from New York rather than static images [16][17]. Videos were used because they include both people in motion and motion of the camera, complete with physical translation, panning, tilting and motion blur. The videos were meant to be an analog for what a robot would see when operating in a crowded environment. The results were mixed. Certain false positives, such as tall buildings in the distance, were discovered by the HOG algorithm and successfully flagged as false positives by the Hough transform, as shown in Figure 5.3. However other complex areas, such as advertisements on buildings, were discovered by the HOG and incorrectly marked by the Hough transform as having a head, as shown in Figure 5.4. Since many of the false positives seemed to be generated by distant objects, a filter was added to the tracker module that further improves removal of false positives, including those incorrectly marked as being human by the Hough transform. The tracker filter rejects any bounding box whose dimensions are smaller than predefined constants, or whose depth is farther than what would be considered a reasonable tracking distance. The filter works under the assumption that the person to be followed will be within some acceptable range of the robot and appear large in the camera’s field of view. For a 320x240 image, the rejection criteria is any bounding box that is smaller than 60x120 or whose distance is greater than 500cm from the camera. Thus, any distant objects, including distant humans, are eliminated from the tracker. In practice, the filter was shown to be very beneficial for removing false positives. After the tracker filter was added, false positives have become something of a rarity. Figure 5.3. A building is identified as human by the HOG algorithm. The red bounding box indicates that the Hough transform was unable to find a head, and the object is probably not a human. Figure 5.4. A street scene in which advertisements on the side of a distant building are misidentified as human by both the HOG and Hough algorithms. The head radius as detected by the Hough transform was also explored as an alternate method of depth estimation. The motivation behind this experiment was to further increase the framerate of the software. Since the Hough transform requires less processing power than the stereoscopic block matching algorithm, a sizeable increase in performance could be gained simply by estimating the depth by head size. Intuitively, if someone is closer to the camera, the radius of their head should appear to be larger than if they are further from the camera. Using this principle, fixed points of a known distance were marked on the ground and the distance to a human subject was measured at each point, both using head radius and the traditional stereo block matching method. The test was carried out three times for each method of depth estimation. The results were compared and are presented in Tables 5.2 and 5.3. Head-Radius Depth Estimation Position Attempt 1 Attempt 2 Attempt 3 Actual Distance 1 99-178cm 139-187cm 104-189cm 270cm 2 243-275cm 276-342cm 313-446cm 360cm 3 354-380cm 408-498cm 339-596cm 480cm 4 381-437cm 496-571cm 528-596cm 560cm 5 331-359cm 443-536cm 449-495cm 480cm 6 294-307cm 297-330cm 342-390cm 360cm 7 136cm 63-108cm 97-151cm 270cm Table 5.2. This table shows the results of depth estimation by examining changes in head radius. Stereo Block Matching Depth Estimation Position Attempt 1 (med) Attempt 2 (med) Attempt 3 (avg) Actual Distance 1 277.2cm 277.2cm 251-282cm 270cm 2 332.2cm 332.2cm 363-397cm 360cm 3 556.6cm 415.8-556.6cm 486-556cm 480cm 4 556.6cm 556.6cm 569-664cm 560cm 5 556.6cm 556.6cm 428-443cm 480cm 6 332.2cm 332.2cm 334-351cm 360cm 7 277.2cm 277.2cm 277-279cm 270cm Table 5.3. This table shows the results of depth estimation as estimated by the stereo block matching algorithm and reprojection. Since a block of pixels are sampled from the depth map, stereo block matching depth estimation can be calculated in two ways. The first two attempts used the median value of depth samples at the person’s center of mass, while the last attempt used an average of the depth samples as the final depth value. The data in Table 5.2 above seems to follow the pattern that measurements increase as the subject moves further from the camera and implies that head radius can be used as an alternative to stereoscopic imaging for depth estimation. However, we can see that sometimes the actual distance falls inside the measured range, and sometimes the measurements can be incorrect by as much as 162cm, as shown in Table 5.4 below. Head Radius Measurement Error Position Attempt 1 Attempt 2 Attempt 3 1 92 cm 83 cm 81 cm 2 85 cm 18 cm 0 cm 3 100 cm 0 cm 0 cm 4 123 cm 0 cm 0 cm 5 121 cm 0 cm 0 cm 6 53 cm 30 cm 0 cm 7 134 cm 162 cm 119 cm Table 5.4. This table shows the difference between the actual distance and the closest number in the measured range for each attempt. Table 5.4 shows the difference between the actual distance and the nearest number in the measured range. Therefore, if the measurement fluctuated between 99cm and 178cm, as is the case for Attempt 1 at Position 1, then the measurement would have to be adjusted by at least 92cm to include the actual depth of 270cm within the measurement range. If the actual distance already falls within the measurement, then the error is considered to be 0cm. If the distances estimated from the head radius exhibited a consistent measurement error, then the data could simply be adjusted to compensate for the error and the measurements could be considered as reliable. However, even though neither the subject’s location nor the hardware were modified between successive attempts, the error in measurement is inconsistent from attempt to attempt. Since the size of the error is unpredictable between successive attempts, the depth estimation cannot be corrected simply by adding or subtracting a constant. There is also the problem of the estimated range itself. At each position, the distance from the camera to the person was estimated by their head radius. However, the measurement was not consistent from frame to frame, and therefore a minimum and maximum observed distance were recorded. The range of the estimations for each position are shown in Table 5.5. Head Radius Measurement Range Position Attempt 1 Attempt 2 Attempt 3 1 79 cm 48 cm 85 cm 2 32 cm 66 cm 136 cm 3 26 cm 90 cm 257 cm 4 56 cm 75 cm 68 cm 5 28 cm 93 cm 41 cm 6 13 cm 33 cm 48 cm 7 0 cm 45 cm 54 cm Table 5.5. This table shows the difference between the maximum estimated distance and minimum estimated distance per position for all three attempts. The data in Table 5.5 can be interpreted as the preciseness of using head radius as a depth estimator. Even if the actual depth falls within the estimated depth range, the range itself can sometimes be quite large. In the worst case, the range was 2.57 meters. The range also reflects the separation needed between two people for them to be considered individuals. If one person is, for example, 40 centimeters in front of another person, but the range of the head radius estimation at that position is 50 centimeters, then the software cannot be sure that the two people are in fact individuals and not the same person. Conversely, when looking at Table 5.3, it is apparent that both methods of stereoscopic depth estimation performed remarkably well for short distances. Selecting a depth value from the sampled data by median was prone to errors for distances greater than 480cm from the camera. However, when selecting depth as an average of the samples, the estimated distances were quite close to the actual distances for each point. Furthermore, the range of values when using the average of samples method for depth estimation was very small for all but the furthest distances from the camera. Though the stereo block matching performed better than head-radius based depth estimation in terms of accuracy, it did so at a performance cost. During testing, it was observed that the stereo block matching algorithm combined with the usual HOG and Hough transform algorithms used for person detection ran at approximately 10 to 15 frames per second. On the other hand, the head-radius based depth estimation, combined with HOG for person detection, ran at approximately 20-27 frames per second. Though the head-radius based method was less accurate, sizeable performance gains were achieved through its use. 5.3 Single Person Indoor Performance The person following robot was tested indoors in the hallway of the computer science building. The hallway has a smooth tile floor, white walls, tight corners, doorways, and variable lighting, both from overhead lights and windows. The robot was trained in an adjacent lab, and then manually driven to the center of the hallway where the main person following software was started. The robot was initially tested only in its ability to track a single human. The human wore a green shirt to stand out against the white background and stood approximately two meters in front of the robot when the software was started. Several tests were then carried out that included: Tracking the human in a straight line. Tracking the human in a zig-zag pattern. Tracking a human in profile or facing away from the robot. Tracking a human making sudden movements (jumping from side to side). Recovering from tracking loss when the human moves out of the robot’s field of view. Tracking the human through variable lighting conditions. Navigating around corners and through doorways. The results showed that the robot was able to successfully track the human in a straight line, zig-zag pattern, and when the human was making sudden jumping movements. The robot was able to detect and track the human in all orientations and also able to keep a walking pace. When the human would move out of the robot’s field of view, the robot rotated in the direction of the human’s last known location and was able to recover tracking in a reasonable amount of time, usually within a few seconds of tracking loss. A problem was discovered with the robot’s hardware during the variable lighting tests. To test the robot’s ability to track in variable lighting, the human walked from one end of the hallway to the other, moving from shadowy areas to areas illuminated by overhead lamps, and finally to an area illuminated by sunlight from a window. The amount of light was measured with a lux meter at 25 lux in the shadows, 200 lux under the overhead lamps, and approximately 1200 lux near the window. The robot had no problems tracking the human between the shadows and overhead lamps, but lost tracking when entering the sunlit area. Onboard video recorded by the robot revealed that the right camera has a hardware problem with its auto-exposure function and was unable to adjust itself automatically to accommodate the new lighting conditions. The image became greatly over exposed, as shown in Figure 5.5, and the robot was unable to consistently detect the silhouette of the human. Figure 5.5. Over-exposed video stream when entering a brightly lit area due to hardware malfunction. Though the robot was eventually able to detect the human and continue navigating, the results showed that the defective hardware would be an obstacle in future testing. It was also discovered that the robot’s simple person following logic was inadequate for an environment with tight corners. The robot was able to follow the human around corners, but with aid and careful positioning of the human. The robot would occasionally bump into corner edges. Finally, the robot is unable to follow a person through a doorway if the door needs to be held open by the human. If the door is open, then the human and robot can pass through the doorway without loss of tracking. However, if the human is required to hold the door open, then the distance between the human and robot is reduced to the point where the human no longer fits entirely in the frame and thus cannot be detected. 5.4. Indoor Tracking with Fixed Depth A second round of testing was performed to test the effects of auto-exposure and higher framerates on person tracking. The software was modified to clone the video stream from the working left camera as the right camera’s output. Block matching was disabled, and a fixed distance of 150cm was assumed to every point in space to allow the robot to follow any detected human at a constant speed. The effect would be both functional auto-exposure and a framerate increase of approximately 10 frames per second, bringing the average framerate up to 20-27fps. In order to test how well the software could detect and segment humans from the environment, the human subject wore a white colored shirt to match the white walls of the hallway, which was a problematic scenario in previous generations of the person following robot. Testing revealed that the coloration of the person’s clothing made no difference in the robot’s ability to detect and follow, as shown in Figure 5.6. Figure 5.6. A person walks away from the robot wearing a shirt colored similarly to the environment. Testing also revealed that the robot was able to cope with drastic changes in lighting. As the robot moved from a 25 lux environment to an approximate 1200 lux environment near the window and then back to a 25 lux environment, the auto-exposure feature of the camera was able to adjust the exposure settings to properly image the scene. The robot never failed to track the human and was able to follow the human through the environment without a problem. A screenshot from the onboard video taken in the brightly lit environment is shown in Figure 5.7 for reference. Figure 5.7. Onboard video from the robot passing through the same brightly lit environment as before, with auto-exposure working properly. The increase in framerate was also shown to have a positive effect on tracking. Since the robot has more information about the person’s location per second, it can calculate a more accurate trajectory to predict the person’s motion. This results in a lower miss rate in the HOG algorithm, which means that the entire image does not have to be rescanned for human targets. Since the robot has a better understanding of where the person will be in the immediate future, the software has to perform fewer calculations to detect the human, and thus can maintain a consistently high framerate. The robot was also tested with a higher resolution 640x480 video stream. Since four times as many pixels needed to be processed, the framerate dropped to approximately 10 frames per second when the robot was able to predict the person’s motion, and 4 frames per second when the image needed to be rescanned. Because the robot had less temporal information about the person’s whereabouts, the robot had trouble predicting the person’s movements if they were not traveling in a straight line and would easily lose tracking. Once the robot lost tracking due to low framerate, it was nearly impossible for the robot to regain a lock on the human target. It was also observed that, perhaps due to motion blur or a lack of hardware synchronization between the two cameras, distance estimation using the block matching algorithm was prone to errors, sometimes by hundreds of centimeters, when the robot was rotating left or right. 5.5 Outdoor Testing After indoor testing showed positive results, the robot was taken outdoors to be tested in intense sunlight and on various types of terrain. As before, the robot was trained in the lab on a person’s clothing. Training was performed indoors as it required the human subject to stand in front of the robot in isolation. The training program selects the first human detected by the HOG algorithm and memorizes their clothing color, which is visually verified by the person with the use of a monitor attached to the robot. The monitor allows the user to verify that they were indeed selected as the tracking target, and that the camera settings were adequate for a good exposure. After training was complete, the robot was manually driven outside using the companion Android app. Once outside, the robot was positioned on a cement surface and the main person following software was started. The software was also configured to use stereoscopic imaging for depth estimation, rather than using the head radius or assuming a fixed distance. Preliminary testing revealed that due to the drastic illumination differential between the indoor environment where the robot was trained and outdoors where the main software was started, the robot was unable to match humans to the training data. Though the robot was still able to detect humans, onboard video showed that the scene was greatly over exposed. Since the camera hardware is unable to set the exposure automatically, the robot needed to be reprogrammed in the field to accommodate the new lighting conditions. The level of over exposure can be seen in Figure 5.8. The onboard video also revealed another problem with the camera hardware. Because neither the lenses nor the sensors have infrared filters, the infrared radiation from the sun washed out all color information in the scene, making it impossible for the robot to identify the tracking subject by clothing color. To allow the test to continue, the exposure settings were adjusted manually and the threshold for identifying the tracking subject was reduced to include every person in the scene. Figure 5.8. Over exposed video stream due to intense sunlight. Each human appears to be wearing identically colored clothing from the robot’s perspective, even though the outfits were quite different. Once the exposure level was reduced and the identification threshold was adjusted, the robot was briefly tested on concrete before being moved to a grassy area. The robot was able to identify and track human subjects, and move without difficulty on the concrete and grass surfaces. Detection and tracking performance was unaffected by surface type. However, because multiple people were in the scene and every person appeared to be wearing purple clothing, the robot would occasionally become confused and follow the wrong person, as shown in Figure 5.9. The extreme lighting also made shadows appear to be pitch black, as shown in Figure 5.10. As people in shaded areas appear as featureless silhouettes, the robot would have had significant trouble properly identifying the correct tracking target. Figure 5.9. Infrared radiation caused all persons to appear purple, rendering identification by color matching impossible. Figure 5.10. Because of the intensity of the light, shade and shadowy areas appear pitch black. The illumination in direct sunlight was measured at approximately 95,000 lux while the illumination in shaded areas was measured at approximately 23,000 lux. The results of the outdoor tests showed that while the robot was able to detect and follow humans over rough terrain, infrared interference made it impractical in its current form. It was concluded that in order for the robot to become practical outdoors, new IR-blocking lenses would need to be installed. 5.6 Testing with IR-Blocking Lenses and Multiple Subjects New IR-blocking lenses were installed on the robot and the robot was once again tested indoors and outdoors. Two human subjects were used, one of which was wearing a black shirt and the other wearing a white shirt. As before, the robot was trained indoors and then manually driven to the same grassy area outside using the companion Android app. The first test tasked the robot to follow a single human without others in the scene through a shaded area to one illuminated by direct sunlight. Due to the new lenses, shadows no longer appeared as pitch black areas and the robot was able to follow the human subject without issue. The next test required the robot to continue following a human in a straight line while a second human, who was not the intended tracking target, moved across the screen, cutting between the robot and the intended target. This test was performed several times, and each time the robot was able to maintain a lock on the intended target, as shown in Figure 5.11. Figure 5.11. After infrared blocking lenses were installed, the robot was able to properly distinguish individual human subjects outdoors. There were several instances in which the robot appeared to briefly confuse the intended tracking target with the second person. This was due to the fact that the robot was trained indoors and then moved outdoors where the lighting conditions were drastically different. In order to allow the robot to initially recognize the same target human outdoors, the identification threshold had to be lowered to account for the lighting differences. Once the robot acquired a lock on the intended target, the training data set is continually retrained to promote better tracking performance as the subject moves between shaded and brightly lit environments. However, the identification threshold is not scaled automatically and remains low. Thus, when the second subject in the black shirt moves to an area with direct sunlight, his shirt and pants appear to be a light gray, similar to those of the intended target in the shade. Despite this, the robot was able to maintain tracking because it also takes trajectory, depth and prior locations into account when deciding which human to follow. The same tests were performed indoors in the hallway with the IR-blocking lenses still attached, where lighting was more consistent and closer to the lighting in the laboratory where the robot was initially trained. The intended target walked in a straight line and the second human subject crossed his path to try to confuse the robot. A second test was also performed in which both subjects walked in parallel away to see if the robot would occasionally favor the wrong person. The results showed that the robot performed better indoors than outdoors and was able to maintain tracking of the correct subject, despite the efforts of the second human. The robot would still occasionally confuse the two subjects, but usually for not long enough to alter the robot’s own direction of travel. It was observed that in one instance the robot confused the second human as the correct tracking subject for long enough to turn toward him. However, the robot quickly corrected its path when tracking of the intended target was reestablished, as shown in Figure 5.12. Figure 5.12. Onboard footage of the robot detecting both humans and correctly choosing the tracking target in an indoor environment. 5.7 Summary This chapter presented the various tests that were performed with the robot and discussed the results. Tests to determine the optimal interpupillary distance between the two cameras in the stereoscopic imaging rig were carried out. The results showed that the distance between the cameras had little impact on the accuracy of the system for distances up to 2.5 meters. However, the configuration in which the cameras were closest together were the most accurate when the target was closest to the cameras. The Hough transform for head detection was tested as a method for removing false positives. The upper region of every detection from the HOG algorithm was tested for the presence of a circle, which would indicate the presence of a head and confirm the object as a human. The results were mixed. Some false positives, such as a desk chair viewed in profile, were removed as false positives, but others, such as complex geometry on distant skyscrapers, were not. False positive removal was improved by introducing a filter into the tracker, which removed any object that was smaller than the expected dimensions of a human within 5 meters of the camera. Head detection was also shown to improve person tracking by fine-tuning the location of each person’s bounding box relative to the location of their head, which improved tracking in situations when the person was in an uncommon pose, such as a squat. Use of the head radius as an alternative method of depth estimation was also explored and compared against stereoscopic image reprojection. The results showed that while the radius of the head could be used to estimate depth, the method was not as reliable or accurate as stereoscopic imaging. However, it was observed that estimating distance by head radius was much less computationally expensive and increased the overall framerate of the software by 10 to 15 frames per second. Tests were carried out indoors with a single tracking subject to test the robot’s ability to track a single person in a variety of situations. It was shown that the robot was able to track a single person reliably in a straight line, zig-zag pattern, and when the person was making sharp changes in direction. The robot was also able to track the person in a variety of poses and reliably recover from tracking loss in a reasonable amount of time. It was discovered in testing that one of the cameras had a hardware defect preventing it from automatically adjusting exposure levels. Testing with a single working camera and fixed distance value revealed that the robot can also cope with changes in lighting and continue to track the human subject. Outdoor testing initially revealed hardware problems with the non-IR blocking lenses. IR interference from the sun washed out the image and color information, preventing the robot from properly exposing the image and identifying individual people. This was solved by introducing IR-blocking lenses. The following chapter concludes this thesis by reiterating the results and comparing them with those of previous generations of the person following robot. The next chapter also discusses potential improvements to the system for future work. CHAPTER 6 CONCLUSION This thesis investigated the viability of stereoscopic imaging together with the Histogram of Oriented Gradients and Hough Transform algorithms for person detection and following. The stereoscopic imaging system was built from two off the shelf digital camera boards and mounted to a Segway RMP chassis. The robot was tested indoors and outdoors in a variety of lighting conditions, on different types of surface, and with multiple people in the scene. The robot’s performance in each scenario was recorded by onboard video and any areas of deficient performance, tracking loss or failed person detection were identified and analyzed. 6.1 Summary of Results Testing of the stereoscopic imaging rig showed that even a stereoscopic system made from cheap off the shelf components can have good accuracy in estimating distance. The stereoscopic imaging rig was able to work in a variety of lighting conditions, from direct sunlight to pitch dark environments. It was also tested against estimating distance by head radius and proved to be more accurate, but computationally more expensive. Indoor testing demonstrated that the robot is able to detect and follow a specific person through an environment where small changes in lighting may occur. The robot was able to consistently detect and follow a single human target regardless of the color of their clothing or the color of the surrounding environment. The robot was also able to detect humans regardless of their pose or orientation relative to the robot. When multiple people were present, the robot generally followed the correct person, though would occasionally confuse the intended tracking target with a similarly colored person nearby. Finally, it was demonstrated that the robot is able to track humans with erratic motion, and quickly recover from tracking loss should the human subject move out of the robot’s field of view. However, indoor testing also revealed a few deficiencies in the system. Due to simple path planning routines, the robot’s software is not sophisticated enough to safely navigate around sharp corners. Though the robot was able to follow a human throughout the building, bumping into walls and intervention from the human were not uncommon. It was also discovered that one of the cameras has a hardware problem preventing it from automatically adjusting exposure levels to accommodate changes in lighting, which prevented the robot from smoothly navigating through areas of extreme shifts in illumination. Testing with a single working camera showed that if both cameras worked properly, the robot would have little trouble operating in the same areas. Initial outdoor testing had significant problems due to infrared radiation from the sun. When the robot was moved to direct sunlight from an indoor environment, the video stream was greatly over exposed and the robot was unable to correctly identify the tracking target. After the exposure settings and identification threshold were manually adjusted, the robot was able to track a person but would often get confused when multiple people entered the scene. Because the lenses did not filter IR, every colored object, including people, appeared purple in color to the robot, making matching by color impossible. After IR-blocking lenses were installed on the robot, it was able to detect and follow a human subject on a variety of surfaces and through extreme changes in lighting. Finally, testing at higher video resolutions showed that the robot’s software does not perform well when the framerate dips below approximately 10 frames per second. When the frame size was increased from 320x240 to 640x480, the framerate dropped to approximately 4 frames per second. Because of this, the robot frequently lost tracking of the human when they made erratic movements and had a very difficult time recovering from tracking loss. It was also noted that distance estimation worsens as framerate decreases due to a lack of hardware synchronization between the two cameras and the introduction of more motion blur at lower framerates. At the higher resolution, distance estimation was sometimes off by a few hundred centimeters. 6.2 Comparison to Prior Generations The sixth generation person following robot solved many of the problems that were inherent in prior generations. The first three generations relied solely on color matching for person detection and were affected by the color of the background environment and changes in lighting. The fourth generation robot used Hough transforms to detect the head and color matching in the region immediately below to identify a specific human, but was also affected by false positives from circular non-human object. The performance was prohibitively slow and the software could not be used in a realtime system. The fifth generation, which used the Microsoft Kinect depth sensor to image the scene in three dimensions, worked well indoors if the person was facing the robot, but failed to detect humans that were facing away from the robot or moving erratically through the scene. The robot had trouble on rough terrain and would frequently lose tracking due to motion blur. The sixth generation robot was demonstrated to be unaffected by changes in lighting, environmental color, or terrain type. The robot can work indoors or outdoors and has a very low false positive rate. However, although color matching in Lab space generally works, it has been shown to be the weakest link in the sixth generation robot’s software as with prior generation robots. 6.3 Future Work The sixth generation robot could immediately benefit from better hardware. It was shown in both indoor and outdoor testing that higher framerate and the ability to automatically adjust exposure greatly increase the robot’s ability to predict the tracking subject’s future movements. Simply replacing the onboard computer with a more powerful one should produce immediate positive results. Likewise, higher quality cameras with functional auto-exposure, or a professional stereoscopic imaging system like the ZED stereo camera from Stereo Labs should solve many of the robot’s imaging problems [17]. On the software side, the identification of a person within a group of individuals proved to be the most problematic. A simple solution to improve performance could be the addition of an annealing function for the identification threshold. This would allow the robot, when moved to an area of significantly different lighting, to identify the human before it as the tracking subject and gradually increase the threshold automatically until no other person can be confused with them. A more robust solution might be to replace the color matching identification module with a neural network solution. The neural network may be trained on various features of the human, such as facial features, hair color, clothing color and design, and so forth. Then, the neural network could be used to discriminate between the intended tracking subject and other people in the scene. The sixth generation robot uses a shortest-path algorithm to determine how to best follow the human target. This could be replaced with path planning software, which would allow it to detect obstacles in the depth data and choose the best route, though not necessarily the shortest route, to reach the target. The addition of path planning software would allow the robot to safely navigate around less sterile environments that include furniture, tight corners or narrow corridors. REFERENCES [1] S. Piérard, A. Lejeune, and M. Van Droogenbroeck. "A probabilistic pixel-based approach to detect humans in video streams" IEEE International Conference on Acoustics, Speech and Signal Processing, pages 921-924, 2011 [2] K. Mikolajczyk, C. Schmid, and A. Zisserman, "Human detection based on a probabilistic assembly of robust part detectors", The European Conference on Computer Vision, volume 3021/2004, pages 69-82, 2005 [3] M. Tarokh and P. Ferrari, “Person following by a mobile robot using computer vision and fuzzy logic”, M.S. Thesis, Dept. of Computer Science, San Diego State University, San Diego, CA, 2001. [4] M. Tarokh and J. Kuo, “Vision based person tracking and following in unstructured environments”, M.S. Thesis, Dept. of Computer Science, San Diego State University, San Diego, CA, 2004. [5] M. Tarokh, P. Merloti, and J. Duddy, “Vision-based robotic person following under light variations and difficult walking maneuvers”, 2008. [6] R. Shenoy and M. Tarokh, “Enhancing the autonomous robotic person detection and following using modified Hough transform”, M.S. Thesis, Dept. of Computer Science, San Diego State University, San Diego, CA, 2013. [7] R. Vu and M. Tarokh, “Investigating The Use Of Microsoft Kinect 3d Imaging For Robotic Person Following”, M.S. Thesis, Dept. of Computer Science, Department of Computer Science, San Diego State University, San Diego, CA, 2014. [8] J. MacCormick, “How does the Kinect work?” Internet: http://users.dickinson.edu/~jmac/selected-talks/kinect.pdf [9] “Obtaining Depth Information from Stereo Images”, Ensenso and IDS Imaging Systems, GmbH, 2012. [10] N. Dalal and B. Triggs, “Histograms of Oriented Gradients for Human Detection” in CVPR, pages 886-893, 2005 [11] H. Rhody, “Lecture 10: Hough Circle Transform”, Chester F. Carlson Center for Imaging Science, Rochester Institute of Technology, Rochester, New York, Oct. 11, 2015. [12] R. Fisher et al, “Hough Transform”, Internet: http://homepages.inf.ed.ac.uk/rbf/HIPR2/hough.htm, 2003 [13] A. Solberg, “Hough Transform”, University of Oslo, Internet: http://www.uio.no/studier/emner/matnat/ifi/INF4300/h09/undervisningsmateriale/hou gh09.pdf, Oct. 21, 2009 [14] A. Jain, Fundamentals of Digital Image Processing. New Jersey, United States of America: Prentice Hall. pp. 68, 71, 73 [15] C. McCormick, “Stereo Vision Tutorial”, Internet: http://mccormickml.com/2014/01/10/stereo-vision-tutorial-part-i/, Jan. 10, 2014 [16] T. Kupaev, “Walk down the times square in New York”, Internet: https://www.youtube.com/watch?v=ezyrSKgcyJw, Jul. 10, 2012 [17] “HD 720p Walking in Shibuya, Tokyo”, Internet: https://www.youtube.com/watch?v=xY5A8KwdlDk, Apr. 21, 2009 [18] “Jetson TK1”, Internet: http://elinux.org/Jetson_TK1 [19] V. Anuvakya and M. Tarokh, “Robotic Person Following in a Crowd Using Infrared and Vision Sensors”, M.S. Thesis, Dept. of Computer Science, San Diego State University, San Diego, CA, 2014. [20] D. Ballard and C. Brown, “The Hough Method for Curve Detection” in Computer Vision, Rochester, New York, Prentice Hall, 1982, pp. 123-131. [21] S. Blathur, “CIE-Lab Color Space”, Internet: http://sheriffblathur.blogspot.com/2013/07/cie-lab-color-space.html, Jul 18, 2013 [22] “Color Differences and Tolerances”, datacolor, Lawrenceville, New Jersey, Jan. 2013 [23] D. Brainard, “Color Appearance and Color Difference Specification” in The Science of Color, Santa Barbara, California, University of California, Santa Barbara, pp. 192213 [24] D. Gavrila, “A Bayesian, Exemplar-Based Approach to Hierarchical Shape Matching” in IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 29, No. 8, August 2007. [25] P. Felzenszwalb et al, “Object Detection with Discriminatively Trained Part Based Model” in IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 32, No. 9, September 2010. [26] M. Barnich, S. Jodogne, and M. Droogenbroech, “Robust Analysis of Silhouettes by Morphological Size Distributions” in Lecture Notes on Computer Science, Volume 4179, pp. 734-745, 2006. [27] N. Navab and C. Unger, “Stereo Vision I: Rectification and Disparity” Technische Universitat Munchen, Internet: http://campar.in.tum.de/twiki/pub/Chair/TeachingWs11Cv2/3D_CV2_WS_2011_Rec tification_Disparity.pdf [28] Z. Schuessler, “Delta E 101” Internet: https://zschuessler.github.io/DeltaE/learn/ [29] R. Rao, “Lecture 16: Stereo and 3D Vision” University of Washington, Internet: https://courses.cs.washington.edu/courses/cse455/09wi/Lects/lect16.pdf, Mar. 6, 2009 [30] L. Iocchi, “Stereo Vision: Triangulation” Universita di Roma, Internet: http://www.dis.uniroma1.it/~iocchi/stereo/triang.html, Apr. 6, 1998 [31] A. Barry, “Wide Angle Lens Stereo Calibration with OpenCV” Internet: http://abarry.org/wide-angle-lens-stereo-calibration-with-opencv/ [32] N. Ogale, “A Survey of Techniques for Human Detection from Video” University of Maryland, College Park, MD [33] A. Elgammal, “CS 534: Computer Vision Camera Calibration” Rutgers University, Internet: http://www.cs.rutgers.edu/~elgammal/classes/cs534/lectures/Calibration.pdf [34] V. Prisacariu and I. Reid, “fastHOG – A Real-Time GPU Implementation of HOG”, University of Oxford, Oxford, UK, Jul. 14, 2009
© Copyright 2025 Paperzz