SDSU Template, Version 11.1

ROBOTIC PERSON FOLLOWING USING STEREO DEPTH SENSING
AND PERSON RECOGNITION
_______________
A Thesis
Presented to the
Faculty of
San Diego State University
_______________
In Partial Fulfillment
of the Requirements for the Degree
Master of Science
in
Computer Science
_______________
by
Michal Pasamonik
Fall 2016
SAN DIEGO STATE UNIVERSITY
The Undersigned Faculty Committee Approves the
Thesis of Michal Pasamonik:
Robotic Person Following Using Stereo Depth Sensing and Person Recognition
_____________________________________________
Mahmoud Tarokh, Chair
Department of Computer Science
_____________________________________________
Xiaobai Liu
Department of Computer Science
_____________________________________________
Ashkan Ashrafi
Department of Electrical and Computer Engineering
______________________________
Approval Date
iii
Copyright © 2016
by
Michal Pasamonik
All Rights Reserved
iv
DEDICATION
I dedicate this thesis to my two children, Alice and Kazuma.
v
vi
ABSTRACT OF THE THESIS
Robotic Person Following Using Stereo Depth Sensing and Person
Recognition
by
Michal Pasamonik
Master of Science in Computer Science
San Diego State University, 2016
This thesis tested the viability and effectiveness of using stereoscopic imaging with
cheap components for the purpose of person detection and following. Stereoscopic cameras
can produce a depth map, much like active imaging systems such as the Microsoft Kinect
can, but are not subject to the same environmental limitations and may be configured to work
in all types of lighting. A stereoscopic imaging rig was built from two wide angle USB
cameras mounted to an aluminum channel and driven by a low power compute platform. The
compute platform, which was mounted to the top of the robot, controlled the imaging
hardware, detected and tracked human targets using a combination of histogram of oriented
gradients, Hough transforms, color and depth tracking, and sent motion commands to a
secondary computer, which was mounted on the bottom of the robot. The secondary
computer controlled the robot chassis and converted the primary computer's motion
commands into actual motion.
The Histogram of Oriented Gradients algorithm was used as the primary means of
person detection due to its low false positive rate, invariance to color and lighting, and ability
to detect humans in various poses. As the Histogram of Oriented Gradients produces a
bounding box with predictable margins around the human subject, the upper one fourth of
each potential human detection was further processed with a circle Hough transform to detect
the head position, which was used to fine tune the person’s location and aid in removing false
positives. Finally, the locations of each subject’s shirt and pants were identified and a feature
vector describing their coloration was constructed. The feature vector was used to identify
individuals from the group, chiefly the primary tracking subject. The system was optimized
by tracking the trajectory of individuals through time and limiting the search space to their
expected location.
The person following robot’s software was written in C++ and ran on an embedded
version of Ubuntu Linux, though the software was designed to be platform independent. The
robot was tested in indoor and outdoor environments, under varying lighting conditions, and
with a varying number of people in the scene. Testing revealed that even though the robot
could be made to work in both indoor and outdoor environments, it benefited from a lens
change when moving from one environment to the other. Non-IR blocking lenses worked
best in low-light indoor environments and IR blocking lenses worked best in outdoors.
Differences in terrain type proved to have no effect on performance. The results showed that
stereoscopic imaging can be a cheap, robust and effective solution for person following.
vii
TABLE OF CONTENTS
PAGE
ABSTRACT ............................................................................................................................. vi
LIST OF TABLES ................................................................................................................... ix
LIST OF FIGURES ...................................................................................................................x
ACKNOWLEDGEMENTS .................................................................................................... xii
CHAPTER
1
INTRODUCTION .........................................................................................................1
1.1 Person Detection And Following ...........................................................................1
1.2 Challenges Of Person Detection And Following ...................................................2
1.3 Contributions Of This Thesis .................................................................................3
1.4 Summary Of The Following Chapters ...................................................................4
2
BACKGROUND ...........................................................................................................5
2.1 First Generation .....................................................................................................5
2.2 Second Generation .................................................................................................7
2.3 Third Generation ....................................................................................................8
2.4 Fourth Generation...................................................................................................9
2.5 Fifth Generation ...................................................................................................10
2.6 Further Fifth Generation Experiments .................................................................11
3
METHOD OF PERSON DETECTION.......................................................................13
3.1 Stereo Block Matching ........................................................................................13
3.2 Histogram Of Oriented Gradients ........................................................................16
3.3 Hough Transform For Head Detection .................................................................19
3.4 Color Matching.....................................................................................................21
3.5 Summary ..............................................................................................................22
4
SIXTH GENERATION HARDWARE AND SOFTWARE .......................................24
4.1 Stereoscopic Imaging System ..............................................................................24
4.2 Compute Hardware and Programming Environment ...........................................30
4.3 Initial Person Detection And Target Tracking .....................................................31
viii
4.4 Location Refinement And False Positive Removal .............................................34
4.5 Person Identification.............................................................................................34
4.6 Speed And Direction Calculations .......................................................................38
4.7 Client-Server Communication ..............................................................................41
4.8 Summary ..............................................................................................................43
5
TESTS AND RESULTS ..............................................................................................45
5.1 Stereoscopic Imaging Tests .................................................................................45
5.2 Hough Tests And Alternate Ranging ...................................................................46
5.3 Single Person Indoor Performance .......................................................................53
5.4 Indoor Tracking With Fixed Depth ......................................................................55
5.5 Outdoor Tests .......................................................................................................58
5.6 Testing With IR-Blocking Lenses And Multiple Subjects ...................................61
5.7 Summary ..............................................................................................................63
6
CONCLUSION ............................................................................................................65
6.1 Summary Of Results ............................................................................................65
6.2 Comparison To Prior Generations ........................................................................66
6.3 Future Work .........................................................................................................67
REFERENCES ........................................................................................................................68
ix
LIST OF TABLES
PAGE
Table 4.1. Comparison if Interpupillary Distances and their Accuracy ..................................28
Table 4.2. Priority Lost Logic for Selecting a Tracking Target...............................................37
Table 4.3. List of Valid Commands Supported by the Robot's Server ....................................41
Table 5.1. Comparison if Interpupillary Distances and their Accuracy ..................................45
Table 5.2. Head-Radius Distance Estimation Results .............................................................50
Table 5.3. Stereo Block Matching Distance Estimation Results .............................................50
Table 5.4. Head-Radius Distance Measurement Errors ...........................................................51
Table 5.5. Head-Radius Distance Measurement Ranges .........................................................52
x
LIST OF FIGURES
PAGE
Figure 3.1. Stereo Block Matching Sliding Window Example. ..............................................14
Figure 3.2. Triangulation of Stereoscopic Images to Calculate Depth. ...................................15
Figure 3.3. The Construction of an HOG Feature Vector........................................................18
Figure 3.4. A Visualization of the Lab Color Space. ...............................................................21
Figure 4.1. Pre-rectified Stereo Images used for Initial Testing. .............................................25
Figure 4.2. Disparity Maps Created from the Pre-rectified Images. ........................................25
Figure 4.3. Screenshot of Camera Rectification Software. ......................................................26
Figure 4.4. Stereoscopic Imaging Rig.................................... Error! Bookmark not defined.9
Figure 3.1. A sliding window, shown in red, is moved horizontally across the right
image until the same block of pixels is found as in the left image, as shown in
blue...............................................................................................................................14
Figure 3.2. The horizontal disparity between two stereo images, along with
knowledge of the camera locations, can be used to triangulate distances to
objects in the scene. .....................................................................................................15
Figure 3.3. An example of the construction of an HOG feature vector. ..................................18
Figure 3.4. A visualization of the Lab color space [21]...........................................................21
Figure 4.1. Pre-rectified stereoscopic image used in the initial exploratory program. ............26
Figure 4.2. The resulting disparity maps after 7 seconds, 31 seconds, and 207 minutes
of processing respectively. ...........................................................................................26
Figure 4.3. Screenshot of the camera rectification software discovering the camera’s
intrinsic and extrinsic parameters by analyzing the horizontal lines of the
checkerboard pattern. The colored lines indicate the horizontal areas of the
board that the software is analyzing for lens distortion. ..............................................27
Table 4.1. Show a comparison of depth measuring accuracy between three different
interpupillary distances. ...............................................................................................28
Figure 4.4. The stereoscopic imaging system made from two USB camera boards. ..............29
Figure 4.5. The NVidia Jetson TK1 development board selected for image processing
[15]. ..............................................................................................................................31
Figure 4.6. This block diagram shows the program flow of the sixth generation
robot’s software. Initialization and program entry are shown in yellow square.
xi
The first two steps, which use both cameras to capture images and build a
depth map are shown in blue. The red squares denote steps which constitute
person detection and identification, and only use the left camera’s output.
Finally, the green square denotes the actual person following step, when
motor commands are sent to the robot chassis and the software’s main loop
starts anew. ...................................................................................................................32
Figure 4.7. A snapshot of the robot’s onboard video stream showing the bounding
box from the HOG algorithm indicated in white, Hough transform head
detection search area indicated in light blue, approximate head location as
determined by Hough transform indicated in yellow, and approximate location
of shirt and pants indicated in pink. .............................................................................36
Table 4.2. This table shows the conditions of the priority system for selecting which
person to follow. The algorithm favors following the same person from frame
to frame, even if only part of their clothing matches the training data, and only
falls back on pure color matching if the previously tracked person cannot be
found by shirt or pants color. .......................................................................................38
Figure 4.8. A rectified image, shown on the right, is processed with the unwarping
matrix to return the image to its original coordinate space, as shown on the
left. ...............................................................................................................................40
Table 4.3. Shows all of the valid commands supported by the server program. .....................43
Figure 4.9. A screenshot of the companion Android app used to take direct control of
the robot. ......................................................................................................................43
Table 5.1. Show a comparison of depth measuring accuracy between three different
interpupillary distances. ...............................................................................................45
Figure 5.1. An office chair in profile mistaken as a human being by the HOG
algorithm. .....................................................................................................................47
Figure 5.2. A bounding box adjusted to fit a squatting figure by discovering the
location of the head. The yellow circle indicates the circle that is most likely
the head, while every other circle discovered by the Hough transform is
marked in purple. The blue square denotes the search area for the head, while
the green square denotes the bounding box for the human as discovered by the
HOG algorithm. ...........................................................................................................47
Figure 5.3. A building is identified as human by the HOG algorithm. The red
bounding box indicates that the Hough transform was unable to find a head,
and the object is probably not a human. ......................................................................49
Figure 5.4. A street scene in which advertisements on the side of a distant building
are misidentified as human by both the HOG and Hough algorithms. ........................49
Table 5.2. This table shows the results of depth estimation by examining changes in
head radius. ..................................................................................................................50
Table 5.3. This table shows the results of depth estimation as estimated by the stereo
block matching algorithm and reprojection. ................................................................51
xii
Table 5.4. This table shows the difference between the actual distance and the closest
number in the measured range for each attempt. .........................................................51
Table 5.5. This table shows the difference between the maximum estimated distance
and minimum estimated distance per position for all three attempts...........................52
Figure 5.5. Over-exposed video stream when entering a brightly lit area due to
hardware malfunction. .................................................................................................55
Figure 5.6. A person walks away from the robot wearing a shirt colored similarly to
the environment. ..........................................................................................................56
Figure 5.7. Onboard video from the robot passing through the same brightly lit
environment as before, with auto-exposure working properly. ...................................57
Figure 5.8. Over exposed video stream due to intense sunlight. Each human appears
to be wearing identically colored clothing from the robot’s perspective, even
though the outfits were quite different. ........................................................................59
Figure 5.9. Infrared radiation caused all persons to appear purple, rendering
identification by color matching impossible. ...............................................................60
Figure 5.10. Because of the intensity of the light, shade and shadowy areas appear
pitch black. ...................................................................................................................60
Figure 5.11. After infrared blocking lenses were installed, the robot was able to
properly distinguish individual human subjects outdoors. ..........................................62
Figure 5.12. Onboard footage of the robot detecting both humans and correctly
choosing the tracking target in an indoor environment. ..............................................63
Figure 3.1. A sliding window, shown in red, is moved horizontally across the right
image until the same block of pixels is found as in the left image, as shown in
blue...............................................................................................................................14
Figure 3.2. The horizontal disparity between two stereo images, along with
knowledge of the camera locations, can be used to triangulate distances to
objects in the scene. .....................................................................................................15
Figure 3.3. An example of the construction of an HOG feature vector. ..................................18
Figure 3.4. A visualization of the Lab color space [21]...........................................................21
Figure 4.1. Pre-rectified stereoscopic image used in the initial exploratory program. ............26
Figure 4.2. The resulting disparity maps after 7 seconds, 31 seconds, and 207 minutes
of processing respectively. ...........................................................................................26
Figure 4.3. Screenshot of the camera rectification software discovering the camera’s
intrinsic and extrinsic parameters by analyzing the horizontal lines of the
checkerboard pattern. The colored lines indicate the horizontal areas of the
board that the software is analyzing for lens distortion. ..............................................27
xiii
Table 4.1. Show a comparison of depth measuring accuracy between three different
interpupillary distances. ...............................................................................................28
Figure 4.4. The stereoscopic imaging system made from two USB camera boards. ..............29
Figure 4.5. The NVidia Jetson TK1 development board selected for image processing
[15]. ..............................................................................................................................31
Figure 4.6. This block diagram shows the program flow of the sixth generation
robot’s software. Initialization and program entry are shown in yellow square.
The first two steps, which use both cameras to capture images and build a
depth map are shown in blue. The red squares denote steps which constitute
person detection and identification, and only use the left camera’s output.
Finally, the green square denotes the actual person following step, when
motor commands are sent to the robot chassis and the software’s main loop
starts anew. ...................................................................................................................32
Figure 4.7. A snapshot of the robot’s onboard video stream showing the bounding
box from the HOG algorithm indicated in white, Hough transform head
detection search area indicated in light blue, approximate head location as
determined by Hough transform indicated in yellow, and approximate location
of shirt and pants indicated in pink. .............................................................................36
Table 4.2. This table shows the conditions of the priority system for selecting which
person to follow. The algorithm favors following the same person from frame
to frame, even if only part of their clothing matches the training data, and only
falls back on pure color matching if the previously tracked person cannot be
found by shirt or pants color. .......................................................................................38
Figure 4.8. A rectified image, shown on the right, is processed with the unwarping
matrix to return the image to its original coordinate space, as shown on the
left. ...............................................................................................................................40
Table 4.3. Shows all of the valid commands supported by the server program. .....................43
Figure 4.9. A screenshot of the companion Android app used to take direct control of
the robot. ......................................................................................................................43
Table 5.1. Show a comparison of depth measuring accuracy between three different
interpupillary distances. ...............................................................................................45
Figure 5.1. An office chair in profile mistaken as a human being by the HOG
algorithm. .....................................................................................................................47
Figure 5.2. A bounding box adjusted to fit a squatting figure by discovering the
location of the head. The yellow circle indicates the circle that is most likely
the head, while every other circle discovered by the Hough transform is
marked in purple. The blue square denotes the search area for the head, while
the green square denotes the bounding box for the human as discovered by the
HOG algorithm. ...........................................................................................................47
xiv
Figure 5.3. A building is identified as human by the HOG algorithm. The red
bounding box indicates that the Hough transform was unable to find a head,
and the object is probably not a human. ......................................................................49
Figure 5.4. A street scene in which advertisements on the side of a distant building
are misidentified as human by both the HOG and Hough algorithms. ........................49
Table 5.2. This table shows the results of depth estimation by examining changes in
head radius. ..................................................................................................................50
Table 5.3. This table shows the results of depth estimation as estimated by the stereo
block matching algorithm and reprojection. ................................................................51
Table 5.4. This table shows the difference between the actual distance and the closest
number in the measured range for each attempt. .........................................................51
Table 5.5. This table shows the difference between the maximum estimated distance
and minimum estimated distance per position for all three attempts...........................52
Figure 5.5. Over-exposed video stream when entering a brightly lit area due to
hardware malfunction. .................................................................................................55
Figure 5.6. A person walks away from the robot wearing a shirt colored similarly to
the environment. ..........................................................................................................56
Figure 5.7. Onboard video from the robot passing through the same brightly lit
environment as before, with auto-exposure working properly. ...................................57
Figure 5.8. Over exposed video stream due to intense sunlight. Each human appears
to be wearing identically colored clothing from the robot’s perspective, even
though the outfits were quite different. ........................................................................59
Figure 5.9. Infrared radiation caused all persons to appear purple, rendering
identification by color matching impossible. ...............................................................60
Figure 5.10. Because of the intensity of the light, shade and shadowy areas appear
pitch black. ...................................................................................................................60
Figure 5.11. After infrared blocking lenses were installed, the robot was able to
properly distinguish individual human subjects outdoors. ..........................................62
Figure 5.12. Onboard footage of the robot detecting both humans and correctly
choosing the tracking target in an indoor environment. ..............................................63
Figure 3.1. A sliding window, shown in red, is moved horizontally across the right
image until the same block of pixels is found as in the left image, as shown in
blue...............................................................................................................................14
Figure 3.2. The horizontal disparity between two stereo images, along with
knowledge of the camera locations, can be used to triangulate distances to
objects in the scene. .....................................................................................................15
Figure 3.3. An example of the construction of an HOG feature vector. ..................................18
xv
Figure 3.4. A visualization of the Lab color space [21]...........................................................21
Figure 4.1. Pre-rectified stereoscopic image used in the initial exploratory program. ............26
Figure 4.2. The resulting disparity maps after 7 seconds, 31 seconds, and 207 minutes
of processing respectively. ...........................................................................................26
Figure 4.3. Screenshot of the camera rectification software discovering the camera’s
intrinsic and extrinsic parameters by analyzing the horizontal lines of the
checkerboard pattern. The colored lines indicate the horizontal areas of the
board that the software is analyzing for lens distortion. ..............................................27
Table 4.1. Show a comparison of depth measuring accuracy between three different
interpupillary distances. ...............................................................................................28
Figure 4.4. The stereoscopic imaging system made from two USB camera boards. ..............29
Figure 4.5. The NVidia Jetson TK1 development board selected for image processing
[15]. ..............................................................................................................................31
Figure 4.6. This block diagram shows the program flow of the sixth generation
robot’s software. Initialization and program entry are shown in yellow square.
The first two steps, which use both cameras to capture images and build a
depth map are shown in blue. The red squares denote steps which constitute
person detection and identification, and only use the left camera’s output.
Finally, the green square denotes the actual person following step, when
motor commands are sent to the robot chassis and the software’s main loop
starts anew. ...................................................................................................................32
Figure 4.7. A snapshot of the robot’s onboard video stream showing the bounding
box from the HOG algorithm indicated in white, Hough transform head
detection search area indicated in light blue, approximate head location as
determined by Hough transform indicated in yellow, and approximate location
of shirt and pants indicated in pink. .............................................................................36
Table 4.2. This table shows the conditions of the priority system for selecting which
person to follow. The algorithm favors following the same person from frame
to frame, even if only part of their clothing matches the training data, and only
falls back on pure color matching if the previously tracked person cannot be
found by shirt or pants color. .......................................................................................38
Figure 4.8. A rectified image, shown on the right, is processed with the unwarping
matrix to return the image to its original coordinate space, as shown on the
left. ...............................................................................................................................40
Table 4.3. Shows all of the valid commands supported by the server program. .....................43
Figure 4.9. A screenshot of the companion Android app used to take direct control of
the robot. ......................................................................................................................43
Table 5.1. Show a comparison of depth measuring accuracy between three different
interpupillary distances. ...............................................................................................45
xvi
Figure 5.1. An office chair in profile mistaken as a human being by the HOG
algorithm. .....................................................................................................................47
Figure 5.2. A bounding box adjusted to fit a squatting figure by discovering the
location of the head. The yellow circle indicates the circle that is most likely
the head, while every other circle discovered by the Hough transform is
marked in purple. The blue square denotes the search area for the head, while
the green square denotes the bounding box for the human as discovered by the
HOG algorithm. ...........................................................................................................47
Figure 5.3. A building is identified as human by the HOG algorithm. The red
bounding box indicates that the Hough transform was unable to find a head,
and the object is probably not a human. ......................................................................49
Figure 5.4. A street scene in which advertisements on the side of a distant building
are misidentified as human by both the HOG and Hough algorithms. ........................49
Table 5.2. This table shows the results of depth estimation by examining changes in
head radius. ..................................................................................................................50
Table 5.3. This table shows the results of depth estimation as estimated by the stereo
block matching algorithm and reprojection. ................................................................51
Table 5.4. This table shows the difference between the actual distance and the closest
number in the measured range for each attempt. .........................................................51
Table 5.5. This table shows the difference between the maximum estimated distance
and minimum estimated distance per position for all three attempts...........................52
Figure 5.5. Over-exposed video stream when entering a brightly lit area due to
hardware malfunction. .................................................................................................55
Figure 5.6. A person walks away from the robot wearing a shirt colored similarly to
the environment. ..........................................................................................................56
Figure 5.7. Onboard video from the robot passing through the same brightly lit
environment as before, with auto-exposure working properly. ...................................57
Figure 5.8. Over exposed video stream due to intense sunlight. Each human appears
to be wearing identically colored clothing from the robot’s perspective, even
though the outfits were quite different. ........................................................................59
Figure 5.9. Infrared radiation caused all persons to appear purple, rendering
identification by color matching impossible. ...............................................................60
Figure 5.10. Because of the intensity of the light, shade and shadowy areas appear
pitch black. ...................................................................................................................60
Figure 5.11. After infrared blocking lenses were installed, the robot was able to
properly distinguish individual human subjects outdoors. ..........................................62
Figure 5.12. Onboard footage of the robot detecting both humans and correctly
choosing the tracking target in an indoor environment. ..............................................63
xvii
Figure 3.1. A sliding window, shown in red, is moved horizontally across the right
image until the same block of pixels is found as in the left image, as shown in
blue...............................................................................................................................14
Figure 3.2. The horizontal disparity between two stereo images, along with
knowledge of the camera locations, can be used to triangulate distances to
objects in the scene. .....................................................................................................15
Figure 3.3. An example of the construction of an HOG feature vector. ..................................18
Figure 3.4. A visualization of the Lab color space [21]...........................................................21
Figure 4.1. Pre-rectified stereoscopic image used in the initial exploratory program. ............26
Figure 4.2. The resulting disparity maps after 7 seconds, 31 seconds, and 207 minutes
of processing respectively. ...........................................................................................26
Figure 4.3. Screenshot of the camera rectification software discovering the camera’s
intrinsic and extrinsic parameters by analyzing the horizontal lines of the
checkerboard pattern. The colored lines indicate the horizontal areas of the
board that the software is analyzing for lens distortion. ..............................................27
Table 4.1. Show a comparison of depth measuring accuracy between three different
interpupillary distances. ...............................................................................................28
Figure 4.4. The stereoscopic imaging system made from two USB camera boards. ..............29
Figure 4.5. The NVidia Jetson TK1 development board selected for image processing
[15]. ..............................................................................................................................31
Figure 4.6. This block diagram shows the program flow of the sixth generation
robot’s software. Initialization and program entry are shown in yellow square.
The first two steps, which use both cameras to capture images and build a
depth map are shown in blue. The red squares denote steps which constitute
person detection and identification, and only use the left camera’s output.
Finally, the green square denotes the actual person following step, when
motor commands are sent to the robot chassis and the software’s main loop
starts anew. ...................................................................................................................32
Figure 4.7. A snapshot of the robot’s onboard video stream showing the bounding
box from the HOG algorithm indicated in white, Hough transform head
detection search area indicated in light blue, approximate head location as
determined by Hough transform indicated in yellow, and approximate location
of shirt and pants indicated in pink. .............................................................................36
Table 4.2. This table shows the conditions of the priority system for selecting which
person to follow. The algorithm favors following the same person from frame
to frame, even if only part of their clothing matches the training data, and only
falls back on pure color matching if the previously tracked person cannot be
found by shirt or pants color. .......................................................................................38
xviii
Figure 4.8. A rectified image, shown on the right, is processed with the unwarping
matrix to return the image to its original coordinate space, as shown on the
left. ...............................................................................................................................40
Table 4.3. Shows all of the valid commands supported by the server program. .....................43
Figure 4.9. A screenshot of the companion Android app used to take direct control of
the robot. ......................................................................................................................43
Table 5.1. Show a comparison of depth measuring accuracy between three different
interpupillary distances. ...............................................................................................45
Figure 5.1. An office chair in profile mistaken as a human being by the HOG
algorithm. .....................................................................................................................47
Figure 5.2. A bounding box adjusted to fit a squatting figure by discovering the
location of the head. The yellow circle indicates the circle that is most likely
the head, while every other circle discovered by the Hough transform is
marked in purple. The blue square denotes the search area for the head, while
the green square denotes the bounding box for the human as discovered by the
HOG algorithm. ...........................................................................................................47
Figure 5.3. A building is identified as human by the HOG algorithm. The red
bounding box indicates that the Hough transform was unable to find a head,
and the object is probably not a human. ......................................................................49
Figure 5.4. A street scene in which advertisements on the side of a distant building
are misidentified as human by both the HOG and Hough algorithms. ........................49
Table 5.2. This table shows the results of depth estimation by examining changes in
head radius. ..................................................................................................................50
Table 5.3. This table shows the results of depth estimation as estimated by the stereo
block matching algorithm and reprojection. ................................................................51
Table 5.4. This table shows the difference between the actual distance and the closest
number in the measured range for each attempt. .........................................................51
Table 5.5. This table shows the difference between the maximum estimated distance
and minimum estimated distance per position for all three attempts...........................52
Figure 5.5. Over-exposed video stream when entering a brightly lit area due to
hardware malfunction. .................................................................................................55
Figure 5.6. A person walks away from the robot wearing a shirt colored similarly to
the environment. ..........................................................................................................56
Figure 5.7. Onboard video from the robot passing through the same brightly lit
environment as before, with auto-exposure working properly. ...................................57
Figure 5.8. Over exposed video stream due to intense sunlight. Each human appears
to be wearing identically colored clothing from the robot’s perspective, even
though the outfits were quite different. ........................................................................59
xix
Figure 5.9. Infrared radiation caused all persons to appear purple, rendering
identification by color matching impossible. ...............................................................60
Figure 5.10. Because of the intensity of the light, shade and shadowy areas appear
pitch black. ...................................................................................................................60
Figure 5.11. After infrared blocking lenses were installed, the robot was able to
properly distinguish individual human subjects outdoors. ..........................................62
Figure 5.12. Onboard footage of the robot detecting both humans and correctly
choosing the tracking target in an indoor environment. ..............................................63
Figure 3.1. A sliding window, shown in red, is moved horizontally across the right
image until the same block of pixels is found as in the left image, as shown in
blue...............................................................................................................................14
Figure 3.2. The horizontal disparity between two stereo images, along with
knowledge of the camera locations, can be used to triangulate distances to
objects in the scene. .....................................................................................................15
Figure 3.3. An example of the construction of an HOG feature vector. ..................................18
Figure 3.4. A visualization of the Lab color space [21]...........................................................21
Figure 4.1. Pre-rectified stereoscopic image used in the initial exploratory program. ............26
Figure 4.2. The resulting disparity maps after 7 seconds, 31 seconds, and 207 minutes
of processing respectively. ...........................................................................................26
Figure 4.3. Screenshot of the camera rectification software discovering the camera’s
intrinsic and extrinsic parameters by analyzing the horizontal lines of the
checkerboard pattern. The colored lines indicate the horizontal areas of the
board that the software is analyzing for lens distortion. ..............................................27
Table 4.1. Show a comparison of depth measuring accuracy between three different
interpupillary distances. ...............................................................................................28
Figure 4.4. The stereoscopic imaging system made from two USB camera boards. ..............29
Figure 4.5. The NVidia Jetson TK1 development board selected for image processing
[15]. ..............................................................................................................................31
Figure 4.6. This block diagram shows the program flow of the sixth generation
robot’s software. Initialization and program entry are shown in yellow square.
The first two steps, which use both cameras to capture images and build a
depth map are shown in blue. The red squares denote steps which constitute
person detection and identification, and only use the left camera’s output.
Finally, the green square denotes the actual person following step, when
motor commands are sent to the robot chassis and the software’s main loop
starts anew. ...................................................................................................................32
Figure 4.7. A snapshot of the robot’s onboard video stream showing the bounding
box from the HOG algorithm indicated in white, Hough transform head
xx
detection search area indicated in light blue, approximate head location as
determined by Hough transform indicated in yellow, and approximate location
of shirt and pants indicated in pink. .............................................................................36
Table 4.2. This table shows the conditions of the priority system for selecting which
person to follow. The algorithm favors following the same person from frame
to frame, even if only part of their clothing matches the training data, and only
falls back on pure color matching if the previously tracked person cannot be
found by shirt or pants color. .......................................................................................38
Figure 4.8. A rectified image, shown on the right, is processed with the unwarping
matrix to return the image to its original coordinate space, as shown on the
left. ...............................................................................................................................40
Table 4.3. Shows all of the valid commands supported by the server program. .....................43
Figure 4.9. A screenshot of the companion Android app used to take direct control of
the robot. ......................................................................................................................43
Table 5.1. Show a comparison of depth measuring accuracy between three different
interpupillary distances. ...............................................................................................45
Figure 5.1. An office chair in profile mistaken as a human being by the HOG
algorithm. .....................................................................................................................47
Figure 5.2. A bounding box adjusted to fit a squatting figure by discovering the
location of the head. The yellow circle indicates the circle that is most likely
the head, while every other circle discovered by the Hough transform is
marked in purple. The blue square denotes the search area for the head, while
the green square denotes the bounding box for the human as discovered by the
HOG algorithm. ...........................................................................................................47
Figure 5.3. A building is identified as human by the HOG algorithm. The red
bounding box indicates that the Hough transform was unable to find a head,
and the object is probably not a human. ......................................................................49
Figure 5.4. A street scene in which advertisements on the side of a distant building
are misidentified as human by both the HOG and Hough algorithms. ........................49
Table 5.2. This table shows the results of depth estimation by examining changes in
head radius. ..................................................................................................................50
Table 5.3. This table shows the results of depth estimation as estimated by the stereo
block matching algorithm and reprojection. ................................................................51
Table 5.4. This table shows the difference between the actual distance and the closest
number in the measured range for each attempt. .........................................................51
Table 5.5. This table shows the difference between the maximum estimated distance
and minimum estimated distance per position for all three attempts...........................52
Figure 5.5. Over-exposed video stream when entering a brightly lit area due to
hardware malfunction. .................................................................................................55
xxi
Figure 5.6. A person walks away from the robot wearing a shirt colored similarly to
the environment. ..........................................................................................................56
Figure 5.7. Onboard video from the robot passing through the same brightly lit
environment as before, with auto-exposure working properly. ...................................57
Figure 5.8. Over exposed video stream due to intense sunlight. Each human appears
to be wearing identically colored clothing from the robot’s perspective, even
though the outfits were quite different. ........................................................................59
Figure 5.9. Infrared radiation caused all persons to appear purple, rendering
identification by color matching impossible. ...............................................................60
Figure 5.10. Because of the intensity of the light, shade and shadowy areas appear
pitch black. ...................................................................................................................60
Figure 5.11. After infrared blocking lenses were installed, the robot was able to
properly distinguish individual human subjects outdoors. ..........................................62
Figure 5.12. Onboard footage of the robot detecting both humans and correctly
choosing the tracking target in an indoor environment. ..............................................63
xxii
ACKNOWLEDGEMENTS
I am forever in gratitude to the many friends, family and colleagues, without whom
this thesis would not have been possible, for their support, advice, and contributions.
Foremost, I would like to thank my advisor and committee chair, Dr. Mahmoud
Tarokh, for his guidance and support. His expert knowledge on the subject matter,
suggestions and feedback were invaluable throughout the research and development of this
thesis and motivated me to work beyond my abilities. I would also like to thank Dr. Xiaobai
Liu and Dr. Ashkan Ashrafi for taking time from their busy schedules to serve on my thesis
committee.
Many months of debugging were saved thanks to Shalom Cohen, a thesis student. I
would like to give my appreciation to Shalom for meeting with me and showing me how to
use the Segway RMP chassis, along with its various quirks and oddities. His examples and
explanations were integral to combining my software with the Segway hardware in a timely
manner.
Finally, I would like to express my utmost gratitude and appreciation to my family. I
am sincerely thankful to my wife, Rina, for her continuous encouragement and moral
support, as well as for giving me the time and space necessary to develop the software,
hardware and thesis paper. And lastly, I would like to thank my brother, Paul, who was
always cheerful and willing to travel to the lab on weekends to aid in testing.
This project has truly been a collaborative effort and I am humbled by the kindness,
encouragement and support of those who contributed.
CHAPTER 1
INTRODUCTION
Broadly, person following in robotics can be described as a robot which is tasked
with detecting humans in the environment, recognizing an individual human, and following
the human target as they move through the environment. While person following is trivial for
a human, many factors such as changing lighting conditions, complex environments, and
variations in the appearance of humans make person following a challenging task for robots.
This chapter will discuss person detection, person following and the challenges of person
detection and following in the context of robotics in detail.
1.1 Person Detection and Following
In order for a robot to follow a person, the person must first be detected in the given
scene. The scene must be imaged in some way by sensors attached to the robot, whether
through passive imaging using a standard digital camera or thermal camera, active imaging
using an RGB-D camera or laser scanner, or by some other means. Then, the sensor data
must be processed to separate the humans in the scene from the background and other
foreground objects. Once the humans have been extracted, the robot must select a single
human to follow and adjust its speed and direction accordingly.
Various techniques exist for person detection, each with advantages and drawbacks.
Motion based person detection typically involves frame subtraction, where the contents of
the current frame are subtracted from the contents of the previous frame, thus eliminating
static areas and revealing the silhouettes of objects in motion [26]. Since frame subtraction is
computationally easy, such person detection techniques tend to be very efficient [32]. Pierard
et al. used this technique to great effect by combining frame subtraction with a shape filling
and pixel voting heuristic, where each silhouette is filled with large rectangles that act as a
signature for what type of object the silhouette represents, and was able to achieve a high
degree of accuracy in detecting humans [1].
Parts-based human detection algorithms work by detecting individual body parts
whose proximity and placement relative to one another act as a heuristic for detecting the
presence of a human [25]. Such algorithms have the advantage of being able to detect
humans that are not in motion, who may be in a variety of poses or states of partial occlusion.
Mikolajczyk et al. demonstrated the effectiveness of one such algorithm by modeling the
human body as a collection of seven body parts. Each individual part was detected using a
Scale-Invariant Feature Transform (SIFT) algorithm and a cascading classifier. The
cascading classifier uses the scale and location of each body part as a confidence measure to
determine if there exist any neighboring body parts [2]. Once a certain number of body parts
have been detected, the region is marked as containing a human figure [2]. Experiments
using the MIT-CMU test set revealed that the detector performed comparable to or better
than other state of the art detectors at the time of writing [2].
Shape and color recognition algorithms, such as the Chamfer system and color
matching, have been used for human detection in video to varying degrees of success
[24][32]. Color and shape based human detection algorithms, which use the shape and color
of the human target’s head and clothing for detection, have also been used in previous
generations of the person following robot, which will be discussed in detail in the next
chapter.
1.2 Challenges of Person Detection and Following
Though humans can recognize other humans with relative ease, the task is quite
difficult for a robot. Problems in human detection primarily stem from three areas:
environment-related, human-related and motion-related.
If a person following robot is to be used practically, then it must be able to cope with
complex environments. Environments can be plain and orderly, such as an empty hallway, or
complex and organic, such as an area containing heavy vegetation. Such environments may
prohibit algorithms that rely on shape detection and color matching from achieving their goal
of detecting humans, as contours of a complex backdrop may confuse shape detection
routines and plain, featureless environments may render humans in similarly colored clothing
undetectable. Furthermore, the lighting of the environment must be taken into consideration.
As the illumination in the environment varies, so too does the appearance of colors. As the
human target moves from one location in the environment to another their coloration will
appear to change, presenting yet another problem for color matching algorithms.
The robot’s own mobility may prohibit person detection from working correctly.
When the robot is in motion, both foreground and background will appear to change,
rendering frame subtraction based person detection techniques ineffective. Likewise, as the
robot makes abrupt turns, drives over uneven terrain, or introduces significant judder or
motion blur to the video stream, the human target may become difficult or impossible to
detect. Rough terrain proved to be a major problem for previous generations of the person
following robot.
Finally, humans themselves tend to present a significant challenge to human
detection. Human appearance is highly variable, as humans may be clothed in a wide variety
of colors and patterns, may appear in a wide range of poses and may be in partial states of
occlusion. Parts-based algorithms are able to function despite imperfect lighting and partial
occlusion of the human target, but are far too computationally expensive to be used in a realtime system, as is evidenced by the work of Mikolajczyk et al., whose algorithm required
approximately ten seconds to process a single 640x480 image [2].
Ideally, a person following robot must be able to detect humans despite their
coloration, pose or completeness in the frame while also being power efficient and
maintaining a high framerate. The robot must also be able to cope with rough terrain induced
video instability and be able to recover from tracking loss.
1.3 Contributions of this Thesis
The work presented in this thesis tests the viability of a robot vision system built with
off-the-shelf components that is cheap, power efficient, and able to function with an
acceptable level of accuracy in sub-optimal lighting and environmental conditions. This
thesis attempts to overcome or significantly mitigate all of the problems in person detection
and following through the use of a stereoscopic imaging rig built from two ordinary CMOS
camera boards and a small, power efficient computer (NVidia TK1).
A combination of algorithms are used to efficiently detect and track humans in three
dimensional space. Humans are initially detected using the Histogram of Oriented Gradients
algorithm, which is a shape-based detection method that is invariant to lighting changes. A
depth map of the scene is created using a standard block matching algorithm. Then, the
persons’ location is fine-tuned by searching for the location of the head using a Hough circle
transform, which is also used to filter out false positives, and combining the persons’ two
dimensional location with depth map data. Finally, additional processing steps are taken to
further reduce false positives before motor commands are sent to the underlying robot
chassis.
This thesis also includes many optimizations to allow the software to run on a mobile
processor with an acceptable level of accuracy and framerate. Results show that the software
is able to detect humans even in severely adverse lighting conditions, in indoor and outdoor
environments, and irrespective of human-related factors.
1.4 Summary of the Following Chapters
Chapter 2 will contain a description and analysis of previous generations of the
person following robot, including what approach was used, what problems were solved and
what problems persisted as background information. Chapter 3 will discuss the software
architecture and implementation details of the person detection algorithm proposed in this
thesis. Chapter 4 will cover the hardware implementation details of the person following
robot. Chapter 5 is present the testing that was performed and the results of those tests. The
thesis will conclude with Chapter 6, where achievements are summarized and potential
solutions for remaining problems are presented as future work.
CHAPTER 2
BACKGROUND
Five generations of person following robot have been developed as San Diego State
University’s Intelligent Machines and Systems Lab, each improving on the previous
generation’s achievements. Each generation will be described in detail as background
information for the reader to gain an understanding of the evolution of the person following
robot project.
2.1 First Generation
The first generation of person following robot, developed by P. Ferrari and Dr.
Tarokh, consisted of a Pioneer 2DX mobility platform, a remote laptop computer, a
transmitter and receiver, and Polar Industries GFP-5005 CCD wireless camera [3]. Initial
software was developed, which utilized thresholding and region growing as a means of
person detection, and acted as the means for steering the robot.
A basic description of the program flow is as follows: The CCD camera, which was
mounted on the top of the Pioneer 2DX robot chassis, captured and transmitted images
wirelessly to a remote laptop. The laptop processed each image by segmenting the tracking
subject from the background, and then calculating a center of mass for the segmented region.
The size of the segmented region was used as an estimate of the distance to the tracking
subject. The combination of the center of mass and size of the segmented region was used to
determine in which direction the robot should travel and how fast. Finally, the newly
computed motor commands were then transmitted wirelessly back to the mobility platform,
which put the robot into motion [3].
As each image was received, the tracking subject was segmented from the
background using image thresholding. Each pixel in the image data is compared against a
predetermined range of values that act as a threshold. If the pixel under examination falls
outside of the threshold, then it is discarded, leaving only pixels that fall within a specific
range of values. In the first generation robot, the tracking subject’s shirt color range was used
as the threshold value, resulting in the segmentation of the person’s shirt.
To further isolate the subject’s shirt, region growing was used to examine the
segmented region’s eccentricity, compactness and circularity [3]. In the event that several
regions were segmented in the initial thresholding step, the compactness, eccentricity and
circularity measures would be compared against those of the training data to choose a region
that most likely represents the tracking subject. Once a segmented region was selected, its
size relative to the size of the frame and center of mass were calculated.
The size and center of mass values were used with a fuzzy controller to determine the
robot’s motion. The horizontal location of the center of pass within the frame was used to
determine the direction that the robot should rotate. If the center of pass appeared generally
to the left, then the robot would rotate counter-clockwise, and would clockwise if the person
appeared generally to the right. The vertical position of the center of mass and size of the
segmented region were used as distance estimators and controlled the robot’s forward
motion. If the center of mass was high in the frame or the segmented region was small, then
the person was expected to be further away and the robot would speed up. Conversely, if the
segmented region was large, then the person was expected to be close and the robot would
slow down. By employing these simple rules, the fuzzy controller attempted to keep the
segmented region in the center of the frame [3].
The authors initially attempted to track the segmented region from frame to frame
using a frame subtraction technique, where the image data of the current frame was
subtracted from that of the previous frame. The idea was to help isolate the tracking subject,
which was expected to be in motion, by eliminating the static background. However, the
authors found that this technique was unusable when the robot itself was in motion, as the
motion of the robot would produce many differences in the resulting image that were
unrelated to the tracking subject.
The final version of the software also had several important limitations. Though the
robot was shown to be able to track a target in ideal conditions, the color thresholding
technique had difficulty in isolating the tracking subject the background was of a similar
color or when multiple similarly dressed persons were present in the frame. The hardware
was also limited in its range by the wireless modems and the framerate of the software was
bottlenecked by the wireless transmission speed of the camera. Finally, the fixed camera
prevented the robot from tracking a subject that was making quick directional changes, as the
subject would exit the camera’s field of view.
2.2 Second Generation
The second generation robot built by J. Kuo and Dr. Tarokh improved on the first
generation robot largely through hardware changes [4]. In order to improve the robots
performance in tracking fast moving subjects, a new color CCD camera system was added to
the robot that supported pan-tilt motion. The robot chassis itself was replaced with an allterrain version from the same vendor to improve performance on rough terrain. The wireless
transmitters and receivers were also removed to eliminate network-induced lag and increase
the range of the robot.
The new hardware additions necessitated changes to the software. With the
introduction of a color camera, the robot was now able to take advantage of detecting objects
by color instead of just grayscale level. The authors elected to use the HSI color model,
which describes color by hue, saturation and intensity, with the intensity component
discarded when performing color matching in order to compensate for lighting changes. Like
the previous generation, the image data was filtered by threshold values to isolate the
tracking subject. However, the second generation robot used value ranges for hue and
saturation as thresholds that were derived from an initial training phase.
Once the subject was segmented, the size and center of mass were calculated as with
the first generation robot. However, the new pan-tilt hardware required that additional fuzzy
controllers be created to rotate the camera. Since the pan-tilt hardware could react faster than
the robot chassis, two extra fuzzy sets were created to track the target horizontally and
vertically in the frame. If the target appeared off center vertically or horizontally, the camera
would pan and tilt appropriately to recenter the target. The second set of fuzzy controllers,
which controlled the robot’s motion, used a five frame history of the pan-tilt motion to
determine which direction the robot should travel and how fast.
It was shown in testing that while the pan-tilt apparatus and wired architecture
improved the range and tracking of fast moving targets, the system still suffered from
tracking loss when changes in lighting occurred. Outdoor testing was performed during
twilight as bright sunlight created too great of a contrast between sunlit and shaded areas.
Environmental reflections also altered the coloration of clothing, reducing the robot’s ability
to track by color. Finally, the all-terrain robot chassis introduced new problems as it was
unable to rotate smoothly, and the jittery motion caused motion blur, which further reduced
the robot’s ability to track a human subject [4].
2.3 Third Generation
The third generation robot, built by P. Merloti and Dr. Tarokh, improved on the work
of the second generation with several hardware and software changes [5]. On the hardware
side, a new two-wheeled robotic platform and camera system were used. The robotic
platform was changed from a four wheel all-terrain vehicle to a self-balancing Segway RMP,
which would give the robot to navigate tight corners and turn in place. The camera system
was also changed to one that had a slightly higher resolution at 640x480 and allowed manual
control of the shutter and gain.
As the new camera hardware allowed manual control of shutter and gain, the
developers attempted to compensate for lighting changes by manually adjusting the exposure
and gain settings relative to the region of interest containing the subject’s shirt, rather than
the overall image [5]. The software was also changed to track the subject’s centroid from
frame to frame. Thus, when multiple subjects were in the frame, the software attempted to
find the correct subject by comparing the mass and location of centroids, and the region’s
color to that of the training data and previously detected regions.
The software controlling the motion of the robot was also changed. First, the fuzzy
controllers for the pan-tilt hardware, speed and steering were changed to PID controllers,
which proved to work better at tracking the subject [5]. When the pan-tilt could no longer
track the subject, then the robot would turn in the direction of the camera to recenter the
subject. In the event of tracking loss, software was added to move the camera system in a
sinusoidal fashion in an attempt to rediscover the subject.
Though the hardware and software changes improved the overall performance of the
system in areas where the previous generations were deficient, such as tracking in gradual
lighting changes, recovering from tracking loss, and navigation in tight spaces, testing
revealed that tracking performance was still inadequate when the subject’s clothing color
matched that of the background [5]. Finding a solution to this single problem would be the
inspiration for the major shift in strategy found in the fourth generation robot.
2.4 Fourth Generation
The fourth generation robot, developed by R. Shenoy and Dr. Tarokh, attempted to
solve the background coloration problem common to the first three generations of person
following robot by shape detection rather than color tracking [6]. Even though the first three
generations used various other metrics, such as eccentricity, circularity, compactness, and
centroid location to aid in tracking, detection was primarily done by color matching to a
trained sample. Thus, it is only natural that if the subject and background were similarly
colored, the robot would have trouble detecting the subject.
In order to get around this problem, the authors changed detection strategies from
color matching to shape detection. Instead of looking for regions similarly colored to training
data, the new software attempted to locate the subject’s head by using a Hough Circle
transform to detect all circular objects inside the image [6]. Initially, the software would first
process the image with a Canny edge detector to remove all color information and only keep
the edge information. Then, the edge detected image would be processed with a Hough
transform, which attempted to discover all circular shapes in the image at three preselected
radii.
Initial testing showed that even though the Hough transform was generally good at
detecting the head, it also produced many false positives. To compensate for false positives, a
color matching based torso detection system was added that would look directly under each
detected circle for a shirt similarly colored to that of a training sample [6]. With the addition
of the torso detection system, most false positives were removed and the software was able to
track subjects under most circumstances. Testing performed on video data from previous
generations showed that the fourth generation could track subjects more accurately with
fewer false positives.
Even though the fourth generation solved the major common problem of the first
three generations and showed much promise in tracking ability, it was never tested with
actual robotics hardware [6]. The software simply required too much processing time to be
practical, requiring several seconds of processing time per frame. Because of this, all of the
testing of the software was performed on prerecorded videos from previous generation
robots.
2.5 Fifth Generation
The fifth generation robot, developed by R. Vu and Dr. Tarokh, utilized the Microsoft
Kinect to detect and track humans in three dimensions [7]. Since the Microsoft Kinect
projects infrared light as a means of imaging the environment, the human detection is should
not be affected by lighting changes.
The Microsoft Kinect is a Structured Light system that works by projecting a
proprietary 640x480 speckle pattern onto the environment. The speckle pattern is created in
such a way that no two regions of the pattern are similar. An infrared sensor on the Kinect
device receives the reflected speckle pattern, from which it calculates a depth map in
hardware. Finally, the depth map is analyzed for human figures using a machine learning
algorithm [8].
The Kinect was mounted to the top of an iRobot Create robotic chassis, along with a
laptop for processing [7]. This robot was tested in varying light conditions, ranging from a
pitch black environment to a bright day outdoors. The robot was also tested on different
surfaces to test how well the Kinect performs with terrain induced judder.
The results of lighting experiments showed that the robot was able to detect and track
human subjects in a pitch dark environment, but performance began to suffer in a 1200 to
2300 lux environment typical of early morning sunlight. The volume in which humans can be
detected was measured to be between 500mm to 4000mm in optimal lighting conditions. As
the outdoor illumination became brighter, the Kinect failed to produce a depth map in midday shaded environments, measured at 17000 lux, and prevented the software from extracting
human figures. The authors concluded that infrared from the sun was interfering with the
Kinect’s own infrared projector and sensor system.
The robot was tested on five different surfaces: laminate, tile, carpet, sidewalk and
asphalt. Tests were performed on each surface type to determine how well the robot followed
the subject in a straight line, from side to side, and in a zigzag pattern. The results showed
that the robot performed best on carpet and worst on asphalt, with the other three surface
types achieving mediocre results in all tests. The hard surfaces would induce motion judder,
which caused loss of tracking. It was thus determined that this generation of robot performed
worse than previous generations on non-optimal surface types.
It was also observed that tracking suffered greatly when the subject was facing away
from the camera, irrespective of terrain or lighting conditions, as the Kinect was designed to
detect people facing toward it.
The authors concluded that the iRobot Create was a poor choice of chassis, as it was
too lightweight and produced instability in motion. The Kinect was also found to be
ineffective in outdoor use, when tracking subjects turned away from the camera, and when
the Kinect itself was put in motion.
2.6 Further Fifth Generation Experiments
Using the same fifth generation chassis, V. Anuvakya and Dr. Tarokh performed
further experiments using a different strategy for detection, tracking and tracking loss
recovery. The authors attempted to recognize a single person among a group of people by
training the software to remember biometric information [19].
The software ran in one of three possible modes. The first mode, dubbed Learning
Mode, was the initial mode that the software was in and was responsible for detecting and
learning the features of the tracking target [19]. The user would stand in front of the robot
with both hands at an upward 90 degree angle. Pose recognition algorithms would recognize
the person as the tracking target, and begin to learn their features. The person’s clothing color
was memorized, along with their joint positions, relative bone lengths and overall height
[19]. The Kinect SDK also returns a unique index number for every person present in the
scene, which is consistent from frame to frame. The tracking target’s index number is stored
along with their biometric and color information. Lastly, a value from a hardware light
sensitive resistor was also memorized, which is later used to account for changes in lighting
[19]. Once the tracking target is identified, the software switches to a Tracking Mode.
As the index number is consistent from frame to frame, tracking was largely
performed by the Kinect libraries. The Kinect libraries report the index number and position
of each person in the scene, and thus the software would choose the correct person to track
by comparing against the memorized index number from the initial learning phase [19]. Once
the correct tracking target was identified, their center of mass was calculated and the offset of
the center of mass from the center of the image frame was used to determine the robot’s
direction of travel. Likewise, the person’s distance offset from a fixed value was used to
determine the robot’s forward speed. If there was a change in lighting condition, as detected
by the light resistor, then new lighting and color information was learned once the tracking
subject was determined to be facing toward the camera. The person’s orientation to the
camera was determined by the distance between their wrist joints [19].
Tracking by index number was unreliable in the event of tracking loss. If tracking
was lost, such as in the event that the tracking subject moved out of the frame, then the
Kinect SDK would assign the person a new index number and thus the person would have to
be identified by other means [19]. To reacquire the correct tracking target, the heights of all
people in the scene were compared against that of the initially learned biometric data. The
software would then compare each person against initially learned color data and bone
lengths to further distinguish between people of the same height. A total weighted sum of
differences for these three features was used to determine which person to follow. Once a
person was identified as the tracking target, the software switched back to Tracking Mode
[19].
Results from experimentation showed that matching color histograms was an accurate
way of identifying people by color regardless of changes in lighting [19]. The results also
showed that the robot’s Recognition Mode was able to correctly determine the correct
tracking subject 83% of the time after tracking was lost [19]. However, the robot experienced
tracking problems and was unable to locate the person when their clothing color was similar
to the background or when the person was standing next to a large object, in sunlight or in
flickering light. The software also failed to correctly track the subject when they were
walking at a constant speed next to another person, or when they made rapid changes in
speed [19].
CHAPTER 3
METHOD OF PERSON DETECTION
This thesis attempts to build on the strengths and weaknesses of prior person
following robot projects by taking a relatively unique approach to imaging and person
detection. The robot described in this thesis uses a stereoscopic camera to passively image
the environment and generate a depth map. The image data is used for person detection
through the Histogram of Oriented Gradients algorithm. The upper third of every detected
person is further processed with a Hough circle transform to detect heads and remove false
positives. Finally, the tracking subject is identified by the color of their clothing, whose
location can be estimated with a high degree of accuracy from the prior two processing steps.
The person’s location in 2D space and depth data are used to determine the robot’s forward
speed and direction of travel.
3.1 Stereo Block Matching
Given two images of the same scene taken from two cameras at slightly different
positions, it is possible to recover depth information from the two dimensional images by
examining their differences. Intuitively, one can understand the concepts behind this
algorithm simply by looking at the way humans see their world in three dimensions. As an
object moves closer to one’s face, the object will appear differently to the left eye than to the
right. Likewise, if an object is distant, then the left and right eyes will both see the object in
approximately the same way. By analyzing the differences between what the left and right
eye sees, the brain can triangulate an approximate depth to any object in the person’s field of
view. This too can be mimicked in software, and by using information about the location,
lens focal length and geometry of the cameras, the depth to any region in the image can be
triangulated [9][27][30].
Just as humans have two eyes, the Stereo Block Matching algorithm operates on two
grayscale images taken from two cameras that are a slight distance apart. For each pixel in
the left image, a horizontal search is conducted for the same pixel in the right image. Since
single pixels are usually nondescript, the block matching algorithm, as the name implies,
operates on blocks of pixels, usually 5x5 or 7x7 [15]. For each pixel in the left image, the
block of pixels surrounding it is compared against a horizontally sliding window of the same
size on the right image. For each window position in the right image, a sum of absolute
differences is computed to compare how closely the pixels in the left image’s block compare
to the pixels in the right image’s window, per Equation 3.1 [29].
ℎ𝑒𝑖𝑔ℎ𝑡 𝑤𝑖𝑑𝑡ℎ
𝑠𝑢𝑚 𝑜𝑓 𝑎𝑏𝑠𝑜𝑙𝑢𝑡𝑒 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑠 = ∑
𝑦=0
3.1
∑ | 𝐿𝑒𝑓𝑡𝑥 ,𝑦 − 𝑅𝑖𝑔ℎ𝑡𝑥 ,𝑦 |
𝑥=0
As one can see, Equation 3.1 operates on a block of pixels of dimension width by
height. The gray level for every pixel in the left image’s block, given by 𝐿𝑒𝑓𝑡𝑥 ,𝑦 is
subtracted from every corresponding pixel at the same x and y location in the right image’s
sliding window, given by 𝑅𝑖𝑔ℎ𝑡𝑥 ,𝑦 . The sum of the absolute value of differences represents
the overall difference between the left and right blocks. Naturally, the closest match will be
the window that produced the lowest sum of absolute differences [15]. Once the closest
match has been identified in the right image, the horizontal offset, or disparity, is written into
a disparity map [9].
Figure 3.1. A sliding window, shown in red, is moved horizontally across the right
image until the same block of pixels is found as in the left image, as shown in blue.
If the location of each camera is known, and the disparity between every
corresponding pixel from the left and right image is known, then the depth to any pixel can
be triangulated by drawing a ray projecting from each camera through the corresponding left
and right pixels. The point at which the rays converge is the depth of that pixel, which can be
saved in a separate depth map [9]. The concept is demonstrated in Figure 3.2.
Figure 3.2. The horizontal disparity between two stereo images, along with knowledge
of the camera locations, can be used to triangulate distances to objects in the scene.
Generally, the distance between the two cameras, shown as Baseline b, and focal
length of the lenses, shown as f in Figure 3.2, are known constants [27][29]. For any region
of interest in the scene, the horizontal offset from the left side of the image frame, shown as
𝑋𝑙 and 𝑋𝑟 in Figure 3.2, can be discovered in both the left and right images using the
standard Stereo Block Matching algorithm given in Equation 3.1. Likewise, the disparity for
the region can be calculated simply by subtracting the horizontal offset in the left image
frame from the horizontal offset in the right image frame, as shown in Equation 3.2.
𝐷𝑖𝑠𝑝𝑎𝑟𝑖𝑡𝑦 = |𝑋𝑙 − 𝑋𝑟 |
3.2
Finally, we can use the disparity information, along with knowledge of the distance
between the cameras and focal length of the lenses to triangulate the depth to the object, as
shown in Equation 3.3 [30].
𝑍=
𝑓∗𝑏
𝐷𝑖𝑠𝑝𝑎𝑟𝑖𝑡𝑦
3.3
Before images can be analyzed for disparity information, the images must first be
processed to correct for lens distortion, rotation and positional misalignment of the cameras
[31][33]. As each right image is scanned horizontally for matching pixels from the left
image, it is critical that corresponding pixels in the left and right images appear on the same
horizontal plane [9]. In order to ensure that corresponding pixels appear on the same
scanline, the cameras must be calibrated before the stereo block matching algorithm is
started. A checkerboard pattern of a known size can be used to build a model of the
stereoscopic camera and discover the camera’s intrinsic and extrinsic parameters, which
include each camera’s focal length, lens distortion parameters, location and orientation
relative to the other cameras [9][31][33]. This thesis makes use of OpenCV’s stereo camera
calibration facilities to discover the intrinsic and extrinsic parameters, which once discovered
are saved to a file for later use [31].
3.2 Histogram of Oriented Gradients
The first four generations of the person following robot extracted humans from the
image data by color thresholding or circle detection, which proved to be problematic when
the background was the same color as the person or when other circular non-human objects
were present. The fifth generation of person following robot used proprietary libraries for
person detection, which required the person to be facing toward the camera. As such, the
sixth generation robot requires a person detection algorithm that is independent of color, can
consistently detect humans in a variety of poses, and produces a sufficiently low amount of
false positives. The Histogram of Oriented Gradients (HOG) approach to person detection
meets these requirements and is able to detect people in various poses, with various types of
clothing, and even in partial states of occlusion. The detection method is also not affected by
subject coloration or changes in lighting, making it ideal for a robot that may travel through
environments with changing illumination.
In essence, the HOG feature vector describes the silhouette of the average human
figure. The vector is constructed in stages, the first of which is edge detection over the entire
image. Though many complex methods for edge detection are available, the creators of the
HOG algorithm, Navneet Dalal and Bill Triggs, found that convolving the image with a
simple [-1 0 1] Sobel kernel produced the best results [10].
Once the edge detection is completed in both vertical and horizontal directions, the
resulting edge maps are divided into small 8x8 cells. For each pixel in each cell, Orientation
Binning is performed. Orientation Binning is essentially the creation of a histogram of
quantized gradient directions [10]. Using the horizontal and vertical edge maps produced in
the first step, it is possible to calculate the gradient orientation and magnitude of each edge
using Equations 3.4 and 3.5 respectively.
𝑂𝑟𝑖𝑒𝑛𝑡𝑎𝑡𝑖𝑜𝑛 = atan
𝐺𝑦
𝐺𝑥
𝑀𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒 = √𝐺𝑥2 + 𝐺𝑦2
3.4
3.5
Each orientation is converted from radians to degrees, and then quantized in 40
degree increments. Each bin in the histogram represents a 40 degree increment, resulting in 9
bins total. Thus, a gradient direction of 23 degrees would contribute to the first bin of the
histogram, which has a range of 0 to 39 degrees, and an orientation of 44 degrees would
contribute to the second bin, which has a range of 40 to 79, and so forth. The amount that
each gradient orientation contributes to the cell’s histogram is equal to its magnitude, giving
prominent edges more influence than weaker edges [10]. After a histogram of quantized
gradient orientations has been created for each cell, the cells are grouped into 2x2 blocks,
with a 50% overlap between blocks [10][34]. Each cell is then contrast normalized relative to
the other cells in the block to prevent aliasing [10][34]. The entire process of creating the
HOG feature vector is shown in Figure 3.3.
Figure 3.3. An example of the construction of an HOG feature vector.
Finally, all of the cells in each block are concatenated together to form the HOG
feature vector [10][34]. When the standard 64x128 descriptor size is used as described by
Navneet and Triggs, the image is broken into 8x8 cells, 8 cells horizontally and 16 cells
vertically, totaling 128 cells [10]. Since 4 cells make up a 2x2 block of 16x16 pixels and each
block has a 50% overlap, the descriptor has a total of 105 blocks, 7 horizontally and 15
vertically [10]. Since each block as 4 histograms, one for each cell, and each histogram has 9
bins, then the total feature vector length after all of the histograms are concatenated is 3780
integer values [10].
These feature vectors are used in conjunction with a classifier system, such as a
Support Vector Machine or Neural Network, to determine whether the each feature vector
contains a human or some other object. During supervised training, feature vectors describing
positive examples of humans and negative examples of other objects are created from
64x128 pixel images and are shown to the classifier system [10]. Back propagation can be
used to train the classifier on the training data. After training, any region of an image to be
tested for the presence of a human is scaled to 64x128 and a feature vector is created. The
feature vector is then used as input to the classifier system, and a boolean describing whether
a human was detected is returned.
By scaling the image frame to various sizes, humans can be detected at a range of
distances from the camera. Since the HOG algorithm operates on 64x128 regions, it is natural
that people who are close to the camera will appear larger than these dimensions, and people
who are distant from the camera will appear smaller. In order to detect everyone in the frame,
the entire image is scaled to various sizes, and the HOG algorithm is performed on each new
image, increasing the chance that the humans in the scene will fit in the 64x128 window [10].
3.3 Hough Transform for Head Detection
A common fundamental problem of computer vision is that of detecting shapes in
image data. The Hough Transform is a feature extraction technique used to detect lines or
shapes which can be described by an equation, and has the added benefit of being able to
detect shapes which may be incomplete or otherwise only partially visible [12].
As with the Histogram of Oriented Gradients algorithm, the Hough Transform
operates on edge detected images, which have a value of 0 when no edge is present and some
value other than 0 when an edge is present [13]. Given an edge detected image, which shows
only the perimeter of shapes, one can already intuitively understand the basic principle
behind the Hough transform – that all non-zero valued points that are part of a specific shape
should be solutions to the equation describing that shape [12]. Let us examine the simple case
of a line, given by Equation 3.6, where m denotes the slope of the line and b denotes the yintercept, and together form the line parameters.
𝑦 = 𝑚𝑥 + 𝑏
3.6
Given an edge detected image, every non-zero pixel that is part of the same line will
share a common value for m and b. Therefore, if we had a way of tracking how many points
in the image data fall on a line for every value of m and b, we could use the highest scoring
combinations of m and b as evidence of the presence of a line [13].
We can create a parameter space (b, m), sometimes called Hough space, in which
every point corresponds to the parameters of the shape’s equation. In the simple case of a line
using Equation 3.6, the equation can be rewritten such that x and y are constant and b and m
are the variable parameters, as shown in Equation 3.7.
𝑏 = −𝑚𝑥 + 𝑦
3.7
The Hough space would be defined as a two dimensional array of which one
dimension denotes the m values and the other dimension denotes the b values. Each index
into the Hough space is called an accumulator, whose initial value is 0 [13]. Therefore, the
Hough transform algorithm works by iterating over every non-zero pixel in the edge map,
calculating every b and m value for the lines that pass through that pixel, and incrementing
the corresponding accumulators in the Hough space [13]. Since pixels that lie on the same
line will share a common value for m and b, it is only natural that the corresponding
accumulator at index (m,b) in Hough space will be incremented multiple times. The
algorithm concludes by a search for the highest scoring accumulator cells, which provide
evidence of the existence of a line [13].
The Hough transform can be extended to discover the presence of circles simply be
replacing the equation of a line with that of a circle, given in Equation 3.8 [11].
𝑥 = 𝑎 + 𝑅 𝑐𝑜𝑠𝜃
3.8
𝑦 = 𝑏 + 𝑅 sinθ
Therefore, the Hough space would be extended to three dimensions to accommodate
the three parameters (a, b, R), where parameters (a, b) denote the center of the circle, and R
denotes the radius [20]. As before, the edge detected image is iterated over and for every
non-zero pixel all possible values of a, b and R are computed. For every value of (a, b, R)
that satisfies Equation 3.8 for the given x and y, the Hough space accumulator at index (a, b,
R) is incremented [11]. When all pixels in the edge map have been examined, the Hough
space is iterated over to find the global maxima, which indicate the presence of a circle in the
image data [11].
Since finding all possible values to satisfy Equation 3.8 for a three dimensional
parameter space is computationally expensive, the radius parameter is usually limited to a
small range of values.
3.4 Color Matching
Sometimes it is desirable to segment an image by a specific color or color range. In
most applications that require segmentation by color, a threshold is defined for the desired
color range and all pixels that fall within the threshold are kept, while all pixels that fall
outside of the threshold are discarded. In environments where coloration of clothing may be
affected by changes in lighting, it is advisable to use a color space that takes illumination into
account. Previous generations of the person following robot used the RGB color space,
which proved inadequate, and then later the HSI color space, which consists of hue,
saturation and intensity components, with the intensity component being ignored to account
for changes in lighting [6]. However, this too proved inadequate as the systems were unable
to cope with variable lighting.
This thesis uses the Lab color space, which consists of a lightness component L*, and
two color-opponent components a* and b* and attempts to approximate human perception of
color and luminosity [14]. In the Lab color space, color balance is adjusted by modifying the
a* and b* curves, while lightness is adjusted with the L* component. A graphical
representation of the Lab color space is shown in Figure 3.4.
Figure 3.4. A visualization of the Lab color space [21].
Since Lab color space is designed to mimic not only the way humans perceive color,
but also the way humans perceive changes in color, any two colors in Lab color space can be
compared for similarity with a distance function, given in Equation 3.9 [14]
𝐸 = √Δ𝐿∗ 2 + Δ𝑎∗ 2 + Δ𝑏 ∗ 2
3.9
Where E is a measure of similarity [22][23]. The smaller the resulting E value, the
more similar the two colors are [28].
In the case of tracking a color over time, the L component of color we are interested
in, which represents the lighting, can be adjusted as changes in environmental lighting occur.
Therefore in the case of person tracking, as the person is detected from frame to frame, the
reference Lab color information which identifies them is also updated.
3.5 Summary
This chapter discussed the algorithms and software concepts used in the sixth
generation person following robot. It was shown how Stereo Block Matching can be used to
discover corresponding pixels between images captured by two different cameras, and then
the horizontal disparity of the pixels be triangulated into a depth map. An explanation of the
Histogram of Oriented Gradients algorithm was given, and how it can be used to detect
humans at various distances from the camera. The Hough Transform for the detection of
circles was discussed, which is used by the sixth generation robot to detect heads. Finally, the
Lab color space was discussed and a method for detecting the similarity of two colors in Lab
color space was presented. These four algorithms make up the core of the sixth generation
robot’s person following software.
In the next chapter, the hardware components of the sixth generation person
following robot are discussed, including the development and testing of stereoscopic imaging
rig, power efficient compute hardware, and client-server communication model. The program
flow of the sixth generation robot’s software, from initial person detection to the physical
motion of the robot, is analyzed. Lastly, an auxiliary Android application for taking manual
control of the robot is presented.
CHAPTER 4
SIXTH GENERATION HARDWARE AND SOFTWARE
The sixth generation robot has gone through several revisions of hardware and
software over the course of its development. The software was primarily written in C++ on
an Ubuntu Linux-based desktop computer and tested on a development board also running an
embedded version of Ubuntu Linux. The final hardware platform consists of a stereoscopic
imaging rig built from two wide angle cameras and Segway RMP chassis. The robot also
features two computers dedicated to image processing and communication with the Segway
RMP chassis.
4.1 Stereoscopic Imaging System
Before a stereoscopic camera was built, an exploratory program was created to
determine the feasibility of a stereoscopic system for depth estimation. The initial program
was written as a single threaded Java program and operated on pre-rectified stereo images
obtained from the internet.
The program would operate on left and right images, an example of which is shown
in Figure 4.1, to produce a disparity map using the standard block matching algorithm,
shown in Figure 4.2. The program would load the left image and right image into memory,
and then create grayscale versions of both images to simplify the block matching process.
Since color data was thrown away, only similarity between gray levels was considered. The
software would then enter a loop that iterated over the left image. For every pixel in the left
image, a window of a predetermined size was placed around the pixel and then the right
image was scanned along the same horizontal plane with a sliding window of identical
dimensions, as per usual. For every window position in the right image, a sum-of-differences
calculation, given in Equation 3.1, was performed against the left image’s window, with the
lowest scoring position considered as the closest match. However, in order to speed up
processing, an area larger than a single pixel was written into the disparity. The size of the
area was usually equal to that of the scanning window. Thus, a large window size would
require a larger amount of processing time than a smaller window size in order to find a
corresponding pixel in the right image, but the number of pixels needing to be scanned was
reduced as the software operated on window boundaries.
The effect can be seen in Figure 4.2, in which the grayscale values denote the size of
the disparity between the left and right image, with whiter the pixels denoting a larger
disparity and darker ones denoting a smaller disparity. The disparity map which required 7
seconds of processing appears to be lower resolution than the one which required 207
minutes because a block of data was written to the disparity map for every corresponding
pixel that was found. The software would try to find the next corresponding block in
successive iterations rather than the next corresponding pixel.
Testing revealed that the exploratory program was indeed able to produce disparity
maps that visually appear correct. In the sample image shown in Figure 4.1, the wooden
walkway is closest to the camera, followed by the river and rock face on the right. The green
hillside on the left half of the image is furthest from the camera. This is reflected in the
disparity maps that were generated, shown in Figure 4.2, as the area with the wooden
walkway appears lightest, the river and rock face appearing a darker shade of gray, and
finally the hillside appearing black or nearly black. Similar results were obtained with other
stereoscopic images found on the internet, with the areas closest to the camera correctly
showing as lighter than those further from the camera.
Even though the exploratory program confirmed that stereo imaging was a viable
method of depth sensing, it was not efficient enough for real-time use. In a real-time system,
many frames would need to be processed each second as the robot passed through its
environment for both person following and to avoid collisions with nearby objects. It became
evident that the program needed to be highly parallelized. Fortunately, Stereo Block
Matching lends itself to parallelization as each operation in the block matching phase is the
same operation performed every pixel in the block. The differences for each block can be
calculated in parallel on the GPU. Furthermore, if enough processing units are available on
the GPU, the differences can be calculated for every block on the horizontal plane at the
same time. OpenCV has both CUDA and OpenCL implementations of the standard Stereo
Block Matching algorithm that are highly parallelized and execute on the GPU. The sixth
generation person following robot uses OpenCV’s GPU implementation of the Stereo Block
Matching algorithm.
Figure 4.1. Pre-rectified stereoscopic image used in the initial exploratory program.
Figure 4.2. The resulting disparity maps after 7 seconds, 31 seconds, and 207 minutes of
processing respectively.
Once it was determined that a stereoscopic system can indeed be used to produce a
disparity map that reflects the depths in the scene, a prototype stereoscopic camera was built
from two Logitech C160 USB webcams to test the accuracy of depth measurement. The
webcams were mounted to a cardboard box at varying distances from each other. Each
webcam was aligned horizontally and vertically to point in the same direction as to allow the
stereo block matching algorithm to discover corresponding pixels on the same horizontal
plane. To facilitate this, another piece of software, of which screenshots are shown in Figure
4.3, was created with the OpenCV libraries to rectify each webcam and discover each
camera’s intrinsic and extrinsic parameters. The software accesses both cameras and looks
for a checkerboard pattern. Since checkerboards have a well-defined shape, consisting of
straight horizontal and vertical lines, the software analyzes the distortion in the horizontal
lines of the checkerboard to estimate how the lenses are warping the image. Once the
checkerboard is visible to each camera, the software takes 30 images of the board and
attempts to correct for the distortion of the lenses, such that the horizontal lines of the
checkerboard appear straight.
The intrinsic and extrinsic parameter values, which describe the camera model in
terms of focal length, sensor size, projection matrix and lens correction matrices are written
to a file, so that the calibration software need only to be run once each time the hardware is
changed [24].
Figure 4.3. Screenshot of the camera rectification software discovering the camera’s
intrinsic and extrinsic parameters by analyzing the horizontal lines of the checkerboard
pattern. The colored lines indicate the horizontal areas of the board that the software is
analyzing for lens distortion.
The stereo block matching software was rewritten in C++ and also utilized the
OpenCV libraries to perform block matching on the GPU, which brought the average
processing speed up to 30 frames per second when limited by the camera’s refresh rate, or 41
frames per second when operating on static images. Depth measurements were taken of a
chair placed in predetermined locations to which the distances were known. This test, to
which the results are given in Table 4.1, determined how accurate a basic stereoscopic
imaging system could be and what the optimal interpupillary distance between the cameras
is, where interpupillary distance is defined as the distance from the center of the left camera
lens to the center of the right camera lens.
Interpupillary Distance:
Interpupillary Distance:
Interpupillary Distance:
7.62cm
12cm
22cm
Min. Distance: 37.6cm
Min. Distance: 72.5cm
Min. Distance: 135cm
Actual
Measured
Actual
Measured
Actual
Measured
254cm
234cm
254
220
254
236
162cm
150cm
162
145
162
145
89cm
86cm
89
93
89
Failed
71cm
70cm
71
76
71
Failed
Table 4.1. Show a comparison of depth measuring accuracy between three different
interpupillary distances.
From the table above, one can see that the optimal interpupillary distance for the
cameras is between approximately 8cm and 12cm, in terms of smallest depth measurement
error and shortest detectable distance.
The final iteration of the stereoscopic imaging system was built from two high refresh
rate USB camera modules sourced from Amazon and were mounted to an aluminum channel
with polycarbonate standoffs, as shown in Figure 4.4.
Figure 4.4. The stereoscopic imaging system made from two USB camera boards.
Each camera has a maximum refresh rate of 120Hz and came with 170 degree fisheye
lenses. Cameras with a high refresh rate were specifically sought to decrease the time
between image capture intervals to maximize framerates. Since the Logitech webcams had a
30Hz refresh rate, if image processing took more than 1/30th of a second, then the program
would idle until the next frame became available, dropping the framerate to 15fps. However,
a 120Hz camera would allow the software to only wait a maximum of 1/120th of a second for
image data to become available, thus raising the overall framerate even without changes to
the software. Another important consideration was the camera’s low light performance.
Cameras with small lenses and small sensors usually suffer in low light, requiring the camera
to decrease the shutter speed in order to capture the image. Therefore in order to mitigate the
effects of poor lighting, cameras without infrared filters on the sensor or lenses were chosen.
However, it was discovered in testing that the 170 degree wide angle lenses produced
far too much distortion to be practical, even when rectification matrices were applied. The
rectification software was unable to fully account for the lens distortion and the Stereo Block
Matching algorithm would fail to find corresponding pixels on the same horizontal plane as a
result. The 170 degree lenses were replaced with 2.8mm 92 degree field of view lenses, both
in IR-pass and IR-block variety for indoor and outdoor use. With the 92 degree lenses, the
rectification software was able to produce matrices that removed the lens distortion from all
but the edges of the frames, allowing the Stereo Block Matching algorithm to function as
expected. IR-pass lenses were used for testing in indoor environments, as they allowed all
frequencies of light to pass through to the sensor for a brighter exposure in what would
typically be a lower light environment. However, it was discovered in outdoor testing that IR
from the sun washed out all color information. Since shutter speed is guaranteed to be high in
bright daylight, it was decided that IR-blocking lenses should be used outdoors to cut out IR
from the sun and preserve color data. Doing so would allow the robot to correctly identify the
tracking target by color. The effects of sunlight on color matching are discussed in greater
detail in the next chapter.
4.2 Compute Hardware and Programming Environment
The software for the person following robot was almost entirely developed on a
desktop computer with an NVidia GTX 980Ti graphics card, Intel Xeon X5680 processor
and 18 gigabytes of ram and was designed to execute on the GPU for maximum efficiency.
Initial parts of the software were written in Java, but, due to performance concerns, the final
iteration of the software was entirely written in C++. In order to maintain platform
independence, the software makes heavy use of the OpenCV 2.4.11 libraries for accessing
camera and GPU hardware. As the NVidia Jetson TK1 runs an embedded version of Ubuntu
Linux, the desktop computer on which the software was developed also ran Ubuntu Linux to
maintain compatibility. However, since OpenCV is available for Linux, Windows, OSX and
Android, the software can effortlessly be ported to any of the major platforms and run as a
native binary. The software was compiled with gcc and Visual Studio 10 and 13, and run on
Ubuntu 16.04, Ubuntu 14.04 and Windows 7 and 10. Preprocessor flags allow easy
configuration of the software to best suit the underlying hardware, and the software can be
compiled to run on Intel and AMD graphics hardware through OpenCL, NVidia graphics
hardware through CUDA, or run in a CPU-only mode that does not make use of the GPGPU
capabilities of the graphics hardware.
Though the desktop hardware was more than adequate to handle the compute heavy
algorithms of the sixth generation robot, it used a prohibitively large amount of energy and
was thus unsuitable for mobile use. The Segway RMP chassis has a computer attached to its
frame, which was used by previous generations of the person following robot for both image
processing and communication with the chassis, however this computer has a low power
x86-based processor and no discrete GPU capable of general processing. As a result, a
second mobile computer, the NVidia Jetson TK1 development board, was purchased and
used solely for image processing tasks.
The NVidia Jetson TK1, as shown in Figure 4.5, development board contains a quadcore 2.3GHz ARM Cortex-A15 CPU, a 192-core NVidia Kepler GK20a GPU capable of
GPGPU programming, 2 gigabytes of RAM and 16 gigabytes of storage [15]. It also includes
a USB 3.0 port, which has the necessary bandwidth to simultaneously stream uncompressed
video data from two USB cameras at a high refresh rate, and an Ethernet port, which was
used to send motion commands to the Segway RMP’s computer.
Figure 4.5. The NVidia Jetson TK1 development board selected for image processing
[15].
4.3 Initial Person Detection and Target Tracking
Though this thesis uses a stereoscopic imaging rig built from two USB camera
boards, only the left camera’s output is used for person detection. At each iteration of the
software, both cameras capture an image of the scene, are processed to remove lens
distortion, and disparity and depth maps are created from the rectified images. After that, the
right image frame is discarded, and the unrectified left image frame is further processed to
discover any humans that may be in the scene. The program flow is shown in Figure 4.6
below, where the program entry point and initialization are shown in yellow, the depth
sensing stage is shown in blue, the person detection and identification are shown in red, and
the physical robot hardware being put into motion is shown in green.
Figure 4.6. This block diagram shows the program flow of the sixth generation robot’s
software. Initialization and program entry are shown in yellow square. The first two
steps, which use both cameras to capture images and build a depth map are shown in
blue. The red squares denote steps which constitute person detection and identification,
and only use the left camera’s output. Finally, the green square denotes the actual
person following step, when motor commands are sent to the robot chassis and the
software’s main loop starts anew.
The Histogram of Oriented Gradients method is used for initial person detection and
is indicated in red in Figure 4.6 above. This algorithm was specifically chosen for its color
invariance, resilience to poor lighting conditions and low false positive rate. Another benefit
of the HOG algorithm is that once a person is detected, the location of their head and torso
can easily be approximated, as the location is a constant offset from the upper left corner of
the bounding box. For every person that is detected, their head will appear in the upper 1/3rd
and middle 30% of the bounding box. Their upper body appears below the head, and is in the
middle 20% of the bounding box, with a height of approximately 16% of the height of the
bounding box. The lower body appears below the upper body, is square shaped and has an
approximate width of 25% of the bounding box width. Being able to approximate the
locations of the head, upper body and lower body within the bounding box facilitate further
processing of the human subjects by making additional steps, such as shape-matching to
discover the same features, unnecessary. In this thesis, the HOG algorithm can run in two
different modes to improve performance.
In the first mode, the left image frame is scaled to different sizes and each image is
processed by the HOG algorithm to allow detection of human figures at various distances
from the camera. This mode is only used when no humans have previously been detected.
Every human that is detected in this mode is added to a tracker, which tracks the person’s
current location in three dimensions, bounding box dimensions, clothing color, head location,
detection timestamp, and a 300 millisecond history of previous locations and attributes.
The second mode is used if data exists in the tracker. Since humans were previously
detected, scaling the source image and running the HOG algorithm multiple times is no
longer a necessity. Instead, a trajectory is calculated for each subject in the tracker and a
region of interest based on the person’s expected location is copied from the image data.
Since the subject’s expected location is calculated as a velocity from their location history
and previous detection timestamps, the resulting location is invariant to any changes in
framerate that might have occurred throughout the person’s detection history. The region of
interest is then rescaled to the size of the expected HOG feature vector length, which in the
case of this thesis is 64x128 pixels, and tested for the presence of a human using the standard
HOG algorithm. If a person is detected, the person’s tracker entry is updated with new
location information and timestamp, and the previous location information is added to their
history. If a person is not found at the expected location, then the entry remains in the tracker
for a period of time in case the person reappears. If the person does not reappear after an
expected period of time (300 milliseconds was used as the default value), then the person’s
entry is removed from the tracker. Lastly, if the last person is removed from the tracker, then
the person detection system switches back to the first mode to rescan the scene for human
targets.
4.4 Location Refinement and False Positive Removal
The next step in the person tracking pipeline is to further process the tracker data with
the Hough transform. Early testing revealed that certain non-human objects, such as office
chairs viewed in profile, would consistently show up as false positives in the HOG step. To
remedy this, the Hough transform was introduced as an additional step in the person tracking
pipeline. For each person that appears in the tracker, the person’s region of interest is
analyzed for the presence of circles. Since the person’s bounding box was defined by the
HOG algorithm, the relative location of the person’s body features are well known, including
their head which is expected to be located in the middle of the upper 1/3rd of the bounding
box. The radius of the head can also be inferred from the size of the bounding box. As such,
the circle Hough transform need not be applied to the entire region of interest as defined by
the person’s bounding box, but instead can be efficiently applied to a small region in the
upper 1/3rd of the bounding box and with a few radii. By limiting the search space for circles,
the Hough transform’s impact on framerate becomes negligible. A 3x3 Gaussian blur is
applied to the region before the Hough transform is performed to improve detection.
If a head is detected, then the person’s tracker information is updated with an adjusted
location relative to their head. If a head is not detected, then the person’s tracker entry is
flagged as not having a head and, depending on program settings, may be removed from the
tracker or may be omitted from consideration for person following later.
Testing showed that the Hough transform had some success in removing false
positives, such as in the case of the office chair in profile, but failed to remove other false
positives, such as certain street signs containing the letter O, which were often mistaken for a
head.
4.5 Person Identification
Because the bounding box margin widths surrounding each target from the HOG
algorithm are largely constant, it is easy to determine the location of a person’s shirt and
pants. Since the location of their chest and torso area is well defined, colors can be sampled
from both areas and used as a feature vector to identify an individual within a group. More
importantly, since the critical steps of person detection have already been completed and
color matching is only used for identification, it is not beholden to the same problems as that
previous generations of person following robot had – chiefly that surrounding colors can
have a profound impact on the quality of the person detection.
Before the main person following program is started, the human to be followed stands
in front of the robot and a training program is run. The training program samples pixels from
the area containing the shirt and pants, as indicated by the pink squares in Figure 4.7, to
create two feature vectors.
Figure 4.7. A snapshot of the robot’s onboard video stream showing the bounding box
from the HOG algorithm indicated in white, Hough transform head detection search
area indicated in light blue, approximate head location as determined by Hough
transform indicated in yellow, and approximate location of shirt and pants indicated in
pink.
Each feature vector, one for the pants and one for the shirt, contains the sampled
colors converted to Lab color space and is saved to a file, along with the camera parameters
that were used when the feature vectors were created, such as exposure, brightness, contrast,
and color saturation settings. When the main person following program is started, the training
data is loaded from the training file and the camera parameters are adjusted to match those
that were used when the training data was created.
Processing continues per usual until after the HOG and Hough steps have filled the
tracker with potential person following targets. The tracker entries are iterated over and shirtpants feature vectors are created for each target, where each feature vector consists of 96
color samples. The feature vectors of each target are then compared against the training data
using a simple distance function to determine which target most closely resembles the
training data in their clothing coloration, as given in Equation 4.1.
𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠
∑
(4.1)
√(𝐿2𝑎
−
𝐿2𝑡 )
+
(𝑎𝑎2
−
𝑎𝑡2 )
+
(𝑏𝑎2
−
𝑏𝑡2 )
𝑎=0
For each person in the tracker, Equation 4.1 is used to calculate the similarity of the
person’s shirt color, pants color, and an average of shirt and pants color to the training data.
The index of the person identified as the tracking target in the previous frame is also noted.
The tracking target is then selected by a priority system to increase tracking reliability in
scenarios where two or more people might be wearing similarly colored clothing. The
parameters for the priority system are shown in Table 4.2.
Priority
Condition
Description
1
Closest Aggregate
The person with the shirt and pants color most similar to
Index=Previously
the training data is also the same person that was
Identified Index
2
Closest Shirt
Index=Previously
Identified Index
3
Closest Pants
Index=Previously
Identified Index
4
Closest Aggregate <
Threshold
identified in the previous frame as the tracking target.
The person with the shirt color most similar to the
training data is also the same person that was identified
in the previous frame as the tracking target.
The person with the pants color most similar to the
training data is also the same person that was identified
in the previous frame as the tracking target.
The average of the shirt and pants color distance
measurement for the most similarly colored person is
below a constant threshold of 20.0.
5
(Closest Pants <
(0.80*Closest Shirt) <
Threshold
The pants color distant measurement for the most
similarly colored person is 20% more similar than the
person’s shirt is to the training data, and both values are
below the constant threshold.
6
Closest Shirt <
Threshold
The person with the closest matching shirt to the
training data has a color distance measurement below
the threshold of 20.0.
7
-1
The color distance measurement for the shirt, pants and
average of shirt and pants for all people was greater than
the threshold of 20.0.
Table 4.2. This table shows the conditions of the priority system for selecting which
person to follow. The algorithm favors following the same person from frame to frame,
even if only part of their clothing matches the training data, and only falls back on pure
color matching if the previously tracked person cannot be found by shirt or pants color.
The priority system prefers to follow the same person from frame to frame rather than
purely the person who is most similarly colored to the training data. If the person who has the
most similarly colored shirt and pants to the training data is also the same person that was
previously being followed, they are once again selected as the tracking target. If, however,
the most similarly colored person is not the previous tracking target, but another person with
a closely matching shirt is, then the person with the closely matching shirt is once again
selected as the tracking target. This is also true for pants color, and a person with very similar
pants color to the training data, who was also being tracked in previous frames, will be
selected as the tracking target again even if other people’s clothing matches the training data
better. The priority system does not select purely by color similarity unless the tracking target
in the previous frame cannot be identified by clothing color.
In the case that the tracking target from the previous frame cannot be identified, the
software will choose who to follow purely on the basis of color similarly. The top priority is
someone whose shirt and pants are similar to the training data. The next priority is someone
whose pants are at least 20% more similar to the training data than the closest matching shirt
color. Finally, the person with the closest matching shirt is selected as the tracking target,
provided that their shirt is greater than a constant threshold. If the shirt and pants color for all
people in the scene is greater than the threshold, then it is assumed that the person to be
followed has left the scene and the robot should perform its tracking loss actions to
rediscover the correct person.
This priority system, combined with the tracking software’s ability to track people
from frame to frame by trajectory and velocity, helps reduce tracking loss or confusion in
situations such as people crossing paths or people standing in close proximity to each other.
In each new frame, the robot will always prefer to track the same person that was being
followed in the previous frames, even if their coloration is not an exact match. By choosing a
person to follow based on how similar they are not only to the training data but also to the
person that was being followed in the previous frame, the robot is less susceptible to jerky
motion or rapid changes in lighting, as would be the case if the robot selected a tracking
target purely by color from a group of similarly colored people.
Another component of the person identification algorithm is the periodic retraining of
the training set feature vectors. Retraining happens after a timeout of 100 milliseconds, rather
than per frame. At every 100 millisecond interval, one percent of the selected target’s color
data is alpha blended in to the training set, such that after 10 seconds the training set is
completely retrained. This continual retraining of the feature data aids in adjusting to changes
in lighting, especially if the robot is used in a different environment than where the training
data was originally created.
After the software chooses a tracking target, it must determine a direction and speed.
4.6 Speed and Direction Calculations
The speed and direction of the robot are calculated separately according to different
criteria and the robot may perform a range of actions depending on whether the tracking
target is visible in the camera frame, recently exited the camera frame, or has not been seen
for a while.
Under normal circumstances, the person to be followed will be standing in front of
the robot, in full view of the stereoscopic imaging rig. In this scenario, forward speed is
determined from the target’s distance as read from the depth map. Because the depth map is
created from two rectified images, the shape of the depth map itself appears corrected for
lens distortions. However, the HOG and Hough algorithms that determine the person’s
location within the camera frame operate on unrectified images. As such, the coordinates of
pixels in the image data, such as those located at the person’s center of mass, will not
correspond to those pixels’ depth at the same coordinates in the depth map. To be able to
sample depth data accurately, the depth map must be preprocessed with an unwarping matrix
that, in effect, reverses the image rectification process and returns the depth data to the same
coordinate space as the unrectified camera images.
The unwarping matrix is created in the initialization stage of the program flow. When
the program starts up, all of the necessary rectification parameters and matrices to correct for
lens distortion are loaded. Since the camera’s frame size is known, a new table of the same
dimensions as the camera frame is created. For every x and y coordinate in the camera frame,
its lens-corrected location is calculated by a multiplication with the lens correction matrix.
This location is then used as an index into the new table, where the original x and y values
are written. This table, called the unwarping matrix, acts as a lookup table where the location
of every pixel from a lens distortion corrected frame acts as an index into the table, from
which the original pre-corrected coordinates of the same pixel can be retrieved.
A demonstration of this concept can be seen in Figure 4.8, which shows rectified
image data from the right camera unwarped using the generated unwarping matrix and output
to the left window.
Figure 4.8. A rectified image, shown on the right, is processed with the unwarping
matrix to return the image to its original coordinate space, as shown on the left.
Since the depth map may contain errors or fluctuations in depth for any given area,
the depth data is sampled from the target’s chest area and the median depth value is selected
as the person’s depth.
Originally, the robot’s forward speed was directly proportional to the person’s
distance from the robot. However, to reduce jerky motion induced by constant acceleration
and deceleration of the Segway robot, the speed is now determined by a fuzzy logic system
with two membership groups. The person’s depth data is compared against a threshold of 200
centimeters. If the person is further from the robot than 200cm, then the robot gradually
accelerates to a predetermined high speed. If the person is at 200cm or below, then the robot
decelerates to a predetermined low speed.
The direction of travel is determined by the person’s center of mass relative to the
center of the image frame, and is calculated using Equation 4.2.
𝑑𝑖𝑟𝑒𝑐𝑡𝑖𝑜𝑛 = (
𝑏𝑜𝑢𝑛𝑑𝑖𝑛𝑔 𝑏𝑜𝑥 𝑤𝑖𝑑𝑡ℎ
2
− 1.0) ∗ 1000.0
𝑓𝑟𝑎𝑚𝑒 𝑤𝑖𝑑𝑡ℎ
2
𝑥 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 +
(4.2)
The person’s position within the camera frame is calculated as the horizontal offset of
the left edge of their bounding box, shown as x position in Equation 4.2, plus half the width
of the bounding box itself. The person’s position is then divided by half the frame width to
scale their horizontal position between 0.0 and 2.0. Since the Segway RMP expects values
between -1000.0 and 1000.0 for full speed left turn and full speed right turn, the person’s
position is further scaled to fit this range by subtracting 1.0, bringing the range from -1.0 to
1.0, and then multiplied by 1000.0 to fit the value range that the Segway expects.
Since the Segway RMP is a two wheeled platform, turning left or right is performed
by accelerating one wheel and decelerating the other wheel. It was discovered that changes
in direction also affect the robot’s forward speed, as one wheel is either accelerated or
decelerated to make the turn. This too would result in undesirable motion if the robot was
constantly adjusting its orientation toward the tracking subject. A dead zone was added
suppress unnecessary turns and thus unnecessary changes in speed until the tracking subject
was on the far edges of the image frame. Therefore, after the direction of travel value is
calculated, the value is checked to see if it falls within the dead zone, which constitutes the
middle 40% of the image frame. If the person’s center of mass is within the dead zone, then
the robot makes no attempt to turn toward the person and continues its forward motion. If the
person’s center of mass falls either in the left 30% or right 30% of the image frame, then the
robot will turn left or right respectively at a speed proportional to the person’s proximity to
the frame edge.
If the person manages to escape the robot’s field of view by exiting either the left or
right edge of the frame, then the robot will continue turning in the direction where the person
was last seen until the tracking subject is recaptured. This feature allows the robot to quickly
recover from tracking loss by turning into the direction of the person, even if they are no
longer visible to the robot. Tracking loss can also occur without the person leaving the frame,
such as in the case of the person being occluded by an object. In such a case, since the robot
has no knowledge of where the person went, it decelerates to a stop and idles for a
predetermined amount of time. If the person does not reappear, then the robot begins to
slowly turn in a circle to the left until the tracking subject reappears.
4.7 Client-Server Communication
When the actual speed and direction have been determined, the Jetson TK1 computer
that is responsible for image processing must communicate these intentions to the older
secondary computer, which is responsible for controlling the Segway RMP chassis. The
secondary computer runs a small server program on boot, which the primary computer
connects to on program startup. The client and server are usually connected by TCP over a
wired Ethernet connection for reliability, but can also connect wirelessly. If the client
disconnects, either intentionally or due to a network error, the server continues to listen for
new connections until a shutdown command is sent.
A simple text-based communications protocol was developed to allow the client to
signal its intentions to the server, but also to simplify any future development involving the
Segway RMP. The communication is one way only, from client to server, and the server does
not send any responses to the client. When the client software would like to move the robot
in a particular direction, it simply sends one of the supported commands to the server that are
given in Table 4.3.
Command
move <-1000.0 - 1000.0>
Example
move 200.0
Description
Moves the robot forward at
20% of maximum speed.
turn <-1000.0 - 1000.0> <-
turn 300.0 -100.0
500.0 - 500.0>
Moves the robot forward at
30% speed and turns left
gradually.
brake
brake
Stops the robot immediately.
shutdown
shutdown
Stops the robot and shuts
down the server, preventing
further connections.
Table 4.3. Shows all of the valid commands supported by the server program.
A companion Android app was also created that allows a person to take direct control
of the robot. The intention of the app was to allow the user to drive the robot manually in
complex environments that would be too hazardous for the robot to navigate alone. The
Android app has an easy to use interface and connects to the server program mentioned
above over wifi. A screenshot of the app is shown in Figure 4.9.
Figure 4.9. A screenshot of the companion Android app used to take direct control of
the robot.
4.8 Summary
This chapter detailed the development of the stereoscopic imaging software and
hardware, as well as the image rectification process. Tests to determine the stereoscopic
imaging system’s optimal interpupillary distances were also discussed. An explanation of the
sixth generation robot’s compute hardware was also given.
This chapter also analyzed the general program flow of the sixth generation person
following robot’s software, including initialization steps, person detection using the
Histogram of Oriented Gradients algorithm, persistent frame-to-frame person tracking by
trajectory, and person location refinement with the Hough transform for head detection. False
positive removal through head detection and bounding box size was also discussed. The
method by which a person is identified amongst a group was also explained, which included
a discussion of the Lab color space, color matching with a distance formula, and a priority
system that prefers tracking the same person from frame to frame over exact color matching.
A method for determining the robot’s forward speed and direction was shown, which
uses the tracking subject’s depth and horizontal position within the image frame. Finally, the
sixth generation person following robot’s client-server model was presented and an
explanation was given of how the speed and distance information is sent over a network to
the Segway RMP hardware, which translates the data into physical motion. A companion
Android app was also discussed, which allows the user to take manual control of the robot by
connecting to the robot’s server software.
The next chapter presents tests performed with the sixth generation person following
robot under various lighting conditions, types of terrain, and with multiple people in the
scene in various poses. Tests were also performed to determine if there are alternate methods
for estimating the distance to a person without the use of stereoscopic block matching. The
results are presented and analyzed in the next chapter.
CHAPTER 5
TESTS AND RESULTS
Throughout the robot’s hardware and software development, various tests were
performed to gauge the viability and accuracy of the proposed solutions. The stereoscopic
imaging system was tested for accuracy in depth estimation and compared against alternative
methods of depth sensing. The HOG algorithm was tested in various lighting conditions. The
robot’s ability to track a human subject was tested indoors and outdoors, through rapidly
changing illumination and on various terrain. The results from these tests are presented in
this chapter.
5.1 Stereoscopic Imaging Tests
As the stereoscopic imaging hardware was being developed, the effects of
interpupillary distance between the two cameras on depth sensing were explored. The
cameras were placed at varying distances apart and the depth to a target of a known distance
was measured. Measurements were taken using the standard stereo block matching algorithm
to create a disparity map and reprojection was used to calculate the depth to the target in
centimeters. The results were then compared to discover the optimal interpupillary distance
for the camera system. The results, given in Table 4.1, are reproduced for the reader in Table
5.1.
Interpupillary Distance:
Interpupillary Distance:
Interpupillary Distance:
7.62cm
12cm
22cm
Min. Distance: 37.6cm
Min. Distance: 72.5cm
Min. Distance: 135cm
Actual
Measured
Actual
Measured
Actual
Measured
254cm
234cm
254
220
254
236
162cm
150cm
162
145
162
145
89cm
86cm
89
93
89
Failed
71cm
70cm
71
76
71
Failed
Table 5.1. Show a comparison of depth measuring accuracy between three different
interpupillary distances.
The measurements were fairly consistent at longer distances for all three
interpupillary distances. However, when the distance to the target was reduced, wide
distances between the two camera lenses prevented the system from capturing depth
accurately. The furthest spacing between the two cameras, 22cm, was unable to measure
distances to objects closer than 135cm from the camera lenses. As there was very little
difference in the depth sensing accuracy between the remaining two interpupillary distances,
a spacing of 7.62cm was chosen for the final stereoscopic imaging rig as it was able to
resolve depth at shorter distances.
5.2 Hough Tests and Alternate Ranging
The Hough transform for head detection was added as a component of the robot’s
software to help remove false positives and improve positional tracking of targets. Before the
Hough transform was added, the HOG algorithm would occasionally confuse objects that
may have a human-like silhouette as humans and pollute the tracker module with false
positives. An example of this, in which an office chair viewed in profile is mistaken for a
human, is shown in Figure 5.1. The HOG algorithm also failed to adjust the bounding box of
previously detected human subjects as they performed different poses, such as squatting or
tilting to the side, which can indicate the person’s direction of travel and aid in calculating an
accurate trajectory. Thus, the upper 1/3rd of each bounding box discovered by the HOG
algorithm is scanned for the head using the Hough transform for circles. If at least one circle
is found, then the person’s entry in the tracker is marked as having a head. If multiple circles
are found, then the circle closest to the center of the search area is considered the head. An
example of this is shown in Figure 5.2.
Figure 5.1. An office chair in profile mistaken as a human being by the HOG algorithm.
Figure 5.2. A bounding box adjusted to fit a squatting figure by discovering the location
of the head. The yellow circle indicates the circle that is most likely the head, while
every other circle discovered by the Hough transform is marked in purple. The blue
square denotes the search area for the head, while the green square denotes the
bounding box for the human as discovered by the HOG algorithm.
Testing of the Hough and HOG algorithms was done on YouTube videos of street
scenes from New York rather than static images [16][17]. Videos were used because they
include both people in motion and motion of the camera, complete with physical translation,
panning, tilting and motion blur. The videos were meant to be an analog for what a robot
would see when operating in a crowded environment.
The results were mixed. Certain false positives, such as tall buildings in the distance,
were discovered by the HOG algorithm and successfully flagged as false positives by the
Hough transform, as shown in Figure 5.3. However other complex areas, such as
advertisements on buildings, were discovered by the HOG and incorrectly marked by the
Hough transform as having a head, as shown in Figure 5.4. Since many of the false positives
seemed to be generated by distant objects, a filter was added to the tracker module that
further improves removal of false positives, including those incorrectly marked as being
human by the Hough transform. The tracker filter rejects any bounding box whose
dimensions are smaller than predefined constants, or whose depth is farther than what would
be considered a reasonable tracking distance. The filter works under the assumption that the
person to be followed will be within some acceptable range of the robot and appear large in
the camera’s field of view. For a 320x240 image, the rejection criteria is any bounding box
that is smaller than 60x120 or whose distance is greater than 500cm from the camera. Thus,
any distant objects, including distant humans, are eliminated from the tracker. In practice, the
filter was shown to be very beneficial for removing false positives. After the tracker filter
was added, false positives have become something of a rarity.
Figure 5.3. A building is identified as human by the HOG algorithm. The red bounding
box indicates that the Hough transform was unable to find a head, and the object is
probably not a human.
Figure 5.4. A street scene in which advertisements on the side of a distant building are
misidentified as human by both the HOG and Hough algorithms.
The head radius as detected by the Hough transform was also explored as an alternate
method of depth estimation. The motivation behind this experiment was to further increase
the framerate of the software. Since the Hough transform requires less processing power than
the stereoscopic block matching algorithm, a sizeable increase in performance could be
gained simply by estimating the depth by head size.
Intuitively, if someone is closer to the camera, the radius of their head should appear
to be larger than if they are further from the camera. Using this principle, fixed points of a
known distance were marked on the ground and the distance to a human subject was
measured at each point, both using head radius and the traditional stereo block matching
method. The test was carried out three times for each method of depth estimation. The results
were compared and are presented in Tables 5.2 and 5.3.
Head-Radius Depth Estimation
Position
Attempt 1
Attempt 2
Attempt 3
Actual Distance
1
99-178cm
139-187cm
104-189cm
270cm
2
243-275cm
276-342cm
313-446cm
360cm
3
354-380cm
408-498cm
339-596cm
480cm
4
381-437cm
496-571cm
528-596cm
560cm
5
331-359cm
443-536cm
449-495cm
480cm
6
294-307cm
297-330cm
342-390cm
360cm
7
136cm
63-108cm
97-151cm
270cm
Table 5.2. This table shows the results of depth estimation by examining changes in
head radius.
Stereo Block Matching Depth Estimation
Position
Attempt 1 (med)
Attempt 2 (med)
Attempt 3 (avg)
Actual Distance
1
277.2cm
277.2cm
251-282cm
270cm
2
332.2cm
332.2cm
363-397cm
360cm
3
556.6cm
415.8-556.6cm
486-556cm
480cm
4
556.6cm
556.6cm
569-664cm
560cm
5
556.6cm
556.6cm
428-443cm
480cm
6
332.2cm
332.2cm
334-351cm
360cm
7
277.2cm
277.2cm
277-279cm
270cm
Table 5.3. This table shows the results of depth estimation as estimated by the stereo
block matching algorithm and reprojection.
Since a block of pixels are sampled from the depth map, stereo block matching depth
estimation can be calculated in two ways. The first two attempts used the median value of
depth samples at the person’s center of mass, while the last attempt used an average of the
depth samples as the final depth value.
The data in Table 5.2 above seems to follow the pattern that measurements increase
as the subject moves further from the camera and implies that head radius can be used as an
alternative to stereoscopic imaging for depth estimation. However, we can see that
sometimes the actual distance falls inside the measured range, and sometimes the
measurements can be incorrect by as much as 162cm, as shown in Table 5.4 below.
Head Radius Measurement Error
Position
Attempt 1
Attempt 2
Attempt 3
1
92 cm
83 cm
81 cm
2
85 cm
18 cm
0 cm
3
100 cm
0 cm
0 cm
4
123 cm
0 cm
0 cm
5
121 cm
0 cm
0 cm
6
53 cm
30 cm
0 cm
7
134 cm
162 cm
119 cm
Table 5.4. This table shows the difference between the actual distance and the closest
number in the measured range for each attempt.
Table 5.4 shows the difference between the actual distance and the nearest number in
the measured range. Therefore, if the measurement fluctuated between 99cm and 178cm, as
is the case for Attempt 1 at Position 1, then the measurement would have to be adjusted by at
least 92cm to include the actual depth of 270cm within the measurement range. If the actual
distance already falls within the measurement, then the error is considered to be 0cm. If the
distances estimated from the head radius exhibited a consistent measurement error, then the
data could simply be adjusted to compensate for the error and the measurements could be
considered as reliable. However, even though neither the subject’s location nor the hardware
were modified between successive attempts, the error in measurement is inconsistent from
attempt to attempt. Since the size of the error is unpredictable between successive attempts,
the depth estimation cannot be corrected simply by adding or subtracting a constant. There is
also the problem of the estimated range itself.
At each position, the distance from the camera to the person was estimated by their
head radius. However, the measurement was not consistent from frame to frame, and
therefore a minimum and maximum observed distance were recorded. The range of the
estimations for each position are shown in Table 5.5.
Head Radius Measurement Range
Position
Attempt 1
Attempt 2
Attempt 3
1
79 cm
48 cm
85 cm
2
32 cm
66 cm
136 cm
3
26 cm
90 cm
257 cm
4
56 cm
75 cm
68 cm
5
28 cm
93 cm
41 cm
6
13 cm
33 cm
48 cm
7
0 cm
45 cm
54 cm
Table 5.5. This table shows the difference between the maximum estimated distance and
minimum estimated distance per position for all three attempts.
The data in Table 5.5 can be interpreted as the preciseness of using head radius as a
depth estimator. Even if the actual depth falls within the estimated depth range, the range
itself can sometimes be quite large. In the worst case, the range was 2.57 meters. The range
also reflects the separation needed between two people for them to be considered individuals.
If one person is, for example, 40 centimeters in front of another person, but the range of the
head radius estimation at that position is 50 centimeters, then the software cannot be sure that
the two people are in fact individuals and not the same person.
Conversely, when looking at Table 5.3, it is apparent that both methods of
stereoscopic depth estimation performed remarkably well for short distances. Selecting a
depth value from the sampled data by median was prone to errors for distances greater than
480cm from the camera. However, when selecting depth as an average of the samples, the
estimated distances were quite close to the actual distances for each point. Furthermore, the
range of values when using the average of samples method for depth estimation was very
small for all but the furthest distances from the camera.
Though the stereo block matching performed better than head-radius based depth
estimation in terms of accuracy, it did so at a performance cost. During testing, it was
observed that the stereo block matching algorithm combined with the usual HOG and Hough
transform algorithms used for person detection ran at approximately 10 to 15 frames per
second. On the other hand, the head-radius based depth estimation, combined with HOG for
person detection, ran at approximately 20-27 frames per second. Though the head-radius
based method was less accurate, sizeable performance gains were achieved through its use.
5.3 Single Person Indoor Performance
The person following robot was tested indoors in the hallway of the computer science
building. The hallway has a smooth tile floor, white walls, tight corners, doorways, and
variable lighting, both from overhead lights and windows. The robot was trained in an
adjacent lab, and then manually driven to the center of the hallway where the main person
following software was started.
The robot was initially tested only in its ability to track a single human. The human
wore a green shirt to stand out against the white background and stood approximately two
meters in front of the robot when the software was started. Several tests were then carried out
that included:

Tracking the human in a straight line.

Tracking the human in a zig-zag pattern.

Tracking a human in profile or facing away from the robot.

Tracking a human making sudden movements (jumping from side to side).

Recovering from tracking loss when the human moves out of the robot’s field
of view.

Tracking the human through variable lighting conditions.

Navigating around corners and through doorways.
The results showed that the robot was able to successfully track the human in a
straight line, zig-zag pattern, and when the human was making sudden jumping movements.
The robot was able to detect and track the human in all orientations and also able to keep a
walking pace. When the human would move out of the robot’s field of view, the robot rotated
in the direction of the human’s last known location and was able to recover tracking in a
reasonable amount of time, usually within a few seconds of tracking loss.
A problem was discovered with the robot’s hardware during the variable lighting
tests. To test the robot’s ability to track in variable lighting, the human walked from one end
of the hallway to the other, moving from shadowy areas to areas illuminated by overhead
lamps, and finally to an area illuminated by sunlight from a window. The amount of light was
measured with a lux meter at 25 lux in the shadows, 200 lux under the overhead lamps, and
approximately 1200 lux near the window. The robot had no problems tracking the human
between the shadows and overhead lamps, but lost tracking when entering the sunlit area.
Onboard video recorded by the robot revealed that the right camera has a hardware problem
with its auto-exposure function and was unable to adjust itself automatically to accommodate
the new lighting conditions. The image became greatly over exposed, as shown in Figure 5.5,
and the robot was unable to consistently detect the silhouette of the human.
Figure 5.5. Over-exposed video stream when entering a brightly lit area due to
hardware malfunction.
Though the robot was eventually able to detect the human and continue navigating,
the results showed that the defective hardware would be an obstacle in future testing. It was
also discovered that the robot’s simple person following logic was inadequate for an
environment with tight corners. The robot was able to follow the human around corners, but
with aid and careful positioning of the human. The robot would occasionally bump into
corner edges.
Finally, the robot is unable to follow a person through a doorway if the door needs to
be held open by the human. If the door is open, then the human and robot can pass through
the doorway without loss of tracking. However, if the human is required to hold the door
open, then the distance between the human and robot is reduced to the point where the
human no longer fits entirely in the frame and thus cannot be detected.
5.4. Indoor Tracking with Fixed Depth
A second round of testing was performed to test the effects of auto-exposure and
higher framerates on person tracking. The software was modified to clone the video stream
from the working left camera as the right camera’s output. Block matching was disabled, and
a fixed distance of 150cm was assumed to every point in space to allow the robot to follow
any detected human at a constant speed. The effect would be both functional auto-exposure
and a framerate increase of approximately 10 frames per second, bringing the average
framerate up to 20-27fps.
In order to test how well the software could detect and segment humans from the
environment, the human subject wore a white colored shirt to match the white walls of the
hallway, which was a problematic scenario in previous generations of the person following
robot. Testing revealed that the coloration of the person’s clothing made no difference in the
robot’s ability to detect and follow, as shown in Figure 5.6.
Figure 5.6. A person walks away from the robot wearing a shirt colored similarly to the
environment.
Testing also revealed that the robot was able to cope with drastic changes in lighting.
As the robot moved from a 25 lux environment to an approximate 1200 lux environment near
the window and then back to a 25 lux environment, the auto-exposure feature of the camera
was able to adjust the exposure settings to properly image the scene. The robot never failed
to track the human and was able to follow the human through the environment without a
problem. A screenshot from the onboard video taken in the brightly lit environment is shown
in Figure 5.7 for reference.
Figure 5.7. Onboard video from the robot passing through the same brightly lit
environment as before, with auto-exposure working properly.
The increase in framerate was also shown to have a positive effect on tracking. Since
the robot has more information about the person’s location per second, it can calculate a
more accurate trajectory to predict the person’s motion. This results in a lower miss rate in
the HOG algorithm, which means that the entire image does not have to be rescanned for
human targets. Since the robot has a better understanding of where the person will be in the
immediate future, the software has to perform fewer calculations to detect the human, and
thus can maintain a consistently high framerate.
The robot was also tested with a higher resolution 640x480 video stream. Since four
times as many pixels needed to be processed, the framerate dropped to approximately 10
frames per second when the robot was able to predict the person’s motion, and 4 frames per
second when the image needed to be rescanned. Because the robot had less temporal
information about the person’s whereabouts, the robot had trouble predicting the person’s
movements if they were not traveling in a straight line and would easily lose tracking. Once
the robot lost tracking due to low framerate, it was nearly impossible for the robot to regain a
lock on the human target. It was also observed that, perhaps due to motion blur or a lack of
hardware synchronization between the two cameras, distance estimation using the block
matching algorithm was prone to errors, sometimes by hundreds of centimeters, when the
robot was rotating left or right.
5.5 Outdoor Testing
After indoor testing showed positive results, the robot was taken outdoors to be tested
in intense sunlight and on various types of terrain. As before, the robot was trained in the lab
on a person’s clothing. Training was performed indoors as it required the human subject to
stand in front of the robot in isolation. The training program selects the first human detected
by the HOG algorithm and memorizes their clothing color, which is visually verified by the
person with the use of a monitor attached to the robot. The monitor allows the user to verify
that they were indeed selected as the tracking target, and that the camera settings were
adequate for a good exposure. After training was complete, the robot was manually driven
outside using the companion Android app. Once outside, the robot was positioned on a
cement surface and the main person following software was started. The software was also
configured to use stereoscopic imaging for depth estimation, rather than using the head radius
or assuming a fixed distance.
Preliminary testing revealed that due to the drastic illumination differential between
the indoor environment where the robot was trained and outdoors where the main software
was started, the robot was unable to match humans to the training data. Though the robot was
still able to detect humans, onboard video showed that the scene was greatly over exposed.
Since the camera hardware is unable to set the exposure automatically, the robot needed to be
reprogrammed in the field to accommodate the new lighting conditions. The level of over
exposure can be seen in Figure 5.8. The onboard video also revealed another problem with
the camera hardware. Because neither the lenses nor the sensors have infrared filters, the
infrared radiation from the sun washed out all color information in the scene, making it
impossible for the robot to identify the tracking subject by clothing color. To allow the test to
continue, the exposure settings were adjusted manually and the threshold for identifying the
tracking subject was reduced to include every person in the scene.
Figure 5.8. Over exposed video stream due to intense sunlight. Each human appears to
be wearing identically colored clothing from the robot’s perspective, even though the
outfits were quite different.
Once the exposure level was reduced and the identification threshold was adjusted,
the robot was briefly tested on concrete before being moved to a grassy area. The robot was
able to identify and track human subjects, and move without difficulty on the concrete and
grass surfaces. Detection and tracking performance was unaffected by surface type.
However, because multiple people were in the scene and every person appeared to be
wearing purple clothing, the robot would occasionally become confused and follow the
wrong person, as shown in Figure 5.9. The extreme lighting also made shadows appear to be
pitch black, as shown in Figure 5.10. As people in shaded areas appear as featureless
silhouettes, the robot would have had significant trouble properly identifying the correct
tracking target.
Figure 5.9. Infrared radiation caused all persons to appear purple, rendering
identification by color matching impossible.
Figure 5.10. Because of the intensity of the light, shade and shadowy areas appear pitch
black.
The illumination in direct sunlight was measured at approximately 95,000 lux while
the illumination in shaded areas was measured at approximately 23,000 lux. The results of
the outdoor tests showed that while the robot was able to detect and follow humans over
rough terrain, infrared interference made it impractical in its current form. It was concluded
that in order for the robot to become practical outdoors, new IR-blocking lenses would need
to be installed.
5.6 Testing with IR-Blocking Lenses and Multiple Subjects
New IR-blocking lenses were installed on the robot and the robot was once again
tested indoors and outdoors. Two human subjects were used, one of which was wearing a
black shirt and the other wearing a white shirt. As before, the robot was trained indoors and
then manually driven to the same grassy area outside using the companion Android app.
The first test tasked the robot to follow a single human without others in the scene
through a shaded area to one illuminated by direct sunlight. Due to the new lenses, shadows
no longer appeared as pitch black areas and the robot was able to follow the human subject
without issue.
The next test required the robot to continue following a human in a straight line while
a second human, who was not the intended tracking target, moved across the screen, cutting
between the robot and the intended target. This test was performed several times, and each
time the robot was able to maintain a lock on the intended target, as shown in Figure 5.11.
Figure 5.11. After infrared blocking lenses were installed, the robot was able to
properly distinguish individual human subjects outdoors.
There were several instances in which the robot appeared to briefly confuse the
intended tracking target with the second person. This was due to the fact that the robot was
trained indoors and then moved outdoors where the lighting conditions were drastically
different. In order to allow the robot to initially recognize the same target human outdoors,
the identification threshold had to be lowered to account for the lighting differences. Once
the robot acquired a lock on the intended target, the training data set is continually retrained
to promote better tracking performance as the subject moves between shaded and brightly lit
environments. However, the identification threshold is not scaled automatically and remains
low. Thus, when the second subject in the black shirt moves to an area with direct sunlight,
his shirt and pants appear to be a light gray, similar to those of the intended target in the
shade. Despite this, the robot was able to maintain tracking because it also takes trajectory,
depth and prior locations into account when deciding which human to follow.
The same tests were performed indoors in the hallway with the IR-blocking lenses
still attached, where lighting was more consistent and closer to the lighting in the laboratory
where the robot was initially trained. The intended target walked in a straight line and the
second human subject crossed his path to try to confuse the robot. A second test was also
performed in which both subjects walked in parallel away to see if the robot would
occasionally favor the wrong person.
The results showed that the robot performed better indoors than outdoors and was
able to maintain tracking of the correct subject, despite the efforts of the second human. The
robot would still occasionally confuse the two subjects, but usually for not long enough to
alter the robot’s own direction of travel. It was observed that in one instance the robot
confused the second human as the correct tracking subject for long enough to turn toward
him. However, the robot quickly corrected its path when tracking of the intended target was
reestablished, as shown in Figure 5.12.
Figure 5.12. Onboard footage of the robot detecting both humans and correctly
choosing the tracking target in an indoor environment.
5.7 Summary
This chapter presented the various tests that were performed with the robot and
discussed the results. Tests to determine the optimal interpupillary distance between the two
cameras in the stereoscopic imaging rig were carried out. The results showed that the
distance between the cameras had little impact on the accuracy of the system for distances up
to 2.5 meters. However, the configuration in which the cameras were closest together were
the most accurate when the target was closest to the cameras.
The Hough transform for head detection was tested as a method for removing false
positives. The upper region of every detection from the HOG algorithm was tested for the
presence of a circle, which would indicate the presence of a head and confirm the object as a
human. The results were mixed. Some false positives, such as a desk chair viewed in profile,
were removed as false positives, but others, such as complex geometry on distant
skyscrapers, were not. False positive removal was improved by introducing a filter into the
tracker, which removed any object that was smaller than the expected dimensions of a human
within 5 meters of the camera. Head detection was also shown to improve person tracking by
fine-tuning the location of each person’s bounding box relative to the location of their head,
which improved tracking in situations when the person was in an uncommon pose, such as a
squat.
Use of the head radius as an alternative method of depth estimation was also explored
and compared against stereoscopic image reprojection. The results showed that while the
radius of the head could be used to estimate depth, the method was not as reliable or accurate
as stereoscopic imaging. However, it was observed that estimating distance by head radius
was much less computationally expensive and increased the overall framerate of the software
by 10 to 15 frames per second.
Tests were carried out indoors with a single tracking subject to test the robot’s ability
to track a single person in a variety of situations. It was shown that the robot was able to
track a single person reliably in a straight line, zig-zag pattern, and when the person was
making sharp changes in direction. The robot was also able to track the person in a variety of
poses and reliably recover from tracking loss in a reasonable amount of time. It was
discovered in testing that one of the cameras had a hardware defect preventing it from
automatically adjusting exposure levels. Testing with a single working camera and fixed
distance value revealed that the robot can also cope with changes in lighting and continue to
track the human subject.
Outdoor testing initially revealed hardware problems with the non-IR blocking lenses.
IR interference from the sun washed out the image and color information, preventing the
robot from properly exposing the image and identifying individual people. This was solved
by introducing IR-blocking lenses.
The following chapter concludes this thesis by reiterating the results and comparing
them with those of previous generations of the person following robot. The next chapter also
discusses potential improvements to the system for future work.
CHAPTER 6
CONCLUSION
This thesis investigated the viability of stereoscopic imaging together with the
Histogram of Oriented Gradients and Hough Transform algorithms for person detection and
following. The stereoscopic imaging system was built from two off the shelf digital camera
boards and mounted to a Segway RMP chassis. The robot was tested indoors and outdoors in
a variety of lighting conditions, on different types of surface, and with multiple people in the
scene. The robot’s performance in each scenario was recorded by onboard video and any
areas of deficient performance, tracking loss or failed person detection were identified and
analyzed.
6.1 Summary of Results
Testing of the stereoscopic imaging rig showed that even a stereoscopic system made
from cheap off the shelf components can have good accuracy in estimating distance. The
stereoscopic imaging rig was able to work in a variety of lighting conditions, from direct
sunlight to pitch dark environments. It was also tested against estimating distance by head
radius and proved to be more accurate, but computationally more expensive.
Indoor testing demonstrated that the robot is able to detect and follow a specific
person through an environment where small changes in lighting may occur. The robot was
able to consistently detect and follow a single human target regardless of the color of their
clothing or the color of the surrounding environment. The robot was also able to detect
humans regardless of their pose or orientation relative to the robot. When multiple people
were present, the robot generally followed the correct person, though would occasionally
confuse the intended tracking target with a similarly colored person nearby. Finally, it was
demonstrated that the robot is able to track humans with erratic motion, and quickly recover
from tracking loss should the human subject move out of the robot’s field of view.
However, indoor testing also revealed a few deficiencies in the system. Due to simple
path planning routines, the robot’s software is not sophisticated enough to safely navigate
around sharp corners. Though the robot was able to follow a human throughout the building,
bumping into walls and intervention from the human were not uncommon. It was also
discovered that one of the cameras has a hardware problem preventing it from automatically
adjusting exposure levels to accommodate changes in lighting, which prevented the robot
from smoothly navigating through areas of extreme shifts in illumination. Testing with a
single working camera showed that if both cameras worked properly, the robot would have
little trouble operating in the same areas.
Initial outdoor testing had significant problems due to infrared radiation from the sun.
When the robot was moved to direct sunlight from an indoor environment, the video stream
was greatly over exposed and the robot was unable to correctly identify the tracking target.
After the exposure settings and identification threshold were manually adjusted, the robot
was able to track a person but would often get confused when multiple people entered the
scene. Because the lenses did not filter IR, every colored object, including people, appeared
purple in color to the robot, making matching by color impossible. After IR-blocking lenses
were installed on the robot, it was able to detect and follow a human subject on a variety of
surfaces and through extreme changes in lighting.
Finally, testing at higher video resolutions showed that the robot’s software does not
perform well when the framerate dips below approximately 10 frames per second. When the
frame size was increased from 320x240 to 640x480, the framerate dropped to approximately
4 frames per second. Because of this, the robot frequently lost tracking of the human when
they made erratic movements and had a very difficult time recovering from tracking loss. It
was also noted that distance estimation worsens as framerate decreases due to a lack of
hardware synchronization between the two cameras and the introduction of more motion blur
at lower framerates. At the higher resolution, distance estimation was sometimes off by a few
hundred centimeters.
6.2 Comparison to Prior Generations
The sixth generation person following robot solved many of the problems that were
inherent in prior generations. The first three generations relied solely on color matching for
person detection and were affected by the color of the background environment and changes
in lighting. The fourth generation robot used Hough transforms to detect the head and color
matching in the region immediately below to identify a specific human, but was also affected
by false positives from circular non-human object. The performance was prohibitively slow
and the software could not be used in a realtime system. The fifth generation, which used the
Microsoft Kinect depth sensor to image the scene in three dimensions, worked well indoors if
the person was facing the robot, but failed to detect humans that were facing away from the
robot or moving erratically through the scene. The robot had trouble on rough terrain and
would frequently lose tracking due to motion blur.
The sixth generation robot was demonstrated to be unaffected by changes in lighting,
environmental color, or terrain type. The robot can work indoors or outdoors and has a very
low false positive rate. However, although color matching in Lab space generally works, it
has been shown to be the weakest link in the sixth generation robot’s software as with prior
generation robots.
6.3 Future Work
The sixth generation robot could immediately benefit from better hardware. It was
shown in both indoor and outdoor testing that higher framerate and the ability to
automatically adjust exposure greatly increase the robot’s ability to predict the tracking
subject’s future movements. Simply replacing the onboard computer with a more powerful
one should produce immediate positive results. Likewise, higher quality cameras with
functional auto-exposure, or a professional stereoscopic imaging system like the ZED stereo
camera from Stereo Labs should solve many of the robot’s imaging problems [17].
On the software side, the identification of a person within a group of individuals
proved to be the most problematic. A simple solution to improve performance could be the
addition of an annealing function for the identification threshold. This would allow the robot,
when moved to an area of significantly different lighting, to identify the human before it as
the tracking subject and gradually increase the threshold automatically until no other person
can be confused with them.
A more robust solution might be to replace the color matching identification module
with a neural network solution. The neural network may be trained on various features of the
human, such as facial features, hair color, clothing color and design, and so forth. Then, the
neural network could be used to discriminate between the intended tracking subject and other
people in the scene.
The sixth generation robot uses a shortest-path algorithm to determine how to best
follow the human target. This could be replaced with path planning software, which would
allow it to detect obstacles in the depth data and choose the best route, though not necessarily
the shortest route, to reach the target. The addition of path planning software would allow the
robot to safely navigate around less sterile environments that include furniture, tight corners
or narrow corridors.
REFERENCES
[1]
S. Piérard, A. Lejeune, and M. Van Droogenbroeck. "A probabilistic pixel-based
approach to detect humans in video streams" IEEE International Conference on
Acoustics, Speech and Signal Processing, pages 921-924, 2011
[2]
K. Mikolajczyk, C. Schmid, and A. Zisserman, "Human detection based on a
probabilistic assembly of robust part detectors", The European Conference on
Computer Vision, volume 3021/2004, pages 69-82, 2005
[3]
M. Tarokh and P. Ferrari, “Person following by a mobile robot using computer vision
and fuzzy logic”, M.S. Thesis, Dept. of Computer Science, San Diego State
University, San Diego, CA, 2001.
[4]
M. Tarokh and J. Kuo, “Vision based person tracking and following in unstructured
environments”, M.S. Thesis, Dept. of Computer Science, San Diego State University,
San Diego, CA, 2004.
[5]
M. Tarokh, P. Merloti, and J. Duddy, “Vision-based robotic person following under
light variations and difficult walking maneuvers”, 2008.
[6]
R. Shenoy and M. Tarokh, “Enhancing the autonomous robotic person detection and
following using modified Hough transform”, M.S. Thesis, Dept. of Computer
Science, San Diego State University, San Diego, CA, 2013.
[7]
R. Vu and M. Tarokh, “Investigating The Use Of Microsoft Kinect 3d Imaging For
Robotic Person Following”, M.S. Thesis, Dept. of Computer Science, Department of
Computer Science, San Diego State University, San Diego, CA, 2014.
[8]
J. MacCormick, “How does the Kinect work?” Internet:
http://users.dickinson.edu/~jmac/selected-talks/kinect.pdf
[9]
“Obtaining Depth Information from Stereo Images”, Ensenso and IDS Imaging
Systems, GmbH, 2012.
[10]
N. Dalal and B. Triggs, “Histograms of Oriented Gradients for Human Detection” in
CVPR, pages 886-893, 2005
[11]
H. Rhody, “Lecture 10: Hough Circle Transform”, Chester F. Carlson Center for
Imaging Science, Rochester Institute of Technology, Rochester, New York, Oct. 11,
2015.
[12]
R. Fisher et al, “Hough Transform”, Internet:
http://homepages.inf.ed.ac.uk/rbf/HIPR2/hough.htm, 2003
[13]
A. Solberg, “Hough Transform”, University of Oslo, Internet:
http://www.uio.no/studier/emner/matnat/ifi/INF4300/h09/undervisningsmateriale/hou
gh09.pdf, Oct. 21, 2009
[14]
A. Jain, Fundamentals of Digital Image Processing. New Jersey, United States of
America: Prentice Hall. pp. 68, 71, 73
[15]
C. McCormick, “Stereo Vision Tutorial”, Internet:
http://mccormickml.com/2014/01/10/stereo-vision-tutorial-part-i/, Jan. 10, 2014
[16]
T. Kupaev, “Walk down the times square in New York”, Internet:
https://www.youtube.com/watch?v=ezyrSKgcyJw, Jul. 10, 2012
[17]
“HD 720p Walking in Shibuya, Tokyo”, Internet:
https://www.youtube.com/watch?v=xY5A8KwdlDk, Apr. 21, 2009
[18]
“Jetson TK1”, Internet: http://elinux.org/Jetson_TK1
[19]
V. Anuvakya and M. Tarokh, “Robotic Person Following in a Crowd Using Infrared
and Vision Sensors”, M.S. Thesis, Dept. of Computer Science, San Diego State
University, San Diego, CA, 2014.
[20]
D. Ballard and C. Brown, “The Hough Method for Curve Detection” in Computer
Vision, Rochester, New York, Prentice Hall, 1982, pp. 123-131.
[21]
S. Blathur, “CIE-Lab Color Space”, Internet:
http://sheriffblathur.blogspot.com/2013/07/cie-lab-color-space.html, Jul 18, 2013
[22]
“Color Differences and Tolerances”, datacolor, Lawrenceville, New Jersey, Jan. 2013
[23]
D. Brainard, “Color Appearance and Color Difference Specification” in The Science
of Color, Santa Barbara, California, University of California, Santa Barbara, pp. 192213
[24]
D. Gavrila, “A Bayesian, Exemplar-Based Approach to Hierarchical Shape
Matching” in IEEE Transactions on Pattern Analysis and Machine Intelligence,
Volume 29, No. 8, August 2007.
[25]
P. Felzenszwalb et al, “Object Detection with Discriminatively Trained Part Based
Model” in IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume
32, No. 9, September 2010.
[26]
M. Barnich, S. Jodogne, and M. Droogenbroech, “Robust Analysis of Silhouettes by
Morphological Size Distributions” in Lecture Notes on Computer Science, Volume
4179, pp. 734-745, 2006.
[27]
N. Navab and C. Unger, “Stereo Vision I: Rectification and Disparity” Technische
Universitat Munchen, Internet:
http://campar.in.tum.de/twiki/pub/Chair/TeachingWs11Cv2/3D_CV2_WS_2011_Rec
tification_Disparity.pdf
[28]
Z. Schuessler, “Delta E 101” Internet: https://zschuessler.github.io/DeltaE/learn/
[29]
R. Rao, “Lecture 16: Stereo and 3D Vision” University of Washington, Internet:
https://courses.cs.washington.edu/courses/cse455/09wi/Lects/lect16.pdf, Mar. 6,
2009
[30]
L. Iocchi, “Stereo Vision: Triangulation” Universita di Roma, Internet:
http://www.dis.uniroma1.it/~iocchi/stereo/triang.html, Apr. 6, 1998
[31]
A. Barry, “Wide Angle Lens Stereo Calibration with OpenCV” Internet:
http://abarry.org/wide-angle-lens-stereo-calibration-with-opencv/
[32]
N. Ogale, “A Survey of Techniques for Human Detection from Video” University of
Maryland, College Park, MD
[33]
A. Elgammal, “CS 534: Computer Vision Camera Calibration” Rutgers University,
Internet: http://www.cs.rutgers.edu/~elgammal/classes/cs534/lectures/Calibration.pdf
[34]
V. Prisacariu and I. Reid, “fastHOG – A Real-Time GPU Implementation of HOG”,
University of Oxford, Oxford, UK, Jul. 14, 2009