A PARALLEL ARCHITECTURE FOR HARDWARE FACE

A PARALLEL ARCHITECTURE FOR HARDWARE FACE DETECTION
T. Theocharides
N. Vijaykrishnan, M. J. Irwin
[email protected]
{vijay, mji}@cse.psu.edu
University of Cyprus
Penn State University
ABSTRACT
Face detection is a very important application in the field of
machine vision. In this paper, we present a scalable parallel
architecture which performs face detection using the
AdaBoost algorithm. Experimental results show that the
proposed architecture can detect faces with the same
accuracy as the software implementation, on real-time video
at a frame rate of 52 frames per second.
and white rectangles. The result of a feature is the sum of the
pixels under the white rectangle minus the sum of the pixels under
the black rectangle. A group of features composes a stage, the
outcome of which is the sum of the feature outcomes. The outcome
of a stage determines whether the region of the image examined
contains a face or not. The classification process benefits from
using the integral and integral squared images, a transformation of
the original image. A location in the integral image holds the sum
of the intensity values of the pixels located above and to the left of
the location in the original image. The integral image and the
rectangle computation that benefits from it are shown in Figure 1.
1. INTRODUCTION
Face detection is defined as the process of identifying all
image regions that contain a face regardless of the position, the
orientation and the environment conditions in the image. Face
detection is a necessary operation in a wide range of applications.
From control and security applications, to identification systems,
face detection plays a primary role. It is the primary step towards
face recognition [2] and serves as a fore step towards multiple
applications such as identification, monitoring, tracking, etc. Face
detection algorithms have been heavily researched, and have
improved drastically both in terms of performance and speed. So
far however, the operation has been extensively done in software
[2, and 4]. State-of-the-art software face detection can achieve up
to 15 image frames per second [5] under favorable circumstances,
and as such is not quite suitable for real-time video deployment. A
fast hardware implementation that can be integrated either on a
generic processor or as part of a larger system, directly attached to
the video source, such as a security camera or a robot’s camera, is
therefore desirable. There have been a few attempts at hardware
implementations that implement face detection on multi-FPGA
boards and multiprocessor platforms using programmable
hardware [1, 3, 6-9], but do not achieve real time frame rate and
high accuracy. A stand-alone accurate and real-time capable
detection system which can interface with existing video interfaces
is therefore more desirable. Such a system can either be mapped on
an FPGA or as an Application-Specific Integrated Circuit (ASIC),
as it can be placed on a camera, as a co-processor, as part of an
embedded platform, or as an entirely stand alone system. In this
paper, we present a scalable parallel architecture that implements
one of the most widely accepted face detection algorithms, the
AdaBoost face detection technique [5]. Accepted by the computer
vision community as the state-of-the-art in terms of speed and
accuracy [2], the AdaBoost algorithm uses the boosting
classification approach [5]. The approach presents two significant
advantages; it has the ability to quickly eliminate non-face regions,
and the classification process itself is extremely fast. The
classifiers used are called features, each feature consisting of a
set of black
This work was supported in part by grants from NSF CAREER
0093085, and MARCO/DARPA GSRC-PAS.
Figure 1: Rectangles, Features, integral image and
rectangle computation
The size of the feature determines the size of the image region
being searched. When the base size computes for all regions, the
features are then enlarged in subsequent scales, and evaluated for
each scale, thus able to detect faces of larger sizes. Additional
algorithm details are omitted for space limitations, and readers are
encouraged to refer to [5] for details. The algorithm outline is
shown in Figure 2. Next we describe the proposed architecture.
(x,y) = (0,0)
X = x+dx
Y= y+dy
Source
Image
Integral
Image
Feature
Pool
Scaled
Feature
Compute
Rectangle
corners in
Search
Window
Pass
Pass
Pass
Face at
(x,y; scale)
Stage 1
Stage 2
Fail
Stage N
Fail
scale = 1.0
scale = scale*1.2
X>col
Y>row
Fail
Scale>
max
Done
Figure 2: Algorithm Outline
2. PROPOSED ARCHITECTURE
It is evident that we need to provide parallel data access to the
integral and integral squared images. As such, we propose the use
of a grid array processor, as the structure of our architecture. The
array is used as memory to store the computation data and as data
transfer unit, to aid in accessing the integral image in parallel.
Essentially the system consists of three major components: the
collection and data transfer units (CDTUs), the multiplication and
evaluation units (MEUs) and the control units (CUs). The CDTUs
serve as data storage elements for the integral image. In our
architecture, the size of the array depends on the size of the image.
An image size of mxn pixels requires an mxn array of CDTUs.
While this might seem that the architecture requires excessive
resources, techniques such as image scaling can be used to reduce
the input image frame and thus keep the size of the array within
reasonable limits. Additionally, image partitioning can be used in
conjunction with image scaling, to produce image sizes to fit the
hardware budget. A floorplan of the system is shown in Figure 3,
illustrating the location of each unit and the data movement across
the system. Due to space limitations, we omit the detailed
description of each unit; however we outline the computation next.
one sum at a time per MEU. From left to right, eventually all sums
arrive in each MEU, where the rectangle sums are used with the
training data of each feature, in order to evaluate the feature. The
feature result is shifted in a toroidal fashion to the CDTU on the far
right of the grid, to continue the computation. Eventually, when all
feature results are computed, they are stored back into the CDTUs
in the grid and the computation resumes with the next feature.
When all stages complete for a single scale, the flagged locations
contain a face. If the scale computed is the last one, the
computation ends, and each search window with a face still has its
flag bit set inside the representing CDTU. Each location that
contains a face is shifted to the right and outside of the grid array,
to the output of the processor for the host application to proceed.
3. PERFORMANCE EVALUATION
To evaluate the performance of the proposed implementation,
we designed and verified the architecture using Verilog HDL and
ModelSim®. We then synthesized the architecture using a
commercial 90nm library and targeting a 500 MHz clock cycle.
We use the Intel Open Computer Vision Library (CV) [8] for the
training data. We evaluated our design using sample images which
contain various numbers and sizes of faces. Our synthesized design
indicates that the experimental architecture consumes an area of
approximately 115mm2. We computed the average number of
cycles per test frame and obtained a rough estimate of 52 frames
per second.
Figure 3: Architecture Floorplan
The operation essentially is partitioned into the following
stages: configuration, computation of integral and integral squared
images, and computation of image variance, computation of
rectangles per feature, feature computation, stage evaluation and
image evaluation. When a feature size increases (i.e. when a search
window size increases) the computation is essentially treated as a
new one. When the image has been searched at all search window
sizes, the system is ready for the next image frame. In each case,
all three units collaborate to perform the computation. Incoming
pixels stream in the processor in parallel along all rows of the
processor and are shifted in row-wise every cycle. First, the
integral image is computed. The computation consists of horizontal
and vertical shifts and additions. Incoming pixels are shifted inside
the array on each row. Depending on the current pixel column,
each of the computation units performs one of three operations; it
either adds the incoming pixel value into the stored sum, or
propagates the incoming value to the next-in-raw processing
element while, either shifting and adding in the vertical dimension
(downwards) the accumulated sum or simply doing nothing in the
vertical dimension. The entire computation takes 2 * [(m + (m-1) +
(n-1)] cycles, for an input image of n rows by m columns. Next, the
rectangle computation happens. For each rectangle in each feature,
each corner point is shifted towards the collection point. The points
move one at a time, but in parallel for all rectangles in the array. At
each collection point, the point is either added or subtracted to an
accumulated sum, with the rectangle value computed when all
points of each rectangle arrive at the collection point. Each point
requires dx+dy cycles to reach the collection point, where dx and
dy are the offset coordinates of the point with respect to the upper
left corner of the search window. When finished collecting the
rectangle sums for a single feature, the collected sums are stored in
the CDTU that represents the starting corner for each feature. Next,
all the collected sums are then shifted leftwards towards the MEUs,
4. CONCLUSION
This paper presented a parallel architecture that performs face
detection using the AdaBoost algorithm. The architecture targets
both parallel computation and parallel data movement, and is
capable of processing on average 52 frames per second.
5. REFERENCES
1. R. McCready, “Real-Time Face Detection on a Configurable
Hardware System”, International Symposium on Field
Programmable Gate Arrays, October 2000.
2. E. Hjelmås, B. K. Low, “Face Detection: A Survey”, Computer
Vision and Image Understanding, Vol. 83, No. 3, September 2001.
3. B. Srijanto, “Implementing a Neural Network based face
detection onto a reconfigurable computing system using
Champion”, MS Thesis, University of Tennessee, August 2002.
4. Intel Open Source Computer Vision Library, Oct. 2005,
http://www.intel.com/technology/computing/opencv/index.htm.
5. P. Viola and M. Jones, “Robust Real-time Object Detection”,
International Journal of Computer Vision, October 2002.
6. M. Reuvers, “Face Detection on the INCA+ System”, Masters
Thesis, University of Amsterdam, 2004.
7. T. Theocharides, et al, “Embedded Hardware Face Detection”,
In the Proc. of VLSI Design, Mumbai, India, January 2004.
8. V. Kianzad et al, “An Architectural Level Design Methodology
for Embedded Face Detection”, Proceedings of the International
Conference on Hardware/Software Codesign and System
Synthesis, New York City, September 2005.
9. H. Broers et al, “Face Detection and Recognition on a Smart
Camera”, Proceedings of the Advanced Concepts for Intelligent
Vision Systems Conference, Belgium, August 2004.