A PARALLEL ARCHITECTURE FOR HARDWARE FACE DETECTION T. Theocharides N. Vijaykrishnan, M. J. Irwin [email protected] {vijay, mji}@cse.psu.edu University of Cyprus Penn State University ABSTRACT Face detection is a very important application in the field of machine vision. In this paper, we present a scalable parallel architecture which performs face detection using the AdaBoost algorithm. Experimental results show that the proposed architecture can detect faces with the same accuracy as the software implementation, on real-time video at a frame rate of 52 frames per second. and white rectangles. The result of a feature is the sum of the pixels under the white rectangle minus the sum of the pixels under the black rectangle. A group of features composes a stage, the outcome of which is the sum of the feature outcomes. The outcome of a stage determines whether the region of the image examined contains a face or not. The classification process benefits from using the integral and integral squared images, a transformation of the original image. A location in the integral image holds the sum of the intensity values of the pixels located above and to the left of the location in the original image. The integral image and the rectangle computation that benefits from it are shown in Figure 1. 1. INTRODUCTION Face detection is defined as the process of identifying all image regions that contain a face regardless of the position, the orientation and the environment conditions in the image. Face detection is a necessary operation in a wide range of applications. From control and security applications, to identification systems, face detection plays a primary role. It is the primary step towards face recognition [2] and serves as a fore step towards multiple applications such as identification, monitoring, tracking, etc. Face detection algorithms have been heavily researched, and have improved drastically both in terms of performance and speed. So far however, the operation has been extensively done in software [2, and 4]. State-of-the-art software face detection can achieve up to 15 image frames per second [5] under favorable circumstances, and as such is not quite suitable for real-time video deployment. A fast hardware implementation that can be integrated either on a generic processor or as part of a larger system, directly attached to the video source, such as a security camera or a robot’s camera, is therefore desirable. There have been a few attempts at hardware implementations that implement face detection on multi-FPGA boards and multiprocessor platforms using programmable hardware [1, 3, 6-9], but do not achieve real time frame rate and high accuracy. A stand-alone accurate and real-time capable detection system which can interface with existing video interfaces is therefore more desirable. Such a system can either be mapped on an FPGA or as an Application-Specific Integrated Circuit (ASIC), as it can be placed on a camera, as a co-processor, as part of an embedded platform, or as an entirely stand alone system. In this paper, we present a scalable parallel architecture that implements one of the most widely accepted face detection algorithms, the AdaBoost face detection technique [5]. Accepted by the computer vision community as the state-of-the-art in terms of speed and accuracy [2], the AdaBoost algorithm uses the boosting classification approach [5]. The approach presents two significant advantages; it has the ability to quickly eliminate non-face regions, and the classification process itself is extremely fast. The classifiers used are called features, each feature consisting of a set of black This work was supported in part by grants from NSF CAREER 0093085, and MARCO/DARPA GSRC-PAS. Figure 1: Rectangles, Features, integral image and rectangle computation The size of the feature determines the size of the image region being searched. When the base size computes for all regions, the features are then enlarged in subsequent scales, and evaluated for each scale, thus able to detect faces of larger sizes. Additional algorithm details are omitted for space limitations, and readers are encouraged to refer to [5] for details. The algorithm outline is shown in Figure 2. Next we describe the proposed architecture. (x,y) = (0,0) X = x+dx Y= y+dy Source Image Integral Image Feature Pool Scaled Feature Compute Rectangle corners in Search Window Pass Pass Pass Face at (x,y; scale) Stage 1 Stage 2 Fail Stage N Fail scale = 1.0 scale = scale*1.2 X>col Y>row Fail Scale> max Done Figure 2: Algorithm Outline 2. PROPOSED ARCHITECTURE It is evident that we need to provide parallel data access to the integral and integral squared images. As such, we propose the use of a grid array processor, as the structure of our architecture. The array is used as memory to store the computation data and as data transfer unit, to aid in accessing the integral image in parallel. Essentially the system consists of three major components: the collection and data transfer units (CDTUs), the multiplication and evaluation units (MEUs) and the control units (CUs). The CDTUs serve as data storage elements for the integral image. In our architecture, the size of the array depends on the size of the image. An image size of mxn pixels requires an mxn array of CDTUs. While this might seem that the architecture requires excessive resources, techniques such as image scaling can be used to reduce the input image frame and thus keep the size of the array within reasonable limits. Additionally, image partitioning can be used in conjunction with image scaling, to produce image sizes to fit the hardware budget. A floorplan of the system is shown in Figure 3, illustrating the location of each unit and the data movement across the system. Due to space limitations, we omit the detailed description of each unit; however we outline the computation next. one sum at a time per MEU. From left to right, eventually all sums arrive in each MEU, where the rectangle sums are used with the training data of each feature, in order to evaluate the feature. The feature result is shifted in a toroidal fashion to the CDTU on the far right of the grid, to continue the computation. Eventually, when all feature results are computed, they are stored back into the CDTUs in the grid and the computation resumes with the next feature. When all stages complete for a single scale, the flagged locations contain a face. If the scale computed is the last one, the computation ends, and each search window with a face still has its flag bit set inside the representing CDTU. Each location that contains a face is shifted to the right and outside of the grid array, to the output of the processor for the host application to proceed. 3. PERFORMANCE EVALUATION To evaluate the performance of the proposed implementation, we designed and verified the architecture using Verilog HDL and ModelSim®. We then synthesized the architecture using a commercial 90nm library and targeting a 500 MHz clock cycle. We use the Intel Open Computer Vision Library (CV) [8] for the training data. We evaluated our design using sample images which contain various numbers and sizes of faces. Our synthesized design indicates that the experimental architecture consumes an area of approximately 115mm2. We computed the average number of cycles per test frame and obtained a rough estimate of 52 frames per second. Figure 3: Architecture Floorplan The operation essentially is partitioned into the following stages: configuration, computation of integral and integral squared images, and computation of image variance, computation of rectangles per feature, feature computation, stage evaluation and image evaluation. When a feature size increases (i.e. when a search window size increases) the computation is essentially treated as a new one. When the image has been searched at all search window sizes, the system is ready for the next image frame. In each case, all three units collaborate to perform the computation. Incoming pixels stream in the processor in parallel along all rows of the processor and are shifted in row-wise every cycle. First, the integral image is computed. The computation consists of horizontal and vertical shifts and additions. Incoming pixels are shifted inside the array on each row. Depending on the current pixel column, each of the computation units performs one of three operations; it either adds the incoming pixel value into the stored sum, or propagates the incoming value to the next-in-raw processing element while, either shifting and adding in the vertical dimension (downwards) the accumulated sum or simply doing nothing in the vertical dimension. The entire computation takes 2 * [(m + (m-1) + (n-1)] cycles, for an input image of n rows by m columns. Next, the rectangle computation happens. For each rectangle in each feature, each corner point is shifted towards the collection point. The points move one at a time, but in parallel for all rectangles in the array. At each collection point, the point is either added or subtracted to an accumulated sum, with the rectangle value computed when all points of each rectangle arrive at the collection point. Each point requires dx+dy cycles to reach the collection point, where dx and dy are the offset coordinates of the point with respect to the upper left corner of the search window. When finished collecting the rectangle sums for a single feature, the collected sums are stored in the CDTU that represents the starting corner for each feature. Next, all the collected sums are then shifted leftwards towards the MEUs, 4. CONCLUSION This paper presented a parallel architecture that performs face detection using the AdaBoost algorithm. The architecture targets both parallel computation and parallel data movement, and is capable of processing on average 52 frames per second. 5. REFERENCES 1. R. McCready, “Real-Time Face Detection on a Configurable Hardware System”, International Symposium on Field Programmable Gate Arrays, October 2000. 2. E. Hjelmås, B. K. Low, “Face Detection: A Survey”, Computer Vision and Image Understanding, Vol. 83, No. 3, September 2001. 3. B. Srijanto, “Implementing a Neural Network based face detection onto a reconfigurable computing system using Champion”, MS Thesis, University of Tennessee, August 2002. 4. Intel Open Source Computer Vision Library, Oct. 2005, http://www.intel.com/technology/computing/opencv/index.htm. 5. P. Viola and M. Jones, “Robust Real-time Object Detection”, International Journal of Computer Vision, October 2002. 6. M. Reuvers, “Face Detection on the INCA+ System”, Masters Thesis, University of Amsterdam, 2004. 7. T. Theocharides, et al, “Embedded Hardware Face Detection”, In the Proc. of VLSI Design, Mumbai, India, January 2004. 8. V. Kianzad et al, “An Architectural Level Design Methodology for Embedded Face Detection”, Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis, New York City, September 2005. 9. H. Broers et al, “Face Detection and Recognition on a Smart Camera”, Proceedings of the Advanced Concepts for Intelligent Vision Systems Conference, Belgium, August 2004.
© Copyright 2026 Paperzz