An Efficient GPGPU Implementation of Viola-Jones Classifier based Face Detection Algorithm Sharmila Shridhar, Vinay Gangadhar, Ram Sai Manoj ECE 759 Project Presentation Fall 2015 University of Wisconsin - Madison 1 Executive Summary • Viola Jones Classifier based Face Detection Implemented on GPU • Detection done in 2 phases: – Nearest Neighbor and Integral Image – Scanning Window and HAAR Feature Detection • Optimizations applied to both phases – Shared Memory, No bank conflicts, TB divergence etc. • Upto 5.3x SpeedUp compared to single threaded CPU performance • GPU performs better for larger images 2 Introduction • Face detection a hot algorithm – Auto tagging pictures in Facebook – Easy search in Google Photos – Biometric based security access – Threat activity detection – Face mapping • Human Faces Similar properties (HAAR) • Different type of algortihms – Motion based – Color based • Viola Jones Classifier based algorithm 3 Motivation • Face Detection algorithms have large amount of Data Level Parallelism (DLP) • Involve processing each window separately to detect face • GPU resources can be utilized efficiently • Performance and Energy efficient Face detection implementation • This application can be used to – showcase the benefits of GPGPU implementation – showcase incremental benefits with fine tuning of the code (optimizations) 4 Outline • Introduction • Motivation • Viola Jones Background • Nearest Neighbor and Integral Image • HAAR based Cascade Classifier Implementation • Evaluation and Results • Conclusion 5 Background- Viola Jones Algorithm (1) • Haar Feature Selection – Each feature consists of white and black rectangles – Subtract white region’s pixel sum from black region’s pixel sum – If (sum > threshold) Region has Haar Feature • We use pre-selected Haar Features as classifiers for Face Detection 6 Background- Viola Jones Algorithm (2) • Integral Image Calculation – Sum calculation for each Haar rectangle is very important – Sped up by using Integral Image Integral Sum at any pixel IS x, 𝑦 = 𝑥 ′ ≤𝑥,𝑦′≤𝑦 𝑣(𝑥 ′ , 𝑦 ′ ), where 𝑣 𝑥′, 𝑦′ is the value of the pixel (𝑥 ′ , 𝑦 ′ ) 7 Background- Viola Jones Algorithm (3) Cascade Classifier • Adaboost Algorithm – Output of several weak classifiers (Haar Features) combined into weighted sum to form Strong Classifier – Further concatenates strong classifiers to a Cascade Classifier + Cascade Classifier 8 Background - Image Pyramid • We use 25x25 window size for each feature • But face in the image can be of any size • Use Image Pyramid to make face detection scale invariant 9 Outline • Introduction • Motivation • Viola Jones Background • Nearest Neighbor and Integral Image • HAAR based Cascade Classifier Implementation • Evaluation and Results • Conclusion 10 Implementation Flow Read the source Image and Cascade classifier parameters Nearest Neighbor Image downscaling factor 1.2 Detection window 25 X 25 Image / Scaling Factor >= 25 Integral Image Compute sum of pixels from [0,0] [x,y] Integral sum > threshold all stages Face detected, store the coordinates Compute sum of squares of pixels from [0,0] [x,y] Set Image for HAAR Detection Compute the image co-ordinates for each HAAR feature Run Cascade Classifier Shift the detection window Image with detected Faces Group rectangles Draw rectangles around the faces Integral sum < threshold for a stage Skip this window for further stages Nearest Neighbor (NN) • Computes the image pixels for the downscaled (DS) image • Used scaling factor of 1.2 in implementation • Detection window of 25 X 25 pixels • Image downscaled until it’s equal to detection window • Why downscale ?? Example: downscaled image with scaling factor of 2 Parallelization scope • DS image pixels calculated by scale factor offset, width, height of source image • Map each (or more) pixel position to a single thread 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 Nearest Neighbor 0 2 4 6 16 18 20 22 32 34 36 38 48 50 52 54 12 Integral Image (II) • Sum of all the pixels above & to the left of (X, Y) for each X & Y in the image structure • Integral Image Prefix scan along row for all rows and then along column Parallelization scope • Prefix scan of each row independent of other rows RowScan (RS) • Prefix scan along columns ColumnScan (CS) • Similarly compute square integral sum (sum of squares) • Why ? Need the variance of the pixels for the haar rectangle co-ordinates Var(X) = E(X2) - (E(X))2 0 2 4 6 16 18 20 22 32 34 36 38 48 50 52 54 RS 0 2 4 6 16 20 24 28 48 54 60 66 96 104 112 120 CS 0 2 6 12 16 36 60 88 48 102 162 228 96 200 312 422 13 NN + II Implementation • Implementing the NN and II Sum in 4 separate kernels • Kernel 1 Nearest Neighbor(NN) & RowScan (RS) – downscaled image ready • Kernel 2 Matrix Transpose • Kernel 3 RowScan ColumnScan • Kernel 4 Matrix Transpose • Integral sum & square integral sum ready at the end of kernel 4 14 Kernel 1 Nearest Neighbor (NN) & RowScan (RS) • Combine NN & RS eliminates 1 global memory access (storing in between kernels) Kernel configuration: (w, h – width & height of downscaled Image) • Threads per block = smallestpower2(w) – Constraint from RowScan algorithm • Blocks = h • RowScan – Inclusive prefix scan of each row in image • Harris-Sengupta-Owen algorithm (Upsweep & downsweep) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 NN NNNN 0 2 4 6 16 18 20 22 32 34 36 38 48 50 52 54 RS 0 2 4 6 16 20 24 28 48 54 60 66 96 104 112 120 0 16 32 48 Shared memory [2 * BLOCKSIZE + 1] 15 Kernel 2, 3 & 4 ColumnScan(CS) • Columnscan replaced with kernel 2, 3 & 4 • Why ? Could have been done as Rowscan & columnscan Straightforward implementation 0 2 4 6 16 18 20 22 32 34 36 38 48 50 52 54 RS 0 2 4 6 16 20 24 28 48 54 60 66 96 104 112 120 CS 0 2 6 12 16 36 60 88 48 102 162 228 96 200 312 422 16 Alternate Method for ColumnScan Motivation for Transpose • ColumnScan – perform prefix scan along the column Thread 0 Thread 1 Thread 2 Thread 3 0 2 4 6 16 20 24 28 48 54 60 66 96 104 112 120 Global memory (GM) layout of above 2D matrix T0 0 T2 T1 16 48 Row 0 96 2 20 54 Row 1 104 4 T3 24 60 Row 2 112 6 28 66 120 Row 3 • Global memory reads aren’t coalesced • Time consuming (400 – 500 cycles each access !!) & each thread needs separate access 17 ColumnScan Breakdown 0 2 4 6 16 20 24 28 48 54 60 66 96 104 112 120 Transpose 0 2 4 6 16 20 24 28 48 54 60 66 96 104 112 120 RS 0 2 6 12 16 36 60 88 48 102 162 228 96 200 312 422 Transpose 0 2 6 12 16 36 60 88 48 102 162 228 96 200 312 422 Image after Integral sum Downscaled image after RowScan Matrix Transpose • Implemented in a tiled fashion – coalesced reads & writes to GM • Further optimizations continued…. 0 2 4 6 16 20 24 28 48 54 60 66 96 104 112 120 Transpose 0 2 4 6 16 20 24 28 48 54 60 66 96 104 112 120 18 Optimizations in Transpose & RS kernels Matrix Transpose • GM coalescing write row wise only 0 2 4 6 16 20 24 28 48 54 60 66 48 96 54 104 96 104 112 120 Naive way Transpose 0 16 48 96 2 20 54 104 4 24 60 112 6 28 66 120 0 2 4 6 16 20 24 28 Optimized version Transpose 0 96 48 54 104 60 112 66 120 Shared memory[BLOCKSIZE] [BLOCKSIZE] 2 16 20 48 54 96 104 4 24 60 112 6 28 66 120 Global memory • Reading along column Any SM bank conflicts ?? 32 banks of SM. Threads in a warp access to the same bank • BLOCKSIZE = 16 SM[Ty]Tx] changed to SM[Tx][Ty] • (Tx, Ty) map to (Tx * 16 + Ty) % 32 bank (Tx, Ty) for (0, 0) & (2, 0) map to bank 0 • Eliminated by Shared memory [BLOCKSIZE] [BLOCKSIZE + 1] 19 Optimization in RowScan Only RowScan – Row wise prefix scan • Use extern shared memory (SM) don’t make it a occupancy constraint by hardcoding • Kernel execution configuration – Shared memory size req. 2 * (width of image) [One each for integral sum & square Integral sum] – At downscale of 256 X 256 image size (from source of 1k X 1k pixels) 1D 256 Threads Per Block (TPB), 256 blocks 6 blocks alive – 100% occupancy (1536 threads) – Hardcoding to max case of 1K TPB (8kB SM) decreases it to 5 blocks (84% occupancy only) Face detection using Cascade Classifier(CC) follows…. 20 Outline • Introduction • Motivation • Viola Jones Background • Nearest Neighbor and Integral Image • HAAR based Cascade Classifier Implementation • Evaluation and Results • Conclusion 21 Implementation Scan Window Processing Classifier Size: 25x25 (Fixed) Some Specs of Classifiers • Each HAAR Classifier has up to 3 rectangles • Each stage consists of up to 211 HAAR Classifiers Image Size: 1024x1024 (can vary) • Our algorithm has 25 stages with 2913 HAAR Classifiers 22 Implementation Scan Window Processing Classifier Size: 25x25 • For each image, we consider 25x25 moving scan window • Next scan window by pixel++ • Each scan window is processed independently Image Size: 1024x1024 • For 1024 x 1024 image, there are 1000 x 1000 = 106 scan windows 23 Baseline Implementation Scan Window Processing • Each thread operates on one scan window • Each scan window is processed through all 25 stages to detect face • A Bit Vector keeps track of rejected scan windows Scan Win (20,30) BV[20][30] = True Stage 1 BV[20][30] = True Stage 2 BV[20][30] = False Stage 25 • Bit Vector copied back to Host Memory 24 Optimizations (1) Scan Window Processing Use Shared Memory • Information of classifiers common for all threads – – – – Indices of Haar classifiers Weights of each rectangle in Haar classifier Threshold for each Haar classifier Threshold for each stage • Bring these data to shared memory – Share data across Thread Block • Due to shared memory limitation, entire Scan Window Processing split into 12 Kernels [Total 2913 Classifiers into 12 Kernels] • Each kernel uses approximately 19 kB of shared memory 25 Optimizations (2) Scan Window Processing Use Pinned Host Memory • Replace malloc with cudaMallocHost Use Fast Math • Variance = sqrtf(square sum) – Special FU in GPU is used Do not use maxrregcount • Our kernel needs around 26 registers • If we restrict to 20, spilling occurs • Occupancy not always a measure of performance 26 Optimizations (3) Scan Window Processing Use block divergence • If a scan window fails at the end of any kernel, reject • Results in thread divergence • But more importantly, it leads to block divergence • Thread blocks with all threads rejected won’t be launched at all • This block divergence is the common case in most of the image windows • Hence according to Amdahl’s law, this optimization gave huge performance benefit Kernel 1 Kernel 2 Kernel 1 Kernel 2 27 Outline • Introduction • Motivation • Viola Jones Background • Nearest Neighbor and Integral Image • HAAR based Cascade Classifier Implementation • Evaluation and Results • Conclusion 28 Evaluation • Used GTX 480 GPU for evaluation – 15 SMs, 1.5GB Global Memory • Shared memory Usage – 8.2kB NN and II – 19kB HAAR Kernels • Occupancy – 100% for NN and II – 66.67% for HAAR Kernels • 1024 x 1024 image size for evaluation 21 downsampling iterations • Compared with CPU single threaded performance 29 Performance of NN and II Kernels (1) NN + RowScan Kernel [Log Scale] Execution Time (ms) 1 Shared Mem No Bank Conflicts Extern Shared Mem 0.1 0.01 Downsampling Iterations 30 Performance of NN and II Kernels (2) Transpose 1 Kernel [Log Scale] 1 Execution Time (ms) Shared Mem No Bank Conflicts Extern Shared Mem 0.1 0.01 Downsampling Iterations 31 Performance of NN and II Kernels (3) RowScan Only Kernel [Log Scale] 1.4 Execution Time (ms) 1.2 1 Shared Mem No Bank Conflicts 0.8 Extern Shared Mem 0.6 0.4 0.2 0 Downsampling Iterations 32 Performance of NN and II Kernels (4) Transpose 2 Kernel [Log Scale] Execution Time (ms) 1 Shared Mem No Bank Conflicts Extern Shared Mem 0.1 0.01 Downsampling Iterations 33 NN + II Overall Performance 4 Kernels Performance 6 5 No Bank Conflicts Extern Shared Mem 4 3 2 1 0 ROWSCAN + NN TRANSPOSE 1 ROWSCAN TRANSPOSE 2 NN and II Performance Execution Time (ms) Execution Time (ms) Shared Mem 20 Overall NN and II SpeedUp = 1.46x 15 10 5 0 GPU EXCLUSIVE TIME CPU 34 Performance of HAAR Kernels SpeedUp Over Baseline GPU 100 TB divergence No Maxregcount 10 Fast Math Pinned Host Mem Shared Mem 1 0 1 2 3 4 5 6 7 8 9 10 11 HAAR Kernels 35 Performance of HAAR Kernels (2) 29.7x SpeedUp Over Baseline GPU 3 82.7x 128.8x 155.9x 161.1x 135.1x 137.8x 139.2x 144x 221.3x 212.7x 2.5 2 TB divergence No Maxregcount Fast Math 1.5 Pinned Host Mem Shared Mem 1 Baseline 0.5 0 0 1 2 3 4 5 6 HAAR Kernels 7 8 9 10 11 36 Speed Up Over Iterations 9 8 7 TB divergence No Maxregcount SpeedUp Over CPU 6 Fast Math Pinned HostMem 5 Shared Mem Baseline 4 CPU 3 2 1 0 Downsampling Iterations 37 Scanning Window Speed Up Comparison 6 5 Exclusive time SpeedUp Over All Iterations Inclusive time 4 CPU 3 2 1 0 BASELINE SHARED MEM PINNED HOST MEM FASTMATH NO MAXRREGCOUNT TB DIVERGENCE GPU Optimizations Overall Scanning Window SpeedUp = 5.47x 38 Overall Face Detection Speed Up 6 Face Detecttion SpeedUp 5 4 Inclusive time Exclusive time 3 CPU 2 1 0 BASELINE SHARED MEM PINNED HOST MEM FAST MATH NO MAXRREGCOUNT TB DIVERGENCE GPU Optimizations Overall Scanning Window SpeedUp = 5.35x 39 GPU Speed Up Over Varying Image Sizes 6 SPEEDUP OVER CPU 5 4 5.35x GPU CPU 3 2 1 0 25x25 32x32 64x64 128x128 256x256 512x512 1024x1024 INCREASING IMAGE SIZE 40 GPU Face Detection Accuracy Faces 1 2 4 8 16 32 Average Detection Rate % Detection Rate % 100 100 100 87.5 100 93.75 96.875 41 Lessons Learned & Future Work • GPU provides performance and energy benefits over CPU for parallelizable workloads • But this comes at a cost Need to understand the bottlenecks • Can reap benefits with finer level optimizations Future work • • • • Compare GPU performance with equivalent OpenMP, MPI code OpenCV library provides CUDA APIs for Object Detection Compare performance of our implementation with this Detection accuracy can be improved with more robust version 42 Conclusion • Face Detection is a good candidate for parallelization • Optimizations help in increasing GPU Performance • Up to 5.3x performance improvement on GPU • Further improvements possible with careful analysis and hardcoding 43
© Copyright 2025 Paperzz