A 0.5V Power and Area Efficient Laplacian Pyramid Processing Engine using FIFO with Adaptive Data Compression Seyed Mohammad Ali Zeinolabedin1, Jun Zhou2, Xin Liu2 and Tony T. Kim1 1 Nanyang Technological University, Singapore, 2Institute of Microelectronics, Singapore [email protected] However, due to its non-adaptive nature, the mean square error (MSE) increases significantly over a wide range of input images. Moreover, the more FIFO size reduction was achieved at the cost of large MSE, which is not applicable to practical implementations. Other architecture-level techniques mostly focus on algorithm dependent optimizations [9, 10] While circuit-level techniques mainly include power optimization of the flipflops/SRAMs used for FIFO implementation, memorysplitting techniques, and clock gating techniques [11, 12]. However, these approaches are difficult to be generalized for different applications. This paper proposes a novel power and area efficient LPPE with various architecture and circuit level techniques such as FIFO with adaptive data compression for area and power reduction, a new extension method for output MSE improvement, and near-threshold circuit operation for further reducing the overall power consumption. Abstract— This paper proposes a power and area efficient Laplacian Pyramid processing engine (LPPE) for multiresolution image representation in image/video processing. In the proposed LPPE, a novel FIFO architecture with adaptive data compression is proposed to reduce the power and area consumption of LPPE. A new filtering extension method is also proposed to reduce the output errors. In circuit level, nearthreshold design is adopted to further reduce the power consumption by supply voltage scaling. The proposed LPPE fabricated in a 0.18 µm CMOS process technology can process 112 frames per second at 3.68 MHz and 0.5 V while consuming only 452 µW. I. INTRODUCTION Recent developments in the integrated circuit (IC) design technology enable us to realize complex algorithms on silicon. Most of the image/video processing algorithms implemented in ICs for the real time performance are memory dominant. Moreover, some need to be ultra-low power, especially for biomedical and mobile applications. Therefore, one of the main bottlenecks of fulfilling an ultra-low power image/video processor lies in on-chip memory such as SRAM and FIFO. According to the studies in [1], on-chip memory occupies 50%-80% of the overall power consumption. Laplacian Pyramid (LP) is one of the popular multiresolution (multiscale) image representations in image/video processing applications such as image fusion, compression, feature extraction and object recognition [2-4]. Various LP hardware designs have been proposed [5-7]. However, they all focus on the performance rather than the power consumption. As portable applications become more and more popular, there is an increasing demand for system miniaturization and low power consumption, which is shifting the design focus from performance to area & power optimization. As a heavily used component in LP processing engine (LPPE) as well as in other image/video processors, FIFO consumes significant portion of area and power consumption. Various techniques available in literature for dealing with FIFOs can be categorized as architecture-level and circuitlevel techniques. A recent work published in [8] is a general architecture-level approach introducing the area and energy efficient FIFO through error-reduced data compression (FERDC). FERDC reduces dynamic and leakage power, and area by 28.84% and 32.73%, and 34.91% respectively. 978-1-4673-7472-9/15/$31.00 ©2015 IEEE II. PREVIOUS LAPLACIAN PYRAMID (LP) ARCHITECTURE LP comprises decimation and interpolation functions as shown in Fig. 1. The decimation part, denoted as Gaussian pyramid (GP), produces low frequency coefficients (C1) from an input image (C0). The output of the decimation part is further processed by the interpolation part to calculate an estimated version of the input image. Subsequently, high frequency coefficients (D1) are obtained by subtracting the estimated image from the original input image. Fig. 1. General Laplacian Pyramid structure. Fig. 2. Proposed LP architecture in [7]. 104 In our previous LP design [7] as illustrated in Fig. 2, novel LP architecture using poly-phase representations and noble identities is proposed to reverse the order of resampling and filters and hence avoid redundant and zero-operand operations, occupying 50% of overall operations. E0 and E1 are filter blocks, and R0 and R1 are poly-phase components. It is worth noticing that two rows (Row2i and Row2i-1) of an image are processed in Fig. 2 because the equivalent “Column” block in Fig. 2 necessitates its second input from the next row. Each “Column” block (Fig. 2) requires a FIFO block to keep the essential number of data, generated by the “Row” block. Its size is calculated by the filter length, the input image width and the input data bit-width. Moreover, since the “Delay” blocks synchronize the original input data with the outputs of “Column” block in interpolation part, they are also realized by the FIFOs. Their size is defined by the input data bit-width and the total delay introduced by decimation and interpolation parts. As reported in [7], FIFOs dominate the area and power of the LPPE and therefore, smart FIFOs are highly required to further improve the area and the power of the LPPE. III. PROPOSED LP PROCESSING ENGINE (LPPE) A. FIFO with Adaptive Error Reduced Data Compression In this work, we propose a FIFO with adaptive error reduced data compression (FAERDC). The operation principle of FAERDC in a 5th order filter is illustrated in Fig. 3(a). It utilizes a differential predictor (Fig. 3(b)) to remove the horizontal correlation between the subsequent data and generates the energy-compacted output (yi). Before sending yi to the quantizer (Q. in Fig. 3(c)) to reduce the data width, yi is updated by the quantization error of the previous data to prevent the quantization errors from being accumulated at the output (xiq1). However, this input-independent data compression causes large MSE at the outputs. To address this, a Quantizer Parameters Adaptive Tuning Block is designed to make the compression adaptive to the input data so as to reduce the MSE. The four quantizer parameters fL, sL, Step1 and Step2 are updated as a function of the maximum value of the present row to make the quantizer adaptive for the next incoming data spectrum. After updating the quantizer parameters, it is still possible for some values in the next row to fall outside of [-sL, sL]. Therefore, while processing the next row, the corresponding errors from the quantizer are averaged and used to update the retrieved data. FAREDCs are used in the “FIFO” and “Delay” blocks to reduce the area and power. Fig. 3(a) illustrates the proposed FIFOs architecture for a sample 5th-order filter. In Fig. 3(a), the output of each FIFO block, named as yisjD where j varies from 1 to 4, is connected to the input of the next “FIFO Core”. Therefore, only one “Encoding Part” is required. Moreover the “Quantizer Parameters Adaptive Tuning” block only calculates the quantizer parameters for “FIFO Core1”, which is the present row. Then the calculated parameters are stored to be used in the subsequent “FIFO Corej”, where j is 2 to 4. 105 (a) ‐ Register Differential Predictor + Register I. Differential Predictor (b) (c) Fig. 3. (a) A block diagram of FAERDC. (b) Differential and I. Differential Predictors. (c) Quantizer for positive and negative numbers. Fig. 4. Proposed power and area efficient 4W-delay data synchronization. However, each row needs a separate “Decoding part” to generate the desired inputs for the filter. In data synchronization applications whose input-output delay is larger than input image width (W), the proposed FIFOs architecture can be easily utilized. For instance, the “Delay” blocks in Fig. 2 require four-row (4W) delay between the input and output, which is realized as shown in Fig. 4. Since there is only one output, one decoding part for the last FIFO is required. This reduces the power and the area significantly compared to the previous one in [8]. B. Semi-Periodic Extension Method Image boundary pixels issue arises out of filtering when the filter window is not completely covered by the input image. This case generates undesired artifacts within the image boundary [13]. Various extension methods have been developed to address this problem [13]. Popular extension methods for hardware implementation are constant extension, duplication and mirroring (also called wavelet extension) as described in Fig. 5(a) [13]. The original extension method of S1 Decimation Part S3 S2 S4 S5 In1 In2 Row2i‐1 Row2i Fig. 5. Proposed semi-periodic extension method for row filtering. En1 En2 Clk2 Clk1 E0 D1 E1 E0 D2 E1 (1) + New FIFO E0 + (2) + New FIFO LFCi E1 Control Unit Control Signals Interpolation Part S6 (Delay Blocks) S7 (4,5) S8 New FIFO In1 In2 R0 (3) Mux LFCi R1 R0 ‐ HFC2i‐1 R1 ‐ HFC2i New FIFO Control Signals Fig. 7. Pipeline implementation of LPPE. There are only eight pipeline stages shown here. Other pipeline stages are within the filters and FIFOs blocks. Fig. 6. MSE comparison between the proposed extension and other extension methods over 100 standard test images. decimation on their inputs. They provide two subsequent data at the input of the filter blocks located in the next stage. So, data arrival is controlled by a clock (Clk1) that is twice faster than second clock (Clk2). Clk2 is also connected to remaining blocks of decimation part. Decimation operation between S3 and S4 is implicitly applied by providing two data from neighboring rows through FIFOs. In addition, interpolation between S5 and S6 is performed by a multiplexer. It sends inputs to output one after another using the faster clock (Clk1). Hereafter the following blocks are all controlled by Clk1. As a result, low frequency output (LFC) and two high frequency outputs (HFC) are streamed out with Clk2 and Clk1, respectively. The “Control Unit” is responsible for resetting the blocks at the beginning, disabling/enabling the adaptive part of the proposed FIFO and controlling the whole image processing flow. LP introduced in [14] is periodic and only suitable for noncasual software implementation where the input image is completely stored in the memory. A new extension method named as semi-periodic extension, using the image periodicity across the image boundary, is proposed and shown in Fig. 5(b). For instance, in a sample 5th-order filter, first two data of each row is added to the end of the row and two last data of each row is added to the beginning of the next row. To do the column filtering, the wavelet extension method is used. The LP simulation result using over 100 standard test images in Fig. 6 reveals that the proposed extension method improves the MSE of the low frequency output (LFC) by 41.2% and 76.6% compared to the wavelet and constant extension methods, respectively. The MSE of the high frequency output (HFC) is also improved by 45.1% and 86.2% compared to the wavelet and constant extension methods, respectively. IV. C. Near-threshold Design Besides the proposed new architecture and adaptive FIFO, near-threshold operation is adopted to further reduce the power of the LPPE. To achieve this, the standard cell library is simulated and a new library for near-threshold operation is constructed by excluding the cells with fan-in more than 4 causing large variations at ultra-low voltage. After that, the new library is characterized at 0.5 V in the selected process technology to obtain timing information for synthesis and back-end design [15]. Ultra-low voltage level shifters [16] are also adopted in the design to interface between the nearthreshold core and the I/Os EXPERIMENTAL RESULTS The extensive simulation results over 100 standard test images show that the MSE of proposed LPPE with FAERDC outperforms the MSE of LPPE with the FIFO architectures in [8] by more than 90% for LFC and more than 86% for HFC as presented in Fig. 8. It also reveals that employing FAERDC generates the standard deviation of 0.85 and 0.43 in MSE for LFC and HFC, respectively, which is 98% better than those of others. The Proposed LPPE is designed in a 0.18µm CMOS process technology. The placement and routing (PnR) was done by the applying the CPF (common power format) flow for the low power design. Fig. 9 shows the chip micrograph of the proposed LPPE with the size of 4 × 4 mm. Detailed specifications of the test chip are summarized in Table I. The test chip demonstrated the performance of 3.68 MHz at 0.50 V for processing 112 images (256 × 256) per second. When using “Cameraman” input image, the measured power of the proposed FIFO is 452 µW while it is 445 µW with FIFO introduced in [8] (Fig. 10(a)). The “Quantizer Parameters Update” block for adaptive operation contributes the power overhead, 1.55% of the total power. However, the D. Implementation of LPPE Fig. 7 explains the pipeline hardware implementation of the LPPE employing the proposed FIFO (FAERDC). It consists of eight pipeline stages indicated as S1 to S8. It also incorporates other pipeline stages within the filter and FIFO blocks. Fig. 7 shows that there are two inputs required to start the LP process. The first two blocks indicated as D1 and D2 perform 106 MSE of the proposed LPPE is better by more than 95% (Fig. 10(b)). Fig. 11(a) shows the low and high frequency outputs of the proposed LPPE for “Cameraman” input image. Fig. 11(b) illustrates the difference between the proposed adaptive output and the conventional non-adaptive output. Note that the proposed adaptive mode outperforms the non-adaptive one at the edges which conveys vital information of the image. The total area improvement estimated for 256 × 256, 512 × 512 and 1024 × 1024 input image sizes is summarized in Fig. 11(c). V. Fig. 8. MSE performance and standard deviation of MSE of LPPE outputs, applying FIFO architectures in [8] and FAERDC with 7-bit wide FIFOs over 100 standard test images. TABLE I TEST CHIP SUMMARY CONCLUSION This paper presented a power and area efficient Laplacian Pyramid processing engine (LPPE) for image/video processing applications. In the proposed LPPE, a novel FIFO architecture with adaptive data compression is proposed to reduce the area by 18.98%~36.06% and improve the power by 16%. In addition, a new filtering extension method is introduced to considerably improve the output accuracy compared to other extension methods. On circuit level, nearthreshold design is adopted to further reduce the power consumption. Measurement results show that the proposed LPPE achieves the operating frequency of 3.68 MHz at 0.50 V for processing 112 fps while consuming only 452 μW. Fig. 9. Chip micrograph. Technology VDD Frames per second (fps) Power (µW) Frequency (MHz) Energy/cycle (pJ/cycle) Image size 0.18µm 0.50 V 112 452 3.68 122.83 256 × 256 REFERENCES [1] K. Danckaert, K. Masselos, F. Cathoor, H. J. De Man, and C. Goutis, "Strategy for power-efficient design of parallel systems," IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 7, no. 2, pp. 258-265, Jun. 1999. [2] P. Burt and E. Adelson, "The Laplacian Pyramid as a Compact Image Code," IEEE Trans. Commun., vol. 31, no. 4, pp. 532-540, Apr. 1983. [3] D. G. Lowe, "Distinctive Image Features from Scale-Invariant Keypoints," International Journal of Computer Vision, vol. 60, no. 2, pp. 91-110, 2004. [4] S. Chen, Q. Guo, H. Leung, and E. Bosse, "A Maximum Likelihood Approach to Joint Image Registration and Fusion," IEEE Trans. Image Process., vol. 20, no. 5, pp. 1363-1372, May. 2011. [5] Y. Song, K. Gao, G. Ni, and R. Lu, "Implementation of real-time Laplacian pyramid image fusion processing based on FPGA," Proceedings of the SPIE, 2007, pp. 683316-683316. [6] V. Popovic, K. Seyid, A. Schmid, and Y. Leblebici, "Real-time hardware implementation of multi-resolution image blending," in IEEE Int. Conf. ICASSP, Vancouver, BC, 2013, pp. 2741-2745. [7] S. M. A. Zeinolabedin, N. Karimi, and S. Samavi, "Low computational complexity hardware implementation of Laplacian Pyramid," in Proc. IEEE 18th Iranian Conf. Electr. Eng., 2010, pp. 465-470. [8] S. M. A. Zeinolabedin, J. Zhou, X. Liu, and T. T. Kim, "An Area- and Energy-Efficient FIFO Design Using Error-Reduced Data Compression and Near-Threshold Operation for Image/Video Applications," Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, in press. [9] S. L. Chen, "VLSI Implementation of a Low-Cost High-Quality Image Scaling Processor," IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 60, no. 1, pp. 31-35, Jan. 2013. [10] B. K. Mohanty, P. K. Meher, S. Al-Maadeed, and A. Amira, "Memory Footprint Reduction for Power-Efficient Realization of 2-D Finite Impulse Response Filters," IEEE Trans. Circuit Syst. I, Reg. Papers, vol. 61 ,no. 1, pp. 120-133, Jan. 2014 [11] D. Jeon, Y. Kim, I. Lee, Z. Zhang, D. Blaauw, and D. Sylvester, "A 470mV 2.7mW feature extraction-accelerator for micro-autonomous vehicle navigation in 28nm CMOS," in IEEE ISSCC Dig. Tech. Papers, Feb. 2013, pp. 166-167. (a) (b) Fig. 10. (a) Power measurement result for LP operating at 0.5 V with both non-adaptive and adaptive modes. (b) MSE comparison of the “Cameraman” input image in non-adaptive and adaptive modes. (a) (b) (c) Fig. 11. (a) The LPPE output for Cameraman Image. (b) The difference between the outputs of adaptive and nonadaptive modes. (c) Total area improvement vs. different input image size. [12] T. T. Kim, J. Liu, J. Keane, and C. H. Kim, "Circuit techniques for ultralow power subthreshold SRAMs," in Proc. IEEE Int. Symp. Circuis Syst. (ISCAS), Seattle, Washington, 2008, pp. 2574-2577. [13] D. G. Bailey, Design for embedded image processing on FPGAs [electronic resource] Singapore : John Wiley & Sons (Asia), 2011. [14] M. N. Do and M. Vetterli, "The contourlet transform: an efficient directional multiresolution image representation," IEEE Trans. Image Process., vol. 14, no. 12, pp. 2091-2106, Dec. 2005. [15] X. Liu, J. Zhou, X. Liao, C. Wang, J. Luo, M. Madihian, et al., "Ultralow-energy near-threshold biomedical signal processor for versatile wireless health monitoring," in Proc. IEEE Asian Solid State Circuits Conf. (A-SSCC), 2012, pp. 381-384. [16] J. Zhou, C. Wang, X. Liu, X. Zhang, and M. Je, "A fast and energyefficient level shifter with wide shifting range from sub-threshold up to I/O voltage," in Proc. IEEE Asian Solid State Circuits Conf. (A-SSCC), 2013, pp. 137-140. 107
© Copyright 2026 Paperzz