A 0.5V Power and Area Efficient Laplacian Pyramid Processing

A 0.5V Power and Area Efficient Laplacian Pyramid
Processing Engine using FIFO with Adaptive Data
Compression
Seyed Mohammad Ali Zeinolabedin1, Jun Zhou2, Xin Liu2 and Tony T. Kim1
1
Nanyang Technological University, Singapore, 2Institute of Microelectronics, Singapore
[email protected]
However, due to its non-adaptive nature, the mean square
error (MSE) increases significantly over a wide range of input
images. Moreover, the more FIFO size reduction was achieved
at the cost of large MSE, which is not applicable to practical
implementations. Other architecture-level techniques mostly
focus on algorithm dependent optimizations [9, 10] While
circuit-level techniques mainly include power optimization of
the flipflops/SRAMs used for FIFO implementation, memorysplitting techniques, and clock gating techniques [11, 12].
However, these approaches are difficult to be generalized for
different applications.
This paper proposes a novel power and area efficient LPPE
with various architecture and circuit level techniques such as
FIFO with adaptive data compression for area and power
reduction, a new extension method for output MSE
improvement, and near-threshold circuit operation for further
reducing the overall power consumption.
Abstract— This paper proposes a power and area efficient
Laplacian Pyramid processing engine (LPPE) for multiresolution image representation in image/video processing. In the
proposed LPPE, a novel FIFO architecture with adaptive data
compression is proposed to reduce the power and area
consumption of LPPE. A new filtering extension method is also
proposed to reduce the output errors. In circuit level, nearthreshold design is adopted to further reduce the power
consumption by supply voltage scaling. The proposed LPPE
fabricated in a 0.18 µm CMOS process technology can process
112 frames per second at 3.68 MHz and 0.5 V while consuming
only 452 µW.
I.
INTRODUCTION
Recent developments in the integrated circuit (IC) design
technology enable us to realize complex algorithms on silicon.
Most of the image/video processing algorithms implemented
in ICs for the real time performance are memory dominant.
Moreover, some need to be ultra-low power, especially for
biomedical and mobile applications. Therefore, one of the
main bottlenecks of fulfilling an ultra-low power image/video
processor lies in on-chip memory such as SRAM and FIFO.
According to the studies in [1], on-chip memory occupies
50%-80% of the overall power consumption.
Laplacian Pyramid (LP) is one of the popular multiresolution (multiscale) image representations in image/video
processing applications such as image fusion, compression,
feature extraction and object recognition [2-4]. Various LP
hardware designs have been proposed [5-7]. However, they all
focus on the performance rather than the power consumption.
As portable applications become more and more popular, there
is an increasing demand for system miniaturization and low
power consumption, which is shifting the design focus from
performance to area & power optimization.
As a heavily used component in LP processing engine
(LPPE) as well as in other image/video processors, FIFO
consumes significant portion of area and power consumption.
Various techniques available in literature for dealing with
FIFOs can be categorized as architecture-level and circuitlevel techniques. A recent work published in [8] is a general
architecture-level approach introducing the area and energy
efficient FIFO through error-reduced data compression
(FERDC). FERDC reduces dynamic and leakage power, and
area by 28.84% and 32.73%, and 34.91% respectively.
978-1-4673-7472-9/15/$31.00 ©2015 IEEE
II.
PREVIOUS LAPLACIAN PYRAMID (LP) ARCHITECTURE
LP comprises decimation and interpolation functions as
shown in Fig. 1. The decimation part, denoted as Gaussian
pyramid (GP), produces low frequency coefficients (C1) from
an input image (C0). The output of the decimation part is
further processed by the interpolation part to calculate an
estimated version of the input image. Subsequently, high
frequency coefficients (D1) are obtained by subtracting the
estimated image from the original input image.
Fig. 1. General Laplacian Pyramid structure.
Fig. 2. Proposed LP architecture in [7].
104
In our previous LP design [7] as illustrated in Fig. 2, novel
LP architecture using poly-phase representations and noble
identities is proposed to reverse the order of resampling and
filters and hence avoid redundant and zero-operand operations,
occupying 50% of overall operations. E0 and E1 are filter
blocks, and R0 and R1 are poly-phase components. It is worth
noticing that two rows (Row2i and Row2i-1) of an image are
processed in Fig. 2 because the equivalent “Column” block in
Fig. 2 necessitates its second input from the next row. Each
“Column” block (Fig. 2) requires a FIFO block to keep the
essential number of data, generated by the “Row” block. Its
size is calculated by the filter length, the input image width
and the input data bit-width. Moreover, since the “Delay”
blocks synchronize the original input data with the outputs of
“Column” block in interpolation part, they are also realized by
the FIFOs. Their size is defined by the input data bit-width and
the total delay introduced by decimation and interpolation
parts. As reported in [7], FIFOs dominate the area and power
of the LPPE and therefore, smart FIFOs are highly required to
further improve the area and the power of the LPPE.
III.
PROPOSED LP PROCESSING ENGINE (LPPE)
A. FIFO with Adaptive Error Reduced Data Compression
In this work, we propose a FIFO with adaptive error reduced
data compression (FAERDC). The operation principle of
FAERDC in a 5th order filter is illustrated in Fig. 3(a). It
utilizes a differential predictor (Fig. 3(b)) to remove the
horizontal correlation between the subsequent data and
generates the energy-compacted output (yi). Before sending yi
to the quantizer (Q. in Fig. 3(c)) to reduce the data width, yi is
updated by the quantization error of the previous data to
prevent the quantization errors from being accumulated at the
output (xiq1). However, this input-independent data
compression causes large MSE at the outputs. To address this,
a Quantizer Parameters Adaptive Tuning Block is designed to
make the compression adaptive to the input data so as to
reduce the MSE. The four quantizer parameters fL, sL, Step1
and Step2 are updated as a function of the maximum value of
the present row to make the quantizer adaptive for the next
incoming data spectrum. After updating the quantizer
parameters, it is still possible for some values in the next row
to fall outside of [-sL, sL]. Therefore, while processing the
next row, the corresponding errors from the quantizer are
averaged and used to update the retrieved data.
FAREDCs are used in the “FIFO” and “Delay” blocks to
reduce the area and power. Fig. 3(a) illustrates the proposed
FIFOs architecture for a sample 5th-order filter. In Fig. 3(a),
the output of each FIFO block, named as yisjD where j varies
from 1 to 4, is connected to the input of the next “FIFO Core”.
Therefore, only one “Encoding Part” is required. Moreover the
“Quantizer Parameters Adaptive Tuning” block only
calculates the quantizer parameters for “FIFO Core1”, which is
the present row. Then the calculated parameters are stored to
be used in the subsequent “FIFO Corej”, where j is 2 to 4.
105
(a)
‐
Register
Differential Predictor
+
Register
I. Differential Predictor
(b)
(c)
Fig. 3. (a) A block diagram of FAERDC. (b) Differential and I. Differential
Predictors. (c) Quantizer for positive and negative numbers.
Fig. 4. Proposed power and area efficient 4W-delay data synchronization.
However, each row needs a separate “Decoding part” to
generate the desired inputs for the filter. In data
synchronization applications whose input-output delay is
larger than input image width (W), the proposed FIFOs
architecture can be easily utilized. For instance, the “Delay”
blocks in Fig. 2 require four-row (4W) delay between the
input and output, which is realized as shown in Fig. 4. Since
there is only one output, one decoding part for the last FIFO is
required. This reduces the power and the area significantly
compared to the previous one in [8].
B. Semi-Periodic Extension Method
Image boundary pixels issue arises out of filtering when the
filter window is not completely covered by the input image.
This case generates undesired artifacts within the image
boundary [13]. Various extension methods have been
developed to address this problem [13]. Popular extension
methods for hardware implementation are constant extension,
duplication and mirroring (also called wavelet extension) as
described in Fig. 5(a) [13]. The original extension method of
S1
Decimation Part
S3
S2
S4
S5
In1
In2
Row2i‐1
Row2i
Fig. 5. Proposed semi-periodic extension method for row filtering.
En1
En2
Clk2
Clk1
E0
D1
E1
E0
D2
E1
(1)
+
New
FIFO
E0
+
(2)
+
New
FIFO
LFCi
E1
Control
Unit
Control
Signals
Interpolation Part
S6 (Delay Blocks) S7
(4,5)
S8
New
FIFO
In1
In2
R0
(3)
Mux
LFCi
R1
R0
‐
HFC2i‐1
R1
‐
HFC2i
New
FIFO
Control
Signals
Fig. 7. Pipeline implementation of LPPE. There are only eight pipeline
stages shown here. Other pipeline stages are within the filters and FIFOs
blocks.
Fig. 6. MSE comparison between the proposed extension and other
extension methods over 100 standard test images.
decimation on their inputs. They provide two subsequent data
at the input of the filter blocks located in the next stage. So,
data arrival is controlled by a clock (Clk1) that is twice faster
than second clock (Clk2). Clk2 is also connected to remaining
blocks of decimation part. Decimation operation between S3
and S4 is implicitly applied by providing two data from
neighboring rows through FIFOs. In addition, interpolation
between S5 and S6 is performed by a multiplexer. It sends
inputs to output one after another using the faster clock (Clk1).
Hereafter the following blocks are all controlled by Clk1. As a
result, low frequency output (LFC) and two high frequency
outputs (HFC) are streamed out with Clk2 and Clk1,
respectively. The “Control Unit” is responsible for resetting
the blocks at the beginning, disabling/enabling the adaptive
part of the proposed FIFO and controlling the whole image
processing flow.
LP introduced in [14] is periodic and only suitable for noncasual software implementation where the input image is
completely stored in the memory. A new extension method
named as semi-periodic extension, using the image periodicity
across the image boundary, is proposed and shown in Fig.
5(b). For instance, in a sample 5th-order filter, first two data of
each row is added to the end of the row and two last data of
each row is added to the beginning of the next row. To do the
column filtering, the wavelet extension method is used. The
LP simulation result using over 100 standard test images in
Fig. 6 reveals that the proposed extension method improves
the MSE of the low frequency output (LFC) by 41.2% and
76.6% compared to the wavelet and constant extension
methods, respectively. The MSE of the high frequency output
(HFC) is also improved by 45.1% and 86.2% compared to the
wavelet and constant extension methods, respectively.
IV.
C. Near-threshold Design
Besides the proposed new architecture and adaptive FIFO,
near-threshold operation is adopted to further reduce the
power of the LPPE. To achieve this, the standard cell library is
simulated and a new library for near-threshold operation is
constructed by excluding the cells with fan-in more than 4
causing large variations at ultra-low voltage. After that, the
new library is characterized at 0.5 V in the selected process
technology to obtain timing information for synthesis and
back-end design [15]. Ultra-low voltage level shifters [16] are
also adopted in the design to interface between the nearthreshold core and the I/Os
EXPERIMENTAL RESULTS
The extensive simulation results over 100 standard test
images show that the MSE of proposed LPPE with FAERDC
outperforms the MSE of LPPE with the FIFO architectures in
[8] by more than 90% for LFC and more than 86% for HFC as
presented in Fig. 8. It also reveals that employing FAERDC
generates the standard deviation of 0.85 and 0.43 in MSE for
LFC and HFC, respectively, which is 98% better than those of
others. The Proposed LPPE is designed in a 0.18µm CMOS
process technology. The placement and routing (PnR) was
done by the applying the CPF (common power format) flow
for the low power design. Fig. 9 shows the chip micrograph of
the proposed LPPE with the size of 4 × 4 mm. Detailed
specifications of the test chip are summarized in Table I.
The test chip demonstrated the performance of 3.68 MHz at
0.50 V for processing 112 images (256 × 256) per second.
When using “Cameraman” input image, the measured power
of the proposed FIFO is 452 µW while it is 445 µW with
FIFO introduced in [8] (Fig. 10(a)). The “Quantizer
Parameters Update” block for adaptive operation contributes
the power overhead, 1.55% of the total power. However, the
D. Implementation of LPPE
Fig. 7 explains the pipeline hardware implementation of the
LPPE employing the proposed FIFO (FAERDC). It consists of
eight pipeline stages indicated as S1 to S8. It also incorporates
other pipeline stages within the filter and FIFO blocks. Fig. 7
shows that there are two inputs required to start the LP
process. The first two blocks indicated as D1 and D2 perform
106
MSE of the proposed LPPE is better by more than 95% (Fig.
10(b)). Fig. 11(a) shows the low and high frequency outputs of
the proposed LPPE for “Cameraman” input image. Fig. 11(b)
illustrates the difference between the proposed adaptive output
and the conventional non-adaptive output. Note that the
proposed adaptive mode outperforms the non-adaptive one at
the edges which conveys vital information of the image. The
total area improvement estimated for 256 × 256, 512 × 512
and 1024 × 1024 input image sizes is summarized in Fig.
11(c).
V.
Fig. 8. MSE performance and standard deviation of MSE of LPPE outputs,
applying FIFO architectures in [8] and FAERDC with 7-bit wide FIFOs
over 100 standard test images.
TABLE I TEST CHIP SUMMARY
CONCLUSION
This paper presented a power and area efficient Laplacian
Pyramid processing engine (LPPE) for image/video
processing applications. In the proposed LPPE, a novel FIFO
architecture with adaptive data compression is proposed to
reduce the area by 18.98%~36.06% and improve the power by
16%. In addition, a new filtering extension method is
introduced to considerably improve the output accuracy
compared to other extension methods. On circuit level, nearthreshold design is adopted to further reduce the power
consumption. Measurement results show that the proposed
LPPE achieves the operating frequency of 3.68 MHz at 0.50
V for processing 112 fps while consuming only 452 μW.
Fig. 9. Chip micrograph.
Technology
VDD
Frames per second (fps)
Power (µW)
Frequency (MHz)
Energy/cycle (pJ/cycle)
Image size
0.18µm
0.50 V
112
452
3.68
122.83
256 × 256
REFERENCES
[1]
K. Danckaert, K. Masselos, F. Cathoor, H. J. De Man, and C. Goutis,
"Strategy for power-efficient design of parallel systems," IEEE Trans.
Very Large Scale Integr. (VLSI) Syst., vol. 7, no. 2, pp. 258-265, Jun.
1999.
[2] P. Burt and E. Adelson, "The Laplacian Pyramid as a Compact Image
Code," IEEE Trans. Commun., vol. 31, no. 4, pp. 532-540, Apr. 1983.
[3] D. G. Lowe, "Distinctive Image Features from Scale-Invariant
Keypoints," International Journal of Computer Vision, vol. 60, no. 2, pp.
91-110, 2004.
[4] S. Chen, Q. Guo, H. Leung, and E. Bosse, "A Maximum Likelihood
Approach to Joint Image Registration and Fusion," IEEE Trans. Image
Process., vol. 20, no. 5, pp. 1363-1372, May. 2011.
[5] Y. Song, K. Gao, G. Ni, and R. Lu, "Implementation of real-time
Laplacian pyramid image fusion processing based on FPGA,"
Proceedings of the SPIE, 2007, pp. 683316-683316.
[6] V. Popovic, K. Seyid, A. Schmid, and Y. Leblebici, "Real-time
hardware implementation of multi-resolution image blending," in IEEE
Int. Conf. ICASSP, Vancouver, BC, 2013, pp. 2741-2745.
[7] S. M. A. Zeinolabedin, N. Karimi, and S. Samavi, "Low computational
complexity hardware implementation of Laplacian Pyramid," in Proc.
IEEE 18th Iranian Conf. Electr. Eng., 2010, pp. 465-470.
[8] S. M. A. Zeinolabedin, J. Zhou, X. Liu, and T. T. Kim, "An Area- and
Energy-Efficient FIFO Design Using Error-Reduced Data Compression
and Near-Threshold Operation for Image/Video Applications," Very
Large Scale Integration (VLSI) Systems, IEEE Transactions on, in press.
[9] S. L. Chen, "VLSI Implementation of a Low-Cost High-Quality Image
Scaling Processor," IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 60,
no. 1, pp. 31-35, Jan. 2013.
[10] B. K. Mohanty, P. K. Meher, S. Al-Maadeed, and A. Amira, "Memory
Footprint Reduction for Power-Efficient Realization of 2-D Finite
Impulse Response Filters," IEEE Trans. Circuit Syst. I, Reg. Papers, vol.
61 ,no. 1, pp. 120-133, Jan. 2014
[11] D. Jeon, Y. Kim, I. Lee, Z. Zhang, D. Blaauw, and D. Sylvester, "A
470mV 2.7mW feature extraction-accelerator for micro-autonomous
vehicle navigation in 28nm CMOS," in IEEE ISSCC Dig. Tech. Papers,
Feb. 2013, pp. 166-167.
(a)
(b)
Fig. 10. (a) Power measurement result for LP operating at 0.5 V with both
non-adaptive and adaptive modes. (b) MSE comparison of the
“Cameraman” input image in non-adaptive and adaptive modes.
(a)
(b)
(c)
Fig. 11. (a) The LPPE output for Cameraman Image. (b) The difference
between the outputs of adaptive and nonadaptive modes. (c) Total area
improvement vs. different input image size.
[12] T. T. Kim, J. Liu, J. Keane, and C. H. Kim, "Circuit techniques for ultralow power subthreshold SRAMs," in Proc. IEEE Int. Symp. Circuis
Syst. (ISCAS), Seattle, Washington, 2008, pp. 2574-2577.
[13] D. G. Bailey, Design for embedded image processing on FPGAs
[electronic resource] Singapore : John Wiley & Sons (Asia), 2011.
[14] M. N. Do and M. Vetterli, "The contourlet transform: an efficient
directional multiresolution image representation," IEEE Trans. Image
Process., vol. 14, no. 12, pp. 2091-2106, Dec. 2005.
[15] X. Liu, J. Zhou, X. Liao, C. Wang, J. Luo, M. Madihian, et al., "Ultralow-energy near-threshold biomedical signal processor for versatile
wireless health monitoring," in Proc. IEEE Asian Solid State Circuits
Conf. (A-SSCC), 2012, pp. 381-384.
[16] J. Zhou, C. Wang, X. Liu, X. Zhang, and M. Je, "A fast and energyefficient level shifter with wide shifting range from sub-threshold up to
I/O voltage," in Proc. IEEE Asian Solid State Circuits Conf. (A-SSCC),
2013, pp. 137-140.
107