Formal Model for the Reduction of the Dynamic Energy

IEICE TRANS. FUNDAMENTALS, VOL.E91–A, NO.12 DECEMBER 2008
3559
PAPER
Special Section on VLSI Design and CAD Algorithms
Formal Model for the Reduction of the Dynamic Energy
Consumption in Multi-Layer Memory Subsystems∗
Hongwei ZHU† , Nonmember, Ilie I. LUICAN†† , Student Member, Florin BALASA†††a) ,
and Dhiraj K. PRADHAN†††† , Nonmembers
SUMMARY
In real-time data-dominated communication and multimedia processing applications, a multi-layer memory hierarchy is typically
used to enhance the system performance and also to reduce the energy consumption. Savings of dynamic energy can be obtained by accessing frequently used data from smaller on-chip memories rather than from large
background memories. This paper focuses on the reduction of the dynamic
energy consumption in the memory subsystem of multidimensional signal
processing systems, starting from the high-level algorithmic specification
of the application. The paper presents a formal model which identifies
those parts of arrays more intensely accessed, taking also into account the
relative lifetimes of the signals. Tested on a two-layer memory hierarchy,
this model led to savings of dynamic energy from 40% to over 70% relative
to the energy used in the case of flat memory designs.
key words: multimedia processing applications, memory allocation, dynamic energy consumption, signal-to-memory mapping
1.
Introduction
A typical architecture of an embedded signal processing
system includes programmable hardware (e.g., DSP core),
custom hardware (application-specific accelerator datapaths
and logic), controller, and a distributed memory organization [7]. The on-chip memory subsystem is often complemented by an external (off-chip) memory for storing the
large amounts of data during the execution of the application.
The memory subsystem is typically a major contributor to the overall energy budget of the system [7]. Savings of dynamic energy (which expands only when memory
accesses occur) at the level of the whole memory subsystem can be mainly obtained by accessing frequently used
data from smaller on-chip memories rather than from large
background (off-chip) memories. As on-chip storage, the
scratch-pad memories (SPMs) — software-controlled static
or dynamic random-access memories, more energy-efficient
than caches — are widely used in embedded systems, in
Manuscript received February 29, 2008.
Manuscript revised June 24, 2008.
†
The author is with the Physical IP Division, ARM Inc., Sunnyvale, California, U.S.A.
††
The author is with the Dept. of Computer Science, Univ. of
Illinois at Chicago, U.S.A.
†††
The author is with the Dept. of Computer Science and Information Systems, Southern Utah University, U.S.A.
††††
The author is with the Dept. of Computer Science, Bristol
University, U.K.
∗
The content is based on the conference papers published in
the proceedings of ICCAD 2006 [19] and SASIMI 2007 [20].
a) E-mail: [email protected]
DOI: 10.1093/ietfec/e91–a.12.3559
which the flexibility of caches in terms of workload adaptability is often unnecessary, whereas power consumption
and cost play a much more critical role. Different from
caches, the SPM occupies one distinct part of the virtual
address space with the rest of the address space occupied by
the main memory. The consequence is that there is no need
to check for the availability of the data in the SPM. Hence,
the SPM does not possess a comparator and the miss/hit acknowledging circuitry [3]. This contributes to a significant
energy (as well as area) reduction. Another consequence is
that in cache memory systems, the mapping of data to the
cache is done during run-time, whereas in SPM-based systems this can be done either manually by the designer, or
automatically — by a compiler, using a suitable algorithm.
With the scaling of the technology below 0.1 µm, the
static energy due to leakage currents has become increasingly important. While leakage is a problem for any transistor, it is even more critical for memories: their high density of integration translates into a higher power density that
increases temperature, which in turn increases leakage currents significantly. As technology scales, the importance of
static energy consumption increases even when memories
are idle (not accessed). To reduce the static energy, proper
schemes to put a memory block into a dormant (sleep)
state with negligible energy spending are required. These
schemes normally imply a timing overhead: transitioning a
memory block into and, especially, out of the dormant state
consumes energy and time. Putting a memory block into
the dormant state should be done only if the cost in extra
energy and decrease of performance can be amortized. For
dealing with dynamic energy, we are interested only in the
total number of accesses, and not to their distribution over
time. Introducing the time dimension makes the problem of
energy reduction much more complex.
1.1 Previous Works
The energy-efficient assignment of signals to the on- and
off-chip memories has been studied since the late nineties.
These previous works focused on partitioning the arrays into
copy candidates and on the optimal selection and mapping
of these into the memory hierarchy [6], [15], [26]. The general idea was to identify the data (arrays or parts of arrays)
that are most frequently accessed in each loop nest. Copying
these heavily accessed data from the large off-chip memory
to a smaller on-chip memory can potentially save energy
c 2008 The Institute of Electronics, Information and Communication Engineers
Copyright IEICE TRANS. FUNDAMENTALS, VOL.E91–A, NO.12 DECEMBER 2008
3560
(since most accesses will take place on the smaller copy and
not on the large, more energy consuming, original array)
and also improve performance† . Many different possibilities exist for deciding on which parts of the arrays should
be copy candidates and, also, for selecting among the candidates those which will be instantiated as copies and their assignment to the different memory layers. For instance, Kandemir and Choudhary analyze and exploit the temporal locality by inserting local copies [16]. Their layer assignment
builds a separate hierarchy per loop nest and then combines
them into a single hierarchy. However, the approach lacks a
global view on the (part of) arrays lifetimes in applications
having imperfect nested loops. Brockmeyer et al. use the
steering heuristic of assigning the arrays having the lowest
access number over size ratio to the cheapest memory layer
first, followed by incremental reassignments [6]. Hu et al.
can use parts of arrays as copies, but they typically use cuts
along the array dimensions [15] (like rows and columns of
matrices).
The energy-aware partitioning of an on-chip memory in multiple banks has been studied by several research
groups, as well. Techniques of an exploratory nature analyze possible partitions, matching them against the access
patterns of the application [8], [23]. Other approaches exploit the properties of the dynamic energy cost and the resulting structure of the partitioning space to come up with
algorithms able to derive the optimal partition for a given
access pattern [1], [5]. Leakage-aware partitioning of memory structures is addressed at circuit-level — especially for
caches. The cache-decay architecture turns off the cache
lines during the time they are not used [18]. The drowsy
cache architecture puts the cache lines into state-preserving
low-power modes based on usage statistics [13]. These techniques, together with dynamic resizing, require the modification of the internal structure of caches, which are normally highly-optimized designs. On a higher level of abstraction, Kandemir et al. exploit bank locality for maximizing the idleness, thus ensuring maximal amortization of the
energy spent on memory re-activation [17]. Golubeva et al.
proposed a trace-based architectural approach, considering
both the dynamic and static energy consumption [14].
The storage allocation in a hierarchical organization
(on- and off-chip) [21] and the energy-aware partitioning of
the SPM into several blocks must be complemented with a
comprehensive solution for mapping the multidimensional
arrays from the code of the application into the physical
memories. This operation aims (a) to map the arrays into
an amount of data storage as small as possible, (b) to use
mapping functions simple enough in order to ensure an address generation hardware of a reasonable complexity, and
(c) to ascertain that any distinct scalar signals (array elements) simultaneously alive be mapped to distinct storage
locations. Good overviews of mapping techniques are given
in [10] and [27].
1.2 Research Motivation and Contributions
The motivation of this research is summarized as follows:
1. The lack of a general model for identifying those
parts of arrays from a given application code that are more
intensely accessed, in order to steer their assignment to different memory layers such that the dynamic energy consumption be reduced, and also taking into account the relative lifetimes of signals in order to reduce the data storage
on each hierarchical level of the memory subsystem.
Such a model could be extended to steer also the partitioning of the memory on each hierarchical layer in order to
further reduce the dynamic energy consumption. In a later
phase, the static energy could be considered as well considering the distribution of the memory accesses over time.
Note that the existent works use as input either a memory
trace, which can be very inefficient for applications requiring a very long sequence of datapath instructions (as in the
case of deep loop nests, where the ranges of the iterators
are large), or overly constrained specifications in syntactical
point of view (like specifications having only perfect nested
loops).
2. The lack of a model for mapping the multidimensional arrays (signals) to the data memory that takes into
account the distributed structure of the memory subsystem
and, also, exploits the possible memory sharing between the
elements of different arrays.
The existent signal-to-memory mapping techniques focus on shrinking the mapping window of each array separately, allowing implicitly memory sharing only between elements of a same array. This usually leads to excessive data
storage allocation. In addition, the existent mapping techniques do not take into account that the memory subsystem
may have a distributed structure.
This paper presents a formal model based on lattices
[22], which allows to identify with accuracy those parts
of arrays from a given application code that are more intensely accessed for read or write operations. Storing onchip these parts of arrays yields the highest reduction of the
dynamic energy consumption in a hierarchical memory subsystem. Since the mathematical model is very general, the
proposed approach is able to handle the entire class of affine
specifications [7], the code being organized in sequences of
loop nests having as boundaries linear functions of the outer
loop iterators, conditional instructions where the conditions
may be both data-dependent or data-independent (relational
and/or logical operators of linear functions of loop iterators),
and multidimensional signals whose array references have
(possibly complex) linear indices. The model was tested
for the time being assuming two memory layers — on-chip
scratch-pad and off-chip memories — focusing on the reduc†
Note that this problem is basically different from caching for
performance [12], where the question is to find how to fill the cache
such that the data needed have been loaded in advance from the
main memory.
ZHU et al.: FORMAL MODEL FOR THE REDUCTION OF THE DYNAMIC ENERGY CONSUMPTION
3561
tion of the dynamic energy consumption due to memory accesses. Extensions of the model to the exploitation of memory banking, or dealing with an arbitrary number of memory
layers, as well as taking also into account the leakage energy
consumption, will be considered in the future.
In addition, this methodology solves in a consistent
way the problem of mapping the multidimensional arrays to
the physical memory† . Similarly to [25], our approach computes bounding boxes for live elements in the index space
of arrays, but, since this algorithm works not only for entire
arrays, but also parts of arrays (like, for instance, array references or, more general, sets of array elements represented
by lattices), this signal-to-memory mapping technique can
be also applied in a multi-layer memory hierarchy [27].
The rest of the paper is organized as follows. Section 2
presents a formal model used to detect the parts of the arrays intensely accessed during memory operations when the
high-level specification code of the application is executed.
Section 3 discusses the assignment of these array parts to
the memory layers of the hierarchical memory subsystem
such that the reduction of the dynamic energy consumption
in the hierarchical memory subsystem be maximized subject to the on-chip storage constraints. Section 4 discusses
implementation aspects and presents experimental results.
Finally, Sect. 5 summarizes the main conclusions of this research.
2.
Polyhedral Framework for the Reduction of the Dynamic Energy Consumption in the Memory Subsystem
2.1 Basic Definitions
An array reference M[x1 (i1 , . . . , in )] · · · [xm (i1 , . . . , in )] of an
m-dimensional signal M, in the scope of a nest of n loops
having the iterators i1 , . . . , in , is characterized by an iterator
space and an index space. The iterator space signifies the set
of all iterator vectors i = (i1 , . . . , in ) ∈ Zn in the scope of the
array reference. The index (or array) space is the set of all
index vectors x = (x1 , . . . , xm ) ∈ Zm of the array reference.
When the indices of an array reference are linear expressions
with integer coefficients of the loop iterators, the index space
consists of one or several linearly bounded lattices [24]:
{ x = T · i + u ∈ Zm | A · i b, i ∈ Zn }
(1)
where x∈ Zm is the index vector of the m-dimensional signal
and i∈ Zn is an n-dimensional iterator vector. For instance,
in the nested loops
f or (i = 0; i 2; i + +)
f or ( j = 0; j 3; j + +) · · · A[3i + j][5i + 2 j] · · ·
the index space of the array reference A[x][y] is
⎡
⎤
⎡
⎤⎫
⎧
⎪
⎪
⎪
⎢⎢⎢⎢ 1 0 ⎥⎥⎥⎥ ⎢⎢⎢⎢ 0 ⎥⎥⎥⎥⎪
⎪
⎪
⎪
⎪
⎪
⎢⎢⎢ −2 ⎥⎥⎥⎪
3 1
i ⎢⎢⎢ −1 0 ⎥⎥⎥ i
⎨ x
⎬
⎢
⎥
⎢
⎥
=
⎪
⎪
⎢
⎥
⎢
⎥
⎪
⎪
⎢
⎥
⎢
⎥
⎪
y
5
2
j
0
1
j
0
⎢
⎥
⎢
⎥
⎪
⎪
⎢⎣
⎥⎦
⎢⎣
⎥⎦⎪
⎪
⎪
⎩
0 −1
−3 ⎭
Without decrease in generality, it will be assumed along this
paper that each array reference can be represented as a single lattice, although several lattices may be needed in the
general case (for instance, an array reference in the scope of
a condition i f (i j) has two lattices, one for i j + 1 and
one for i j − 1).
The specifications are considered to be procedural,
therefore the execution ordering is induced by the loop
structure and it is thus fixed†† . The goal is to identify the
parts of the arrays in the given algorithmic specification that
are heavily accessed during the code execution. This can be
accomplished (as it will be seen) by a partitioning of each index space into sets which are all lattices (linearly bounded).
2.2 The Index Space of an Array Reference
Let { x = T · i + u | A · i b } be the linearly bounded
lattice of a given array reference. This section will show
how to model the index space of an array reference, that
is, what are the relations satisfied by the coordinates x of
the points in this set. After the theoretical part, illustrative
examples will be provided.
For any matrix T ∈ Zm×n having rank T = r, and
assuming the first r rows of T are linearly independent††† ,
there exists
a unimodular
matrix†††† S ∈ Zn×n such that
H11 0
T·S=
, where H11 ∈ Zr×r is a lower triangular
H21 0
matrix with positive diagonal elements, and H21 ∈ Z(m−r)×r
[22]. The block matrix is called the reduced Hermite form
of matrix T.
de f
j1
Let S−1 i = j ≡
, where j1 , j2 are r-, respecj2
tively (n − r)-, dimensional vectors. Then
H11
(2)
j +u
x = Ti + u = TSj + u =
H21 1
x1
u1
Denoting x =
, and u =
(where x1 , u1 are
x2
u2
r-dimensional vectors), it follows that x1 = H11 j1 + u1 . As
H11 is nonsingular (being lower triangular of rank r), j1 can
be obtained explicitly:
j1 = H−1
11 (x1 − u1 )
(3)
†
Although our methodology for memory management takes
into account the mapping problem as well, this topic is only
marginally addressed in this paper. A thorough presentation of our
mapping model able to cope with memory hierarchy is given in
[27].
††
The search space becomes much larger still when also the
available freedom in loop organization is incorporated. If the original loop ordering is not optimally suited to exploit data locality,
code transformations should be applied (like in [15], for instance)
in an earlier phase to increase it.
†††
This assumption does not decrease the generality: it is done
only to simplify the formulas, affected otherwise by a row permutation matrix.
††††
A square matrix with integer elements whose determinant is
± 1.
IEICE TRANS. FUNDAMENTALS, VOL.E91–A, NO.12 DECEMBER 2008
3562
The iterator vector i results with a simple substitution:
H−1 (x1 − u1 ) j
11
i = S 1 = S1 S2
j2
j2
= S1 H−1
11 (x1 − u1 ) + S2 j2
where S1 and S2 are the submatrices of S containing
the first r, respectively the last n − r, columns of S. As
the iterator vector must represent a point inside the iterator
space A · i b, it follows that:
−1
AS1 H−1
11 x1 + AS2 j2 b + AS1 H11 u1
(4)
If r < n, the n − r variables of j2 can be eliminated
with the Fourier-Motzkin technique [9].
As the rows of matrix H11 are r linearly independent
r-dimensional vectors, each row of H21 is a linear combination of the rows of H11 . Then from (2), it results that
there exists a matrix C ∈ (m−r)×r such that†
x2 − u2 = C · (x1 − u1 )
(5)
Taking into account that the elements of j1 must be
integers, it follows (by multiplying and dividing the right
member of (3) with det H11 ) that the points x inside the
index space must additionally satisfy the divisibility constraints
det H11
hTi (x1 − u1 )
∀i = 1, . . . , r
(6)
where hTi are the rows of the matrix with integer coefficients det H11 · H−1
11 , and a|b means “a divides b.”
According to (6), when r = n, the points x are uniformly
spaced along the r linear independent coordinates, the size
of gaps in these dimensions being equal to the diagonal
elements of H11 : if hii are the diagonal elements of matrix H11 , it can be verified that the divisibility constraints
(6) are not affected when x1 is subject to translations of
vectors vi = [0 · · · hii · · · 0], ∀i = 1, . . . , r. Indeed,
hTi (x1 −u1 +vi ) = hTi (x1 −u1 )+hTi vi = hTi (x1 −u1 ) + det H11 .
The system of inequalities (4), the equations (5), and
the divisibility conditions (6) characterize the index space
of the given array reference. Several examples will illustrate
the generality of this model.
Example 1: f or (i = 0; i 2; i + +)
f or ( j = 0; j 3; j + +)
· · · A[3i][5i + 2 j] · · ·
3 0
0
20
−1 1
, H11 = 6
,
, u=u1 =
Since T=H11 =
5 2
0
−5 3
1 0
S=S1 =
(S2 , j2 do not exist since n − r = 2 − 2 = 0;
0 1
H21 , x2 , u2 do not exist since m − r = 2 − 2 = 0), the inequalx
are: 6 x 0, 18 −5x + 3y 0,
ities (4) with x1 =
y
representing the first quadrilateral in Fig. 1. Not all the lattice points in the quadrilateral have coordinates the index
values of the array reference. Only the lattice points satisfying also the divisibility conditions (6):
6 2x (or
3 x) and 6 − 5x + 3y belong to the index space. Note
Fig. 1
ple 2.
The index spaces of the array references in Example 1 and Exam-
also that these lattice points, colored black in the figure, are
uniformly spaced along the two axes Ox and Oy, the size
of the gaps in these dimensions being 3 and 2, the diagonal
elements of H11 .
Example 2: f or (i = 0; i 2; i + +)
f or ( j = 0; j 3; j + +)
· · · A[3i + j][5i + 2 j] · · ·
3 1
1
0
0
1
−1
T=
, H11 =H11 =
, S=
, and
5 2
2 −1
1 −3
S1 =S. S2 , j2 do not exist since n − r = 2 − 2 = 0; H21 ,
x2 , u2 do not exist since m − r = 2 − 2 = 0. The inequalities
(4) are: 2 2x − y 0, 3 −5x + 3y 0. As det H11 = 1,
there are no divisibility conditions (6).
Example 3: f or (i = 0; i 2; i + +)
f or ( j = 0; j 3; j + +)
· · · A[3i + j + 4][6i + 2 j + 7] · · ·
3 1
4
Since T=
, u=
, S= S1 S2
=
6 2
7
0
1
1 0
, it follows that T·S=
, H11 =[1], and
1 −3
2 0
H21 =[2]. Since m − r = 2 − 1 = 1, the points (x, y) in the index space satisfy one equation of type (5): y − 7 = 2(x − 4).
The system of inequalities (4) is: 2 t 0, 7 x − 3t 4, where the vector j2 has one element (since n − r = 1)
denoted t. The system of inequalities can be reduced here to
13 x 4. There are no divisibility conditions (6) since
det H11 = 1.
While the basic operations with lattices are performed
using their definition (1) as images of iterator polytopes, it
is the array (index) space of the signals that has to be partitioned (based the intensity of memory accesses). The mapping of the iterator space into the array space of a signal
performed by a linearly bounded lattice (or, in particular, by
an array reference) — described and exemplified in this sec†
The coefficients of matrix C are determined
by backward substitutions from the equations: H21 .row(i) = rj=1 ci j · H11 .row( j)
for any i = 1, . . . , m − r.
ZHU et al.: FORMAL MODEL FOR THE REDUCTION OF THE DYNAMIC ENERGY CONSUMPTION
3563
Fig. 2 (a) Illustrative algorithmic specification. (b) The partitions (disjoint lattices) of the array
space of the signal A — obtained as explained in Sect. 2.3 — and their number of memory accesses. The
index space of each lattice is represented (see Sect. 2.2) only by inequalities (4) (as the equality and
divisibility conditions (5) and (6) do not appear in this example). (c) The gain factors of the different
lattices partitioning the array space of signal A. The darker partitions are more heavily accessed.
tion — is used in the array space partitioning for all the signals in the algorithmic specification (see the next section).
2.3 The Partitioning of the Array Space of a Signal
The decomposition of the array space of every indexed signal into disjoint bounded lattices can be performed analytically by a recursive intersection, starting from the array references in the code. Let S be a multidimensional signal in
the algorithmic specification. A high-level pseudo-code of
its array space partitioning is given below:
let LS be the initial collection of lattices of signal S;
// these are the lattices of the array references of S
do
for (each pair of lattices (L1 ,L2 ) in the collection LS )
compute L = L1 ∩ L2 ;
if (the intersection L is not empty)
add the new lattice L to the collection LS ;
compute L1 − L2 and L2 − L1 ;
add each new lattice in the differences to LS ;
end if;
end for;
until (no new lattice can be added to the collection LS );
The intersection and difference operations with linearly
bounded lattices are thoroughly explained in [2]. Figure 2(b)
shows the result of the array space decomposition for the
2-dimensional signal A from the illustrative (affine) algorithmic specification in Fig. 2(a). The array space of A was
partitioned into 11 disjoint lattices† A0, A1, . . ., A10 — as
shown in Figs. 2(b)–(c); each array reference in the code is
either a disjoint lattice itself or it can be written as a union
of disjoint lattices. For instance,
A[4][j] (1st loop nest) = A8
A[k][i] (1st loop nest) = A0 ∪ A2 ∪ A4 ∪ . . . ∪ A10
A[5][j] (2nd loop nest) = A10
A[k][i] (2nd loop nest) = A2 ∪ A3 ∪ . . . ∪ A10
A[6][j] (3rd loop nest) = A9
A[k][i] (3rd loop nest) = A1 ∪ A3 ∪ . . . ∪ A10
The partitioning of the array space into disjoint lattices
allows to build a map of the intensity of memory accesses
and detect with precision the zones of the array space which
are the most heavily accessed.
3.
Hierarchical Memory Layer Assignment
The decomposition of the index space of each multidimensional signal allows to compute the number of memory accesses when addressing the different parts of the arrays. In
order to determine the number of memory accesses to the
array elements covered by a certain lattice, the following
computation scheme is used:
#accesses = 0;
for (all the array references including the given lattice)
select an array reference and find the set of iterators
mapped by the array reference into that lattice;
#accesses += size of this set;
†
Since there are no divisibility conditions (6), the lattices are
actually Z-polytopes [11] in the array space.
IEICE TRANS. FUNDAMENTALS, VOL.E91–A, NO.12 DECEMBER 2008
3564
end for;
Example: Compute the number of memory (read) accesses to the partition (lattice) A10 (see Fig. 2(b)).
Since A10 is included in the array references A[k][i]
from all the three loop nests and, in addition, it coincides
with the operand A[5][j] from the second loop nest, the
contributions of these 4 array references must be computed.
Since the set A10 is {x = 5, y = t | 256 t 32} and the
lattice of the array reference A[k][i] in the first loop nest
is {x = k, y = i | 256 j 32, 8 k 0, j + 32 i j − 32}, the set of the iterators mapped by the array reference
A[k][i] into A10 is { j = t1 , k = 5, i = t2 | 256 t1 , t2 32, t1 + 32 t2 t1 − 32}. The size of this set is 13,569 [4],
[11].
The contributions of the other two array references
A[k][i] are also 13,569 accesses each. Since the set of
the iterators mapped by the array reference A[5][j] into
A10 is { j = t1 , k = t2 , i = t3 | 256 t1 32, 9 t2 1, t1 + 32 t3 t1 − 32}, the contribution of A[5][j] is
131,625 accesses — the size of this set. In conclusion, the
total number of accesses to the lattice A10 is 13,569 × 3 +
131,625 = 172,332. Figure 2(b) displays also the number
of read accesses for each of the 11 lattices partitioning the
array space of signal A.
Similarly as in [15], the potential benefit of loading a
partition into the scratch-pad memory is quantified by a gain
factor
In Fig. 2(c) the gain factors are indicated on the index
space of signal A, the darker areas being those zones more
heavily accessed. Note that the memory accesses are not
uniformly distributed inside the partitions. (For instance,
A[0][48] in the partition A0 is accessed 49 times, whereas
A[0][148] is accessed 65 times.) However, this approach
allows to identify those parts of arrays more accessed than
others, without descending at scalar level — which could be
computationally prohibitive. Different from other previous
works, the parts of arrays that are candidates to be loaded
in the SPM are not restricted to arrays references and their
cuts along the coordinates, but on the partitions of the array
space exhibiting high-value gain factors. It can be seen from
Fig. 2(c) that the array reference A[k][i] in the first loop nest,
covering the columns 0 to 8 of the array space, has zones
accessed with very different intensities. Even cutting along
the dimensions x and y, the cut lines would intersect areas of
very different gain factors. The exploration of the partitions
having high gain factors leads to a better reduction of the
dynamic energy, as shown in Sect. 4.
Note that every element of the arrays D and Dopt is
accessed exactly once for reading and once for writing. Actually, it can be shown that the storage requirement for any
of the signals D and Dopt is only one memory location: two
registers in the datapath represent the best memory allocation solution for these two signals.
The mapping of the signal partitions to the off-chip and
scratch-pad memories is done using a model thoroughly described in [27]. According to it, for each n-dimensional sig-
nal A, a maximal window WA = (w1 , . . . , wn ) bounding the
simultaneously alive elements is computed; any access to an
element A[index1 ] . . . [indexn ] of the array is then redirected
to a memory location WA [index1 mod w1 ] . . . [indexn mod
wn ] relative to a base address. The bounding window is computed such that any two distinct live array elements cannot
be mapped by the modulo operations to the same location
into the physical memory. Due to our lattice-based framework, the bounding windows are computed not only for
the whole arrays, but also for the partitions (lattices) to be
loaded in the SPM (their windows being typically “smaller”
than the corresponding array window).
4.
Experimental Results
A hierarchical memory allocation tool has been implemented in C++, incorporating the polyhedral model described in this paper. For the time being, the tool supports
only a two-level memory hierarchy, where an SPM is used
between the main memory and the processor core. The dynamic energy is computed based on the number of accesses
to each memory layer. In computing the dynamic energy
consumptions for SPM and main (off-chip) memory, the
CACTI v4.2 power model [28] is used† . In general, the ratio between the energy consumed by an SPM access and the
main memory varies between one and two orders of magnitude. The energy per access for an SPM is not constant,
but a size-dependent function — the energy per access tends
to increase as the SPM size grows; however, for small SPM
sizes up to a few KBytes the energy per access is relatively
constant. Typical SPM and main memory energy values for
read accesses are 0.048 nJ and 3.57 nJ, respectively (assuming memory sizes used in the illustrative example). The dynamic energy values for write accesses are slightly higher.
Figure 3 displays the dependence of the dynamic energy consumption as a function of the SPM size. The first
bar is the reference and corresponds to a “flat” memory design, in which all operands have to be retrieved from the
Fig. 3 Variation of the dynamic energy consumption with the SPM size
for the illustrative example in Fig. 2(a).
†
The CACTI model can be used to estimate the energy consumption for SPMs as explained and exemplified in [3] (Appendix
A).
ZHU et al.: FORMAL MODEL FOR THE REDUCTION OF THE DYNAMIC ENERGY CONSUMPTION
3565
Table 1
Application
Motion estim.
Durbin alg.
SVD updating
Vocoder
Dyn. prog.
#Array
refs.
13
21
85
236
3,992
#Scalars
265,633
252,499
3,045,447
33,619
21,082,751
#Memory
accesses
864,900
1,004,993
6,227,124
200,000
83,834,000
Experimental results.
Mem.
size
2,465
1,249
34,950
11,890
124,751
main memory. The second bar shows the energy used when
the partition A8 (of size 225 and gain factor 765.92) is
copied from the main memory to the SPM and it is accessed
afterwards from there. Note that placing A8, A9, and A10
in the SPM leads to almost 65% energy savings for an SPM
size of 675 bytes (the 4th bar). The last two columns correspond to an SPM large enough to contain the partition A4,
as well as A5. Furthermore, this approach allows savings
in the overall data memory size: since some of the lattices
have different lifetimes, different such partitions of the array
space can share the same memory locations. For instance,
the partition (lattice) A0 is read only in the first loop nest (as
part of the operand A[k][i]), whereas A3 is read only in
the 2nd and 3rd loop nests; hence, A3 can replace A0 in the
data memory without any increase of the its size, this being
also the case for A2 (part of the operands A[k][i] in the
1st and 2nd loop nests) and A1 — read only in the 3rd loop
nest. In conclusion, although the total number of scalars in
the illustrative example is 399,405, a data memory of 3,179
locations would suffice since the signals D and Dopt can be
stored in two datapath registers. Moreover, due to the possible memory sharing by the lattices A0 − A3 and A2 − A1, the
data memory needs at most 3,179 − 2 × 289=2,601 locations
for the entire signal A.
Storing on-chip all the signals is, obviously, the most
desirable scenario in point of view of dynamic energy consumption. We assumed that the SPM size is constrained to
smaller values than the overall storage requirement. In our
tests, we computed the ratio between the dynamic energy
reduction and the SPM size; the value of the SPM size maximizing these ratio was selected, the idea being to obtain
the maximum benefit for the smallest SPM size. In this example, an SPM of 675 locations (and a main memory of
2,601 − 675 = 1,926 locations) satisfies this condition.
Note that Brockmeyer et al. [6] perform the assignment
of the multidimensional signals to the memory layers at the
level of entire arrays. The quantitative measure of their assignment approach is the average number of accesses per
array element. For the illustrative example in Fig. 2(a), the
average number of accesses to signal A is 248.42 (since there
are 789,750 read accesses to the 3,179 A-elements). As it is
shown in Fig. 2(c), the number of memory accesses can be
very unevenly distributed in the array space† . Our model
clearly offers a higher flexibility, identifying those parts of
arrays that are heavily accessed, which can be significantly
smaller than the whole array space (only about 21% in the
example from Fig. 2).
Dyn. energy
1-layer [µJ]
3,088
3,588
22,231
714
299,287
SPM
size
1,416
764
12,672
3,879
27,316
Energy
saved [6]
38.7%
55.2%
35.9%
30.8%
43.3%
Energy
saved [15]
40.7%
58.5%
38.4%
32.5%
46.6%
Energy
saved
50.7%
73.2%
46.0%
39.5%
56.1%
CPU
[sec]
23
28
37
8
47
Let us assume now that the SPM candidates can be
combinations of cuts along the array dimensions, as in [15].
If the size of the SPM is limited to 675 bytes (the total size
of the lattices A8, A9, A10), up to 61 rows of signal A can
be stored in the SPM. Selecting the “middle” 61 rows (since
they are the most accessed) 114–174, each one having 3,510
accesses, the dynamic energy saving versus a flat 1-layer
(off-chip) memory subsystem is 26.7% — since 214,110 accesses will be directed to the SPM. On the other hand, storing the lattices A8, A9, A10 in the SPM entails 3 × 172,332
= 516,996 accesses to the SPM and, consequently, a reduction of dynamic energy consumption by 64.6%.
Table 1 summarizes the results of our experiments, carried out on a PC with a 1.85 GHz Athlon XP processor and
512 MB memory. The benchmarks used are algebraic kernels (like Durbin’s algorithm for solving Toeplitz systems)
and algorithms used in multimedia applications (like, for
instance, an MPEG4 motion estimation algorithm). The
table displays the numbers of array references, scalar signals, and memory accesses; the data memory size (in storage locations/bytes) and the dynamic energy consumption
assuming only one (off-chip) memory layer; the SPM size
and the savings of dynamic energy applying, respectively,
a previous model steered by the gain factors of whole arrays [6], a previous model steered by the most accessed array rows/columns [15], and the current model, versus the
single-layer memory scenario; the CPU times.
The sizes of the main memory and of the SPM were
evaluated after the mapping of the signals into the physical
memories. The mapping algorithm [27] computes maximal
bounding windows (the general idea is explained in the last
paragraph of Sect. 3) for each signal based on the bounding
windows of all the lattices extracted from the code. The
main memory size is the total size of these windows. Since
in our model the SPM stores the intensely-accessed parts of
the signals represented also by lattices, the same mapping
algorithm [27] operating with only that subset of lattices was
used to compute the size of the SPM.
Our experiments show that the savings of dynamic energy consumption are from 40% to over 70% relative to the
energy used in the case of a flat memory design. Although
previous models produce important energy savings as well
([15] yields better results than [6] due to a higher flexibility in selecting the ‘copy candidates’), our model led to 20–
33% better savings than them. The energy consumptions for
†
In this illustrative example, it varies from 1 (e.g., for the element A[0][0]) to 780 accesses (e.g., for A[5][144]).
IEICE TRANS. FUNDAMENTALS, VOL.E91–A, NO.12 DECEMBER 2008
3566
the motion estimation benchmark were, respectively, 1894,
1832, and 1522 µJ; the saved energies relative to the energy
in column 6 are displayed as percentages in columns 8–10.
5.
Conclusions
This paper has presented a formal model which allows to
partition the index space of the arrays from data-dominated
applications such that those array parts heavily accessed are
identified and stored in scratch-pad memories in order to
diminish the dynamic energy consumption due to memory
accesses. This model led to energy savings of, typically, 40–
70% and proved to be better than previous models based on
array references and their cuts along the main dimensions.
References
[1] F. Angiolini, L. Benini, and A. Caprara, “An efficient profilebased algorithm for scratchpad memory partitioning,” IEEE Trans.
Comput.-Aided Des. Integr. Circuits Syst., vol.24, no.11, pp.1660–
1676, Nov. 2005.
[2] F. Balasa, H. Zhu, and I.I. Luican, “Computation of storage requirements for multi-dimensional signal processing applications,” IEEE
Trans. Very Large Scale Integr. (VLSI) Syst., vol.15, no.4, pp.447–
460, April 2007.
[3] R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P.
Marwedel, “Comparison of cache- and scratch-pad based memory
systems with respect to performance, area and energy consumption,”
Technical Report #762, University of Dortmund, Sept. 2001.
[4] A.I. Barvinok, “A polynomial time algorithm for counting integral
points in polyhedra when the dimension is fixed,” Math. Operations
Res., vol.19, no.4, pp.769–779, 1994.
[5] L. Benini, L. Macchiarulo, A. Macii, E. Macii, and M. Poncino,
“Layout-driven memory synthesis for embedded systems-on-chip,”
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol.10, no.2,
pp.96–105, April 2002.
[6] E. Brockmeyer, M. Miranda, H. Corporaal, and F. Catthoor, “Layer
assignment techniques for low energy in multi-layered memory organisations,” Proc. ACM/IEEE Design Automation and Test in Europe, pp.1070–1075, Munich, Germany, March 2003.
[7] F. Catthoor, S. Wuytack, E. De Greef, F. Balasa, L. Nachtergaele,
and A. Vandecapelle, Custom Memory Management Methodology:
Exploration of Memory Organization for Embedded Multimedia
System Design, Kluwer Acad. Publ., Boston, 1998.
[8] S. Coumeri and D.E. Thomas, “Memory modeling for system synthesis,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol.8,
no.3, pp.327–334, June 2000.
[9] G.B. Dantzig and B.C. Eaves, “Fourier-Motzkin elimination and its
dual,” J. Comb. Theory A, vol.14, pp.288–297, 1973.
[10] A. Darte, R. Schreiber, and G. Villard, “Lattice-based memory allocation,” IEEE Trans. Comput., vol.54, no.10, pp.1242–1257, Oct.
2005.
[11] J.A. De Loera, R. Hemmecke, J. Tauzer, and R. Yoshida, “Effective lattice point counting in rational convex polytopes,” J. Symbolic
Computation, vol.38, no.4, pp.1273–1302, 2004.
[12] J.Z. Fang and M. Lu, “An iteration partition approach for cache or
local memory thrashing on parallel processing,” IEEE Trans. Comput., vol.42, no.5, pp.529–546, 1993.
[13] K. Flautner, N. Kim, S. Martin, D. Blaauw, and T. Mudge, “Drowsy
caches: Simple techniques for reducing leakage power,” Proc.
Symp. Computer Architecture, pp.148–157, May 2002.
[14] O. Golubeva, M. Loghi, M. Poncino, and E. Macii, “Architectural
leakage-aware management of partitioned scratchpad memories,”
Proc. ACM/IEEE Design Automation and Test in Europe, pp.1665–
1670, Nice, France, April 2007.
[15] Q. Hu, A. Vandecapelle, M. Palkovic, P.G. Kjeldsberg, E.
Brockmeyer, and F. Catthoor, “Hierarchical memory size estimation
for loop fusion and loop shifting in data-dominated applications,”
Proc. Asia-S. Pacific Design Automation Conf., pp.606–611, Yokohama, Japan, 2006.
[16] M. Kandemir and A. Choudhary, “Compiler-directed scratch-pad
memory hierarchy design and management,” Proc. 39th ACM/IEEE
Design Automation Conf., pp.690–695, Las Vegas, NV, June 2002.
[17] M. Kandemir, M.J. Irwin, G. Chen, and I. Kolcu, “Compilerguided leakage optimization for banked scratch-pad memories,”
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol.13, no.10,
pp.1136–1146, Oct. 2005.
[18] S. Kaxiras, Z. Hu, and M. Martonosi, “Cache decay: Exploiting
generational behavior to reduce cache leakage power,” Proc. Symp.
Computer Arch., pp.240–251, June 2001.
[19] I.I. Luican, H. Zhu, and F. Balasa, “Formal model of data reuse analysis for hierarchical memory organizations,” Proc. IEEE/ACM Int.
Conf. on Comp.-Aided Design, pp.595–600, San Jose, CA, Nov.
2006.
[20] I.I. Luican, H. Zhu, F. Balasa, and D.K. Pradhan, “Reducing the dynamic energy consumption in the multi-layer memory of embedded
multimedia processing systems,” Proc. 14th Workshop on Synthesis and Syst. Integration of Mixed Inform. Technologies, pp.42–48,
Sapporo, Japan, Oct. 2007.
[21] P.R. Panda, N. Dutt, and A. Nicolau, “On-chip vs. off-chip memory: The data partitioning problem in embedded processor-based
systems,” ACM Trans. Des. Autom. Electron. Syst., vol.5, no.3,
pp.682–704, July 2000.
[22] A. Schrijver, Theory of Linear and Integer Programming, John Wiley, New York, 1986.
[23] W. Shiue and C. Chakrabarti, “Memory exploration for low-power
embedded systems,” Proc. 35th ACM/IEEE Design Automation
Conf., pp.140–145, June 1998.
[24] L. Thiele, “Compiler techniques for massive parallel architectures,”
in State-of-the-art in Computer Science, ed. P. Dewilde, Kluwer
Acad. Publ., 1992.
[25] R. Tronçon, M. Bruynooghe, G. Janssens, and F. Catthoor, “Storage
size reduction by in-place mapping of arrays,” in Verification, Model
Checking and Abstract Interpretation, ed. A. Coresi, pp.167–181,
2002.
[26] S. Wuytack, J.-P. Diguet, F. Catthoor, and H. De Man, “Formalized
methodology for data reuse exploration for low-power hierarchical
memory mappings,” IEEE Trans. Very Large Scale Integr. (VLSI)
Syst., vol.6, no.4, pp.529–537, Dec. 1998.
[27] H. Zhu, I.I. Luican, and F. Balasa, “Mapping multi-dimensional
signals into hierarchical memory organizations,” Proc. ACM/IEEE
Design Automation and Test in Europe, pp.385–390, Nice, France,
April 2007.
[28] CACTI v4.2, [Online]. Available:
http://quid.hpl.hp.com:
9081/cacti/detailed.y
ZHU et al.: FORMAL MODEL FOR THE REDUCTION OF THE DYNAMIC ENERGY CONSUMPTION
3567
Hongwei Zhu
received the B.S. degree in
Electrical Engineering from Xi’an Jiaotong University, P.R. China, in 1996, the M.S. degree
in Electrical & Electronic Engineering from
Nanyang Technological University, Singapore,
in 2001, and the Ph.D. degree in Computer Science from the University of Illinois at Chicago,
U.S.A., in 2007. Currently, he is a software
engineer at ARM Inc., Sunnyvale, California,
U.S.A. His research interests are digital design
methodologies, memory management for signal
processing, and combinatorial optimization in VLSI CAD.
Ilie I. Luican
received the B.S. and M.S.
degrees in Computer Science from the Polytechnical University of Bucharest (PUB), Romania,
in 2002 and 2003, respectively. Currently, he is
a Ph.D. candidate in the Dept. of Computer Science, University of Illinois at Chicago, U.S.A.
His research interests are high-level synthesis
and memory management for real-time multidimensional signal processing.
Florin Balasa
received the M.S. and Ph.D.
degrees in Computer Science from the Polytechnical University of Bucharest in 1981 and 1994,
respectively. He also received the Ph.D. degree
in Electrical Engineering from the Katholieke
Universiteit Leuven, Belgium, in 1995. Dr.
Balasa is, currently, an associate professor of
Computer Science at the Southern Utah University. He was also an assistant professor of Computer Science at the Univ. of Illinois at Chicago.
Dr. Balasa is a recipient of the National Science
Foundation Career Award. He is an associate editor for J. Computers &
Electrical Eng. (Elsevier).
Dhiraj K. Pradhan
received the M.S. and
Ph.D. degrees in Computer Science from Brown
University, in 1970, and, respectively, from University of Iowa, U.S.A., in 1972. Dr. Pradhan
currently holds a Chair in Computer Science at
the at the University of Bristol, U.K. He had
been a professor of Electrical and Computer Engineering at Oregon State University, Corvallis.
Previous to this, Dr. Pradhan had held the COE
Endowed Chair Professorship in Computer Science at Texas A&M University, College Station,
also serving as founder of their Laboratory of Computer Systems. Prior, Dr.
Pradhan held a Professorship at the University of Massachussetts, Amherst,
where he also served as coordinator of Computer Engineering. Prof. Pradhan has served as co-author and editor of various books, including FaultTolerant Computing: Theory and Techniques, Vols. I, II (Prentice-Hall,
1986), Fault-Tolerant Computer Systems Design (Prentice-Hall, 1996, 2nd
ed. 2003), and IC Manufacturability: The Art of Process and Design Integration (IEEE Press, 2000). He continues to serve as an Editor for prestigious journals, including IEEE Transactions. A Fellow of both ACM,
IEEE, and Japan Society of Promotion of Science, prof. Pradhan is also the
recipient of a Humboldt Prize, Germany. In 1997, he was also awarded the
Fulbright-Flad Chair in Computer Science and the 1996 IEEE TCAD Best
Paper Award.