IEICE TRANS. FUNDAMENTALS, VOL.E91–A, NO.12 DECEMBER 2008 3559 PAPER Special Section on VLSI Design and CAD Algorithms Formal Model for the Reduction of the Dynamic Energy Consumption in Multi-Layer Memory Subsystems∗ Hongwei ZHU† , Nonmember, Ilie I. LUICAN†† , Student Member, Florin BALASA†††a) , and Dhiraj K. PRADHAN†††† , Nonmembers SUMMARY In real-time data-dominated communication and multimedia processing applications, a multi-layer memory hierarchy is typically used to enhance the system performance and also to reduce the energy consumption. Savings of dynamic energy can be obtained by accessing frequently used data from smaller on-chip memories rather than from large background memories. This paper focuses on the reduction of the dynamic energy consumption in the memory subsystem of multidimensional signal processing systems, starting from the high-level algorithmic specification of the application. The paper presents a formal model which identifies those parts of arrays more intensely accessed, taking also into account the relative lifetimes of the signals. Tested on a two-layer memory hierarchy, this model led to savings of dynamic energy from 40% to over 70% relative to the energy used in the case of flat memory designs. key words: multimedia processing applications, memory allocation, dynamic energy consumption, signal-to-memory mapping 1. Introduction A typical architecture of an embedded signal processing system includes programmable hardware (e.g., DSP core), custom hardware (application-specific accelerator datapaths and logic), controller, and a distributed memory organization [7]. The on-chip memory subsystem is often complemented by an external (off-chip) memory for storing the large amounts of data during the execution of the application. The memory subsystem is typically a major contributor to the overall energy budget of the system [7]. Savings of dynamic energy (which expands only when memory accesses occur) at the level of the whole memory subsystem can be mainly obtained by accessing frequently used data from smaller on-chip memories rather than from large background (off-chip) memories. As on-chip storage, the scratch-pad memories (SPMs) — software-controlled static or dynamic random-access memories, more energy-efficient than caches — are widely used in embedded systems, in Manuscript received February 29, 2008. Manuscript revised June 24, 2008. † The author is with the Physical IP Division, ARM Inc., Sunnyvale, California, U.S.A. †† The author is with the Dept. of Computer Science, Univ. of Illinois at Chicago, U.S.A. ††† The author is with the Dept. of Computer Science and Information Systems, Southern Utah University, U.S.A. †††† The author is with the Dept. of Computer Science, Bristol University, U.K. ∗ The content is based on the conference papers published in the proceedings of ICCAD 2006 [19] and SASIMI 2007 [20]. a) E-mail: [email protected] DOI: 10.1093/ietfec/e91–a.12.3559 which the flexibility of caches in terms of workload adaptability is often unnecessary, whereas power consumption and cost play a much more critical role. Different from caches, the SPM occupies one distinct part of the virtual address space with the rest of the address space occupied by the main memory. The consequence is that there is no need to check for the availability of the data in the SPM. Hence, the SPM does not possess a comparator and the miss/hit acknowledging circuitry [3]. This contributes to a significant energy (as well as area) reduction. Another consequence is that in cache memory systems, the mapping of data to the cache is done during run-time, whereas in SPM-based systems this can be done either manually by the designer, or automatically — by a compiler, using a suitable algorithm. With the scaling of the technology below 0.1 µm, the static energy due to leakage currents has become increasingly important. While leakage is a problem for any transistor, it is even more critical for memories: their high density of integration translates into a higher power density that increases temperature, which in turn increases leakage currents significantly. As technology scales, the importance of static energy consumption increases even when memories are idle (not accessed). To reduce the static energy, proper schemes to put a memory block into a dormant (sleep) state with negligible energy spending are required. These schemes normally imply a timing overhead: transitioning a memory block into and, especially, out of the dormant state consumes energy and time. Putting a memory block into the dormant state should be done only if the cost in extra energy and decrease of performance can be amortized. For dealing with dynamic energy, we are interested only in the total number of accesses, and not to their distribution over time. Introducing the time dimension makes the problem of energy reduction much more complex. 1.1 Previous Works The energy-efficient assignment of signals to the on- and off-chip memories has been studied since the late nineties. These previous works focused on partitioning the arrays into copy candidates and on the optimal selection and mapping of these into the memory hierarchy [6], [15], [26]. The general idea was to identify the data (arrays or parts of arrays) that are most frequently accessed in each loop nest. Copying these heavily accessed data from the large off-chip memory to a smaller on-chip memory can potentially save energy c 2008 The Institute of Electronics, Information and Communication Engineers Copyright IEICE TRANS. FUNDAMENTALS, VOL.E91–A, NO.12 DECEMBER 2008 3560 (since most accesses will take place on the smaller copy and not on the large, more energy consuming, original array) and also improve performance† . Many different possibilities exist for deciding on which parts of the arrays should be copy candidates and, also, for selecting among the candidates those which will be instantiated as copies and their assignment to the different memory layers. For instance, Kandemir and Choudhary analyze and exploit the temporal locality by inserting local copies [16]. Their layer assignment builds a separate hierarchy per loop nest and then combines them into a single hierarchy. However, the approach lacks a global view on the (part of) arrays lifetimes in applications having imperfect nested loops. Brockmeyer et al. use the steering heuristic of assigning the arrays having the lowest access number over size ratio to the cheapest memory layer first, followed by incremental reassignments [6]. Hu et al. can use parts of arrays as copies, but they typically use cuts along the array dimensions [15] (like rows and columns of matrices). The energy-aware partitioning of an on-chip memory in multiple banks has been studied by several research groups, as well. Techniques of an exploratory nature analyze possible partitions, matching them against the access patterns of the application [8], [23]. Other approaches exploit the properties of the dynamic energy cost and the resulting structure of the partitioning space to come up with algorithms able to derive the optimal partition for a given access pattern [1], [5]. Leakage-aware partitioning of memory structures is addressed at circuit-level — especially for caches. The cache-decay architecture turns off the cache lines during the time they are not used [18]. The drowsy cache architecture puts the cache lines into state-preserving low-power modes based on usage statistics [13]. These techniques, together with dynamic resizing, require the modification of the internal structure of caches, which are normally highly-optimized designs. On a higher level of abstraction, Kandemir et al. exploit bank locality for maximizing the idleness, thus ensuring maximal amortization of the energy spent on memory re-activation [17]. Golubeva et al. proposed a trace-based architectural approach, considering both the dynamic and static energy consumption [14]. The storage allocation in a hierarchical organization (on- and off-chip) [21] and the energy-aware partitioning of the SPM into several blocks must be complemented with a comprehensive solution for mapping the multidimensional arrays from the code of the application into the physical memories. This operation aims (a) to map the arrays into an amount of data storage as small as possible, (b) to use mapping functions simple enough in order to ensure an address generation hardware of a reasonable complexity, and (c) to ascertain that any distinct scalar signals (array elements) simultaneously alive be mapped to distinct storage locations. Good overviews of mapping techniques are given in [10] and [27]. 1.2 Research Motivation and Contributions The motivation of this research is summarized as follows: 1. The lack of a general model for identifying those parts of arrays from a given application code that are more intensely accessed, in order to steer their assignment to different memory layers such that the dynamic energy consumption be reduced, and also taking into account the relative lifetimes of signals in order to reduce the data storage on each hierarchical level of the memory subsystem. Such a model could be extended to steer also the partitioning of the memory on each hierarchical layer in order to further reduce the dynamic energy consumption. In a later phase, the static energy could be considered as well considering the distribution of the memory accesses over time. Note that the existent works use as input either a memory trace, which can be very inefficient for applications requiring a very long sequence of datapath instructions (as in the case of deep loop nests, where the ranges of the iterators are large), or overly constrained specifications in syntactical point of view (like specifications having only perfect nested loops). 2. The lack of a model for mapping the multidimensional arrays (signals) to the data memory that takes into account the distributed structure of the memory subsystem and, also, exploits the possible memory sharing between the elements of different arrays. The existent signal-to-memory mapping techniques focus on shrinking the mapping window of each array separately, allowing implicitly memory sharing only between elements of a same array. This usually leads to excessive data storage allocation. In addition, the existent mapping techniques do not take into account that the memory subsystem may have a distributed structure. This paper presents a formal model based on lattices [22], which allows to identify with accuracy those parts of arrays from a given application code that are more intensely accessed for read or write operations. Storing onchip these parts of arrays yields the highest reduction of the dynamic energy consumption in a hierarchical memory subsystem. Since the mathematical model is very general, the proposed approach is able to handle the entire class of affine specifications [7], the code being organized in sequences of loop nests having as boundaries linear functions of the outer loop iterators, conditional instructions where the conditions may be both data-dependent or data-independent (relational and/or logical operators of linear functions of loop iterators), and multidimensional signals whose array references have (possibly complex) linear indices. The model was tested for the time being assuming two memory layers — on-chip scratch-pad and off-chip memories — focusing on the reduc† Note that this problem is basically different from caching for performance [12], where the question is to find how to fill the cache such that the data needed have been loaded in advance from the main memory. ZHU et al.: FORMAL MODEL FOR THE REDUCTION OF THE DYNAMIC ENERGY CONSUMPTION 3561 tion of the dynamic energy consumption due to memory accesses. Extensions of the model to the exploitation of memory banking, or dealing with an arbitrary number of memory layers, as well as taking also into account the leakage energy consumption, will be considered in the future. In addition, this methodology solves in a consistent way the problem of mapping the multidimensional arrays to the physical memory† . Similarly to [25], our approach computes bounding boxes for live elements in the index space of arrays, but, since this algorithm works not only for entire arrays, but also parts of arrays (like, for instance, array references or, more general, sets of array elements represented by lattices), this signal-to-memory mapping technique can be also applied in a multi-layer memory hierarchy [27]. The rest of the paper is organized as follows. Section 2 presents a formal model used to detect the parts of the arrays intensely accessed during memory operations when the high-level specification code of the application is executed. Section 3 discusses the assignment of these array parts to the memory layers of the hierarchical memory subsystem such that the reduction of the dynamic energy consumption in the hierarchical memory subsystem be maximized subject to the on-chip storage constraints. Section 4 discusses implementation aspects and presents experimental results. Finally, Sect. 5 summarizes the main conclusions of this research. 2. Polyhedral Framework for the Reduction of the Dynamic Energy Consumption in the Memory Subsystem 2.1 Basic Definitions An array reference M[x1 (i1 , . . . , in )] · · · [xm (i1 , . . . , in )] of an m-dimensional signal M, in the scope of a nest of n loops having the iterators i1 , . . . , in , is characterized by an iterator space and an index space. The iterator space signifies the set of all iterator vectors i = (i1 , . . . , in ) ∈ Zn in the scope of the array reference. The index (or array) space is the set of all index vectors x = (x1 , . . . , xm ) ∈ Zm of the array reference. When the indices of an array reference are linear expressions with integer coefficients of the loop iterators, the index space consists of one or several linearly bounded lattices [24]: { x = T · i + u ∈ Zm | A · i b, i ∈ Zn } (1) where x∈ Zm is the index vector of the m-dimensional signal and i∈ Zn is an n-dimensional iterator vector. For instance, in the nested loops f or (i = 0; i 2; i + +) f or ( j = 0; j 3; j + +) · · · A[3i + j][5i + 2 j] · · · the index space of the array reference A[x][y] is ⎡ ⎤ ⎡ ⎤⎫ ⎧ ⎪ ⎪ ⎪ ⎢⎢⎢⎢ 1 0 ⎥⎥⎥⎥ ⎢⎢⎢⎢ 0 ⎥⎥⎥⎥⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎢⎢⎢ −2 ⎥⎥⎥⎪ 3 1 i ⎢⎢⎢ −1 0 ⎥⎥⎥ i ⎨ x ⎬ ⎢ ⎥ ⎢ ⎥ = ⎪ ⎪ ⎢ ⎥ ⎢ ⎥ ⎪ ⎪ ⎢ ⎥ ⎢ ⎥ ⎪ y 5 2 j 0 1 j 0 ⎢ ⎥ ⎢ ⎥ ⎪ ⎪ ⎢⎣ ⎥⎦ ⎢⎣ ⎥⎦⎪ ⎪ ⎪ ⎩ 0 −1 −3 ⎭ Without decrease in generality, it will be assumed along this paper that each array reference can be represented as a single lattice, although several lattices may be needed in the general case (for instance, an array reference in the scope of a condition i f (i j) has two lattices, one for i j + 1 and one for i j − 1). The specifications are considered to be procedural, therefore the execution ordering is induced by the loop structure and it is thus fixed†† . The goal is to identify the parts of the arrays in the given algorithmic specification that are heavily accessed during the code execution. This can be accomplished (as it will be seen) by a partitioning of each index space into sets which are all lattices (linearly bounded). 2.2 The Index Space of an Array Reference Let { x = T · i + u | A · i b } be the linearly bounded lattice of a given array reference. This section will show how to model the index space of an array reference, that is, what are the relations satisfied by the coordinates x of the points in this set. After the theoretical part, illustrative examples will be provided. For any matrix T ∈ Zm×n having rank T = r, and assuming the first r rows of T are linearly independent††† , there exists a unimodular matrix†††† S ∈ Zn×n such that H11 0 T·S= , where H11 ∈ Zr×r is a lower triangular H21 0 matrix with positive diagonal elements, and H21 ∈ Z(m−r)×r [22]. The block matrix is called the reduced Hermite form of matrix T. de f j1 Let S−1 i = j ≡ , where j1 , j2 are r-, respecj2 tively (n − r)-, dimensional vectors. Then H11 (2) j +u x = Ti + u = TSj + u = H21 1 x1 u1 Denoting x = , and u = (where x1 , u1 are x2 u2 r-dimensional vectors), it follows that x1 = H11 j1 + u1 . As H11 is nonsingular (being lower triangular of rank r), j1 can be obtained explicitly: j1 = H−1 11 (x1 − u1 ) (3) † Although our methodology for memory management takes into account the mapping problem as well, this topic is only marginally addressed in this paper. A thorough presentation of our mapping model able to cope with memory hierarchy is given in [27]. †† The search space becomes much larger still when also the available freedom in loop organization is incorporated. If the original loop ordering is not optimally suited to exploit data locality, code transformations should be applied (like in [15], for instance) in an earlier phase to increase it. ††† This assumption does not decrease the generality: it is done only to simplify the formulas, affected otherwise by a row permutation matrix. †††† A square matrix with integer elements whose determinant is ± 1. IEICE TRANS. FUNDAMENTALS, VOL.E91–A, NO.12 DECEMBER 2008 3562 The iterator vector i results with a simple substitution: H−1 (x1 − u1 ) j 11 i = S 1 = S1 S2 j2 j2 = S1 H−1 11 (x1 − u1 ) + S2 j2 where S1 and S2 are the submatrices of S containing the first r, respectively the last n − r, columns of S. As the iterator vector must represent a point inside the iterator space A · i b, it follows that: −1 AS1 H−1 11 x1 + AS2 j2 b + AS1 H11 u1 (4) If r < n, the n − r variables of j2 can be eliminated with the Fourier-Motzkin technique [9]. As the rows of matrix H11 are r linearly independent r-dimensional vectors, each row of H21 is a linear combination of the rows of H11 . Then from (2), it results that there exists a matrix C ∈ (m−r)×r such that† x2 − u2 = C · (x1 − u1 ) (5) Taking into account that the elements of j1 must be integers, it follows (by multiplying and dividing the right member of (3) with det H11 ) that the points x inside the index space must additionally satisfy the divisibility constraints det H11 hTi (x1 − u1 ) ∀i = 1, . . . , r (6) where hTi are the rows of the matrix with integer coefficients det H11 · H−1 11 , and a|b means “a divides b.” According to (6), when r = n, the points x are uniformly spaced along the r linear independent coordinates, the size of gaps in these dimensions being equal to the diagonal elements of H11 : if hii are the diagonal elements of matrix H11 , it can be verified that the divisibility constraints (6) are not affected when x1 is subject to translations of vectors vi = [0 · · · hii · · · 0], ∀i = 1, . . . , r. Indeed, hTi (x1 −u1 +vi ) = hTi (x1 −u1 )+hTi vi = hTi (x1 −u1 ) + det H11 . The system of inequalities (4), the equations (5), and the divisibility conditions (6) characterize the index space of the given array reference. Several examples will illustrate the generality of this model. Example 1: f or (i = 0; i 2; i + +) f or ( j = 0; j 3; j + +) · · · A[3i][5i + 2 j] · · · 3 0 0 20 −1 1 , H11 = 6 , , u=u1 = Since T=H11 = 5 2 0 −5 3 1 0 S=S1 = (S2 , j2 do not exist since n − r = 2 − 2 = 0; 0 1 H21 , x2 , u2 do not exist since m − r = 2 − 2 = 0), the inequalx are: 6 x 0, 18 −5x + 3y 0, ities (4) with x1 = y representing the first quadrilateral in Fig. 1. Not all the lattice points in the quadrilateral have coordinates the index values of the array reference. Only the lattice points satisfying also the divisibility conditions (6): 6 2x (or 3 x) and 6 − 5x + 3y belong to the index space. Note Fig. 1 ple 2. The index spaces of the array references in Example 1 and Exam- also that these lattice points, colored black in the figure, are uniformly spaced along the two axes Ox and Oy, the size of the gaps in these dimensions being 3 and 2, the diagonal elements of H11 . Example 2: f or (i = 0; i 2; i + +) f or ( j = 0; j 3; j + +) · · · A[3i + j][5i + 2 j] · · · 3 1 1 0 0 1 −1 T= , H11 =H11 = , S= , and 5 2 2 −1 1 −3 S1 =S. S2 , j2 do not exist since n − r = 2 − 2 = 0; H21 , x2 , u2 do not exist since m − r = 2 − 2 = 0. The inequalities (4) are: 2 2x − y 0, 3 −5x + 3y 0. As det H11 = 1, there are no divisibility conditions (6). Example 3: f or (i = 0; i 2; i + +) f or ( j = 0; j 3; j + +) · · · A[3i + j + 4][6i + 2 j + 7] · · · 3 1 4 Since T= , u= , S= S1 S2 = 6 2 7 0 1 1 0 , it follows that T·S= , H11 =[1], and 1 −3 2 0 H21 =[2]. Since m − r = 2 − 1 = 1, the points (x, y) in the index space satisfy one equation of type (5): y − 7 = 2(x − 4). The system of inequalities (4) is: 2 t 0, 7 x − 3t 4, where the vector j2 has one element (since n − r = 1) denoted t. The system of inequalities can be reduced here to 13 x 4. There are no divisibility conditions (6) since det H11 = 1. While the basic operations with lattices are performed using their definition (1) as images of iterator polytopes, it is the array (index) space of the signals that has to be partitioned (based the intensity of memory accesses). The mapping of the iterator space into the array space of a signal performed by a linearly bounded lattice (or, in particular, by an array reference) — described and exemplified in this sec† The coefficients of matrix C are determined by backward substitutions from the equations: H21 .row(i) = rj=1 ci j · H11 .row( j) for any i = 1, . . . , m − r. ZHU et al.: FORMAL MODEL FOR THE REDUCTION OF THE DYNAMIC ENERGY CONSUMPTION 3563 Fig. 2 (a) Illustrative algorithmic specification. (b) The partitions (disjoint lattices) of the array space of the signal A — obtained as explained in Sect. 2.3 — and their number of memory accesses. The index space of each lattice is represented (see Sect. 2.2) only by inequalities (4) (as the equality and divisibility conditions (5) and (6) do not appear in this example). (c) The gain factors of the different lattices partitioning the array space of signal A. The darker partitions are more heavily accessed. tion — is used in the array space partitioning for all the signals in the algorithmic specification (see the next section). 2.3 The Partitioning of the Array Space of a Signal The decomposition of the array space of every indexed signal into disjoint bounded lattices can be performed analytically by a recursive intersection, starting from the array references in the code. Let S be a multidimensional signal in the algorithmic specification. A high-level pseudo-code of its array space partitioning is given below: let LS be the initial collection of lattices of signal S; // these are the lattices of the array references of S do for (each pair of lattices (L1 ,L2 ) in the collection LS ) compute L = L1 ∩ L2 ; if (the intersection L is not empty) add the new lattice L to the collection LS ; compute L1 − L2 and L2 − L1 ; add each new lattice in the differences to LS ; end if; end for; until (no new lattice can be added to the collection LS ); The intersection and difference operations with linearly bounded lattices are thoroughly explained in [2]. Figure 2(b) shows the result of the array space decomposition for the 2-dimensional signal A from the illustrative (affine) algorithmic specification in Fig. 2(a). The array space of A was partitioned into 11 disjoint lattices† A0, A1, . . ., A10 — as shown in Figs. 2(b)–(c); each array reference in the code is either a disjoint lattice itself or it can be written as a union of disjoint lattices. For instance, A[4][j] (1st loop nest) = A8 A[k][i] (1st loop nest) = A0 ∪ A2 ∪ A4 ∪ . . . ∪ A10 A[5][j] (2nd loop nest) = A10 A[k][i] (2nd loop nest) = A2 ∪ A3 ∪ . . . ∪ A10 A[6][j] (3rd loop nest) = A9 A[k][i] (3rd loop nest) = A1 ∪ A3 ∪ . . . ∪ A10 The partitioning of the array space into disjoint lattices allows to build a map of the intensity of memory accesses and detect with precision the zones of the array space which are the most heavily accessed. 3. Hierarchical Memory Layer Assignment The decomposition of the index space of each multidimensional signal allows to compute the number of memory accesses when addressing the different parts of the arrays. In order to determine the number of memory accesses to the array elements covered by a certain lattice, the following computation scheme is used: #accesses = 0; for (all the array references including the given lattice) select an array reference and find the set of iterators mapped by the array reference into that lattice; #accesses += size of this set; † Since there are no divisibility conditions (6), the lattices are actually Z-polytopes [11] in the array space. IEICE TRANS. FUNDAMENTALS, VOL.E91–A, NO.12 DECEMBER 2008 3564 end for; Example: Compute the number of memory (read) accesses to the partition (lattice) A10 (see Fig. 2(b)). Since A10 is included in the array references A[k][i] from all the three loop nests and, in addition, it coincides with the operand A[5][j] from the second loop nest, the contributions of these 4 array references must be computed. Since the set A10 is {x = 5, y = t | 256 t 32} and the lattice of the array reference A[k][i] in the first loop nest is {x = k, y = i | 256 j 32, 8 k 0, j + 32 i j − 32}, the set of the iterators mapped by the array reference A[k][i] into A10 is { j = t1 , k = 5, i = t2 | 256 t1 , t2 32, t1 + 32 t2 t1 − 32}. The size of this set is 13,569 [4], [11]. The contributions of the other two array references A[k][i] are also 13,569 accesses each. Since the set of the iterators mapped by the array reference A[5][j] into A10 is { j = t1 , k = t2 , i = t3 | 256 t1 32, 9 t2 1, t1 + 32 t3 t1 − 32}, the contribution of A[5][j] is 131,625 accesses — the size of this set. In conclusion, the total number of accesses to the lattice A10 is 13,569 × 3 + 131,625 = 172,332. Figure 2(b) displays also the number of read accesses for each of the 11 lattices partitioning the array space of signal A. Similarly as in [15], the potential benefit of loading a partition into the scratch-pad memory is quantified by a gain factor In Fig. 2(c) the gain factors are indicated on the index space of signal A, the darker areas being those zones more heavily accessed. Note that the memory accesses are not uniformly distributed inside the partitions. (For instance, A[0][48] in the partition A0 is accessed 49 times, whereas A[0][148] is accessed 65 times.) However, this approach allows to identify those parts of arrays more accessed than others, without descending at scalar level — which could be computationally prohibitive. Different from other previous works, the parts of arrays that are candidates to be loaded in the SPM are not restricted to arrays references and their cuts along the coordinates, but on the partitions of the array space exhibiting high-value gain factors. It can be seen from Fig. 2(c) that the array reference A[k][i] in the first loop nest, covering the columns 0 to 8 of the array space, has zones accessed with very different intensities. Even cutting along the dimensions x and y, the cut lines would intersect areas of very different gain factors. The exploration of the partitions having high gain factors leads to a better reduction of the dynamic energy, as shown in Sect. 4. Note that every element of the arrays D and Dopt is accessed exactly once for reading and once for writing. Actually, it can be shown that the storage requirement for any of the signals D and Dopt is only one memory location: two registers in the datapath represent the best memory allocation solution for these two signals. The mapping of the signal partitions to the off-chip and scratch-pad memories is done using a model thoroughly described in [27]. According to it, for each n-dimensional sig- nal A, a maximal window WA = (w1 , . . . , wn ) bounding the simultaneously alive elements is computed; any access to an element A[index1 ] . . . [indexn ] of the array is then redirected to a memory location WA [index1 mod w1 ] . . . [indexn mod wn ] relative to a base address. The bounding window is computed such that any two distinct live array elements cannot be mapped by the modulo operations to the same location into the physical memory. Due to our lattice-based framework, the bounding windows are computed not only for the whole arrays, but also for the partitions (lattices) to be loaded in the SPM (their windows being typically “smaller” than the corresponding array window). 4. Experimental Results A hierarchical memory allocation tool has been implemented in C++, incorporating the polyhedral model described in this paper. For the time being, the tool supports only a two-level memory hierarchy, where an SPM is used between the main memory and the processor core. The dynamic energy is computed based on the number of accesses to each memory layer. In computing the dynamic energy consumptions for SPM and main (off-chip) memory, the CACTI v4.2 power model [28] is used† . In general, the ratio between the energy consumed by an SPM access and the main memory varies between one and two orders of magnitude. The energy per access for an SPM is not constant, but a size-dependent function — the energy per access tends to increase as the SPM size grows; however, for small SPM sizes up to a few KBytes the energy per access is relatively constant. Typical SPM and main memory energy values for read accesses are 0.048 nJ and 3.57 nJ, respectively (assuming memory sizes used in the illustrative example). The dynamic energy values for write accesses are slightly higher. Figure 3 displays the dependence of the dynamic energy consumption as a function of the SPM size. The first bar is the reference and corresponds to a “flat” memory design, in which all operands have to be retrieved from the Fig. 3 Variation of the dynamic energy consumption with the SPM size for the illustrative example in Fig. 2(a). † The CACTI model can be used to estimate the energy consumption for SPMs as explained and exemplified in [3] (Appendix A). ZHU et al.: FORMAL MODEL FOR THE REDUCTION OF THE DYNAMIC ENERGY CONSUMPTION 3565 Table 1 Application Motion estim. Durbin alg. SVD updating Vocoder Dyn. prog. #Array refs. 13 21 85 236 3,992 #Scalars 265,633 252,499 3,045,447 33,619 21,082,751 #Memory accesses 864,900 1,004,993 6,227,124 200,000 83,834,000 Experimental results. Mem. size 2,465 1,249 34,950 11,890 124,751 main memory. The second bar shows the energy used when the partition A8 (of size 225 and gain factor 765.92) is copied from the main memory to the SPM and it is accessed afterwards from there. Note that placing A8, A9, and A10 in the SPM leads to almost 65% energy savings for an SPM size of 675 bytes (the 4th bar). The last two columns correspond to an SPM large enough to contain the partition A4, as well as A5. Furthermore, this approach allows savings in the overall data memory size: since some of the lattices have different lifetimes, different such partitions of the array space can share the same memory locations. For instance, the partition (lattice) A0 is read only in the first loop nest (as part of the operand A[k][i]), whereas A3 is read only in the 2nd and 3rd loop nests; hence, A3 can replace A0 in the data memory without any increase of the its size, this being also the case for A2 (part of the operands A[k][i] in the 1st and 2nd loop nests) and A1 — read only in the 3rd loop nest. In conclusion, although the total number of scalars in the illustrative example is 399,405, a data memory of 3,179 locations would suffice since the signals D and Dopt can be stored in two datapath registers. Moreover, due to the possible memory sharing by the lattices A0 − A3 and A2 − A1, the data memory needs at most 3,179 − 2 × 289=2,601 locations for the entire signal A. Storing on-chip all the signals is, obviously, the most desirable scenario in point of view of dynamic energy consumption. We assumed that the SPM size is constrained to smaller values than the overall storage requirement. In our tests, we computed the ratio between the dynamic energy reduction and the SPM size; the value of the SPM size maximizing these ratio was selected, the idea being to obtain the maximum benefit for the smallest SPM size. In this example, an SPM of 675 locations (and a main memory of 2,601 − 675 = 1,926 locations) satisfies this condition. Note that Brockmeyer et al. [6] perform the assignment of the multidimensional signals to the memory layers at the level of entire arrays. The quantitative measure of their assignment approach is the average number of accesses per array element. For the illustrative example in Fig. 2(a), the average number of accesses to signal A is 248.42 (since there are 789,750 read accesses to the 3,179 A-elements). As it is shown in Fig. 2(c), the number of memory accesses can be very unevenly distributed in the array space† . Our model clearly offers a higher flexibility, identifying those parts of arrays that are heavily accessed, which can be significantly smaller than the whole array space (only about 21% in the example from Fig. 2). Dyn. energy 1-layer [µJ] 3,088 3,588 22,231 714 299,287 SPM size 1,416 764 12,672 3,879 27,316 Energy saved [6] 38.7% 55.2% 35.9% 30.8% 43.3% Energy saved [15] 40.7% 58.5% 38.4% 32.5% 46.6% Energy saved 50.7% 73.2% 46.0% 39.5% 56.1% CPU [sec] 23 28 37 8 47 Let us assume now that the SPM candidates can be combinations of cuts along the array dimensions, as in [15]. If the size of the SPM is limited to 675 bytes (the total size of the lattices A8, A9, A10), up to 61 rows of signal A can be stored in the SPM. Selecting the “middle” 61 rows (since they are the most accessed) 114–174, each one having 3,510 accesses, the dynamic energy saving versus a flat 1-layer (off-chip) memory subsystem is 26.7% — since 214,110 accesses will be directed to the SPM. On the other hand, storing the lattices A8, A9, A10 in the SPM entails 3 × 172,332 = 516,996 accesses to the SPM and, consequently, a reduction of dynamic energy consumption by 64.6%. Table 1 summarizes the results of our experiments, carried out on a PC with a 1.85 GHz Athlon XP processor and 512 MB memory. The benchmarks used are algebraic kernels (like Durbin’s algorithm for solving Toeplitz systems) and algorithms used in multimedia applications (like, for instance, an MPEG4 motion estimation algorithm). The table displays the numbers of array references, scalar signals, and memory accesses; the data memory size (in storage locations/bytes) and the dynamic energy consumption assuming only one (off-chip) memory layer; the SPM size and the savings of dynamic energy applying, respectively, a previous model steered by the gain factors of whole arrays [6], a previous model steered by the most accessed array rows/columns [15], and the current model, versus the single-layer memory scenario; the CPU times. The sizes of the main memory and of the SPM were evaluated after the mapping of the signals into the physical memories. The mapping algorithm [27] computes maximal bounding windows (the general idea is explained in the last paragraph of Sect. 3) for each signal based on the bounding windows of all the lattices extracted from the code. The main memory size is the total size of these windows. Since in our model the SPM stores the intensely-accessed parts of the signals represented also by lattices, the same mapping algorithm [27] operating with only that subset of lattices was used to compute the size of the SPM. Our experiments show that the savings of dynamic energy consumption are from 40% to over 70% relative to the energy used in the case of a flat memory design. Although previous models produce important energy savings as well ([15] yields better results than [6] due to a higher flexibility in selecting the ‘copy candidates’), our model led to 20– 33% better savings than them. The energy consumptions for † In this illustrative example, it varies from 1 (e.g., for the element A[0][0]) to 780 accesses (e.g., for A[5][144]). IEICE TRANS. FUNDAMENTALS, VOL.E91–A, NO.12 DECEMBER 2008 3566 the motion estimation benchmark were, respectively, 1894, 1832, and 1522 µJ; the saved energies relative to the energy in column 6 are displayed as percentages in columns 8–10. 5. Conclusions This paper has presented a formal model which allows to partition the index space of the arrays from data-dominated applications such that those array parts heavily accessed are identified and stored in scratch-pad memories in order to diminish the dynamic energy consumption due to memory accesses. This model led to energy savings of, typically, 40– 70% and proved to be better than previous models based on array references and their cuts along the main dimensions. References [1] F. Angiolini, L. Benini, and A. Caprara, “An efficient profilebased algorithm for scratchpad memory partitioning,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol.24, no.11, pp.1660– 1676, Nov. 2005. [2] F. Balasa, H. Zhu, and I.I. Luican, “Computation of storage requirements for multi-dimensional signal processing applications,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol.15, no.4, pp.447– 460, April 2007. [3] R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel, “Comparison of cache- and scratch-pad based memory systems with respect to performance, area and energy consumption,” Technical Report #762, University of Dortmund, Sept. 2001. [4] A.I. Barvinok, “A polynomial time algorithm for counting integral points in polyhedra when the dimension is fixed,” Math. Operations Res., vol.19, no.4, pp.769–779, 1994. [5] L. Benini, L. Macchiarulo, A. Macii, E. Macii, and M. Poncino, “Layout-driven memory synthesis for embedded systems-on-chip,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol.10, no.2, pp.96–105, April 2002. [6] E. Brockmeyer, M. Miranda, H. Corporaal, and F. Catthoor, “Layer assignment techniques for low energy in multi-layered memory organisations,” Proc. ACM/IEEE Design Automation and Test in Europe, pp.1070–1075, Munich, Germany, March 2003. [7] F. Catthoor, S. Wuytack, E. De Greef, F. Balasa, L. Nachtergaele, and A. Vandecapelle, Custom Memory Management Methodology: Exploration of Memory Organization for Embedded Multimedia System Design, Kluwer Acad. Publ., Boston, 1998. [8] S. Coumeri and D.E. Thomas, “Memory modeling for system synthesis,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol.8, no.3, pp.327–334, June 2000. [9] G.B. Dantzig and B.C. Eaves, “Fourier-Motzkin elimination and its dual,” J. Comb. Theory A, vol.14, pp.288–297, 1973. [10] A. Darte, R. Schreiber, and G. Villard, “Lattice-based memory allocation,” IEEE Trans. Comput., vol.54, no.10, pp.1242–1257, Oct. 2005. [11] J.A. De Loera, R. Hemmecke, J. Tauzer, and R. Yoshida, “Effective lattice point counting in rational convex polytopes,” J. Symbolic Computation, vol.38, no.4, pp.1273–1302, 2004. [12] J.Z. Fang and M. Lu, “An iteration partition approach for cache or local memory thrashing on parallel processing,” IEEE Trans. Comput., vol.42, no.5, pp.529–546, 1993. [13] K. Flautner, N. Kim, S. Martin, D. Blaauw, and T. Mudge, “Drowsy caches: Simple techniques for reducing leakage power,” Proc. Symp. Computer Architecture, pp.148–157, May 2002. [14] O. Golubeva, M. Loghi, M. Poncino, and E. Macii, “Architectural leakage-aware management of partitioned scratchpad memories,” Proc. ACM/IEEE Design Automation and Test in Europe, pp.1665– 1670, Nice, France, April 2007. [15] Q. Hu, A. Vandecapelle, M. Palkovic, P.G. Kjeldsberg, E. Brockmeyer, and F. Catthoor, “Hierarchical memory size estimation for loop fusion and loop shifting in data-dominated applications,” Proc. Asia-S. Pacific Design Automation Conf., pp.606–611, Yokohama, Japan, 2006. [16] M. Kandemir and A. Choudhary, “Compiler-directed scratch-pad memory hierarchy design and management,” Proc. 39th ACM/IEEE Design Automation Conf., pp.690–695, Las Vegas, NV, June 2002. [17] M. Kandemir, M.J. Irwin, G. Chen, and I. Kolcu, “Compilerguided leakage optimization for banked scratch-pad memories,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol.13, no.10, pp.1136–1146, Oct. 2005. [18] S. Kaxiras, Z. Hu, and M. Martonosi, “Cache decay: Exploiting generational behavior to reduce cache leakage power,” Proc. Symp. Computer Arch., pp.240–251, June 2001. [19] I.I. Luican, H. Zhu, and F. Balasa, “Formal model of data reuse analysis for hierarchical memory organizations,” Proc. IEEE/ACM Int. Conf. on Comp.-Aided Design, pp.595–600, San Jose, CA, Nov. 2006. [20] I.I. Luican, H. Zhu, F. Balasa, and D.K. Pradhan, “Reducing the dynamic energy consumption in the multi-layer memory of embedded multimedia processing systems,” Proc. 14th Workshop on Synthesis and Syst. Integration of Mixed Inform. Technologies, pp.42–48, Sapporo, Japan, Oct. 2007. [21] P.R. Panda, N. Dutt, and A. Nicolau, “On-chip vs. off-chip memory: The data partitioning problem in embedded processor-based systems,” ACM Trans. Des. Autom. Electron. Syst., vol.5, no.3, pp.682–704, July 2000. [22] A. Schrijver, Theory of Linear and Integer Programming, John Wiley, New York, 1986. [23] W. Shiue and C. Chakrabarti, “Memory exploration for low-power embedded systems,” Proc. 35th ACM/IEEE Design Automation Conf., pp.140–145, June 1998. [24] L. Thiele, “Compiler techniques for massive parallel architectures,” in State-of-the-art in Computer Science, ed. P. Dewilde, Kluwer Acad. Publ., 1992. [25] R. Tronçon, M. Bruynooghe, G. Janssens, and F. Catthoor, “Storage size reduction by in-place mapping of arrays,” in Verification, Model Checking and Abstract Interpretation, ed. A. Coresi, pp.167–181, 2002. [26] S. Wuytack, J.-P. Diguet, F. Catthoor, and H. De Man, “Formalized methodology for data reuse exploration for low-power hierarchical memory mappings,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol.6, no.4, pp.529–537, Dec. 1998. [27] H. Zhu, I.I. Luican, and F. Balasa, “Mapping multi-dimensional signals into hierarchical memory organizations,” Proc. ACM/IEEE Design Automation and Test in Europe, pp.385–390, Nice, France, April 2007. [28] CACTI v4.2, [Online]. Available: http://quid.hpl.hp.com: 9081/cacti/detailed.y ZHU et al.: FORMAL MODEL FOR THE REDUCTION OF THE DYNAMIC ENERGY CONSUMPTION 3567 Hongwei Zhu received the B.S. degree in Electrical Engineering from Xi’an Jiaotong University, P.R. China, in 1996, the M.S. degree in Electrical & Electronic Engineering from Nanyang Technological University, Singapore, in 2001, and the Ph.D. degree in Computer Science from the University of Illinois at Chicago, U.S.A., in 2007. Currently, he is a software engineer at ARM Inc., Sunnyvale, California, U.S.A. His research interests are digital design methodologies, memory management for signal processing, and combinatorial optimization in VLSI CAD. Ilie I. Luican received the B.S. and M.S. degrees in Computer Science from the Polytechnical University of Bucharest (PUB), Romania, in 2002 and 2003, respectively. Currently, he is a Ph.D. candidate in the Dept. of Computer Science, University of Illinois at Chicago, U.S.A. His research interests are high-level synthesis and memory management for real-time multidimensional signal processing. Florin Balasa received the M.S. and Ph.D. degrees in Computer Science from the Polytechnical University of Bucharest in 1981 and 1994, respectively. He also received the Ph.D. degree in Electrical Engineering from the Katholieke Universiteit Leuven, Belgium, in 1995. Dr. Balasa is, currently, an associate professor of Computer Science at the Southern Utah University. He was also an assistant professor of Computer Science at the Univ. of Illinois at Chicago. Dr. Balasa is a recipient of the National Science Foundation Career Award. He is an associate editor for J. Computers & Electrical Eng. (Elsevier). Dhiraj K. Pradhan received the M.S. and Ph.D. degrees in Computer Science from Brown University, in 1970, and, respectively, from University of Iowa, U.S.A., in 1972. Dr. Pradhan currently holds a Chair in Computer Science at the at the University of Bristol, U.K. He had been a professor of Electrical and Computer Engineering at Oregon State University, Corvallis. Previous to this, Dr. Pradhan had held the COE Endowed Chair Professorship in Computer Science at Texas A&M University, College Station, also serving as founder of their Laboratory of Computer Systems. Prior, Dr. Pradhan held a Professorship at the University of Massachussetts, Amherst, where he also served as coordinator of Computer Engineering. Prof. Pradhan has served as co-author and editor of various books, including FaultTolerant Computing: Theory and Techniques, Vols. I, II (Prentice-Hall, 1986), Fault-Tolerant Computer Systems Design (Prentice-Hall, 1996, 2nd ed. 2003), and IC Manufacturability: The Art of Process and Design Integration (IEEE Press, 2000). He continues to serve as an Editor for prestigious journals, including IEEE Transactions. A Fellow of both ACM, IEEE, and Japan Society of Promotion of Science, prof. Pradhan is also the recipient of a Humboldt Prize, Germany. In 1997, he was also awarded the Fulbright-Flad Chair in Computer Science and the 1996 IEEE TCAD Best Paper Award.
© Copyright 2025 Paperzz