ppt - NUS Computing

Embedded Systems
Seminar
Heterogeneous Memory Management for
Embedded Systems
By O.Avissar, R.Barua and D.Stewart.
Presented by Kumar Karthik.
Heterogeneous Memory


Heterogeneous = different types of…
Embedded Systems come with a small amount
of on-chip SRAM, a moderate amount of offchip SRAM, a considerable amount of off-chip
DRAM and large amounts of EEPROM (Flash
memory)
Relative RAM costs and Latencies


Latency
On-chip SRAM < off-chip SRAM < on-chip
DRAM < off-chip DRAM
Cost
On-chip SRAM > off-chip SRAM > on-chip
DRAM > off-chip DRAM
Caches in Embedded Chips



Caches are power hungry
Cache miss penalties make it hard to give realtime performance guarantees
Solution : do away with caches and create a
non-overlapping address space for systems
with heterogeneous memory units (DRAM,
SRAM, EEPROM).
Memory Allocation in ES



Memory allocation for program data is done by
the embedded system programmer, in software,
as current compilers are not capable of doing it
over heterogeneous memory units
Code is written in Assembly : tedious and nonportable
Solution : An intelligent compilation strategy
that can achieve optimal memory allocation in
ES.
Memory Allocation Example
The need for Profiling




Recall : RAM Latencies
Optimal if most frequently accessed code
sections are stored in the memory unit with
lowest latency.
Access frequencies of memory references need
to be measured.
Solution : Profiling.
Intelligent Compilers

The intelligent compiler must be able to
1.
2.
3.

Optimally allocate memory to program data
Base memory allocation on frequency estimates
collected through profiling
Correlate memory accesses with the variables they
access
Task 3 demands inter-procedural pointer analysis,
which is costly.
Profiling


Instead of pointers, a more efficient statistical
method is used. Each accessed address is marked
checked against a table of address ranges for the
different variables.
Provides exact statistics as opposed to pointer
analysis
Memory Access Times


Total access time (Sum) of all the memory
accesses in the program needs to be minimized
The formulation is first defined for global
variables and then extended for heap and stack
variables.
Formulation for global variables

Key terms
TrjNr(vi) – Total time taken for N reads of variable i stored
on memory unit j.
TwjNw(vi) – Total time taken for N writes of variable i
stored on memory unit j.
Ij(vi) – The set of 0/1 integer variables.
Formulation for global variables

Total Access time
= ∑ j=1 to U) ∑(i=1 to G) Ij(vi)[TrjNr(vi) + TwjNw(vi) ]
U = Number of Memory units
G = Number of Variables
TrjNr(vi) + TwjNw(vi) contributes to the inner sum
only if variable i is stored in memory unit j (if
not, Ij(vi) = 0 and the whole term will be 0).
(
0/1 integer linear program solver



The 0/1 integer linear program solver tries out
all combinations of the summation to arrive at
the lowest total memory access time and returns
this solution to the compiler
The solution is the optimal memory allocation.
MATLAB is used as the solver in this paper.
Constraints

The following constraints also hold :
The embedded processor allows at most one
memory access per cycle. Overlapping memory
latencies are not considered.
 Every variable is allocated on only one memory unit
 The sum of the sizes of all the variables allocated to
a particular memory unit must not exceed the size of
the unit.

Stack variables



Extending the formulation for local variables,
procedure parameters and return variables
(collectively known as stack variables).
Stacks are sequentially allocated abstractions,
much like arrays.
Distributing stacks over heterogeneous memory
units optimizes memory allocation.
Stack split example
Distributed Stacks


Multiple stack pointers…from example, 2 stack
pointers will have to be incremented on entry
(on for each split of the stack) and 2 will have to
be decremented on leaving the procedure.
Induces overhead when 2 stack pointers have to
be maintained.
Distributed Stacks


software overhead…tolerated for long-running
procedures and eliminated by allocating each
stack frame to one memory unit for short
procedures (one stack pointer per procedure)
Distributed stacks are implemented by compiler
for ease of use…..abstraction of stack as a
contiguous data structure is maintained for the
programmer
Comparison to globals


Stack variables have limited lifetimes compared
to globals. They are ‘live’ when a particular
procedure is executing and can be garbage
collected once the procedure is exited.
Hence variables with non-overlapping lifetimes
can share the same address space and their total
size can be larger than that of the memory unit
they are stored in.
Formulation for Stack Frames


2 ways of extending the method to handle stack
variables.
Each procedure’s stack frame is stored in a
single memory unit.
No multiple stack pointers
 Distributed stack as different stack frames may still
be allocated to different memory units

Stack-extended formulation

Total access time = time taken to access global
variables + time taken to access stack variables

The fis refer to the number of functions in the
program (as each function has a stack frame).
Constraints



Each stack frame may at most be stored in one
memory unit
Stack reaches maximum size when a call-graph
leaf node is reached.
A call-graph leaf node is the deepest nested
procedure called….if all such procedures’ stack
frames can be allocated, program allocation will
fit into memory if all paths to leaf nodes on the
call graph fit into memory.
Stack-extended formulation

2nd alternative
Stack variables from the same procedure can be
mapped to different memory units
 Stack variables are thus treated like globals with the
total access time equal to =
 However memory requirements are relaxed as in the
stack-frame case based on disjoint lifetimes of the
stack variables

Heap-extended formulation




Heap data cannot be allocated statically as the
allocation frequencies and block sizes are
unknown at compile time.
Calls such as malloc( ) fall into this category
Allocation has to be estimated using a good
heuristic.
Each static heap allocation site is treated as a
variable v in the formulation
Heap-extended formulation



The number of references to each site is
counted through profiling.
The variable size is bounded as a finite multiple
of the total size of memory allocated at that site.
If a malloc( ) site allocates 20 bytes 8 times over
in a program, 160 bytes is the size of v which is
multiplied by a safety factor of 2 to give 320
bytes as the allocation size for this site.
Heap-extended formulation



This optimizes for the common case
Calls like malloc( ) are cloned for each memory
level which in turn maintains a free list.
If allocation size is exceeded at runtime (max
size is passed as a parameter for each call site) a
memory block from slower and larger memory
is returned.
Heap-extended formulation


Latency would be ≤ latency of slowest memory
If real-time guarantees are needed, all heap
allocation must be assumed to go to the slowest
memory.
Experiment



This compiler was implemented as an extension
to the commonly used GCC cross-compiler to
target the Motorola M-Core processor.
Benchmarks used represent code in typical
applications.
The runtimes were normalized using only the
fastest memory type (SRAM) and then slower
memories were introduced for subsequent tests
to measure runtimes.
Results
Results


Using 20% SRAM and the rest DRAM still
produces runtimes closer to the all SRAM case.
Cheaper and without much of a performance
loss.
This proves that (at least for the benchmark
programs) memory allocation is optimal. The
FIB with a linear recurrence to compute
Fibonacci numbers is an exception with equal
number of accesses to all variables.
Experiment 2


Enough DRAM and EEPROM was provided
while SRAM size was varied for each of the
benchmark programs.
This would help determine the minimum
amount of SRAM needed to maintain
performance reasonably close to the 100%
SRAM case
FIR Benchmark
Matrix multiplication benchmark
Fibonacci series benchmark
Byte to ASCII converter
Results


Clear that most frequently accessed code is
between 10-20% of entire program
This portion of code is successfully put on
SRAM through profile-based optimizations.
Comparing Stack frames and stack
variables
Results


The BMM benchmark is used as it has the most
number of functions/procedures (hence most
number of stack frames/variables).
Allocating stack variables on different units
performs better in theory due to the finer
granularity and thus a more custom allocation.
The difference is apparent for the smaller SRAM
sizes.
Applications

The approach in the paper can be used to
determine an optimal trade-off between
minimum SRAM size and meeting performance
requirements.
Adapting to pre-emption


In context-switching environments, all data has
to be live at any given time on some live
memory.
The variables of all the live programs are
combined and the formulation is solved by
multiplying the relative frequencies of the
contexts with their respective variables. An
optimal allocation is achieved in this case.
Summary






Compiler method to distribute program data
efficiently among heterogeneous memories.
Caching hardware is not used
Static allocation of memory units
Stack distribution
Optimal guarantee
Runtime depends on relative access frequencies.
Related work




Not much work on cache-less embedded chips
with heterogeneous memory units
Memory allocation task is usually left to the
programmer
Compiler method is better for larger, more
complex programs
It is error free and is also portable over different
systems with minor modifications to the
compiler.
Related work



Panda et al., Sjodin et al. have researched on
memory allocation in cached embedded chips.
Cached systems spend more effort on
minimizing cache misses than minimizing
memory access times…no optimal guarantee.
Earlier studies only take into account 2 memory
levels (SRAM and DRAM) while this
formulation can be extended to N levels of
memory.
Related work




Dynamic allocation strategies are also possible
but not explored here.
Software caching (emulation of a cache in fast
memory) is an option.
Methods to overcome software overhead need
to be devised.
Inability to provide real-time guarantees should
be addressed.
THE END