poster - Calit2

Parallelism versus Memory Allocation
in Pipelined Router Forwarding Engines
Student: Jia Mao ([email protected])
Faculties: Fan Chung, Ron Graham, George Varghese
CSE Department, University of California, San Diego, CA
Abstract
New memory model
A crucial problem that needs to be addressed is the allocation
of memory to processors in a pipeline. Ideally, the processor
memories should be totally separate (i.e. one-port memories)
in order to minimize contention; however, this minimizes
memory sharing. Idealized sharing occurs by using a single
shared memory for all processors but this maximizes
contention.
Instead, here we show that perfect memory sharing of shared
memory can be achieved with a collection of *two*-port
memories, as long as the number of processors is less than
the number of memories.
We show that the problem of allocation is NP-complete in
general, but has a fast approximation algorithm that comes
within a factor of 3/2. The proof utilizes a new bin packing
model, which is interesting in its own right. Further, for
important special cases that arise in practice, the approximation algorithm is indeed optimal. We also describe a
dynamic memory allocation algorithm that provides good
memory utilization while allowing fast updates.
Facing with two unacceptable extremes, it is natural to consider intermediate
forms. We propose a collection of Y-port memories, where Y < n.
Here the “collection” is achieved by a
partial crossbar so that each processor
has access to a variable number of
memory banks, and Y constrains the
maximum of number of processors that
can share on memory bank.
Parallel processors are often used to solve time-consuming
problems, such as prefix-matching in IP lookup schemes. Typically,
each processor has some memory where it stores computation
data.
--- P1
--- P2
--- P3
--- P4
--- P5
--- P6
--- P7
To minimize contention and maximize speed, each memory should
be read by exactly one process.
This algorithm is based on the following observations:
1. We can repack to break cycles in an associated graph without
for i = 1 to n
using more bins.
pack wi greedily
2. We can repack to remove weak edges in the associated graph
remove cycles if present
until each connected component has at most one weak edge,
remove weak edges till every CC has ≤ 1 weak edge
remove weak loops till there’s ≤without
1 weakusing
loop more
total bins.
3. We can repack to remove weak loops in the associated graph
merge non-weak loops with other CC’s
until there is at most one weak loop, without using more bins.
4. A non-weak loop can always be ‘absorbed’ into another
connected component with a weak edge, without using more bins.
Our final choice is Y=2.
So each memory has access to a
collection of 2-port memories. Each
memory has 2 ports that can be
allocated to any two processors.
Memory access speeds slow down
by a factor of at most 2.
After running this algorithm, at the end we will have a cycle-free graph, with each CC having at most 1 weak edge, and has at most 1 weak loop.
Moreover, if there is a non-weak loop, then all other CC’s have no weak edges.
Our algorithm runs in linear time, generates packing within a factor of 3/2 from optimum, and is optimal when b is no less than n.
We can also show that 3/2 is a tight bound for this particular algorithm.
Dynamic allocation algorithm
Bad news
Background
Approximation algorithm
The general offline memory allocation problem in this model for arbitrary b and
n is NP-complete.
We can abstract as a bin-packing problem with a “2-type” constraint and
bin(memory) size normalized to 1:
Given: b bins and a list of n weights W = (w1,w2,…,wn) all of different types, with
wi ≥ 0
Question: If each weight can be partitioned into parts, can W be packed into b
bins where each bin has parts of at most 2 types?
This new bin-packing problem can be shown to be NPC by transformation from
3-PARTITION.
Good news
We have a good approximation algorithm that
So far we have only dealt with approximation and exact algorithms for static memory allocation. On the more practical side, how can we get good
dynamic memory allocation algorithms that maintain overall efficiency? Upon each new memory request, allocate or de-allocate, we are now allowed to
repack previously assigned memory units besides handling the new request. In this situation we have a tradeoff between memory utilization and cost of
repacking or compaction. To capture and analyze this tradeoff, we define the compaction ratio of any online memory allocation algorithm A to be
rA = maxt {M / W} = maxt {total moved memory units up to time t / total memory units allocated up to time t}
Depending on our requirement on the compaction ratio, we have several different cases to consider:
Case 1: rA can be arbitrarily large.
This happens when repacking or compaction cost is assumed to be of negligible cost and we have unlimited computing power. We can just solve the
offline allocation problem optimally every time a new memory request comes in and perform repacking whenever needed. In practice, this is just an
ideal case. Repacking almost certainly has a cost because it has to perform memory operations. Computation cost is also high and cannot be
neglected.
Case 2: rA is bounded from above. In particular, where c is a constant.
runs in linear time
This is the most interesting case, and we have a 3/2-competitive algorithm with compaction ratio bounded by 1 from above.
is factor 3/2 of optimal
Case 3: rA = 0, i.e., no compaction is allowed.
is optimal when b ≥ n (# memories ≥ # processors)
In this case, no online algorithm can have approximation ratio better than 2. We have a simple online algorithm that achieves ratio 2.
Every packing has an associated graph representation, where vertices correspond to weights, and edges correspond to bins. For a partially filled bin, a socalled “weak edge” is used, denoted by a dotted line.
Unfortunately, if the tasks assigned to processors vary wildly in
memory usage, this is not an efficient use of memory. The
following model maximizes sharing but memory access becomes
very slow due to time-multiplexing.
We also distinguish between “cycles” (2 or more vertices) and “loops”.
For example, given
Packing 1:
, the following two packings are valid:
Packing 2:
Conclusion
The most important lesson is that it is possible to share memory across parallel stages in an almost perfect manner (regardless of individual demands) if
we use two-port instead of one-port memories, each of which can be assigned to a stage using some form of partial crossbar switch. In practice, one
would simply choose the parameters such that the number of memories is larger than the number processor stages. In that case, the approximation
algorithm we presented will provide 100% efficiency.
In essence, we are finessing a difficult problem (allocating across 1-port memories) by changing the model. The new models are practical. We know at
least one implementation of one of our models that scales to multiple OC-768 speeds. On the theoretical front, we could consider the general case of
packing bins so that each bin contains at most r bins for some fixed integer r. In this case, we could formulate the associated hypergraphs of a packing
instead of just associated graphs.