Parallelism versus Memory Allocation in Pipelined Router Forwarding Engines Student: Jia Mao ([email protected]) Faculties: Fan Chung, Ron Graham, George Varghese CSE Department, University of California, San Diego, CA Abstract New memory model A crucial problem that needs to be addressed is the allocation of memory to processors in a pipeline. Ideally, the processor memories should be totally separate (i.e. one-port memories) in order to minimize contention; however, this minimizes memory sharing. Idealized sharing occurs by using a single shared memory for all processors but this maximizes contention. Instead, here we show that perfect memory sharing of shared memory can be achieved with a collection of *two*-port memories, as long as the number of processors is less than the number of memories. We show that the problem of allocation is NP-complete in general, but has a fast approximation algorithm that comes within a factor of 3/2. The proof utilizes a new bin packing model, which is interesting in its own right. Further, for important special cases that arise in practice, the approximation algorithm is indeed optimal. We also describe a dynamic memory allocation algorithm that provides good memory utilization while allowing fast updates. Facing with two unacceptable extremes, it is natural to consider intermediate forms. We propose a collection of Y-port memories, where Y < n. Here the “collection” is achieved by a partial crossbar so that each processor has access to a variable number of memory banks, and Y constrains the maximum of number of processors that can share on memory bank. Parallel processors are often used to solve time-consuming problems, such as prefix-matching in IP lookup schemes. Typically, each processor has some memory where it stores computation data. --- P1 --- P2 --- P3 --- P4 --- P5 --- P6 --- P7 To minimize contention and maximize speed, each memory should be read by exactly one process. This algorithm is based on the following observations: 1. We can repack to break cycles in an associated graph without for i = 1 to n using more bins. pack wi greedily 2. We can repack to remove weak edges in the associated graph remove cycles if present until each connected component has at most one weak edge, remove weak edges till every CC has ≤ 1 weak edge remove weak loops till there’s ≤without 1 weakusing loop more total bins. 3. We can repack to remove weak loops in the associated graph merge non-weak loops with other CC’s until there is at most one weak loop, without using more bins. 4. A non-weak loop can always be ‘absorbed’ into another connected component with a weak edge, without using more bins. Our final choice is Y=2. So each memory has access to a collection of 2-port memories. Each memory has 2 ports that can be allocated to any two processors. Memory access speeds slow down by a factor of at most 2. After running this algorithm, at the end we will have a cycle-free graph, with each CC having at most 1 weak edge, and has at most 1 weak loop. Moreover, if there is a non-weak loop, then all other CC’s have no weak edges. Our algorithm runs in linear time, generates packing within a factor of 3/2 from optimum, and is optimal when b is no less than n. We can also show that 3/2 is a tight bound for this particular algorithm. Dynamic allocation algorithm Bad news Background Approximation algorithm The general offline memory allocation problem in this model for arbitrary b and n is NP-complete. We can abstract as a bin-packing problem with a “2-type” constraint and bin(memory) size normalized to 1: Given: b bins and a list of n weights W = (w1,w2,…,wn) all of different types, with wi ≥ 0 Question: If each weight can be partitioned into parts, can W be packed into b bins where each bin has parts of at most 2 types? This new bin-packing problem can be shown to be NPC by transformation from 3-PARTITION. Good news We have a good approximation algorithm that So far we have only dealt with approximation and exact algorithms for static memory allocation. On the more practical side, how can we get good dynamic memory allocation algorithms that maintain overall efficiency? Upon each new memory request, allocate or de-allocate, we are now allowed to repack previously assigned memory units besides handling the new request. In this situation we have a tradeoff between memory utilization and cost of repacking or compaction. To capture and analyze this tradeoff, we define the compaction ratio of any online memory allocation algorithm A to be rA = maxt {M / W} = maxt {total moved memory units up to time t / total memory units allocated up to time t} Depending on our requirement on the compaction ratio, we have several different cases to consider: Case 1: rA can be arbitrarily large. This happens when repacking or compaction cost is assumed to be of negligible cost and we have unlimited computing power. We can just solve the offline allocation problem optimally every time a new memory request comes in and perform repacking whenever needed. In practice, this is just an ideal case. Repacking almost certainly has a cost because it has to perform memory operations. Computation cost is also high and cannot be neglected. Case 2: rA is bounded from above. In particular, where c is a constant. runs in linear time This is the most interesting case, and we have a 3/2-competitive algorithm with compaction ratio bounded by 1 from above. is factor 3/2 of optimal Case 3: rA = 0, i.e., no compaction is allowed. is optimal when b ≥ n (# memories ≥ # processors) In this case, no online algorithm can have approximation ratio better than 2. We have a simple online algorithm that achieves ratio 2. Every packing has an associated graph representation, where vertices correspond to weights, and edges correspond to bins. For a partially filled bin, a socalled “weak edge” is used, denoted by a dotted line. Unfortunately, if the tasks assigned to processors vary wildly in memory usage, this is not an efficient use of memory. The following model maximizes sharing but memory access becomes very slow due to time-multiplexing. We also distinguish between “cycles” (2 or more vertices) and “loops”. For example, given Packing 1: , the following two packings are valid: Packing 2: Conclusion The most important lesson is that it is possible to share memory across parallel stages in an almost perfect manner (regardless of individual demands) if we use two-port instead of one-port memories, each of which can be assigned to a stage using some form of partial crossbar switch. In practice, one would simply choose the parameters such that the number of memories is larger than the number processor stages. In that case, the approximation algorithm we presented will provide 100% efficiency. In essence, we are finessing a difficult problem (allocating across 1-port memories) by changing the model. The new models are practical. We know at least one implementation of one of our models that scales to multiple OC-768 speeds. On the theoretical front, we could consider the general case of packing bins so that each bin contains at most r bins for some fixed integer r. In this case, we could formulate the associated hypergraphs of a packing instead of just associated graphs.
© Copyright 2025 Paperzz