Wormhole RTR FPGAs with Distributed Configuration Decompression CSE-670 Final Project Presentation A Joint Project Presentation by: Ali Mustafa Zaidi Mustafa Imran Ali Introduction “An FPGA configuration architecture supporting distributed control and fast context switching.” Aim of Joint Project: Explore the potential for dynamically reconfiguring FPGAs by adapting the WRTR Approach Enable High Speed Reconfiguration using: Optimized Logic-Blocks and Distributed Configuration Decompression techniques Study focuses on Datapath-oriented FPGAs Project Methodology ≈ In depth study of issues. Definition of basic Architecture models. Definition of Area models (for estimation of relative overhead w.r.t other RTR schemes). Design of Reconfigurable systems around designed Architecture models. Identification of FPGA resource allocation methods Selection of Benchmarks for testing and simulation of various approaches. Experimentation with WRTR systems and corresponding PRTR system without distributed configuration (i.e. single host/application) for comparison of baseline performance. For distributing FPGA area between multiple applications/hosts (at runtime, compile-time, or system design-time). Evaluate resource utilization, and normalized reconfiguration overhead etc. Experimentation with all systems with distributed configurations (i.e. multiple hosts/applications). The PRTR systems’ configuration port will be time multiplexed between the various applications. Evaluate resource utilization, and normalized reconfiguration overhead etc. Configuration Issues with Ultra-high Density FPGAs FPGA densities rising dramatically with process technology improvements. Configuration time for serial method becoming prohibitively large. FPGAs are increasingly used as compute engines, implementing data-intensive portions of applications directly in Hardware. Lack of efficient method for dynamic reconfiguration of large FPGAs will lead to inefficient utilization of the available resources. Scalability Issues with Multi-context RTR Concept: While one plane is operating, configure the other planes serially. Latency hidden by overlapping configuration with computation As FPGA size, and thus configuration time grows, Multi context becomes less effective in hiding latency Configuring more bits in parallel for each context is only a stopgap solution. Only so many pins can be dedicated to configuration Overheads in the Multi-context approach: Number of SRAM cells used for configuration grow linearly with number of contexts Multiplexing Circuitry associated with each configurable unit Global, low-skew context select wires. Scalability Issues with Partial RTR Concept: The Configuration memory is Addressable like standard Random Access Memory. Overheads in PRTR Approach: Long global cell-select and data busses required – Vertical and Horizontal Decoding circuitry – represent centralized control resource. area overhead, issues with single cycle signal-transmission as wires grow relative to logic. Can be accessed only sequentially by different user applications (one app at a time) Potential for underutilization of hardware as FPGA density increases. One solution could be to design the RAM as a multi-ported memory But area of RAM increases quadratically with increase in number of ports. Only so many dedicated configuration ports can be provided Not a long-term scalable solution. What is Wormhole RTR WRTR is a method for reconfiguring a configurable device in an entirely distributed fashion. Routing and configuration handled at local instead of global level. Advertised Benefits of WRTR WRTR is a distributed paradigm Allows different parts of same resource to be independently configured simultaneously. Dramatically increases the configuration bandwidth of a device. Lack of centralized controller means: Fewer single point failures that can lead to total system failure (e.g. a broken configuration pin) Increased resilience: routing around faults – improving chip yields ? Distributed control provides scalability Eliminates configuration bottleneck. Origins of Wormhole RTR Concept Developed in late 90s at Virginia Tech. Intended as a method of rapidly creating and modifying ‘custom computational pathways’ using a distributed control scheme. Essence of WRTR concept: Independent self-steering streams. Streams carried both programming information as well as operand data Streams interact with architecture to perform computation. (see DIAGRAM) Origins of Wormhole RTR Origins of Wormhole RTR Programming information configures both the pathway of stream through the system, as well as the operations performed by computational elements along the path. Heterogeneity of architectures is supported by these streams. Runtime determination of path of stream is possible, allowing allocation of resources as they become available. Adapting WRTR for conventional FPGAs Our aims To achieve Fast, Parallel Reconfiguration With minimum area overhead And minimum constraints imposed on the underlying FPGA Architecture. Configuration Architecture is completely decoupled from the FPGA Architecture. WRTR model is used as inspiration for developing a new paradigm for dynamic reconfiguration Not necessary that WRTR method is followed to the letter Issues Associated with using WRTR for Conventional FPGAs Original WRTR was intended for Coarse-grained dataflow architectures with localized communications Thus operand data was appended to the streams immediately after the programming header. In conventional FPGAs, dataflow patterns are unrelated to configuration flow, and there is no restriction of localizing communications. Therefore Wormhole routing is used only for configuration (cannot be used for data). Issues Associated with using WRTR for Conventional FPGAs The original model was intended to establish linear pipelines through system. This makes run-time direction determination feasible. However, for conventional FPGAs, the functions implemented have arbitrary structures. Configuration stream can not change direction arbitrarily (i.e. fixed at compile-time). Issues Associated with using WRTR for Conventional FPGAs Due to the need for large number of configuration ports, I/O ports must be shared/multiplexed thus active circuits may need to be stalled to load configurations. Should not be a severe issue for high-performance computing oriented tasks. Should impose minimum constraints on the underlying FPGA architecture Constraints applicable in an FPGA with WRTR are same as those for any PRTR architecture. A Possible System Architecture for a WRTR FPGA Many configuration/IO ports, divided between multiple host processors. (See Diagram) Internally, FPGA divided into partitions, useable by each of the hosts. Partition boundaries may be determined at system design time, or at runtime, based on requirements of each host at any given time The various WRTR Models derived Our aim was to devise a high-speed, distributed configuration model with all the benefits of the original WRTR concept, but with minimum overhead. To this end, 3 models have been devised: Basic: with Full Internal Routing Second: with Perimeter-only Routing. Third: Packetized, or parallel configuration streams, with no Internal Routing. Basic WRTR Model: with Internal Routing Each configurable block or “tile” is accompanied by a simple configuration Stream Router. See Diagram Overhead scales linearly with FPGA Size. Expected Issues with this model Complicated router, arbitration overhead and prioritization, potential for deadlock conditions etc. May be restricted to coarser grained designs. Without data routing, do we really need internal routing? Second: WRTR with Perimeter-only Routing Primary requirement for achieving parallel configuration is multiple input ports. So why not restrict routing to chip boundary? (See Diagram) Overhead scaling improved (similar to PRTR Model) Highlights: Internal Routing not a mandatory requirement. Finer granularity for configuration achievable Significantly lower overheads as FPGA sizes grow (ratio of perimeter to area) Issues Longer time required to reach parts of FPGA as FPGA Size grows. Reduced configuration parallelism because of potentially greater arbitration delays at boundary Routers. Third: Packet based distribution of Configuration One solution to the increased boundary arbitration issues: use packets instead of streams. (See Diagram) A single configuration from each application is generated as a stream (a worm) similar to previous models. Before entering device, configuration packets from different streams are grouped according to their target rows. Third: Packet based distribution of Configuration Benefit: No need at all for Routers in the fabric itself. Drawbacks: Increases overhead on the host system Implies a centralized external controller Or a limited crossbar interconnect within the FPGA Parallel Reconfiguration still possible, but with limited multitasking. This model may be considered for embedded systems, with low configuration parallelism, but high resource requirements. Basic Area Model Baseline model required to identify overheads associated with each PRTR model. Basic Building block: (See Diagram) A basic Array of SRAM Cells Configured by arbitrary number of scan chains. Assumptions for a fair comparison of overhead: Each RTR model studied has exactly the same amount of Logic resources to configure (rows and columns). Each model can be configured at exactly the same granularity. The given array of logic resources (see Diagram) has an area equivalent to a serially configurable FPGA. AREA of Basic Model = (A* B) * (x * y) The PRTR FPGA Area Model Please See Diagram Configuration Granularity decided by A and B AREA = Area of Basic model + Overheads Overheads = Area of ‘log2(x)-to-x’ Row-select decoder + Area of ‘log2(y)-to-y’ n-bit Column De-multiplexer + Area of 1 n-bit bus * y The basic WRTR FPGA Area Model Please See Diagram AREA = Area of Basic Model + Overheads Overheads = Area of 1 n-bit bus * 2x + Area of 1 n-bit bus * 2y + Area of 1 4-D Router block * [x * y]. The Perimeter Routing WRTR FPGA Area Model Please See Diagram AREA = Area of Basic Model + Overheads Overheads = Area of 1 n-bit bus * 2x + Area of 1 n-bit bus * 2y + Area of 1 3-D Router block * [2(x + y) – 1] This model can also be made one dimensional to further reduce overheads. (other constraints will apply) The Packet based WRTR FPGA Area Model Please See Diagram AREA = Area of Basic Model + Overheads Overheads = Area of 1 n-bit bus * 2x + Area of 1 n-bit bus * 2y Additional overheads may appear in host system. This model can also be made one dimensional to further reduce overheads. (other constraints will apply) Parameters defined and their Impact The number of Busses (x and y) Number of Busses varies with reconfiguration granularity. For fixed logic capacity, A and B increase with decreasing x and y, i.e. coarser granularity. Impact of coarser granularity: Reduced overhead ? Reduced reconfiguration flexibility Increased reconfiguration time per block Thus it is better to have finer granularity The Width of the busses (n-bits) Smaller the width, smaller the overhead (for fixed number of busses) Longer Reconfiguration times. Parameters defined and their Impact It is possible to achieve finer granularity without increasing overhead at the cost of bus width (and hence reconfiguration time per block) Impact of Coarse grained vs. Fine grained configurability – Methods of Handling hazards in the underlying FPGA fabric. Coarse grained configuration places minimum constraints on FPGA architecture Fine-grained reconfiguration is subject to all issues associated with Partial RTR Systems. Approaches to Router Design Active Routing Mechanism Similar to conventional networks Routing of streams depends on stream-specified destination, as well as network metrics (e.g. congestion, deadlock) Hazards and conflicts may be dealt with at Run-time. Significantly complicated routing logic required. Most likely will be restricted to very coarse grained systems. Passive Routing Mechanism Routing of streams depends only on stream-specified direction Hazards and conflicts avoided by compile time optimization. We have selected the Passive Routing Mechanism for our WRTR Models Passive Router Details Must be able to handle streams from 4 different directions. Includes mechanisms for stalling streams in case of conflict etc. Streams from different directions only stalled if there is a conflict in outgoing direction. Detecting and Applying back-pressure Routing Circuitry for one port is defined (see Diagram) For a 4D router, this design is replicated 4 times For a 3D router, this design is replicated 3 times Utilizing Variable Length Configuration Mechanisms Support Hardware and Logic Block Issues Configuration Overhead Configuration data is huge for large FPGAs Has to be reduced in order to have fast context switching Configuration Overhead Minimization Initial pointers in this direction Variable Length Configurations Default Configuration on Power-up Variable Length Configurations Ideally Change only minimum number of bits for a new configuration Utilize the idea of short configurations for frequently used configurations Start from a default configuration and change minimum bits to reconfigure to a new state Hurdles Logic blocks always require full configurations to be specified – configuration sizes cannot be varied Knowing only what to change requires keeping track of what was configured before – difficult issue in multiple dynamic applications switching A default power-up configuration can hardly be useful for all application cases Configuration Overhead Minimization How to do it? Remove redundancy in configuration data or “compact” the contents of configuration stream Result? This will minimize the information required to be conveyed during configuration or reconfiguration Configuration Compression Applying “some” sort of compression to the configuration data stream Configuration Decompression Approaches Centralized Approach Decompress the configuration stream at the boundary of the FPGA Distributed Approach – New Paradigm Decompress the stream at the boundary of the Logic Blocks or Logic Cluster Centralized Approach Advantage Requires hardware only at the boundary of the device from where the configuration data enters the device Significant reduction in configuration size can be achieved Runlength Coding and Lemple-Ziv based compression used Examples Atmel 6000 Series Zhiyuan Li, Scott Hauck, “Configuration Compression for Virtex FPGAs”, IEEE Symposium on FPGAs for Custom Computing Machines, 2001 Centralized Approach Limitations More efficient variable length coding not easy to use because of the large number of symbol possibilities It is difficult to quantify symbols in the configuration stream of heterogeneous devices which can have different types of blocks Decentralized Approach Advantages Decompressing at the logic block boundary enables configurations to be easily symbolized and hence VLC to be used In other words, we know what exactly we are coding so Huffman like codes can be used based on the frequency of configuration occurrences Also has advantages specific to Wormhole RTR – discussed next Decentralized Approach Limitations The decompression hardware has to be replicated Optimality Issue: Decompression hardware should be amortized over how much programmable logic area? In other words, granularity of the logic area should be determined for optimal cost/benefit ratio Suitability of Decentralized Approach to WRTR If worms are decompressed at the boundary, large internal worms lengths will result This leads to greater internal worm lengths and greater issues to arbitration and worm blockages Decentralized approach thus favors shorter worm lengths and parallel worms to traverse with less blockages Variable Length Configuration Overall idea Frequently used configurations of a logic block should have small sized codes Variable length coding such as Huffman coding can be adapted Configuration Frequency Analysis How to decide upon the frequency? Hardwired? By the designer through benchmarks analysis? Generic? Done by software generating the configuration stream Continued… Hardwired determination will be inferior – no large benefit gained due to variations in applications Software that generates the configuration can optimally identify a given number of frequently used configuration according to set of applications to be executed Code determination should be done by software generating the configurations for optimal codes Decoding Hardware Approaches Huffman coding the configurations A Hardwired Huffman Decoder Adaptive Decoder (code table can be changed) Using a Table of Frequently used configurations and address Decoder Huffman Coding the Table Addresses Static Coding Adaptive Coding Decoding Hardware Features Static Huffman Decoders Coding Configurations Requires a very wide decoder Using an Address Decoder only Lower compression Reduced hardware but less compression (fixed sized codes) Coding the Decoder Inputs Requires a relatively smaller Huffman decoder Some points to Note Decompression approach is decoupled from any specific logic block architecture Though certain logic blocks will favor more compression (discussed later) Not every possible configuration will be coded. Especially random logic portions will require all the bits to be transmitted A special code will prefix the random logic configuration to identify it to be handled separately Logic Block Selection High Level Issues Should be Datapath Oriented Efficient support for random logic implementation High functionality with minimum configuration bits to support dense implementation with reconfiguration overhead reduction Well defined datapath functionality (configuration) to aid in the quantification of frequently used configuration idea Chosen Hi-Functionality Logic Block A Low Functionality Logic Block Logic Block Considerations High functionality blocks good for datapath implementations Low Functionality blocks less dense datapath implementations Logic Block Considerations Low functionality blocks have lesser configurations bits/block and vice versa Frequently Used Configurations memory cost depends on the size of configurations stored So decoder hardware overhead will be less for low functionality blocks Logic Block Considerations What about the configuration time overhead? Less dense functionality means more blocks to configure This leads to longer configuration streams Logic Block Considerations Assumption: Random Logic does not benefit from one block or the other Consequence: Datapath oriented designs will require fewer blocks to configure with high functionality blocks and even higher compression and larger overhead for random logic implementations than low-functionality blocks Logic Block Issue Conclusions Proper logic block selection for a particular application affects Decoder hardware size Configuration compression ratio Since Random logic is not compressed, using high functionality blocks for less datapath oriented applications will result in A high decoder overhead unutilized Less compression and longer configuration streams Huffman decoder hardware Basic Huffman hardware is sequential in nature and is variable input rate variable output rate Huffman Hardware Sequential decoding not suitable for WRTR Worm will have to be stalled Negates the benefit of fast reconfiguration Hardware should be able to process N bits at a time where N=bus width This requires a constant input rate architecture with variable number of codes processed per cycle Constant Input Rate PLA based Architecture Input rate K bits/cycle PLA undertakes table lookup process •Input bits Determine one unique path along Huffman tree •Next state feed back to the input of PLA to indicate the final residing state •Indicator The number of symbols decoded in the cycle The constant-input rate PLA-based architecture for the VLC decoder PLA Based Architecture Ref: “VLSI Designs for High Speed Huffman Decoders”, Shihfu Chang and David G. Messerschimit Decoder model-FSM FSM Implemention ROM PLA Lower complexity high speed Implementation Results The area is a function of number of inputs and outputs along with input rate Hardware Area Estimates The number of inputs and outputs depend upon the maximum codelength and the symbol sizes Typically, for a 16 entry table Code Size range from 1 to 8 bits Symbol size will equal the decoder input i.e. 4 Handling Multiple Symbols Outputs More than one code will be decoded with a max of N codes per cycle To take full advantage of parallel decoding, multiple configuration chains can be employed. A counter can be used to cycle between the chains and output one configuration per chain with a maximum of N chains. Decoder Hardware For parallel decoding of N codes and M-bit decoded symbols Huffman Decoder (discussed before) N M-bit decoders N port ROM table N parallel configuration chains Concluding Points The hardware overhead of the Huffman decoding mechanism discussed has to be incorporated in the area model discussed. Empirical determination of reconfiguration speed-up versus area overhead determination for a sampling of benchmarks
© Copyright 2026 Paperzz