Aristotle University of Thessaloniki Enhancing a Reconfigurable Instruction Set Processor with Partial Predication and Virtual Opcode Support Nikolaos Vassiliadis, George Theodoridis and Spiridon Nikolaidis Section of Electronics and Computers, Department of Physics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece E-mail: [email protected] 1 Outline Introduction Target Architecture Overview Partial Predicated Execution Enhancement Virtual Opcode Enhancement Development Framework Experimental Results Conclusions 2 Introduction Characteristics of modern embedded applications Diversity of algorithms Rapid evolution of standards High performance demands To amortize cost over high production volumes embedded systems must: Exhibit high levels of flexibility => fast Time-to-Market Exhibit high levels of adaptability => increased reusability An appealing option => couple a reconfigurable hardware (RH) to a typical processor Processor => bulk of the flexibility RH => adaptation to the target application Support by a development framework that hides RH related issues Maintain flexibility Continue to target software-oriented group of users 3 Target Architecture Reconfigurable Instruction Set CONTROL SIGNALS Processor (RISP) ALU DATA MEMORY PIPELINE REGISTER MULTIPLIER PIPELINE REGISTER SHIFTER MUX PIPELINE REGISTER Reconfigurable Functional Unit CONTROL LOGIC REGISTER FILE I_DATA_INBUS PIPELINE REGISTER Core processor 32-bit single issue RISC 5 pipeline stages WRITE BACK DATA (RFU) 1-D array of coarse-grain processing elements (PEs) Re OPCODE An interface that tightly couples the RFU to the core Explicit communication OPERANDS CORE / RFU INTERFACE STATUS SIGNALS CONFIGURATION CONFIGURATION LAYER BITS 1ST STAGE RESULT 2ND STAGE RESULT PROCESSING & INTERCONNECT LAYERS 4 Target Architecture - ISA 32-Bit Instruction Word Format Re OpCode Source 1 Source 2 Destination Source 3 Source 4 Re=‘0’ => Standard Instruction Set Flexibility to execute any program Re=‘1’ => Reconfigurable Instruction Set Extensions Offers the adaptation to the target application Three types of Reconfigurable Instructions Complex computational operations Complex addressing modes Complex control flow operations 5 Target Architecture - RFU 1-D Array of coarse-grain PEs Executes Reconfigurable Instructions Multiple-Input-Single-Output (MISO) clusters of primitive operations Un-registered output Chain of operations in the same clock cycle Registered output Chain of pipelined operations Floating PEs => Can operate in both core pipeline stages on demand Better utilization of the available hardware FEEDBACK NETWORK OPERAND1 OPERAND SELECT PE BASIC STRUCTURE PE RESULT OPERAND2 1ST STAGE OPERANDS Operand1 1ST STAGE RESULT INPUT NETWORK CONSTANTS MUX OPERANDS PE Result REGISTER OUTPUT NETWORK 2ND STAGE RESULT ND 2 STAGE OPERANDS Operand2 Function Sel Spatial-Temporal Sel OPERAND1 OPERAND SELECT PE BASIC STRUCTURE PE RESULT OPERAND2 6 Target Architecture – Configuration Local configuration memory Multi-context No overhead to select a context CONFIGURATION 0 EXTERNAL CONFIGURATION MEMORY CONFIGURATION CONTROLLER Array of coarse-grain PEs => Small number of configuration bit-stream per instruction CONFIGURATION BITS LOCAL STORAGE CONFIGURATION 1 CONFIGURATION 2 CONFIGURATION BITS CONFIGURATION 3 7 Target Architecture – Synthesis Results Configuration A hardware model Granularity (VHDL) was designed Value 32-bits (16x16Multiplier) Number of Processing Elements 8 Processing Elements Functionality ALU, Shifter, Multiplier Configuration Contexts 16 words of 134 bits Local Memory Size 8 constants of 32-bits Number of Provided Local Operands 4 Component Area (mm2) Processor Core 0.134 RFU Processing Layer 0.186 RFU Interconnection Layer 0.125 RFU Configuration Layer 0.137 RFU Total 0.448 Synthesis results with STM 0.13um Reasonable area overhead No overhead to core critical path 8 Enhancement with Partial Predicated Execution Predication Eliminate branches from an instruction stream Conditional execution of an instruction Utilized to expose Instruction Level Parallelism Our approach => partial predicated execution to eliminate the branch in an “if-then-else” statement b a if a<0 then g=b+c; else g=d-f; 0 c d f + SELECT Instruction CMP MUX g Large clusters of operations => increased performance 9 Support of Partial Predicated Execution The available output network can be utilized Extensions Two configuration bits Two multiplexers Hardwired connections to PEs Selection of the RFU output Controlled by configuration bits => no predication Controlled by comparison result => predicated execution Comparison => implemented in a PE 1st PE Result 1st Stage Result Output Network 2nd Stage Result n PE Result 1st Sel Bit MUX 1st Output Config. 1st CMP Result MUX 2nd Sel Bit 2nd Output Config. 2nd CMP Result 10 Enhancement with Virtual Opcode Explicitly communication between Core and RFU Opcode explosion problem Proposed solution => “Virtual” opcode Virtual opcode = Natural opcode + code region Overhead => Configuration memory size Coarse grain => Small configuration size => 136 bits/per instr. In general Virtual opcode can performed by flushing and reload the whole local memory Large performance overhead Applicable for different applications 11 Support of Virtual Opcode Local Configuration memory => extended with extra level of contexts First level = K contexts of locally available reconfigurable instructions Second level = L copies of the first level for different code regions Instruction 1 Instruction K For each code region only one of L contexts is active Context 1 The same natural opcode in different region context forms a virtual opcode Partitioning of regions and issue of activation performed by the compiler One cycle overhead to activate a context Config Bits Context Select OpCode Instruction 1 Instruction K Context L Set Active Context Configuration memory size = K*L*Conf. Bits per Instr. 12 Development Framework Automated framework for the C/C++ development of applications in the architecture Front-End MachSUIF Optimized IR in CDFG form Transparent incorporation of the reconfigurable instructions set extensions Based on the SUIF/MachSUIF compiler infrastructure Pattern Gen. Instrumentation Mapping m2c Profiling Basic Block Profiling Results Instr.Gen. User Defined Parameters Instruction Selection Statistics Instr. Extens. Back-End Executable Code 13 Dev. Framework – Front End / Profiling Application source code translated in CDFG (SUIFvm operations) Application Source Code Perform machine independent optimizations DFG #1 + d + - If-conversion for partial predicated execution can be applied b c a e DFG #2 CDFG instrumented with profiling annotations translated to equivalent C code compiled and executed in the host Profiling information are collected Regions execution frequency ... dfg1++; //profiling code vr1=a+b; vr2=c+d; e=vr1+vr2; ... 14 Dev. Framework – Instruction Generation First step = Pattern Generation In-house tool for the identification of MISO cluster of operations based on the MaxMISO algorithm Candidate1 register register constant NEG SHIFT ADD Second step = Mapping of MISO in the RFU 1. 2. 3. Place the SUIFvm nodes in PEs / Route the 1-D array Analyze paths and set the output of a PE (reg./unreg.) to minimize delay Report candidate instruction semantics register register PE1 PE2 SUB NEG Candidate2 SHIFT PE3 register Candidate2 src1: $vr1 src2: $vr1 src3: $vr3 dst: $vr4 { region: func1 – dfg1 PE1: sub, output: reg PE2: neg, output: un-reg ……………………………………… edg1: in1-PE1, in2-PE1…………. ………………………………………. latency: 1 cycle type: comp static gain: 2 } 15 Dev. Framework – Instruction Selection (1/2) No Virtual opcode Consider the whole application space Perform pair-wise graph isomorphism to identify identical candidate instructions Calculate dynamic gain offered by each candidate Dynamic = Static x Frequency Rank candidate instructions based on dynamic gain Select best L instructions L defined by the number of supported instructions 16 Dev. Framework – Instruction Selection (2/2) With Virtual opcode enabled Partition application code into regions Currently supporting only procedures Perform Graph isomorphism per region Calculate dynamic gain offered by each candidate for each region Calculate overhead to set active the region contexts Rank regions and candidate instructions based on dynamic gain Select best K regions and best L instructions from each region L, K defined by the supported contexts and instructions per context 17 Experimental Results Prove the performance improvements offered by the proposed architecture Evaluate the efficiency of the enhancements A complete MPEG-2 encoding application is used Source code from MediaBench benchmark suite Input data => a video sequence consisting of 12 frames with resolution of 144x176 pixels 18 Exp. Results – SpeedUp Analysis Speedup analysis for the most timing consuming functions of MPEG2 enc. Accelerate only critical regions => small overall speedup (Amdahl) Our approach accelerates the whole application’s space => overall speedup is preserved Instr. Count (106) (No RFU) SpeedUp SpeedUp (Incremental) SAD 589.0 6.6 1.5 dist1 1206.0 3.4 2.3 fullsearch 73.5 2.0 2.5 bdist1 18.0 2.0 2.5 putbits 16.3 2.3 2.6 fdct 15.6 2.3 2.6 quant 13.1 2.6 2.7 idctcol 11.4 2.4 2.7 dct 10.4 2.3 2.7 pred_comp 10.1 1.9 2.7 iquant 9.9 1.8 2.8 add_pred 8.0 2.0 2.8 bdist2 7.3 1.8 2.8 idctrow 7.0 2.2 2.8 putnonintrablk 6.9 1.8 2.8 sub_pred 6.6 1.8 2.9 Overall 1448.7 2.9 2.9 19 Exp. Results – Evaluation of predication Reg. Reg. Reg. Reg. + Reg. + Reg. + + Const. Const. Const. Const. >> >> Overall Speedup No predic. 1.7 1.7 Predic. 6.6 2.9 Reg. Reg. Reg. Reg. CMP SAD Speedup Reg. Reg. Const. + + - Reg. + - Const. - Reg. CMP + - Const. - Reg. CMP + - MUX MUX MUX Reg. Reg. Reg. Instruction 1 Instruction 2 Instructions 3a & 3b Example of four instructions derived using if conversion and partial predicated execution These instructions implement the SAD function Significant performance improvements are offered 20 Exp. Results – Evaluation of Virtual Opcode Contexts 16 12 8 4 2 Unified 3,1 2,9 Speedup 2,3 Memory Organization (inst.Xcont.) Memory Size (KB) 2,1 4x8 1.7 0.5 1,9 8x8 2.0 1.1 16x12 2.8 3.2 32x12 3.0 8.7 Unconstr. 3.1 - SpeedUp 2,7 2,5 1,7 1,5 4 8 16 32 64 Instructions per Context Virtual opcode can be used to preserve speedups for architectures with limited opcode space Reasonable overhead for the local configuration memory size Finer partitioning of regions could result to more impressive results 21 Conclusions Two enhancements to a previously proposed RISP architecture have been proposed Partial predicated execution => increase performance Virtual opcode => relaxes opcode space pressure An automated development framework have been presented Hides the reconfigurable hardware from the user Supports the two enhancements The efficiency of the RISP and enhancements have been proved using an MPEG2 encoding application Future research Support full predication for further performance improvements Support finer partitioning of regions for better utilization of virtual opcode 22 Thank you !!! Questions ?? 23
© Copyright 2026 Paperzz