ppt

Aristotle University of Thessaloniki
Enhancing a Reconfigurable
Instruction Set Processor with Partial
Predication and Virtual Opcode
Support
Nikolaos Vassiliadis, George Theodoridis and Spiridon Nikolaidis
Section of Electronics and Computers, Department of Physics, Aristotle University
of Thessaloniki, 54124 Thessaloniki, Greece
E-mail: [email protected]
1
Outline
 Introduction
 Target Architecture Overview
 Partial Predicated Execution Enhancement
 Virtual Opcode Enhancement
 Development Framework
 Experimental Results
 Conclusions
2
Introduction
 Characteristics of modern embedded applications
 Diversity of algorithms
 Rapid evolution of standards
 High performance demands
 To amortize cost over high production volumes embedded systems
must:


Exhibit high levels of flexibility => fast Time-to-Market
Exhibit high levels of adaptability => increased reusability
 An appealing option => couple a reconfigurable hardware (RH) to a
typical processor


Processor => bulk of the flexibility
RH => adaptation to the target application
 Support by a development framework that hides RH related issues
 Maintain flexibility
 Continue to target software-oriented group of users
3
Target Architecture
 Reconfigurable Instruction Set
CONTROL
SIGNALS
Processor (RISP)
ALU
DATA
MEMORY
PIPELINE REGISTER
MULTIPLIER
PIPELINE REGISTER
SHIFTER
MUX
PIPELINE REGISTER
 Reconfigurable Functional Unit
CONTROL LOGIC
REGISTER FILE
I_DATA_INBUS
PIPELINE REGISTER
 Core processor
 32-bit single issue RISC
 5 pipeline stages
WRITE BACK
DATA
(RFU)

1-D array of coarse-grain
processing elements (PEs)
Re OPCODE
 An interface that tightly couples
the RFU to the core

Explicit communication
OPERANDS
CORE / RFU INTERFACE
STATUS
SIGNALS
CONFIGURATION
CONFIGURATION LAYER
BITS
1ST STAGE RESULT
2ND STAGE RESULT
PROCESSING & INTERCONNECT LAYERS
4
Target Architecture - ISA
32-Bit Instruction Word Format
Re
OpCode
Source 1
Source 2
Destination
Source 3
Source 4
 Re=‘0’ => Standard Instruction Set
 Flexibility to execute any program
 Re=‘1’ => Reconfigurable Instruction Set Extensions
 Offers the adaptation to the target application
 Three types of Reconfigurable Instructions
 Complex computational operations
 Complex addressing modes
 Complex control flow operations
5
Target Architecture - RFU
 1-D Array of coarse-grain PEs
 Executes Reconfigurable Instructions
 Multiple-Input-Single-Output (MISO) clusters of primitive operations
 Un-registered output
 Chain of operations in the same clock cycle
 Registered output
 Chain of pipelined operations
 Floating PEs => Can operate in both core pipeline stages on demand
 Better utilization of the available hardware
FEEDBACK NETWORK
OPERAND1
OPERAND
SELECT
PE BASIC
STRUCTURE
PE RESULT
OPERAND2
1ST STAGE
OPERANDS
Operand1
1ST STAGE
RESULT
INPUT
NETWORK
CONSTANTS
MUX
OPERANDS
PE
Result
REGISTER
OUTPUT
NETWORK
2ND STAGE
RESULT
ND
2 STAGE
OPERANDS
Operand2
Function Sel
Spatial-Temporal
Sel
OPERAND1
OPERAND
SELECT
PE BASIC
STRUCTURE
PE RESULT
OPERAND2
6
Target Architecture – Configuration
 Local configuration memory
 Multi-context
 No overhead to select a
context
CONFIGURATION 0
EXTERNAL
CONFIGURATION
MEMORY
CONFIGURATION
CONTROLLER
 Array of coarse-grain PEs =>
 Small number of
configuration bit-stream per
instruction
CONFIGURATION BITS LOCAL
STORAGE
CONFIGURATION 1
CONFIGURATION 2
CONFIGURATION
BITS
CONFIGURATION 3
7
Target Architecture – Synthesis Results
Configuration
 A hardware model
Granularity
(VHDL) was
designed
Value
32-bits
(16x16Multiplier)
Number of Processing Elements
8
Processing Elements Functionality
ALU, Shifter, Multiplier
Configuration Contexts
16 words of 134 bits
Local Memory Size
8 constants of 32-bits
Number of Provided Local Operands
4
Component
Area
(mm2)
Processor Core
0.134
RFU Processing Layer
0.186
RFU Interconnection Layer
0.125
RFU Configuration Layer
0.137
RFU Total
0.448
 Synthesis results with STM
0.13um
 Reasonable area overhead
 No overhead to core critical path
8
Enhancement with Partial Predicated Execution
 Predication
 Eliminate branches from an instruction stream
 Conditional execution of an instruction
 Utilized to expose Instruction Level Parallelism
 Our approach => partial predicated execution to eliminate the branch
in an “if-then-else” statement
b
a
if a<0 then g=b+c;
else g=d-f;
0
c
d
f
+ SELECT
Instruction
CMP
MUX
g
 Large clusters of operations => increased performance
9
Support of Partial Predicated Execution
 The available output network can be
utilized
 Extensions
 Two configuration bits
 Two multiplexers
 Hardwired connections to PEs
 Selection of the RFU output
 Controlled by configuration bits => no
predication
 Controlled by comparison result =>
predicated execution
 Comparison => implemented in a PE
1st PE Result
1st Stage Result
Output
Network
2nd Stage Result
n PE Result
1st Sel
Bit
MUX
1st Output
Config.
1st CMP
Result
MUX
2nd Sel
Bit
2nd Output
Config.
2nd CMP
Result
10
Enhancement with Virtual Opcode
 Explicitly communication between Core and RFU
 Opcode explosion problem
 Proposed solution => “Virtual” opcode
 Virtual opcode = Natural opcode + code region
 Overhead => Configuration memory size
 Coarse grain => Small configuration size => 136 bits/per instr.
 In general Virtual opcode can performed by flushing and reload the
whole local memory


Large performance overhead
Applicable for different applications
11
Support of Virtual Opcode
 Local Configuration memory => extended with
extra level of contexts


First level = K contexts of locally available
reconfigurable instructions
Second level = L copies of the first level for
different code regions
Instruction 1
Instruction K
 For each code region only one of L contexts is
active



Context 1
The same natural opcode in different region
context forms a virtual opcode
Partitioning of regions and issue of activation
performed by the compiler
One cycle overhead to activate a context
Config
Bits
Context
Select
OpCode
Instruction 1
Instruction K
Context L
Set Active
Context
 Configuration memory size = K*L*Conf. Bits per
Instr.
12
Development Framework
 Automated framework for the
C/C++
development of applications in the
architecture
Front-End
MachSUIF
Optimized IR
in CDFG form
 Transparent incorporation of the
reconfigurable instructions set
extensions
 Based on the SUIF/MachSUIF
compiler infrastructure
Pattern Gen.
Instrumentation
Mapping
m2c
Profiling
Basic Block
Profiling Results
Instr.Gen.
User Defined
Parameters
Instruction
Selection
Statistics
Instr.
Extens.
Back-End
Executable
Code
13
Dev. Framework – Front End / Profiling
 Application source code translated
in CDFG (SUIFvm operations)
Application
Source Code
 Perform machine independent
optimizations
DFG #1
+
d
+
-
 If-conversion for partial predicated
execution can be applied
b c
a
e
DFG #2
 CDFG instrumented with profiling
annotations


translated to equivalent C code
compiled and executed in the host
 Profiling information are collected
 Regions execution frequency
...
dfg1++; //profiling code
vr1=a+b;
vr2=c+d;
e=vr1+vr2;
...
14
Dev. Framework – Instruction Generation
 First step = Pattern Generation
 In-house tool for the identification of MISO
cluster of operations based on the
MaxMISO algorithm
Candidate1
register
register constant
NEG
SHIFT
ADD
 Second step = Mapping of MISO in the
RFU
1.
2.
3.
Place the SUIFvm nodes in PEs / Route
the 1-D array
Analyze paths and set the output of a PE
(reg./unreg.) to minimize delay
Report candidate instruction semantics
register
register
PE1
PE2
SUB
NEG
Candidate2
SHIFT
PE3
register
Candidate2 src1: $vr1 src2: $vr1 src3: $vr3 dst: $vr4
{
region: func1 – dfg1
PE1: sub, output: reg
PE2: neg, output: un-reg
………………………………………
edg1: in1-PE1, in2-PE1………….
……………………………………….
latency: 1 cycle
type: comp
static gain: 2
}
15
Dev. Framework – Instruction Selection (1/2)
No Virtual opcode
 Consider the whole application space
 Perform pair-wise graph isomorphism to identify identical candidate
instructions
 Calculate dynamic gain offered by each candidate
 Dynamic = Static x Frequency
 Rank candidate instructions based on dynamic gain
 Select best L instructions
 L defined by the number of supported instructions
16
Dev. Framework – Instruction Selection (2/2)
With Virtual opcode enabled
 Partition application code into regions
 Currently supporting only procedures
 Perform Graph isomorphism per region
 Calculate dynamic gain offered by each candidate for each region
 Calculate overhead to set active the region contexts
 Rank regions and candidate instructions based on dynamic gain
 Select best K regions and best L instructions from each region
 L, K defined by the supported contexts and instructions per context
17
Experimental Results
 Prove the performance improvements offered by the proposed
architecture
 Evaluate the efficiency of the enhancements
 A complete MPEG-2 encoding application is used
 Source code from MediaBench benchmark suite
 Input data => a video sequence consisting of 12 frames with resolution
of 144x176 pixels
18
Exp. Results – SpeedUp Analysis
 Speedup analysis for the
most timing consuming
functions of MPEG2 enc.
 Accelerate only critical
regions => small overall
speedup (Amdahl)
 Our approach accelerates
the whole application’s
space => overall speedup
is preserved
Instr.
Count (106)
(No RFU)
SpeedUp
SpeedUp
(Incremental)
SAD
589.0
6.6
1.5
dist1
1206.0
3.4
2.3
fullsearch
73.5
2.0
2.5
bdist1
18.0
2.0
2.5
putbits
16.3
2.3
2.6
fdct
15.6
2.3
2.6
quant
13.1
2.6
2.7
idctcol
11.4
2.4
2.7
dct
10.4
2.3
2.7
pred_comp
10.1
1.9
2.7
iquant
9.9
1.8
2.8
add_pred
8.0
2.0
2.8
bdist2
7.3
1.8
2.8
idctrow
7.0
2.2
2.8
putnonintrablk
6.9
1.8
2.8
sub_pred
6.6
1.8
2.9
Overall
1448.7
2.9
2.9
19
Exp. Results – Evaluation of predication
Reg. Reg. Reg. Reg.
+
Reg.
+
Reg.
+
+
Const.
Const.
Const.
Const.
>>
>>
Overall
Speedup
No predic.
1.7
1.7
Predic.
6.6
2.9
Reg. Reg.
Reg. Reg.
CMP
SAD
Speedup
Reg.
Reg.
Const.
+
+
-
Reg.
+
-
Const.
-
Reg.
CMP
+ -
Const.
-
Reg.
CMP
+ -
MUX
MUX
MUX
Reg.
Reg.
Reg.
Instruction 1
Instruction 2
Instructions 3a & 3b
 Example of four instructions derived using if conversion and partial
predicated execution

These instructions implement the SAD function
 Significant performance improvements are offered
20
Exp. Results – Evaluation of Virtual Opcode
Contexts
16
12
8
4
2
Unified
3,1
2,9
Speedup
2,3
Memory
Organization
(inst.Xcont.)
Memory
Size
(KB)
2,1
4x8
1.7
0.5
1,9
8x8
2.0
1.1
16x12
2.8
3.2
32x12
3.0
8.7
Unconstr.
3.1
-
SpeedUp
2,7
2,5
1,7
1,5
4
8
16
32
64
Instructions per Context
 Virtual opcode can be used to preserve speedups for
architectures with limited opcode space
 Reasonable overhead for the local configuration memory size
 Finer partitioning of regions could result to more impressive
results
21
Conclusions
 Two enhancements to a previously proposed RISP architecture have
been proposed


Partial predicated execution => increase performance
Virtual opcode => relaxes opcode space pressure
 An automated development framework have been presented
 Hides the reconfigurable hardware from the user
 Supports the two enhancements
 The efficiency of the RISP and enhancements have been proved
using an MPEG2 encoding application
 Future research
 Support full predication for further performance improvements
 Support finer partitioning of regions for better utilization of virtual opcode
22
Thank you !!!
Questions ??
23