Designers often use very different algorithms if a behavior is

Codesign Extended Applications
Brian Grattan, Greg Stitt, Frank Vahid*
Dept of Computer Science & Engineering
University of California, Riverside
*Also with the Center for Embedded Computer Systems at UC Irvine
This work was supported in part by the National Science Foundation
and by NEC C&C Research Labs
1
Outline

Introduction: Hardware/Software Partitioning





And the common assumption of a single specification
Different Algorithms in Hardware/Software
Codesign Extended Applications
Experiments
Future Work and Conclusions
CODES’02 – Codesign Extended Applications
Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside
1-2
Introduction – Hw/Sw Partitioning

Hw/sw partitioning can speedup software

Shown by numerous researchers



1.5 to 10x common
Some examples like image processing get 100-800x speedup


E.g., Balboni, Fornaciari, Sciuto CODES’96; Eles, Peng, Kuchchinski, Doboli
DAES’97; Gajski, Vahid, Narayan, Gong Prentice-Hall 1997; Grode, Knudsen,
Madsen DATE’98; many others
E.g., Cameron project, FCCM’02
Can reduce energy too

E.g.



Henkel, Li CODES’98
Wan, Ichikawa, Lidsky, Rabaey CICC’98
Stitt, Grattan, Villarreal, Vahid FCCM’02

60-80% energy savings measured on real single-chip uP/FPGA devices
CODES’02 – Codesign Extended Applications
Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside
1-3
Hw/Sw Partitioning on Single-Chip
Platforms
Configurable logic

Numerous single-chip
commercial devices with uP
and FPGA







Triscend E5 (shown)
Triscend A7
Atmel FPSLIC
Xilinx Virtex II Pro
Altera Excalibur
More sure to come…
Make hw/sw partitioning
even more attractive
uP and peripherals
CODES’02 – Codesign Extended Applications
Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside
Cache/memory
1-4
Hw/Sw Partitioning – Commercial Tools
Evolving

Commercial products
evolving


Synopsys’ Nimble compiler
(2000) attempt
Proceler


Microprocessor Report’s 2001
Technology of the Year Award
Others coming…
CODES’02 – Codesign Extended Applications
Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside
1-5
Hw/Sw Partitioning – Single-Spec
Assumption

Assumption – Start
from a single
specification


Partitioning


Typically sw source
Find critical sw kernels,
map some to hw
Specification
Hw/sw partitioner
Sw
Hw
Compilation
Synthesis
Binaries
Netlists
This assumption is
made in most research
efforts as well as
commercial tools
CODES’02 – Codesign Extended Applications
Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside
1-6
Digital Camera Example

Developed with intent of
exploring hw/sw tradeoffs


Captures images,
compresses, uploads to PC
Soon found that a single
specification wasn’t
reasonable

DCT
DCT
Huffman
Huffman
encoder
Encoder
Controller
Controller
CCD
CCD
Pre-Process
Pre-Processor
Communications
CRCCRC
calculation
Two key functions had
different hw/sw algorithms


CRC
DCT
CODES’02 – Codesign Extended Applications
Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside
1-7
Digital Camera Example

Results in weak hw
design


We would have written
CRC and DCT differently
had we known they’d be
mapped to hw
Yet, we’d keep the
original algorithms if
they ended up in
software
Spec: DCT, Huffman, CRC, CCD, Ctrl
Hw/sw partitioner
Sw: Huff., CCD, Ctrl
Hw: CRC, DCT
Compilation
Synthesis
Binaries
Netlists
Weak
CODES’02 – Codesign Extended Applications
Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside
1-8
Different Algorithms in Hw vs. Sw


The single-specification assumption doesn’t
always hold
Key observation


Designers often use very different algorithms if a
behavior is mapped to hardware versus if that behavior
is mapped to software
Widely known by designers


In textbooks
Also known in parallel processing – sequential and
parallel algorithms
CODES’02 – Codesign Extended Applications
Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside
1-9
Different Algorithms – Sorting Example

Suppose desired behavior fills a buffer,
sorts the buffer, and transmits the
sorted list
Quicksort
Fill()
Sort()
Transmit()

Sort() in software –QuickSort



Sort() in hardware – Parallel Mergesort



Simple and fast in sw
Poor in hw, can’t be parallelized well
Very fast in hardware
Slow in sw (if sequential) due to
overhead
MS
MS
MS
MS
Derive one from the other?
CODES’02 – Codesign Extended Applications
Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside
MS
MS
…
1-10
Different Algorithms – CRC Example

CRC – Cyclic
Redundancy Check


Used for error checking
during communication,
stronger than parity
Mathematically, divides
a constant into the data
and saves the
remainder
Main Function
…
calls crc() with
parameters:
init_crc-initial value
*data-pointer to data
len-length of data
jinit-initializing options
crc()
returns:
value of CRC
for given data
crc/data/data/data
CODES’02 – Codesign Extended Applications
Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside
1-11
Different Algorithms – CRC in Hardware
char crc_hw(…)
{
unsigned short j , crc_value = init_crc;
unsigned short new_crc_value;
if (jinit >= 0) crc_value=((uchar) jinit) | (((uchar) jinit) << 8);
for (j=1;j<=len;j++) {
new_crc_value = bit(4,data[j]) ^ bit(0,data[j]) ^ bit(8,crc_value) ^ bit(12,crc_value); // bit 0
new_crc_value = new_crc_value | (bit(5,data[j])^bit(1,data[j])^bit(9,crc_value)^bit(13,crc_value))<<1;
new_crc_value = new_crc_value | (bit(6,data[j])^bit(2,data[j])^bit(10,crc_value)^bit(14,crc_value))<< 2;
. … continue for bits 3 through 7 …
. }
return (new_crc_value);
}

Hardware Version



Knowing the generator polynomial, one can calculate the XOR’s for each
individual bit
Each CRC value is the result of bit-wise XOR’s with the data and the
previous CRC value
Synthesizes to hw very nicely; but getting bits and shifting are inefficient in
sw
CODES’02 – Codesign Extended Applications
Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside
1-12
Different Algorithms – CRC in Software

Software Version



Before doing any
calculations, create an
initialization table that
calculates the CRC for
each individual
character
Use data as index into
initialization table and
execute two XOR’s
Requires lookups, but
faster for a sequential
calculation
char crc_sw(…) // Source: Numerical Recipes in C
{
unsigned short initialize_table(unsigned short crc, unsigned char
one_char);
static unsigned short icrctb[256];
unsigned short tmp1, j , crc_value = init_crc;
if (!init) {
init=1;
for (j=0;j<=255;j++) {
icrctb[j]=initialize_table(j << 8,(uchar)0);
}
}
if (jinit >= 0) crc_value=((uchar) jinit) | (((uchar) jinit) << 8);
for (j=1;j<=len;j++) {
tmp1 = data[j] ^ HIBYTE(crc_value);
crc_value = icrctb[tmp1] ^ LOBYTE(crc_value) << 8;
}
}
return (crc_value);
}
CODES’02 – Codesign Extended Applications
Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside
1-13
Different Algorithms -- DCT

DCT – Discrete Cosine Transform



Computationally intensive, numerous matrix multiplies
Accounts for perhaps 70% of JPEG encoding time
Dozens of possible algorithms



Best algorithm depends largely on computational resources
Certainly different for sw and hw
Doing multiplications in floating-point vs. fixed-point

Multiplication by a constant can be efficiently mapped to
hardware, but accuracy will be lost by not using floating-point
CODES’02 – Codesign Extended Applications
Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside
1-14
Codesign Extended Applications (CEAs)

Basic idea:

Write two versions of certain
functions



Only the critical functions, and
Only those with different sw and hw
algorithms
Typically only a handful of these


Most time is spent in just a few
critical functions
Include both function versions in
the specification

But use compiler flags to include
either sw or hw version
main()
{
…
crc();
…
}
char crc(…)
{
#ifdef cea_crc_hw
crc_hw(…);
#else
crc_sw(…);
#endif
}
% gcc –Dcea_crc_hw main.c
CODES’02 – Codesign Extended Applications
Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside
1-15
CEAs when using C/C++ and VHDL
VHDL code
C code
crc_hw(…inputs…)
/* Hardware crc... */
for (j=1;j<=len;j++) {
TSHORT(to_hw)= data[j]);
TBYTE(enable) = 1;
TBYTE(enable) = 0;
}
crc_value=TSHORT(result);
return (crc_value)
if (rst = '1') then
crc <= "0000000000000000";
done <= '0';
elsif (clk'event and clk = '1') then
if (enable = '1') then
if done = '0' then
crc <= nextCRC16_D8(input,crc);
done <= '1';
end if;
else
done <= '0';
output <= crc;
end if;
end if;
CODES’02 – Codesign Extended Applications
Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside
1-16
CEAs Enable Hw/Sw Partitioning Tool

Traditional hw/sw partitioner



CEAs plus platforms result in
simple partitioner



Compiler, estimators, search
heuristics, technology files, etc.
Drawback: heavy impact on tool
flow
Script uses existing compiler,
synthesis, and evaluation
(simulation or physical
measurement)
Drawbacks: must write two
versions of critical functions, script
may use simpler search function
Different partitioners for different
domains
Specification
Essentially a
compiler,
search
heuristic, and
estimator.
Heavy-duty
tool.
Hw/sw partitioner
Sw
Hw
Compilation
Synthesis
Binaries
Netlists
CEA
Search
heuristic and
tool control.
Lightweight
tool.
Script
Sw
Hw
Compilation
Synthesis
Binaries
Netlists
CODES’02 – Codesign Extended Applications
Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside
Evaluator
1-17
Experiments

Compared hw and sw
CRC algorithms



Synthesized to FPGA
Compiled to MIPS uP
Demonstrates need for
different algorithms
Sw and hw CRC algorithms in FPGA.
Size
(Blocks)
Delay (clock
cycles/character)
Hardware CRC
algorithm
19
1
Software CRC
algorithm
44
3
Sw and hw CRC algorithms on a microprocessor.
Size
(Assembly
Lines)
Clock Cycles
Software CRC
Algorithm
1061
180,000
Hardware CRC
Algorithm
1298
814,000
CODES’02 – Codesign Extended Applications
Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside
1-18
Experiments

Wrote small signal processing example as CEA

Wrote sw and hw versions of core functions


Setup power measurement for two real platforms



In this case, algorithms were similar
XS40 (board with microcontroller chip and Xilinx FPGA chip)
E5 (single chip with microcontroller and FPGA)
Partitioning script automatically partitioned and measured power and
cycles (overnight – due to place & route time)


Demonstrates how CEAs enable simple yet practical hw/sw partitioning
Easily migrates to different platforms, different chips
Partitioning
Sum
Multiply
SW
SW
SW
HW
SW
HW
HW
HW
SW
SW
HW
SW
HW
SW
HW
HW
Energy (Joules) on E5 device
Bit-Share
SW
HW
SW
SW
HW
HW
SW
HW
12.4
8.6
8.8
8.0
4.8
Does not Route
Does not Route
Does not Route
CODES’02 – Codesign Extended Applications
Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside
1-19
Issues and Future Work

Issues




What if hw versions not used after partitioning? Wasted effort?
Verification of all possible combinations?
Must use wisely or problem grows unwieldy
Future work


More examples, more platforms
Several versions of the same function






One hardware area-conscious
One hardware speed-conscious
One software code-size-conscious
One software speed-conscious
…more…
Experimenting with communication between hardware and
software

DMA transfer, wide-access memories, …
CODES’02 – Codesign Extended Applications
Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside
1-20
Conclusions



Basic hw/sw partitioning assumption of a single
specification doesn’t always hold
Codesign Extended Applications help support
different algorithms
CEAs enable hw/sw partitioning in existing tool
flows


Utilizes existing compilation, synthesis, mapping,
evaluation tools, and platforms
Simple yet effective approach to hw/sw partitioning
CODES’02 – Codesign Extended Applications
Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside
1-21