Implementation of VLD and Constant Division on PAC DSP Platform

Implementation of VLD and
Constant Division on PAC DSP
Platform
Student: Chung-Yen Tsai
Advisor: Prof. David W. Lin
Date: 2005.12.22
1
Outline




Introduction to PAC DSP
VLD Implementation on PAC
Implementation of Constant Division on PAC
Conclusion
2
Outline
 Introduction



to PAC DSP
VLD Implementation on PAC
Implementation of Constant Division on PAC
Conclusion
3
Reference

ITRI STC/M310, PACDSP2S0000, “PACDSP
v2.0 Instruction Set Menu”, June, 2005.
4
Introduction to PAC DSP

PAC is a Very-Long Instruction Word (VLIW)
processor


Supporting Single Instruction Multiple Data (SIMD)
instructions
One scalar unit, and two clusters


There are a Load/Store unit (L/S) and an
Arithmetic unit (AU) in each cluster
Ping-Pong Register File architecture
5
The Architecture of PACDSP
6
Outline

Introduction to PAC DSP
 VLD


Implementation on PAC
Implementation of Constant Division on PAC
Conclusion
7
References


[1] S. Sriram and C. Y. Hung, “MPEG-2 video
decoding on the TMS320C6X DSP architecture,”
IEEE Signal Systems Computer Conf., vol. 2,
Nov. 1998, pp.1735-1739
[2] C. Fogg, “Survey of software and hardware
VLC architectures,” SPIE vol. 2186, Image and
Video Compression, 1994
8
VLD Problem 1: It Does Not Pay to Use
Both Clusters


Because of uncertain length of the code to be
decoded, the benefit of two-cluster
architecture cannot be utilized
If we do not use the block-based coding type,
the program flow is simpler

Because of fewer branches
9
VLD Problem 2: Memory vs. Performance

Tradeoff exists between required memory
size (for VLD lookup table) and cycles used



Different VLD methods have been proposed in the
literature
We have some analysis in later slides
Performance is limited for deeply pipelined
processors because of significant memory
access time [1]
10
VLD Methods [1], [2]

Bit-by-bit matching


Multiple-pass lookups


Separate the table into 3 parts, and read the table
3 times at most
Bounded multiple-pass lookups


Read the bitstream bit-by-bit, and check after
each reading
Also 3 tables, but read bitstream only once
One-table lookup

Only one table, and only read once
11
Bit-by-bit Matching: Method

Test VLC Table from MPEG-4 Standard
12
Comparison
Comparison between All Methods
Bit-by-Bit Matching
600
Multiple-Pass
Bounded Multiple-Pass
One Table
400
_0
11
_0
10
_0
01
_1
1
0
_0
00
1
_0
00
0_
1
_0
00
0_
01
_0
00
0_
00
_0
1
00
0_
00
_0
01
00
0_
0
00
_0
0_
00
1
0_
00
_0
00
00
_0
0_
1
00
00
_0
01
200
_1
0
cycles
800
code pattern
13
Conclusion of VLD on PAC



Branch and jump instructions cause
degradation of performance
Bitsream reading and memory accesses also
cost many cycles, so we should try to reduce
their frequency
The bounded multiple-pass method seems to
be the best of all analyzed methods in
tradeoff between required memory size and
speed performance
14
Outline


Introduction to PAC DSP
VLD Implementation on PAC
 Implementation
of Constant
Division on PAC

Conclusion
15
References


[1] D. A. Patterson and H. L. Hennessy,
“Computer organization & Design: The
Hardware/ Software Interface”, sec. 4.7 ”Division”
[2] M. D. Ercegovac and T. Lang, “Digital
Arithmetic”, sec. 1.6 “Basic Division Algorithms”
16
Why we need efficient constant division?

Disadvantages of a hardware divisor



Larger area and more power consumption
Several cycles required for a division
Several DSPs have no hardware divisor


That is, no division instructions supported
Algorithms for completing division with use of
addition and multiplication is necessary
17
If we use a table-lookup


The most efficient method with multiplication support
Disadvantages

Unknown divisor


Precision





QP is a user-defined value, so a table including all the
possible QP value(3 ~ 9 bits)
EX: dividend = 0xFFFF(65535); divisor= 0x1C(28)
Result q = 2340
1/28 = 0.035714285  1170 (with scale 32768)
 65535 x 1170 / 32768 = 2339
Can be adopted if use rounding to nearest integer rule
18
Simple Idea But Bad Result

Idea


For a positive integer, we can just substrate the
dividend with the divisor, and check if the dividend
is negative or not
Result

With dividend 0x8000 and divisor 0x8


There will be 0x1000 (4096) iterations for a division
Very inefficient
19
Introduction to The Algorithms

Restoring algorithms






“Grammar School Algorithm Ver.1” [1]
“Grammar School Algorithm Ver.2” [1]
“Grammar School Algorithm Ver.3” [1]
“Algorithm Restoring Divide (RD)” [2]
“Algorithm Non-performing Divide (NPD)” [2]
Non-Restoring algorithm

“Algorithm Non-restoring Divide (NRD)” [2]
20
Use The Idea of Long-Division
-- The Grammar School Algorithm
Quotient
Divisor
1 0 0 0
1 0 0 1
1 0 0 1 0 1 0
-1 0 0 0
1
1
1
-1
Dividend
0
0 1
0 1 0
0 0 0
1 0
Remainder
21
Grammar School Algorithm Ver.1 [1]


[Initialize]
d^ means the divisor is
rem=dividend
MSB half aligned
[Recurrence]
for j=0…n
rem=rem-d^;
if rem>=0
quo1; quo[0]=1;
else
rem=rem+d^; quo1; quo[0]=1;
d^  1;
endfor
22
Grammar School Algorithm Ver.2



[Initialize]
d^ means the divisor is
rem=dividend;
MSB half aligned
rem 1;
[Recurrence]
for j=0…n-1
rem=rem-d^;
if rem>=0
rem1; quo1; quo[0]=1;
else
rem=rem+d^; rem1; quo1; quo[0]=1;
endfor
[Correction]
rem = rem 16
23
Grammar School Algorithm Ver.3



[Initialize]
d^ means the divisor is
rem=dividend;
MSB half aligned
rem 1;
[Recurrence]
for j=0…n-1
LHS(rem)=LHS(rem)-d^;
if rem>=0
rem1; rem[0]=1;
else
LHS(rem)=LHS(rem)+d^; rem1; rem[0]=0;
endfor
[Correction]
rem = rem 17; quo = rem & 0xFFFF
24
Algorithm RD

[Initialize]
rem=dividend;
[Recurrence]
d^ means the divisor is
MSB half aligned
for j=0…n-1
rem’=2*rem-d^;
if rem’>=0
quo1; quo[0]=1;
else
rem=rem’; quo1; quo[0]=0;
endfor
25
Algorithm NPD

[Initialize]
rem=dividend;
[Recurrence]
d^ means the divisor is
MSB half aligned
for j=0…n-1
if (2*rem-d^)>=0
quo1; quo[0]=1; rem=2*rem-d^;
else
quo1; quo[0]=0;
endfor
26
Algorithm NRD



[Initialize]
d^ means the divisor is
rem=dividend;
MSB half aligned
rem=2*rem-d^;
[Recurrence]
for j=0…n-1
if rem >=0
quo1; quo[0]=1; rem=2*rem-d^;
else
quo1; quo[0]=0; rem=2*rem+d^;
endfor
[Correction]
if rem < 0 quo[0]=0; rem=rem+d^;
else
quo[0]=0;
27
Comparison between Different Version of
Grammar School Algorithm

Simulation Results




Version1: 168 cycles
Version2: 161 cycles
Version3: 162 cycles
Why can’t we get a significant improvement
with use of the Version3 algorithm?


The limitation is arisen from the latencies and
delay slot of PAC
Thus, the other 3 algorithms can never be better
28
Grammar School Algorithm Ver.3
performance: 162 cycles
29
Conclusion

The grammar algorithm ver.2 and ver.3 have
almost the same performance because of the
latencies and delay slot




If the latencies of comparison instructions can be
less, the algorithm ver.3 will be better
Ver.3 need one more cycle to get the quotient
Algorithm ver.2 is better than ver.1 because
of the fewer iterations required
The table-lookup method may be adopted
when the implementation is still on simulator
30