Implementation of VLD and Constant Division on PAC DSP Platform Student: Chung-Yen Tsai Advisor: Prof. David W. Lin Date: 2005.12.22 1 Outline Introduction to PAC DSP VLD Implementation on PAC Implementation of Constant Division on PAC Conclusion 2 Outline Introduction to PAC DSP VLD Implementation on PAC Implementation of Constant Division on PAC Conclusion 3 Reference ITRI STC/M310, PACDSP2S0000, “PACDSP v2.0 Instruction Set Menu”, June, 2005. 4 Introduction to PAC DSP PAC is a Very-Long Instruction Word (VLIW) processor Supporting Single Instruction Multiple Data (SIMD) instructions One scalar unit, and two clusters There are a Load/Store unit (L/S) and an Arithmetic unit (AU) in each cluster Ping-Pong Register File architecture 5 The Architecture of PACDSP 6 Outline Introduction to PAC DSP VLD Implementation on PAC Implementation of Constant Division on PAC Conclusion 7 References [1] S. Sriram and C. Y. Hung, “MPEG-2 video decoding on the TMS320C6X DSP architecture,” IEEE Signal Systems Computer Conf., vol. 2, Nov. 1998, pp.1735-1739 [2] C. Fogg, “Survey of software and hardware VLC architectures,” SPIE vol. 2186, Image and Video Compression, 1994 8 VLD Problem 1: It Does Not Pay to Use Both Clusters Because of uncertain length of the code to be decoded, the benefit of two-cluster architecture cannot be utilized If we do not use the block-based coding type, the program flow is simpler Because of fewer branches 9 VLD Problem 2: Memory vs. Performance Tradeoff exists between required memory size (for VLD lookup table) and cycles used Different VLD methods have been proposed in the literature We have some analysis in later slides Performance is limited for deeply pipelined processors because of significant memory access time [1] 10 VLD Methods [1], [2] Bit-by-bit matching Multiple-pass lookups Separate the table into 3 parts, and read the table 3 times at most Bounded multiple-pass lookups Read the bitstream bit-by-bit, and check after each reading Also 3 tables, but read bitstream only once One-table lookup Only one table, and only read once 11 Bit-by-bit Matching: Method Test VLC Table from MPEG-4 Standard 12 Comparison Comparison between All Methods Bit-by-Bit Matching 600 Multiple-Pass Bounded Multiple-Pass One Table 400 _0 11 _0 10 _0 01 _1 1 0 _0 00 1 _0 00 0_ 1 _0 00 0_ 01 _0 00 0_ 00 _0 1 00 0_ 00 _0 01 00 0_ 0 00 _0 0_ 00 1 0_ 00 _0 00 00 _0 0_ 1 00 00 _0 01 200 _1 0 cycles 800 code pattern 13 Conclusion of VLD on PAC Branch and jump instructions cause degradation of performance Bitsream reading and memory accesses also cost many cycles, so we should try to reduce their frequency The bounded multiple-pass method seems to be the best of all analyzed methods in tradeoff between required memory size and speed performance 14 Outline Introduction to PAC DSP VLD Implementation on PAC Implementation of Constant Division on PAC Conclusion 15 References [1] D. A. Patterson and H. L. Hennessy, “Computer organization & Design: The Hardware/ Software Interface”, sec. 4.7 ”Division” [2] M. D. Ercegovac and T. Lang, “Digital Arithmetic”, sec. 1.6 “Basic Division Algorithms” 16 Why we need efficient constant division? Disadvantages of a hardware divisor Larger area and more power consumption Several cycles required for a division Several DSPs have no hardware divisor That is, no division instructions supported Algorithms for completing division with use of addition and multiplication is necessary 17 If we use a table-lookup The most efficient method with multiplication support Disadvantages Unknown divisor Precision QP is a user-defined value, so a table including all the possible QP value(3 ~ 9 bits) EX: dividend = 0xFFFF(65535); divisor= 0x1C(28) Result q = 2340 1/28 = 0.035714285 1170 (with scale 32768) 65535 x 1170 / 32768 = 2339 Can be adopted if use rounding to nearest integer rule 18 Simple Idea But Bad Result Idea For a positive integer, we can just substrate the dividend with the divisor, and check if the dividend is negative or not Result With dividend 0x8000 and divisor 0x8 There will be 0x1000 (4096) iterations for a division Very inefficient 19 Introduction to The Algorithms Restoring algorithms “Grammar School Algorithm Ver.1” [1] “Grammar School Algorithm Ver.2” [1] “Grammar School Algorithm Ver.3” [1] “Algorithm Restoring Divide (RD)” [2] “Algorithm Non-performing Divide (NPD)” [2] Non-Restoring algorithm “Algorithm Non-restoring Divide (NRD)” [2] 20 Use The Idea of Long-Division -- The Grammar School Algorithm Quotient Divisor 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 -1 0 0 0 1 1 1 -1 Dividend 0 0 1 0 1 0 0 0 0 1 0 Remainder 21 Grammar School Algorithm Ver.1 [1] [Initialize] d^ means the divisor is rem=dividend MSB half aligned [Recurrence] for j=0…n rem=rem-d^; if rem>=0 quo1; quo[0]=1; else rem=rem+d^; quo1; quo[0]=1; d^ 1; endfor 22 Grammar School Algorithm Ver.2 [Initialize] d^ means the divisor is rem=dividend; MSB half aligned rem 1; [Recurrence] for j=0…n-1 rem=rem-d^; if rem>=0 rem1; quo1; quo[0]=1; else rem=rem+d^; rem1; quo1; quo[0]=1; endfor [Correction] rem = rem 16 23 Grammar School Algorithm Ver.3 [Initialize] d^ means the divisor is rem=dividend; MSB half aligned rem 1; [Recurrence] for j=0…n-1 LHS(rem)=LHS(rem)-d^; if rem>=0 rem1; rem[0]=1; else LHS(rem)=LHS(rem)+d^; rem1; rem[0]=0; endfor [Correction] rem = rem 17; quo = rem & 0xFFFF 24 Algorithm RD [Initialize] rem=dividend; [Recurrence] d^ means the divisor is MSB half aligned for j=0…n-1 rem’=2*rem-d^; if rem’>=0 quo1; quo[0]=1; else rem=rem’; quo1; quo[0]=0; endfor 25 Algorithm NPD [Initialize] rem=dividend; [Recurrence] d^ means the divisor is MSB half aligned for j=0…n-1 if (2*rem-d^)>=0 quo1; quo[0]=1; rem=2*rem-d^; else quo1; quo[0]=0; endfor 26 Algorithm NRD [Initialize] d^ means the divisor is rem=dividend; MSB half aligned rem=2*rem-d^; [Recurrence] for j=0…n-1 if rem >=0 quo1; quo[0]=1; rem=2*rem-d^; else quo1; quo[0]=0; rem=2*rem+d^; endfor [Correction] if rem < 0 quo[0]=0; rem=rem+d^; else quo[0]=0; 27 Comparison between Different Version of Grammar School Algorithm Simulation Results Version1: 168 cycles Version2: 161 cycles Version3: 162 cycles Why can’t we get a significant improvement with use of the Version3 algorithm? The limitation is arisen from the latencies and delay slot of PAC Thus, the other 3 algorithms can never be better 28 Grammar School Algorithm Ver.3 performance: 162 cycles 29 Conclusion The grammar algorithm ver.2 and ver.3 have almost the same performance because of the latencies and delay slot If the latencies of comparison instructions can be less, the algorithm ver.3 will be better Ver.3 need one more cycle to get the quotient Algorithm ver.2 is better than ver.1 because of the fewer iterations required The table-lookup method may be adopted when the implementation is still on simulator 30
© Copyright 2026 Paperzz