IMPLEMENTATION OF FINITE FIELD INVERSION Debdeep Mukhopadhyay Chester Rebeiro Dept. of Computer Science and Engineering Indian Institute of Technology Kharagpur INDIA Finite Field Inverse 23-27 May 2011 Anurag Labs, DRD0 2 Itoh-Tsujii Method for Binary Fields 23-27 May 2011 Anurag Labs, DRD0 3 The Steps 23-27 May 2011 Anurag Labs, DRD0 4 How do we do a Squaring Consider (again) the field GF(24), with irreducible polynomial x4+x+1. What is (x3+x2+1)2 in this field ? 23-27 May 2011 Anurag Labs, DRD0 5 Squaring Squaring can be represented in the form of a matrix multiplication T.a 23-27 May 2011 Anurag Labs, DRD0 6 Quad Operation Quad operation can be done by two squaring operations. Quad operation can be written in the form T2.a 23-27 May 2011 Anurag Labs, DRD0 7 Advantage of using Quad Operations Quad circuits have better LUT utilization compared to Squarer circuits 23-27 May 2011 Anurag Labs, DRD0 8 Generalization of the Itoh-Tsujii Algorithm 23-27 May 2011 Anurag Labs, DRD0 9 Theorem 1 23-27 May 2011 Anurag Labs, DRD0 10 Theorem 2 23-27 May 2011 Anurag Labs, DRD0 11 Quad Itoh-Tsujii Inversion Algorithm 23-27 May 2011 Anurag Labs, DRD0 12 A Circuit for Inversion At every clock cycle, either the multiplier or the quadblock is active. The output of the multiplier is stored in mout register 23-27 May 2011 Anurag Labs, DRD0 13 Finding the Inverse 23-27 May 2011 Anurag Labs, DRD0 14 Finding the Inverse Step 2 23-27 May 2011 Anurag Labs, DRD0 15 Finding the Inverse Step 2 23-27 May 2011 Anurag Labs, DRD0 16 Control Signals for the Inverse 23-27 May 2011 Anurag Labs, DRD0 17 Performance Charts 23-27 May 2011 Anurag Labs, DRD0 18 Higher Powered Itoh-Tsujii • We seen that Quad circuits utilize LUTs in a better way compared to squarer circuits. • Also LUT size is increasing as silicon technology reduces • We have seen 4-LUT become 6-LUT, and now 8-LUT • This gives us a motivation to investigate using higher powers other than quad circuits 23-27 May 2011 Anurag Labs, DRD0 19 Revisiting the Theorems 23-27 May 2011 Anurag Labs, DRD0 20 2n Itoh-Tsujii Inversion These are the overheads Higher Powered 23-27 May 2011 Anurag Labs, DRD0 21 Overhead in 2n Itoh-Tsujii • Computation of . • Using addition chain for , can be computed in cycles, where is the length of addition chain for . • Computation of , for • Using addition chain for , that contains computed during computation, because 23-27 May 2011 clock Anurag Labs, DRD0 , can be . 22 2n Itoh-Tsujii Design 23-27 May 2011 Anurag Labs, DRD0 23 Building the Optimal Design For a given field and a given FPGA how do decide the optimal design ? Configurable Parameters • Addition chain. • Power circuit used in power block. • Number of cascaded power circuits in the power block. • These have an effect on – Number of clock cycles. – Critical path delay. 23-27 May 2011 Anurag Labs, DRD0 24 Estimating AREA required on an FPGA • A k input LUT (k-LUT) can implement any functionality of maximum k input variables. • Total number of k-LUTs to implement a function with variables can be expressed as 23-27 May 2011 Anurag Labs, DRD0 25 Estimating Delay of a Design in an FPGA • Delay in FPGAs comprise of LUT delay and routing delay.. • For this ITA architecture, we have experimentally found, total delay is proportional to number of LUTs in critical path. • We denote number of LUTs in a delay path as maxlutpath. • In k-LUT, maxlutpath of an variable function is 23-27 May 2011 Anurag Labs, DRD0 26 Recap : Karatsuba Multiplier 23-27 May 2011 Anurag Labs, DRD0 27 Hybrid Karatsuba Multiplier for GF(2233) Note that the school book multiplier has replaced the general Karatsuba Multiplier School Book Multiplier 23-27 May 2011 Anurag Labs, DRD0 28 Estimating LUT Requirement for Hybrid Karatsuba Multiplier • The field multiplier is a hybrid Karatsuba multiplier. • A bit hybrid Karatsuba multiplier consists of two bit and one bit multipliers. This happens in recursive manner. • In threshold ( ) level, School-Book multiplier is invoked. • Total area of bit hybrid Karatsuba multiplier is given by • Total area for the School-Book multiplier is 23-27 May 2011 Anurag Labs, DRD0 29 Estimating Delay of Hybrid Karatsuba Multiplier • The hybrid Karatsuba multiplier is distributed in smaller multipliers like a tree. Height of the tree is • Each level of the Simple Karatsuba tree introduces one LUT delay. • In threshold ( ) level, School-Book multiplier delay is added. • Delay of School-Book multiplier is • Delay of the entire multiplier in LUTs is given by 23-27 May 2011 Anurag Labs, DRD0 30 Estimating Area & Delay for Modular Reduction • For fields generated by trinomials, area of modular reduction is almost equal to field size and delay is one LUT considering LUT size . • For fields generated by pentanomials, – and 2 LUT for – and 2 LUT for 23-27 May 2011 . . Anurag Labs, DRD0 31 Area & Delay Estimates for 2n Circuit • The output of a 2n circuit, which raises an input be expressed as , where is and , • LUT requirement per output bit • Total LUT requirement for the 2n circuit is can binary field matrix is • LUT delay per output bit is • Since all bits are in parallel, delay of 2n circuit is 23-27 May 2011 Anurag Labs, DRD0 32 Area & Delay Estimates for Multiplexer • For a 2s : 1 MUX, there are s selection lines and thus the output is a function of 2s + s variables. • For a MUX in , each of the 2s input lines is of width m bits. • Total LUT requirement is • Total LUT delay of the MUX is • When upper bound 23-27 May 2011 number of inputs to MUX Anurag Labs, DRD0 , the above gives a close 33 Area & Delay of PowerBlock • • • • Let the Powerblock contains us number of cascaded 2n circuits. The has selection lines, where LUT requirement for is Total LUT requirement for Powerblock is • Delay of is • Total LUT delay of Powerblock in 23-27 May 2011 Anurag Labs, DRD0 34 Area & Delay for the Entire Architecture • LUT estimate for the entire architecture is • There are two parallel delay paths. – LUT delay of first path is – LUT delay of second path is – LUT delay of entire architecture is 23-27 May 2011 Anurag Labs, DRD0 35 Optimal Number of Cascades • For a given field and based FPGA, Powerblock can be configured with different power circuits and cascades . • Increase in reduces clock cycles, but increases delay of Powerblock. • is fixed, but depends on and . • is minimum when • Minimum delay of the ITA architecture is thus 23-27 May 2011 Anurag Labs, DRD0 36 Power Circuit Selection to achieve Minimum Clock Cycles • Number of clock cycles for the inversion can be approximated as • • • • Number of clock cycles for increases linearly with . The term reduces with increase in . When is small, the reduction in is significant for increase in . But, for large values of n, the increase in dominates over the decrease in • So, increases with increase in for large values of . 23-27 May 2011 Anurag Labs, DRD0 37 Tuning Design for Optimality • The performance metric is • Minimization of without increasing gives best performance. Area remains almost same. • The following steps are performed to achieve optimal performance • The optimal architecture is given by 23-27 May 2011 Anurag Labs, DRD0 38 Validation of Theoretical Estimates • Our estimation model uses maxlutpath to find LUT delay. • Routing delay is difficult to model in FPGAs. • To get overall delay, we have used experimental results for a reference ITA architecture. • Total delay of reference architecture is the • Let LUT delay of reference architecture is • Total delay of any other ITA architecture in the same field is approximately • Here is a constant and depends on FPGA technology. • In 4-LUT based and 6-LUT based Xilinx FPGAs, has values 0.2 and 0.1 respectively. 23-27 May 2011 Anurag Labs, DRD0 39 Validation on 4-input LUT FPGAs 23-27 May 2011 Anurag Labs, DRD0 40 Validation on 6-input LUT FPGAs 23-27 May 2011 Anurag Labs, DRD0 41 Experimental Results 23-27 May 2011 Anurag Labs, DRD0 42 Comparison Charts 23-27 May 2011 Anurag Labs, DRD0 43
© Copyright 2026 Paperzz