High Speed VLSI Architecture for General Linear Feedback Shift Register (LFSR) Structures ABSTRACT A linear feedback shift register (LFSR) is a shift register whose input bit is a linear function of its previous state. The only linear function of single bits is xor, thus it is a shift register whose input bit is driven by the exclusive-or (xor) of some bits of the overall shift register value. The initial value of the LFSR is called the seed, and because the operation of the register is deterministic, the stream of values produced by the register is completely determined by its current (or previous) state. Likewise, because the register has a finite number of possible states, it must eventually enter a repeating cycle. Linear Feedback Shift Register (LFSR) structures are widely used in digital signal processing and communication systems, such as BCH, CRC. Many current functions such as Scrambling, Convolution Coding, CRC and even Cordic or Fast Fourier Transform can be derived as Linear Feedback Shift Registers (LFSR) In high-rate digital systems such as optical communication system, throughput of 1Gbps is usually desired. The serial input/output operation property of LFSR structure is a bottleneck in such systems and parallel LFSR architecture is thus required. This work presents a three-step high-speed VLSI architecture for LFSR structures; this paper proposes improved three-step LFSR architecture with both higher hardware efficiency and speed. This architecture can be applied to any LFSR structure for high-speed parallel implementation. 0 1. Introduction 1.1 Error Coding In recent years there has been an increasing demand for digital transmission and storage systems. This demand has been accelerated by the rapid development and availability of VLSI technology and digital processing. It is frequently the case that a digital system must be fully reliable, as a single error may shutdown the whole system, or cause unacceptable corruption of data, e.g. in a bank account. In situations such as this error control must be employed so that an error may be detected and afterwards corrected. The simplest way of detecting a single error is a parity checksum, which can be implemented using only exclusive-or gates. But in some applications this method is insufficient and a more sophisticated error control strategy must be implemented. If a transmission system can transfer data in both directions, an error control strategy may be determined by detecting an error and then, if an error has occurred, retransmitting the corrupted data. These systems are called automatic repeat request (ARQ). If transmission takes place in only one direction, e.g. information recorded on a compact disk, the only way to accomplish error control is with forward error correction (FEC).In FEC systems some redundant data is concatenated with the information data in order to allow for the detection and correction of the corrupted data without having to retransmit it. One of the most important classes of FEC codes is linear block codes. In block codes, data is transmitted and corrected within one block (codeword). That is, the data preceding or following a transmitted codeword does not influence the current codeword. Linear block codes are described by the integer n, the total number of symbols in the associated codeword. Block codes are also described by the number k of information symbols within a codeword, and the number of redundant (check) symbols n-k. In error control, it is crucial to understand the sources of errors. Each transmitted bit has probability p > 0 of being received incorrectly. On memoryless channels every 1 transmitted symbol may be considered independently, so only random errors occur. Unfortunately, most channels have memory and usually several successive symbols are corrupted. These kinds of errors are called burst errors [29]. Burst errors can be most efficiently corrected through use of burst error correcting codes, e.g. Reed Solomon (RS) codes [44]. Because the structure of burst error correcting codes is usually complicated, multiple random error correcting codes are often employed. In order to improve burst error correction, the transmitted codewords are also rearranged by interleaving. The resulting code is called an interleaved code. In this way the burst errors scatter into several codewords and look like random errors. Other operations on block codes are also available to improve the error correcting ability or to adapt a code to a specified requirement. For example codes may be shortened, extended, concatenated or interleaved [2,5]. The simplest block codes are Hamming codes. They are capable of correcting only one random error and therefore are not practically useful, unless a simple error control circuit is required. More sophisticated error correcting codes are the Bose, Chaudhuri and Hocquenghem (BCH) codes that are a generalisation of the Hamming codes for multipleerror correction. In this thesis the subclass of binary, random error correcting BCH codes is considered, hereafter called BCH codes. BCH codes operate over finite fields or Galois fields. The mathematical background concerning finite fields is well specified and in recent years the hardware implementation of finite fields has been extensively studied Furthermore, any BCH code can be defined by only two fundamental parameters and these parameters can be selected by the designer. These parameters are crucial to the design and the question arises if it is possible to develop a tool that will automatically generate any BCH codec description, just by providing the code size n and the number of errors to be corrected t. This design automation would considerably reduce BCH codec design cost and time and increase the ease with which BCH codecs with different design parameters are generated. This is an important motivation since the architectures of BCH codecs with different parameters can vary remarkably. 2 1.2 Hardware solutions BCH codes employ sophisticated algorithms and their implementation is rather burdensome. The safe solution both in terms of costs and time is a software solution. But as BCH codes operate over finite fields, a standard microprocessors’ arithmetic is not suitable, and a software solution is therefore rather slow . Another kind of solution is to employ a specialist digital signal processing (DSP) unit, but this option requires rather expensive and sophisticated hardware and can be adopted only when a small number of devices is to be produced. Overall, software solutions are therefore slower, consume more power and are less reliable than hardware implementations. In recent years the Programmable Logic Device (PLD) has been developed and the PLD subclass of Field Programmable Gate Arrays (FPGAs) has been introduced. This has revolutionised hardware design and its implementation. The advantages of an FPGA solution are as follows: The FPGA is fully reprogrammable. A design can be automatically converted from the gate level into the layout structure by the place and route software. Therefore design changes can be made almost as easily as software ones. Simulation at the layout level, where the design is tied to the internal FPGA structure, is also possible (back annotation). This enables not only the logical functionality but the timing characteristics of the design to be simulated as well. Xilinx Inc. offers a wide range of components [55] For example the XC3000 family offers 1,300 to 9,000 gate complexity and 256-1320 flip-flops, so even a relatively complex design can be implemented. (A range of other manufacturers also market FPGA devices including Actel and Altera.) In conclusion, a hardware solution can be easily implemented, and the differences between hardware and a software solutions have become blurred. Unfortunately although FPGA solutions are easy to introduce and verify, they are rather expensive and therefore not economical for mass-production. In this case, a full or semi-custom Application Specific Integrated Circuit (ASIC) might be more appropriate. An ASIC solution is more complex and its implementation takes much longer than an FPGA. On the other hand, 3 although an ASIC is characterised as having high starting costs it will allow for a lower cost per chip in mass-production. However an ASIC solution cannot be modified easily or cheaply, due to the high cost of layout masks and the long time required for their development. 1.3 Verilog HDL and synthesis The development of VLSI and PLDs has stimulated a demand for a hardware description language (HDL), with a well-defined syntax and semantics. This requirement led to the development of the Very (High Speed Integrated Circuit) Hardware Description Language [1,25,26]. The VERILOG language describes a digital circuit using the design entity. The entity contains an input/output interface and an architecture description. The language supports different data types, namely constants, variables and signals and there are also different data formats available, for instance bits, integers and real numbers. VERILOG also supports numerous operators such as addition, multiplication, exponentiation and modulo reduction of these data types [26]. VERILOG offers the opportunity for different levels of design. This is a crucial feature of the language as it enables design partitioning and simulation at different levels, thus the design can be hierarchical. In addition, VERILOG allows a design to be described in different domains [25,34]. There are three different domains for describing digital systems. The behavioural domain describes the system without stating how the specified functionality is to be implemented. The structural domain describes a network of components. The physical domain describes how a system is actually to be built. VERILOG models of digital systems can be written at each of these three levels. These models can then be simulated using Electronic Computer Aided Design (ECAD) tools. VERILOG has subsequently become a standard [26] and has been widely adopted throughout the electronics industry. ECAD tools have long since been available which convert gate level descriptions of circuits into descriptions which can be accepted by ASIC manufacturers. One of the key recent developments has been the design of automatic synthesis tools which convert higher level textual descriptions of digital circuits into lower level or gate level descriptions 4 .These synthesis tools therefore allow high level descriptions of circuits to be transported into hardware much quicker and cheaper than was previously the case. By virtue of being a standard, there are numerous proprietary VERILOG synthesis tools available. Synthesis may be considered as either high level, logic level or layout level synthesis depending on the level of abstraction involved. The highest level of design abstraction is the system level, where the design specification and performance are defined and a system is described, for example, as a set of processors, memories, controllers and buses. Below this is the algorithmic level where the focus is on data structures and the computations performance by individual processors. Next comes the register transfer level (RTL) where the system is viewed as a set of interconnected storage elements and functional blocks. Below this is the logic level where the system is described as a network of gates and flip-flops. The lowest level of abstraction is a circuit level which views the system in terms of individual transistors or other elements of which it is composed. High level synthesis [36] takes place on the algorithmic level and on the RTL. Usually there are different structures that can be used to realise a given behaviour, and one of the tasks of high level synthesis is to find the structure that best meets the given behaviour. There are a number of reasons why high-level synthesis should be considered. For example high level synthesis reduces design times and allows for the possibility of searching the design space for different trade-offs between cost, speed and power. Unfortunately in practice, high-level synthesis tools are rather difficult to develop. Furthermore, a man-crafted design is often more hardware efficient. As a result, the design is usually synthesised at the lower level of abstraction. Logic level synthesis is much simpler because the digital blocks have already been determined, therefore one of the most important aspects of this process is optimisation. Logic synthesis is often associated with a target technology because the final logic form for different technologies is different. The intention at this level may also be to minimise the delay through the circuit [9] and/or to minimise the hardware requirements [11]. This task may be even more complicated as only a few signals may be optimised with respect to time delay whilst with others it may be required to reduce the hardware levels. Layout level synthesis has been carried out for many years now [18] and is well understood. For 5 example the place and route software associated with XILINX FPGA devices can be considered to carry out layout level synthesis. One of the most significant problems for a synthesis tool is that the number of possible solutions increases rapidly with an increase in logic complexity. Usually synthesis problems are NP-complete, that is the synthesis execution time grows exponentially with the size of the problem. Therefore the time required to find the best solution is usually considerable. Consequently algorithms producing inexact but close to optimum solutions are employed - so called heuristic [13]. Design synthesis is a very powerful tool and in theory, saving a considerable amount of design time as the design need not be developed at the gate level but instead at a higher level. In addition, the synthesis tool optimises the final design according to the specified technology and predefined criteria such as minimum area and speed. Unfortunately, synthesis tools are very complex and difficult to develop. Various commercial synthesis tools are available, usually operating at the RTL level, but seldom higher. The problem for a BCH codec designer therefore is that he has to have a high level of understanding of BCH codes before he can write these RTL descriptions in the first place. It is therefore the aim of this project to develop a high level synthesis tool for the design of BCH codecs. This tool will accept the parameters n and t of a BCH code and then generate the VERILOG description of the resulting BCH encoder and decoder. These VERILOG descriptions will be written at the RTL/logic level to facilitate their synthesis to gate level using a standard synthesis tool. 6 1.4 Overview of thesis The structure of this thesis is as follows. Chapter 2 presents finite fields and their arithmetic. It considers how to construct finite fields bit-serial and bit parallel multipliers for the dual, normal and polynomial bases. In addition, finite field inversion and exponentiation are considered and a new approach for raising field elements to the third power is presented. This chapter further presents a new hardware-efficient architecture generating the sum of products and a new dual-polynomial basis multiplier. Chapter 3 introduces BCH codes and algorithms for encoding and decoding BCH codes are presented. Chapter 4 describes the BCH codec synthesis system. 7 2. Finite Fields and Field Operators 2.1 Introduction In this chapter finite fields and finite field arithmetic operators are introduced. The definitions and main results underlying finite field theory are presented and it is shown how to derive extension fields. The various finite field arithmetic operators are reviewed. In addition, new circuits are presented carrying out frequently used arithmetic operations in decoders. These operators are shown to have faster operating speeds and lower hardware requirements than their equivalents and consequently have been used extensively throughout this project. Finite fields Error control codes rely to a large extent on powerful and elegant algebraic structures called finite fields. A field is essentially a set of elements in which it is possible to add, subtract, multiply and divide field elements and always obtain another element within the set. A finite field is a field containing a finite number of elements. A well known example of a field is the infinite field of real numbers. 2.2 Field definitions and basic features The concept of a field is now more formally introduced. A field F is a non-empty set of elements with two operators usually called addition and multiplication, denoted ‘+’ and ‘*’ respectively. For F to be a field a number of conditions must hold: 1. Closure: For every a, b in F c = a + b; d = a * b; (2.1) where c, d F. 2. Associative: For every a, b, c in F a + (b + c) = (a + b) + c and a * (b * c) = (a * b) * c. 8 (2.2) 3. Identity: There exists an identity element ‘0’ for addition and ‘1’ for multiplication that satisfy 0+a=a+0=a and a*1=1*a=a (2.3) for every a in F. 4. Inverse: If a is in F, there exist elements b and c in F such that a+b= 0 a * c = 1. (2.4) Element b is called the additive inverse, b = (-a), element c is called the multiplicative inverse, c = a-1 (a0). 5. Commutative: For every a, b in F a+b=b+a a * b = b * a. (2.5) 6. Distributive: For every a, b, c in F (a + b) * c = a * c + b * c. (2.6) The existence of a multiplicative inverse a-1 enables the use of division. This is because for a,b,c F, c = b/a is defined as c = b * a-1 . Similarly the existence of an additive inverse (-a) enables the use of subtraction. In this case for a,b,c F, c = b - a is defined as c = b + (-a). It can be shown that the set of integers {0, 1, 2, ... , p-1} where p is a prime, together with modulo p addition and multiplication forms a field [30]. Such a field is called the finite field of order p, or GF(p), in honour of Evariste Galois [48]. In this thesis only binary arithmetic is considered, where p is constrained to equal 2. This is because, as shall be seen, by starting with GF(2), the representation of finite field elements maps conveniently into the digital domain. Arithmetic in GF(2) is therefore defined modulo 2. It is from the base field GF(2) that the extension field GF(2m) is generated. 2.2.1 The extension field GF(2m) Before introducing GF(2m), some definitions are required. A polynomial p(x) of degree m over GF(2) is a polynomial of the form p(x) = p0 + p1x + p2x2 + ... + pmxm (2.7) where the coefficients pi are elements of GF(2) = {0,1}. Polynomials over GF(2) can be added, subtracted, multiplied and divided in the usual way [29]. A useful property of polynomials over GF(2) is that ([29], pp.29) 9 p2(x) = ( p0 + p1x + ... +pnxn)2 = p0 + p1x2 + ... + pnx2n = p(x2). (2.8) The notion of an irreducible polynomial is now introduced. Definition 2.1. A polynomial p(x) over GF(2) of degree m is irreducible if p(x) is not divisible by any polynomial over GF(2) of degree less than m and greater than zero. To generate the extension field GF(2m), an irreducible, monic polynomial of degree m over GF(2) is chosen, p(x) say. Then the set of 2m polynomials of degree less than m over GF(2) is formed and denoted F. It can then be proven that when addition and multiplication of these polynomials is taken modulo p(x), the set F forms a field of 2m elements, denoted GF(2m) [30]. Note that GF(2m) is extended from GF(2) in an analogous way that the complex numbers C are formed from the real numbers R where in this case, p(x) = x2 + 1. To represent these 2m field elements, the important concept of a basis is now introduced. 2.2.2 The polynomial basis and primitive elements Definition 2.2. A set of m linearly independent elements ={0 ,1,..., m-1} of GF(2m) is called a basis for GF(2m). A basis for GF(2m) is important because any element a GF(2m) can be represented uniquely as the weighted sum of these basis elements over GF(2). That is a = ao0 + a11 + .... + am-1m-1 ai GF(2). (2.9) Hence the field element a can be denoted by the vector (a0, a1, ..., am-1). This is why the restriction p = 2 has been made, since the above representation maps immediately into the binary field. There are a large number of possible bases for any GF(2m) [30]. One of the more important bases is now introduced. 10 Definition 2.3. Let p(x) be the defining irreducible polynomial for GF(2m). Take as a root of p(x), then A = {1,,...m-1} is the polynomial basis for GF(2m). For example consider GF(28) with p(x) = x4 + x + 1. Take as a root of p(x) then A = {1,,2,3} forms the polynomial basis for this field and all 16 elements can be represented as a = a0 + a1 + a22 + a33 (2.10) where the ai GF(2). These basis coefficients can be stored in a basis table of the kind shown in Appendix B. Definition 2.4. An irreducible polynomial of degree m is a primitive polynomial if the smallest positive integer n for which p(x) divides xn + 1 is n = 2m - 1. If is a root of p(x) where this polynomial is not only irreducible but also primitive, then GF(2m) can be represented alternatively by the set of elements GF(2m) = {0,1,,2, ... n-1}, (n = 2m -1 ). In this case is called a primitive element and n = 1. The relationship between powers of primitive elements and the polynomial basis representation of GF (28) is also shown in Appendix B. The choice as to whether to represent field elements over a basis or as powers of a primitive element usually depends on whether hardware or a software implementation is being adopted. This is because i * j = i +j , where this indices addition is modulo 2m-1 and so can easily be carried out on a general purpose computer. Multiplication of field elements using the primitive element representation is therefore simple to implement in software, but addition is much more difficult. For implementation in hardware however a basis representation of field elements makes addition relatively straight forward to implement. This is because a = b + c = (b0 + b1 + ... + bm-1 m-1 ) + (c0 + c1 + ... + cm-1 m-1 ) = = (b0 + c0) + (b1 + c1) + ... + (bm-1 + cm-1) m-1 (2.11) and so addition is performed component-wise modulo 2. Hence a GF(2m) adder circuit comprises 1 or m XOR gates depending on whether the basis coefficients are represented in 11 series or parallel. This is an important feature of GF(2m) and one of the main reasons why finite fields of this form are so extensively used. 2.2.3 The Dual Basis The dual basis is an important concept in finite field theory and was originally exploited to allow for the design of hardware efficient RS encoders [3]. However subsequent research has allowed the use of dual basis multipliers to be adopted throughout the encoding and decoding processes. Definition 2.5. [15] Let {i} and {i} be bases for GF(2m), let f be a linear function from GF(2m) GF(2), and GF(2m), 0. Then {i} and {i} are dual to each other with respect to f and if 1 if f ( i j ) 0 if i j (2.12) i j. In this case, {i} is the standard basis and {i} is the dual basis. Theorem 2.1. [15]. Every basis has a dual basis with respect to any non-zero linear function f: GF (2m) GF (2), and any non-zero GF(2m). For example consider GF (28) with p(x) = x4 + x + 1 and take as a root of p(x). Then {1,,2,3} is the polynomial basis for the field. Now taking = 1 and f to be the least significant polynomial basis coefficient, {1,3,2,} forms the dual basis to the polynomial basis. In fact by varying there are 2m-1 dual bases to any given basis and the dual basis with the most attractive characteristics can be taken. This is usually taken to mean the dual basis that can be obtained from the polynomial basis with the simplest linear transformation [38]. 2.2.4 Normal basis m1 A normal basis for GF(2m) is a basis of the form B { , 2 ,, 2 } where GF(2m). For every finite field there always exists at least one normal basis [30]. Normal basis representations of field elements are especially attractive in situations where squaring 12 is required, since if (a0, a1, ... ,am-1) is the normal basis representation of a GF(2m) then (am-1, a0, a1, ... , am-2) is the normal basis representation of a2 [31]. This property is important in its own right but also because it allows for hardware efficient Massey-Omura multipliers to be designed. The normal basis representation of GF(28) is given in Appendix B. 2.3 Multiplication by a constant j It is frequently required to carry out multiplication by a constant value in encoders and decoders. This can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier designed specifically for this task ([29], p. 162) ([30], p.89). Let a = a0 + a1 + ... + am-1m-1 be an element in GF(2m) where is a root of the m 1 primitive polynomial p(x) = xm + p j x j . Thus j0 a * = a0 + a12 + ... + am-1m (2.13) but since p() = 0 a* = a0 + a12 + ... + am-2m-1 + am-1(p0 + p1 +p22 + ... + pm-1m-1) (2.14) which is equivalent to a* mod p(). For example consider, multiplication by in GF(28), where p(x) = x4 + x + 1. Then a* = a3 + (a3 + a0) + a12 + a23 (2.15) and this multiplication can be carried out with the following circuit. A0 A1 A2 A3 Figure 2.1. Circuit for computing a a * in GF (28). 13 If the above register is initialised by Ai = ai (i=0,1,2,3) then by clocking the register once, the value of a* is generated. This algorithm may be readily extended for multiplication by j, where j is any integer and for any GF(2m). 2.4 Bit-serial multiplication The most commonly implemented finite field operations are multiplication and addition. Multiplication is considered to be a degree of magnitude more complicated than addition and a large body of research has been carried out attempting to reduce the hardware and time complexities of multiplication. Finite field adders and multipliers can be classified according to whether they are bit-serial or bit-parallel, that is whether the m bits representing field elements are processed in series or in parallel. Whereas bit-serial multipliers generally require less hardware than bit-parallel multipliers, they also usually require m clock cycles to generate a product rather than one. Hence in time critical applications bit-parallel multipliers must be implemented, in spite of the increased hardware overheads. 2.4.1 Berlekamp multipliers The Berlekamp multiplier [3] uses two basis representations, the polynomial basis for the multiplier and the dual basis for the multiplicand and the product. Because it is normal practice to input all data in the same basis, this means some basis transformation circuits will be required. Fortunately for m = (3, 4, 5, 6, 7, 9, 10) the basis conversion from the dual to the polynomial basis - and vice versa - is merely a reordering of the basis coefficients [38]. For the important case m = 8 - for example the error-correcting systems used in CDs, DAT and many other applications operate over GF(28) - this basis conversion requires a reordering and two additions of the basis coefficients (Appendix C). In practice therefore, two additional XOR gates are required. Even including the extra hardware for basis conversions, the Berlekamp multiplier is known to have the lowest hardware requirements of all available bit-serial multipliers [28]. Now let a, b, c GF(2m) such that c = a * b and represent b over the polynomial m 1 basis as b = b k 0 k * k . Further, and following Definition 2.5, let {0, , ..., m-1,} be the 14 m 1 m 1 i 0 i 0 dual basis to the polynomial basis for some f and . Hence a ai i and c ci i where these values of ai and ci are given by the following. Lemma 2.1 [15]. Let {0, 1, ..., m-1} be the dual basis to the polynomial basis for GF(2m) for some f and and let a = i 0 a i i be the dual basis representation of a GF(2m). m1 Then ai = f(ai) for (i=0,1, ..., m-1). The multiplication c = a*b can therefore be represented in the matrix form [15] a0 a 1 ... a m1 a m1 b0 c 0 a 2 ... a m b1 c1 ... ... ... ... ... a m a 2m 2 bm1 c m1 a1 ... (2.16) where ai = f(ai) and ci = f(ci) (i = 0,1, ..., m-1) are the dual basis coefficients of a and c respectively and ai = f(ai) (i=m, m+1,..., 2m-2). It can be shown [15] that am+k = f(am+k ) = m 1 p j0 j (k= 0,1, .....) * a jk (2.17) where the pj are the coefficients of p(x). These values of am+k can therefore be obtained from an m-stage linear feedback shift register (LFSR) where the feedback terms correspond to the pj coefficients and the LFSR is initialised with the dual basis coefficients of a. On clocking the LFSR am is generated, then on the next clock cycle am+1 is produced, and so on. The m vector multiplications listed in equ(2.16) are then carried out by a structure comprising m 2-input AND gates and (m-1) 2-input XOR gates. As an example, a Berlekamp multiplier for GF(28) is shown in Fig. 2.2 where p(x) = x4 + x + 1. 15 A3 A2 A1 A0 c3 c2 c1 c0 B3 B2 B1 B0 Figure 2.2 Bit-serial Berlekamp multiplier for GF(28). The registers in Fig. 2.2 are initialised by Ai = ai and Bi = bi for (i= 0,1,2,3). At this point the first product bit c0 is available on the output line. The remaining values of c1, c2 and c3 are obtained by clocking the register a further three times. With the above scheme at least one basis conversion is required if both inputs and the output are to be represented over the same basis. This basis transformation is a linear transformation of the basis coefficients and can be implemented within the multiplier structure itself. However with GF(28) the dual basis is a permutation of the polynomial basis coefficients and so this conversion can be implemented by a simple reordering of the inputs. 2.4.2 Massey-Omura Multiplier The Massey-Omura multiplier [31,54] operates entirely over the normal basis and so no basis converters are required. The idea behind the Massey-Omura multiplier is that if the Boolean function generating the first product bit has the inputs cyclically shifted, then this same function will also generate the second product bit. Furthermore with each subsequent cyclic shift a further product bit is generated. Hence instead of m Boolean functions, one Boolean function is required to generate all m product bits but with the inputs to this function shifted each clock cycle. 16 As an example, consider a Massey-Omura bit-serial multiplier for GF(28). Let be a root of p(x) = x4 + x + 1 and let a normal basis for the field is {3, 6, 12, 9}. Further, let such that c = a*b and represent these elements over the normal basis. Then c = c03 + c16 + c212 + c39 = = (a03 + a16 + a212 + a39) * (b03 + b16 + b212 + b39) where c0 = a0b2 + a1b2 + a1b3 + a2b0 + a2b1 + a3b1 + a3b3 c1 = a1b3 + a2b3 + a2b0 + a3b1 + a3b2 + a0b2 + a0b0 c2 = a2b0 + a3b0 + a3b1 + a0b2 + a0b3 + a1b3 + a1b1 c3 = a3b1 + a0b1 + a0b2 + a1b3 + a1b0 + a2b0 + a2b2. (2.18) From equ(2.18) only one Boolean function is required to generate c0, the remaining values of c1, c2 and c3 are obtained by adding one to all of the indices. This amounts to a cyclic shift of the inputs to this Boolean function. A circuit diagram for this multiplier is given in Fig 2.3. The registers in Fig. 2.3 are initialised by Ai = ai and Bi = bi for (i=0,1,2,3). At this point the first product bit c0 will be available on the output line. The remaining values of c1, c2 and c3 are obtained by cyclically shifting the registers a further three times. A3 A2 A1 A0 c3 c2 c1 c0 B3 B2 B1 B0 Figure 2.3. Bit-serial Massey-Omura multiplier for GF(28). In the case of a Massey-Omura multiplier for GF(28), from equ(2.18) seven 2-input AND gates and six 2-input XOR gates are required to implement the required Boolean 17 equation. In general there is a result that states the defining Massey-Omura function for a GF(2m) multiplier requires at least (2m-1) 2-input AND gates and at least (2m-2) 2-input XOR gates [39]. In the case of the above example, it can be seen that the GF(28) MasseyOmura multiplier has achieved this lower bound. 2.4.3 Polynomial basis multipliers Polynomial basis multipliers operate entirely over the polynomial basis and require no basis converters. These multipliers are easily implemented, reasonably hardware efficient and the time taken to produce the result is the same as for Berlekamp or MasseyOmura multipliers. In truth however bit-serial polynomial basis multipliers are serial-in parallel-out multipliers. In some applications this results in an additional register being required and adds an extra m clock cycles to the computation time. This is the main reason why polynomial basis multipliers are frequently overlooked for use in codec design. However as will be shown in Sections 2.4.5 and 2.4.6, this feature can be actually beneficial. There are two different methods of operation for polynomial basis multipliers, least significant bit (LSB) first or most significant bit (MSB) first. Either of these approaches may be chosen and both modes are described below. 2.4.3.1 Option L - LSB first In this option the LSB appears first on the multiplier input. Therefore denote this multiplier a Bit-Serial Polynomial Basis Multiplier option L (SPBML). This multiplier is described in detail in the literature [4], ([29], pp.163 -164), ([30], pp. 90-91) and summarised here. Let a, b, c GF(2m) and represent these elements over the polynomial basis as a = a0 + a1 + ... + am-1m-1 b = b0 + b1 + ... + bm-1m-1 c = c0 + c1 + ... + cm-1m-1 (2.19) The multiplication c = a * b can be expressed as c = a * b = (a0 + a1 + ... + am-1m-1) * b c = (...(((a0b) + a1b) + a2b2) + ...) + am-1bm-1 18 (2.20) A circuit carrying out multiplication by implementing equ(2.20) therefore requires an LFSR to carry out multiplication by . This LFSR is initialised with b and on clocking the register the value of b is generated. The values a0 ,a1, ... , am-1 are fed in series into the multiplier to generate each of the values aibi (i=0,1,...,m-1) which are accumulated in a register to form the product bits c0 ,c1, ... , cm-1. As an example, a circuit diagram for such a multiplier for GF(23) using the primitive polynomial p(x) = x3 + x + 1 is given in Fig. 2.4. The operation of this circuit is as follows. The registers are initialised by Bi = bi and Ci = 0 (i=0,1,2). The values a0 ,a1, a2 are fed in series into the multiplier and after 3 clock cycles the result c is available in the Ci register. B0 a2 a1 B1 B2 a0 C0 C1 C2 Figure 2.4. Circuit for multiplying two elements in GF(23). 2.4.3.2 Option M - MSB first In this option the MSB appears first on the multiplier input. The Bit-Serial Polynomial Basis Multiplier option M (SPBMM) has been known for many years [28]([29], p.163) and more recently modified by Scott et al [45]. The multiplication c = a * b (where a, b, c are as given in equation 2.18) can be expressed c = a * b = (a0 + a1 + ... + am-1m-1) * b c = (...(((am-1b) + am-2b) + am-3b) + ...) + a0b. (2.21) A circuit implementing equation 2.21 for GF(23) is shown in Fig. 2.5. Initially the Ci register is set to zero and the Bi register is initialised by Bi = bi (i=0,1,2). a2 is then fed into the circuit and a2b loaded into the top register. Then a1 enters the circuit and the top register are clocked so that they then contain (a2b + a1b). Finally, the top register are 19 clocked to generate (a2b + a1b) and this value is added to a0b to form the required product. In general therefore the result is obtained in the Ci register after m clock cycles. C0 a0 a1 C1 C2 a2 B0 B1 B2 Figure 2.5. Circuit for multiplying two elements in GF(23). 2.4.4 Comparison of bit-serial multipliers The Massey-Omura multiplier operates entirely over a normal basis and so no additional basis conversions are required. The normal basis representation is especially effective in performing operations such as squaring. Unfortunately, the multiplier circuit is relatively hardware inefficient (compared to the Berlekamp multiplier for example, [28, 33]) and cannot be hardwired to carry out reduced complexity constant multiplication. Furthermore, the Massey-Omura multiplier cannot be easily extended for different values of m given a particular choice of m. The Berlekamp multiplier is known to have very low hardware requirements [28]. The Berlekamp multiplier can also be hardwired to allow for particularly efficient constant multiplication [3]. The disadvantage of this multiplier is that it operates over both the dual and the polynomial basis, and so basis converters may be required. In most cases the basis conversion is only a permutation of the basis coefficients, and hence no additional hardware is required (see Appendix C). Because of these reasons, the Berlekamp multiplier is widely used in codec design. The bit-serial polynomial basis multipliers do not require basis converters, and are almost as hardware efficient as the Berlekamp multiplier. They do however have a different interface to the Berlekamp multiplier being serial-in-parallel-out. Hence the choice between a Berlekamp and a polynomial basis multiplier often depends on the circuit in 20 which the multiplier is to be implemented. For example if the result is required to be represented in parallel then an SPBMM may be used, otherwise a Berlekamp multiplier could be rather adopted. In comparing all four multipliers directly, it is noted that they each take m clock cycles to generate a solution. Similarly they each require 2m flip-flops. In order to compare the hardware requirements of these four multipliers some notation is introduced. Let Na equal the number of 2-input AND gates required by a multiplier and let Nx equal the number of 2-input XOR gates required by a multiplier. Furthermore, let Da and Dx be the delays through a 2-input AND gate and XOR gate respectively. Let H(pp) be the Hamming weight of the primitive polynomial chosen for GF(2m). (These choices of p(x) are listed in Appendix A.) The hardware requirements and delays of three of these multipliers are given in below. Berlekamp multiplier Na = m; Nx = m + H(pp) - 3 Delay = Da + log2(m -1) * Dx. (2.22) Standard basis multiplier option L Na = m Nx = m + H(pp) - 2 Delay = Da + Dx. (2.23) Standard basis multiplier option M Na = m Nx = m + H(pp) - 2 Delay = Da + 2Dx. (2.28) For Massey-Omura multipliers the number of gates cannot be explicitly specified. As a comparison, values of Na and Nx for all three types of multiplier are given in Table 2.1 21 m Massey Omura [33] Berlekamp SPBML/ SPBMM Na Nx Na Nx Na Nx 3 5 4 3 3 3 4 4 7 6 4 4 4 5 5 9 8 5 5 5 6 6 11 10 6 6 6 7 7 19 18 7 7 7 8 8 21 20 8 10 8 11 9 17 16 9 9 9 10 10 19 18 10 10 10 11 Table 2.1 The usage of gates for bit-serial Massey Omura, Berlekamp and standard basis multipliers. It should be noticed that in some applications the most important feature of a multiplier is the input/output interface. Beth et. al. [4] presented a different interface for polynomials, dual and normal basis multipliers. In conclusion polynomial basis multipliers can be only serial-in parallel-out, whereas dual and normal basis multipliers can be either parallel-in serial-out or serial-in parallel-out. 2.4.5 Generating the sum of products Often in BCH and RS decoders one product is not required to be generated in isolation, but instead a sum of products must be calculated. For example an equation of the form t c a j bj (2.25) j 1 is required to be evaluated in Berlekamp-Massey algorithm circuits described in the next chapter. If bit-serial Berlekamp or Massey-Omura multipliers are being used, the sum of t products is obtained by the modulo 2 addition of the output of the t independent multipliers and so (t-1) additional XOR gates are required. With polynomial basis multipliers where the outputs are represented in parallel, m*(t-1) XOR gates are required. However if 22 SPBMMs are used to generate these products, large hardware savings can be made, as follows. A SPBMM implements equ(2.21) (rewritten below) c = (...(((am-1b) + am-2b) + am-3b) + ...) + a0b by generating the values Pn = Pn-1 + am-nb (n=1,2,...,m) where P0 = 0 and c = Pm. If now aj = aj,0 + aj,1 + ... + aj,m-1m-1 and bj = bj,0 + bj,1 + ... + bj,m-1m-1 (j=1,2,...t) then instead generate t Pn Pn 1 ai ,m nbi (2.26) i 1 t where P0 = 0 and so Pm a j b j . j 1 Equ(2.26) can be implemented by a circuit comprising two parts. Part A generates Pn-1 in the same manner that Pn-1 is generated in the top register in Figure 2.5. Part B comprises t registers and m 2-input AND gates generating the values ai,m-nbi (n=1,2,...,m) for (i=1,2,...,t). The additions required in equ(2.26) can be carried out by m*(t-1) XOR gates included in the design of the Part A circuit. A circuit for GF(23) with t=2 is shown in Figure 2.6. Using this approach to evaluating equ(2.25) (t+1)*m flip-flops are required. If t distinct SPBMMs are used however, 2t*m flop flops are needed and so the above method allows for a saving of (t-1)*m flip-flops to be made. Given that Berlekamp multipliers are the most hardware efficient bit-serial multipliers it would seem appropriate to use these multipliers when implementing equ(2.25). In this case however the presented approach would again save (t-1)*m flip-flops since t distinct multipliers would be required. In addition (H(pp)-2)* (t -1) - 1 XOR gates are saved where H(pp) is the hamming weight of irreducible polynomial for the field. Hence the presented approach is the most hardware efficient method of implementing equ(2.25) currently available. 23 a1,0 a1,1 a 1,2 a2,1 Part A Part B b1,2 b1,1 b1,0 a2,0 C2 C1 C0 a 2,2 Part B b2,0 b2,2 b2,1 Figure 2.6. Circuit generating c = a1b1 + a2b2 using in GF(23) . Initially the presented approach appears to have an unattractive input/output format since the aj values enter in series, the bj values enter in parallel and the output is also generated in parallel. However when utilised in a Berlekamp-Massey algorithm circuit, this input/output format can be very convenient. This is because the incoming syndromes are frequently represented in series (and so can take the role of the aj values) and the error location values generated by the circuit are represented in parallel (and so can take the role of the bj values). The other multipliers required in the circuit must then also be bit-parallel multipliers, thereby increasing the throughput of the overall circuit. Furthermore in the next section a new approach of Dual Polynomial Basis Multiplier is presented, and a combination of these two architectures offers a new hardware and time efficient architecture for the BMA (presented in Section 3.4.2). It should be noticed here that sum of products architecture may be also extended for dual and normal basis multipliers if their architecture is serial-in parallel-out. Therefore it is possible to construct a sum of products multiplier for MSB-first dual basis multipliers (Fig. 7 [4]) and for MSB-first and LSB-first normal basis multipliers (Fig. 10, 11 [4]). 24 2.4.6 Dual-Polynomial Basis Multipliers In real time applications, the time taken by a multiplier to generate a solution is one of its most important characteristics. Therefore a designer has to choose between hardware efficient but slow bit-serial multipliers, and quick but rather complex bit-parallel multipliers. In some applications it is required to calculate y = a * b * c. (2.27) In the standard approach to generate equ(2.27), two multiplications are carried out independently, i.e. first the multiplication z = a * b is implemented and the result stored in the auxiliary register Z, and then the multiplication y = z * c is carried out. The total calculation time is the sum of two independent multiplication times. In some applications this time is unacceptably long and a parallel multiplier must be employed, and so a more complex architecture is required. To overcome this problem, a new approach has been developed. Using the two proposed Dual-Polynomial Basis Multipliers (DPBMs), the time required to implement equ(2.27) is almost the same as the time required to carry out a single multiplication. Furthermore a DPBM is almost as hardware efficient as the standard bit-serial approach. The DPBM can also be modified to carry out more complex operations such as y = (a * b + c) * d. (This operation is required to be carried out in the Berlekamp Massey algorithm). 2.4.7 Option A dual polynomial basis multipliers The Berlekamp multiplier can be described as a parallel-in serial-out multiplier. On the other hand, bit-serial polynomial basis multipliers are serial-in parallel-out. Therefore, there is the option of connecting these two types of multiplier together to form one multiplier generating y = a * b * c. In this arrangement, the Berlekamp multiplier’s output is connected directly to the polynomial basis multiplier’s serial input. Thus the multiplication y = a * b * c is carried out in the same time span that a single bit-serial Berlekamp multiplier takes to yield one product. A problem occurs however because the polynomial basis multiplier operates on the polynomial basis whilst the Berlekamp multiplier produces a result in the dual basis, and so an additional basis conversion is required. 25 The complexity of this basis representation depends on the irreducible polynomial selected, and so two cases have been considered. Those cases in which the irreducible polynomial for the field is a trinomial of the form p(x) = xm + xp + 1, or a pentanomial of the form p(x) = xm + xp+2 + xp+1 + xp + 1. 2.4.7.1 Irreducible trinomials When the irreducible polynomial defining GF(2m) is a trinomial, the dual basis is a permutation of the polynomial basis (see Appendix C). Therefore it is possible to rearrange the order of the output from the Berlekamp multiplier so that it is compatible with the polynomial basis multiplier. As an example, consider GF(28) with p(x) = x4 + x + 1. An element z GF(28) is represented in the polynomial basis as z = z0+ z1 + z22 + z33 zi GF(2) (2.28) so a Berlekamp multiplier would generate this value in the dual basis as z0 , z3 , z2 , z1 . The SPBMM requires the serial input in the order z3 , z2 , z1 , z0 . A circuit that rearranges the dual basis coefficients into this order can be easily developed, thus allowing the DPBM to be designed. The general scheme for a multiplier generating y = a * b * c is shown in Figure 2.7. a c b Berlekamp multiplier z=a*b z mux SSBMM y=z*c Z register y Figure 2.7. Dual-Polynomial Basis Multiplier option A. Assume for instance that the multiplier shown in Figure 2.7 is a Dual-Polynomial Basis Multiplier option A (DPBMA) for GF(28). The hardware required in addition to the 26 SPBMM and the Berlekamp multiplier is a 2:1 multiplexer and a flip-flop. On the first clock cycle the values of a and b are parallel loaded into the Berlekamp multiplier. Once these values have been stored, the first product bit z0 is available on the output. This result is then clocked into the Z flip-flop. On clocking the Berlekamp multiplier a further three times the values of z3, z2, z1 are produced. These coefficients pass through the multiplexer and feed the serial input of the SPBMM. On the 5th clock cycle the multiplexer feeds the SPBMM input with z0, so that the SPBMM has been fed the input sequence z3, z2, z1, z0, as required. In total therefore this circuit has a total computation time of (m+1) clock cycles. Note also that no extra m-bit register Z is required to store the value of z as required in the standard approach to generating equ(2.27). This approach may be easily extended to GF(2m) where the irreducible polynomial for GF(2m) is of the form p(x) = xm + xp + 1. In this case if (z0, z1, ..., zm-1) is the polynomial basis representation of z GF(2m), the output in the dual basis from a Berlekamp multiplier is (see Appendix C) zp-1, zp-2, ..., z0, zm-1, zm-2, ..., zp . (2.29) In this case, a multiplier structure similar to that shown in Figure 2.7 is derived. In addition, p extra flip-flops and one (p + 1):1 multiplexer are required, and the total calculation time is now m + p clock cycles. 2.4.7.2 Irreducible pentanomial When the irreducible polynomial for GF(2m) is a pentanomial of the form p(x) = xm + xp+2 + xp+1 + xp + 1 the dual to polynomial basis conversion involves a reordering and two GF(2) additions, and so two extra XOR gates are required to implement this conversion. In this case the DPBMA is more difficult to implement, but is still worthy of consideration. As an example, and because GF(28) is the most useful example of a field for which an appropriate pentanomial exists, consider GF(28) with p(x) = x8 + x4 + x3 + x2 + 1. Let z GF(28) be presented in the dual basis as z = z00 + z11 + z22 + z33 + z44 + z55 + z66 + z77 zi GF(2) and in the polynomial basis as z = s0 + s1 + s22 + s33 + s44 + s55 + s66 + s77. 27 si GF(2) Then the dual to polynomial basis conversion is given by s7 z 3 , s6 z4 , s5 z5 , s4 z6 s3 z 3 + z 7 , s 2 z 0 + z 2 , s 1 z 1 , s0 z2 . The DPBMA for GF(28) is shown in Figure 2.8. a c b Berlekamp Multiplier z=a*b z 0 mux Z3 1 Z2 2 Z1 3 Z0 4 SPBMM y=z*c y Figure 2.8. DPBMA generating y = a * b * c in GF(28) The operation of the DPBMA shown in Figure 2.8 is as follows. On the first clock cycle a and b are parallel loaded into the Berlekamp multiplier and at this point the first product bit is available on the output. The remaining 7 product bits are obtained by clocking the Berlekamp multiplier a further 7 times. The first 4 values generated by the Berlekamp multiplier are clocked into the Zi register so that after 4 clock cycles Zi = zi (i=0,1,2,3). This fourth value of z3 is also the first input to the SPBMM (i.e. s7). The next three outputs from the Berlekamp multiplier are fed into the SPBMM and then the multiplexer selects inputs 1,4,3,2 on the next four clock cycles to generate the required input for the SPBMM. The overall DPBMA will generate a solution on the 11th clock cycle. In general, a DPBMA takes m+p+1 clock cycles to generate a product when the irreducible polynomial is of the form p(x) = xm + xp+2 + xp+1 + xp + 1. In addition to the required Berlekamp multiplier and the SPBMM, an additional p+2 flip-flops, two 2-input XOR gates and one (3+p):1 multiplexer are required. 28 2.4.8 Dual polynomial basis multipliers option B The DPBM may also be developed in a different form. With this option, a multiplier implementing equ(2.27) has the same calculation time as a single, bit-serial multiplier. Instead of rearranging the order of the output from a Berlekamp multiplier, it is possible to add an additional circuit to the input of a ‘Berlekamp-like’ multiplier denoted bit-serial dual basis multiplier (SDBM). With this scheme, the SDBM produces a product in the polynomial basis, and so no extra circuit between the SDBM and the SPBMM is required. In order to develop the DPBM option B (DPBMB), the function Rd(x) is introduced. Definition 2.6. Let the irreducible polynomial for GF(2m) be p(x) = p0 + p1x + p2x2 + ... + xm and let a, b GF(2m) be represented in the dual basis as b = b00 + b11 + ... + bm-1m-1 a = a00 + a11 + ... + am-1m-1. Then define the function Rd : GF(2m) GF(2m) such that b = Rd(a), where b satisfies a j m1 j 1 bj m 1 pi ai j m 1. i 0 (2.30) The value b = Rd(a) = a where is a root of p(x) and so the function Rd(x) has the same effect of the coefficients of x as an LFSR which is initialised with the dual basis representation of x. Let Rd2(a) be defined as Rd2(a) = Rd(Rd(a)) - the state of the LFSR after 2 clock cycles - and so on. 2.4.8.1 Irreducible trinomials To introduce the DPBMB assume first that the defining irreducible polynomial p(x) is a trinomial of the form p(x) = xm + xp + 1. Now consider a Berlekamp multiplier without the LFSR but instead a set of m input lines Ai, denoted as SDBM. 29 Let a, b, z GF(2m) such that z = a * b. Further, let b and z be represented in the polynomial basis and a be represented in the dual basis as a = a0 + a1 + ... + am-1m-1. If the SDBM is fed with the inputs Ai = ai (i=0,1, ..., m-1) the first coefficient of the dual basis representation of z is obtained, or equivalently from equ(2.29), the p-th polynomial basis coefficient of z, namely zp-1. So if instead, the multiplier is fed with the dual basis representation of Rdp(a), the p+1-th coefficient of the dual basis representation of z is obtained, or equivalently, the last polynomial basis coefficient zm-1. Similarly, if on the next clock cycle the multiplier is fed with the dual basis representation of Rdp+1(a), the p+2-th coefficient of the dual basis representation of z is obtained, or equivalently, the next to the last polynomial basis coefficient zm-2. This analysis may continue and so overall, if the proposed multiplier is fed with the input sequence Rdp(a), Rdp+1(a), Rdp+3, ..., Rdm-1(a), a, Rd(a), ..., Rdp-1(a) (2.31) the multiplier will generate the values zm-1, zm-2, ... , z0 which is the correct format for the SPBMM. As previously mentioned the proposed technique is flexible in that it can be modified to carry out operations of the form y = (a * b + c) * d. For example, consider Figure 2.9 where a circuit for GF(28) is presented implementing the operation y = (a * b + c) * d. Consider first the lower half of the circuit which implements z = a*b using a SDBM. Taking p(x) = x4 + xp + 1, (p=1) in order that the SDBM produces the result in the required sequence, from equ(2.31) the ‘a’ inputs to the multiplier must be in the order Rd(a), Rd2(a), Rd3(a), a. To achieve this four flip-flops, four 3:1 multiplexers, an Rdp(a) circuit and an Rd(a) circuit are additionally required. (An Rdp(a) circuit is a combinational circuit that given the dual basis representation of a GF(2m), generates Rdp(a). This circuit therefore implements a linear transformation over GF(2) and comprises p XOR gates. In this case, it can be seen that only one additional XOR gate is required). On the first clock cycle, the multiplexers select input 0, thereby loading Rdp(a) into the ari register. On the 2nd and 3rd clock cycles the multiplexers select input 2 thereby loading Rd2(a) and Rd3(a) respectively into the ari register. Finally on the 4th clock cycle, the multiplexers select input 1 to load the dual basis representation of a into the ari register. In doing this the output sequence z3, z2, z1, z0 is generated, as required by the SPBMM. 30 If the Ci register was previously initialised with the polynomial basis coefficients of c and are now clocked, the polynomial basis representation of (a*b + c) is generated. This value is then fed into an SPBMM as normal to generate the required result y = (a*b + c)*d, this equation is required in the BMA. The DPBMB can be easily extended to different GF(2m) if the irreducible polynomial is a trinomial of the form p(x) = xm + xp + 1. In general, the multiplexers should select the following signals. Clock cycle Origin of Signal Actual Values on these Signals 1 Rdp(a) circuit Rdp(a) 2 to m-p Rd(a) circuit Rdp+1(a) to Rdm-1(a) m-p+1 Ai register a m-p+2 to m Rd(a) circuit Rd(a) to Rdp-1(a) 31 Y0 Y2 Y1 Y3 SPBMM a*b + c d0 d1 d2 d3 C0 C1 C2 C3 b0 b1 b2 b3 Summation a*b SDBM Rd(ar) ar0 ar1 ar2 ar3 mux 012 mux 012 mux 012 mux 012 Rdp(a) a0 a1 a2 a3 Figure 2.9 Circuit generating y = (a * b + c) * d in GF(28). 32 Basis rearranging circuit In comparison with a standard approach, a DPBMB circuit requires an additional m 3:1 multiplexers, m flip-flop, one XOR gates to form the Rd(x) circuit and p XOR gates to generate the Rdp(x) circuit. In order to reduce the complexity of this Rdp(x) circuit, a value of p as low as possible should be chosen. Hence the optimal irreducible polynomial to choose in this instance is p(x) = xm + x + 1. Such polynomials exist for m=2,3,4,6,7, etc. 2.4.8.2 Primitive pentanomials When the irreducible polynomial p(x) is of the form p(x) = xm + xp+2 + xp+1 + xp + 1 an DPBMB can be designed similarly as in the trinomial case. Because the basis conversion is not just a permutation of basis coefficients and also involves two GF(2) additions, the circuit rearranging the input to a SDBM is rather more complicated however. Using the same analysis as in the trinomial case, it can be shown that when p(x) = xm + xp+2 + xp+1 + xp + 1, the required input sequence for an SDBM multiplier is Rdp+1(a), Rdp+2(a), ... , Rdm-2(a), Rdp+1 + Rdm-1(a), a + Rdp(a), Rd(a), ... , Rdp(a). So for example consider GF(28) and p(x) = x8 + x4 + x3 + x2 + 1. The required input sequence is therefore Rd3(a), Rd4(a), Rd5(a), Rd6(a), Rd3(a)+Rd7(a), a+Rd2(a), Rd(a), Rd2(a). This sequence is generated by a circuit of the form shown in Figure 2.10. The key section of this circuit is the multiplexer determining the ordering of the above input sequence. The input selection lines are as follows: clocks 1 line 4 Rd3(a) clock 2-4 line 3 Rd(ar) i.e. Rd4(a), Rd5(a), Rd6(a) clock 5 line 2 Rd(ar) + Rd3(a) i.e. Rd7(a) + Rd3(a) clock 6 line 1 Rd2(a) + a clock 7 line 0 Rd(a) clock 8 line 3 Rd(ar) i.e. Rd2(a), In general, the DPBMB requires an additional (p+2) Rd(x) circuits (3 XOR gates in each), 2m XOR gates for summation circuits, m 5:1 multiplexers and m flip-flops. 33 a Rd(x) Rd(a) Rd(x) Rd2(a) 0 1 2 Mux 3 Rd(x) Rd (a) 3 4 Rd(x) 8 registers ar b SDBM z=a*b z SPBMM y=z*c c y = a * b *c Figure 2.10. DPBMB generating y = a * b * c in GF(28) 2.4.8.3 Summary of DPBM The DPBM is particularly useful if the time taken to generate a product is critical. The DPBM offers a half-way solution between a bit-serial and a bit-parallel multiplier. Furthermore, both DPBMs are hardware efficient and in some situations the DPBM offers a reduction in hardware since the intermediate value z does not have to be stored. The structure of the DPBM depends on the irreducible polynomial for GF(2m). The optimal irreducible polynomial is a trinomial of the form p(x) = xm + xp + 1 where p = 1, for p>1, more hardware is required in the multiplier. For some values of m (e.g. m = 8) there do not exist irreducible trinomials, and so an irreducible pentanomial must be used resulting in the addition of extra hardware. Although, the structure of the DPBM depends on the selected irreducible polynomial for GF(2m), it has been shown that the architecture can be easily specified for two important classes of irreducible polynomials. The DPBMs require only one input to be represented in the dual basis, the other input and the output are represented in the polynomial basis. Two different options have 34 been presented. With the DPBMA, the dual basis output is converted into the polynomial basis. This multiplier is particularly suited to generating products of the form y = a * b * c or y = (a * b + c) * d if it is acceptable to take (m+p) clock cycles to generate this product. With the DPBMB the basis rearranging takes place on the input. This approach takes more hardware than the DPBMA circuit, but generating the product only takes m clock cycles. The DPBMB is of particular use when evaluating expressions of the form t y (abi ci )di (2.32) i 1 where a, bi, ci, di GF(2m). This is because only one relatively expensive basis rearranging circuit is required. Expressions of the type equ(2.32) are generated in the implementation of the Berlekamp-Massey algorithm. Note that SPBMLs have not been used in conjunction with DPBMs because the basis reordering circuits are more complicated than the ones needed when using SPBMMs. Beth et. al. [4] presented normal basis multipliers with a LSB-first serial-in parallelout interface. Therefore it is also possible to construct a multiplier that can carry out the multiplication d= a * b * c over the normal basis during only m clock cycles. This multiplier consists of a parallel-in serial-out Massey-Omura multiplier of the form presented in Section 1.4.2. and the above multiplier ([4] Fig. 11). This double - multiplier does not require basis rearranging as the DPBM does, but normal basis multiplication is relatively hardware inefficient (see Section 2.4.4), and constructing a normal basis multiplier for different choices m is quite complex. In addition, a normal basis multiplication requires the arguments of the multiplication to be rotated therefore an additional control system is required. Summing up, in this thesis the DPBM is adopted, however in some instances it is not obviously the most appropriate architecture. A similar architecture using only dual basis multipliers cannot be constructed, because parallel-in serial-out multipliers produce the result in the dual basis and the serialin parallel-out dual basis multipliers can be constructed only for serial input in the polynomial basis [4]. 35 2.5 Bit-Parallel Multiplication In some applications, it is required to adopt bit-parallel architectures rather than bitserial ones to achieve the required performance. So far, only bit-serial multipliers have been considered because of their hardware advantages over bit-parallel multipliers. Unfortunately, in the time critical places in BCH codecs, bit-serial architectures are too slow and more complex bit-parallel architecture must be adopted. 2.5.1 Dual Basis Multipliers The bit-parallel dual basis multiplier (PDBM) was presented in [15]. Let a, c GF(2m) be represented in the dual basis by a = a00 + a11 + ... + am-1m-1 and c = c00 + c11 + ... + cm-1m-1. Let b GF(2m) be represented in the polynomial basis as b = b0 + b1 + ... + bm-1m-1. The multiplication c = a * b is therefore represented by equations (2.16) and (2.17). Using these equations and the bit-serial Berlekamp multiplier properties, the PDBM can be easily derived [15] as a circuit implementing the equations cj = ajb0 + aj+1b1 + aj+2b2 + ... + aj+m-1bm-1 (j=0,1,....,m-1) m 1 am+k = p j0 j (k=0,1,...,m-1) * a jk (2.33) where the pj are the coefficients of the primitive polynomial for the field p(x) = p0 + p1x + ... + xm. In general therefore a PDBM for GF(2m) comprises one type A module that generates am+i (i=0,1,...,m-1) from equ(2.33) and m type B modules each generating the inner product of two m-length vectors over GF(2). As an example, the PDBM for GF(23) using p(x) = x3 + x + 1 is given below. a0 a3 a1 a4 a2 Figure 2.11. Type A module for a bit-parallel dual basis multiplier for GF(23). 36 aj cj b0 aj+1 b1 aj+2 b2 Figure 2.12. Type B module for a bit-parallel dual basis multiplier for GF(23). a0 a1 a2 Module A a3 Module B Module B a4 Module B b c0 c1 c2 Figure 2.13. Bit-parallel dual basis multiplier for GF(23). 2.5.2 Normal basis multipliers A bit-parallel normal basis multiplier was also presented by Massey and Omura [31]. This multiplier comprises m identical Boolean functions, where the inputs to these functions are effectively cyclically shifted one each time. A bit-parallel Massey-Omura multiplier (PMOM) requires at least m(2m-1) 2-input AND gates and at least m(2m-2) 2input XOR gates [31,54]. The complexity of this multiplier is therefore dependent upon the complexity of the defining multiplication function. Accordingly, this multiplier is more hardware intensive than the PDBM and is not used in this thesis. 2.5.3 Polynomial Basis Multipliers The bit-parallel polynomial basis multiplier (PPBM) was presented by Laws et al. [28]. The multiplier performs the same sequence of computations as the bit-serial 37 polynomial multiplier option M (SPBMM), and so denote this multiplier the parallel polynomial basis multiplier option M (PPBMM). Let a, b, c GF(2m) and a = a0 + a1 + ... + am-1m-1 b = b0 + b1 + ... + bm-1m-1 c = c0 + c1 + ... + cm-1m-1. (2.34) To generate c = a * b, the representation c = (...(((am-1b) + am-2b) + am-3b) + ...) + a1b (2.35) is again used. The PPBMM therefore consists of (m-1) blocks that carry out the operations ym-1= am-1b and yj = ajb + yj+1 mod p(). for m-1> j 0 where the result c = y0, and p(x) is the irreducible polynomial for GF(2m). Mastrovito has presented a different type of polynomial basis multiplier [33]. This multiplier generates c a b m 1 mod p( x ) c j j j 0 by employing the product matrix M: c m1 c m 2 c 0 f mm11 m 2 f m1 0 f m1 f mm21 f mm22 f m0 2 where f ji b k for some k f 0m1 a m1 f 0m 2 a m 2 M AT f 00 a0 (2.36) . The most burdensome part of the Mastrovito algorithm is finding the product matrix M. The algorithm for finding the matrix M has been omitted as it is rather complicated and can be found in [33]. In conclusion, the Mastrovito bit-parallel polynomial basis multiplier (MPPBM) is rather difficult to represent algorithmically. However the advantage of the MPPBM is that it has a smaller time delay than the PPBMM. Laws et al. [28] presented a parallel multiplier using the same calculation sequence as the SPBMM. The question arises, is it possible to construct a modular and regular 38 parallel multiplier employing the same calculation sequence as in the case of the SPBML. Research carried out concludes that it is, as below. Express the multiplication c = a * b as in equ(2.20) c = (...(((a0b) + a1b) + a2b2) + ...) + am-1bm-1 . Now represent b * j as b * j = bj,0 + bj,1 + bj,22 + ... + bj,m-1m-1. (2.37) Therefore using (2.20) and (2.37) cj = a0b0,j + a1b1,j + a2b2,j + ... + am-1bm-1,j (2.38) Equation (2.38) may also be derived if the SPBML is considered. In the SPBML, the value cj is obtained by sequentially summing up the binary multiplication bj,i (the state of register bi after j clock cycles) and ai. Using equations (2.38) and (2.37) it is possible to construct a modular and regular bit-parallel polynomial basis multiplier, option L (PPBML). A PPBML for GF(28) is presented below. b0,0 b0 b*2 b* * * b*3 b3,0 b * 3,1 b3,2 b3,3 b3 b0,3 a Module B c0 Module B Module B c1 c2 Figure 2.14. PPBML for GF(2^8). 39 Module B c3 a0 b0,j a1 b1,j a2 b2,j a3 b3,j cj Figure 2.15. Module B of the PPBML - Identical inner product generator identical to that required in the Berlekamp multiplier. bj,0 bj+1,0 bj,1 bj+1,1 bj,2 bj+1,2 bj,3 bj+1,3 Figure 2.16. Circuit for multiplying by in GF(2 ) In general a PPBML for GF(2m) comprises m type B inner product modules and (m1) type C modules that generate j * b where is a root of p(x) and b GF(2m). A type C module essentially carries out a linear transformation of basis coefficients over GF(2) and will therefore consist of a number of XOR gates. 2.5.4 Comparison of parallel multipliers In this section the PDBM [15], the PMOM [31,54] and three polynomial basis multipliers, the MPPBM [33], the PPBMM [28], and the PPBML have been considered. A comparison of the number of XOR and AND gates required by these multipliers and the maximum delay times for a range of values of m are presented below. (In fact, the delay through each of these multipliers is Da plus the values cited below, since a single row of AND gates is also required by each of the multipliers). 40 m PMOM MPPBM PDBM PDBM PPBMM PPBML PPBMM PPBML NA NX DX NA NX DX NA NX DX DX DX 3 15 12 2 9 8 3 9 8 3 3 3 4 28 2 3 16 15 3 16 15 3 4 3 5 45 40 3 25 25 5 25 2 5 6 5 6 66 60 4 36 33 4 36 35 4 6 4 7 133 126 5 49 48 4 49 48 4 7 4 8 168 160 5 64 90 5 64 77 7 11 7 9 153 144 4 81 80 5 81 80 6 11 6 10 190 180 5 100 101 6 100 99 6 12 6 Table 2.2. Comparison of bit-parallel finite field multipliers. The number of gates for the PMOM is taken from [33], for the MPPBM and the PDBM this number is taken from [15]. The primitive polynomials used to design these multipliers (excluding the PMOM) are listed in Appendix A. As a general rule, the number of gates and multiplier delay can be obtained from the following: PDBM: NA= m*m NX= (m-1)*(m + H(pp)-2) Del= log2(m) + t * log2(H(pp)-1) NX= (m-1)*(m + H(pp)-2) ( DX= m -1 + t if H(pp)= 3 ) NX= (m-1)*(m + H(pp)-2) ( DX= log2(m) + t if H(pp)= 3 ) PPBMM: NA= m*m PPBML: NA = m*m where t = (m-1)/(m-p) and p(x) = xm + xp + p 1 p x i i . i 0 In conclusion from Table 2.2, the PPBML has the same parameters as the PDBM for the considered choices of m. The PPBML needs no basis conversions and so the design 41 of a PPBML is simpler and more hardware efficient than the PDBM, especially if a primitive trinomial for GF(2m) does not exist. On the other hand, the PDBM is slightly easier to design (without the basis conversions), and some additional design optimisation can be done, e.g. for m= 8 the number of XOR gates can be reduced to 72 [15]. In conclusion, the choice between PDBM and PPBML is related to the individual design specification, as the differences in design complexity and hardware requirements are small. The PPBMM has the same hardware requirements as the PPBML but a longer delay time. Accordingly the PPBMM is not used in the thesis. Similarly, the PMOM is not used given the high hardware requirements and long delay path of the multiplier. The PPBML in comparison with the MPPBM is much easier to design; and in most cases the final circuits are similar, e.g. for m= 4. Therefore in this thesis only the PDBM and PPBML have been considered. It should be mentioned that a number of bit parallel multipliers have been proposed for circumstances in which p(x) is of the form p(x) = xm + xm-1 + ..... + x + 1, that is, when p(x) is an all one polynomial [22]. However all one polynomials are relatively rare and so do not help in finding general solutions of the kind required here. 2.6 Finite field exponentiation 2.6.1 Squaring In some applications squaring in a finite field is required. Squaring can be performed using a standard multiplier but this approach is rather hardware inefficient. Instead, a different algorithm is employed, as was described for example in [16]. Let a GF(2m) be represented in the polynomial basis as a = a0 + a1 + a22 + ... + am-1m-1. Now let b GF(2m) such that b = a2 . From equ(2.8), f2(x) = f(x2) and so b = a2 = a0 + a12 + a24 + a36 + ... + am-12m-2 mod p(). (2.39) In other words, the coefficients of b can be obtained from a linear transformation of the coefficients of a over GF(2). This linear transformation will require a number of XOR gates to implement, and these numbers for a range of m are listed in Table 2.3. 42 2.6.2 Raising field elements to the third power The standard approach for carrying out exponentiation to the power three is to use a standard multiplier and squaring and then calculate a3 = a2 * a [46]. If a PPBML is used together with the approach for carrying out squaring described above, the hardware requirements for this circuit are as given in Table 2.3. An alternative method of raising elements to the power three is now described. Let a, b GF(2m) such that b = a3 and represent both these elements over the polynomial basis in the usual way. From the equation (x + y)3 = x3 + 3x2y + 3xy2 + y3 the expressions b = a3 = (a0 + a1 + a22 + ... + am-1m-1)3 m1 m 2 m1 i 0 i 0 j i 1 mod p() b ai 3i ai a j ( 2i j i 2 j ) mod p() (2.40) are derived. A circuit implementing equ(2.40) can be designed directly and consists of (m1) + (m-2) + ... +1 = m*(m-1)/2 AND gates and at most m*(m2-1)/2 XOR gates. However in practice, these requirements are much lower. The number of gates for this cubic circuit is given in Table 2.3. In comparison with the standard approach this method offers hardware savings especially if design optimisation is employed. For example for m = 8, with optimisation the number of XOR gates is almost the halved. m squaring a3 = a2 * a cubic NXOR NXOR NAND NXOR NAND 4 2 16 (13) 6 17 16 5 3 29 (21) 10 27 25 6 3 47 (33) 15 38 36 7 3 66 (46) 21 51 49 8 12 (10) 135 (70) 28 87 64 9 6 133 (83) 36 86 81 10 6 159 (105) 45 105 100 Table 2.3. Hardware requirements for exponentiation in GF(2m). ( ) = with design optimisation. 43 2.7 Finite field inversion BCH decoders are required to implement the finite field division c = a/b. This division can be implemented using a division algorithm e.g. [15, 17, 21]. Unfortunately, BCH decoders require that the result of a division be available faster than these algorithms allow. Often however b is available earlier than a and so it can be beneficial to first employ inversion to generate b-1 and then to use a fast bit-parallel multiplier. Throughout this thesis, the Fermat inverter is used. Fermat inverters operating over the normal and dual bases have been presented [16, 54]. The dual basis inverter is hardware efficient and what is more, it is convenient that the result of this division is represented in the dual basis and so can be utilised in dual basis multipliers for example. Hence, the dual basis inverter has been employed in this project. A Fermat inverter implements the equation a 1 a 2 m 2 a 2 a 4 a 8 a 2 m 1 (2.41) and so in turn is based on repeated multiplications and squaring. The dual basis inverter uses a PDBM as presented in Section 2.5.1 and carries out squaring in the polynomial basis as described in Section 2.6.1. The overall inversion circuit requires (m-1) clock cycles to generate a result. To then calculate c = a/b = a*b-1 one extra clock cycle is required to carry out the multiplication. 2.8 Conclusions In this chapter the main definitions and results underpinning finite field theory have been introduced. It has been shown how to generate GF(2m) from the base field GF(2) and the most important basis representations have been described. The most useful bit-serial and bit-parallel finite field multipliers have been reviewed for adoption in BCH codecs. Circuits for carrying out inversion, division and exponentiation in GF(2m) have also been described. Finally some important new circuits have been presented. A hardware efficient method of generating the sum of products using a previously overlooked multiplier has been described. This circuit operates entirely over the polynomial basis and has an attractive input/output format for use in circuits implementing 44 the Berlekamp-Massey algorithm. Two multiplier circuits generating products of the form y = a*b*c have also been presented. These circuits are based around Berlekamp multipliers and SPBMMs. Hardware/time trade-offs are made in determining which of these two options to adopt. Both multipliers propose novel methods of implementing the required basis conversions so allowing Berlekamp multipliers and SPBMMs to be used in tandem. Finally, a new bit-parallel multiplier - the PPBML - has been presented. This multiplier is a hardware efficient equivalent of a previously presented bit-serial multiplier. In addition, a new algorithm for exponentiation to the power three has been presented. The algorithm is especially hardware efficient if the design optimisation is employed. 3. BCH codes In this chapter Bose-Chaudhuri-Hocquenghem (BCH) codes are introduced and various BCH encoding and decoding algorithms are presented. BCH code encoding and 45 decoding is considered. Three different decoding strategies are presented according to the error correcting capability of the code. Generally decoding is broken down into three processes, syndromes calculation, Berlekamp-Massey algorithm (BMA) and Chien search. In addition the BMA can be developed with or without inversion and both methods are described here. At the end of this chapter, comparisons between BCH codes and RS codes are presented. 3.1 Background The first class of linear codes derived for error correction were Hamming codes [20]. These codes are capable of correcting only a single error but because of their simplicity, Hamming codes and their variations have been widely used in error control systems; e.g. the 16/32 bit parallel error detection and correction circuits SN54ALS616/SN54ALS632 [50]. Later the generalised binary class of Hamming codes for multiple-errors was discovered by Hocquenghem in 1959 [23], and independently by Bose and Chaudhuri in 1960 [6]. Subsequently, non-binary error-correcting codes were derived by Gorenstein and Zieler [19]. Almost at the same time independently of the work of Bose, Chaudhuri and Hocquenghem, the important subclass of non-binary BCH codes - RS codes - were introduced by Reed and Solomon [44]. This project is concerned with BCH codes however and these codes are described in more detail below. 3.2 BCH codes The class of BCH codes is a large class of error correction codes that occupies a prominent place in theory and practice of error correction. This prominence is a result of the relatively simple encoding and decoding techniques. Furthermore, provided the block length is not excessive, there are good codes in this class ([30] Chapter 9). In this thesis 46 only the subclass of binary BCH codes is considered as these codes can be simply and efficiently implemented in digital hardware. Before considering BCH codes, some additional theory needs to be introduced. Theorem 3.1. ([30], p.10) The minimum distance of a linear code is the minimum Hamming weight of any non-zero codeword. Theorem 3.2. ([30], p.10) A code with minimum distance d can correct (d-1)/2 errors. Definition 3.2. A linear code C is cyclic if whenever (c0, c1, ..., cn-1) C then so is (cn-1, c0, c1, ..., cn-2). A codeword (c0, c1, ..., cn-1) of a cyclic code can be represented as the polynomial c(x) = c0 + c1x + .... + cn-1xn-1 . This correspondence is very helpful as the mathematical background of polynomials is well developed, and so this representation is used here. It is frequently convenient to define error-correcting codes in terms of the generator polynomial of that code g(x) [29]. The generator polynomial of a t-error-correcting BCH code is defined to be the least common multiple (LCM) of f1, f3, ... f2*t-1, that is, g(x) = LCM{ f1, f3, f5, ... f2*t-1} (3.1) where fj is the minimal polynomial of j (0 < j < 2t + 1) considered below. Let fj (0 < j < 2t + 1) be a minimal polynomial of j then fj is obtained by (Theorem 2.14, [29]): f j ( x) e 1 (x 2 ) i (3.2) i 0 where j and e 2 . According to Theorem e m. 47 To generate a codeword for an (n, k) t error-correcting BCH code, the k information symbols are formed into the information polynomial i(x) = i0 + i1x +...+ ik-1xk-1 where ij GF(2). Then the codeword polynomial c(x) = c0 + c1x + ... + cn-1xn-1 is formed as c(x) = i(x)*g(x). (3.3) Since the degree of fj(x) is less or equal to m (e m equ(3.2); [29] p. 38), from equ(3.1) the degree of the g(x) (and consequently the number of parity bits n-k) is at most equal to m*t. For small values of t, the number of parity check bits is usually equal to m*t ([29] p. 142). For any positive integer m 3 there exists binary BCH codes (n, k) with the following parameters: n = 2m - 1 length of codeword in bits t the maximum number of error bits that can be corrected k n - m * t number of information bits in a codeword dmin 2*t + 1 the minimum distance. A list of BCH code parameters for m 10 is given in Appendix D. Note that for t = 1, this construction of BCH codes generates Hamming codes. The number of parity bits equals m, and so (2m - 1, 2m - m -1) codes are obtained. In this case the generator polynomial g(x) satisfies g(x) = f1(x) = p(x) where p(x) is the irreducible polynomial for GF(2m) as given, for example, in Appendix A. In this thesis only primitive BCH codes are considered. Binary non-primitive BCH codes can be constructed in a similar manner to primitive codes ([29] p.151). Non-primitive BCH codes have a generator polynomial g(x) with l, l+1, l+2, ... , l+d-2 as roots, where is an element in GF(2m) and l is a non-negative integer. Non-primitive BCH codes obtained in this way have a minimum distance of at least d. When l = 1, d = 2* t + 1 and = where is a primitive element of GF(2m), primitive BCH codes are obtained. 48 3.3 Encoding BCH codes If BCH codewords are encoded as in equ(3.3) the data bits do not appear explicitly in the codeword. To overcome this let c(x) = xn-k * i(x) + b(x) (3.4) where c(x)= c0 + c1x +...+ cn-1xn-1, i(x)= i0 + i1x +...+ ik-1xk-1, b(x)= b0 + b1x +...+ bm-1xm-1 and cj, ij, bj GF(2). Then if b(x) is taken to be the polynomial such that xn-k * i(x) = q(x) * g(x) - b(x) (3.5) the k data bits will be present in the codeword. (By implementing equ(3.4) instead of equ(3.3) systematic ([29] p. 54) codewords are generated). BCH codes are implemented as cyclic codes [42], that is, the digital logic implementing the encoding and decoding algorithms is organised into shift-register circuits that mimic the cyclic shifts and polynomial arithmetic required in the description of cyclic codes. Using the properties of cyclic codes [29, 30], the remainder b(x) can be obtained in a linear (n-k)-stage shift register with feedback connections corresponding to the coefficients of the generator polynomial g(x) = 1 + g1x + g2x2 + ... + gn-k-1xn-k-1 + xn-k. Such a circuit is shown on Figure 3.1. S1 g1 b0 gn-k-1 g2 b1 bn-k-1 b2 xn-k i(x) 1 c(x) S2 2 Figure 3.1. Encoding circuit for a (n, k) BCH code. The encoder shown in Figure 3.1 operates as follows for clock cycles 1 to k, the information bits are transmitted in unchanged form (switch S2 in position 2) and the parity bits are calculated in the Linear Feedback Shift Register (LFSR) (switch S1 is on). 49 for clock cycles k+1 to n, the parity bits in the LFSR are transmitted (switch S2 in position 1) and the feedback in the LFSR is switch off (S1 - off). As an example, the (15, 5) 3-error correcting BCH code is considered. The generator polynomial with , 2, 3, ... , 6 as the roots is obtained by multiplying the following minimal polynomials: roots minimal polynomial , 2, 4 f1(x) = (x+) * (x+2) * (x+4) * (x+8) = 1 + x + x4 3, 6 f3(x) = (x+3) * (x+6) * (x+12) + (x+9) = 1 + x + x2 + x3 + x4 5 f5(x) = (x+5) * (x+10) = 1 + x + x2 Thus the generator polynomial g(x) is given by g(x) = f1(x) * f3(x) * f5(x) = 1 + x + x2 + x4 + x5 + x8 + x10. 3.4 Decoding BCH codes The decoding process is far more complicated than the encoding process. As a general rule, decoding can be broken down into three separate steps: 1. Calculating the syndromes 2. Solving the key equation 3. Finding the error locations. Fortunately, for some BCH codes step number 2 can be omitted. To decode BCH codes in this thesis, three different strategies have been employed, for single error correcting (SEC), double error correcting (DEC) and triple and more error correcting (TMEC) BCH codes. Regarding step 1, the calculation of the syndromes is identical for all BCH codes. For SEC codes step number 2 - solving the key equation - can be omitted, as a syndrome gives rise to the error location polynomial coefficient. For DEC codes step number two can also be omitted but the error location algorithm is rather more complicated. Finally, when 50 implementing the TMEC decoding algorithm all three steps must be carried out, where step 2 - the solution of the key equation - is the most complicated. 3.4.1 Calculation of the syndromes Let c(x) = c0 + c1x + c2x2 + ... + cn-1xn-1 r(x) = r0 + r1x + r2x2 + ... + rn-1xn-1 e(x) = e0 + e1x + e2x2 + ... + en-1xn-1 (3.6) be the transmitted polynomial, the received polynomial and the error polynomial respectively so that r(x) = c(x) + e(x). (3.7) The first step of the decoding process is to store the received polynomial r(x) in a buffer register and to calculate the syndromes Sj (for 1 j 2t -1). The most important feature of the syndromes is that they do not depend on transmitted information but only on error locations, as shown below. Define the syndromes Sj as n1 S j ri i j for (1 j 2t). (3.8) i0 Since rj = cj + ej (j = 0, 1, ...., n-1) n1 S j (c i e i ) i 0 i j n1 c i i j i 0 n1 e i i j . (3.9) i 0 By the definition of BCH codes n1 c i 0 i j i 0 for (1 j 2t) (3.10) thus n1 S j e i i j . (3.11) i 0 It is therefore observed that the syndromes Sj depends only on the error polynomial e(x), and so if no errors occur, the syndromes will all be zero. To generate the syndromes, express equ(3.8) as 51 Sj = (...((rn-1 * j + rn-2) * j + rn-3) * j + ....) * j + r0. (3.12) Thus a circuit calculating the syndrome Sj carries out (n-1) multiplications by the constant value j and (n-1) single bit summations. Note that because rjGF(2) the equation S2i = Si2 is met [29]. For example a circuit calculating S3 for m = 4 and p(x) = x4 + x + 1 is presented in Figure 3.2. Initially the register si (0 i 3) is set to zero. Then the register s0 - s3 is shifted 15 times and the received bits ri (0 i 14) are clocked into the syndrome calculation circuit. Then the S3 is obtained in the s0 - s3 register. r(x) s0 s1 s2 s3 Figure 3.2. Circuit computing S3 for m = 4. Syndromes can also be calculated in a second way ([29] p. 152, 165), ([30] p. 271). Employing this approach, Sj is obtained as the remainder in the division of the received polynomial by the minimal polynomial fj(x), that is, r(x) = aj * fj(x) + bj(x) (3.13) Sj = bj(j). (3.14) where It should be mentioned that the minimal polynomials for , 2, 4, .... are the same and so only one register is required to calculate the syndromes S1, S2, S4, ... . The rule can be extended for S3, S6, ..., and so on. For example the circuit calculating S3 for m = 4 is shown in Figure 3.3. The minimal polynomial of 3 is f3(x) = 1 + x + x2 + x3 + x4 and let b(x) = b0 + b1x + b2x2 + b3x3 be the remainder on dividing r(x) by f3(x). Then S3 = b(3) = b0 + b13 + b26 + b39 = b0 + b3 + b22 + (b1 + b2 + b3) 3. 52 The circuit in Figure 3.3 therefore operates by first dividing r(x) by f3(x) to generate b(x) and then calculating b(3). The result is obtained after the register b0 - b3 have been clocked 15 times. r(x) b0 b1 b2 b3 S3 Figure 3.3. Second method of computing S3 for m = 4. 3.4.2 Solving the key equation The second stage of the decoding process is finding the coefficients of the error location polynomial (x) = 0 + 1x + ... + txt using the syndromes Sj (1 j < 2t). The relationship between the syndromes and these values of j is given by ([5], p.168) t S t i j j 0 (i= 1, ..., t) (3.15) j 0 and the roots of (x) give the error positions. The coefficients of (x) can be calculated by methods such as the Peterson-Gorenstein-Zieler algorithm [5, 43] or Euclid’s algorithm [49]. In this thesis the Berlekamp-Massey Algorithm (BMA) [2, 32] has been used as it has the reputation of being the most efficient method in practice [5]. In the BMA, the error location polynomial (x) is found by t-1 recursive iterations. During each iteration r, the degree of (x) is usually incremented by one. Through this method the degree of (x) is exactly the number of corrupted bits, as the roots of (x) are associated with the transmission errors. The BMA is based on the property that for a number of iterations r greater or equal the number of errors ta that have actually occurred (r ta), the discrepancy dr = 0 in equ (3.16) below where t d r S 2r j 1 j . (3.16) j 0 53 On the other hand if r < ta, the discrepancy dr calculated in equ(3.16) is usually non zero and is used to modify the degree and coefficients of (x). What the BMA essentially does therefore is compute the shortest degree (x) such that equ(3.15) holds. The BMA with inversion is given below. Initials values: 1 if S1 0 dp S1 if S1 0 ( 0 ) ( x ) 1 S1 x x 3 x 2 (1) ( x ) if S1 0 if S1 0 0 if l1 1 if r 1. S1 0 (3.17) S1 0 The error location polynomial (x) is then generated by the following set of recursive equations: t d r (i r ) S 2r i 1 i0 (r) ( x ) ( r 1) ( x ) d p1 d r ( r ) ( x ) 0 if d r 0 or r l r bsel 1 if d r 0 and r l r x 2 ( r ) ( x ) if bsel 0 ( r 1 ) ( x ) 2 ( r 1) ( x ) if bsel 0 x (3.18) l r if bsel 0 l r 1 2 l r l r 1 if bsel 0 d r if bsel 0 dp d r if bsel 0 r r 1 these calculation are carried out for r= 1, ..., t-1. Note that the above algorithm is slightly modified in comparison with the previously presented BMA [2, 32]. Due to more complicated initial states, the number of iterations is decreased by one. In practice, this causes only a slight increase in the hardware requirements but the BMA calculation time is significantly reduced. 54 A circuit implementing the BMA is given in Figure 3.4. The error location polynomial (x) is obtained in the C registers after t-1 iterations. 0 1 B2 B3 B4 Bt 0 inv C1 C2 C3 C4 Ct dr reg Sj Sj-1 Sj+2 Sj-2 Sj-3 Sj-4 Sj-t+1 Sj+1 Figure 3.4. Berlekamp Massey Algorithm with inversion. In some applications it may be beneficial to implement the BMA without inversion. A version of the BMA achieving this was presented in [8, 56]. For inversionless BMA the initial conditions are the same as for the BMA with inversion given in equ(3.17). The error location polynomial is then generated by following recursive equations: t d r (i r ) S 2r i 1 i0 (r) ( x ) d p ( r 1) ( x ) d r ( r ) ( x ) 0 if d r 0 or r l r bsel 1 if d r 0 and r l r x 2 ( r ) ( x ) if bsel 0 ( r 1) ( x ) 2 ( r 1) ( x ) if bsel 0 x l r if l r 1 2 r l r d p if dp d r if bsel 0 1 if bsel 0 bsel 0 bsel 0 r r 1. 55 (3.19) In conclusion, inversionless BMA is more complicated and requires a greater number of multiplications than the BMA with inversion. On the other hand, inversion can take (m-1) clock cycles (see Section 2.7) and therefore even if parallel multiplication is used this constraint will slow down the algorithm. Therefore the inversionless algorithm has to be implemented for some BCH codes. For SEC and DEC BCH codes the coefficients of (x) can be obtained directly without using the BMA. This is because for SEC BCH codes (x) = 1 + S1x and for DEC BCH codes (x) = 1 + 1x + 2x2 = 1 + S1x + (S12 + S3 * S1-1) x2 [2], ([30] p. 321). This approach for generating (x) directly in terms of the syndromes can be theoretically extended to TMEC but quickly becomes too complex to implement in hardware and so the BMA must be used. 3.4.3 Finding the error locations 3.4.3.1 General case The last step in decoding BCH codes is to find the error location numbers. These values are the reciprocals of the roots of (x) and may be found simply by substituting 1, , 2, ... , n-1 into (x). A method of achieving this using sequential substitution has been presented by Chien [10]. In the Chien search the sum 0 + 1j + 22j + ... + ttj (j= 0, 1, ... , k-1) (3.20) is evaluated every clock. It can be noticed that if (j)= 0, the received bit rn-1-j is corrupted. Therefore if for clock cycle j the sum equals zero the received bit rn-j-1 should be corrected. A circuit implementing the Chien search is shown in Figure 3.5. The operation of this circuit is as follows. The registers c0, c1, ..., ct are initialised by the coefficients of the error location polynomial 0, 1, ... , t. Then the sum t c j is calculated and if this equals j 0 zero, the error has been found and after being delayed in a buffer, the faulty received bit is 56 corrected using an XOR gate. On the next clock cycle each value in the ci register is multiplied by i (using a constant multiplier), and the sum t c j is calculated again. The j 0 above operations are carried out for every transmitted information bit (that is k times). input buffer c0 c1 c2 ct t * 2 t * * c j j 0 output Figure 3.5 Chien’s search circuit. 3.4.3.2 Finding an error location for t = 2 In the case of DEC BCH codes, two different algorithms may be adopted. Firstly, one may adopt the general procedure, namely finding the syndromes, implementing the relatively burdensome BMA and then adopting Chien search. In this thesis however another approach has been adopted [53]. This algorithm does not require the error location polynomial (x) to be generated, instead a more sophisticated error location procedure is adopted. This algorithm is summarised below. Suppose the received vector has at most two errors, then the error location polynomial (x) is given by [2] ([30] p. 321): (x) = 1 + 1x + 2x2 = 1 + S1x + (S12 + S3 * S1-1) x2. (3.21) Therefore if there is no error 1 = 0 and 2 = 0, thus S1 = 0 S3 = 0. (3.22) If only one error has occurred 1 0 and 2 = 0, thus S1 0 S3 = S13. (3.23) If there are two errors 1 0 and 2 0, thus S1 0 S3 S13. (3.2 ) 57 If S1 = 0 and S3 0 more than two errors have occurred and so the error pattern cannot be corrected. This step-by-step decoding algorithm is based around the assumption that an error has occurred at the present location. Accordingly the values in the s1 and s3 registers are changed. These changes are easily implemented because if the received bit rn-1 is corrupted, only the first bits si,0 of the registers s1 and s3 need be negated where si = si,0 + si,1x + ... + si,m-1xm-1 for (i=1,3) and the s1 and s3 registers hold the values of S1 and S3 respectively. Similarly, assuming that the received bit rn-1-j is corrupted, the syndrome registers are clocked j-times implementing the function si si * i (with a circuit similar to the syndrome calculation circuit) and so only the first bits s1,0 and s3,0 are negated [53]. A circuit employing this algorithm is given in Figure 3.6. At first registers s1 and s3 are initiated with S1 and S3 respectively and using equ(3.22-2 ), the number of errors present are stored by clocking values h1, h3 into flip-flops p1, p3. It is then assumed that an error has occurred in the first position. The registers s1 and s3 are updated and again using equ(3.22-2 ) the new number of errors present is specified. If the new number of errors has decreased, the assumption has proven to be correct and an error has been found. That is, the received bit rn-1 is corrected and the error assumption changes are introduced permanently into the s1 and s3 registers. In addition, the p flip-flops are clocked with the new h signals. Alternatively, if the number of errors has not decreased, the assumption is wrong and the correct bit has been received and the changes are cancelled. The above operations are repeated for every received information bit rn-1-j (0 j < k), after the s registers have been shifted (si si * i, i= 1,3) j-times. 58 S1 S3 s1 s3 r(x) (s1)3 *3 * s1 0 p1 h1 (s1)3 s3 h3 p3 b u f f e r error decision circuit output Figure 3.6. Error location circuit for t = 2. 3.5 Reed-Solomon codes In this thesis only binary BCH codes have been considered and only their hardware representation developed. Therefore here a comparison between binary BCH codes and non-binary BCH codes, the sub-class of RS codes [44] is considered. RS codes are the most efficient error correcting codes theoretically possible and a wide body of knowledge concerning them exists [5,30]. In addition, RS codes are especially attractive as they can correct not only random but burst errors as well. In many situations the information channel has memory, and so random binary BCH codes are not appropriate. Fortunately, binary BCH codes can correct burst errors when an interleaved code with large t is adopted. But as will be shown below, this architecture is not recommended and instead, RS codes should be used. RS codes operate on symbols consisting of m-bits and which are elements of GF(2m). Each codeword consists of (n= 2m-1, k= 2m-1-2t) such symbols, where t is the maximum number of symbols that may be corrected. Now the decoding of RS codes will briefly be presented in comparison with BCH codes. The encoding process is omitted here as it is relatively simple, and therefore only slightly influences a codec’s complexity. There are two different ways of decoding RS 59 codes [5,16] in the time or in the frequency domain. Here frequency domain decoding process will be considered. Decoding may be separated into four main areas 1. Calculation of syndromes using equation: n S i r j i j 0 i< 2t (3.25) j0 where the rj GF(2m) are the received symbols (see also equ(3.8)). Note that the calculation of the syndromes for BCH codes is simpler (riGF(2)) than for RS codes because ri GF(2m) and so the equation Si2 =S2i does not hold for RS codes. 2. Berlekamp-Massey Algorithm. The BMA is similar as in the case of BCH codes but requires twice as many iterations (taking the same value of t). 3. Recursive extension, computing the equation t Ei Ei l l 2t i n-1 where Ei = Si 0 i 2t-1. (3.26) l 1 Youzhi [56] has shown that this step can be implemented with a BMA circuit by adding only additional control signals. 4. Obtaining the error magnitudes ei by computing the inverse transforms ei 1 n1 E j i j n j0 2t i n-1. (3.27) This step is not required for BCH codes and in comparison to the Chien search is rather more complicated. If binary BCH codes and non-binary RS codes are compared, at first it may seem that BCH codes are much simpler to implement. This is because RS codes operate on symbols and require additional steps to be computed since not only do the error locations have to be calculated (as with BCH codes) but also the error magnitudes. But after closer consideration, it may be seen that for example, a (15, 11) RS code can correct up to two corrupted 4-bit symbols, e.g. at least one 5-bit burst error. This code consists of 4*15= 60 codeword bits and 4*11= 44 information bits. Conversely, consider a similar (63, 36) 5 bits random error correcting BCH code. It should be noted here that this comparison of BCH and RS codes is not based on any practical experiments and in practice maybe different 60 codes should be considered. This BCH code has not only a lower information rate (k/n) but more hardware is needed to the decoder. Consequently, calculation of the syndromes is simpler in comparison with the RS code, but a greater number of syndromes must be computed. Furthermore, the BMA is much more complex for the BCH code as the number of errors is greater. RS codes also require the inverse transform calculation to be implemented and this is more complicated than the equivalent Chien search circuit. But taken overall, the hardware requirements of the RS codec are much simpler. In addition the BCH code requires all the operations to be carried out over GF(26), whereas the RS code operates over GF(2 ), and so more complex arithmetic circuits are required for BCH codes. In conclusions, RS codecs generally have more attractive properties and they should rather be implemented if burst errors have to be corrected. 3.6 Conclusions In this chapter BCH codes have been introduced. Encoding and decoding algorithms for BCH codes with different error-correcting abilities have been considered. Decoders have more complex structure than encoders and so the decoding process has been broken down into three separate steps. The first step is the syndrome calculation process, an identical process whatever the error-correcting ability of the code. The next step is to find the error location polynomial (x). This stage is the most complicated of the three and for DEC BCH codes, an alternative decoding algorithm is used which by-passes the need to generate this polynomial entirely. For TMEC BCH codes (x) is calculated using the relatively complex BMA, whereas for SEC BCH codes, (x) can be expressed immediately in terms of the syndromes. The last stage of decoding is to find and correct any errors present. Two different approaches have been employed to achieve this, one for the general case and one for DEC BCH codes. CHAPTER 4 61 4.1 High speed architectures for General Linear Feedback Shift Registers The implementation of CRC check generation circuit can be implemented with the use of linear feedback circuit. Following figure shows the LFSR representation of CRC with generator polynomial 1+y+y3+y5 4.2 Architectures for polynomial G(y)=1+y+y3+y5 D + D D + D D + y 1 Fig.4.1Serial structure CRC codes have been used for years to detect data errors on interfaces, and their operation and capabilities are well understood. 4.3 Motivation for the parallel Implementation: Cyclic redundancy check (CRC) is widely used to detect errors in data communication and storage devices. When high-speed data transmission is required, the general serial implementation cannot meet the speed requirement. Since parallel processing is a very efficient way to increase the throughput rate, parallel CRC implementations have been discussed extensively in the past decade. Although parallel processing increases the number of message bits that can be processed in one clock cycle, it can also lead to a long critical path (CP); thus, the increase of throughput rate that is achieved by parallel processing will be reduced by the decrease of circuit speed. Another issue is the increase of hardware cost caused by parallel processing, which needs to be controlled. This brief addresses these two issues of parallel CRC implementations. 62 4.4 Literature Survey and Existing Systems: In the past recursive formulas have been developed for parallel CRC hardware computation based on mathematical deduction. They have identical CPs. The parallel CRC algorithm in [2] processes an m-bit message in (m+k)/L clock cycles, where is the order of the generator polynomial and L is the level of parallelism. However, in [1], m message bits can be processed in m/L clock cycles. High-speed architectures for parallel long Bose–Chaudhuri–Hocquenghen (BCH) encoders in [3] and [4], which are based on the multiplication and division computations on generator polynomial, are efficient in terms of speeding up the parallel linear feedback shift register (LFSR) structures. They can also be generally used for the LFSR of any generator polynomial. However, their hardware cost is high. 4.5 LFSR (Linear Feedback Shift Register) A Linear Feedback Shift Register (LFSR) is a shift register whose input bit is a linear function of its previous state. The only linear functions of single bits are xor and inverse-xor; thus it is a shift register whose input bit is driven by the exclusive-or (xor) of some bits of the overall shift register value. The initial value of the LFSR is called the seed, and because the operation of the register is deterministic, the sequence of values produced by the register is completely determined by its current (or previous) state. Likewise, because the register has a finite number of possible states, it must eventually enter a repeating cycle. However, an LFSR with a well-chosen feedback function can produce a sequence of bits which appears random and which has a very long cycle. 4.6 Serial input hardware realization 63 Fig 4.2. Basic LFSR architecture Fig 4.3 .Linear Feedback Shift Register Implementation of CRC-32 4.7 DESIGN OF ARCHITECTURES USING DSP TECHNIQUES 4.7.1 Unfolding It’s a transformation technique that can be applied to DSP program to create a new program describing more than one iteration of the original program. Unfolding a DSP program by an unfolding factor J creates a new program that describes J consecutive iterations of the original program. It increases the sampling rate by replicating hardware so that several inputs can be processed in parallel and several outputs can be produced at the same time. 4.7.2 Pipelining 64 Reduces the effective critical path by introducing pipelining latches along the critical data path either to increase the clock frequency or sample speed or to reduce power consumption at the same speed. It is done using a look-ahead pipelining algorithm to reduce the iteration bound of the CRC architecture. 4.7.3 Retiming A technique used to change the locations of delay elements in a circuit without affecting the input/output characteristics of the circuit. Moving around existing delays • Does not alter the latency of the system • Reduces the critical path of the system Retiming has many applications in synchronous circuit design. These applications include reducing the clock period of the circuit, reducing the number of registers in the circuit, reducing the power consumption of the circuit and logic synthesis. It can be used to increase the clock rate of a circuit by reducing the computation time of the critical path. 4.7.4 Critical path: It is the path with the longest computation time among all paths that contain zero delays, and the computation time of the critical path is the lower bound on the clock period of the circuit. 4.7.5 Iteration bound: It is defined as the maximum of al the loop bounds. Loop bound is defined as t/w, where t is the computation time of the loop and w is the no.of delay elements in the loop. fLinear Feedback Shift Register (LFSR) structures are widely used in digital signal processing and communication systems, such as BCH, CRC. In high-rate digital systems such as optical communication system, throughput of 1Gbps is usually desired. The serial input/output operation property of LFSR structure is a bottleneck in such systems and parallel LFSR architecture is thus required.. In , high-speed parallel CRC implementation based on unfolding, pipelining and retiming is proposed. These parallel LFSR structures are not always efficient for general LFSR structures. Furthermore, the 65 large fan out problem explained in for long LFSR structures are not addressed in these papers. Pipelining technique is needed to reduce the achievable minimum clock period before parallel implementation is applied .A three-step LFSR structure is presented in [4]. The message input m(x) is first multiplied by a factor polynomial p(x); then the generator polynomial g(x) is thus modified as g’(x)=p(x)g(x); remainder of m(x)p(x)/g’(x) is finally divided by p(x) and quotient is the expected output. The second step of this algorithm inserts as many delay elements as possible to the right-most feedback loop, which otherwise causes large fan out and long latency when g(x) is long [4]. This threestep scheme can eliminate the effect of large fan out. However, the feedback structure of p(x) in the third step can still limit the achievable clock frequency of the final parallel LFSR structure. Three approaches are proposed in [5] to eliminate the feedback loops in the third step. Since the speed bottleneck of the three-step algorithm in [4] is usually located in the third step, the new approaches in [5] can efficiently speedup the final parallel LFSR structures. However, the hardware cost of the approaches in [5] is high. Furthermore, since the goal of the second step in [4] is to insert delay elements into the rightmost feedback loop and the achievable clock frequency is not necessarily determined by this feedback loop, the feedback structure obtained from the second step of [4] is not optimum for the achievable clock frequency. This paper uses different structures to solve the bottleneck in the third step of [4] with lower hardware cost. When we construct p(x), we guarantee that p(x) can be decomposed into several shortlength polynomials. Since the quotient of dividing the output of the second step by p(x) can be alternatively obtained by dividing the output of the second step by a chain of p(x) factor polynomials and the feedback loop of small length polynomial can be easily handled by the look-ahead pipelining techniques [3], the iteration bound bottleneck can be solved with smaller number of XOR gates than needed in [5]. A search algorithm for reducing achievable clock period of the second step and thus the overall parallel LFSR is then presented after large fanout problem effect is solved. 66 4.7.6 IMPROVED ALGORITHM FOR ELIMINATING THE FANOUT BOTTLENECK FOR LFSR STRUCTURE For a generator polynomial g(x) of degree (n-k) and a message sequence m(x) of degree (k-1), the systematic encoding can provide us with the code word of degree (n-1) as: c(x)=m(x)xn-k +Rem(m(x)xn-k )g( x) , where Rem(m(x)xn-k )g( x) is the remainder of polynomial of dividing by m(x)xn-k byg(x). For example, if g(y)=1+y+y^3+y^5 its corresponding LFSR structure for computing Rem(m(x)xn-k )g ( x) is shown in Fig.. In Fig., D denotes a delay element. Message sequence m(x) is injected into Fig from the right side with the most significant bit first. After k clock cycles, will be available in the delay elements of Fig. Rem(m(x)xn-k )g ( x)From Fig. 2.1 (a), we can see that it has 3 feedback loops, with loops bound of TXOR, (3/4) TXOR and (3/5) TXOR for loops 1, 2 and 3, respectively. The iteration bound is thus TXOR,corresponding to the rightmost feedback loop, where TXOR is the computation time of an XOR gate. Note that, Iteration bound is defined as the maximum of all the loop bounds. Loop bound is defined as t/w, where t is the computation time of the loop and w is the number of delay elements in the loop [6]. Iteration bound is the minimum achievable clock period of a digital system. Fig 4.4 LFSR structure for g(y)=1+y+y3+y5 (a) serial structure Fig 4.5 LFSR structure for g(y)=1+y+y3+y5 (b)3-parellel structure 67 3-parallel implementation to the LFSR structure in Fig. 2.1 (a) is shown in Fig. 2.1 (b). After observing Fig. 2.1 (b), we can find two issues. One is that the iteration bound increased to 3TXOR, which means that although the throughput rate has been increased by a factor of 3, the achievable clock frequency is decreased by a factor of 3; the achievable processing speed is thus the same as previous serial LFSR structure in Fig. 2.1 (a). Therefore, reducing the iteration bound of the original serial LFSR structure is important before we apply parallel implementation [3]. Another issue indicated in Fig. 2.1 (b) is that each of the three right-most XOR gates are driving a lot of other XOR gates, which is a large number when the generator polynomial is long and thus causes large fan out delay [4]. Inserting delay elements between the right-most XOR gates and their subsequent XOR gates can solve this issue. The two rightmost XOR gates in Fig. 2.1 (b) can be separated from their subsequent XOR gates by first pipelining input y(3k) and y(3k+1) and then applying retiming. However, this scheme cannot be applied to the lowest right-most XOR gate. This is caused by the fact that the number of delay elements in the right-most feedback loop in Fig. 2.1 (a) is only 2 and less than the desired parallelism level 3. Therefore, inserting enough delay elements into the right-most feedback loop of the original serial LFSR structure is the key for solving large. Fan out issue, which is the contribution of three-step LFSR architecture in [4]. From =[q(x)g’(x)+r’(x)]/p(x), we can see that if we multiply both m(x)xn-k and g(x) with p(x), the remainder of dividing m(x)xn-k by g’(x) is r’(x)=r(x)p(x). r(x) is the quotient of dividing r’(x) by p(x). This is the basic idea of the three-step LFSR architecture. Since g’(x) is reconstructed by g(x)p(x), it’s very important to choose proper p(x) for addressing the two issues discussed above. In [4], a clustered looked-ahead computation is applied for finding p(x). This scheme is efficient and can insert delay elements to the right-most feedback loop with the minimum increase of polynomial degree to g(x) and thus can control the increase of XOR gates. However, the iteration bound bottleneck is transferred to the LFSR in the third stage. Although three approaches have been proposed in [5] to address this the iteration bound issue in the third step, the hardware 68 cost is high and the iteration bound bottleneck will again be transferred to the second step. 4.7.7 Improved Algorithm for eliminating the Fanout bottleneck We start with the same example used in [4]. Example 1: Given a BCH(255,233) code using generatorpolynomial g(x)=1+x2+x3+x4+x5+x6+x7+x9+x14+x16+x17+x19 +x20+x22+x25+x26+x27+x29+x30+x31+x32 =[1 0 1 1 1 1 1 1 1 0 1 0 0 0 0 1 0 1 1 0 1 1 0 1 0 0 1 1 10 1 1 1 1]. Assume our targeted parallelism level is also 8. Then we obtain: g’(x)=g(x)p(x)=1+x+x2+x4+x7+x8+x11+x12+x16+x17+x18+x25+x27+x28+x31+x33 +x42 p(x)=(1+x)(1+x4)(1+x5) This is illustrated in Fig. 69 Fig 4.6 Three-step implementation of BCH (255,233) encoding, (a) first step,(b) second step, (c) third step and (d)-(e) improved third step The improved third step Fig. 4.6 (d) is based on the look ahead pipelining scheme discussed as follows. Operation of 1/(1+xk )is shown in Fig4.6 (a), where we implement (4.1). b(n+k) = a(n) + b(n) (4.1) If we apply k steps look-ahead to (2.1), we obtain (4.2). b(n+2k) = a(n+k) + b(n+k) = a(n+k) + a(n) + b(n). (4.2) The corresponding hardware implementation is shown in Fig. 4.6 (b). Another additional 2k steps look-ahead of (4.2) can be represented as (4.3). b(n+4k) = a(n+3k) + a(n+2k) + b(n+2k) = a(n+3k) + a(n+2k) + a(n+k) + a(n) + b(n) (4.3) (2.3) can be implemented with the structure shown in Fig. 4.6(c). 70 Fig4.7 look-ahead pipelining for 1/(1+xk), (a) original hardware implementation for 1/(1+xk), (b) 2k-level look-ahead pipelining and (c) 4klevel look-ahead pipelining Replace the 1/(1+x) in Fig. 4.7 (c) with Fig. 4.7 (c) and k=1,Fig. 4.7 (c) can be pipelined as shown in Fig. 4.7 (d). Note that in Fig. 4.7 (d), we have two 1/(1+x4) in a row and they can be simplified as 1/(1+x8). This process is shown in Fig. 4.7 (e). From Example 1, we can see that the difference of the proposed scheme with the previous three-step LFSR structures [4][5] is located at constructing p(x). In this paper, p(x) is constructed with small length polynomials which is easy to handle with the proposed look-ahead pipelining algorithm. However, p(x) in [4][5] is obtained and handled with a single long polynomial, which usually leads to large iteration bound and is difficult to handle. When p(x) is decomposed into small length polynomials, its operational characteristics for 1/p(x) is slightly different from when it is implemented as a single long polynomial. We will show this difference with the following example as shown in Fig. 2.4. A strict proof is given in Appendix A. From Fig. 2.4, we can see that operation of m(x)/p(x) can get both correct remainder and quotient when we use structure p(x)=1+x+x2+x3; while structure for p(x)=(1+x)(1+x2) can provide us with only correct quotient. However, using p(x)=(1+x)(1+x2) is sufficient because we only need quotient in the third step. Furthermore, in the structure for p(x)=(1+x)(1+x2) shown in Fig. 2.4 (b), the quotient can be obtained without any latency. This is an important advantage, because the structure for p(x)=1+x+x2+x3 in Fig. 2.4 has latency of 3 clock cycles, which has the same number as the degree of p(x). Iteration bound of the third step can be easily reduced to any small value and that of the overall LFSR structure is determined in the second step, which is not handled in [4][5].We discuss this issue in section III. 71 Fig 4.8 Operation of m(x)/p(x) for m(x)=1+x+x2+x3+x4+x5+x6 (a) p(x)=1+x+x2+x3 and (b) p(x)=(1+x)(1+x2) 4.8 PROPOSED ALGORITHM FOR REDUCING THE ITERATION BOUND OF THE LFSR STRUCTURE As we can see from TABLE I, we can keep multiplying g(x) by short length polynomials such as 1+xk to insert as many delay elements to the right-most feedback loop of g(x) as we need in each iteration for eliminating the fanout bottleneck of a LFSR structure. After this fanout bottleneck is eliminated, our goal should be to reduce the iteration bound of the entire LFSR structure, which is now located in the second step. We have seen the advantage of multiplying g(x) by short length polynomial such as 1+xk, so that the third step does not lead to the iteration bound bottleneck and the latency of the third step becomes minimum. We will show its advantage for reducing the iteration bound of the second step and thus the whole LFSR structure in this section. We start with the third iteration shown in TABLE I. 72 After the third iteration, g’(x) does not have fanout bottleneck anymore for unfolding factor of J=8. Its iteration bound is TXOR*max{2/9, 3/11, 4/14, 5/15, 6/17, 8/25, 9/26, 10/28, 11/30, 12/31, 13/34, 14/35, 15/38, 16/40, 17/41, 17/42}=0.4146TXOR, which is located in the 16th feedback loop. If we keep multiplying g’(x) by 1+xk to reduce its iteration bound, there are 17 possibilities for k, corresponding to the 17 feedback loops in g’(x). After trying all these 17possibilities, we conclude that 1+x17 will reduce the iteration bound in the second step most from 0.4146TXOR to 0.3621TXOR. g’’(x)= (1+x17)g’(x) is given in TABLE II. Based on the discussion for the third step in Section II, 1/(1+x17) is far away from causing iteration bound bottleneck of the entire LFSR structure, which has been reduced to 0.3621 TXOR. We can keep multiplying g’’(x) with 1+xk to lower iteration bound even further. For example, after the 6th iteration shown in TABLE II, the iteration bound of the LFSR structure has been reduced to 0.3369TXOR. Note that, these optimized iteration bounds are not achieved without extra cost. We can see that the number of required XOR gates has increased from 2 to 76. Although multiplying g’(x) will change the LFSR structure for lower iteration bound, the elimination of fanout bottleneck is maintained. This is because the elimination of fanout bottleneck is given in the right-most feedback loop of the second step of the LFSR structure. Multiplying g’(x) by 1+xk maintains the structure on the right side of the feed back loop corresponding to 1+xk. This property is illustrated in TABLE II. 73 From TABLE II, we have p(x)=(1+x)(1+x4 )(1+x5 ) and p'(x)= p(x)(1+x17 )(1+x43 )(1+x86 ) (1 +x8 )(1+ x5 )(1+ x17 )(1+ x43 )(1+ x86 )/[(1+ x)(1+ x2 )] Then Improved three-step implementation of BCH (255,233)encoding is shown in Fig. From the discussion so far, we can summarize the proposed high speed VSLI algorithm for general LFSR structures as follows: 1) Iteratively multiply g(x) by short length polynomials to insert as many delay elements to the right-most feedback loop of g(x) as we need in each iteration for eliminating the fan out bottleneck of a LFSR structure. The iteration exits when the number of delay elements in g’(x) is not less than the targeted unfolding factor J. The simplest short length polynomials we can use are 1+xk, where k is the degree difference of the two highest-degree terms in g’(x). Another way to find short length polynomials is to partially borrow Algorithm A in [4, section III]. Instead of getting one long polynomial p(x) by using Algorithm A once, we can use it by multiple times and limit the length of each obtained p(x) until the number of delay elements in g’(x) is not less than the targeted unfolding factor J. Although first method to find short length polynomial can guarantee to find p(x), the second method might be more Hardware efficient. This is because the second method can lead to polynomial g’(x) of lower degree polynomial. Note that eliminating fan out bottleneck is not needed when g(x) is 2) Multiply g’(x) iteratively by 1+ xk , which will lead to the smallest iteration bound for the current g’’(x). For each iteration, the number of possible k is the same as 74 that of feedback loops for current g’’(x). The iteration exits when desired iteration bound or the best iteration bound for certain hardware cost requirement is reached. Fig 4.9 Improved three-step implementation of BCH (255,233) encoding, (a) first step, (b) second step, (c) third step 4.9 BCH ENCODER ARCHITECTURE An (n, k) binary BCH code encodes a k-bit message into an n-bit code word. A k-bit message (mk−1, mk−2, . . .,m0) can be considered as the coefficients of a degree k – 1 polynomial m(x) = mk−1xk−1 + mk−2xk−2 + . . . + m0,where mk−1, mk−2, . . ., m0 ∈ GF(2). Meanwhile, the corresponding n-bit code word (cn−1, cn−2, . . ., c0) can be considered as the coefficients of a degree n−1 polynomial c(x) = cn−1xn−1 + cn−2xn−2 + . . . + c0, where cn−1,cn−2, . . ., c0 ∈ GF(2). The encoding of BCH codes can besimply expressed by: c(x) = m(x)g(x),where the degree n − k polynomial g(x) = gn−kxn−k+ gn−k−1 xn−k−1 + . . . + g0 (gn−k, gn−k−1, ・ ・ ・ , g0 ∈ GF(2)) is the generator polynomial of the BCH code. Usually,gn−k = g0 =_ 1_. However, systematic encoding is generally desired, since message bits are just part of the code word. The systematic encoding can be implemented by: 75 c(x) = m(x) ・ xn−k + Rem(m(x) ・ xn−k)g(x), (1)where Rem(f(x))g(x) denotes the remainder polynomial of dividing f(x) by g(x). The architecture of a systematic BCH encoder is shown in Fig. 1. During the first k clock cycles, the two switches are connected to the ’a’ port, and the k-bit message is input to the LFSR serially with most significant bit (MSB) first. Meanwhile, the message bits are also sent to the output to form the systematic part of the code word. After k clock cycles, the switches are moved to the ’b’ port. At this point, the n − k registers contain the coefficients of Rem(m(x) ・ xn−k)g(x). The remainder bits are then shifted out of the registers to the code word output bit by bit to form the remaining systematic code word bits. For binary BCH, the multipliers in Fig. 1 can be replaced by connection or no connection when gi (0 ≤ i < n−k) is ’1’ or ’0’, respectively. The critical path of this architecture consists of two XOR gates, and the output of the right-most XOR gate is input to all the other XOR gates. In the case of long BCH codes, this architecture may suffer from the long delay of the right-most XOR gate caused by the large fanout. Although the serial architecture of BCH encoder is quite straight forward, in the case when it can not run as fast as the application requirements, parallel architectures must be employed. Fanout bottleneck will also exist in parallel architectures. 4.10 PARALLEL BCH ENCODER WITH ELIMINATED FANOUT BOTTLENECK In the serial BCH encoder in Fig. 4.10, the effect of large fanout can always be eliminated by retiming [7]. To make notations simple, we refer to the input to the rightmost XOR gate, which is the delayed output of the second XOR gate from right as the horizontal input (Hinput). In Fig. 1,there is at least one register at the Hinput of the right-most XOR gate. Meanwhile, registers can be added to the message input. Therefore, as shown in Fig. 2, retiming can always be performed along the dotted cutset by removing one register from each input to the right-most XOR gate 76 FIG 4.10 and adding one to the output. For the purpose of clarity,switches are removed from the LFSR in Fig. 2 and the other figures in the remainder of the paper. However, if unfolding is applied directly to Fig. 1, retiming can not be applied in an obvious way to eliminate the large fanout. The original architecture can be expressed by data flow graph (DFG) as nodes connected by path with delays. Each XOR gate in the LFSR is a node in the corresponding DFG. In the J-unfolded architecture, there are J copies of each node with the same function as in the original architecture (see Chapter 5 of [4]). However, the total number of delayelements does not change. 4.11 Retimed LFSR Assuming there is a path from node U to node V in the original architecture with W delay elements, in the J-unfolded architecture, node Ui is connected to node V(i+W)%J with _(i + W)/J_ delay elements, where Ui, Vj (0 ≤ i, j < J) are copies of nodes U and V , respectively. Therefore, if the unfolding factor J is greater than W, there will be W paths with one delay element and J −W paths without any delay element in the unfolded architecture. For example, Fig. 3 (a) shows an LFSR with generator polynomial g(x) = x3 + x + 1. In this example, there are two registers in the path connecting the output of the left XOR gate and the input of the right XOR gate. In the 3-unfolded architecture illustrated in Fig. 3 (b), there are 3-1=2 paths from the output of the copies of the left XOR gate to the input of the copies of the right XOR gate with one delay, and another one path without any delay. The unfolded LFSR in Fig. 3 (b) cannot be retimed to eliminate the fanout problem for each copy of the right XOR gate. If the generator polynomial can be expressed as: g(x) = xt0 + xt1 + ・ ・ ・ + xtl−2 + 1, (2) where t0, t1, ・ ・ ・ tl−2 are positive integers with t0 > t1 > t2 ・ ・ ・ > tl−2 and l is the total number of non-zero termsof g(x), there are t0 − t1 consecutive registers at 77 the Hinput of the right-most XOR gate in Fig. 1. If a J-unfolded BCH encoder is desired, t0 − t1 ≥ J needs to be satisfied to ensure that there is at least one register at the Hinput of each of the J copies of the right-most XOR gate, so that retiming can be applied to move one register to the output. Meanwhile, J registers need to be added to the message input to enable retiming. Fig4.11 (a) An LFSR example, (b) 3-unfolded version of LFSR in In the case of t0 − t1 < J, the generator polynomial needs to be modified to enable retiming of the right-most XOR gate in the J-unfolded architecture. Assuming the original (n, k) BCH code uses generator polynomial g(x) with degree n−k, the message inputm(x) multiplied by xn−k can be written as: m(x)xn−k = q(x)g(x) + r(x), (3) where q(x) and r(x) represent the quotient and remainder polynomials of dividing m(x)xn−k by g(x), respectively. Multiplying p(x) to both side of (3), we get: m(x)p(x)xn−k = q(x)(g(x)p(x)) + r(x)p(x). (4) Let g_(x) = p(x)g(x) be expressed as: 78 CHAPTER -5 BCH CODES Given an BCH(255, 233) code using generator polynomial g(x) =x32+x31+x30+x29+x27+x26+x25+x22+x20+x19+x17+x16+x14+x9+x7+x6+x5+x4+x3 +x2+1, we want to find p(x) such that t0’-t1’≥ 8 in g_(x). In this example, E should be set to 8, and a = 32, b = 31 at the beginning of Algorithm A. The intermediate values after each iteration in Algorithm A are given below: After iteration I: ˜p(x) = 1+x−1. ˜g(x) = x32 +x28 +x27 +x2 +x22 +x21 +x20 +x18 + x17 + x15 + x14 + x13 + x9 + x8 + x7 + x+1+x−1. num = 1; b = 28; a − b = 4 < 8; continue. After iteration II: ˜p(x) = 1+x−1 + x−4. ˜g(x) = x32 +x26 +x25 +x2 +x23 +x20 +x17 +x16 + x14+x12+x10+x9+x8+x7++x5+x3+x2+x−2+x−4. num = 4; b = 26; a − b = 6 < 8; continue. After iteration III: ˜p(x) = 1+x−1 + x−4 + x−6. ˜g(x) = x32 + x21 + x19 + x17 + x13 + x12 + x11 + x9 + x7 + x5 + x2 + x+1+x−1 + x−3 + x−6. num = 6; b = 21; a − b = 11 > 8; stop. Final step: p(x) = ˜p(x)x6 = x6 + x5 + x2 + 1 g_(x) = ˜g(x)x6 = x38 + x27 + x25 + x23 + x19 + x18 + 79 x17 + x15 + x13 + x11 + x8 + x7 + x6 + x5 + x3 + 1. According to (4), the modified method of finding Rem FIG 5.1 Block diagram of the modified BCH encoding FIG 5.2 Step 1 of the modified BCH encoding (m(x)xn−k)g(x) in the BCH encoding can be implemented by the steps illustrated in Fig. 4. Each step is explained using the g(x), p(x) and g_(x) derived in Example I in the remainder of this section. The first step in Fig. 4 is to multiply the message input polynomial by p(x). This can be implemented by adding delayed message inputs according to the coefficient of p(x). For example, using the p(x) derived in Example I, this step can be implemented by the diagram in Fig. 5. The four taps correspond to 1, x2, x5and x6, respectively, as shown in Fig. 5. After m(x)p(x) is computed, it is fed into the second block to compute Rem(m(x)p(x)xn−k)g_(x) by using similar LFSR architectures as that in Fig. 1. However, since deg(g_(x)) > n−k, the product of p(x) and m(x) should be added to the output of the (n − k)th register from left, instead of being added to the output of the right-most register. The addition of m(x)p(x) can break the consecutive a − b registers at the Hinput of the right-most XOR gate in the LFSR. The implementation of the second step using the BCH code in Example 1 is illustrated in Fig. 6. As could be observed in Fig. 6, there are 38-27=11 consecutive registers at the Hinput of the rightmost XOR gate according to g_(x). However, after adding m(x)p(x) to the output of the 32nd register, only 6 consecutive registers are left. Therefore, at most 6-unfolding can be applied to Fig. 6 without suffering from large fanout problem. At the end of algorithm A, deg(g_(x)) = num + deg(g(x)) = n − k + num. Hence, only num 80 consecutive registers left after adding m(x)p(x) where num ≤ E −1 at the end of Algorithm A. FIG 5.3 Step 2 of the modified BCH encoding FIG 5.4 Step 3 of the modified BCH encoding Therefore ,E is usually set to larger than the desired unfolding factor J to ensure num ≥ J at the end of Algorithm A. Alternatively, at the expense of slight increase in the critical path and latency, the delays at the input of the last XOR gate can be retimed and moved to the output of this XOR gate. For example, the 5 delays at the Hinput of the last XOR gate in Fig. 6 can be retimed and moved to the output of this XOR gate. This requires first adding 5 delays to m(x)p(x) input. The penalty is the increase in the critical path to 2 XOR gates in the serial encoder. In the third step, Rem(m(x)p(x)xn−k)g_(x) needs to be divided by p(x) to get the final result. Similar architectures as that in Fig. 1 can also be used, except that the input data is added to the input of the left-most register, since the input polynomial does not need to be multiplied by any power of x. For example, the third step of the modified BCH encoding using the p(x) derived in Example I is illustrated in Fig. 7. Unfolding the modified BCH encoder in Fig. 4 by factor J, a parallel architecture capable of processing J message bits at a time is derived. In the J-unfolded block of computing p(x)m(x), feedback loop does not exist. Thus it can be pipelined to achieve desired clock frequency. In the second block, since the LFSR of the modified generator polynomial g_(x) has at least J registers at the Hinput of the right-most XOR gate, retiming can be applied to the J-unfolded architecture to eliminate the effect of large fan out after adding J registers to the output of m(x)p(x). Although the fan out 81 problem does not exist in the third block in Fig. 4, it can exist in the unfolded architecture. Since the polynomial p(x) enables g_(x) = p(x)g(x) to have consecutive zero coefficients after the highest power term, the difference of the highest two powers of p(x) is equal to t0 − t1 < J. After J-unfolding is applied, there are some copies of the right-most XOR gates connected to lp−2 XOR gates, where lp is the number of non-zer terms in p(x). In the worst case, lp is at most E+1−(t0−t1), and E is set to as small as possible, which is a little bit larger than the unfolding factor J. Usually, the desired unfolding factor is far less than the length of g(x) in long BCH codes. Hence, the delay caused by the fan out of dividing p(x) is far less than that of dividing g(x) in the original BCH encoders. 5.1 BCH DECODER ARCHITECTURE In this section, a parallel BCH decoder is presented.The syndrome-based BCH decoding consists of three major steps [3], as depicted in Fig. 2, where R is the hard decision of received information from noisy channel and D is the decoded codeword. S and Λ represent syndromes of the received polynomial and error locator polynomial, respectively 5.2 Syndrome Generator For t-error-correcting BCH codes, 2t syndromes of the received polynomial could be evaluated as follows: Sj = R(αj) =nX−1 i=0Ri(αj )i (1) for 1 ≤ j ≤ 2t. If 2t conventional syndrome generator units shown in Fig. 3(a) are used at the same time independently, n clock cycles are necessary to complete computing all the 2t syndromes. However, if each syndrome generator unit in Fig. 3(a) is replaced by 82 a parallel syndrome generator unit with parallel factor of p depicted in Fig. 3(b), which can process p bits per clock cycle, only _n/p_ clock cycles are sufficient. It is worth noting that for binary BCH codes, even-indexed syndromes are the squares of earlierindexed syndromes, i.e., S2j = S2 j . Based on this constraint, actually only t parallel syndrome generator units are required to compute the odd-indexed syndromes, followed by a much simpler field square circuit to generate those even-indexed syndromes. 5.3. Key Equation Solver Either Peterson’s or Berlekamp-Massey (BM) algorithm [3] could be employed to solve the key equations for Λ(x). Inversion-free BM algorithm and its efficient implementations could be easily found in the literature [2] [4] and are not considered in this paper. 5.4 . Chien Search Once Λ(x) is found, the decoder searches for error locations by checking whether Λ(αi) = 0 for 0 ≤ i ≤ (n − 1), which is Syndrome generator unit (a) Conventional architecture (b) Parallel architecture with parallel factor of p 83 CHAPTER-6 APPENDIX-A VERILOG HDL Implementation of High Speed LFSR is done using Verilog HDL. In the semiconductor and electronic design industry, Verilog is a hardware description language (HDL) used to model electronic systems. Verilog HDL, not to be confused with VHDL, is most commonly used in the design, verification, and implementation of digital logic chips at the Register transfer level (RTL) level of abstraction. It is also used in the verification of analog and mixed-signal circuits. 6.1 Overview Hardware description languages, such as Verilog, differ from software programming languages because they include ways of describing the propagation of time and signal dependencies (sensitivity). There are two assignment operators, a blocking assignment (=), and a non-blocking (<=) assignment. The non-blocking assignment allows designers to describe a state-machine update without needing to declare and use temporary storage variables. Since these concepts are part of the Verilog's language semantics, designers could quickly write descriptions of large circuits, in a relatively compact and concise form. At the time of Verilog's introduction (1984), Verilog represented a tremendous productivity improvement for circuit designers who were already using graphical schematic-capture, and specially-written software programs to document and simulate electronic circuits. The designers of Verilog wanted a language with syntax similar to the C programming language, which was already widely used in engineering software development. Verilog is case-sensitive, has a basic preprocessor (though less sophisticated than ANSI C/C++), and equivalent control flow keywords (if/else, for, while, case, etc.), and compatible language operators precedence. Syntactic differences include variable declaration (Verilog requires bit-widths on net/reg types), demarcation of procedural-blocks (begin/end instead of curly braces {}), and many other minor differences. 84 A Verilog design consists of a hierarchy of modules. Modules encapsulate design hierarchy, and communicate with other modules through a set of declared input, output, and bidirectional ports. Internally, a module can contain any combination of the following: net/variable declarations (wire, reg, integer, etc.), concurrent and sequential statement blocks and instances of other modules (sub-hierarchies). Sequential statements are placed inside a begin/end block and executed in sequential order within the block. But the blocks themselves are executed concurrently, qualifying Verilog as a Dataflow language. Verilog's concept of 'wire' consists of both signal values (4-state: "1, 0, floating, undefined"), and strengths (strong, weak, etc.) This system allows abstract modeling of shared signal-lines, where multiple sources drive a common net. When a wire has multiple drivers, the wire's (readable) value is resolved by a function of the source drivers and their strengths. A subset of statements in the Verilog language is synthesizable. Verilog modules that conform to a synthsizeable coding-style, known as RTL (register transfer level), can be physically realized by synthesis software. Synthesis-software algorithmically transforms the (abstract) Verilog source into a netlist, a logically-equivalent description consisting only of elementary logic primitives (AND, OR, NOT, flipflops, etc.) that are available in a specific VLSI technology. Further manipulations to the netlist ultimately lead to a circuit fabrication blueprint (such as a photo mask-set for an ASIC, or a bitstream-file for an FPGA). 6.2 History, Beginning Verilog was invented by Phil Moorby and Prabhu Goel during the winter of 1983/1984 at Automated Integrated Design Systems (later renamed to Gateway Design Automation in 1985) as a hardware modeling language. Gateway Design Automation was later purchased by Cadence Design Systems in 1990. Cadence now has full proprietary rights to Gateway's Verilog and the Verilog-XL simulator logic simulators. 85 6.3 Verilog-95 With the increasing success of VHDL at the time, Cadence decided to make the language available for open standardization. Cadence transferred Verilog into the public domain under the Open Verilog International (OVI) (now known as Accellera) organization. Verilog was later submitted to IEEE and became IEEE Standard 1364-1995, commonly referred to as Verilog-95. In the same time frame Cadence initiated the creation of Verilog-A to put standards support behind its analog simulator Spectre. Verilog-A was never intended to be a standalone language and is a subset of Verilog-AMS which encompassed Verilog-95. 6.4 Verilog 2001 Extensions to Verilog-95 were submitted back to IEEE to cover the deficiencies that users had found in the original Verilog standard. These extensions became IEEE Standard 13642001 known as Verilog-2001. Verilog-2001 is a significant upgrade from Verilog-95. First, it adds explicit support for (2's complement) signed nets and variables. Previously, code authors had to perform signed-operations using awkward bit-level manipulations (for example, the carry-out bit of a simple 8-bit addition required an explicit description of the boolean-algebra to determine its correct value.) The same function under Verilog-2001 can be more succinctly described by one of the built-in operators: +, -, /, *, >>>. A generate/endgenerate construct (similar to VHDL's generate/endgenerate) allows Verilog-2001 to control instance and statement instantiation through normal decision-operators (case/if/else). Using generate/endgenerate, Verilog-2001 can instantiate an array of instances, with control over the connectivity of the individual instances. File I/O has been improved by several new system-tasks. And finally, a few syntax additions were introduced to improve code-readability (eg. always @*, named-parameter override, C-style function/task/module header declaration.) Verilog-2001 is the dominant flavor of Verilog supported by the majority of commercial EDA software packages. 86 6.5 Verilog 2005 Not to be confused with SystemVerilog, Verilog 2005 (IEEE Standard 1364-2005) consists of minor corrections, spec clarifications, and a few new language features (such as the uwire keyword.) A separate part of the Verilog standard, Verilog-AMS, attempts to integrate analog and mixed signal modelling with traditional Verilog. 6.6 Design Styles Verilog, like any other hardware description language, permits a design in either Bottomup or Top-down methodology. Bottom-Up Design The traditional method of electronic design is bottom-up. Each design is performed at the gate-level using the standard gates (refer to the Digital Section for more details). With the increasing complexity of new designs this approach is nearly impossible to maintain. New systems consist of ASIC or microprocessors with a complexity of thousands of transistors. These traditional bottom-up designs have to give way to new structural, hierarchical design methods. Without these new practices it would be impossible to handle the new complexity. Top-Down Design The desired design-style of all designers is the top-down one. A real top-down design allows early testing, easy change of different technologies, a structured system design and offers many other advantages. But it is very difficult to follow a pure top-down design. Due to this fact most designs are a mix of both methods, implementing some key elements of both design styles. 87 Figure shows a Top-Down design approach. 6.7 Verilog Abstraction Levels Verilog supports designing at many different levels of abstraction. Three of them are very important: Behavioral level Register-Transfer Level Gate Level Behavioral level This level describes a system by concurrent algorithms (Behavioral). Each algorithm itself is sequential, that means it consists of a set of instructions that are executed one after the other. Functions, Tasks and Always blocks are the main elements. There is no regard to the structural realization of the design. 88 Register-Transfer Level Designs using the Register-Transfer Level specify the characteristics of a circuit by operations and the transfer of data between the registers. An explicit clock is used. RTL design contains exact timing bounds: operations are scheduled to occur at certain times. Modern RTL code definition is "Any code that is synthesizable is called RTL code". Gate Level Within the logic level the characteristics of a system are described by logical links and their timing properties. All signals are discrete signals. They can only have definite logical values (`0', `1', `X', `Z`). The usable operations are predefined logic primitives (AND, OR, NOT etc gates). Using gate level modeling might not be a good idea for any level of logic design. Gate level code is generated by tools like synthesis tools and this netlist is used for gate level simulation and for backend. 6.8 Introduction ModelSim is a verification and simulation tool for VHDL, Verilog, SystemVerilog, and mixed language designs. 6.8.1 Basic Simulation Flow The following diagram shows the basic steps for simulating a design in ModelSim. Figure 6.1. Basic Simulation Flow - Overview Lab 89 In ModelSim, all designs are compiled into a library. You typically start a new simulation in ModelSim by creating a working library called "work". "Work" is the library name used by the compiler as the default destination for compiled design units. After creating the working library, you compile your design units into it. The ModelSim library format is compatible across all supported platforms. You can simulate your design on any platform without having to recompile your design. r Design and Running the Simulation With the design compiled, you load the simulator with your design by invoking the simulator on a top-level module (Verilog) or a configuration or entity/architecture pair (VHDL). Assuming the design loads successfully, the simulation time is set to zero, and you enter a run command to begin simulation. If you don’t get the results you expect, you can use ModelSim’s robust debugging environment to track down the cause of the problem. 6.8.2 Project Flow A project is a collection mechanism for an HDL design under specification or test. Even though you don’t have to use projects in ModelSim, they may ease interaction with the tool and are useful for organizing files and specifying simulation settings. The following diagram shows the basic steps for simulating a design within a ModelSim project. 90 FIG 6.2 As you can see, the flow is similar to the basic simulation flow. However, there are two important differences: automatically. unless you specifically close them. 6.8.3Multiple Library Flow ModelSim uses libraries in two ways: 1) as a local working library that contains the compiled version of your design; 2) as a resource library. The contents of your working library will change as you update your design and recompile. A resource library is typically static and serves as a parts source for your design. You can create your own resource libraries, or they may be supplied by another design team or a third party (e.g., a silicon vendor). You specify which resource libraries will be used when the design is compiled, and there are rules to specify in which order they are searched. A common example of using both a working library and a resource library is one where your gate-level design and testbench are compiled 91 into the working library, and the design references gate-level models in a separate resource library. The diagram below shows the basic steps for simulating with multiple libraries. Figure 6.3. Multiple Library Flow 6.9 Debugging Tools ModelSim offers numerous tools for debugging and analyzing your design. Several of these tools are covered in subsequent lessons, including: Using projects Working with multiple libraries Setting breakpoints and stepping through the source code Viewing waveforms and measuring time Viewing and initializing memories Creating stimulus with the Waveform Editor Automating simulation 92 6.10 Basic Simulation Figure 3-1. Basic Simulation Flow - Simulation Lab Design Files for this Lesson The sample design for this lesson is a simple 8-bit, binary up-counter with an associated testbench. The pathnames are as follows: Verilog – <install_dir>/examples/tutorials/verilog/basicSimulation/counter.v and – <install_dir>/examples/tutorials/vhdl/basicSimulation/counter.vhd and tcounter.v VHDL tcounter.vhd This lesson uses the Verilog files counter.v and tcounter.v. If you have a VHDL license, use counter.vhd and tcounter.vhd instead. Or, if you have a mixed license, feel free to use the Verilog testbench with the VHDL counter or vice versa. 6.10.1 Create the Working Design Library Before you can simulate a design, you must first create a library and compile the source code into that library. 1. Create a new directory and copy the design files for this lesson into it. Start by creating a new directory for this exercise (in case other users will be working with these lessons). 93 Verilog: Copy counter.v and tcounter.v files from /<install_dir>/examples/tutorials/verilog/basicSimulation to the new directory. VHDL: Copy counter.vhd and tcounter.vhd files from /<install_dir>/examples/tutorials/vhdl/basicSimulation to the new directory. 2. Start ModelSim if necessary. a. Type vsim at a UNIX shell prompt or use the ModelSim icon in Windows. Upon opening ModelSim for the first time, you will see the Welcome to ModelSim dialog. Click Close. b. Select File > Change Directory and change to the directory you created in step 1. 3. Create the working library. a. Select File > New > Library. This opens a dialog where you specify physical and logical names for the library (Figure 32). You can create a new library or map to an existing library. We’ll be doing the former. Figure 6.4. The Create a New Library Dialog b. Type work in the Library Name field (if it isn’t already entered automatically). c. Click OK. 94 ModelSim creates a directory called work and writes a specially-formatted file named _info into that directory. The _info file must remain in the directory to distinguish it as a ModelSim library. Do not edit the folder contents from your operating system; all changes should be made from within ModelSim. ModelSim also adds the library to the list in the Workspace (Figure 3-3) and records the library mapping for future reference in the ModelSim initialization file (modelsim.ini). Figure 6.5 . work Library in the Workspace When you pressed OK in step 3c above, the following was printed to the Transcript: vlib work vmap work work These two lines are the command-line equivalents of the menu selections you made. Many command-line equivalents will echo their menu-driven functions in this fashion. 6.11 Compile the Design With the working library created, you are ready to compile your source files. You can compile by using the menus and dialogs of the graphic interface, as in the Verilog example below, or by entering a command at the ModelSim> prompt. 1. Compile counter.v and tcounter.v. a. Select Compile > Compile. This opens the Compile Source Files dialog (Figure 3-4). If the Compile menu option is not available, you probably have a project open. If so, close the project by making the Workspace pane active and selecting File > Close from the menus. 95 b. Select both counter.v and tcounter.v modules from the Compile Source Files dialog and click Compile. The files are compiled into the work library. c. When compile is finished, click Done. Figure 6.5 . Compile Source Files Dialog 2. View the compiled design units. a. On the Library tab, click the ’+’ icon next to the work library and you will see two design units (Figure 3-5). You can also see their types (Modules, Entities, etc.) and the path to the underlying source files (scroll to the right if necessary). b. Double-click test_counter to load the design. You can also load the design by selecting Simulate > Start Simulation in the menu bar. This opens the Start Simulation dialog. With the Design tab selected, click the ’+’ sign next to the work library to see the counter and test_counter modules. Select the test_counter module and click OK (Figure 3-6). Figure 6.6 . Loading Design with Start Simulation Dialog 96 When the design is loaded, you will see a new tab in the Workspace named sim that displays the hierarchical structure of the design (Figure 3-7). You can navigate within the hierarchy by clicking on any line with a ’+’ (expand) or ’-’ (contract) icon. You will also see a tab named Files that displays all files included in the design. Figure 6.7. VHDL Modules Compiled into work Library 6.12 Load the Design 1. Load the test_counter module into the simulator. a. In the Workspace, click the ‘+’ sign next to the work library to show the files contained there. Figure 6.8 . Workspace sim Tab Displays Design Hierarchy 97 2. View design objects in the Objects pane. a. Open the View menu and select Objects. The command line equivalent is: view objects The Objects pane (Figure 3-8) shows the names and current values of data objects in the current region (selected in the Workspace). Data objects include signals, nets, registers, constants and variables not declared in a process, generics, parameters. Figure 6.9 Object Pane Displays Design Objects You may open other windows and panes with the View menu or with the view command. See Navigating the Interface. 6.12 Run the Simulation Now you will open the Wave window, add signals to it, then run the simulation. 1. Open the Wave debugging window. a. Enter view wave at the command line You can also use the View > Wave menu selection to open a Wave window. The Wave window is one of several windows available for debugging. To see a list of the other debugging windows, select the View menu. You may need to move or resize the windows to your liking. Window panes within the Main window can be zoomed to occupy the entire Main window or undocked to stand alone. For details, 98 see Navigating the Interface. 2. Add signals to the Wave window. a. In the Workspace pane, select the sim tab. b. Right-click test_counter to open a popup context menu. c. Select Add > To Wave > All items in region (Figure 3-9). All signals in the design are added to the Wave window. Figure 6.10 . Using the Popup Menu to Add Signals to Wave Window 3. Run the simulation. a. Click the Run icon in the Main or Wave window toolbar. The simulation runs for 100 ns (the default simulation length) and waves are drawn in the Wave window. 99 b. Enter run 500 at the VSIM> prompt in the Main window. The simulation advances another 500 ns for a total of 600 ns (Figure 3-10). Figure 6.11. Waves Drawn in Wave Window c. Click the Run -All icon on the Main or Wave window toolbar. The simulation continues running until you execute a break command or it hits a statement in your code (e.g., a Verilog $stop statement) that halts the simulation. d. Click the Break icon. The simulation stops running. 6.12 Xilinx design flow The first step involved in implementation of a design on FPGA involves System Specifications. Specifications refer to kind of inputs and kind of outputs and the range of values that the kit can take in based on these Specifications. After the first step system specifications the next step is the Architecture. Architecture describes the interconnections between all the blocks involved in our design. Each and every block in the Architecture along with their interconnections is modeled in either VHDL or Verilog depending on the ease. All these blocks are then simulated and the outputs are verified for correct functioning. 100 Figure 6.12 Xilinx Implementation Design Flow-Chart. After the simulation step the next steps i.e., Synthesis. This is a very important step in knowing whether our design can be implemented on a FPGA kit or not. Synthesis converts our VHDL code into its functional components which are vendor specific. After performing synthesis RTL schematic, Technology Schematic and generated and the timing delays are generated. The timing delays will be present in the FPGA if the design is implemented on it. Place & Route is the next step in which the tool places all the components on a FPGA die for optimum performance both in terms of areas and speed. We also see the interconnections which will be made in this part of the implementation flow. In post place and route simulation step the delays which will be involved on the FPGA kit are considered by the tool and simulation is performed taking into consideration these delays which will be present in the implementations on the kit. Delays here mean electrical loading effect, wiring delays, stray capacitances. After post place and route, comes generating the bit-map file, which means converting the VHDL code into bit streams which is useful to configure the FPGA kit. A bit file is generated this step is performed. After this comes final step of downloading the bit map file on to the FPGA board which is done by connecting the computer to FPGA board with the help of JTAG cable (Joint Test Action Group) which is an IEEE standard. The bit map file consist the whole design which is placed on the FPGA die, the outputs can now be observed from the FPGA LEDs. This step completes the whole process of implementing our design on an FPGA. 101 6.13 Xilinx ISE 10.1 software 6.13.1 Introduction Xilinx ISE (Integrated Software Environment) 9.2i software is from XILINX company, which is used to design any digital circuit and implement onto a Spartan-3E FPGA device. XILINX ISE 9.2i software is used to design the application, verify the functionality and finally download the design on to a Spartan-3E FPGA device. 6.13.2 Xilinx ISE 10.1 software tools SIMULATION : ISE (Integrated Software Environment) Simulator SYNTHESIS, PLACE & POUTE : XST (Xilinx Synthesis Technology) Synthesizer 6.13.3 Design steps using Xilinx ISE 10.1 1 Create an ISE PROJECT for particular embedded system application. 2 Write the assembly code in notepad or write pad and generate the verilog or vhdl module by making use of assembler. 3 Check syntax for the design. 4 Create verilog test fixture of the design. 5 Simulate the test bench waveform (BEHAVIORAL SIMULATION) for functional verification of the design using ISE simulator. 6 Synthesize and implement the top level module using XST synthesizer. 102 CHAPTER-7 SIMULATION RESULTS Serial Implementation 1+y+y3+y5 103 Synthesis Results: Fig 1: HDL Synthesis Report Macro Statistics # Registers :5 1-bit register :5 # Xors :3 1-bit xor2 :3 Design Statistics # IOs :8 Cell Usage : # BELS :3 # LUT2 # LUT3 :1 :2 # FlipFlops/Latches # FDC # Clock Buffers :5 :5 :1 104 # BUFGP :1 # IO Buffers :7 # IBUF # OBUF :2 :5 Device utilization summary:--------------------------- Selected Device : XC3S500E Number of Slices: 3 out of 960 0% Number of Slice Flip Flops: 5 out of 1920 0% Number of 4 input LUTs: 3 out of 1920 0% 8 out of 7% Number of IOs: Number of bonded IOBs: Number of GCLKs: 8 1 out of 108 2 4% Timing Summary: Minimum period: 2.269ns (Maximum Frequency: 440.723MHz) Minimum input arrival time before clock: 2.936ns Maximum output required time after clock: 4.450ns Parellel Implementation 1+y+y3+y5 105 HDL Synthesis Report Macro Statistics # Registers :5 1-bit register :5 # Xors :9 1-bit xor2 :9 Design Statistics # IOs : 11 Cell Usage : # BELS :6 # LUT3 :3 # LUT4 :3 # FlipFlops/Latches # FDC :5 :5 # Clock Buffers # :1 BUFGP :1 # IO Buffers :9 # IBUF # OBUF :4 :5 106 Device utilization summary: --------------------------- Selected Device: XC3S500E Number of Slices: 3 out of 960 0% Number of Slice Flip Flops: 5 out of 1920 0% Number of 4 input LUTs: 6 out of 1920 0% 10 out of 9% Number of IOs: Number of bonded IOBs: Number of GCLKs: 11 1 out of 108 2 4% Timing Summary: --------------Minimum period: 2.269ns (Maximum Frequency: 440.723MHz) Minimum input arrival time before clock: 4.235ns Maximum output required time after clock: 4.450ns ADVANTAGES: Reduced Power dissipation Higher throughput rate. Higher processing speed 107 Fast Computation. LFSR can rapidly transmit a sequence that indicates high-precision relative time offsets APPLICATIONS: Pattern Generators Built-In Self-Test(BIST) Encryption. LFSR can be used for generating pseudo-random numbers, pseudo-noise sequences, fast digital counters, and whitening sequences. Pseudo-Random Bit Sequences Conclusion Efficient high-speed parallel LFSR structures must address two important issues: large fan out bottleneck and Iteration bound bottleneck. Three-step LFSR architectures provide us with the flexibility to handle these two issues. The key point is the construction of p(x) and p’(x), which are used for addressing the fan out bottleneck and iteration bound bottleneck, respectively, as shown in this paper. These two issues can 108 be better solved by choosing p(x) and p’(x) that can be further decomposed into small length polynomials because small length polynomials can be easily handled by the proposed look-ahead pipelining algorithms. Higher processing speed and hardware efficiency can be achieved by using this approach. REFERENCE [1]Tong-Bi Pei, Charles Zukowski, "High-Speed Parallel CRC Circuits in VLSI", IEEE Transactions on Communications, vol. 40, no. 4, April 1992 pp. 653-657. [2] G. Campobello, G. Patané, M. Russo, “Parallel CRC realization,” IEEE Transactions on Computers, VOL. 52, NO. 10, OCTOBER 2003. 109 [3] C. Cheng and K. K. Parhi, “High-Speed Parallel CRC Implementation Based on Unfolding, Pipelining, and Retiming,” IEEE Trans. Circuits Syst. II, Express Briefs, vol. 53, no. 10, pp.1017–1021, Oct. 2006. [4] K. K. Parhi, “Eliminating the fan out bottleneck in parallel long BCH encoders,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 51, no. 3, pp. 512–516, Mar. 2004. [5] X. Zhang and K. K. Parhi, “High-speed architectures for parallel long BCH encoders,” in Proc. ACM Great Lakes Symp. VLSI, Boston, MA, pp. 1–6, Apr. 2004. [6] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation. Hoboken, NJ: Wiley, 1999. [7] T.V. Ramabadran and S.S, Gaitonde, “A Tutorial on CRC Computations”, IEEE Micro, Vol.8, No. 4, , pp. 62-75, August 1988. 110
© Copyright 2025 Paperzz