Abstract - codelooker_software development and programming

High Speed VLSI Architecture for General Linear Feedback
Shift Register (LFSR) Structures
ABSTRACT
A linear feedback shift register (LFSR) is a shift register whose input bit is a linear
function of its previous state. The only linear function of single bits is xor, thus it is a shift
register whose input bit is driven by the exclusive-or (xor) of some bits of the overall shift
register value. The initial value of the LFSR is called the seed, and because the operation of
the register is deterministic, the stream of values produced by the register is completely
determined by its current (or previous) state. Likewise, because the register has a finite
number of possible states, it must eventually enter a repeating cycle.
Linear Feedback Shift Register (LFSR) structures are widely used in digital signal
processing and communication systems, such as BCH, CRC. Many current functions such
as Scrambling, Convolution Coding, CRC and even Cordic or Fast Fourier Transform can
be derived as Linear Feedback Shift Registers (LFSR) In high-rate digital systems such as
optical communication system, throughput of 1Gbps is usually desired. The serial
input/output operation property of LFSR structure is a bottleneck in such systems and
parallel LFSR architecture is thus required.
This work presents a three-step high-speed VLSI architecture for LFSR structures;
this paper proposes improved three-step LFSR architecture with both higher hardware
efficiency and speed. This architecture can be applied to any LFSR structure for high-speed
parallel implementation.
0
1. Introduction
1.1 Error Coding
In recent years there has been an increasing demand for digital transmission and
storage systems. This demand has been accelerated by the rapid development and
availability of VLSI technology and digital processing. It is frequently the case that a
digital system must be fully reliable, as a single error may shutdown the whole system, or
cause unacceptable corruption of data, e.g. in a bank account. In situations such as this
error control must be employed so that an error may be detected and afterwards corrected.
The simplest way of detecting a single error is a parity checksum, which can be
implemented using only exclusive-or gates. But in some applications this method is
insufficient and a more sophisticated error control strategy must be implemented.
If a transmission system can transfer data in both directions, an error control
strategy may be determined by detecting an error and then, if an error has occurred,
retransmitting the corrupted data. These systems are called automatic repeat request
(ARQ). If transmission takes place in only one direction, e.g. information recorded on a
compact disk, the only way to accomplish error control is with forward error correction
(FEC).In FEC systems some redundant data is concatenated with the information data in
order to allow for the detection and correction of the corrupted data without having to
retransmit it. One of the most important classes of FEC codes is linear block codes. In
block codes, data is transmitted and corrected within one block (codeword). That is, the
data preceding or following a transmitted codeword does not influence the current
codeword. Linear block codes are described by the integer n, the total number of symbols
in the associated codeword. Block codes are also described by the number k of information
symbols within a codeword, and the number of redundant (check) symbols n-k.
In error control, it is crucial to understand the sources of errors. Each transmitted bit
has probability p > 0 of being received incorrectly. On memoryless channels every
1
transmitted symbol may be considered independently, so only random errors occur.
Unfortunately, most channels have memory and usually several successive symbols are
corrupted. These kinds of errors are called burst errors [29]. Burst errors can be most
efficiently corrected through use of burst error correcting codes, e.g. Reed Solomon (RS)
codes [44]. Because the structure of burst error correcting codes is usually complicated,
multiple random error correcting codes are often employed. In order to improve burst error
correction, the transmitted codewords are also rearranged by interleaving. The resulting
code is called an interleaved code. In this way the burst errors scatter into several
codewords and look like random errors. Other operations on block codes are also available
to improve the error correcting ability or to adapt a code to a specified requirement. For
example codes may be shortened, extended, concatenated or interleaved [2,5].
The simplest block codes are Hamming codes. They are capable of correcting only
one random error and therefore are not practically useful, unless a simple error control
circuit is required. More sophisticated error correcting codes are the Bose, Chaudhuri and
Hocquenghem (BCH) codes that are a generalisation of the Hamming codes for multipleerror correction. In this thesis the subclass of binary, random error correcting BCH codes is
considered, hereafter called BCH codes. BCH codes operate over finite fields or Galois
fields. The mathematical background concerning finite fields is well specified and in recent
years the hardware implementation of finite fields has been extensively studied
Furthermore, any BCH code can be defined by only two fundamental parameters and these
parameters can be selected by the designer. These parameters are crucial to the design and
the question arises if it is possible to develop a tool that will automatically generate any
BCH codec description, just by providing the code size n and the number of errors to be
corrected t. This design automation would considerably reduce BCH codec design cost and
time and increase the ease with which BCH codecs with different design parameters are
generated. This is an important motivation since the architectures of BCH codecs with
different parameters can vary remarkably.
2
1.2 Hardware solutions
BCH codes employ sophisticated algorithms and their implementation is rather
burdensome. The safe solution both in terms of costs and time is a software solution. But
as BCH codes operate over finite fields, a standard microprocessors’ arithmetic is not
suitable, and a software solution is therefore rather slow . Another kind of solution is to
employ a specialist digital signal processing (DSP) unit, but this option requires rather
expensive and sophisticated hardware and can be adopted only when a small number of
devices is to be produced. Overall, software solutions are therefore slower, consume more
power and are less reliable than hardware implementations.
In recent years the Programmable Logic Device (PLD) has been developed and the
PLD subclass of Field Programmable Gate Arrays (FPGAs) has been introduced. This has
revolutionised hardware design and its implementation. The advantages of an FPGA
solution are as follows:
 The FPGA is fully reprogrammable.
 A design can be automatically converted from the gate level into the layout structure by
the place and route software. Therefore design changes can be made almost as easily as
software ones.
 Simulation at the layout level, where the design is tied to the internal FPGA structure, is
also possible (back annotation). This enables not only the logical functionality but the
timing characteristics of the design to be simulated as well.
 Xilinx Inc. offers a wide range of components [55] For example the XC3000 family
offers 1,300 to 9,000 gate complexity and 256-1320 flip-flops, so even a relatively
complex design can be implemented. (A range of other manufacturers also market
FPGA devices including Actel and Altera.)
In conclusion, a hardware solution can be easily implemented, and the differences
between hardware and a software solutions have become blurred. Unfortunately although
FPGA solutions are easy to introduce and verify, they are rather expensive and therefore
not economical for mass-production. In this case, a full or semi-custom Application
Specific Integrated Circuit (ASIC) might be more appropriate. An ASIC solution is more
complex and its implementation takes much longer than an FPGA. On the other hand,
3
although an ASIC is characterised as having high starting costs it will allow for a lower
cost per chip in mass-production. However an ASIC solution cannot be modified easily or
cheaply, due to the high cost of layout masks and the long time required for their
development.
1.3 Verilog HDL and synthesis
The development of VLSI and PLDs has stimulated a demand for a hardware
description language (HDL), with a well-defined syntax and semantics. This requirement
led to the development of the Very (High Speed Integrated Circuit) Hardware Description
Language [1,25,26]. The VERILOG language describes a digital circuit using the design
entity. The entity contains an input/output interface and an architecture description. The
language supports different data types, namely constants, variables and signals and there
are also different data formats available, for instance bits, integers and real numbers.
VERILOG also supports numerous operators such as addition, multiplication,
exponentiation and modulo reduction of these data types [26].
VERILOG offers the opportunity for different levels of design. This is a crucial
feature of the language as it enables design partitioning and simulation at different levels,
thus the design can be hierarchical. In addition, VERILOG allows a design to be described
in different domains [25,34]. There are three different domains for describing digital
systems. The behavioural domain describes the system without stating how the specified
functionality is to be implemented. The structural domain describes a network of
components. The physical domain describes how a system is actually to be built.
VERILOG models of digital systems can be written at each of these three levels. These
models can then be simulated using Electronic Computer Aided Design (ECAD) tools.
VERILOG has subsequently become a standard [26] and has been widely adopted
throughout the electronics industry.
ECAD tools have long since been available which convert gate level descriptions of
circuits into descriptions which can be accepted by ASIC manufacturers. One of the key
recent developments has been the design of automatic synthesis tools which convert higher
level textual descriptions of digital circuits into lower level or gate level descriptions
4
.These synthesis tools therefore allow high level descriptions of circuits to be transported
into hardware much quicker and cheaper than was previously the case. By virtue of being a
standard, there are numerous proprietary VERILOG synthesis tools available.
Synthesis may be considered as either high level, logic level or layout level
synthesis depending on the level of abstraction involved. The highest level of design
abstraction is the system level, where the design specification and performance are defined
and a system is described, for example, as a set of processors, memories, controllers and
buses. Below this is the algorithmic level where the focus is on data structures and the
computations performance by individual processors. Next comes the register transfer level
(RTL) where the system is viewed as a set of interconnected storage elements and
functional blocks. Below this is the logic level where the system is described as a network
of gates and flip-flops. The lowest level of abstraction is a circuit level which views the
system in terms of individual transistors or other elements of which it is composed.
High level synthesis [36] takes place on the algorithmic level and on the RTL.
Usually there are different structures that can be used to realise a given behaviour, and one
of the tasks of high level synthesis is to find the structure that best meets the given
behaviour. There are a number of reasons why high-level synthesis should be considered.
For example high level synthesis reduces design times and allows for the possibility of
searching the design space for different trade-offs between cost, speed and power.
Unfortunately in practice, high-level synthesis tools are rather difficult to develop.
Furthermore, a man-crafted design is often more hardware efficient. As a result, the design
is usually synthesised at the lower level of abstraction.
Logic level synthesis is much simpler because the digital blocks have already been
determined, therefore one of the most important aspects of this process is optimisation.
Logic synthesis is often associated with a target technology because the final logic form for
different technologies is different. The intention at this level may also be to minimise the
delay through the circuit [9] and/or to minimise the hardware requirements [11]. This task
may be even more complicated as only a few signals may be optimised with respect to time
delay whilst with others it may be required to reduce the hardware levels. Layout level
synthesis has been carried out for many years now [18] and is well understood. For
5
example the place and route software associated with XILINX FPGA devices can be
considered to carry out layout level synthesis.
One of the most significant problems for a synthesis tool is that the number of
possible solutions increases rapidly with an increase in logic complexity. Usually synthesis
problems are NP-complete, that is the synthesis execution time grows exponentially with
the size of the problem. Therefore the time required to find the best solution is usually
considerable. Consequently algorithms producing inexact but close to optimum solutions
are employed - so called heuristic [13].
Design synthesis is a very powerful tool and in theory, saving a considerable
amount of design time as the design need not be developed at the gate level but instead at a
higher level. In addition, the synthesis tool optimises the final design according to the
specified technology and predefined criteria such as minimum area and speed.
Unfortunately, synthesis tools are very complex and difficult to develop.
Various commercial synthesis tools are available, usually operating at the RTL
level, but seldom higher. The problem for a BCH codec designer therefore is that he has to
have a high level of understanding of BCH codes before he can write these RTL
descriptions in the first place. It is therefore the aim of this project to develop a high level
synthesis tool for the design of BCH codecs. This tool will accept the parameters n and t of
a BCH code and then generate the VERILOG description of the resulting BCH encoder and
decoder. These VERILOG descriptions will be written at the RTL/logic level to facilitate
their synthesis to gate level using a standard synthesis tool.
6
1.4 Overview of thesis
The structure of this thesis is as follows. Chapter 2 presents finite fields and their
arithmetic. It considers how to construct finite fields bit-serial and bit parallel multipliers
for the dual, normal and polynomial bases. In addition, finite field inversion and
exponentiation are considered and a new approach for raising field elements to the third
power is presented. This chapter further presents a new hardware-efficient architecture
generating the sum of products and a new dual-polynomial basis multiplier.
Chapter 3 introduces BCH codes and algorithms for encoding and decoding BCH
codes are presented. Chapter 4 describes the BCH codec synthesis system.
7
2. Finite Fields and Field Operators
2.1 Introduction
In this chapter finite fields and finite field arithmetic operators are introduced. The
definitions and main results underlying finite field theory are presented and it is shown how
to derive extension fields. The various finite field arithmetic operators are reviewed. In
addition, new circuits are presented carrying out frequently used arithmetic operations in
decoders. These operators are shown to have faster operating speeds and lower hardware
requirements than their equivalents and consequently have been used extensively
throughout this project.
Finite fields
Error control codes rely to a large extent on powerful and elegant algebraic
structures called finite fields. A field is essentially a set of elements in which it is possible
to add, subtract, multiply and divide field elements and always obtain another element
within the set. A finite field is a field containing a finite number of elements. A well known
example of a field is the infinite field of real numbers.
2.2 Field definitions and basic features
The concept of a field is now more formally introduced. A field F is a non-empty
set of elements with two operators usually called addition and multiplication, denoted ‘+’
and ‘*’ respectively. For F to be a field a number of conditions must hold:
1. Closure: For every a, b in F
c = a + b;
d = a * b;
(2.1)
where c, d  F.
2. Associative: For every a, b, c in F
a + (b + c) = (a + b) + c and
a * (b * c) = (a * b) * c.
8
(2.2)
3. Identity: There exists an identity element ‘0’ for addition and ‘1’ for multiplication that
satisfy
0+a=a+0=a
and
a*1=1*a=a
(2.3)
for every a in F.
4. Inverse: If a is in F, there exist elements b and c in F such that
a+b= 0
a * c = 1.
(2.4)
Element b is called the additive inverse, b = (-a), element c is called the multiplicative
inverse, c = a-1 (a0).
5. Commutative: For every a, b in F
a+b=b+a
a * b = b * a.
(2.5)
6. Distributive: For every a, b, c in F
(a + b) * c = a * c + b * c.
(2.6)
The existence of a multiplicative inverse a-1 enables the use of division. This is
because for a,b,c  F, c = b/a is defined as c = b * a-1 . Similarly the existence of an
additive inverse (-a) enables the use of subtraction. In this case for a,b,c  F, c = b - a is
defined as c = b + (-a).
It can be shown that the set of integers {0, 1, 2, ... , p-1} where p is a prime,
together with modulo p addition and multiplication forms a field [30]. Such a field is called
the finite field of order p, or GF(p), in honour of Evariste Galois [48]. In this thesis only
binary arithmetic is considered, where p is constrained to equal 2. This is because, as shall
be seen, by starting with GF(2), the representation of finite field elements maps
conveniently into the digital domain. Arithmetic in GF(2) is therefore defined modulo 2. It
is from the base field GF(2) that the extension field GF(2m) is generated.
2.2.1 The extension field GF(2m)
Before introducing GF(2m), some definitions are required. A polynomial p(x) of degree m
over GF(2) is a polynomial of the form
p(x) = p0 + p1x + p2x2 + ... + pmxm
(2.7)
where the coefficients pi are elements of GF(2) = {0,1}. Polynomials over GF(2) can be
added, subtracted, multiplied and divided in the usual way [29]. A useful property of
polynomials over GF(2) is that ([29], pp.29)
9
p2(x) = ( p0 + p1x + ... +pnxn)2 = p0 + p1x2 + ... + pnx2n = p(x2).
(2.8)
The notion of an irreducible polynomial is now introduced.
Definition 2.1. A polynomial p(x) over GF(2) of degree m is irreducible if p(x) is not
divisible by any polynomial over GF(2) of degree less than m and greater than zero.
To generate the extension field GF(2m), an irreducible, monic polynomial of degree
m over GF(2) is chosen, p(x) say. Then the set of 2m polynomials of degree less than m over
GF(2) is formed and denoted F. It can then be proven that when addition and
multiplication of these polynomials is taken modulo p(x), the set F forms a field of 2m
elements, denoted GF(2m) [30]. Note that GF(2m) is extended from GF(2) in an analogous
way that the complex numbers C are formed from the real numbers R where in this case,
p(x) = x2 + 1.
To represent these 2m field elements, the important concept of a basis is now
introduced.
2.2.2 The polynomial basis and primitive elements
Definition 2.2. A set of m linearly independent elements  ={0 ,1,..., m-1} of GF(2m) is
called a basis for GF(2m).
A basis for GF(2m) is important because any element a  GF(2m) can be represented
uniquely as the weighted sum of these basis elements over GF(2). That is
a = ao0 + a11 + .... + am-1m-1
ai  GF(2).
(2.9)
Hence the field element a can be denoted by the vector (a0, a1, ..., am-1). This is why the
restriction p = 2 has been made, since the above representation maps immediately into the
binary field.
There are a large number of possible bases for any GF(2m) [30]. One of the more
important bases is now introduced.
10
Definition 2.3. Let p(x) be the defining irreducible polynomial for GF(2m). Take  as a
root of p(x), then A = {1,,...m-1} is the polynomial basis for GF(2m).
For example consider GF(28) with p(x) = x4 + x + 1. Take  as a root of p(x) then
A = {1,,2,3} forms the polynomial basis for this field and all 16 elements can be
represented as
a = a0 + a1 + a22 + a33
(2.10)
where the ai  GF(2). These basis coefficients can be stored in a basis table of the kind
shown in Appendix B.
Definition 2.4. An irreducible polynomial of degree m is a primitive polynomial if the
smallest positive integer n for which p(x) divides xn + 1 is n = 2m - 1.
If  is a root of p(x) where this polynomial is not only irreducible but also primitive,
then GF(2m) can be represented alternatively by the set of elements GF(2m) = {0,1,,2, ...
n-1}, (n = 2m -1 ). In this case  is called a primitive element and n = 1. The relationship
between powers of primitive elements and the polynomial basis representation of GF (28)
is also shown in Appendix B.
The choice as to whether to represent field elements over a basis or as powers of a
primitive element usually depends on whether hardware or a software implementation is
being adopted. This is because i * j = i +j , where this indices addition is modulo 2m-1
and so can easily be carried out on a general purpose computer. Multiplication of field
elements using the primitive element representation is therefore simple to implement in
software, but addition is much more difficult. For implementation in hardware however a
basis representation of field elements makes addition relatively straight forward to
implement. This is because
a = b + c = (b0 + b1 + ... + bm-1 m-1 ) + (c0 + c1 + ... + cm-1 m-1 ) =
= (b0 + c0) + (b1 + c1) + ... + (bm-1 + cm-1) m-1
(2.11)
and so addition is performed component-wise modulo 2. Hence a GF(2m) adder circuit
comprises 1 or m XOR gates depending on whether the basis coefficients are represented in
11
series or parallel. This is an important feature of GF(2m) and one of the main reasons why
finite fields of this form are so extensively used.
2.2.3 The Dual Basis
The dual basis is an important concept in finite field theory and was originally
exploited to allow for the design of hardware efficient RS encoders [3]. However
subsequent research has allowed the use of dual basis multipliers to be adopted throughout
the encoding and decoding processes.
Definition 2.5. [15] Let {i} and {i} be bases for GF(2m), let f be a linear function from
GF(2m)  GF(2), and  GF(2m), 0. Then {i} and {i} are dual to each other with
respect to f and  if
1 if
f (  i  j )  
0 if
i j
(2.12)
i  j.
In this case, {i} is the standard basis and {i} is the dual basis.
Theorem 2.1. [15]. Every basis has a dual basis with respect to any non-zero linear
function f: GF (2m)  GF (2), and any non-zero  GF(2m).
For example consider GF (28) with p(x) = x4 + x + 1 and take  as a root of p(x).
Then {1,,2,3} is the polynomial basis for the field. Now taking  = 1 and f to be the
least significant polynomial basis coefficient, {1,3,2,} forms the dual basis to the
polynomial basis. In fact by varying  there are 2m-1 dual bases to any given basis and the
dual basis with the most attractive characteristics can be taken. This is usually taken to
mean the dual basis that can be obtained from the polynomial basis with the simplest linear
transformation [38].
2.2.4 Normal basis
m1
A normal basis for GF(2m) is a basis of the form B  { ,  2 ,,  2 } where  
GF(2m). For every finite field there always exists at least one normal basis [30]. Normal
basis representations of field elements are especially attractive in situations where squaring
12
is required, since if (a0, a1, ... ,am-1) is the normal basis representation of a  GF(2m) then
(am-1, a0, a1, ... , am-2) is the normal basis representation of a2 [31]. This property is
important in its own right but also because it allows for hardware efficient Massey-Omura
multipliers to be designed. The normal basis representation of
GF(28) is given in
Appendix B.
2.3 Multiplication by a constant j
It is frequently required to carry out multiplication by a constant value in encoders
and decoders. This can be accomplished using two-variable input multipliers of the type
described later. Alternatively it is often beneficial to employ a multiplier designed
specifically for this task ([29], p. 162) ([30], p.89).
Let a = a0 + a1 + ... + am-1m-1 be an element in GF(2m) where  is a root of the
m 1
primitive polynomial p(x) = xm +  p j  x j . Thus
j0
a *  = a0 + a12 + ... + am-1m
(2.13)
but since p() = 0
a* = a0 + a12 + ... + am-2m-1 + am-1(p0 + p1 +p22 + ... + pm-1m-1) (2.14)
which is equivalent to a* mod p().
For example consider, multiplication by  in GF(28), where p(x) = x4 + x + 1. Then
a* = a3 + (a3 + a0) + a12 + a23
(2.15)
and this multiplication can be carried out with the following circuit.
A0
A1
A2
A3
Figure 2.1. Circuit for computing a  a *  in GF (28).
13
If the above register is initialised by Ai = ai (i=0,1,2,3) then by clocking the register once,
the value of a* is generated. This algorithm may be readily extended for multiplication by
j, where j is any integer and for any GF(2m).
2.4 Bit-serial multiplication
The most commonly implemented finite field operations are multiplication and
addition. Multiplication is considered to be a degree of magnitude more complicated than
addition and a large body of research has been carried out attempting to reduce the
hardware and time complexities of multiplication. Finite field adders and multipliers can be
classified according to whether they are bit-serial or bit-parallel, that is whether the m bits
representing field elements are processed in series or in parallel. Whereas bit-serial
multipliers generally require less hardware than bit-parallel multipliers, they also usually
require m clock cycles to generate a product rather than one. Hence in time critical
applications bit-parallel multipliers must be implemented, in spite of the increased
hardware overheads.
2.4.1 Berlekamp multipliers
The Berlekamp multiplier [3] uses two basis representations, the polynomial basis
for the multiplier and the dual basis for the multiplicand and the product. Because it is
normal practice to input all data in the same basis, this means some basis transformation
circuits will be required. Fortunately for m = (3, 4, 5, 6, 7, 9, 10) the basis conversion from
the dual to the polynomial basis - and vice versa - is merely a reordering of the basis
coefficients [38]. For the important case m = 8 - for example the error-correcting systems
used in CDs, DAT and many other applications operate over GF(28) - this basis conversion
requires a reordering and two additions of the basis coefficients (Appendix C). In practice
therefore, two additional XOR gates are required. Even including the extra hardware for
basis conversions, the Berlekamp multiplier is known to have the lowest hardware
requirements of all available bit-serial multipliers [28].
Now let a, b, c  GF(2m) such that c = a * b and represent b over the polynomial
m 1
basis as b =
b
k 0
k
*  k . Further, and following Definition 2.5, let {0, , ..., m-1,} be the
14
m 1
m 1
i 0
i 0
dual basis to the polynomial basis for some f and . Hence a   ai  i and c   ci  i
where these values of ai and ci are given by the following.
Lemma 2.1 [15]. Let {0, 1, ..., m-1} be the dual basis to the polynomial basis for GF(2m)
for some f and  and let a = i 0 a i  i be the dual basis representation of a  GF(2m).
m1
Then ai = f(ai) for (i=0,1, ..., m-1).
The multiplication c = a*b can therefore be represented in the matrix form [15]
 a0
 a
 1
 ...

a m1
a m1  b0  c 0 

 

a 2 ... a m  b1  c1 


... ...
...  ...  ... 
 

 
a m  a 2m 2  bm1  c m1 
a1
...
(2.16)
where ai = f(ai) and ci = f(ci) (i = 0,1, ..., m-1) are the dual basis coefficients of a and
c respectively and ai = f(ai) (i=m, m+1,..., 2m-2). It can be shown [15] that
am+k = f(am+k ) =
m 1
p
j0
j
(k= 0,1, .....)
* a jk
(2.17)
where the pj are the coefficients of p(x). These values of am+k can therefore be obtained
from an m-stage linear feedback shift register (LFSR) where the feedback terms correspond
to the pj coefficients and the LFSR is initialised with the dual basis coefficients of a. On
clocking the LFSR am is generated, then on the next clock cycle am+1 is produced, and so
on. The m vector multiplications listed in equ(2.16) are then carried out by a structure
comprising m 2-input AND gates and (m-1) 2-input XOR gates. As an example, a
Berlekamp multiplier for GF(28) is shown in Fig. 2.2 where p(x) = x4 + x + 1.
15
A3
A2
A1
A0
c3 c2 c1 c0
B3
B2
B1
B0
Figure 2.2 Bit-serial Berlekamp multiplier for GF(28).
The registers in Fig. 2.2 are initialised by Ai = ai and Bi = bi for (i= 0,1,2,3). At this
point the first product bit c0 is available on the output line. The remaining values of c1, c2
and c3 are obtained by clocking the register a further three times.
With the above scheme at least one basis conversion is required if both inputs and
the output are to be represented over the same basis. This basis transformation is a linear
transformation of the basis coefficients and can be implemented within the multiplier
structure itself. However with GF(28) the dual basis is a permutation of the polynomial
basis coefficients and so this conversion can be implemented by a simple reordering of the
inputs.
2.4.2 Massey-Omura Multiplier
The Massey-Omura multiplier [31,54] operates entirely over the normal basis and
so no basis converters are required. The idea behind the Massey-Omura multiplier is that if
the Boolean function generating the first product bit has the inputs cyclically shifted, then
this same function will also generate the second product bit. Furthermore with each
subsequent cyclic shift a further product bit is generated. Hence instead of m Boolean
functions, one Boolean function is required to generate all m product bits but with the
inputs to this function shifted each clock cycle.
16
As an example, consider a Massey-Omura bit-serial multiplier for GF(28). Let  be
a root of p(x) = x4 + x + 1 and let a normal basis for the field is {3, 6, 12, 9}. Further,
let such that c = a*b and represent these elements over the normal basis. Then
c = c03 + c16 + c212 + c39 =
= (a03 + a16 + a212 + a39) * (b03 + b16 + b212 + b39)
where
c0 = a0b2 + a1b2 + a1b3 + a2b0 + a2b1 + a3b1 + a3b3
c1 = a1b3 + a2b3 + a2b0 + a3b1 + a3b2 + a0b2 + a0b0
c2 = a2b0 + a3b0 + a3b1 + a0b2 + a0b3 + a1b3 + a1b1
c3 = a3b1 + a0b1 + a0b2 + a1b3 + a1b0 + a2b0 + a2b2.
(2.18)
From equ(2.18) only one Boolean function is required to generate c0, the remaining values
of c1, c2 and c3 are obtained by adding one to all of the indices. This amounts to a cyclic
shift of the inputs to this Boolean function. A circuit diagram for this multiplier is given in
Fig 2.3. The registers in Fig. 2.3 are initialised by Ai = ai and Bi = bi for (i=0,1,2,3). At this
point the first product bit c0 will be available on the output line. The remaining values of c1,
c2 and c3 are obtained by cyclically shifting the registers a further three times.
A3
A2
A1
A0
c3 c2 c1 c0
B3
B2
B1
B0
Figure 2.3. Bit-serial Massey-Omura multiplier for GF(28).
In the case of a Massey-Omura multiplier for GF(28), from equ(2.18) seven 2-input
AND gates and six 2-input XOR gates are required to implement the required Boolean
17
equation. In general there is a result that states the defining Massey-Omura function for a
GF(2m) multiplier requires at least (2m-1) 2-input AND gates and at least (2m-2) 2-input
XOR gates [39]. In the case of the above example, it can be seen that the GF(28) MasseyOmura multiplier has achieved this lower bound.
2.4.3 Polynomial basis multipliers
Polynomial basis multipliers operate entirely over the polynomial basis and require
no basis converters. These multipliers are easily implemented, reasonably hardware
efficient and the time taken to produce the result is the same as for Berlekamp or MasseyOmura multipliers. In truth however bit-serial polynomial basis multipliers are serial-in
parallel-out multipliers. In some applications this results in an additional register being
required and adds an extra m clock cycles to the computation time. This is the main reason
why polynomial basis multipliers are frequently overlooked for use in codec design.
However as will be shown in Sections 2.4.5 and 2.4.6, this feature can be actually
beneficial.
There are two different methods of operation for polynomial basis multipliers, least
significant bit (LSB) first or most significant bit (MSB) first. Either of these approaches
may be chosen and both modes are described below.
2.4.3.1 Option L - LSB first
In this option the LSB appears first on the multiplier input. Therefore denote this
multiplier a Bit-Serial Polynomial Basis Multiplier option L (SPBML). This multiplier is
described in detail in the literature [4], ([29], pp.163 -164), ([30], pp. 90-91) and
summarised here.
Let a, b, c  GF(2m) and represent these elements over the polynomial basis as
a = a0 + a1 + ... + am-1m-1
b = b0 + b1 + ... + bm-1m-1
c = c0 + c1 + ... + cm-1m-1
(2.19)
The multiplication c = a * b can be expressed as
c = a * b = (a0 + a1 + ... + am-1m-1) * b
c = (...(((a0b) + a1b) + a2b2) + ...) + am-1bm-1
18
(2.20)
A circuit carrying out multiplication by implementing equ(2.20) therefore requires an
LFSR to carry out multiplication by . This LFSR is initialised with b and on clocking the
register the value of b is generated. The values a0 ,a1, ... , am-1 are fed in series into the
multiplier to generate each of the values aibi (i=0,1,...,m-1) which are accumulated in a
register to form the product bits c0 ,c1, ... , cm-1. As an example, a circuit diagram for such a
multiplier for GF(23) using the primitive polynomial p(x) = x3 + x + 1 is given in Fig. 2.4.
The operation of this circuit is as follows. The registers are initialised by Bi = bi and
Ci = 0 (i=0,1,2). The values a0 ,a1, a2 are fed in series into the multiplier and after 3 clock
cycles the result c is available in the Ci register.
B0
a2
a1
B1
B2
a0
C0
C1
C2
Figure 2.4. Circuit for multiplying two elements in GF(23).
2.4.3.2 Option M - MSB first
In this option the MSB appears first on the multiplier input. The Bit-Serial
Polynomial Basis Multiplier option M (SPBMM) has been known for many years
[28]([29], p.163) and more recently modified by Scott et al [45].
The multiplication c = a * b (where a, b, c are as given in equation 2.18) can be
expressed
c = a * b = (a0 + a1 + ... + am-1m-1) * b
c = (...(((am-1b) + am-2b) + am-3b) + ...) + a0b.
(2.21)
A circuit implementing equation 2.21 for GF(23) is shown in Fig. 2.5. Initially the Ci
register is set to zero and the Bi register is initialised by Bi = bi (i=0,1,2). a2 is then fed into
the circuit and a2b loaded into the top register. Then a1 enters the circuit and the top
register are clocked so that they then contain (a2b + a1b). Finally, the top register are
19
clocked to generate (a2b + a1b) and this value is added to a0b to form the required
product. In general therefore the result is obtained in the Ci register after m clock cycles.
C0
a0
a1
C1
C2
a2
B0
B1
B2
Figure 2.5. Circuit for multiplying two elements in GF(23).
2.4.4 Comparison of bit-serial multipliers
The Massey-Omura multiplier operates entirely over a normal basis and so no
additional basis conversions are required. The normal basis representation is especially
effective in performing operations such as squaring. Unfortunately, the multiplier circuit is
relatively hardware inefficient (compared to the Berlekamp multiplier for example, [28,
33]) and cannot be hardwired to carry out reduced complexity constant multiplication.
Furthermore, the Massey-Omura multiplier cannot be easily extended for different values
of m given a particular choice of m.
The Berlekamp multiplier is known to have very low hardware requirements [28].
The Berlekamp multiplier can also be hardwired to allow for particularly efficient constant
multiplication [3]. The disadvantage of this multiplier is that it operates over both the dual
and the polynomial basis, and so basis converters may be required. In most cases the basis
conversion is only a permutation of the basis coefficients, and hence no additional
hardware is required (see Appendix C). Because of these reasons, the Berlekamp multiplier
is widely used in codec design.
The bit-serial polynomial basis multipliers do not require basis converters, and are
almost as hardware efficient as the Berlekamp multiplier. They do however have a different
interface to the Berlekamp multiplier being serial-in-parallel-out. Hence the choice
between a Berlekamp and a polynomial basis multiplier often depends on the circuit in
20
which the multiplier is to be implemented. For example if the result is required to be
represented in parallel then an SPBMM may be used, otherwise a Berlekamp multiplier
could be rather adopted.
In comparing all four multipliers directly, it is noted that they each take m clock
cycles to generate a solution. Similarly they each require 2m flip-flops. In order to compare
the hardware requirements of these four multipliers some notation is introduced. Let Na
equal the number of 2-input AND gates required by a multiplier and let Nx equal the
number of 2-input XOR gates required by a multiplier. Furthermore, let Da and Dx be the
delays through a 2-input AND gate and XOR gate respectively. Let H(pp) be the Hamming
weight of the primitive polynomial chosen for GF(2m). (These choices of p(x) are listed in
Appendix A.) The hardware requirements and delays of three of these multipliers are given
in below.
Berlekamp multiplier
Na = m;
Nx = m + H(pp) - 3
Delay = Da + log2(m -1) * Dx.
(2.22)
Standard basis multiplier option L
Na = m
Nx = m + H(pp) - 2
Delay = Da + Dx.
(2.23)
Standard basis multiplier option M
Na = m
Nx = m + H(pp) - 2
Delay = Da + 2Dx.
(2.28)
For Massey-Omura multipliers the number of gates cannot be explicitly specified.
As a comparison, values of Na and Nx for all three types of multiplier are given in Table
2.1
21
m
Massey Omura [33]
Berlekamp
SPBML/ SPBMM
Na
Nx
Na
Nx
Na
Nx
3
5
4
3
3
3
4
4
7
6
4
4
4
5
5
9
8
5
5
5
6
6
11
10
6
6
6
7
7
19
18
7
7
7
8
8
21
20
8
10
8
11
9
17
16
9
9
9
10
10
19
18
10
10
10
11
Table 2.1 The usage of gates for bit-serial Massey Omura, Berlekamp and standard basis
multipliers.
It should be noticed that in some applications the most important feature of a
multiplier is the input/output interface. Beth et. al. [4] presented a different interface for
polynomials, dual and normal basis multipliers. In conclusion polynomial basis multipliers
can be only serial-in parallel-out, whereas dual and normal basis multipliers can be either
parallel-in serial-out or serial-in parallel-out.
2.4.5 Generating the sum of products
Often in BCH and RS decoders one product is not required to be generated in
isolation, but instead a sum of products must be calculated. For example an equation of the
form
t
c   a j bj
(2.25)
j 1
is required to be evaluated in Berlekamp-Massey algorithm circuits described in the next
chapter.
If bit-serial Berlekamp or Massey-Omura multipliers are being used, the sum of t
products is obtained by the modulo 2 addition of the output of the t independent multipliers
and so (t-1) additional XOR gates are required. With polynomial basis multipliers where
the outputs are represented in parallel,
m*(t-1) XOR gates are required. However if
22
SPBMMs are used to generate these products, large hardware savings can be made, as
follows. A SPBMM implements equ(2.21) (rewritten below)
c = (...(((am-1b) + am-2b) + am-3b) + ...) + a0b
by generating the values Pn = Pn-1 + am-nb (n=1,2,...,m) where P0 = 0 and c = Pm. If now
aj = aj,0 + aj,1 + ... + aj,m-1m-1 and bj = bj,0 + bj,1 + ... + bj,m-1m-1 (j=1,2,...t) then
instead generate
t
Pn  Pn 1   ai ,m nbi
(2.26)
i 1
t
where P0 = 0 and so Pm   a j  b j .
j 1
Equ(2.26) can be implemented by a circuit comprising two parts. Part A generates
Pn-1 in the same manner that Pn-1 is generated in the top register in Figure 2.5. Part B
comprises t registers and m 2-input AND gates generating the values ai,m-nbi (n=1,2,...,m)
for (i=1,2,...,t). The additions required in equ(2.26) can be carried out by m*(t-1) XOR
gates included in the design of the Part A circuit. A circuit for GF(23) with t=2 is shown in
Figure 2.6.
Using this approach to evaluating equ(2.25) (t+1)*m flip-flops are required. If t
distinct SPBMMs are used however, 2t*m flop flops are needed and so the above method
allows for a saving of (t-1)*m flip-flops to be made. Given that Berlekamp multipliers are
the most hardware efficient bit-serial multipliers it would seem appropriate to use these
multipliers when implementing equ(2.25). In this case however the presented approach
would again save (t-1)*m flip-flops since t distinct multipliers would be required. In
addition (H(pp)-2)* (t -1) - 1 XOR gates are saved where H(pp) is the hamming weight of
irreducible polynomial for the field. Hence the presented approach is the most hardware
efficient method of implementing equ(2.25) currently available.
23
a1,0
a1,1
a 1,2
a2,1
Part A
Part B
b1,2
b1,1
b1,0
a2,0
C2
C1
C0
a 2,2
Part B
b2,0
b2,2
b2,1
Figure 2.6. Circuit generating c = a1b1 + a2b2 using in GF(23) .
Initially the presented approach appears to have an unattractive input/output format
since the aj values enter in series, the bj values enter in parallel and the output is also
generated in parallel. However when utilised in a Berlekamp-Massey algorithm circuit, this
input/output format can be very convenient. This is because the incoming syndromes are
frequently represented in series (and so can take the role of the aj values) and the error
location values generated by the circuit are represented in parallel (and so can take the role
of the bj values). The other multipliers required in the circuit must then also be bit-parallel
multipliers, thereby increasing the throughput of the overall circuit. Furthermore in the next
section a new approach of Dual Polynomial Basis Multiplier is presented, and a
combination of these two architectures offers a new hardware and time efficient
architecture for the BMA (presented in Section 3.4.2).
It should be noticed here that sum of products architecture may be also extended for
dual and normal basis multipliers if their architecture is serial-in parallel-out. Therefore it
is possible to construct a sum of products multiplier for MSB-first dual basis multipliers
(Fig. 7 [4]) and for MSB-first and LSB-first normal basis multipliers (Fig. 10, 11 [4]).
24
2.4.6 Dual-Polynomial Basis Multipliers
In real time applications, the time taken by a multiplier to generate a solution is one
of its most important characteristics. Therefore a designer has to choose between hardware
efficient but slow bit-serial multipliers, and quick but rather complex bit-parallel
multipliers. In some applications it is required to calculate
y = a * b * c.
(2.27)
In the standard approach to generate equ(2.27), two multiplications are carried out
independently, i.e. first the multiplication z = a * b is implemented and the result stored in
the auxiliary register Z, and then the multiplication y = z * c is carried out. The total
calculation time is the sum of two independent multiplication times. In some applications
this time is unacceptably long and a parallel multiplier must be employed, and so a more
complex architecture is required.
To overcome this problem, a new approach has been developed. Using the two
proposed Dual-Polynomial Basis Multipliers (DPBMs), the time required to implement
equ(2.27) is almost the same as the time required to carry out a single multiplication.
Furthermore a DPBM is almost as hardware efficient as the standard bit-serial approach.
The DPBM can also be modified to carry out more complex operations such as y = (a * b
+ c) * d. (This operation is required to be carried out in the Berlekamp Massey algorithm).
2.4.7 Option A dual polynomial basis multipliers
The Berlekamp multiplier can be described as a parallel-in serial-out multiplier. On
the other hand, bit-serial polynomial basis multipliers are serial-in parallel-out. Therefore,
there is the option of connecting these two types of multiplier together to form one
multiplier generating y = a * b * c. In this arrangement, the Berlekamp multiplier’s output
is connected directly to the polynomial basis multiplier’s serial input. Thus the
multiplication y = a * b * c is carried out in the same time span that a single bit-serial
Berlekamp multiplier takes to yield one product. A problem occurs however because the
polynomial basis multiplier operates on the polynomial basis whilst the Berlekamp
multiplier produces a result in the dual basis, and so an additional basis conversion is
required.
25
The complexity of this basis representation depends on the irreducible polynomial
selected, and so two cases have been considered. Those cases in which the irreducible
polynomial for the field is
 a trinomial of the form p(x) = xm + xp + 1, or
 a pentanomial of the form p(x) = xm + xp+2 + xp+1 + xp + 1.
2.4.7.1 Irreducible trinomials
When the irreducible polynomial defining GF(2m) is a trinomial, the dual basis is a
permutation of the polynomial basis (see Appendix C). Therefore it is possible to rearrange
the order of the output from the Berlekamp multiplier so that it is compatible with the
polynomial basis multiplier.
As an example, consider GF(28) with p(x) = x4 + x + 1. An element z  GF(28) is
represented in the polynomial basis as
z = z0+ z1 + z22 + z33
zi  GF(2)
(2.28)
so a Berlekamp multiplier would generate this value in the dual basis as
z0 , z3 , z2 , z1 .
The SPBMM requires the serial input in the order
z3 , z2 , z1 , z0 .
A circuit that rearranges the dual basis coefficients into this order can be easily developed,
thus allowing the DPBM to be designed. The general scheme for a multiplier generating
y = a * b * c is shown in Figure 2.7.
a
c
b
Berlekamp multiplier
z=a*b
z
mux
SSBMM
y=z*c
Z
register
y
Figure 2.7. Dual-Polynomial Basis Multiplier option A.
Assume for instance that the multiplier shown in Figure 2.7 is a Dual-Polynomial
Basis Multiplier option A (DPBMA) for GF(28). The hardware required in addition to the
26
SPBMM and the Berlekamp multiplier is a 2:1 multiplexer and a flip-flop. On the first
clock cycle the values of a and b are parallel loaded into the Berlekamp multiplier. Once
these values have been stored, the first product bit z0 is available on the output. This result
is then clocked into the Z flip-flop. On clocking the Berlekamp multiplier a further three
times the values of z3, z2, z1 are produced. These coefficients pass through the multiplexer
and feed the serial input of the SPBMM. On the 5th clock cycle the multiplexer feeds the
SPBMM input with z0, so that the SPBMM has been fed the input sequence z3, z2, z1, z0, as
required. In total therefore this circuit has a total computation time of (m+1) clock cycles.
Note also that no extra m-bit register Z is required to store the value of z as required in the
standard approach to generating equ(2.27).
This approach may be easily extended to GF(2m) where the irreducible polynomial
for GF(2m) is of the form p(x) = xm + xp + 1. In this case if (z0, z1, ..., zm-1) is the
polynomial basis representation of z  GF(2m), the output in the dual basis from a
Berlekamp multiplier is (see Appendix C)
zp-1, zp-2, ..., z0, zm-1, zm-2, ..., zp .
(2.29)
In this case, a multiplier structure similar to that shown in Figure 2.7 is derived. In
addition, p extra flip-flops and one (p + 1):1 multiplexer are required, and the total
calculation time is now m + p clock cycles.
2.4.7.2 Irreducible pentanomial
When the irreducible polynomial for GF(2m) is a pentanomial of the form
p(x) = xm + xp+2 + xp+1 + xp + 1
the dual to polynomial basis conversion involves a reordering and two GF(2) additions, and
so two extra XOR gates are required to implement this conversion. In this case the
DPBMA is more difficult to implement, but is still worthy of consideration.
As an example, and because GF(28) is the most useful example of a field for which
an appropriate pentanomial exists, consider GF(28) with p(x) = x8 + x4 + x3 + x2 + 1. Let z
 GF(28) be presented in the dual basis as
z = z00 + z11 + z22 + z33 + z44 + z55 + z66 + z77
zi  GF(2)
and in the polynomial basis as
z = s0 + s1 + s22 + s33 + s44 + s55 + s66 + s77.
27
si  GF(2)
Then the dual to polynomial basis conversion is given by
s7  z 3 ,
s6  z4 ,
s5  z5 ,
s4  z6
s3  z 3 + z 7 , s 2  z 0 + z 2 , s 1  z 1 ,
s0  z2 .
The DPBMA for GF(28) is shown in Figure 2.8.
a
c
b
Berlekamp Multiplier
z=a*b
z
0
mux
Z3
1
Z2
2
Z1
3
Z0
4
SPBMM
y=z*c
y
Figure 2.8. DPBMA generating y = a * b * c in GF(28)
The operation of the DPBMA shown in Figure 2.8 is as follows. On the first clock
cycle a and b are parallel loaded into the Berlekamp multiplier and at this point the first
product bit is available on the output. The remaining 7 product bits are obtained by
clocking the Berlekamp multiplier a further 7 times. The first 4 values generated by the
Berlekamp multiplier are clocked into the Zi register so that after 4 clock cycles Zi = zi
(i=0,1,2,3). This fourth value of z3 is also the first input to the SPBMM (i.e. s7). The next
three outputs from the Berlekamp multiplier are fed into the SPBMM and then the
multiplexer selects inputs 1,4,3,2 on the next four clock cycles to generate the required
input for the SPBMM. The overall DPBMA will generate a solution on the 11th clock
cycle.
In general, a DPBMA takes m+p+1 clock cycles to generate a product when the
irreducible polynomial is of the form p(x) = xm + xp+2 + xp+1 + xp + 1. In addition to the
required Berlekamp multiplier and the SPBMM, an additional p+2 flip-flops, two 2-input
XOR gates and one (3+p):1 multiplexer are required.
28
2.4.8 Dual polynomial basis multipliers option B
The DPBM may also be developed in a different form. With this option, a
multiplier implementing equ(2.27) has the same calculation time as a single, bit-serial
multiplier. Instead of rearranging the order of the output from a Berlekamp multiplier, it is
possible to add an additional circuit to the input of a ‘Berlekamp-like’ multiplier denoted
bit-serial dual basis multiplier (SDBM). With this scheme, the SDBM produces a product
in the polynomial basis, and so no extra circuit between the SDBM and the SPBMM is
required. In order to develop the DPBM option B (DPBMB), the function Rd(x) is
introduced.
Definition 2.6. Let the irreducible polynomial for GF(2m) be p(x) = p0 + p1x + p2x2 + ... +
xm and let a, b  GF(2m) be represented in the dual basis as
b = b00 + b11 + ... + bm-1m-1
a = a00 + a11 + ... + am-1m-1.
Then define the function Rd : GF(2m)  GF(2m) such that b = Rd(a), where b satisfies

a
j  m1
 j 1
bj  
m 1
 pi  ai j  m  1.
 i  0
(2.30)
The value b = Rd(a) = a where  is a root of p(x) and so the function Rd(x) has
the same effect of the coefficients of x as an LFSR which is initialised with the dual basis
representation of x. Let Rd2(a) be defined as Rd2(a) = Rd(Rd(a)) - the state of the LFSR
after 2 clock cycles - and so on.
2.4.8.1 Irreducible trinomials
To introduce the DPBMB assume first that the defining irreducible polynomial p(x)
is a trinomial of the form p(x) = xm + xp + 1. Now consider a Berlekamp multiplier
without the LFSR but instead a set of m input lines Ai, denoted as SDBM.
29
Let a, b, z  GF(2m) such that z = a * b. Further, let b and z be represented in the
polynomial basis and a be represented in the dual basis as a = a0 + a1 + ... + am-1m-1. If
the SDBM is fed with the inputs Ai = ai (i=0,1, ..., m-1) the first coefficient of the dual
basis representation of z is obtained, or equivalently from equ(2.29), the p-th polynomial
basis coefficient of z, namely zp-1. So if instead, the multiplier is fed with the dual basis
representation of Rdp(a), the p+1-th coefficient of the dual basis representation of z is
obtained, or equivalently, the last polynomial basis coefficient zm-1. Similarly, if on the next
clock cycle the multiplier is fed with the dual basis representation of Rdp+1(a), the p+2-th
coefficient of the dual basis representation of z is obtained, or equivalently, the next to the
last polynomial basis coefficient zm-2. This analysis may continue and so overall, if the
proposed multiplier is fed with the input sequence
Rdp(a), Rdp+1(a), Rdp+3, ..., Rdm-1(a), a, Rd(a), ..., Rdp-1(a)
(2.31)
the multiplier will generate the values zm-1, zm-2, ... , z0 which is the correct format for the
SPBMM.
As previously mentioned the proposed technique is flexible in that it can be
modified to carry out operations of the form y = (a * b + c) * d. For example, consider
Figure 2.9 where a circuit for GF(28) is presented implementing the operation y = (a * b +
c) * d. Consider first the lower half of the circuit which implements z = a*b using a
SDBM. Taking p(x) = x4 + xp + 1, (p=1) in order that the SDBM produces the result in the
required sequence, from equ(2.31) the ‘a’ inputs to the multiplier must be in the order
Rd(a), Rd2(a), Rd3(a), a.
To achieve this four flip-flops, four 3:1 multiplexers, an Rdp(a) circuit and an Rd(a)
circuit are additionally required. (An Rdp(a) circuit is a combinational circuit that given the
dual basis representation of a  GF(2m), generates Rdp(a). This circuit therefore
implements a linear transformation over GF(2) and comprises p XOR gates. In this case, it
can be seen that only one additional XOR gate is required).
On the first clock cycle, the multiplexers select input 0, thereby loading Rdp(a) into
the ari register. On the 2nd and 3rd clock cycles the multiplexers select input 2 thereby
loading Rd2(a) and Rd3(a) respectively into the ari register. Finally on the 4th clock cycle,
the multiplexers select input 1 to load the dual basis representation of a into the ari register.
In doing this the output sequence z3, z2, z1, z0 is generated, as required by the SPBMM.
30
If the Ci register was previously initialised with the polynomial basis coefficients of
c and are now clocked, the polynomial basis representation of (a*b + c) is generated. This
value is then fed into an SPBMM as normal to generate the required result y = (a*b + c)*d,
this equation is required in the BMA.
The DPBMB can be easily extended to different GF(2m) if the irreducible polynomial is a
trinomial of the form p(x) = xm + xp + 1. In general, the multiplexers should select the
following signals.
Clock cycle
Origin of Signal
Actual Values on
these Signals
1
Rdp(a) circuit
Rdp(a)
2 to m-p
Rd(a) circuit
Rdp+1(a) to Rdm-1(a)
m-p+1
Ai register
a
m-p+2 to m
Rd(a) circuit
Rd(a) to Rdp-1(a)
31
Y0
Y2
Y1
Y3
SPBMM
a*b + c
d0
d1
d2
d3
C0
C1
C2
C3
b0
b1
b2
b3
Summation
a*b
SDBM
Rd(ar)
ar0
ar1
ar2
ar3
mux
012
mux
012
mux
012
mux
012
Rdp(a)
a0
a1
a2
a3
Figure 2.9 Circuit generating y = (a * b + c) * d in GF(28).
32
Basis
rearranging
circuit
In comparison with a standard approach, a DPBMB circuit requires an additional m
3:1 multiplexers, m flip-flop, one XOR gates to form the Rd(x) circuit and p XOR gates to
generate the Rdp(x) circuit. In order to reduce the complexity of this Rdp(x) circuit, a value
of p as low as possible should be chosen. Hence the optimal irreducible polynomial to
choose in this instance is p(x) = xm + x + 1. Such polynomials exist for m=2,3,4,6,7, etc.
2.4.8.2 Primitive pentanomials
When the irreducible polynomial p(x) is of the form p(x) = xm + xp+2 + xp+1 + xp +
1 an DPBMB can be designed similarly as in the trinomial case. Because the basis
conversion is not just a permutation of basis coefficients and also involves two GF(2)
additions, the circuit rearranging the input to a SDBM is rather more complicated however.
Using the same analysis as in the trinomial case, it can be shown that when p(x) =
xm + xp+2 + xp+1 + xp + 1, the required input sequence for an SDBM multiplier is
Rdp+1(a), Rdp+2(a), ... , Rdm-2(a), Rdp+1 + Rdm-1(a), a + Rdp(a), Rd(a), ... , Rdp(a).
So for example consider GF(28) and p(x) = x8 + x4 + x3 + x2 + 1. The required
input sequence is therefore
Rd3(a), Rd4(a), Rd5(a), Rd6(a), Rd3(a)+Rd7(a), a+Rd2(a), Rd(a), Rd2(a).
This sequence is generated by a circuit of the form shown in Figure 2.10. The key section
of this circuit is the multiplexer determining the ordering of the above input sequence. The
input selection lines are as follows:
clocks 1
line 4
Rd3(a)
clock 2-4
line 3
Rd(ar)
i.e. Rd4(a), Rd5(a), Rd6(a)
clock 5
line 2
Rd(ar) + Rd3(a)
i.e. Rd7(a) + Rd3(a)
clock 6
line 1
Rd2(a) + a
clock 7
line 0
Rd(a)
clock 8
line 3
Rd(ar)
i.e. Rd2(a),
In general, the DPBMB requires an additional (p+2) Rd(x) circuits (3 XOR gates in
each), 2m XOR gates for summation circuits, m 5:1 multiplexers and m flip-flops.
33
a
Rd(x) Rd(a)
Rd(x)
Rd2(a)
0
1
2
Mux
3
Rd(x) Rd (a)
3
4
Rd(x)
8 registers
ar
b
SDBM
z=a*b
z
SPBMM
y=z*c
c
y = a * b *c
Figure 2.10. DPBMB generating y = a * b * c in GF(28)
2.4.8.3 Summary of DPBM
The DPBM is particularly useful if the time taken to generate a product is critical.
The DPBM offers a half-way solution between a bit-serial and a bit-parallel multiplier.
Furthermore, both DPBMs are hardware efficient and in some situations the DPBM offers
a reduction in hardware since the intermediate value z does not have to be stored. The
structure of the DPBM depends on the irreducible polynomial for GF(2m). The optimal
irreducible polynomial is a trinomial of the form p(x) = xm + xp + 1 where p = 1, for p>1,
more hardware is required in the multiplier. For some values of m (e.g. m = 8) there do not
exist irreducible trinomials, and so an irreducible pentanomial must be used resulting in the
addition of extra hardware. Although, the structure of the DPBM depends on the selected
irreducible polynomial for GF(2m), it has been shown that the architecture can be easily
specified for two important classes of irreducible polynomials.
The DPBMs require only one input to be represented in the dual basis, the other
input and the output are represented in the polynomial basis. Two different options have
34
been presented. With the DPBMA, the dual basis output is converted into the polynomial
basis. This multiplier is particularly suited to generating products of the form y = a * b * c
or y = (a * b + c) * d if it is acceptable to take (m+p) clock cycles to generate this product.
With the DPBMB the basis rearranging takes place on the input. This approach takes more
hardware than the DPBMA circuit, but generating the product only takes m clock cycles.
The DPBMB is of particular use when evaluating expressions of the form
t
y   (abi  ci )di
(2.32)
i 1
where a, bi, ci, di  GF(2m). This is because only one relatively expensive basis rearranging
circuit is required. Expressions of the type equ(2.32) are generated in the implementation
of the Berlekamp-Massey algorithm.
Note that SPBMLs have not been used in conjunction with DPBMs because the
basis reordering circuits are more complicated than the ones needed when using SPBMMs.
Beth et. al. [4] presented normal basis multipliers with a LSB-first serial-in parallelout interface. Therefore it is also possible to construct a multiplier that can carry out the
multiplication d= a * b * c over the normal basis during only m clock cycles. This
multiplier consists of a parallel-in serial-out Massey-Omura multiplier of the form
presented in Section 1.4.2. and the above multiplier ([4] Fig. 11). This double - multiplier
does not require basis rearranging as the DPBM does, but normal basis multiplication is
relatively hardware inefficient (see Section 2.4.4), and constructing a normal basis
multiplier for different choices m is quite complex. In addition, a normal basis
multiplication requires the arguments of the multiplication to be rotated therefore an
additional control system is required. Summing up, in this thesis the DPBM is adopted,
however in some instances it is not obviously the most appropriate architecture.
A similar architecture using only dual basis multipliers cannot be constructed,
because parallel-in serial-out multipliers produce the result in the dual basis and the serialin parallel-out dual basis multipliers can be constructed only for serial input in the
polynomial basis [4].
35
2.5 Bit-Parallel Multiplication
In some applications, it is required to adopt bit-parallel architectures rather than bitserial ones to achieve the required performance. So far, only bit-serial multipliers have
been considered because of their hardware advantages over bit-parallel multipliers.
Unfortunately, in the time critical places in BCH codecs, bit-serial architectures are too
slow and more complex bit-parallel architecture must be adopted.
2.5.1 Dual Basis Multipliers
The bit-parallel dual basis multiplier (PDBM) was presented in [15]. Let a, c 
GF(2m) be represented in the dual basis by a = a00 + a11 + ... + am-1m-1 and c = c00 +
c11 + ... + cm-1m-1. Let b  GF(2m) be represented in the polynomial basis as b = b0 +
b1 + ... + bm-1m-1. The multiplication c = a * b is therefore represented by equations
(2.16) and (2.17). Using these equations and the bit-serial Berlekamp multiplier properties,
the PDBM can be easily derived [15] as a circuit implementing the equations
cj = ajb0 + aj+1b1 + aj+2b2 + ... + aj+m-1bm-1
(j=0,1,....,m-1)
m 1
am+k =
p
j0
j
(k=0,1,...,m-1)
* a jk
(2.33)
where the pj are the coefficients of the primitive polynomial for the field p(x) = p0 + p1x +
... + xm.
In general therefore a PDBM for GF(2m) comprises one type A module that
generates am+i (i=0,1,...,m-1) from equ(2.33) and m type B modules each generating the
inner product of two m-length vectors over GF(2). As an example, the PDBM for GF(23)
using p(x) = x3 + x + 1 is given below.
a0
a3
a1
a4
a2
Figure 2.11. Type A module for a bit-parallel dual basis multiplier for GF(23).
36
aj
cj
b0
aj+1
b1
aj+2
b2
Figure 2.12. Type B module for a bit-parallel dual basis multiplier for GF(23).
a0
a1
a2
Module A
a3
Module B
Module B
a4
Module B
b
c0
c1
c2
Figure 2.13. Bit-parallel dual basis multiplier for GF(23).
2.5.2 Normal basis multipliers
A bit-parallel normal basis multiplier was also presented by Massey and Omura
[31]. This multiplier comprises m identical Boolean functions, where the inputs to these
functions are effectively cyclically shifted one each time. A bit-parallel Massey-Omura
multiplier (PMOM) requires at least m(2m-1) 2-input AND gates and at least m(2m-2) 2input XOR gates [31,54]. The complexity of this multiplier is therefore dependent upon the
complexity of the defining multiplication function. Accordingly, this multiplier is more
hardware intensive than the PDBM and is not used in this thesis.
2.5.3 Polynomial Basis Multipliers
The bit-parallel polynomial basis multiplier (PPBM) was presented by Laws et al.
[28]. The multiplier performs the same sequence of computations as the bit-serial
37
polynomial multiplier option M (SPBMM), and so denote this multiplier the parallel
polynomial basis multiplier option M (PPBMM).
Let a, b, c GF(2m) and
a = a0 + a1 + ... + am-1m-1
b = b0 + b1 + ... + bm-1m-1
c = c0 + c1 + ... + cm-1m-1.
(2.34)
To generate c = a * b, the representation
c = (...(((am-1b) + am-2b) + am-3b) + ...) + a1b
(2.35)
is again used. The PPBMM therefore consists of (m-1) blocks that carry out the operations
ym-1= am-1b
and
yj = ajb + yj+1
mod p().
for
m-1> j 0
where the result c = y0, and p(x) is the irreducible polynomial for GF(2m).
Mastrovito has presented a different type of polynomial basis multiplier [33]. This
multiplier generates
c  a b
m 1
mod p( x )   c j 
j
j 0
by employing the product matrix M:
 c m1 


 c m 2  
 


c 0 
 f mm11
 m 2
 f m1
 
 0
 f m1
f mm21

f mm22 

f m0 2
where f ji 
b
k
for some k
f 0m1  a m1 
 

f 0m 2  a m 2 

 M  AT





 

f 00  a0 
(2.36)
.
The most burdensome part of the Mastrovito algorithm is finding the product matrix M.
The algorithm for finding the matrix M has been omitted as it is rather complicated and can
be found in [33]. In conclusion, the Mastrovito bit-parallel polynomial basis multiplier
(MPPBM) is rather difficult to represent algorithmically. However the advantage of the
MPPBM is that it has a smaller time delay than the PPBMM.
Laws et al. [28] presented a parallel multiplier using the same calculation sequence
as the SPBMM. The question arises, is it possible to construct a modular and regular
38
parallel multiplier employing the same calculation sequence as in the case of the SPBML.
Research carried out concludes that it is, as below.
Express the multiplication c = a * b as in equ(2.20)
c = (...(((a0b) + a1b) + a2b2) + ...) + am-1bm-1 .
Now represent b * j as
b * j = bj,0 + bj,1 + bj,22 + ... + bj,m-1m-1.
(2.37)
Therefore using (2.20) and (2.37)
cj = a0b0,j + a1b1,j + a2b2,j + ... + am-1bm-1,j
(2.38)
Equation (2.38) may also be derived if the SPBML is considered. In the SPBML, the value
cj is obtained by sequentially summing up the binary multiplication bj,i (the state of register
bi after j clock cycles) and ai.
Using equations (2.38) and (2.37) it is possible to construct a modular and regular
bit-parallel polynomial basis multiplier, option L (PPBML). A PPBML for GF(28) is
presented below.
b0,0
b0
b*2
b*
*
*
b*3
b3,0 b
*
3,1
b3,2
b3,3
b3
b0,3
a
Module B
c0
Module B
Module B
c1
c2
Figure 2.14. PPBML for GF(2^8).
39
Module B
c3
a0
b0,j
a1
b1,j
a2
b2,j
a3
b3,j
cj
Figure 2.15. Module B of the PPBML - Identical inner product generator identical to that
required in the Berlekamp multiplier.
bj,0
bj+1,0
bj,1
bj+1,1
bj,2
bj+1,2
bj,3
bj+1,3
Figure 2.16. Circuit for multiplying by  in GF(2 )
In general a PPBML for GF(2m) comprises m type B inner product modules and (m1) type C modules that generate j * b where  is a root of p(x) and b  GF(2m). A type C
module essentially carries out a linear transformation of basis coefficients over GF(2) and
will therefore consist of a number of XOR gates.
2.5.4 Comparison of parallel multipliers
In this section the PDBM [15], the PMOM [31,54] and three polynomial basis
multipliers, the MPPBM [33], the PPBMM [28], and the PPBML have been considered. A
comparison of the number of XOR and AND gates required by these multipliers and the
maximum delay times for a range of values of m are presented below. (In fact, the delay
through each of these multipliers is Da plus the values cited below, since a single row of
AND gates is also required by each of the multipliers).
40
m
PMOM
MPPBM
PDBM
PDBM
PPBMM
PPBML
PPBMM
PPBML
NA
NX
DX
NA
NX
DX
NA
NX
DX
DX
DX
3
15
12
2
9
8
3
9
8
3
3
3
4
28
2
3
16
15
3
16
15
3
4
3
5
45
40
3
25
25
5
25
2
5
6
5
6
66
60
4
36
33
4
36
35
4
6
4
7
133
126
5
49
48
4
49
48
4
7
4
8
168
160
5
64
90
5
64
77
7
11
7
9
153
144
4
81
80
5
81
80
6
11
6
10
190
180
5
100
101
6
100
99
6
12
6
Table 2.2. Comparison of bit-parallel finite field multipliers.
The number of gates for the PMOM is taken from [33], for the MPPBM and the PDBM
this number is taken from [15]. The primitive polynomials used to design these multipliers
(excluding the PMOM) are listed in Appendix A.
As a general rule, the number of gates and multiplier delay can be obtained from the
following:
PDBM:
NA= m*m
NX= (m-1)*(m + H(pp)-2)
Del= log2(m) + t * log2(H(pp)-1)
NX= (m-1)*(m + H(pp)-2)
( DX= m -1 + t if H(pp)= 3 )
NX= (m-1)*(m + H(pp)-2)
( DX= log2(m) + t if H(pp)= 3 )
PPBMM:
NA= m*m
PPBML:
NA = m*m
where t = (m-1)/(m-p) and p(x) = xm + xp +
p 1
p x
i
i
.
i 0
In conclusion from Table 2.2, the PPBML has the same parameters as the PDBM
for the considered choices of m. The PPBML needs no basis conversions and so the design
41
of a PPBML is simpler and more hardware efficient than the PDBM, especially if a
primitive trinomial for GF(2m) does not exist. On the other hand, the PDBM is slightly
easier to design (without the basis conversions), and some additional design optimisation
can be done, e.g. for m= 8 the number of XOR gates can be reduced to 72 [15]. In
conclusion, the choice between PDBM and PPBML is related to the individual design
specification, as the differences in design complexity and hardware requirements are small.
The PPBMM has the same hardware requirements as the PPBML but a longer delay
time. Accordingly the PPBMM is not used in the thesis. Similarly, the PMOM is not used
given the high hardware requirements and long delay path of the multiplier. The PPBML in
comparison with the MPPBM is much easier to design; and in most cases the final circuits
are similar, e.g. for m= 4. Therefore in this thesis only the PDBM and PPBML have been
considered.
It should be mentioned that a number of bit parallel multipliers have been proposed
for circumstances in which p(x) is of the form p(x) = xm + xm-1 + ..... + x + 1, that is, when
p(x) is an all one polynomial [22]. However all one polynomials are relatively rare and so
do not help in finding general solutions of the kind required here.
2.6 Finite field exponentiation
2.6.1 Squaring
In some applications squaring in a finite field is required. Squaring can be
performed using a standard multiplier but this approach is rather hardware inefficient.
Instead, a different algorithm is employed, as was described for example in [16].
Let a  GF(2m) be represented in the polynomial basis as
a = a0 + a1 + a22 + ... + am-1m-1.
Now let b  GF(2m) such that b = a2 . From equ(2.8), f2(x) = f(x2) and so
b = a2 = a0 + a12 + a24 + a36 + ... + am-12m-2 mod p().
(2.39)
In other words, the coefficients of b can be obtained from a linear transformation of the
coefficients of a over GF(2). This linear transformation will require a number of XOR
gates to implement, and these numbers for a range of m are listed in Table 2.3.
42
2.6.2 Raising field elements to the third power
The standard approach for carrying out exponentiation to the power three is to use a
standard multiplier and squaring and then calculate a3 = a2 * a [46]. If a PPBML is used
together with the approach for carrying out squaring described above, the hardware
requirements for this circuit are as given in Table 2.3.
An alternative method of raising elements to the power three is now described. Let
a, b GF(2m) such that b = a3 and represent both these elements over the polynomial basis
in the usual way. From the equation
(x + y)3 = x3 + 3x2y + 3xy2 + y3
the expressions
b = a3 = (a0 + a1 + a22 + ... + am-1m-1)3
m1
m 2 m1
i 0
i  0 j  i 1
mod p()
b   ai  3i    ai  a j ( 2i  j   i  2 j ) mod p()
(2.40)
are derived. A circuit implementing equ(2.40) can be designed directly and consists of (m1) + (m-2) + ... +1 = m*(m-1)/2 AND gates and at most m*(m2-1)/2 XOR gates. However
in practice, these requirements are much lower. The number of gates for this cubic circuit is
given in Table 2.3. In comparison with the standard approach this method offers hardware
savings especially if design optimisation is employed. For example for m = 8, with
optimisation the number of XOR gates is almost the halved.
m
squaring
a3 = a2 * a
cubic
NXOR
NXOR
NAND
NXOR
NAND
4
2
16 (13)
6
17
16
5
3
29 (21)
10
27
25
6
3
47 (33)
15
38
36
7
3
66 (46)
21
51
49
8
12 (10)
135 (70)
28
87
64
9
6
133 (83)
36
86
81
10
6
159 (105)
45
105
100
Table 2.3. Hardware requirements for exponentiation in GF(2m). ( ) = with design
optimisation.
43
2.7 Finite field inversion
BCH decoders are required to implement the finite field division c = a/b. This
division can be implemented using a division algorithm e.g. [15, 17, 21]. Unfortunately,
BCH decoders require that the result of a division be available faster than these algorithms
allow. Often however b is available earlier than a and so it can be beneficial to first employ
inversion to generate b-1 and then to use a fast bit-parallel multiplier.
Throughout this thesis, the Fermat inverter is used. Fermat inverters operating over
the normal and dual bases have been presented [16, 54]. The dual basis inverter is hardware
efficient and what is more, it is convenient that the result of this division is represented in
the dual basis and so can be utilised in dual basis multipliers for example. Hence, the dual
basis inverter has been employed in this project.
A Fermat inverter implements the equation
a 1  a 2
m
2
 a 2  a 4  a 8 a 2
m 1
(2.41)
and so in turn is based on repeated multiplications and squaring. The dual basis inverter
uses a PDBM as presented in Section 2.5.1 and carries out squaring in the polynomial basis
as described in Section 2.6.1. The overall inversion circuit requires (m-1) clock cycles to
generate a result. To then calculate c = a/b = a*b-1 one extra clock cycle is required to
carry out the multiplication.
2.8 Conclusions
In this chapter the main definitions and results underpinning finite field
theory have been introduced. It has been shown how to generate GF(2m) from the base field
GF(2) and the most important basis representations have been described.
The most useful bit-serial and bit-parallel finite field multipliers have been reviewed for
adoption in BCH codecs. Circuits for carrying out inversion, division and exponentiation in
GF(2m) have also been described. Finally some important new circuits have been presented.
A hardware efficient method of generating the sum of products using a previously
overlooked multiplier has been described. This circuit operates entirely over the
polynomial basis and has an attractive input/output format for use in circuits implementing
44
the Berlekamp-Massey algorithm. Two multiplier circuits generating products of the form y
= a*b*c have also been presented. These circuits are based around Berlekamp multipliers
and SPBMMs. Hardware/time trade-offs are made in determining which of these two
options to adopt. Both multipliers propose novel methods of implementing the required
basis conversions so allowing Berlekamp multipliers and SPBMMs to be used in tandem.
Finally, a new bit-parallel multiplier - the PPBML - has been presented. This multiplier is a
hardware efficient equivalent of a previously presented bit-serial multiplier. In addition, a
new algorithm for exponentiation to the power three has been presented. The algorithm is
especially hardware efficient if the design optimisation is employed.
3. BCH codes
In this chapter Bose-Chaudhuri-Hocquenghem (BCH) codes are introduced and
various BCH encoding and decoding algorithms are presented. BCH code encoding and
45
decoding is considered. Three different decoding strategies are presented according to the
error correcting capability of the code. Generally decoding is broken down into three
processes, syndromes calculation, Berlekamp-Massey algorithm (BMA) and Chien search.
In addition the BMA can be developed with or without inversion and both methods are
described here. At the end of this chapter, comparisons between BCH codes and RS codes
are presented.
3.1 Background
The first class of linear codes derived for error correction were Hamming codes
[20]. These codes are capable of correcting only a single error but because of their
simplicity, Hamming codes and their variations have been widely used in error control
systems;
e.g.
the
16/32
bit
parallel
error
detection
and
correction
circuits
SN54ALS616/SN54ALS632 [50]. Later the generalised binary class of Hamming codes for
multiple-errors was discovered by Hocquenghem in 1959 [23], and independently by Bose
and Chaudhuri in 1960 [6]. Subsequently, non-binary error-correcting codes were derived
by Gorenstein and Zieler [19]. Almost at the same time independently of the work of Bose,
Chaudhuri and Hocquenghem, the important subclass of non-binary BCH codes - RS codes
- were introduced by Reed and Solomon [44].
This project is concerned with BCH codes however and these codes are described
in more detail below.
3.2 BCH codes
The class of BCH codes is a large class of error correction codes that occupies a
prominent place in theory and practice of error correction. This prominence is a result of
the relatively simple encoding and decoding techniques. Furthermore, provided the block
length is not excessive, there are good codes in this class ([30] Chapter 9). In this thesis
46
only the subclass of binary BCH codes is considered as these codes can be simply and
efficiently implemented in digital hardware.
Before considering BCH codes, some additional theory needs to be introduced.
Theorem 3.1. ([30], p.10) The minimum distance of a linear code is the minimum
Hamming weight of any non-zero codeword.
Theorem 3.2. ([30], p.10) A code with minimum distance d can correct
 (d-1)/2  errors.
Definition 3.2. A linear code C is cyclic if whenever (c0, c1, ..., cn-1) C then so is (cn-1, c0,
c1, ..., cn-2).
A codeword (c0, c1, ..., cn-1) of a cyclic code can be represented as the polynomial
c(x) = c0 + c1x + .... + cn-1xn-1 . This correspondence is very helpful as the mathematical
background of polynomials is well developed, and so this representation is used here.
It is frequently convenient to define error-correcting codes in terms of the generator
polynomial of that code g(x) [29]. The generator polynomial of a t-error-correcting BCH
code is defined to be the least common multiple (LCM) of f1, f3, ... f2*t-1, that is,
g(x) = LCM{ f1, f3, f5, ... f2*t-1}
(3.1)
where fj is the minimal polynomial of j (0 < j < 2t + 1) considered below.
Let fj (0 < j < 2t + 1) be a minimal polynomial of j then fj is obtained by
(Theorem 2.14, [29]):
f j ( x) 
e 1
 (x   2 )
i
(3.2)
i 0
where
  j
and
e
 2  .
According to Theorem e  m.
47
To generate a codeword for an (n, k) t error-correcting BCH code, the k information
symbols are formed into the information polynomial i(x) = i0 + i1x +...+ ik-1xk-1 where ij 
GF(2). Then the codeword polynomial c(x) = c0 + c1x + ... + cn-1xn-1 is formed as
c(x) = i(x)*g(x).
(3.3)
Since the degree of fj(x) is less or equal to m (e  m equ(3.2); [29] p. 38), from equ(3.1) the
degree of the g(x) (and consequently the number of parity bits n-k) is at most equal to m*t.
For small values of t, the number of parity check bits is usually equal to m*t ([29] p. 142).
For any positive integer m  3 there exists binary BCH codes (n, k) with the
following parameters:
n = 2m - 1
length of codeword in bits
t
the maximum number of error bits that can be corrected
k  n - m * t number of information bits in a codeword
dmin  2*t + 1 the minimum distance.
A list of BCH code parameters for m  10 is given in Appendix D. Note that for t = 1, this
construction of BCH codes generates Hamming codes. The number of parity bits equals m,
and so (2m - 1, 2m - m -1) codes are obtained. In this case the generator polynomial g(x)
satisfies
g(x) = f1(x) = p(x)
where p(x) is the irreducible polynomial for GF(2m) as given, for example, in Appendix A.
In this thesis only primitive BCH codes are considered. Binary non-primitive BCH codes
can be constructed in a similar manner to primitive codes ([29] p.151). Non-primitive BCH
codes have a generator polynomial g(x) with
l, l+1, l+2, ... , l+d-2
as roots, where  is an element in GF(2m) and l is a non-negative integer. Non-primitive
BCH codes obtained in this way have a minimum distance of at least d. When l = 1, d = 2*
t + 1 and  =  where  is a primitive element of GF(2m), primitive BCH codes are
obtained.
48
3.3 Encoding BCH codes
If BCH codewords are encoded as in equ(3.3) the data bits do not appear explicitly
in the codeword. To overcome this let
c(x) = xn-k * i(x) + b(x)
(3.4)
where c(x)= c0 + c1x +...+ cn-1xn-1, i(x)= i0 + i1x +...+ ik-1xk-1, b(x)= b0 + b1x +...+ bm-1xm-1
and cj, ij, bj  GF(2). Then if b(x) is taken to be the polynomial such that
xn-k * i(x) = q(x) * g(x) - b(x)
(3.5)
the k data bits will be present in the codeword. (By implementing equ(3.4) instead of
equ(3.3) systematic ([29] p. 54) codewords are generated).
BCH codes are implemented as cyclic codes [42], that is, the digital logic
implementing the encoding and decoding algorithms is organised into shift-register circuits
that mimic the cyclic shifts and polynomial arithmetic required in the description of cyclic
codes. Using the properties of cyclic codes [29, 30], the remainder b(x) can be obtained in a
linear (n-k)-stage shift register with feedback connections corresponding to the coefficients
of the generator polynomial g(x) = 1 + g1x + g2x2 + ... + gn-k-1xn-k-1 + xn-k. Such a circuit is
shown on Figure 3.1.
S1
g1
b0
gn-k-1
g2
b1
bn-k-1
b2
xn-k i(x)
1
c(x)
S2
2
Figure 3.1. Encoding circuit for a (n, k) BCH code.
The encoder shown in Figure 3.1 operates as follows
 for clock cycles 1 to k, the information bits are transmitted in unchanged form (switch
S2 in position 2) and the parity bits are calculated in the Linear Feedback Shift Register
(LFSR) (switch S1 is on).
49
 for clock cycles k+1 to n, the parity bits in the LFSR are transmitted (switch S2 in
position 1) and the feedback in the LFSR is switch off (S1 - off).
As an example, the (15, 5) 3-error correcting BCH code is considered. The
generator polynomial with , 2, 3, ... , 6 as the roots is obtained by multiplying the
following minimal polynomials:
roots
minimal polynomial
, 2, 4
f1(x) = (x+) * (x+2) * (x+4) * (x+8) = 1 + x + x4
3, 6
f3(x) = (x+3) * (x+6) * (x+12) + (x+9) = 1 + x + x2 + x3 + x4
5
f5(x) = (x+5) * (x+10) = 1 + x + x2
Thus the generator polynomial g(x) is given by
g(x) = f1(x) * f3(x) * f5(x) = 1 + x + x2 + x4 + x5 + x8 + x10.
3.4 Decoding BCH codes
The decoding process is far more complicated than the encoding process. As a
general rule, decoding can be broken down into three separate steps:
1. Calculating the syndromes
2. Solving the key equation
3. Finding the error locations.
Fortunately, for some BCH codes step number 2 can be omitted. To decode BCH
codes in this thesis, three different strategies have been employed, for single error
correcting (SEC), double error correcting (DEC) and triple and more error correcting
(TMEC) BCH codes.
Regarding step 1, the calculation of the syndromes is identical for all BCH codes.
For SEC codes step number 2 - solving the key equation - can be omitted, as a syndrome
gives rise to the error location polynomial coefficient. For DEC codes step number two can
also be omitted but the error location algorithm is rather more complicated. Finally, when
50
implementing the TMEC decoding algorithm all three steps must be carried out, where step
2 - the solution of the key equation - is the most complicated.
3.4.1 Calculation of the syndromes
Let
c(x) = c0 + c1x + c2x2 + ... + cn-1xn-1
r(x) = r0 + r1x + r2x2 + ... + rn-1xn-1
e(x) = e0 + e1x + e2x2 + ... + en-1xn-1
(3.6)
be the transmitted polynomial, the received polynomial and the error polynomial
respectively so that
r(x) = c(x) + e(x).
(3.7)
The first step of the decoding process is to store the received polynomial r(x) in a buffer
register and to calculate the syndromes Sj (for 1  j  2t -1). The most important feature of
the syndromes is that they do not depend on transmitted information but only on error
locations, as shown below.
Define the syndromes Sj as
n1
S j   ri i j
for (1  j  2t).
(3.8)
i0
Since rj = cj + ej (j = 0, 1, ...., n-1)
n1
S j   (c i  e i )
i 0
i j
n1
 c i
i j
i 0
n1
  e i  i j
.
(3.9)
i 0
By the definition of BCH codes
n1
c 
i 0
i j
i
0
for (1  j  2t)
(3.10)
thus
n1
S j   e i  i j .
(3.11)
i 0
It is therefore observed that the syndromes Sj depends only on the error polynomial e(x),
and so if no errors occur, the syndromes will all be zero.
To generate the syndromes, express equ(3.8) as
51
Sj = (...((rn-1 * j + rn-2) * j + rn-3) * j + ....) * j + r0.
(3.12)
Thus a circuit calculating the syndrome Sj carries out (n-1) multiplications by the constant
value j and (n-1) single bit summations. Note that because rjGF(2) the equation S2i = Si2
is met [29].
For example a circuit calculating S3 for m = 4 and p(x) = x4 + x + 1 is presented in
Figure 3.2. Initially the register si (0  i  3) is set to zero. Then the register s0 - s3 is shifted
15 times and the received bits ri (0  i  14) are clocked into the syndrome calculation
circuit. Then the S3 is obtained in the s0 - s3 register.
r(x)
s0
s1
s2
s3
Figure 3.2. Circuit computing S3 for m = 4.
Syndromes can also be calculated in a second way ([29] p. 152, 165), ([30] p. 271).
Employing this approach, Sj is obtained as the remainder in the division of the received
polynomial by the minimal polynomial fj(x), that is,
r(x) = aj * fj(x) + bj(x)
(3.13)
Sj = bj(j).
(3.14)
where
It should be mentioned that the minimal polynomials for , 2, 4, .... are the same and so
only one register is required to calculate the syndromes S1, S2, S4, ... . The rule can be
extended for S3, S6, ..., and so on.
For example the circuit calculating S3 for m = 4 is shown in Figure 3.3. The
minimal polynomial of 3 is
f3(x) = 1 + x + x2 + x3 + x4
and let b(x) = b0 + b1x + b2x2 + b3x3 be the remainder on dividing r(x) by f3(x). Then
S3 = b(3) = b0 + b13 + b26 + b39 = b0 + b3 + b22 + (b1 + b2 + b3) 3.
52
The circuit in Figure 3.3 therefore operates by first dividing r(x) by f3(x) to generate b(x)
and then calculating b(3). The result is obtained after the register b0 - b3 have been
clocked 15 times.
r(x)
b0
b1
b2
b3
S3
Figure 3.3. Second method of computing S3 for m = 4.
3.4.2 Solving the key equation
The second stage of the decoding process is finding the coefficients of the error
location polynomial (x) = 0 + 1x + ... + txt using the syndromes Sj (1  j < 2t). The
relationship between the syndromes and these values of j is given by ([5], p.168)
t
S
t i j
j 0
(i= 1, ..., t)
(3.15)
j 0
and the roots of (x) give the error positions. The coefficients of (x) can be calculated by
methods such as the Peterson-Gorenstein-Zieler algorithm [5, 43] or Euclid’s algorithm
[49]. In this thesis the Berlekamp-Massey Algorithm (BMA) [2, 32] has been used as it has
the reputation of being the most efficient method in practice [5].
In the BMA, the error location polynomial (x) is found by t-1 recursive iterations.
During each iteration r, the degree of (x) is usually incremented by one. Through this
method the degree of (x) is exactly the number of corrupted bits, as the roots of (x) are
associated with the transmission errors. The BMA is based on the property that for a
number of iterations r greater or equal the number of errors ta that have actually occurred (r
 ta), the discrepancy dr = 0 in equ (3.16) below where
t
d r   S 2r  j 1   j .
(3.16)
j 0
53
On the other hand if r < ta, the discrepancy dr calculated in equ(3.16) is usually non zero
and is used to modify the degree and coefficients of (x). What the BMA essentially does
therefore is compute the shortest degree (x) such that equ(3.15) holds.
The BMA with inversion is given below.
Initials values:
1 if S1  0
dp  
S1 if S1  0
 ( 0 ) ( x )  1  S1  x
 x 3
 x 2
 (1) ( x )  
if
S1  0
if
S1  0
0 if
l1  
1 if
r  1.
S1  0
(3.17)
S1  0
The error location polynomial (x) is then generated by the following set of recursive
equations:
t
d r    (i r )  S 2r  i 1
i0

(r)
( x )   ( r 1) ( x )  d p1  d r   ( r ) ( x )
0 if d r  0 or r  l r
bsel  
1 if d r  0 and r  l r
 x 2   ( r ) ( x ) if bsel  0
( r 1 )

( x )   2 ( r 1)
( x ) if bsel  0
 x  
(3.18)
l r if bsel  0
l r 1  
2  l r  l r  1 if bsel  0
d r if bsel  0
dp  
d r if bsel  0
r  r 1
these calculation are carried out for r= 1, ..., t-1.
Note that the above algorithm is slightly modified in comparison with the
previously presented BMA [2, 32]. Due to more complicated initial states, the number of
iterations is decreased by one. In practice, this causes only a slight increase in the hardware
requirements but the BMA calculation time is significantly reduced.
54
A circuit implementing the BMA is given in Figure 3.4. The error location
polynomial (x) is obtained in the C registers after t-1 iterations.
0
1
B2
B3
B4
Bt
0
inv
C1
C2
C3
C4
Ct
dr
reg
Sj
Sj-1
Sj+2
Sj-2
Sj-3
Sj-4
Sj-t+1
Sj+1
Figure 3.4. Berlekamp Massey Algorithm with inversion.
In some applications it may be beneficial to implement the BMA without inversion.
A version of the BMA achieving this was presented in [8, 56]. For inversionless BMA the
initial conditions are the same as for the BMA with inversion given in equ(3.17). The error
location polynomial is then generated by following recursive equations:
t
d r    (i r )  S 2r  i 1
i0

(r)
( x )  d p   ( r 1) ( x )  d r   ( r ) ( x )
0 if d r  0 or r  l r
bsel  
1 if d r  0 and r  l r
 x 2   ( r ) ( x ) if bsel  0
 ( r 1) ( x )   2 ( r 1)
( x ) if bsel  0
 x  
l r if
l r 1  
2  r  l r
d p if
dp  
d r if
bsel  0
 1 if
bsel  0
bsel  0
bsel  0
r  r  1.
55
(3.19)
In conclusion, inversionless BMA is more complicated and requires a greater
number of multiplications than the BMA with inversion. On the other hand, inversion can
take (m-1) clock cycles (see Section 2.7) and therefore even if parallel multiplication is
used this constraint will slow down the algorithm. Therefore the inversionless algorithm
has to be implemented for some BCH codes.
For SEC and DEC BCH codes the coefficients of (x) can be obtained directly
without using the BMA. This is because for SEC BCH codes
(x) = 1 + S1x
and for DEC BCH codes
(x) = 1 + 1x + 2x2 = 1 + S1x + (S12 + S3 * S1-1) x2
[2], ([30] p. 321). This approach for generating (x) directly in terms of the syndromes can
be theoretically extended to TMEC but quickly becomes too complex to implement in
hardware and so the BMA must be used.
3.4.3 Finding the error locations
3.4.3.1 General case
The last step in decoding BCH codes is to find the error location numbers. These
values are the reciprocals of the roots of (x) and may be found simply by substituting 1, ,
2, ... , n-1 into (x). A method of achieving this using sequential substitution has been
presented by Chien [10]. In the Chien search the sum
0 + 1j + 22j + ... + ttj
(j= 0, 1, ... , k-1)
(3.20)
is evaluated every clock. It can be noticed that if (j)= 0, the received bit rn-1-j is
corrupted. Therefore if for clock cycle j the sum equals zero the received bit rn-j-1 should be
corrected.
A circuit implementing the Chien search is shown in Figure 3.5. The operation of
this circuit is as follows. The registers c0, c1, ..., ct are initialised by the coefficients of the
error location polynomial 0, 1, ... , t. Then the sum
t
c
j
is calculated and if this equals
j 0
zero, the error has been found and after being delayed in a buffer, the faulty received bit is
56
corrected using an XOR gate. On the next clock cycle each value in the ci register is
multiplied by i (using a constant multiplier), and the sum
t
c
j
is calculated again. The
j 0
above operations are carried out for every transmitted information bit (that is k times).
input
buffer
c0
c1
c2
ct
t

*
2
t
*
*
c
j
j 0
output
Figure 3.5 Chien’s search circuit.
3.4.3.2 Finding an error location for t = 2
In the case of DEC BCH codes, two different algorithms may be adopted. Firstly,
one may adopt the general procedure, namely finding the syndromes, implementing the
relatively burdensome BMA and then adopting Chien search. In this thesis however
another approach has been adopted [53]. This algorithm does not require the error location
polynomial (x) to be generated, instead a more sophisticated error location procedure is
adopted. This algorithm is summarised below.
Suppose the received vector has at most two errors, then the error location
polynomial (x) is given by [2] ([30] p. 321):
(x) = 1 + 1x + 2x2 = 1 + S1x + (S12 + S3 * S1-1) x2.
(3.21)
Therefore if there is no error 1 = 0 and 2 = 0, thus
S1 = 0
S3 = 0.
(3.22)
If only one error has occurred 1  0 and 2 = 0, thus
S1  0
S3 = S13.
(3.23)
If there are two errors 1  0 and 2  0, thus
S1  0
S3  S13.
(3.2 )
57
If S1 = 0 and S3  0 more than two errors have occurred and so the error pattern cannot be
corrected.
This step-by-step decoding algorithm is based around the assumption that an error
has occurred at the present location. Accordingly the values in the s1 and s3 registers are
changed. These changes are easily implemented because if the received bit rn-1 is corrupted,
only the first bits si,0 of the registers s1 and s3 need be negated where si = si,0 + si,1x + ... +
si,m-1xm-1 for (i=1,3) and the s1 and s3 registers hold the values of S1 and S3 respectively.
Similarly, assuming that the received bit rn-1-j is corrupted, the syndrome registers are
clocked j-times implementing the function si  si * i (with a circuit similar to the
syndrome calculation circuit) and so only the first bits s1,0 and s3,0 are negated [53].
A circuit employing this algorithm is given in Figure 3.6. At first registers s1 and s3
are initiated with S1 and S3 respectively and using equ(3.22-2 ), the number of errors
present are stored by clocking values h1, h3 into flip-flops p1, p3. It is then assumed that an
error has occurred in the first position. The registers s1 and s3 are updated and again using
equ(3.22-2 ) the new number of errors present is specified. If the new number of errors
has decreased, the assumption has proven to be correct and an error has been found. That
is, the received bit rn-1 is corrected and the error assumption changes are introduced
permanently into the s1 and s3 registers. In addition, the p flip-flops are clocked with the
new h signals. Alternatively, if the number of errors has not decreased, the assumption is
wrong and the correct bit has been received and the changes are cancelled. The above
operations are repeated for every received information bit rn-1-j (0  j < k), after the s
registers have been shifted (si  si * i, i= 1,3) j-times.
58
S1
S3
s1
s3
r(x)
(s1)3
*3
*
s1  0
p1
h1
(s1)3  s3
h3
p3
b
u
f
f
e
r
error decision circuit
output
Figure 3.6. Error location circuit for t = 2.
3.5 Reed-Solomon codes
In this thesis only binary BCH codes have been considered and only their hardware
representation developed. Therefore here a comparison between binary BCH codes and
non-binary BCH codes, the sub-class of RS codes [44] is considered. RS codes are the
most efficient error correcting codes theoretically possible and a wide body of knowledge
concerning them exists [5,30]. In addition, RS codes are especially attractive as they can
correct not only random but burst errors as well. In many situations the information channel
has memory, and so random binary BCH codes are not appropriate. Fortunately, binary
BCH codes can correct burst errors when an interleaved code with large t is adopted. But as
will be shown below, this architecture is not recommended and instead, RS codes should
be used.
RS codes operate on symbols consisting of m-bits and which are elements of
GF(2m). Each codeword consists of (n= 2m-1, k= 2m-1-2t) such symbols, where t is the
maximum number of symbols that may be corrected.
Now the decoding of RS codes will briefly be presented in comparison with BCH
codes. The encoding process is omitted here as it is relatively simple, and therefore only
slightly influences a codec’s complexity. There are two different ways of decoding RS
59
codes [5,16] in the time or in the frequency domain. Here frequency domain decoding
process will be considered.
Decoding may be separated into four main areas
1. Calculation of syndromes using equation:
n
S i   r j  i j
0 i< 2t
(3.25)
j0
where the rj  GF(2m) are the received symbols (see also equ(3.8)).
Note that the calculation of the syndromes for BCH codes is simpler (riGF(2)) than for
RS codes because ri GF(2m) and so the equation Si2 =S2i does not hold for RS codes.
2. Berlekamp-Massey Algorithm. The BMA is similar as in the case of BCH codes but
requires twice as many iterations (taking the same value of t).
3. Recursive extension, computing the equation
t
Ei   Ei  l l
2t  i  n-1
where Ei = Si
0 i  2t-1.
(3.26)
l 1
Youzhi [56] has shown that this step can be implemented with a BMA circuit by adding
only additional control signals.
4. Obtaining the error magnitudes ei by computing the inverse transforms
ei 
1 n1
E j   i j

n j0
2t  i  n-1.
(3.27)
This step is not required for BCH codes and in comparison to the Chien search is
rather more complicated.
If binary BCH codes and non-binary RS codes are compared, at first it may seem
that BCH codes are much simpler to implement. This is because RS codes operate on
symbols and require additional steps to be computed since not only do the error locations
have to be calculated (as with BCH codes) but also the error magnitudes. But after closer
consideration, it may be seen that for example, a (15, 11) RS code can correct up to two
corrupted 4-bit symbols, e.g. at least one 5-bit burst error. This code consists of 4*15= 60
codeword bits and 4*11= 44 information bits. Conversely, consider a similar (63, 36) 5 bits
random error correcting BCH code. It should be noted here that this comparison of BCH
and RS codes is not based on any practical experiments and in practice maybe different
60
codes should be considered. This BCH code has not only a lower information rate (k/n) but
more hardware is needed to the decoder. Consequently, calculation of the syndromes is
simpler in comparison with the RS code, but a greater number of syndromes must be
computed. Furthermore, the BMA is much more complex for the BCH code as the number
of errors is greater. RS codes also require the inverse transform calculation to be
implemented and this is more complicated than the equivalent Chien search circuit. But
taken overall, the hardware requirements of the RS codec are much simpler. In addition the
BCH code requires all the operations to be carried out over GF(26), whereas the RS code
operates over GF(2 ), and so more complex arithmetic circuits are required for BCH
codes.
In conclusions, RS codecs generally have more attractive properties and they should
rather be implemented if burst errors have to be corrected.
3.6 Conclusions
In this chapter BCH codes have been introduced. Encoding and decoding
algorithms for BCH codes with different error-correcting abilities have been considered.
Decoders have more complex structure than encoders and so the decoding process has been
broken down into three separate steps. The first step is the syndrome calculation process,
an identical process whatever the error-correcting ability of the code. The next step is to
find the error location polynomial (x). This stage is the most complicated of the three and
for DEC BCH codes, an alternative decoding algorithm is used which by-passes the need to
generate this polynomial entirely. For TMEC BCH codes (x) is calculated using the
relatively complex BMA, whereas for SEC BCH codes, (x) can be expressed immediately
in terms of the syndromes. The last stage of decoding is to find and correct any errors
present. Two different approaches have been employed to achieve this, one for the general
case and one for DEC BCH codes.
CHAPTER 4
61
4.1 High speed architectures for General Linear Feedback Shift
Registers
The implementation of CRC check generation circuit can be implemented with the
use of linear feedback circuit. Following figure shows the LFSR representation of CRC
with generator polynomial 1+y+y3+y5
4.2 Architectures for polynomial G(y)=1+y+y3+y5
D
+
D
D
+
D
D
+
y
1
Fig.4.1Serial structure
CRC codes have been used for years to detect data errors on interfaces, and their operation
and capabilities are well understood.
4.3 Motivation for the parallel Implementation:
Cyclic redundancy check (CRC) is widely used to detect errors in data
communication and storage devices. When high-speed data transmission is required, the
general serial implementation cannot meet the speed requirement. Since parallel processing
is a very efficient way to increase the throughput rate, parallel CRC implementations have
been discussed extensively in the past decade.
Although parallel processing increases the number of message bits that can be processed in
one clock cycle, it can also lead to a long critical path (CP); thus, the increase of
throughput rate that is achieved by parallel processing will be reduced by the decrease of
circuit speed. Another issue is the increase of hardware cost caused by parallel processing,
which needs to be controlled. This brief addresses these two issues of parallel CRC
implementations.
62
4.4 Literature Survey and Existing Systems:
In the past recursive formulas have been developed for parallel CRC hardware
computation based on mathematical deduction.
They have identical CPs. The parallel CRC algorithm in [2] processes an m-bit message in
(m+k)/L clock cycles, where is the order of the generator polynomial and L is the level of
parallelism. However, in [1], m message bits can be processed in m/L clock cycles.
High-speed architectures for parallel long Bose–Chaudhuri–Hocquenghen (BCH)
encoders in [3] and [4], which are based on the multiplication and division computations
on generator polynomial, are efficient in terms of speeding up the parallel linear feedback
shift register (LFSR) structures. They can also be generally used for the LFSR of any
generator polynomial. However, their hardware cost is high.
4.5 LFSR (Linear Feedback Shift Register)
A Linear Feedback Shift Register (LFSR) is a shift register whose input bit is a
linear function of its previous state. The only linear functions of single bits are xor and
inverse-xor; thus it is a shift register whose input bit is driven by the exclusive-or (xor) of
some bits of the overall shift register value.
The initial value of the LFSR is called the seed, and because the operation of the
register is deterministic, the sequence of values produced by the register is completely
determined by its current (or previous) state. Likewise, because the register has a finite
number of possible states, it must eventually enter a repeating cycle. However, an LFSR
with a well-chosen feedback function can produce a sequence of bits which appears
random and which has a very long cycle.
4.6 Serial input hardware realization
63
Fig 4.2. Basic LFSR architecture
Fig 4.3 .Linear Feedback Shift Register Implementation of CRC-32
4.7 DESIGN OF ARCHITECTURES USING DSP TECHNIQUES
4.7.1 Unfolding
It’s a transformation technique that can be applied to DSP program to create a new
program describing more than one iteration of the original program.
Unfolding a DSP program by an unfolding factor J creates a new program that
describes J consecutive iterations of the original program. It increases the sampling rate by
replicating hardware so that several inputs can be processed in parallel and several outputs
can be produced at the same time.
4.7.2 Pipelining
64
Reduces the effective critical path by introducing pipelining latches along the
critical data path either to increase the clock frequency or sample speed or to reduce power
consumption at the same speed.
It is done using a look-ahead pipelining algorithm to reduce the iteration bound of
the CRC architecture.
4.7.3 Retiming
A technique used to change the locations of delay elements in a circuit without
affecting the input/output characteristics of the circuit.
Moving around existing delays
• Does not alter the latency of the system
• Reduces the critical path of the system
Retiming has many applications in synchronous circuit design. These applications
include reducing the clock period of the circuit, reducing the number of registers in the
circuit, reducing the power consumption of the circuit and logic synthesis. It can be used to
increase the clock rate of a circuit by reducing the computation time of the critical path.
4.7.4 Critical path:
It is the path with the longest computation time among all paths that contain zero
delays, and the computation time of the critical path is the lower bound on the clock period
of the circuit.
4.7.5 Iteration bound:
It is defined as the maximum of al the loop bounds. Loop bound is defined as t/w,
where t is the computation time of the loop and w is the no.of delay elements in the loop.
fLinear Feedback Shift Register (LFSR) structures are widely used in digital signal
processing and communication systems, such as BCH, CRC. In high-rate digital
systems such as optical communication system, throughput of 1Gbps is usually desired.
The serial input/output operation property of LFSR structure is a bottleneck in such
systems and parallel LFSR architecture is thus required.. In , high-speed parallel CRC
implementation based on unfolding, pipelining and retiming is proposed. These parallel
LFSR structures are not always efficient for general LFSR structures. Furthermore, the
65
large fan out problem explained in for long LFSR structures are not addressed in these
papers.
Pipelining technique is needed to reduce the achievable minimum clock period before
parallel implementation is applied .A three-step LFSR structure is presented in [4]. The
message input m(x) is first multiplied by a factor polynomial p(x); then the generator
polynomial g(x) is thus modified as g’(x)=p(x)g(x); remainder of m(x)p(x)/g’(x) is
finally divided by p(x) and quotient is the expected output. The second step of this
algorithm inserts as many delay elements as possible to the right-most feedback loop,
which otherwise causes large fan out and long latency when g(x) is long [4]. This threestep scheme can eliminate the effect of large fan out. However, the feedback structure
of p(x) in the third step can still limit the achievable clock frequency of the final
parallel LFSR structure. Three approaches are proposed in [5] to eliminate the feedback
loops in the third step. Since the speed bottleneck of the three-step algorithm in [4] is
usually located in the third step, the new approaches in [5] can efficiently speedup the
final parallel LFSR structures. However, the hardware cost of the approaches in [5] is
high. Furthermore, since the goal of the second step in [4] is to insert delay elements
into the rightmost feedback loop and the achievable clock frequency is not necessarily
determined by this feedback loop, the feedback structure obtained from the second step
of [4] is not optimum for the achievable clock frequency. This paper uses different
structures to solve the bottleneck in the third step of [4] with lower hardware cost.
When we construct p(x), we guarantee that p(x) can be decomposed into several shortlength polynomials. Since the quotient of dividing the output of the second step by p(x)
can be alternatively obtained by dividing the output of the second step by a chain of
p(x) factor polynomials and the feedback loop of small length polynomial can be easily
handled by the look-ahead pipelining techniques [3], the iteration bound bottleneck can
be solved with smaller number of XOR gates than needed in [5]. A search algorithm for
reducing achievable clock period of the second step and thus the overall parallel LFSR
is then presented after large fanout problem effect is solved.
66
4.7.6 IMPROVED ALGORITHM FOR ELIMINATING THE FANOUT
BOTTLENECK FOR LFSR STRUCTURE
For a generator polynomial g(x) of degree (n-k) and a message sequence m(x) of degree
(k-1), the systematic encoding can provide us with the code word of degree (n-1) as:
c(x)=m(x)xn-k +Rem(m(x)xn-k )g( x) , where Rem(m(x)xn-k )g( x) is the remainder of
polynomial of dividing by
m(x)xn-k byg(x). For example, if g(y)=1+y+y^3+y^5
its corresponding LFSR structure for computing Rem(m(x)xn-k )g ( x) is shown in Fig..
In Fig., D denotes a delay element. Message sequence m(x) is injected into Fig from
the right side with the most significant bit first. After k clock cycles, will be available
in the delay elements of Fig. Rem(m(x)xn-k )g ( x)From Fig. 2.1 (a), we can see that it
has 3 feedback loops, with loops bound of TXOR, (3/4) TXOR and (3/5) TXOR for
loops
1, 2 and 3, respectively. The iteration bound is thus TXOR,corresponding to the rightmost feedback loop, where TXOR is the computation time of an XOR gate. Note that,
Iteration bound is defined as the maximum of all the loop bounds. Loop bound is
defined as t/w, where t is the computation time of the loop and w is the number of delay
elements in the loop [6]. Iteration bound is the minimum achievable clock period of a
digital system.
Fig 4.4 LFSR structure for g(y)=1+y+y3+y5 (a) serial structure
Fig 4.5 LFSR structure for g(y)=1+y+y3+y5 (b)3-parellel structure
67
3-parallel implementation to the LFSR structure in Fig. 2.1 (a) is shown in Fig. 2.1 (b).
After observing Fig. 2.1 (b), we can find two issues. One is that the iteration bound
increased to 3TXOR, which means that although the throughput rate has been increased
by a factor of 3, the achievable clock frequency is decreased by a factor of 3; the
achievable processing speed is thus the same as previous serial LFSR structure in Fig.
2.1 (a). Therefore, reducing the iteration bound of the original serial LFSR structure is
important before we apply parallel implementation [3].
Another issue indicated in Fig. 2.1 (b) is that each of the three right-most XOR gates
are driving a lot of other XOR gates, which is a large number when the generator
polynomial is long and thus causes large fan out delay [4]. Inserting delay elements
between the right-most XOR gates and their subsequent XOR gates can solve this
issue. The two rightmost XOR gates in Fig. 2.1 (b) can be separated from their
subsequent XOR gates by first pipelining input y(3k) and y(3k+1) and then applying
retiming. However, this scheme cannot be applied to the lowest right-most XOR gate.
This is caused by the fact that the number of delay elements in the right-most feedback
loop in Fig. 2.1 (a) is only 2 and less than the desired parallelism level 3. Therefore,
inserting enough delay elements into the right-most feedback loop of the original serial
LFSR structure is the key for solving large. Fan out issue, which is the contribution of
three-step LFSR architecture in [4].
From
=[q(x)g’(x)+r’(x)]/p(x), we can see that if we multiply both m(x)xn-k and g(x) with
p(x), the remainder of dividing
m(x)xn-k by g’(x) is r’(x)=r(x)p(x). r(x) is the quotient of dividing r’(x) by p(x). This is
the basic idea of the three-step LFSR architecture. Since g’(x) is reconstructed by
g(x)p(x), it’s very important to choose proper p(x) for addressing the two issues
discussed above. In [4], a clustered looked-ahead computation is applied
for finding p(x). This scheme is efficient and can insert delay elements to the right-most
feedback loop with the minimum increase of polynomial degree to g(x) and thus can
control the increase of XOR gates. However, the iteration bound bottleneck is
transferred to the LFSR in the third stage. Although three approaches have been
proposed in [5] to address this the iteration bound issue in the third step, the hardware
68
cost is high and the iteration bound bottleneck will again be transferred to the second
step.
4.7.7 Improved Algorithm for eliminating the Fanout bottleneck
We start with the same example used in [4].
Example 1: Given a BCH(255,233) code using generatorpolynomial
g(x)=1+x2+x3+x4+x5+x6+x7+x9+x14+x16+x17+x19
+x20+x22+x25+x26+x27+x29+x30+x31+x32
=[1 0 1 1 1 1 1 1 1 0 1 0 0 0 0 1 0 1 1 0 1 1 0 1 0 0 1 1 10 1 1 1 1].
Assume our targeted parallelism level is also 8. Then we obtain:
g’(x)=g(x)p(x)=1+x+x2+x4+x7+x8+x11+x12+x16+x17+x18+x25+x27+x28+x31+x33
+x42
p(x)=(1+x)(1+x4)(1+x5)
This is illustrated in Fig.
69
Fig 4.6 Three-step implementation of BCH (255,233) encoding, (a) first step,(b)
second step, (c) third step and (d)-(e) improved third step
The improved third step Fig. 4.6 (d) is based on the look ahead pipelining scheme
discussed as follows. Operation of
1/(1+xk )is shown in Fig4.6 (a), where we implement (4.1).
b(n+k) = a(n) + b(n) (4.1)
If we apply k steps look-ahead to (2.1), we obtain (4.2).
b(n+2k) = a(n+k) + b(n+k) = a(n+k) + a(n) + b(n). (4.2)
The corresponding hardware implementation is shown in Fig.
4.6 (b).
Another additional 2k steps look-ahead of (4.2) can be represented as (4.3).
b(n+4k) = a(n+3k) + a(n+2k) + b(n+2k)
= a(n+3k) + a(n+2k) + a(n+k) + a(n) + b(n) (4.3)
(2.3) can be implemented with the structure shown in Fig. 4.6(c).
70
Fig4.7 look-ahead pipelining for 1/(1+xk), (a) original hardware implementation for
1/(1+xk), (b) 2k-level look-ahead pipelining and (c) 4klevel look-ahead pipelining
Replace the 1/(1+x) in Fig. 4.7 (c) with Fig. 4.7 (c) and k=1,Fig. 4.7 (c) can be
pipelined as shown in Fig. 4.7 (d). Note that in Fig. 4.7 (d), we have two 1/(1+x4) in a
row and they can be simplified as 1/(1+x8). This process is shown in Fig. 4.7 (e).
From Example 1, we can see that the difference of the proposed scheme with the
previous three-step LFSR structures [4][5] is located at constructing p(x). In this paper,
p(x) is constructed with small length polynomials which is easy to handle with the
proposed look-ahead pipelining algorithm. However, p(x) in [4][5] is obtained and
handled with a single long polynomial, which usually leads to large iteration bound and
is difficult to handle.
When p(x) is decomposed into small length polynomials, its operational characteristics
for 1/p(x) is slightly different from when it is implemented as a single long polynomial.
We will show this difference with the following example as shown in Fig. 2.4. A strict
proof is given in Appendix A. From Fig. 2.4, we can see that operation of m(x)/p(x)
can get both correct remainder and quotient when we use structure p(x)=1+x+x2+x3;
while structure for p(x)=(1+x)(1+x2) can provide us with only correct quotient.
However, using p(x)=(1+x)(1+x2) is sufficient because we only need quotient in the
third step. Furthermore, in the structure for p(x)=(1+x)(1+x2) shown in Fig. 2.4 (b), the
quotient can be obtained without any latency. This is an important advantage, because
the structure for p(x)=1+x+x2+x3 in Fig. 2.4 has latency of 3 clock cycles, which has
the same number as the degree of p(x).
Iteration bound of the third step can be easily reduced to any small value and that of the
overall LFSR structure is
determined in the second step, which is not handled in [4][5].We discuss this issue in
section III.
71
Fig 4.8 Operation of m(x)/p(x) for m(x)=1+x+x2+x3+x4+x5+x6 (a) p(x)=1+x+x2+x3 and
(b) p(x)=(1+x)(1+x2)
4.8 PROPOSED ALGORITHM FOR REDUCING THE ITERATION
BOUND OF THE LFSR STRUCTURE
As we can see from TABLE I, we can keep multiplying g(x) by short length
polynomials such as 1+xk to insert as many delay elements to the right-most feedback
loop of g(x) as we need in each iteration for eliminating the fanout bottleneck of a
LFSR structure. After this fanout bottleneck is eliminated, our goal should be to reduce
the iteration bound of the entire LFSR structure, which is now located in the second
step. We have seen the advantage of multiplying g(x) by short length polynomial such
as 1+xk, so that the third step does not lead to the iteration bound bottleneck and the
latency of the third step becomes minimum. We will show its advantage for reducing
the iteration bound of
the second step and thus the whole LFSR structure in this section. We start with the
third iteration shown in TABLE I.
72
After the third iteration, g’(x) does not have fanout bottleneck anymore for unfolding
factor of J=8. Its iteration bound is
TXOR*max{2/9, 3/11, 4/14, 5/15, 6/17, 8/25, 9/26, 10/28,
11/30, 12/31, 13/34, 14/35, 15/38, 16/40, 17/41,
17/42}=0.4146TXOR, which is located in the 16th feedback loop. If we keep
multiplying g’(x) by 1+xk to reduce its iteration bound, there are 17 possibilities for k,
corresponding to the 17 feedback loops in g’(x). After trying all these 17possibilities,
we conclude that 1+x17 will reduce the iteration bound in the second step most from
0.4146TXOR to 0.3621TXOR. g’’(x)= (1+x17)g’(x) is given in TABLE II. Based on
the discussion for the third step in Section II, 1/(1+x17) is far away from causing
iteration bound bottleneck of the entire LFSR structure, which has been reduced to
0.3621 TXOR. We can keep multiplying g’’(x) with 1+xk to lower iteration bound even
further. For example, after the 6th iteration shown in TABLE II, the iteration bound of
the LFSR structure has been reduced to 0.3369TXOR. Note that, these optimized
iteration bounds are not achieved without extra cost. We can see that the number of
required XOR gates has increased from 2
to 76. Although multiplying g’(x) will
change the LFSR structure for lower iteration bound, the elimination of fanout
bottleneck is maintained. This is because the elimination of fanout bottleneck is given
in the right-most feedback loop of the second step of the LFSR structure. Multiplying
g’(x) by 1+xk maintains the structure on the right side of the feed back loop
corresponding to 1+xk. This property is illustrated in TABLE II.
73
From TABLE II, we have p(x)=(1+x)(1+x4 )(1+x5 ) and
p'(x)= p(x)(1+x17 )(1+x43 )(1+x86 ) (1 +x8 )(1+ x5 )(1+ x17 )(1+ x43 )(1+ x86 )/[(1+ x)(1+ x2 )]
Then Improved three-step implementation of BCH (255,233)encoding is shown in Fig.
From the discussion so far, we can summarize the proposed high speed VSLI algorithm
for general LFSR structures as follows:
1) Iteratively multiply g(x) by short length polynomials to insert as many delay
elements to the right-most feedback loop of g(x) as we need in each iteration for
eliminating the fan out bottleneck of a LFSR structure. The iteration exits when the
number of delay elements in g’(x) is not less than the targeted unfolding factor J.
The simplest short length polynomials we can use are 1+xk, where k is the degree
difference of the two highest-degree terms in g’(x).
Another way to find short length polynomials is to partially borrow Algorithm A in [4,
section III]. Instead of getting one long polynomial p(x) by using Algorithm A once, we
can use it by multiple times and limit the length of each obtained p(x) until the number
of delay elements in g’(x) is not less than the targeted unfolding factor J.
Although first method to find short length polynomial can guarantee to find p(x), the
second method might be more
Hardware efficient. This is because the second method can lead to polynomial g’(x) of
lower degree polynomial. Note that eliminating fan out bottleneck is not needed when
g(x) is 2) Multiply g’(x) iteratively by 1+ xk , which will lead to the smallest iteration
bound for the current g’’(x). For each iteration, the number of possible k is the same as
74
that of feedback loops for current g’’(x). The iteration exits when desired iteration
bound or the best iteration bound for certain hardware cost requirement is reached.
Fig 4.9 Improved three-step implementation of BCH (255,233) encoding, (a) first step,
(b) second step, (c) third step
4.9 BCH ENCODER ARCHITECTURE
An (n, k) binary BCH code encodes a k-bit message into an n-bit code word. A k-bit
message (mk−1, mk−2, . . .,m0) can be considered as the coefficients of a degree k – 1
polynomial m(x) = mk−1xk−1 + mk−2xk−2 + . . . + m0,where mk−1, mk−2, . . ., m0 ∈
GF(2). Meanwhile, the corresponding n-bit code word (cn−1, cn−2, . . ., c0) can be
considered as the coefficients of a degree n−1 polynomial c(x) = cn−1xn−1 +
cn−2xn−2 + . . . + c0, where cn−1,cn−2, . . ., c0 ∈ GF(2). The encoding of BCH codes
can besimply expressed by:
c(x) = m(x)g(x),where the degree n − k polynomial g(x) = gn−kxn−k+ gn−k−1 xn−k−1
+ . . . + g0 (gn−k, gn−k−1, ・・・ , g0 ∈ GF(2)) is the generator polynomial of the
BCH code. Usually,gn−k = g0 =_ 1_. However, systematic encoding is generally
desired, since message bits are just part of the code word. The systematic encoding can
be implemented by:
75
c(x) = m(x) ・ xn−k + Rem(m(x) ・ xn−k)g(x), (1)where Rem(f(x))g(x) denotes the
remainder polynomial
of dividing f(x) by g(x). The architecture of a systematic BCH encoder is shown in Fig.
1. During the first k clock cycles, the two switches are connected to the ’a’ port, and the
k-bit message is input to the LFSR serially with most significant bit (MSB) first.
Meanwhile, the message bits are also sent to the output to form the systematic part of
the code word. After k clock cycles, the switches are moved to the ’b’ port. At this
point, the n − k registers contain the coefficients of Rem(m(x) ・ xn−k)g(x). The
remainder bits
are then shifted out of the registers to the code word output bit by bit to form the
remaining systematic code word bits. For binary BCH, the multipliers in Fig. 1 can be
replaced by connection or no connection when gi (0 ≤ i < n−k) is ’1’ or ’0’,
respectively. The critical path of this architecture consists of two XOR gates, and the
output of the right-most XOR gate is input to all the other XOR gates. In the case of
long BCH codes, this architecture may suffer from the long delay of the right-most
XOR gate caused by the large fanout. Although the serial architecture of BCH encoder
is quite straight forward, in the case when it can not run as fast as the application
requirements, parallel architectures must be employed. Fanout bottleneck will also exist
in parallel architectures.
4.10 PARALLEL BCH ENCODER WITH ELIMINATED FANOUT
BOTTLENECK
In the serial BCH encoder in Fig. 4.10, the effect of large fanout can always be
eliminated by retiming [7]. To make notations simple, we refer to the input to the rightmost XOR gate, which is the delayed output of the second XOR gate from right as the
horizontal input (Hinput). In Fig. 1,there is at least one register at the Hinput of the
right-most XOR gate. Meanwhile, registers can be added to the message input.
Therefore, as shown in Fig. 2, retiming can always be performed along the dotted cutset
by removing one register from each input to the right-most XOR gate
76
FIG 4.10
and adding one to the output. For the purpose of clarity,switches are removed from the
LFSR in Fig. 2 and the other figures in the remainder of the paper. However, if
unfolding is applied directly to Fig. 1, retiming can not be applied in an obvious way to
eliminate the large fanout. The original architecture can be expressed by data flow
graph (DFG) as nodes connected by path with delays. Each XOR gate in the LFSR is a
node in the corresponding DFG. In the J-unfolded architecture, there are J copies of
each node with the same function as in the original architecture (see Chapter 5 of [4]).
However, the total number of delayelements does not change.
4.11 Retimed LFSR
Assuming there is a path from node U to node V in the original architecture with W
delay elements, in the J-unfolded architecture, node Ui is connected to node V(i+W)%J
with _(i + W)/J_ delay elements, where Ui, Vj (0 ≤ i, j < J) are copies of nodes U and V
, respectively. Therefore, if the unfolding factor J is greater than W, there will be W
paths with one delay element and J −W paths without any delay element in the
unfolded architecture. For example, Fig. 3 (a) shows an LFSR with generator
polynomial g(x) = x3 + x + 1. In this example, there are two registers in the path
connecting the output of the left XOR gate and the input of the right XOR gate. In the
3-unfolded architecture illustrated in Fig. 3 (b), there are 3-1=2 paths from the output of
the copies of the left XOR gate to the input of the copies of the right XOR gate with
one delay, and another one path without any delay. The unfolded LFSR in Fig. 3 (b)
cannot be retimed to eliminate the fanout problem for each copy of the right XOR gate.
If the generator polynomial can be expressed as:
g(x) = xt0 + xt1 + ・・・ + xtl−2 + 1, (2)
where t0, t1, ・・・ tl−2 are positive integers with t0 > t1 > t2 ・・・ > tl−2 and l
is the total number of non-zero termsof g(x), there are t0 − t1 consecutive registers at
77
the Hinput of the right-most XOR gate in Fig. 1. If a J-unfolded BCH encoder is
desired, t0 − t1 ≥ J needs to be satisfied to ensure that there is at least one register at the
Hinput of each of the J copies of the right-most XOR gate, so that retiming can be
applied to move one register to the output. Meanwhile, J registers need to be added to
the message input to enable retiming.
Fig4.11 (a) An LFSR example, (b) 3-unfolded version of
LFSR in
In the case of t0 − t1 < J, the generator polynomial needs to be modified to enable
retiming of the right-most XOR gate in the J-unfolded architecture. Assuming the
original (n, k) BCH code uses generator polynomial g(x) with degree n−k, the message
inputm(x) multiplied by xn−k can be written as:
m(x)xn−k = q(x)g(x) + r(x), (3) where q(x) and r(x) represent the quotient and
remainder
polynomials of dividing m(x)xn−k by g(x), respectively. Multiplying p(x) to both side
of (3), we get:
m(x)p(x)xn−k = q(x)(g(x)p(x)) + r(x)p(x). (4) Let g_(x) = p(x)g(x) be expressed as:
78
CHAPTER -5 BCH CODES
Given an BCH(255, 233) code using generator polynomial
g(x)
=x32+x31+x30+x29+x27+x26+x25+x22+x20+x19+x17+x16+x14+x9+x7+x6+x5+x4+x3
+x2+1,
we want to find p(x) such that t0’-t1’≥ 8 in g_(x).
In this example, E should be set to 8, and a = 32, b = 31 at the beginning of Algorithm
A. The intermediate values after each iteration in Algorithm A are given below:
After iteration I:
˜p(x) = 1+x−1.
˜g(x) = x32 +x28 +x27 +x2
+x22 +x21 +x20 +x18 +
x17 + x15 + x14 + x13 + x9 + x8 + x7 + x+1+x−1.
num = 1; b = 28; a − b = 4 < 8; continue.
After iteration II:
˜p(x) = 1+x−1 + x−4.
˜g(x) = x32 +x26 +x25 +x2
+x23 +x20 +x17 +x16 +
x14+x12+x10+x9+x8+x7++x5+x3+x2+x−2+x−4.
num = 4; b = 26; a − b = 6 < 8; continue.
After iteration III:
˜p(x) = 1+x−1 + x−4 + x−6.
˜g(x) = x32 + x21 + x19 + x17 + x13 + x12 + x11 + x9 +
x7 + x5 + x2 + x+1+x−1 + x−3 + x−6.
num = 6; b = 21; a − b = 11 > 8; stop.
Final step:
p(x) = ˜p(x)x6 = x6 + x5 + x2 + 1
g_(x) = ˜g(x)x6 = x38 + x27 + x25 + x23 + x19 + x18 +
79
x17 + x15 + x13 + x11 + x8 + x7 + x6 + x5 + x3 + 1.
According to (4), the modified method of finding Rem
FIG 5.1 Block diagram of the modified BCH encoding
FIG 5.2 Step 1 of the modified BCH encoding (m(x)xn−k)g(x) in the BCH encoding
can be implemented by the steps illustrated in Fig. 4. Each step is explained using the
g(x), p(x) and g_(x) derived in Example I in the remainder of this section. The first step
in Fig. 4 is to multiply the message input polynomial by p(x). This can be implemented
by adding delayed message inputs according to the coefficient of p(x). For example,
using the p(x) derived in Example I, this step can be implemented by the diagram in
Fig. 5. The four taps correspond to 1, x2, x5and x6, respectively, as shown in Fig. 5.
After m(x)p(x) is computed, it is fed into the second block to compute
Rem(m(x)p(x)xn−k)g_(x) by using similar LFSR architectures as that in Fig. 1.
However, since deg(g_(x)) > n−k, the product of p(x) and m(x) should be added to the
output of the (n − k)th register from left, instead of being added to the output of the
right-most register. The addition of m(x)p(x) can break the consecutive a − b registers
at the Hinput of the right-most XOR gate in the LFSR. The implementation of the
second step using the BCH code in Example 1 is illustrated in Fig. 6. As could be
observed in Fig. 6, there are 38-27=11 consecutive registers at the Hinput of the rightmost XOR gate according to g_(x). However, after adding m(x)p(x) to the output of the
32nd register, only 6 consecutive registers are left. Therefore, at most 6-unfolding can
be applied to Fig. 6 without suffering from large fanout problem. At the end of
algorithm A, deg(g_(x)) = num + deg(g(x)) = n − k + num. Hence, only num
80
consecutive registers left after adding m(x)p(x) where num ≤ E −1 at the end of
Algorithm A.
FIG 5.3 Step 2 of the modified BCH encoding
FIG 5.4 Step 3 of the modified BCH encoding
Therefore ,E is usually set to larger than the desired unfolding factor J to ensure num ≥
J at the end of Algorithm A. Alternatively, at the expense of slight increase in the
critical path and latency, the delays at the input of the last XOR gate can be retimed and
moved to the output of this XOR gate. For example, the 5 delays at the Hinput of the
last XOR gate in Fig. 6 can be retimed and moved to the output of this XOR gate. This
requires first adding 5 delays to m(x)p(x) input. The penalty is the increase in the
critical path to 2 XOR gates in the serial encoder.
In the third step, Rem(m(x)p(x)xn−k)g_(x) needs to be divided by p(x) to get the final
result. Similar architectures as that in Fig. 1 can also be used, except that the input data
is added to the input of the left-most register, since the input polynomial does not need
to be multiplied by any power of x. For example, the third step of the modified BCH
encoding using the p(x) derived in Example I is illustrated in Fig. 7.
Unfolding the modified BCH encoder in Fig. 4 by factor J, a parallel architecture
capable of processing J message bits at a time is derived. In the J-unfolded block of
computing p(x)m(x), feedback loop does not exist. Thus it can be pipelined to achieve
desired clock frequency. In the second block, since the LFSR of the modified generator
polynomial g_(x) has at least J registers at the Hinput of the right-most
XOR gate, retiming can be applied to the J-unfolded architecture to eliminate the effect
of large fan out after adding J registers to the output of m(x)p(x). Although the fan out
81
problem does not exist in the third block in Fig. 4, it can exist in the unfolded
architecture. Since the polynomial p(x) enables g_(x) = p(x)g(x) to have consecutive
zero coefficients after the highest power term, the difference of the highest two powers
of p(x) is equal to t0 − t1 < J. After J-unfolding is applied, there are some copies of the
right-most XOR gates connected to lp−2 XOR gates, where lp is the number of non-zer
terms in p(x). In the worst case, lp is at most E+1−(t0−t1), and E is set to as small as
possible, which is a little bit larger than the unfolding factor J. Usually, the desired
unfolding factor is far less than the length of g(x) in long BCH codes. Hence, the delay
caused by the fan out of dividing p(x) is far less than that of dividing g(x) in the original
BCH encoders.
5.1 BCH DECODER ARCHITECTURE
In this section, a parallel BCH decoder is presented.The syndrome-based BCH
decoding consists of three major steps [3], as depicted in Fig. 2, where R is the hard
decision of received information from noisy channel and D is the decoded codeword. S
and Λ represent syndromes of the received polynomial and error locator polynomial,
respectively
5.2 Syndrome Generator
For t-error-correcting BCH codes, 2t syndromes of the received polynomial could be
evaluated as follows:
Sj = R(αj) =nX−1 i=0Ri(αj )i (1)
for 1 ≤ j ≤ 2t. If 2t conventional syndrome generator units shown in Fig. 3(a) are used at
the same time independently, n clock cycles are necessary to complete computing all
the 2t syndromes. However, if each syndrome generator unit in Fig. 3(a) is replaced by
82
a parallel syndrome generator unit with parallel factor of p depicted in Fig. 3(b), which
can process p bits per clock cycle, only _n/p_ clock cycles are sufficient. It is worth
noting that for binary BCH codes, even-indexed syndromes are the squares of earlierindexed syndromes, i.e., S2j = S2 j . Based on this constraint, actually only t parallel
syndrome generator units are required to compute the odd-indexed syndromes,
followed by a much simpler field square circuit to generate those even-indexed
syndromes.
5.3. Key Equation Solver
Either Peterson’s or Berlekamp-Massey (BM) algorithm [3] could be employed to
solve the key equations for Λ(x). Inversion-free BM algorithm and its efficient
implementations could be easily found in the literature [2] [4] and are not considered in
this paper.
5.4 . Chien Search
Once Λ(x) is found, the decoder searches for error locations by checking whether Λ(αi)
= 0 for 0 ≤ i ≤ (n − 1), which is
Syndrome generator unit (a) Conventional architecture (b) Parallel architecture with
parallel factor of p
83
CHAPTER-6
APPENDIX-A
VERILOG HDL
Implementation of High Speed LFSR is done using Verilog HDL.
In the semiconductor and electronic design industry, Verilog is a hardware description
language (HDL) used to model electronic systems. Verilog HDL, not to be confused with
VHDL, is most commonly used in the design, verification, and implementation of digital
logic chips at the Register transfer level (RTL) level of abstraction. It is also used in the
verification of analog and mixed-signal circuits.
6.1 Overview
Hardware description languages, such as Verilog, differ from software programming
languages because they include ways of describing the propagation of time and signal
dependencies (sensitivity). There are two assignment operators, a blocking assignment (=),
and a non-blocking (<=) assignment. The non-blocking assignment allows designers to
describe a state-machine update without needing to declare and use temporary storage
variables. Since these concepts are part of the Verilog's language semantics, designers
could quickly write descriptions of large circuits, in a relatively compact and concise form.
At the time of Verilog's introduction (1984), Verilog represented a tremendous productivity
improvement for circuit designers who were already using graphical schematic-capture,
and specially-written software programs to document and simulate electronic circuits.
The designers of Verilog wanted a language with syntax similar to the C programming
language, which was already widely used in engineering software development. Verilog is
case-sensitive, has a basic preprocessor (though less sophisticated than ANSI C/C++), and
equivalent control flow keywords (if/else, for, while, case, etc.), and compatible language
operators precedence. Syntactic differences include variable declaration (Verilog requires
bit-widths on net/reg types), demarcation of procedural-blocks (begin/end instead of curly
braces {}), and many other minor differences.
84
A Verilog design consists of a hierarchy of modules. Modules encapsulate design
hierarchy, and communicate with other modules through a set of declared input, output,
and bidirectional ports. Internally, a module can contain any combination of the following:
net/variable declarations (wire, reg, integer, etc.), concurrent and sequential statement
blocks and instances of other modules (sub-hierarchies). Sequential statements are placed
inside a begin/end block and executed in sequential order within the block. But the blocks
themselves are executed concurrently, qualifying Verilog as a Dataflow language.
Verilog's concept of 'wire' consists of both signal values (4-state: "1, 0, floating,
undefined"), and strengths (strong, weak, etc.) This system allows abstract modeling of
shared signal-lines, where multiple sources drive a common net. When a wire has multiple
drivers, the wire's (readable) value is resolved by a function of the source drivers and their
strengths.
A subset of statements in the Verilog language is synthesizable. Verilog modules that
conform to a synthsizeable coding-style, known as RTL (register transfer level), can be
physically realized by synthesis software. Synthesis-software algorithmically transforms
the (abstract) Verilog source into a netlist, a logically-equivalent description consisting
only of elementary logic primitives (AND, OR, NOT, flipflops, etc.) that are available in a
specific VLSI technology. Further manipulations to the netlist ultimately lead to a circuit
fabrication blueprint (such as a photo mask-set for an ASIC, or a bitstream-file for an
FPGA).
6.2 History, Beginning
Verilog was invented by Phil Moorby and Prabhu Goel during the winter of 1983/1984 at
Automated Integrated Design Systems (later renamed to Gateway Design Automation in
1985) as a hardware modeling language. Gateway Design Automation was later purchased
by Cadence Design Systems in 1990. Cadence now has full proprietary rights to Gateway's
Verilog and the Verilog-XL simulator logic simulators.
85
6.3 Verilog-95
With the increasing success of VHDL at the time, Cadence decided to make the language
available for open standardization. Cadence transferred Verilog into the public domain
under the Open Verilog International (OVI) (now known as Accellera) organization.
Verilog was later submitted to IEEE and became IEEE Standard 1364-1995, commonly
referred to as Verilog-95.
In the same time frame Cadence initiated the creation of Verilog-A to put standards support
behind its analog simulator Spectre. Verilog-A was never intended to be a standalone
language and is a subset of Verilog-AMS which encompassed Verilog-95.
6.4 Verilog 2001
Extensions to Verilog-95 were submitted back to IEEE to cover the deficiencies that users
had found in the original Verilog standard. These extensions became IEEE Standard 13642001 known as Verilog-2001.
Verilog-2001 is a significant upgrade from Verilog-95. First, it adds explicit support for
(2's complement) signed nets and variables. Previously, code authors had to perform
signed-operations using awkward bit-level manipulations (for example, the carry-out bit of
a simple 8-bit addition required an explicit description of the boolean-algebra to determine
its correct value.) The same function under Verilog-2001 can be more succinctly described
by one of the built-in operators: +, -, /, *, >>>. A generate/endgenerate construct (similar to
VHDL's generate/endgenerate) allows Verilog-2001 to control instance and statement
instantiation through normal decision-operators (case/if/else). Using generate/endgenerate,
Verilog-2001 can instantiate an array of instances, with control over the connectivity of the
individual instances. File I/O has been improved by several new system-tasks. And finally,
a few syntax additions were introduced to improve code-readability (eg. always @*,
named-parameter override, C-style function/task/module header declaration.)
Verilog-2001 is the dominant flavor of Verilog supported by the majority of commercial
EDA software packages.
86
6.5 Verilog 2005
Not to be confused with SystemVerilog, Verilog 2005 (IEEE Standard 1364-2005) consists
of minor corrections, spec clarifications, and a few new language features (such as the
uwire keyword.)
A separate part of the Verilog standard, Verilog-AMS, attempts to integrate analog and
mixed signal modelling with traditional Verilog.
6.6 Design Styles
Verilog, like any other hardware description language, permits a design in either Bottomup or Top-down methodology.
Bottom-Up Design
The traditional method of electronic design is bottom-up. Each design is performed at the
gate-level using the standard gates (refer to the Digital Section for more details). With the
increasing complexity of new designs this approach is nearly impossible to maintain. New
systems consist of ASIC or microprocessors with a complexity of thousands of transistors.
These traditional bottom-up designs have to give way to new structural, hierarchical design
methods. Without these new practices it would be impossible to handle the new
complexity.
Top-Down Design
The desired design-style of all designers is the top-down one. A real top-down design
allows early testing, easy change of different technologies, a structured system design and
offers many other advantages. But it is very difficult to follow a pure top-down design. Due
to this fact most designs are a mix of both methods, implementing some key elements of
both design styles.
87
Figure shows a Top-Down design approach.
6.7 Verilog Abstraction Levels
Verilog supports designing at many different levels of abstraction. Three of them are very
important:

Behavioral level

Register-Transfer Level

Gate Level
Behavioral level
This level describes a system by concurrent algorithms (Behavioral). Each algorithm itself
is sequential, that means it consists of a set of instructions that are executed one after the
other. Functions, Tasks and Always blocks are the main elements. There is no regard to the
structural realization of the design.
88
Register-Transfer Level
Designs using the Register-Transfer Level specify the characteristics of a circuit by
operations and the transfer of data between the registers. An explicit clock is used. RTL
design contains exact timing bounds: operations are scheduled to occur at certain times.
Modern RTL code definition is "Any code that is synthesizable is called RTL code".
Gate Level
Within the logic level the characteristics of a system are described by logical links and their
timing properties. All signals are discrete signals. They can only have definite logical
values (`0', `1', `X', `Z`). The usable operations are predefined logic primitives (AND, OR,
NOT etc gates). Using gate level modeling might not be a good idea for any level of logic
design. Gate level code is generated by tools like synthesis tools and this netlist is used for
gate level simulation and for backend.
6.8 Introduction
ModelSim is a verification and simulation tool for VHDL, Verilog, SystemVerilog, and
mixed language designs.
6.8.1 Basic Simulation Flow
The following diagram shows the basic steps for simulating a design in ModelSim.
Figure 6.1. Basic Simulation Flow - Overview Lab
89
In ModelSim, all designs are compiled into a library. You typically start a new simulation
in ModelSim by creating a working library called "work". "Work" is the library name used
by the compiler as the default destination for compiled design units.
After creating the working library, you compile your design units into it. The ModelSim
library format is compatible across all supported platforms. You can simulate your design
on any platform without having to recompile your design.
r Design and Running the Simulation With the design
compiled, you load the simulator with your design by invoking the simulator on a top-level
module (Verilog) or a configuration or entity/architecture pair (VHDL). Assuming the
design loads successfully, the simulation time is set to zero, and you enter
a run command to begin simulation.
If you don’t get the results you expect, you can use ModelSim’s robust debugging
environment to track down the cause of the problem.
6.8.2 Project Flow
A project is a collection mechanism for an HDL design under specification or test. Even
though
you don’t have to use projects in ModelSim, they may ease interaction with the tool and are
useful for organizing files and specifying simulation settings.
The following diagram shows the basic steps for simulating a design within a ModelSim
project.
90
FIG 6.2
As you can see, the flow is similar to the basic simulation flow. However, there are two
important differences:
automatically.
unless you specifically close them.
6.8.3Multiple Library Flow
ModelSim uses libraries in two ways: 1) as a local working library that contains the
compiled
version of your design; 2) as a resource library. The contents of your working library will
change as you update your design and recompile. A resource library is typically static and
serves as a parts source for your design. You can create your own resource libraries, or they
may be supplied by another design team or a third party (e.g., a silicon vendor).
You specify which resource libraries will be used when the design is compiled, and there
are
rules to specify in which order they are searched. A common example of using both a
working
library and a resource library is one where your gate-level design and testbench are
compiled
91
into the working library, and the design references gate-level models in a separate resource
library.
The diagram below shows the basic steps for simulating with multiple libraries.
Figure 6.3. Multiple Library Flow
6.9 Debugging Tools
ModelSim offers numerous tools for debugging and analyzing your design. Several of these
tools are covered in subsequent lessons, including:
 Using projects
 Working with multiple libraries
 Setting breakpoints and stepping through the source code
 Viewing waveforms and measuring time
 Viewing and initializing memories
 Creating stimulus with the Waveform Editor
 Automating simulation
92
6.10 Basic Simulation
Figure 3-1. Basic Simulation Flow - Simulation Lab
Design Files for this Lesson
The sample design for this lesson is a simple 8-bit, binary up-counter with an associated
testbench. The pathnames are as follows:
Verilog
–
<install_dir>/examples/tutorials/verilog/basicSimulation/counter.v
and
–
<install_dir>/examples/tutorials/vhdl/basicSimulation/counter.vhd
and
tcounter.v
VHDL
tcounter.vhd
This lesson uses the Verilog files counter.v and tcounter.v. If you have a VHDL license,
use
counter.vhd and tcounter.vhd instead. Or, if you have a mixed license, feel free to use the
Verilog testbench with the VHDL counter or vice versa.
6.10.1 Create the Working Design Library
Before you can simulate a design, you must first create a library and compile the source
code into that library.
1. Create a new directory and copy the design files for this lesson into it.
Start by creating a new directory for this exercise (in case other users will be working with
these lessons).
93
Verilog: Copy counter.v and tcounter.v files from
/<install_dir>/examples/tutorials/verilog/basicSimulation to the new directory.
VHDL: Copy counter.vhd and tcounter.vhd files from
/<install_dir>/examples/tutorials/vhdl/basicSimulation to the new directory.
2. Start ModelSim if necessary.
a. Type vsim at a UNIX shell prompt or use the ModelSim icon in Windows. Upon
opening ModelSim for the first time, you will see the Welcome to ModelSim dialog. Click
Close.
b. Select File > Change Directory and change to the directory you created in step 1.
3. Create the working library.
a. Select File > New > Library.
This opens a dialog where you specify physical and logical names for the library (Figure 32). You can create a new library or map to an existing library. We’ll be doing the former.
Figure 6.4. The Create a New Library Dialog
b. Type work in the Library Name field (if it isn’t already entered automatically).
c. Click OK.
94
ModelSim creates a directory called work and writes a specially-formatted file named _info
into that directory. The _info file must remain in the directory to distinguish it as a
ModelSim library. Do not edit the folder contents from your operating system; all changes
should be made from within ModelSim. ModelSim also adds the library to the list in the
Workspace (Figure 3-3) and records the library mapping for future reference in the
ModelSim initialization file (modelsim.ini).
Figure 6.5 . work Library in the Workspace
When you pressed OK in step 3c above, the following was printed to the Transcript:
vlib work
vmap work work
These two lines are the command-line equivalents of the menu selections you made. Many
command-line equivalents will echo their menu-driven functions in this fashion.
6.11 Compile the Design
With the working library created, you are ready to compile your source files.
You can compile by using the menus and dialogs of the graphic interface, as in the Verilog
example below, or by entering a command at the ModelSim> prompt.
1. Compile counter.v and tcounter.v.
a. Select Compile > Compile. This opens the Compile Source Files dialog (Figure 3-4).
If the Compile menu option is not available, you probably have a project open. If so, close
the project by making the Workspace pane active and selecting File > Close from the
menus.
95
b. Select both counter.v and tcounter.v modules from the Compile Source Files dialog and
click Compile. The files are compiled into the work library. c. When compile is finished,
click Done.
Figure 6.5 . Compile Source Files Dialog
2. View the compiled design units.
a. On the Library tab, click the ’+’ icon next to the work library and you will see two design
units (Figure 3-5). You can also see their types (Modules, Entities, etc.) and the path to the
underlying source files (scroll to the right if necessary).
b. Double-click test_counter to load the design.
You can also load the design by selecting Simulate > Start Simulation in the menu bar.
This opens the Start Simulation dialog. With the Design tab selected, click the ’+’ sign next
to the work library to see the counter and test_counter modules. Select
the test_counter module and click OK (Figure 3-6).
Figure 6.6 . Loading Design with Start Simulation Dialog
96
When the design is loaded, you will see a new tab in the Workspace named sim that
displays the hierarchical structure of the design (Figure 3-7). You can navigate within the
hierarchy by clicking on any line with a ’+’ (expand) or ’-’ (contract) icon. You will also
see a tab named Files that displays all files included in the design.
Figure 6.7. VHDL Modules Compiled into work Library
6.12 Load the Design
1. Load the test_counter module into the simulator.
a. In the Workspace, click the ‘+’ sign next to the work library to show the files contained
there.
Figure 6.8 . Workspace sim Tab Displays Design Hierarchy
97
2. View design objects in the Objects pane.
a. Open the View menu and select Objects. The command line equivalent is: view objects
The Objects pane (Figure 3-8) shows the names and current values of data objects in the
current region (selected in the Workspace). Data objects include signals, nets, registers,
constants and variables not declared in a process, generics, parameters.
Figure 6.9 Object Pane Displays Design Objects
You may open other windows and panes with the View menu or with the view command.
See Navigating the Interface.
6.12 Run the Simulation
Now you will open the Wave window, add signals to it, then run the simulation.
1. Open the Wave debugging window.
a. Enter view wave at the command line
You can also use the View > Wave menu selection to open a Wave window.
The Wave window is one of several windows available for debugging. To see a list
of the other debugging windows, select the View menu. You may need to move or
resize the windows to your liking. Window panes within the Main window can be
zoomed to occupy the entire Main window or undocked to stand alone. For details,
98
see Navigating the Interface.
2. Add signals to the Wave window.
a. In the Workspace pane, select the sim tab.
b. Right-click test_counter to open a popup context menu.
c. Select Add > To Wave > All items in region (Figure 3-9).
All signals in the design are added to the Wave window.
Figure 6.10 . Using the Popup Menu to Add Signals to Wave Window
3. Run the simulation.
a. Click the Run icon in the Main or Wave window toolbar.
The simulation runs for 100 ns (the default simulation length) and waves are
drawn in the Wave window.
99
b. Enter run 500 at the VSIM> prompt in the Main window.
The simulation advances another 500 ns for a total of 600 ns (Figure 3-10).
Figure 6.11. Waves Drawn in Wave Window
c. Click the Run -All icon on the Main or Wave window toolbar.
The simulation continues running until you execute a break command or it
hits a statement in your code (e.g., a Verilog $stop statement) that halts the
simulation.
d. Click the Break icon.
The simulation stops running.
6.12 Xilinx design flow
The first step involved in implementation of a design on FPGA involves System
Specifications. Specifications refer to kind of inputs and kind of outputs and the range of
values that the kit can take in based on these Specifications. After the first step system
specifications the next step is the Architecture. Architecture describes the interconnections
between all the blocks involved in our design. Each and every block in the Architecture
along with their interconnections is modeled in either VHDL or Verilog depending on the
ease. All these blocks are then simulated and the outputs are verified for correct
functioning.
100
Figure 6.12 Xilinx Implementation Design Flow-Chart.
After the simulation step the next steps i.e., Synthesis. This is a very important step
in knowing whether our design can be implemented on a FPGA kit or not. Synthesis
converts our VHDL code into its functional components which are vendor specific. After
performing synthesis RTL schematic, Technology Schematic and generated and the timing
delays are generated. The timing delays will be present in the FPGA if the design is
implemented on it. Place & Route is the next step in which the tool places all the
components on a FPGA die for optimum performance both in terms of areas and speed. We
also see the interconnections which will be made in this part of the implementation flow.
In post place and route simulation step the delays which will be involved on the
FPGA kit are considered by the tool and simulation is performed taking into consideration
these delays which will be present in the implementations on the kit. Delays here mean
electrical loading effect, wiring delays, stray capacitances.
After post place and route, comes generating the bit-map file, which means
converting the VHDL code into bit streams which is useful to configure the FPGA kit. A
bit file is generated this step is performed. After this comes final step of downloading the
bit map file on to the FPGA board which is done by connecting the computer to FPGA
board with the help of JTAG cable (Joint Test Action Group) which is an IEEE standard.
The bit map file consist the whole design which is placed on the FPGA die, the outputs can
now be observed from the FPGA LEDs. This step completes the whole process of
implementing our design on an FPGA.
101
6.13 Xilinx ISE 10.1 software
6.13.1 Introduction
Xilinx ISE (Integrated Software Environment) 9.2i software is from XILINX
company, which is used to design any digital circuit and implement onto a Spartan-3E
FPGA device. XILINX ISE 9.2i software is used to design the application, verify the
functionality and finally download the design on to a Spartan-3E FPGA device.
6.13.2 Xilinx ISE 10.1 software tools
 SIMULATION : ISE (Integrated Software Environment) Simulator
 SYNTHESIS, PLACE & POUTE : XST (Xilinx Synthesis Technology) Synthesizer
6.13.3 Design steps using Xilinx ISE 10.1
1
Create an ISE PROJECT for particular embedded system application.
2
Write the assembly code in notepad or write pad and generate the verilog or vhdl
module by making use of assembler.
3
Check syntax for the design.
4
Create verilog test fixture of the design.
5
Simulate the test bench waveform (BEHAVIORAL SIMULATION) for functional
verification of the design using ISE simulator.
6
Synthesize and implement the top level module using XST synthesizer.
102
CHAPTER-7
SIMULATION RESULTS
Serial Implementation 1+y+y3+y5
103
Synthesis Results:
Fig 1:
HDL Synthesis Report
Macro Statistics
# Registers
:5
1-bit register
:5
# Xors
:3
1-bit xor2
:3
Design Statistics
# IOs
:8
Cell Usage :
# BELS
:3
# LUT2
#
LUT3
:1
:2
# FlipFlops/Latches
#
FDC
# Clock Buffers
:5
:5
:1
104
#
BUFGP
:1
# IO Buffers
:7
#
IBUF
#
OBUF
:2
:5
Device utilization summary:---------------------------
Selected Device : XC3S500E
Number of Slices:
3 out of
960
0%
Number of Slice Flip Flops:
5 out of 1920
0%
Number of 4 input LUTs:
3 out of 1920
0%
8 out of
7%
Number of IOs:
Number of bonded IOBs:
Number of GCLKs:
8
1 out of
108
2
4%
Timing Summary:
Minimum period: 2.269ns (Maximum Frequency: 440.723MHz)
Minimum input arrival time before clock: 2.936ns
Maximum output required time after clock: 4.450ns
Parellel Implementation 1+y+y3+y5
105
HDL Synthesis Report
Macro Statistics
# Registers
:5
1-bit register
:5
# Xors
:9
1-bit xor2
:9
Design Statistics
# IOs
: 11
Cell Usage :
# BELS
:6
#
LUT3
:3
#
LUT4
:3
# FlipFlops/Latches
#
FDC
:5
:5
# Clock Buffers
#
:1
BUFGP
:1
# IO Buffers
:9
#
IBUF
#
OBUF
:4
:5
106
Device utilization summary:
---------------------------
Selected Device: XC3S500E
Number of Slices:
3 out of
960
0%
Number of Slice Flip Flops:
5 out of 1920
0%
Number of 4 input LUTs:
6 out of 1920
0%
10 out of
9%
Number of IOs:
Number of bonded IOBs:
Number of GCLKs:
11
1 out of
108
2
4%
Timing Summary:
--------------Minimum period: 2.269ns (Maximum Frequency: 440.723MHz)
Minimum input arrival time before clock: 4.235ns
Maximum output required time after clock: 4.450ns
ADVANTAGES:
 Reduced Power dissipation
 Higher throughput rate.
 Higher processing speed
107
 Fast Computation.
 LFSR can rapidly transmit a sequence that indicates high-precision relative time offsets
APPLICATIONS:
 Pattern Generators
 Built-In Self-Test(BIST)
 Encryption.
 LFSR can be used for generating pseudo-random numbers, pseudo-noise sequences,
fast digital counters, and whitening sequences.
 Pseudo-Random Bit Sequences
Conclusion
Efficient high-speed parallel LFSR structures must address two important issues: large
fan out bottleneck and Iteration bound bottleneck. Three-step LFSR architectures
provide us with the flexibility to handle these two issues. The key point is the
construction of p(x) and p’(x), which are used for addressing the fan out bottleneck and
iteration bound bottleneck, respectively, as shown in this paper. These two issues can
108
be better solved by choosing p(x) and p’(x) that can be further decomposed into small
length polynomials because small length polynomials can be easily handled by the
proposed look-ahead pipelining algorithms. Higher processing speed and hardware
efficiency can be achieved by using this approach.
REFERENCE
[1]Tong-Bi Pei, Charles Zukowski, "High-Speed Parallel CRC Circuits in VLSI", IEEE
Transactions on Communications, vol. 40, no. 4, April
1992 pp. 653-657.
[2] G. Campobello, G. Patané, M. Russo, “Parallel CRC realization,” IEEE Transactions
on Computers, VOL. 52, NO. 10, OCTOBER 2003.
109
[3] C. Cheng and K. K. Parhi, “High-Speed Parallel CRC Implementation Based on
Unfolding, Pipelining, and Retiming,” IEEE Trans. Circuits
Syst. II, Express Briefs, vol. 53, no. 10, pp.1017–1021, Oct. 2006.
[4] K. K. Parhi, “Eliminating the fan out bottleneck in parallel long BCH encoders,” IEEE
Trans. Circuits Syst. I, Reg. Papers, vol. 51, no. 3, pp.
512–516, Mar. 2004.
[5] X. Zhang and K. K. Parhi, “High-speed architectures for parallel long BCH encoders,”
in Proc. ACM Great Lakes Symp. VLSI, Boston, MA,
pp. 1–6, Apr. 2004.
[6] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation.
Hoboken, NJ: Wiley, 1999.
[7] T.V. Ramabadran and S.S, Gaitonde, “A Tutorial on CRC Computations”, IEEE Micro,
Vol.8, No. 4, , pp. 62-75, August 1988.
110

Download Report

Abstract - codelooker_software development and programming

Paperzz.com

Your Paperzz