LeibowitzRobert1988

CALIFORNIA STATE UNIVERSITY, NORTHRIDGE
COMPUTER ARCHITECTURES FOR DIGITAL SIGNAL
PROCESSING
A thesis submitted in partial satisfaction of the requirements for the
degree of Master of Science in
Engineering
by
Robert N. Leibowitz
May 1988
The thesis of Robert N. Leibowitz is approved:
Prof. Ray H. Pettit
Date
Prof. David Salomon
Date
Prof. John W. Adams (Chair)
Date
California State University, Northridge
11
ACKNLOWLEDGEMENTS
The author wishes to express his deep gratitude and thanks to
Professor John Adams for his many contributions which culminated
in the writing of this thesis. The logistical support Professor Adams
provided was also invaluable. The author also wishes to thank Cheryl
Wilson, whose assistance in the preparation of art work is greatly
appreciated.
111
TABLE OF CONTENTS
ACKN"O~Ga1ENTS
......................................................................................iii
LIST OFT ABLES .................................................................................................. vi
LIST OF FIGURES ................................................................................................. vii
ABSTACT ................................................................................................................ viii
1. 0 IN1RODUCfiON ........................................................................................ 1
2 . 0 ALGORffHMS ............................................................................................ 8
2.1 Fundamental Algorithms .............................................................. 8
2.2 Advanced Digital Signal Processing Algorithms .................. 11
2.2.1 Beamforming and Direction Finding ................................... ! 2
2. 3 Functional Element Requirements ............................................ 1 6
3. 0 SINGLE CHIP DIGITAL SIGNAL PROCESSORS .............................. ! 8
3.1 Architectural Features_ of the DSP Processors ...................... 2 2
3. 2 Cache Memory ................................................................................... 2 4
3. 3 Program Sequencer - Zero-Overhead Looping .....................2 5
3.4 Address Generation ......................................................................... 2 8
3 . 5 Summary.............................................................................................. 2 8
4. 0 VLSI ARRAY ARCIIffECTURES ..........................................................3 0
4 . 1 Introduction........................................................................................ 3 0
4. 2 Locality of Computation ................................................................. 3 2
4. 3 Balancing Computation with I/0 ................................................3 3
4.4 Modular Architecture .....................................................................3 6
4. 5 Highly Parallel Processing Architectures ............................... 3 7
4.5 .1 Single Instruction-Multiple Data System .........................3 8
4.5 .2 Multiple Instruction-Multiple Data System .................... 3 9
4.5 .3 Systolic and Wavefront Arrays ............................................4 0
4.5.4 Clock Skew in Synchronous Arrays ....................................41
4.5 .5 Dataflow Architecture ..............................................................4 2
4. 6 Architecture Comparison ..............................................................4 2
4.7 Systolic Arrays ..................................................................................44
4. 7.1 Semi-Systolic Convolution Array with Global Data
Comm .................................................................-.......................................... 4 5
4.4.2 Pure-Systolic Convolution Array without Global
Data Comm. ................................................................................................4 7
4. 7.3 Summary of Systolic Arrays for Convolution ................ .5 7
4.5 Programming Arrays ......................................................................58
4.5.1 Wavefront Array Programming Example ........................59
4.6 The ESL 350 MFLOPS Systolic Adaptive Beamformer ......6 3
4.6.1 System Components .................................................................. 6 4
4.6.2 Systolic Cell (PE) Chip ............................................................... 6 6
5 . 0 'IEQ-WOI...CX}Y .............................................................................................7 0
' v
lV
5.1 Technology Comparisons ............................................................... 7 1
5.2 Bipolar Processes .............................................................................. 7 2
5 . 3 BiCMOS Technology .......................................................................... 7 5
5.4 Gallium Arsenide .............................................................................. 7 6
5. 5 Advanced Integrated Circuits of the Future .........................7 9
5. 6 Wafer Scale Integration ................................................................. 7 9
5. 6. 1 A Wafer Scale FFT Processor ................................................. 8 0
5. 7 Sub-Micron Lithography ............................................................... 8 3
5. 8 Summary .............................................................................................. 8 5
6. 0 COMPtJfER AIDED DESIGN................................................................... 8 6
6. 1 LAGER .................................................................................................... 8 8
6. 2 FIRSr......................................................................................................9 0
6. 3 BSSC Compiler .................................................................................... 9 0
6. 3.1 Bit-Serial Cell Architecture .................................................... 9 2
6.3.2 Bit-Serial Language Cell Descriptions ................................ 9 3
6.3.3 The High· Level Language Compiler. ................................... 9 4
6.3.4 Layout Generator........................................................................ 9 5
6.3.5 Simulators ...................................................................................... 9 5
6.3.6 Compiled Silicon Examples ..................................................... 9 6
6.4 Summary.............................................................................................. 9 6
7.0
SlJMMARY.................................................................................................. 9 8
~CES ........................................................................................................... 1 0 2
A. 0 APP:EN'DIX A.............................................................................................. 1 0 6
A.1 Multiplexing Outputs of Digital Logic ....................................... 1 0 6
A.2 ECL Logic Families ............................................................................ 1 0 6
A.3 TIL Logic Families ........................................................................... 1 0 7
B.O APPENDIX B .............................................................................................. 1 0 9
B .1
System Level Metastabilty Considerations ........................... 1 0 9
B .2 Synchronizers ..................................................................................... 1 1 1
C.O APPENDIX C............................................................................................... 1 1 4
C.1
Ground Lift Effects on High Speed Digital Logic .................. 11 4
C.1.1 Empirical Results ........................................................................ 11 6
C.2 Point of Interest................................................................................ 1 1 8
v
LIST OFT ABLES
Table 4.1.
Table 4.2.
Table 6.1.
Systolic Beamformer performance .................................. 6 6
Output functions for systolic chip .................................... 6 8
Design cycle to reach prototype silicon .......................... 8 7
Vl
LIST OF FIGURES
Figure 1.1.
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
3.1.
3.2.
4.1.
4.2.
4.3.
4.4.
4.5.
4.6.
4.7.
4.8.
4.9.
4.10.
4.11.
4.12.
Figure
Figure
Figure
Figure
Figure
Figure
Figure
4.13.
5.1
5.2.
6.1.
6.2.
6.3.
A.l.
Figure
Figure
Figure
Figure
B.l.
B.2.
B.3.
C.l.
Figure C.2.
Figure
Figure
Figure
Figure
Figure
Figure
Figure
C.3.
C.4.
C.5.
C.6.
C.7.
C.8.
C.9.
Computational complexity of selected signal
processing algorithms ............................................................4
ADSP-2100 block diagram .................................................. 2 4
Program sequencer block diagram ..................................2 6
Basic principle of a systolic system ................................. 3 4
16-point FFT flow graph ...................................................... 3 5
A SIMD array ............................................................................3 8
Classification of highly parallel arch ...............................4 3
Design B 1..................................................................................... 4 6
Design B2.....................................................................................4 8
Design F........................................................................................ 4 9
Design R1 ..................................................................................... 5 1
Design R2..................................................................................... 5 3
Design W 1 ................................................................................... 5 5
Design W2 ................................................................................... 5 6
Array structure and data flow for a 2 x 2
matrix multiplication example .......................................... 6 1
Block diagram of ESL's systolic system ..........................6 5
Nonlinear behavior of MOS gm ..........................................7 4
Radix-4 FFr processor architecture ................................. 8 1
Compiler system architecture ............................................ 9 1
Bit-serial cell template .......................................................... 9 3
Symbolic cell description of a FIR filter ........................ 9 3
Change in output voltage vs. the number of
outputs wire-ored ................................................................... 1 0 7
Metastable timing diagram for a D-flip-flop ............... 1 0 9
Dual flip-flop synchronizer ................................................. 11 1
Metastable settling time for CMOS D flip-flop ............ 11 2
Model of the current path through a switching
TIL output.................................................................................. 1 1 4
Schematic of a TTL NAND gate including
inductor L which represents the inductance
associated with the package and ground plane ........J 1 5
Waveforms for a SN74ALS374 quiet output .............. .! 2 0
Waveforms for a SN74F374 quiet output.. ................... 121
Waveforms for a SN74ACT374 quiet output.. ............. 12 2
Waveforms for a SN74ALS374 quiet output .............. .12 3
Waveforms for a SN74ACT374 quiet output.. ............ .12 3
Waveforms for a SN74ALS374 quiet output. .............. 124
Waveforms for a SN74ACT374 quiet output ............... 12 5
Vll
ABSTRACf
COMPUTER ARCIDTECfURES FOR DIGITAL SIGNAL PROCESSING
by
Robert N. Leibowitz
Master of Science in Engineering
The field of Digital Signal Processing (DSP) has expanded rapidly in
recent years as Very Large Scale Integrated (VLSI) circuits now
permit a wide range of DSP algorithms to operate in realtime or near
realtime. Digital signal processing is explored from a historical
perspective and a computational perspective in Chapter One. This
chapter describes the evolution of DSP from the pre World War II
era
up
to
today's
advanced
signal
processing.
Chapter
Two
characterizes the tasks to be performed in DSP. Particular attention is
paid to the . computational requirements, fundamental computational
primitives,
algorithms.
and the
Chapter
inherent levels of parallelism within these
Three
examines
the
breakthrough
which
occurred when general purpose Von Neuman based microprocessors
were transformed into general purpose digital signal processors. The
fundamental principles of systolic and wavefront array processor
architectures are explored in Chapter Four. Examples are given of
actual array implementations and the application software required
to tie system components together. Of all the factors that influence
computer architectures for DSP, technology is without question the
Vlll
most
important.
Chapter
Five
describes
todays
processtng
technologies and the directions toward which this field is heading.
Design
complexity
has
become
a
dominant
cost
limit
in
the
development of VLSI DSP systems. Without new design tools, the
advances
in
algorithms,
architectures,
and
I.C.
fabrication
technologies discussed in the previous chapters could not be fully
utilized.
Chapter Six explores
the computer aided design tools
required to reduce design cycles and costs and to increase the
probability of first-pass success in these new DSP systems. Finally,
appendices A, B, and C highlight common problems which system
designers must face when applying todays new architectures and
technologies.
lX
1.0
INTRODUCTION
1.1
Historical Perspective
Many of the early advances in the field of digital signal processing
were involved with the development of techniques for detecting or
enhancing a signal buried in a noise background. Over the last fifty
years, there has been an astonishing change in the
sophistication
level of these and other signal processing techniques. The classical
signal processing era which started before World War II w as
characterized
by
static realizations
of
lowpass,
bandpass,
and
highpass filters which only used a gross knowledge of the signal and
noise spectra. These techniques were only effective when the signal
and noise waveforms had significantly different
Moreover, most of the hardware
spectral
shapes.
implementations realized during
this period used analog technology.
Following
the
Second World
War,
during
the
modern
signal
processing era (1940-1970), applications such as vocoders that had
been implemented with analog technology became so complex that it
was difficult to explore the effect of design variations on system
performance. Digital signal processing was introduced at this time
first as a simulation technique. The use of digital signal
processing
techniques in realtime applications had not evolved at this point
since the integrated circuit technology required to support this usage
was not available. During this post World War II period, the first
significant advance in detection theory came as the statistics of noise
1
were studied. These statistically based detection techniques extended
the processing capabilities to applications where signals and noise
occupied the same spectral regions. This is almost always the case in
sonar and radar systems. In fact, it was the requirement for radar
and sonar systems at the beginning of World War II that drove the
development of techniques such as matched filtering. The matched
filter developed by North [1] is a time domain filter whose impulse
response is determined by the signal.
Further
advancements
came
in
this
period
as
theories
were
developed for signals with unknown time of arrival, amplitude, or
phase.
Matched
filters
at
this
time
were
implemented
as
a
transversal filter or correlator. The tap weights of the transversal
filter are matched to the samples of the expected signal when used
as a detector. These filter structures were all zero filters and were
particularly useful since they could be implemented by a variety of
analog techniques such as surface acoustic wave (SAW) devices,
charge coupled devices (CCD's), and electro-optical (EO) systems [15] .
Finally, during the past twenty years, sophisticated manipulation of
data spectra using detailed knowledge of signal noise statistics has
been introduced. Adaptive and Kalman filters are examples of these
techniques. For systems such as the linear predictive class of speech
algorithms, additional structure was imposed by assuming models for
how data is generated since the statistics can vary with time. As the
technology improved, implementations were realized digitally for
complex systems, taking advantage of the precision, repeatability,
'
"
2
high signal to noise ratio,
and flexibility
available with digital
systems. Time varying digital filters and matrix difference equations
were introduced during this period.
beyond
the
previous
emphasis
Operations
on
were
convolution
extended
and
Fourier
transforms to matrix-matrix multiplication, linear system
solution,
least-square solution, and eigenvalue decomposition.
1.2
Computational Complexity Perspective
A second perspective of signal processing developments can be
obtained by looking at the
historical
growth
in
computational
complexity of arithmetic operations. The figure of merit used to
describe this computational complexity is indicated in Figure 1.1. The
lowest level mathematical operations such as addition, subtraction,
multiplication and division have a complexity of 0(1). This means "of
order 1" since each involves one fundamental operation. The next
higher level of complexity is composed of those operations of O(N).
This class of operations is called scalar operations. These include the
scalar
(or inner) product
of two
vectors,
which
requires
N
multiplications and additions for vectors of length N. Note that this 1s
the level of complexity required for the frequency
filtering
techniques used during the classical signal processing period. This
category includes all infinite impulse response filters with rational
transfer functions, which covers such commonly used operations as
differentiation and integration as well as all-pole filters.
3
MATHEMATICAL OPERATIONS
Order
Name
Example
1
Fundamental
+, -, X, I
N
Scalar
Inner Product
IIR Filters
Integration
Differentiation
Chebyshev
Butterworth
N2
Vector
Convolution
Correlation
FIR Filters
Linear Transforms
DFTFFT
Matrix-Vector Prod.
N3
Matrix
Matrix-Matrix Prod
LU Decomposition
Matrix Inversion
Eigensystems
Figure 1.1. Computational complexity of selected signal
processing algorithms.
The
next
level includes
convolution
based
operations
such
as
correlation, finite impulse response filtering, matrix-vector products,
linear transforms, and the discrete Fourier transform (DFT). These
operations are of order O(N2) in terms of the number of fundamental
operations to be performed. This factor can be reduced to O(N log N)
through the use of fast transforms such as the fast Fourier transform.
This level of complexity is required for performing the transversal
filters and convolvers of modern signal processing.
4
Operations of 0(N3), such as matrix operations, include matrix-matrix
products,
matrix
inversion,
and
eigen value-eigenvalue
decomposition. These operations include methods for solving sets of
linear equations and involve such algorithms as Gaussian elimination,
triangular
Schmidt
decomposition,
decomposition,
singular
and
least
value
decomposition,
squares
solution
Gramvia
the
Householder transformation. These matrix operations are extensively
used in advanced digital signal processing and put great demands on
effective computational means for realization of these systems.
A survey paper by Dr. Mermoz [3] concludes that new information
about the medium through which the signals propagate is being
incorporated into the processing to improve system performance.
This improved system performance is only available at the expense
of vastly greater computational power.
1.3
New Levels of Performance
Form this brief overview of the field of digital signal processing, it 1s
clear that the complexity of tasks has risen markedly, following the
advances in theory and integrated circuit technology. The range of
applications has also expanded dramatically, and has required all the
performance that current computing systems can deliver.
The increase
in computational throughput required by
advanced
signal processing algorithms is coming from two areas. The first area
currently being exercised is in the field of very large scale integrated
5
(VLSI) circuits. Additional throughput can also be achieved through
the exploitation of the parallelism inherent in many digital signal
processing
algorithms.
hardware
realization
However,
costs
the
required
software
to
take
development
advantage
and
of this
parallelism in the past have been too expensive for most applications
other than those in research environments. This situation is now
beginning to change since many of these algorithms do exhibit a
large amount of parallelism and VLSI technology is now permitting
reasonable
probably
hardware
safe to
development
s~y
costs.
Although
that most manufacturers
today
it
1s
would prefer to
achieve the desired performance level through the utilization of
higher speed conventional single instruction,
single data (SISD)
computer architectures.
The following sections will explore the means by which parallelism
can be mapped into computer architectures. Particular attention will
be
paid
to
the
computational
computational primitives,
requirements,
and the inherent
levels
fundamental
of parallelism
within these algorithms. The explosion in single chip digital signal
processors will be reviewed with a special emphasis on the third
generation devices now being introduced by manufacturers. The
combination of aggressive VLSI technology and innovative highly
parallel architectures will be discussed that can lead to processing
rates in excess of 1 billion floating point operations per second
(MFLOPS)
for
revolutionary
single
precision
performance
level
operands
and
(32-bits).
opens
up
This
the
is
a
practical
utilization of even the most theoretical signal processing algorithms.
6
Design complexity, which has become a dominant cost factor in the
development of VLSI digital signal processing systems will also be
considered in the following sections. Finally, the appendices address
common problems faced by system designers trying to apply state of
the art technologies.
7
2. 0
ALGORITHMS
2. 1
Fundamental Algorithms
In this chapter, the algorithms that characterize the tasks to b e
performed in digital signal processing are examined. P articular
attention
will
fundamental
be
paid
to
computational
the
computational
requirements,
primitives, and the inherent levels of
parallelism within these algorithms.
The algorithmic structure discussed first will be the forms used for
infinite impulse response (IIR) and finite impulse response (FIR)
filters. The basic building block commonly called the second order
section is shown in Equation [2.1].
[2.1]
y(n)=Ay(n-1 )+By(n-2)+Cx(n-2)+Dx(n-1 )+x(n)
In the most general form of the second order section, the two
previous values of both the input and output must be multiplied by a
real
coefficient.
This
structure
permits
as
many
as
four
multiplications in parallel, the results of which must then be summed
with the current input. In the case of the finite impulse response
filter, only the input sample sequence is weighted after a succession
of delays; all these products are then summed to produce the output.
In both of these cases, the algorithmic requirement is to compute a
sum of products. A variety
of techniques
calculation will be discussed in later chapters.
'
f}
8
for
determining
this
The discrete Fourier transform (DFT) is a widely used algorithm
commonly implemented as the fast Fourier transform (FFf). The DFT
calculation required is shown in Equation [2.2].
N-1
X(k)
=
I. x ( n) W n k
[2.2]
n=O
The
values
obtained. by
of X(k)
are obtained by summing
multiplying
the
the input sample values
products
x(n), by the
complex coefficient wnk. The calculation requires four real multiplies
and two real additions.
The FFT calculation can be decomposed into "butterfly operations."
The radix-2 butterfly algorithm is shown in Equation [2.3].
X(m+ 1)=X(m) +wnky (m)
Y (m+ 1)=X(m)-wnky (m)
[2.3]
An N-point FFT calculation requires N/2log2(N) butterflies, and each
butterfly requires four real multiplications and six real additions.
The execution of the FFT also requires complex address
generation
(complex address generation refers to the fact that two addresses
must be generated for each complex data operand required) to access
the desired operands and to store the results. However, as in the FIR
and
IIR
computations,
the
accumulation
of
products
fundamental arithmetic operation that needs to be performed.
,
r)
9
is
the
There is a great deal of inherent parallelism in the FFT algorithm.
One technique is to perform the butterfly computations sequentially,
while exploiting the parallelism within each butterfly. If log2 N
butterfly processors are available, then the overall FFT algorithm can
utilize one butterfly processor for each stage of the FFT. Each
processor would then compute (N/2) butterflies before advancing the
results to the next stage. Similarly, if N /2 butterfly processors are
available, one stage can be computed in one butterfly time. The FFT
can then be processed in log2 N butterfly times plus I/0 overhead.
Finally, if N/2log2 N processors are available, an N-point FFT can be
performed
in the time required to compute one butterfly. This
approach involves a great deal of parallelism,
a
1024-point FFT
requires 5120 processors. Such a system would require over 20,000
real multipliers, and as a result has not been built to date. This
approach also assumes multiple input channels are available to be
multiplexed through the FFT processor. A 16-point FFT processor
using 32 butterfly processors has been built using
wafer
scale
integration [5].
The need to utilize the parallelism in FFT
algorithms can
be
demonstrated by examining the computational requirements for a
radio astronomy spectrum analyzer [6]. In this system, astronomers
use
synthetic
aperture
telescopes
to
investigate
fine
spatial
structures of molecular gas clouds. The kinematics of these clouds
can be studied by analyzing spectral line shapes in detail. It turns
out that many interesting molecular lines are in the millimeter and
10
submillimeter wavelength
domain
which
requires
a
1024-point
spectrum analyzer with a bandwidth of 160 MHz. The input sample
rate of the FFT processor in this system is 320 million complex
samples per second. A 1024-point spectrum is calculated every 3.2
microseconds. This requires the equivalent computation of one radix2 butterfly every 625 picoseconds! The actual implementation of the
FFT processor consists of 2 pipelined radix-32
butterfly processors
which run at 10MHz. A radix-32 1024-point FFT requires two stages
(1024
=
32Y, where y is the number of FFT stages). Each stage in this
1024-point FFT requires 32, radix-32 FFT's (number of FFT's per
stage = 1024/32). Each of the two radix-32 processors perform one
stage of the FFT every 10 Mhz. This is equivalent to a new 1024point spectrum being generated every 32
*
100 x 1 Q-9 seconds or 3.2
microseconds.
2.2
Advanced Digital Signal Processing Algorithms
Chapter 1 introduced the fact that many advanced digital signal
processing
tasks
currently
being
investigated
require order N3
calculations based on a sequence of N sample points. These tasks
include, high resolution spectral
analysis,
beamforming, direction
finding and image processmg.
Advanced digital signal processing tasks have generated a set of new
algorithm requirements focused around linear systems of equations.
Although
numerical
programs
have
been
available
purpose computers for these tasks for some time,
11
for general
the need
to
perform these substantial computations in realtime has given rise to
intensive
research
into
novel
architectures
with
very
high
throughput. Some of the new systems being designed today
throughput in the range of several billion floating point
have
operations
per second. Fortunately, the amount of inherent parallelism in these
calculations allows implementation of systems with contemporary
technology. Some of these high performance architectures will be
explored in later chapters.
2.2.1 Beamforming and Direction Finding
An adaptive beamforming system consists of a set of spatially
separated sensor elements connected to a single-channei or to a
multichannel
adaptive
signal processor [7].
Adding
the
spatia 1
dimension to the signal processing environment leads to wide range
of applications and algorithms.
Adaptive
arrays can be used to
eliminate directional interference by adaptive canceling or adaptive
nulling, thereby improving the signal to noise ratio. Other adaptive
arrays can "steer" themselves automatically to pick out a signal
without knowing beforehand its direction of arrival. The system will
then separate this signal from directional interferences as long as
their directions of arrival are different from that of the signal. Using
adaptive algorithms, it is possible to determine the direction of a
signal in the presence of interference. Adaptive arrays can be made
sensitive
to
signals
originating
nearby
and
highly
sensitive
to
distantly arriving signals, or vice versa. They can be made sensitive
12
to
infrequent
transient
signals
and
insensitive
to
frequent
or
stationary signals, or vice versa.
Classical frequency-domain beamforming is the spatial equivalent of
the matched filter [8]. To form a beam in one direction at one
particular frequency, it is necessary to form the inner product of the
vector of complex amplitudes at the sensor array elements with a
steering vector for the specified look direction. If the processing
bandwidth is small compared to the center frequency, then the
number of complex _multiply accumulations required per second to
realize the beamforming by matrix vector multiplication is shown m
Equation [2.4].
Number of Operations = B
*W *E
[2.4]
where B is the number of cells in the direction depth, W is the
bandwidth in Hertz, and E is the number of elements in the array.
For some special array geometries and the corresponding special
choices of look directions, it is possible to reduce the number of
operations by using spatial convolution of the FFT. However, general
matrix multiplication will be required for randomly time-varying
array geometries as well as for arrays which conform to special
surface shapes.
There are a number of interference cancelling techniques which
require the solution of linear least squares problems. An adaptive
combiner
using
preformed
beams
13
requires
the
solution
of
an
unconstrained
linear
least
squares
problem.
Minimum
variance
distortionless response (MVDR) beamforming requires the solution of
a least
squares problem with
one linear constraint. Most recent
verswns use multiple linear constraints to avoid the formation of
deep nulls in the direction which is too close to the look direction.
Least
squares
adaptive
beamformers
have
frequently
been
implemented using gradient descent and similar iterative methods in
the past. This
method suffers from
slow convergence
when the
interference is strong compared to the signal and noise, and the data
covanance matrix therefore has
a large condition number.
It is
therefore more desirable in these cases to avoid statistical iteration
techniques and to determine the best least squares solution possible
with the available data. Although the problem can be solved by
direct inversion of the sample covariance matrix, it is numerically
preferable to solve the least squares problem directly with the data
matrix,
using either the
singular value
decomposition
(SVD)
or
orthogonal triangularization techniques. For MVDR implemented v1a
orthogonal triangularization with a number of look directions greater
than the number of elements in the array and with the adaptive
weights updated for each data sample, the number of arithmetic
operations required per second is shown in Equation [2.5].
Number of Operations
=B *W *E *E
[2.5]
The computational requirement for this algorithm is then a factor of
E greater than the requirement for classical beamforming, where E 1s
the number of elements in the array.
14
While adaptive interference cancellation techniques incorporate apriori information in the form of the assumption of point sources of
interference, eigenvector and eigenvalue-eigenvector based direction
finding methods use two covariance matrices: prior knowledge of the
noise spatial covariance structure, and measurement of the total
signal plus noise spatial covariance matrix.
resolution
eigenvector
based
direction
A representative high
method
is
the
MUSIC
algorithm of R. Schmidt [9]. For MUSIC and most eigenvector based
direction finding techniques, the most computationally burdensome
step will be the computation of the "direction of arrival spectrum"
(DOA). Although only matrix-vector multiplications are performed,
this algorithm requires a very large number of them. One matrixvector product must be determined for each resolvable point in the
array manifold. For a two-dimensional array with elements sensing
diverse
polarizations,
the
required
number
of
matrix-vector
multiplications per frequency bin update is equal to the product of
the number of resolvable two-dimensional angles times the number
of resolved polarizations.
As discussed above, beamforming and direction finding consist of
algorithms that heavily utilize matrix computations. High resolution
spectral
analysis
computations.
One
and
image
class
processing
of high
resolution
techniques that receives special attention
perspective
is
the
set
also
spectral
from
of linear predictive
require
a
similar
analysis
computational
algorithms.
These
techniques are frequently implemented using fast Toeplitz equation
.
~)
15
(a Toeplitz matrix is symmetric about diagonals, e.g. autocorrelation
matrix) solvers, such as the Levinson-Trench method or Durbin's
method to solve the prediction problem. Image restoration problems
typically require a wider range of minimization
unconstrained
and
linearly
constrained
least
techniques
squares
than
so 1uti on.
Methods include least squares with nonnegativity or other nonlinear
constraints.
The advanced digital signal processing algorithms discussed above
lead to requirements for a fairly complete set of linear algebra
operations for dense matrices. In principle, one needs
multiplication,
matrix
multiplication,
orthogonal
matrix-vector
triangular
decomposition, the SVD, and the generalized SVD.
2. 3
Functional Element Requirements
In the discussion of algorithms, the need for basic functional units
such as · multiplier-accumulators and matrix-matrix multipliers
has
been evident. However, the use of these canonical forms is not
sufficient to implement the entirety of practical algorithms. There is
always a substantial amount of processing that does not map neatly
into these structures. This class of general purpose
processing
includes heuristic decision making and data-dependent conditions
where data cannot stream through a datapath. This means
realizable
digital
signal
processing
systems
must
that
provide
a
component for general purpose computing. Moreover, this component
should be easily programmable using high level computer languages.
'
f}
16
Some
designers
have
attempted
to
incorporate
special
purpose
functions within an architecture that appears to the programmer as a
traditional Von-Neumann computer. This technique is useful for
systems such as the single chip general purpose digital signal
processors discussed in the next chapter. These devices process
signals in the audio band, but cannot provide a total solution for very
high bandwidth signals where the computational complexity includes
linear algebra equations.
This chapter has explored the most important basic computations
needed in digital signal processing. The trend over the years has
been to progress from scalar arithmetic through vector arithmetic,
and most recently, to matrix numerical calculations. Future chapters
will explore the computational architectures and technologies that
permit
these
realtime
digital
realizable.
17
signal
processing
systems
to
be
3. 0
SINGLE CHIP DIGITAL SIGNAL PROCESSORS
Over the past several decades, digital signal processing machines
have taken several forms in response to application requirements
and available
technology.
As
integrated
matured many high volume, low
signal processing
systems
have
to
circuit
moderate
evolved
technology
has
bandwidth digital
from
array
processors
(defined as a vector processor system with one com put at ion a 1
element) to board level bit slice implementations to single chip
processors.
The most significant breakthrough occurred when general purpose
Von Neumann based microprocessors were transformed into general
purpose
digital
signal
processors.
Architecturally
this
meant
transforming from a Von Neumann based processor to a pipelined
Harvard based processor with separate data and address buses for
instructions and data. See Figure 3.1. This dual bus structure enabled
the concurrent fetching of instructions and data. The designers of
these
general purpose
digital
signal
processors
also
made
the
architectural tradeoff to devote a relatively large area of silicon to
such functions as a single cycle parallel
multiplier-accumulator, a
barrel shifter, and a large on-chip data RAM.
general
purpose
microprocessors
perform
In contrast,
operations
most
such
as
multiplication and shifting via multicycle, microcoded instructions
that make use of the datapath's single cycle, parallel add and single
bit shift capability. Since integer multiplication and shift operations
are statistically unimportant for most programs that run general
18
purpose microprocessors, designers choose to devote this silicon area
to more versatile instruction sets, memory management units, or
cache memories.
As an example, consider the state of the art 16 MHz 32-bit 80386
microprocessor.
The
80386
performs
an
integer
16
X
16
multiplication in 1250 ns, or approximately an order of magnitude
slower than most single chip digital signal processors.
The earliest general· purpose digital signal processors such as the
TMS32010 (200 ns instruction cycle time) provided instruction sets
that were very similar to 8-bit general purpose microprocessors such
as the popular Intel 8051 (1
development
on these
~s
instruction cycle time). All software
early machines
was
written in assembly
language. The available addressing modes were limited to immediate,
direct,
indirect
Moreover,
these
or
indirect
first
with
generation
auto
increment/decrement.
machines
had
primitive
input/output interfaces to the outside world and limited amounts of
on-chip data RAM. The combination of these two factors limited the
performance of algorithms such as large FFT's or digital filters that
required frequent and inefficient I/0 transfers which resulted in a
high percentage of dead cycles in the processors datapath. As we will
see
in
this
section,
the
newer
general
purpose
digital
signal
processors provide a wider range of instructions and addressing
modes, support for high level compilers (i.e. full implementations of
the industry standard Kernighan and Richie C compiler), and more
efficient I/0 interfaces.
19
Modern single chip digital signal processing devices can now be
categorized by function into general purpose chips and those that are
application-specific. The application-specific devices are designed to
perform one type of function or class of functions more accurately,
faster,
or
more
cost
effectively
than
their
general
purpose
counterparts. The most common application-specific devices perform
functions such as FIR digital filtering, adaptive filtering, or fast
Fourier transform processing. Some application-specific devices are
programmable, but only within the confines of the chip's function.
For example,
the
ZORAN ZR33481
digital
filter
chip can be
programmed by the user to perform finite impulse response digital
filtering, but the device could not be programmed as efficiently to
perform Infinite Impulse Response filtering.
Single chip digital processing devices can also be categorized
by
precision and data type. Previous generations of processors were
implemented with 16, 24, and 32-bit
fixed point data formats. The
newer processors such as NEC 77230 (150 ns instruction cycle time),
the AT&T WEDSP32C (80 ns instruction cycle time), and the TI
TMS32030 (60 instruction cycle time) offer 32-bit floating
point
formats with cycle times as fast or faster than their fixed point
predecessors [10].
Most of the single chip digital signal processing devices introduced to
date have been fixed point devices. Fixed point data words
are
limited to the range -1 <= x < 1. This has the effect of requiring filter
20
coefficients to be prescaled. This prescaling operation frequently
introduces
a round-off error into the
system. The accumulated
products of truncated or rounded values can also introduce noise into
the system. For these reasons, finite length fixed point calculations
are often modeled with "white noise" error sources added to the
algorithms signal flow diagram. Furthermore, fixed point calculations
are very susceptible to overflow and underflow because of their
limited word length. Many fixed point systems require overflow
checking routines that saturate overflowed values. Not only is a
significant level of error introduced when a value is clamped, but
additional programming overhead is required.
Additional program
development resources are also required as a result of the fact that
many fixed point algorithms need to be carefully designed (operand
alignment considerations) to obtain the necessary performance and
at the same time "coding" around the hazards discussed above.
Floating point capability in a devices allows the user to ignore, to a
larger extent, these overflow, quantization, operand alignment, and
other burdensome tasks common to fixed point operation.
Although floating point devices do face roundoff and quantization
errors, the floating point data format greatly reduces the effect of
many of the drawbacks discussed above. The frequently used 24-bit
mantissa is the same size or larger than most of the fixed point
devices,
with
the
added
benefit
of the
8-bit
exponent
which
increases the dynamic range of the signal. The floating point format
can also represent much smaller numbers, enhancing the overall
precision of the system in which it is used. In most cases, overflow
21
checking is unnecessary. Eight bits of exponent permits numbers
with absolute values as large as 1.7 X 10**38 and as small as 3.5 X
10**-46 [12].
Another significant advantage of the Floating point format is that it
permits the system designer the ability to code an application from a
software floating point simulation of the system directly to the target
system. Floating point data formats also permit a wider range of
algorithms to be ported to the devices such as infinite impulse
response filters and adaptive filter algorithms. Since outputs are fed
back into the computation of the next output in these algorithms,
roundoff
or
truncation
error
can
build
up
and
degrade
the
performance of the system.
3. 1
Architectural Features of the DSP Processors
Beyond, floating point datapaths and fast cycle times, the latest
DSP
processors have a host of features that solve many of the I/0 and
program control issues that faced
These
features
addressing
include
capability,
data
program
earlier generation processors.
address
generators
sequencers
with
with
modulo
zero-overhead
looping, on-chip instruction caches, and other methods such as direct
memory access interfaces to reduce the bottlenecks associated with
transfering data on and off chip.
There are variety of ways to enhance a systems throughput by
increasing the memory bandwidth of a processor. This is usually
'
0
22
accomplished
through
the
use
of
multiple
memories
and
corresponding busses [10]. The brute force option of simply speeding
the data transfer rate runs into
the
chip
crossing
constraints
discussed in Chapter 4. Moreover, the cost and performance o f
memory devices
has
also limited the practicality of this
latter
approach. For example, consider the memory requirements of the
TMS320C30 which can address up to 16 million 32-bit words of
memory. To run at
full
speed,
the
TMS320C30
requires
fast,
expensive, 25 ns, 256-bit static RAMs. This translates into the use of
. 2048 of these devices in applications which need the full addressing
capability of the processor. The memory devices
alone in this
application would cost 1000 fold more than the processor itself. The
price of the memory components could be reduced dramatically if a
wait state could be inserted in the processors cycle time, thereby
permitting the use of slower, less expensive memories.
Instead of the brute force approach of using ever faster memory,
several DSP devices such as the AT&T WEDSP32C, have dual internal
data memories, as well as access to an external data memory
These
internal
memories
typically
contain
128
to
1k
[ 14].
words
(adequate for many DSP tasks). Other DSP circuits have the capability
of fetching instructions from an internal program RAM,
freeing an external bus for data transfers .
•
f}
23
thereby
3. 2
Cache Memory
Fortunately, most time critical computations are repetitive in nature.
Computations can be coded in the form of program loops where small
cache memories can be used efficiently. This instruction cache
memory
will
maintain
a
small
history
of previously
executed
instructions. When the program enters a loop, the cache stores the
instructions on the first pass, but on all subsequent passes it can feed
DATA
ADDRESS
GENERATOR
.,
DATA
ADDRESS
. GENERATOR
12
PMA BUS
~
PMA
DMA
DMABUS
PMOBUS
PMO
OMO
Figure 3.1. ADSP-2100 block diagram.
24
the instruction register and free the program memory bus for data
fetches without incurring extra cycles. The Analog Devices ADSP2100
achieves this capability with a 16 word program cache that permits
two external data accesses in a single cycle using busses PMD and
DMD (see Figure 3.1) [11].
The TMS320C25 from Texas Instruments has a repeat facility by
which a single instruction can be repeated a specified number of
times with only a single fetch from external program memory.
Similarly, when the instructions are executing from internal program
memory, an external memory bus is freed up for data transfers. The
successor to the TMS320C25, the TMS320C30 features a 64-word
program cache. It also includes a direct memory access interface
which transfers blocks of data from an external memory into the
processor's on-chip memory in parallel with the operation of the
chip's main datapath.
3. 3
Program Sequencer - Zero-Overhead Looping
Maintaining throughput rates for DSP application software reqmres a
program sequencer that does not get bogged down by branching,
looping, or responding to interrupts. Figure 3.2 provides a detailed
block diagram of the program sequencer of the ADSP-2100 [11].
Instruction addresses can be selected from four possible sources; a
14-bit program counter (PC), an internal 16-level PC stack, an
interrupt controller, or a 14-bit field of the instruction register. The
program counter, as in most processors, keeps track of the current
25
DUD BUS
i I
66
CCMDfi'ION COOl I• ..,...,
""'*'.I. .. AIMII'II•INft,l
RHICTIOM ••tt..O
COUNTIE" I
LOGIC
i
I
~I
I
,(•
I
I
I
j
u!
~o::·.~~::uenoN
:'U)Of'll . . . .l
TI"MIN&TIOII
COHQiftOM It ....,
I
I
,(••
,_,
I
,.NC ...i.IIIT
I
I
~
I
!
v
PNA IUS
Figure 3 .2. Program sequencer block diagram.
instruction address and feeds an incrementer, which provides the
next
contiguous
address.
The
PC
stack
stores
subroutine
and
interrupt return addresses and is selected when returning to main
program execution. The interrupt controller monitors the external
interrupt-request
inputs
and
provides
interrupt
service
routine
vector addresses. The instruction register is selected when a direct
jump is executed.
The loop stack and comparator facilitate zero-overhead looping, a
feature found on other processors such as the TMS320C30, the
WEDSP32C, as well as the ADSP-2100. Zero-overhead looping permits
26
program
loops
conditional
to
branch
run
without
instructions
the
to
overhead
determine
associated
when
with
a loop
has
completed. In order to take advantage of this feature in the ADSP2100,
two
setup instructions must be
executed
[11]. The first
instruction loads the down counter. The down counter is a 14-bit
register with a automatic post-decrement capability that is intended
for
controlling
the
flow
of
program
loops
which
execute
a
predetermined number of times [13]. The other setup instruction 1s
the do-until command. When executed, this instruction pushes the
end of loop address · (14-bits) and termination condition (4-bits) onto
the loop stack and the beginning of loop address onto the PC stack.
Once a loop is entered, the loop comparator compares the next
address output of the sequencer multiplexer with the end of loop
address on the stack. When the two are equal, it indicates that the
processor is fetching the last instruction in the loop. During the
following cycle, while the last instruction is being executed, the
condition logic tests the termination condition specified by the loop
stack. If the termination condition is false, the sequencer jumps back
to the beginning of the loop by choosing the PC stack as the next
address. Otherwise, the sequencer exits the loop by choosing PC+1 as
the next address.
As mentioned above, this automatic looping feature eliminates the
need
for
an
explicit jump
instruction
within
the
loop.
Every
instruction in a loop can therefore perform a useful operation. The
use of stacks in this architecture permit the nesting of loops which
are especially useful for application routines such as the FFT. In the
27
FFT, loops would be programmed for execution of stages, groups, and
the actual butterfly operations.
3. 4
Address Generation
General purpose single chip DSP processors today also provide
separate arithmetic and logic units (ALU) for data and address
generation since fast number crunching is of little benefit if it must
frequently sit idle waiting for data. Moreover, when the address and
data calculations are· split between arithmetic and logic units, each
unit's architecture can be tailored for the task at hand. For example,
arithmetic and logic units in the datapath need to contain multiplyaccumulate
structures
and
rounding
logic.
In
contrast,
the
architecture of the address arithmetic and logic unit can be optimized
for such tasks as indexing and automatic address modification, as
well as modulo arithmetic. The ability of an ALU to perform modulo
arithmetic is useful to automatically recognize the beginning and end
of a data block within memory. Address
ALU's also frequently
perform FFT radix-2 bit reversal operations.
3.5
Summary
The
architectural
features
of single
chip
DSP
processors
was
discussed in this chapter. These devices have been optimized for
realtime DSP and other computational intensive applications. These
devices take advantage of CMOS VLSI fabrication technology to
provide features such as a pipelined Harvard architecture, fast, single
'
~
28
cycle instructions, on-chip program and data memories, single cycle
branching
and
zero-overhead
datapaths,
and
efficient
looping
direct
capability,
memory
access
floating
1/0
point
interfaces.
Moreover, devices introduced by Analog Devices, Texas Instruments,
and Motorola now provide complete implementations of C compilers
for their processors. It is interesting to note that many of these very
same architectural features discussed above are shared by reduced
instruction set processors (RISC). This trend is expected to continue
in the future thereby blurring the differences between high
speed
single cycle instruction RISC processors and general purpose digital
signal processors.
'
f)
29
4. 0
VLSI ARRAY ARCHITECfURES
4.1
Introduction
This chapter will review the fundamental principles of systolic and
wavefront array
processor architectures.
These
special
purpose
architectures are useful for a wide range of realtime digital signal
processing
applications.
This
chapter will
effectiveness of these special purpose
also review the cost
architectures
relative
to
existing "general purpose" machines.
The major computational requirements for many important realtime
digital signal processing tasks can be reduced to a common set of
basic matrix operations. As previously mentioned in Chapter One,
these
include
matrix-vector
multiplication,
.
.
matrix-matrix
multiplication and addition, matrix 1nvers10n,
solution of linear
systems,
of
least squares
approximate
solution
linear
systems,
eigensystem solution, generalized eigensystems solution, and singular
value decomposition of matrices [15]. Trying to use conventional
architectures for the computation of these algorithms is inefficient.
The
supervisory
supercomputers
overhead
often
incurred
makes them too
in
general
purpose
slow and expensive for
realtime signal and image processing.
These
realtime
signal
processing
algorithms
can
often
make
significant computational demands [19]. For example, the proposed
Canadian Radars at satellite-borne synthetic aperture radar system
<)
30
for monitoring the Arctic will require a digital signal processor
capable of handling an input data rate of 120 million complex words
per second. Processing rates will require 8 billion floating point
operations per second (GFLOPS).
To achieve an adequate throughput rate for these applications, the
only feasible alternative appears to be highly paralleled processmg
by special purpose array processors. Advances in VLSI and wafer
scale integration (WSI)
technology have
lowered implementation
costs for large arrays· to an acceptable level. In fact, implementations
of circuits using today's advanced 1 micron CMOS process technology
can integrate up to half a million transistors at a reasonable cost.
Recent developments in computer aided design (CAD) techniques
have also facilitated speedy prototyping and implementation of these
application specific architectures.
Another
factor
architectures
which
makes
these
special
purpose
array
so attractive is that they can maximize the ma1n
strength of VLSI: intensive computation power, and yet circumvent
its
main
weakness:
restricted
communication.
Fortunately,
the
commonly used algorithms listed above fall into the class which
reuse data in a regular manner. The importance of this property will
be discussed in later sections. The arrays
which perform these
algorithms will have a variety of topologies. The topology of each
array reflects the data flow graph for the algorithm. However, the
processing elements in each of these arrays can be made from one
type of device configuration.
31
4. 2
Locality of Computation
The
algorithms
mentioned
above
for
digital
processing posses certain common properties
signal
such
and
Image
as regularity,
recursiveness, and locality of computation. With VLSI technology, it
becomes feasible to build an array processor which closely resembles
the signal flow graph of a particular algorithm. VLSI technology also
permits systems with hundreds of devices, each performing tens of
millions of . multiplications and additions, to synthesize a machine
that can process over a billion operations per second. With this level
of processing power now available, architectures must be chosen that
reduce the cost of communication and control. For example, if the
cost of the communication network between the processing elements
grows as the square of the number of processing elements or faster,
then a machine using hundreds of parallel processing elements will
be impractical.
Architectures developed today should be able to take advantage of
advances in VLSI technology. In particular, photolithography will
continue to improve exponentially with time as it has in the recent
past. Interconnection technologies depend on factors
such as the
ability of a device to transfer energy down a transmission line and
integrated circuit packaging techniques. These factors will improve at
most linearly with time. The principle of locality of computation then
states
that
VLSI
technology
32
can
best
be
harnessed
when
architectures require little communication relative to the amount of
computation.
In general, the ratio of input/output (I/0) to arithmetic is
algorithm dependent.
Special purpose
very
array architectures can be
utilized in applications where data is reused in the course of the
computation in a regular manner. The array can utilize as many
processing elements as the parallelism in the problem to be solved
allows. Communication networking is simplified
since processing
elements will only be interconnected with their nearest neighbors.
4. 3
Balancing Computation with I/0
I/0 considerations in special purpose machines influence the overall
performance of the system since data needs to be transferred to and
from the host processor (The host in this context can mean a
computer, a memory, or a realtime I/0 interface). The goal in system
design is then to balance the computation
rate
with
the
I/0
bandwidth from the host. In many cases, an accurate estimate o f
available I/0 bandwidth is difficult to obtain. The design of the
special purpose processor should be modular so that its structure can
be easily adjusted to match a variety of 110 bandwidths.
Consider the system in Figure 4.1. The I/0 bandwidth between the
host and the single processing unit is 40 million bytes per second (a
high bandwidth for present technology). Also assume that at least
two bytes will be read from or written to the host processor for each
33
operation. The maximum computation rate will only be 20 million
operations per second, no matter how fast the special purpose
INSTEAD OF:
MEMORY
...
20 MOPS
MAXIMUM
25ns
...
PE
L_
-
--
WE HAVE
..
MEMORY
....
120 MOPS
POSSIBLE
25ns
__.
PE
PE
PE
PE
PE
PE
f--
THE SYSTOLIC ARRAY
Figure 4.1. Basic principle of a systolic system.
processor
can
operate.
Several
orders
of
magnitude
more
performance is possible if multiple computations can be made for
each I/0 transfer. However, the repetitive use of data for a particular
algorithm requires a sufficient amount of memory storage within the
processing elements. The key design consideration then is how to
configure the computational requirements of an algorithm with an
34
appropriate memory structure so that arithmetic processing time is
balanced with I/0 time.
(a)
(b)
+
5=4
t
n = 16
Figure 4.2. (a) 16-point FFT flow graph; (b) decomposing
the FFT computation with N=16 and S=4.
The
I/0
problem
becomes
especially
computation is performed on a relatively
severe
when
a
large
small special purpose
system. In this case, the computation must be decomposed. Executing
the resulting subcomputations may require that a substantial amount
of I/0 bandwidth be dedicated towards storage and retrieval of
intermediate results. Consider the computation of an N-point FFT on
an S-point processor. Figure 4.2 illustrates the case when N=16 and
the processing element contains enough memory to perform a radix4 butterfly (S=4), without having to store any intermediate results.
With the scheme shown in Figure 4.2b, the total number of I/0
•
t}
35
operations is O(N log N/ log S) [21]. In fact, to perform an N-point FFf
with a processing element memory size of O(S), at least O(S) I/0
operations are needed for any decomposition scheme [22]. Therefore,
for the N-point FFT, an S-point device cannot provide more than a
O(log
S)
speed-up
ratio
over
the
conventional
O(N
log
N)
implementation time. This upper bound is an I/0 limitation and will
hold regardless of the throughput rate of the processing element.
In
most
system
level
implementations,
tasks
contain
more
parallelism than spe-cial purpose processing elements. The system
architect then must deal with questions such as how to decompose a
computation to minimize I/0, how the I/0 requirement is related to
the size of the memory and number of processing elements, and how
the I/0 bandwidth limits the achievable speed-up ratio.
4.4
Modular Architecture
The cost of special purpose architectures must be low enough to
justify their limited applicability. These costs can be classified as
nonrecurring engineering
and recurring
production
(parts)
costs.
Historically, the cost (recurring) of integrated circuits has dropped
rapidly
as
the
components
product
life
cycle
matured.
This
advantage applies equally to special purpose and general purpose
systems. However, special purpose systems are seldom produced in
large quantities and the
nonrecurring engineering costs tend to
dominate the total cost of the project. Therefore, the nonrecurring
'
f}
36
engineering costs of a special purpose system must be relatively
small for it to be more attractive than a general purpose approach.
The key factor towards reducing special purpose system design costs
is then to choose an architecture which can be decomposed into a few
types
of building
blocks.
These
building
repetitively used throughout the array.
become
apparent
when considering the
The
blocks
can
then
be
economies of scale
nonrecurring
engineering
costs of the new 100,000 gate CMOS gate arrays offered by LSI Logic,
Milpitas, CA. These· gate arrays will cost several hundred thousand
dollars for each unique design required.
The special purpose arrays considered in this chapter are simple,
regular designs which are modular. These systems are adjustable to
vanous performance goals. System costs can therefore be made
proportional to the performance required. This suggests that meeting
the architectural requirements for a modular design yields cost
effective special purpose systems.
4.5
Highly Parallel Processing Architectures
The similarities and differences of todays popular multiprocessor
architectures are explored in this section. Single instruction, multiple
data systems (SIMD),
multiple
data-multiple instruction (MIMD)
systems, systolic arrays, and wavefront array processors will be
discussed [17].
37
4.5 .1 Single Instruction-Multiple Data System
A SIMD system is a synchronous array of processing elements under
the supervision of a central control unit. All processing elements
receive the same instruction at the same time broadcast from the
central controller. Each processing unit operates on different data
sets from distinct data streams.
Broadcasting of data is usually
permitted in an SIMD array (see Figure 4.3).
Control
Unit
•
+
Control Bus
+
Processing
Unit
••••
,,
Processing
Unit
••••
,,
Data Bus ,,
,,
,,
,
L -
+
Processing
Unit
Interconnection network (local)
-
---
-
----
------
Figure 4.3. A SIMD array.
The Goodyear Aerospace Massively Paralleled Processor (MPP) is an
example
of
a SIMD
computer
designed
for
image
processmg
applications [25]. This architecture consists of a two dimensional
array of 128 X 128 processing elements. The processing elements are
10 Mhz bit-slice processors which perform bit serial arithmetic. The
38
processing array can be programmed to connect opposite array edges
or leave them open to permit the array topology to change from a
plane to a horizontal cylinder. This feature reduces routing time
significantly in a number of image processing applications. Despite
the bit-slice nature of each processing unit, the MPP can reach a peak
rate of 216 MFLOPS and 430 MFLOPS for single precision array
multiplication and addition operations, respectively.
4.5 .2 Multiple Instruction-Multiple Data System
A MIMD system consists of a number of processing units, each with
its own control unit, program and data. The main feature of a MIMD
machine is that the overall processing task can be distributed among
the processing elements to further take advantage of the parallelism
inherent in some classes of algorithms.
encounter
communication
bottlenecks
A MIMD machine can
when
multiple
processing
elements attempt to access shared system resources concurrently.
Nevertheless, the flexibility of the MIMD architecture often makes it
essential for dealing with irregularly structured algorithms.
Extracting the inherent parallelism in algorithms and mapping this
parallelism onto the targeted hardware is an arduous process. There
is little in the way of programming languages or methodology that
supports mapping application code onto a machine in such a way that
achieves high performance [20]. One solution to this problem is
optimizing compilers which help automate this process.
39
A data flow machine is a class of MIMD computer which executes
instructions as soon as its operands are available. This architecture
offers a solution to the problem of efficiently exploiting concurrency
on a large scale. Moreover, the programming techniques used on data
flow
machines are consistent with modern concepts of program
structure. The techniques used to translate source language programs
into data flow graphs are similar to the methods used in conventional
optimizing compilers to analyze the paths of data dependency in
source programs.
4.5.3 Systolic and Wavefront Arrays
Two popular special purpose VLSI array architectures are systolic
and
wavefront
arrays.
These
architectures
provide
massive
concurrency derived from pipeline processing , parallel processing, or
both.
The systolic array is often algorithm oriented and is used as an
attached processor to a general purpose host processor. The systolic
architecture concept was developed at Carnegie-Mellon University
[21]. The term systolic array was coined as
the name of this
architecture, in part, to draw an analogy with the human heart which
rhythmically
sends
and
receives
blood
through
the
circulatory
system. In this analogy, the heart corresponds to a source and
destination of data (memory array), and the network of veins is
equivalent to the array of processors and communication links. The
systolic
array
features
the
important
40
properties
of modularity,
regularity, local interconnection, a high degree of pipelining, and
highly synchronized multiprocessing.
4.5.4 Clock Skew in Synchronous Arrays
The data movements in systolic arrays are synchronized to a global
clock reference. Synchronizing a large systolic array becomes difficult
as the added constraint of decreasing cycle times is considered. For
example, a standard G-1 0 glass expoxy printed circuit board will
delay (tpct - propagation delay through a circuit board trace) an
unloaded micros trip transmission line approximately 1. 77 ns/ft [25].
The loaded propagation delay can be calculated from Equation [4.1].
tpct' = tpct
where
* ~ 1+Cct!C
[4.1]
0
Cct is the load capacitance
and
C0
is
the
characteristic
capacitance of the transmission line. The delay with four loads on the
transmission line will stretch to 2.2 ns/ft. This "clock skew" factor
will add to the system cycle time and therefore act to reduce
throughput. It is not uncommon for a clock to be distributed several
feet across a large computer circuit board or even throughout
an
entire system. Clock skew can easily reach 1 to 2 ns in large systems
even with the clock distribution system carefully balanced. Clock
skew can therefore reduce the performance of a state of the art 50
Mhz processor by 10% to 20%!
41
4.5 .5 Dataflow Architecture
A solution to the problem of global processor synchronization is to
take
advantage
of the
dataflow
computing
principle.
Dataflow
computing is natural to signal processing algorithms and leads the
designer to wavefront array processing. There are two approaches
toward deriving wavefront array algorithms:
one is to trace and
pipeline the computational wavefronts; the other is based on a
dataflow graph model. Conceptually, the requirement for correct
timing in the systolic array is replaced by a requirement for correct
sequencing in the wavefront array.
4.6
Architecture Comparison
A classification of the architectures introduced above is shown in
Figure 4.4 [17]. Figure 4.4 indicates systolic arrays have local
instruction codes with external data sent into the array concurrently
with pipelining. SIMD and wavefront arrays can be categorized as
somewhat more complex than systolic arrays. SIMD arrays use global
instruction (control) busses and data busses instead of the local
busses used in systolic arrays. A wavefront array, on the other hand,
utilizes data-driven processing. In fact, the wavefront array can be
viewed as a static dataflow array that supports the direct hardware
implementation of regular dataflow graphs.
Although wavefront arrays do not suffer form clock distribution
delays, the overhead resulting from data transfer handshaking does
'
,,
42
degrade system throughput.
However,
wavefront arrays
do
yield
higher speeds than synchronous arrays when computing times are
data-dependent. For example, when an abundance of "zero" entries
are
encountered
in
sparse
matrix
multiplications,
a
"trivial"
multiplication can be computed in much less time than a "nontrivial"
multiplication.
·oata pipelined
through boundary
PEs
Globally
synchronous
,
(Control broadcast
from control unit;
llliac IV is an
example)
(Prestored
local control;
Warpisan
example)
~
~
~
Data 110
SIMD
Systolic
~
Preloaded
from data
bus
,,,,,,,,,,,. ,,~,
"~
~
~
~
MIMD
~
Wavefront
·~
-.
(Prestored
local control:
MWAPisan
example)
Data driven
.-
{Prestored
local control;
Dataflow machines
like the Manchester
machine are examples)
~
.. . .. .. .. . .. .. . .. .. .. .. .. .. ..
.
Timing
scheme
Figure 4.4. Classification of highly parallel architectures.
Another difference
between the
structures
is that SIMD
arrays
usually load local memories before a computation begins, while
systolic and wavefront arrays usually pipe data to and from an
outside host. Dew and Manning compare SIMD arrays and systolic
arrays for a image processing application [27]. The report indicates
that both arrays could be used effectively for local windowing
'
t)
43
operations. However, the systolic array performed better for data
dependent
operations
such
as
a binary
search
correlator.
The
efficiency of the systolic or wavefront array is due to the fact that
the host handles image storage and can select the desired data and
pipe them into the array. MIMD multiprocessors generally offer the
features listed above, many with the additional feature of shared
memones.
4. 7
Systolic Arrays
The following discussion on systolic arrays will use the convolution
algorithm (3 tap) as a vehicle to discuss a variety of architectures.
The convolution algorithm is considered because it is an important
problem in its own right and because it is representative of a wide
class of computations suited to systolic designs. Later sections will
explore more complex systolic array architectures which implement
advanced digital signal processing algorithms.
The computation required in convolution is also common to a number
of computation algorithms
such
as
digital filtering,
correlation,
pattern matching, interpolation and polynomial evaluation (including
the
discrete
Fourier
transform).
For
example,
the
analogy
of
multiplication and addition in a pattern matching problem would be
a comparison and boolean AND operation.
44
4. 7.1 Semi-Systolic Convolution Array with Global Data Comm
The systolic array (design B 1 - broadcast inputs) for a class of
architectures in which inputs, Xi are broadcast to processing elements
(PE) is illustrated in Figure 4.5. In this version, the weights, wk are
preloaded (one at each PE) and are stationary at each PE throughout
the computation. The intermediate results, Xi
move from left to right
through the array. At the beginning of a cycle, one Xi is broadcast to
each PE and one output, Yh initialized as zero, enters the array from
the left. During clock cycle one, the right most PE computes Yl = w1x1.
During clock cycle two, w1x2 and w2x2 are accumulated to Y2 and Yl
at the right most and middle PE, respectively. The results Yl, Y2,··· are
output from the right most PE at the rate of one Yi per cycle. Figure
4.5a provides the block diagram and Figure 4.5b provides the data
flow chart for the PE's for design B 1. The data flow is shown for the
first four clock cycles of the 3-tap filter. Note that in Reference [21],
page 40, Figure 3, the coefficients are labeled incorrectly on the
diagram. Figure 4.5a reflects the correct labels.
The systolic array (design B2 - broadcast inputs) for a class of
architectures
in
which
each
Yi remains at a particular PE to
accumulate it's terms is illustrated in Figure 4.6. As in design B1, the
inputs, x 1 are broadcast to all PE's concurrently. The weights circulate
around the array of PE's. The first weight w 1 is associated with a tag
bit which signals a PE to output its accumulator contents and then
clears the accumulator. This configuration will also result in a
narrower data path to transfer the weights through the array
45
(a)
PE1
PE2
PE3
x.I
+J~y
yi
(b)
Clock Cycle
1
2
3
4
5
Xl
X2
X3
X4
xs
R1
Xl W3
X2W3
X3W3
X4W3
R2
XlW2
X2W2+
Xl W3
X3W2+
X2W3
X4W2+
X3W3
R3
Xl W!
X2W!+
X1W2
Y3
Y4
D
Figure 4.5. Design B 1: Systolic convolution array: (a)
PE block diagram; (b) data flow chart.
relative
to
the
accumulations.
B1
design
Another
conventional off the
which
advantage
needs
of this
to
transfer
configuration
self multiply-accumulators
can be
partial
is
used
that
to
reduce costs. However, design B 1 does require a separate bus to
multiplex the outputs of the PE's together. The PE block diagram and
46
data flow chart for four cycles of a three tap filter is shown in Figure
4.6a and Figure 4.6b, respectively.
Note that designs which require multiplexing the outputs of PE's will
create a special set of design problems that can degrade noise
margins
1n
emitter
coupled
logic
(ECL)
and
generate
overshoot/undershoot and increase propagation delays in transistortransistor logic (TTL). This problem and its solution is described in
Appendix A.
The systolic array (design F - fan-in results) architecture in which
the weight vector is fixed and the input data moves past the weights
in the left to right direction is shown in Figure 4.7. The weights are
preloaded into the processing elements and remain there throughout
the computation. The datapath for each PE in this architecture
consists solely of a multiplier. An external summation or vector
reduction network is required to form the new Yi· The PE block
diagram and data flow chart is shown in Figure 4.7a and Figure 4.7b,
respectively.
4. 7.2 Pure-Svstolic Convolution Arrav without Global Data Comm
Although global broadcasting of fan-in alleviates the I/0 bottleneck
problem, implementing this architecture in a real systems generates
another set of problems as the number of processing elements
increase. Wires become long and expanding these communication
paths without slowing down the clock becomes difficult.
47
(a)
PE1
PE2
PE3
Xi
Wi
I
~
Wi-k
y
(b)
Clock Cycle
1
2
3
4
5
D
Xl
X2
X3
X4
X5
co
Wl
W3
W2
Wl
W3
Cl
W2
Wl
W3
W2
Wl
C2
W3
W2
Wl
W3
W2
Rl
XlWl
X2W3
X3W2+
X2W3
Y4
R2
XlW2
X2Wl+
Xl W 2
X3W3
X4W2+
X3W3
R3
X1W3
X2W2+
Xl W3
Y3
X4W3
Figure 4.6. Design B2: Systolic convolution array: (a)
PE block diagram; (b) data flow chart.
48
(a)
PE2
PE1
PE3
X.--+~
I
W3
y
(b)
XA
1
Clock Cycle
2
4
3
5
Xl
X2
X3
X4
xs
Xl
X2
X3
X4
Xl
X2
X3
X2Wl
X3Wl
X4Wl
X5Wl
XlW2
X2W2
X3W2
X4W2
XIW3
X2W3
X3W3
Y3
Y4
YS
XB
X::
A
XlWl
B
c
y
Yl
Y2
Figure 4.7. Design F: Systolic convolution array: (a) PE block
diagram; (b) data flow chart.
."
49
Furthermore, this technique will increase the cost and reduce the
reliability at the chip, board and system levels. This section will
describe a set of systolic architectures which do not require global
data communication channels.
The systolic array (design Rl - results do not circulate) architecture
in which the weight vector and input data moves in opposite
directions is shown in Figure 4.8. The intermediate results are
stationary at a particular PE to accumulate it's terms. The inputs Xi.
and weights Wi, move in opposite directions such that when data
meets a weight at a PE, they are multiplied and the resulting product
accumulated.
Consecutive
Xi's and wi's in this architecture are
separated by two cycle times to ensure that each Xi meets every Wi.
Figure 4.8a and Figure 4.8b illustrates the block diagram of the PE
and the resulting data flow chart, respectively.
As in the B2 design, Rl can use off the self multiplier-accumulator
hardware. Rl can also use a tag bit technique associated with the
first weight w1, to trigger the output and clear the accumulator
contents of a PE. Note that the datapath in the PE's for Rl are only
maintained at a 50% utilization factor. It is possible to fully utilize
the PE's by interleaving two independent convolutions through the
same systolic array. However, the PE's in Figure 4.8 would require a
dual accumulator structure to hold temporary results for the second
convolution computation.
'
{}
50
(a)
PE1
PE2
PE3
x.k
1-
~
Wi
•§]
B
~
•
y
(b)
1
XA
2
Xl
XB
X2
W3
WB
we
A
B
c
X2
7
X4
X3
X2
X3
W2
Wl
W3
W2
W3
Wl
X2W2+
X1W3
X4W3
X3Wl+
X2Wl+
Xl W2
W2
W3
Wl
X3W3
X3W2+
X2W3
XlWl
AR
BR
8
X4
Xl
W2
W2
6
X3
Xl
X::
WA
Clock Cycle
4
5
3
Y3
Y3
Y2
CR
Yl
Y2
Figure 4.8. Design R1: Systolic convolution array: (a) PE block
diagram; (b) data flow chart.
51
The systolic array (design R2 - results do not circulate) for class of
architectures in which the weight vector and input data moves in the
same direction is shown in Figure 4.9. In this version, both Xi and Wi
move from left to right, with Xi moving twice as fast as wi. The delay
of the wi path is accomplished by double buffering this path as
shown the PE block diagram in Figure 4.9a. A standard multiplieraccumulator is also used in this configuration with the tag bit
technique to signal the output of the accumulator contents at each PE.
However, an additional register is required in each PE to delay the
weights.
This design· has the advantage over the similar design Rl in
that the datapath within each PE is 100% utilized. The data flow chart
for R2 is shown in Figure 4. 9b.
Another version of R2 is possible in which the weights move twice as
fast as the data. An additional PE register would be required to
double buffer the data instead of the weights as in the R2 design.
This would provide a more efficient use of silicon real estate in
instances where the weights require more bits of precision than the
data word.
The systolic array (design WI - weights do not circulate) for a class
of architectures in which each weight, Wi remains at a particular PE
is illustrated in Figure 4.10. The results, Yi and the data, Xi move
systolically through the array in opposite directions. Similar to the R 1
design, consecutive xi's and Yi's are separated by two clock cycles.
Figure 4.1 Oa and Figure 4.1 Ob show a block diagram of the
52
(a)
X i-k
wI
PE2
PE1
PE3
y
(b)
XA
XB
1
2
Clock Cycle
4
3
XI
X2
X3
XI
X::
WAl
WBl
WCl
A
B
c
W3
W2
WI
XI W3
5
6
7
X4
xs
X6
X7
X2
X3
X4
xs
X6
XI
X2
X3
X4
xs
WI
W2
WI
WI
W3
W3
W2
W2
WI
W3
W3
W2
WI
X2W2+
X3WI+
X4W3
XIWI
X2W3
X2W2+
X2W2+
XIW2
AR
W3
W2
X5W2+
X4WI+
X6Wl+
X3W3
XSW3
X4W2+
Y3
Y2
Y4
Y3
Y3
BR
CR
Yl
Yl
Figure 4.9. Design R2: Systolic convolution array: (a) PE block
diagram; (b) data flow chart.
53
processing elements and the data flow chart, respectively. Note that
there is no need for another path for moving the Yi's as is required
for the Rl and R2 designs. Furthermore, the results Yit are output
from the left-most PE during the same cycle it's last input, x(i+k-1)
(or xi+2 for k=3), enters the PE. The design W1 therefore suffers from
the same drawback as design Rl, namely, the datapaths in the PE's
are only 50% utilized. Another disadvantage of design Wl is that
conventional multiplier-accumulator hardware cannot be used. The
architecture of Wl is useful for recursive filtering because of the fact
that the weights remain at PE's and the data xi, and the results Yi,
move along the array.
The systolic array (design W2 - weights do not circulate) for a class
of architectures in which each weight, Wi remains at a particular PE
is illustrated in Figure 4.11. The data inputs Xi, and the results Yi,
move in the same direction but at different speeds. The W2 design
takes advantage of the full bandwidth of datapath's in the PE's.
However, the latency through the array is slower that design W1.
The output Yit in W2 takes place k cycles (k=number filter taps) after
the last of its inputs starts entering the left most PE of the array.
Similar to design R2, design W2 has a dual version for which the Xi's
move twice as fast as the Yi's.
Figure
4.11 a
and Figure
4.11 b
illustrate the processing element block diagram and the data flow
chart, respectively.
54
(a)
PE1
x.I
PE2
PE3
...,
xi-k
y~
(b)
1
XA
2
Xl
XB
Clock Cycle
4
3
5
X2
X3
Xl
X2
X:
X3
Xl W3
R2
X3
X2W3
X2W2+
X1W3
Xl W2
X3W2+
X2W3
R1
Xl Wl
X2Wl+
XIW2
X3Wl+
X2W2+
XlW3
y
Yl
Y2
Y3
Figure 4.10.
7
X4
X2
Xl
R3
6
Design W1: Systolic convolution array: (a)
PE block diagram; (b) data flow chart.
55
(a)
PE1
PE2
PE3
I -----~-•~
X·I
X i-k
y
(b)
XA
1
2
Clock Cycle
4
3
Xl
X2
X3
Xl
XB
)C
R1
R3
Xl W3
5
6
7
X4
xs
X6
X7
X2
X3
X4
xs
X6
Xl
X2
X3
X4
xs
X2W3
X3W3
X4W3
X5W3
X6W3
X1W2
X2W2+
X1W3
X3W2+
X2W3
X4W2+
X3W3
XSW2+
X4W3
Xl Wl
X2Wl+
XlW2
X3Wl+
X2W2+
Xl W3
X4Wl+
X3W2+
X2W3
Yl
Y2
Y3
R5
y
Figure 4.11.
'
Design W2: Systolic convolution array: (a)
PE block diagram; (b) data flow chart.
d
56
4. 7.3 Summary of Systolic Arrays for Convolution
The designs listed above by no means exhaust all the possible
systolic configurations for the convolution problem. For example, it is
possible to design arrays where results, data and weights all move
during each cycle. Furthermore, additional memory inside each PE
would ease implementation of interpolation, polyphase, or adaptive
filtering tasks. A flexible communication network would permit the
same systolic array to implement different functions
as in the
proposed General Electric array [26]. The General Electric array is a
2-dimensional processor which can be dynamically reconfigured to
accommodate a wide range of computational structures.
'
,,
57
4. 8
Programming Arrays
Programming
a
fixed
interconnect
wavefront
array
refers
to
specifying the sequence of operations for each processing element m
a system [17]. Each operation will describe the type of computation
(i.e. addition, subtraction, etc.), the input data link direction (north,
south, east, west, or internal register), and the output data link.
When programming fixed interconnect systolic arrays, an additional
specification is required to determine when an operation actually
occurs. Wavefront arrays, being data-driven, do not require this level
of programming. In this sense, programming a wavefront array 1s
simpler than programming a systolic array. From the viewpoint of
algorithm
mapping,
both
wavefront
arrays
and
systolic
arrays
require an assignment of the computational nodes of the dependence
graph to the PE's.
A dependence graph is a directed graph which is embedded in an
index space and specifies the data dependencies of an algorithm. In
the graph,
the
computations
are
represented
by
nodes
and the
dependencies in these computations are represented by arcs. In
deriving an array for a given algorithm, the first step is to derive an
intermediate dependence graph from the algorithm, and then map
the dependence graph to a systolic array or direct to a wavefront
array. This two step process permits the designer to first determine
the inherent parallelism in the algorithm itself and then to map this
parallelism on to the specific hardware
58
architecture.
As mentioned above, systolic arrays also require a timing schedule of
computations. Since a wavefront array has inherent self-timing, no
scheduling is needed. In fact, as a result of the asynchronous nature
of the wavefront architecture, it will adopt the optimal schedule.
However,
hardware
interfacing
problems
can
occur
with
asynchronous arrays when attempting to communicate with other
synchronous subsystems such as host computers, or mass memory
storage devices. Local communications in wavefront arrays can also
result in this same type of interfacing problem since in many cases
the
PE's
are
all
synchronous
processors
without
global
synchronization. This interfacing problem is described in Appendix B
along with possible interfacing solutions.
4.8.1 Wavefront Array Programming Example
A programming array should be able to express parallel data-driven
computing.
The
Occam
programming
language
for
the
Inmos
Transputer is an example of such a language [29]. Another language
that can be used for wavefront arrays is the Matrix Data Flow
Language (MDFL), which uses the wavefront concept to reduce the
complexity of parallel programming.
Many
one-dimensional or two-dimensional digital filters
can be
initially given in the dependence graph form. The programmer just
needs to assign one processing element to each dependence graph
computational node, if possible, and program the node to perform the
59
required computations. If the number of PE' s is less than the number
of dependence graph nodes, the programmer can group several of
the nodes and assign them to a single PE.
A matrix multiplication example of the Occam programming language
for a two-dimensional wavefront array is provided below. The array
structure and data flow for a 2 x 2 multiplication is illustrated in
Figure 4.12
** Occam Main Program **
Line
Line
Line
Line
Line
Line
Line
Line
#1
#2
#3
#4
#5
#6
#7
#8
CHAN vertical [n*(n+1)]:
CHAN horizontal [n*(n+1)]:
PAR i = [0 FOR N]
PARj = [0 FOR N]
mult (vertical[(n*i)+j],
vertical [(n*i)+j+1],
horizontal [(n*i)+j],
horizontal [(n*(i+ 1)+j)]):
** Process mult **
Line #1
Line
Line
Line
Line
Line
Line
Line
Line
Line
Line
Line
Line
#2
#3
#4
#5
#6
#7
#8
#9
#10
#11
#12
#13
PROC mult(CHAN
up,down,left,right)=
V AR acc,a,b:
SEQ
ACC: =0
SEQ i = [0 FOR I]
SEQ
PAR
up? a
left ? b
ace : ace + a * b
PAR
down ! a
right ! b:
60
Memory Modules
A
11
Ao1
A
00
,,
sis!
10~
0 ...
A10
0
coo
,,n
n
n ..
sis!
11~
,,
...
p
C01
..
,n+1
1
1
2n
n+1
1
Memory
Modules
c1 o
n+1
n+1 ..
p
2
n
c 11
2n+1.,.
p
n+1
~r
Figure 4.12. Array structure and data flow for a 2 x 2
matrix multiplication example.
The main program segment defines the array structure. The CHAN
statement used in the first two lines of the program initializes the
serial input ports of the left most and top processors to accept input
data. One data matrix enters from the left column, while the other
matrix enters from the top row. The PAR construction used in
statements three and four spawn N times N processes which execute
concurrently. The PAR construction also defines the two-dimensional
N x N structure of the array as mentioned above.
61
The process mult, which describes the PE operations, is called N* N
times and is executed in parallel by each PE for each data operand
transferred. The mult process passes as arguments the addresses of
the PE 110 channels. The statement V AR used in the second line of
the process declares ace, a and b as variables. The SEQ statement
used in line three results in the instructions on lines four through
thirteen executing sequentially. The fourth line simply initializes the
variable ace to zero. The fifth line results in lines 6 through 13
executing once for each new operand transferred to the PE. The sixth
line results in lines seven through thirteen executing sequentially.
The statements up ? a and left ? b are executed in parallel as a result
of the preceding PAR construction on line seven. These instructions
will
input
a data operand
from
the
upper
and
left
serial
communication ports on the PE. Similarly, instructions down ! a and
right ! b are I/0 statements that output data in parallel to the right
and
lower
(down)
serial
communication
instruction on line ten performs the actual
ports on the PE. The
multiply-accumulate
computation.
In summary, this process multiplies and accumulates data as it
moves right and down through the array. After the entire matrix has
passed through the array, each processor contains an element of the
final matrix. All the PE's perform the same tasks which
include
reading, multiplying, accumulating, and transmitting the data further
right and down.
62
4. 9
The ESL 350 MFLOPS Systolic Adaptive Bearnformer
ESL, Inc. has built a systolic array processor that performs adaptive
bearnforrning for acoustic signal processing applications [28]. The
Systolic Array Beamformer project is intended to demonstrate the
applicability of systolic processing techniques to acoustic signal
processing.
The
Systolic
Adaptive
Bearnforrner
implements
the
Minimum Variance Distortionless Response (MVDR) algorithm and
can process 120 sensors and 280 frequencies in realtime.
The processor was built to perform narrow-band, passive, sonar
signal processing. The basic signal processing requires that inputs
from
an
array
of hydrophones
are
linearly
combined
in
the
frequency domain to produce a directional acoustic receiver with a
very high gain. The goal of an adaptive bearnformer is to combine
the sensor outputs in such a way that the array will hear sounds
corning from the desired direction, but will minimize the total energy
(from
noise
and
interference
sources)
received
from
all
other
directions. The systolic bearnforrning processor has been designed to
solve
a set
of three
specific
problems
defined
by the
Naval
Underwater Systems Center.
The first problem is Bearing Time Response where spectral power as
a function of bearing is determined. The number of bearings can be
set to any value up to 250 with the directions of the bearings user
selectable.
'
The power
versus
~}
63
bearing
values
for
up
to
seven
frequencies are computed. All beams are formed using a 32-element
hydrophone
subarray.
The second problem is the Fine Discrete Fourier Transform where
spectral power at a selected bearings using fine frequency resolution
(3/ 16Hz) is determined. A display with 1200 spectral power lines at
selected bearings using the MVDR algorithm is generated. Again, all
beams are formed using a 3 2-element hydrophone sub array.
Lastly, the sensor values from the five subarrays are conventionally
beamformed to synthesize a five element array with a very large
aperture. The five element array is then used to form MVDR beams
in 2500 operator selectable directions.
4.6.1 System Components
Implemented with custom VLSI devices,
the
Systolic
Adaptive
Beamformer consists of a Digital Equipment VAX 111780 processor
with standard peripherals
attached
to
the
Systolic
Beamformer
through a Digital Equipment DR 11-W interface (see Figure 4.13). The
beamformer
is
composed
of four
components,
a
hydrophone
interface, a FFT processor, the MVDR processor, and I/0. The Systolic
Beamformer can accept data realtime from a sensor array, from high
density magnetic tape, or a signal simulator.
All
numerically
intensive processing is performed in the Systolic Beamformer. The
rest of the system provides display, I/0, and control functions. The
actual beamforming is performed on the MVDR processor. Pipelining
'
f}
64
techniques
are
used
within
each
processor
and
between
each
processor.
Systonc processors
Preprocessing
(FFTand
nonnalization)
90 MFLOPS
Factor
matrix
I
Unear
system
I
solver
Vector
Inner
product
VAX 750
host
computer
260 MFLOPS
Figure 4.13. Block diagram of ESL's systolic system.
The Array Interface processor reads input data for each of the 120
input channels, 1024 time samples for each channel, with 12-bits per
sample. This data is placed into a buffer. At the same time, the
Coarse FFT (CFFT)
processor performs FFT
and normalization
operations on the data that was previously deposited in the buffer.
The output of these operations is in turn, placed into another buffer.
This buffer is read to produce the conventional subarray beams, and
by the MVDR processor to produce its outputs. Table 4.1 describes
the performance of the Systolic Beamformer for a representative
task. In Table 4.1, d is
the
steering vector which is formed
analytically from data relating to sensor spacing and position and
from direction information supplied by the operator. Typically, the
direction information is a set of values equally spaced in cosine of
the bearing angle.
65
Processor
CFFT
Norm
MVDR
FDFT
v
Function
Parameters
Coarse FFf
Normalization
Beamformer
Fine DFf
d vector gen.
1120freq x 120 sensors
280freq x 120 sensors
280freq x beams
280 freq
9freq x 5 sensors x2500d
Total
MFLOPS
performance
65
10
270
1
4
3 50
Table 4.1. Systolic Beamformer performance.
From a computing point of view, the Systolic Beamformer must solve
three linear algebra problems. The first problem requires a matrix to
be
factored.
In
this
problem,
a
U new
matrix
defined
by
UnewUnew*=UoldUold*+zz*Uold· Uold is an n x n upper triangular matrix
with real, positive values on the diagonal, and z is a vector in Cn. The
second problem requires a linear system of the form U x = d to be
solved where x and d are vectors in en and U is as defined above.
Lastly, the inner product of the form w *z must be computed, where
w and z are in en.
4.9.2 Systolic Cell CPE) Chip
The heart of the systolic array is a custom VLSI systolic cell chip. A
processor board in this ESL system consists of 12 systolic cell chips.
The systolic cell chip is used to solve all of the problem types
discussed
above.
multiplier/adder
This
chip
designed
arithmetic. The chip also
to
is
a
high
support
speed
both
floating
complex
and
point
real
supports fixed point to floating point
66
conversions. The IEEE Std. 754 single precision floating point data
format is used.
In the complex mode, successive operands are treated as the real and
imaginary parts of complex numbers. In the real mode, each floating
point operand is treated simply as a real number. The chip has 3
input data ports (A, B, and C) and 2 output data ports (BD, and CD), all
8-bits wide. Each byte of data requires 100 ns to be transferred.
Therefore, a complex operand transfer requires 8 cycles or 800 ns.
The basic operation performed by the systolic cell chip is indicated
by Equation [4.2]. Input operands are transferred from ports A, B,
and C and produce 2 output results through ports BD and CD.
BD <--- B
[4.2
CD<-A*B+C
The D suffix in the signal abbreviation indicates that the output
results are delayed 7 major cycles through the chip. A major cycle is
defined as the time required to transfer one complex operand or two
real operands.
The CD port returns the complex output resulting from one of the
operations described in Table 4.2. Three control pins, OC, FC, and BC
determine the operation which is performed on the input operands
and the control signals which are provided to adjacent cells. These
control inputs are sampled during the first data I/0 cycle in each
67
operand transfer. The state of the control inputs during successive
data I/0 cycles are ignored. The outputs CD, OCD, and FCD will appear
7 major cycles after the control inputs are presented.
Control Inputs
oc FC BC
0
0
0
0
1
1
1
1
0
0
1
1
0
0
1
1
0
1
0
1
0
1
0
1
CD Output
CD<-EC
CD<- EAxEB+EC
CD<-EC
CD<- EAxEB+EC
CD<- EAxEB+EC
CD<-EAxEB
CD<-EC
CD<-EAxEB
Control Outputs
OCD FCD
0
0
0
0
1
1
0
1
0
0
1
1
0
0
1
1
Table 4.2. Output functions for systolic chip.
OCD is a delayed output that is used to control adjacent systolic cells
in matrix calculations such as the triangular backsolve operation. The
OCD output is typically wired to an adjacent OC input to determine
whether that particular cell will perform a transfer or calculation
operation. FCD is used to control adjacent systolic cells in the same
manner as the OCD output control signal.
The systolic cell uses an extensive amount of internal pipelining to
achieve high efficiency within the arithmetic unit. The maximum
operation rate of the systolic cell is four single precision floating
point real multiplications and four single precision floating point real
additions every 16 clock cycles or 10 MFLOPS per chip. This rate is
achieved when the internal pipeline on the device is fully utilized
performing complex operations with a 50 ns clock. In the real mode,
68
the maximum throughput is 5 MFLOPS with two real multiplications
and 2 real additions occurring every 16 clock cycles.
ESL is currently building a second generation systolic array processor
that will perform at a substantially higher computation rate. The new
design will include more systolic cells in each processor and more
flexible microcode addressing modes.
'
"
69
5. 0
TECHNOLOGY
Of all the factors that influence computer architectures for digital
signal processing, technology is without question the most important.
It is not difficult to show how architectural concepts, such as floating
point hardware multipliers and register files became significant only
when the appropriate enabling technology was available. Moreover,
multiple specialized processors
took
on reduced implementations
through techniques such as time multiplexing until the technology
made the use of completely distinct multiple processors economically
viable. Not only is technology a dominant factor in the determination
of architectures, it is a fast moving and volatile area. For example,
the conventional two to three year design cycle time for most high
performance
computer
systems
and very large scale integrated
(VLSI) components is inflexible enough to permit these very same
systems to become technologically obsolete even before the designs
are released to production. Integrated circuit memory technology is
another area where the field is rapidly advancing. The number of
bits per device is increasing at the rate of 70 percent per year. Logic
density available is increasing by 25 percent per year. The power
delay product of contemporary processes is dropping by a factor of
two each year permitting the area of individual integrated circuits to
increase by a factor of 20 percent. The most significant problem
which is an outgrowth of these rapid technology advances is that the
cost of design has been rising at least 40 percent a year. Solutions to
this problem will be addressed in Chapter 6 .
•
t)
70
There are a number of competing technologies available for use in
high performance digital signal processing systems. Today, CMOS is
by far the most widely used fabrication process as a result the
technologies low power, high density, and speed. CMOS gate arrays
are now available that permit over 100,000 gates to be implemented
with unloaded gate delays as low as 700ps. Bipolar technology, once
the most prevalent high speed commercial process has lagged behind
CMOS in terms of power dissipation, density and cost. However, the
development of new VLSI processes have made bipolar technology
more competitive with CMOS. This topic will be discussed later in this
chapter.
5. 1
Technology · Comparisons
The factors that need to be addressed when comparing different
technologies to CMOS include density, speed, and power. Device
density for a particular process is a function of the actual transistor
size, the metalization interconnect pitch (minimum metal to adjacent
spacing), and whether the process can support polysilicon resistors.
The ability of a process to support polysilicon resistors as opposed to
diffused resistors is significant because polysilicon resistors can be
.
.
placed between transistors without 1ncreas1ng the m1n1mum
transistor-transistor pitch. The speed of a transistor for a particular
technology is a function of the intrinsic device transit time, the
device capacitance, the interconnect capacitance, and the ability of
the transistor to charge and discharge these capacitances. The power
dissipated by a transistor is a function of the intrinsic device
71
capacitance
and
the
interconnect
capacitance.
Reducing
these
capacitances will therefore reduce charging currents which in turn
will reduce power dissipation.
CMOS technologies, however popular, do not drive capacitive loads
well. Each additional fanout adds 300-400 picoseconds to the g1ven
delay, and about 4 millimeters of wire is equivalent to one fanout
[31]. Since the longest wires in a chip are usually over five times the
die perimeter in length, either actual performance is degraded as
density increases (through the need for large, high current drive
transistors), or a growing number of gates is needed to spread out
the loading effects. Moreover, system bottlenecks often occur while
trying to drive off-chip loads. A 5 to 10 nanosecond I/0 delay is
typical, and this figure degrades rapidly with increased loading,
limiting system speed. Moreover, the large short circuit current
generated when CMOS output transistors switch state introduces
system level problems such as signal overshoot and undershoot. This
switching noise problem as well as measures to reduce it's effects are
considered in Appendix C. CMOS technology, therefore, does not
provide adequate performance for many system requirements, even
though databook speeds may appear sufficient for many applications.
5.2
Bipolar Processes
Bipolar is not only faster than CMOS fabrication technologies, but also
much less sensitive to capacitive loading. In fact, the fan-out and
wire length have a much smaller effect on on speed. Typical on-chip
72
delays are 200-500 picoseconds, with an additional 30 picoseconds
for each additional load. As with CMOS, one load is equal to about 4
millimeters of wire [31].
A recently
developed bipolar VLSI process
(BITl)
by Bipolar
Integrated Technology will be compared to CMOS technology using
the criteria discussed above. This BITl process is a representative
example of the wave of new bipolar processes introduced that
borrow fabrication techniques from CMOS and use geometries in the
1 to 2 urn range [30]. In comparing the speed of CMOS to BITl, the
upper limit on the speed of a transistor will be defined by the device
transit time (ft). The ft of BITl and other bipolar processes are in the
range from 3 to 5 gigahertz. Similarly, in CMOS devices with gate
lengths less than 2 urn, transit times are in the tens of picoseconds
are represent a small part of the overall delay.
Low device capacitances, relative to interconnect capacitances, are
achieved in both small-geometry, low-current CMOS devices and in
the BITl process. The same was not true for conventional bipolar
processes because of their large overall s1ze.
Low interconnect capacitances are a function of both metal pitch and
transistor size. As mentioned above, these factors greatly influence
circuit density.
Both
CMOS
and bipolar have comparably low
interconnect capacitances. As a result of the low device densities,
conventional bipolar processes produce larger interconnect distances
and
therefore
larger
interconnect
73
capacitances.
In
general,
interconnect capacitance is determined by the length and width of
the metalization. However, once widths are reduced to less than 2
urn, the capacitance is dominated by fringing fields, and further
reduction will have little effect on capacitance.
A fast charging transistor is needed to charge both the device
and
the interconnect capacitances. The charging rate will depend on
device transconductance. Transconductance is the ability of an active
device to deliver output current in response to an input logic voltage
swing. In a bipolar transistor,
transconductance is a function
of
collector current and absolute. However, for CMOS transistors, simple
scaling to improve transistor transconductance faces diminishing
returns as channel lengths decrease below 2 urn. See Figure 5.1. For
example, a typical CMOS transistor with a 1.5 x 8 urn gate has a
transconductance
that
is approximately 0.60 ms at 25 degrees
Celsius. By comparison, the BITl transistor of the same area has a
transconductance
of 3.8
ms.
This
difference relates
to
a
6.3
improvement in transconductance and therefore charging time over
the CMOS device.
gm
(mS/mm)
200
100
T=300K
0.1
1
2
FET channel
1 0 length (microns)
Figure 5.1 Nonlinear behavior of MOS gm
74
The speed of both CMOS and bipolar processes is most readily
increased by reducing all dimensions, thus reducing load capacitance.
Therefore,
photolithography improvements should provide the same
percentage improvement in either process. This will result in the
speed ratios between the processes remaining constant. However,
CMOS will continue to maintain it's density advantage over bipolar
technology as result of it's low power dissipation. The density of
VLSI bipolar devices will continue to be limited not by the size of
transistors used, but by the maximum power dissipation permitted
on one monolithic device.
5. 3
BiCMOS Technology
Since bipolar technology provides the ability to achieve unparalleled
system level speed and CMOS technology provides the advantage of
low power and
high density , it would seem obvious to combine both
technologies on one chip. Surprisingly, such a fabrication process
(known as BiCMOS) has been slow to develop. Although there are two
commercially popular approaches that take advantage of BiCMOS
technology; memory devices and devices with high performance I/0.
BiCMOS static RAM memory takes advantage of CMOS transistors to
create an extremely
dense,
low
power memory
matrix.
Bipolar
transistors are used to interface to the outside world (I/0 interface)
and in the decode circuitry which is heavily loaded. The other
popular use of BiCMOS technology is in products which require high
speed I/0 drivers, or in products which must drive large loads. In
these
applications,
CMOS
is
75
used to
implement
internal
logic
functions, while bipolar transistors are used in the 1/0 ring. However,
aside from these two applications, BiCMOS transistors will probably
not be fully integrated throughout digital logic devices in the future.
The reason is that at ECL bipolar speeds, CMOS transistors will
dissipate more power than bipolar transistors. This is a result of the
fact that the power consumption of a CMOS transistor is directly
proportional to its speed of operation. Therefore, the designer cannot
take advantage of the main strength of CMOS technology; low power
which permits high density.
5 .4
Gallium Arsenide
Gallium Arsenide (GaAs) logic circuits are advertised as a technology
that offers at least twice the speed of ECL, at the same power
dissipation levels. GaAs also seems to be more radiation-hard
silicon, although the differences have not be adequately
Another advantage of GaAs is that it is tolerant to
variations. It's operating range is from
than
quantized.
temperature
approximately -200 degree
Celsius to +200 degrees Celsius. Gallium Arsenide is also better suited
for the efficient integration of electronic and optical components.
However, the promise of this level
of performance
by
Gallium
Arsenide technology has not been fully realized. Although GaAs can
support very high speeds, obtaining high speeds while maintaining
low power dissipation is not easy at VLSI
densities.
The power
dissipated by both CMOS and GaAs are inversely proportional to the
frequency
•
of operation. Furthermore,
c>
76
GaAs
also does
not drive
capacitive loads well. In fact, 1 millimeter of wire can add as much as
150 picoseconds to the gate delay.
As higher performance is sought, elaborate processing techniques are
necessary to reduce the metal capacitance. One technique used by
manufacturers is to take advantage of the low dielectric constant of
air by raising up the second-level metal on small metal posts in
strategic locations [32]. These air bridges add one mask step and
several processing steps which acts to reduces
the
manufacturing
yield. Furthermore, GaAs ingots with low crystal defect densities are
difficult to make, the largest ingots are only four inches in diameter.
Since processing costs scale on a per wafer basis, the die cost is much
higher with small wafers. The high defect density found on G a As
wafers also will reduce the transistor economically feasible for a
design.
In addition, the semiconducting material itself
contains
arsenic, which makes safe manufacture very expensive.
With the prior comments in mind, many digital signal processing
systems will employ digital GaAs integrated circuits and silicon
circuits in the same system. Although silicon VLSI can handle the
majority of signal processing tasks, GaAs can still be employed at the
front end of the system. In these systems, the input to the processor
may be a single stream of data at rates of 0.1-2 gigasamples per
second,
usually
generated by
very
wideband
image
sensors
or
detectors. In these cases, a single high-speed data stream must be
partitioned into a set of lower rate, parallel substreams, which m
'
0
77
turn can be further processed by silicon devices operating at much
slower clock rates.
As noted above, the high bandwidth signal processing environment
does appear to be the best application area for digital GaAs in the
near
to
intermediate
architectures
advantage
can
future.
usually
of parallel
This
be
is
deeply
processing
because
signal
pipelined
techniques
to
and
obtain
processor
can
take
required
system throughput. However, digital GaAs might be useful in signal
processing
control · subsystems
where
decision
branches
are
performed and the hardware is usually less deeply pipelined (i.e.
instruction pipes or cache memories).
The majority of near term applications for digital GaAs integrated
circuits will include high speed data acquisition and storage of very
wide bandwidth pulsed, pseudo-random, or continuous data streams.
Other application areas include realtime radar signal processing and
signal analysis, electronic countermeasures, data encryption, spread
spectrum
communications,
wideband
data
through
error
noisy
detection
channels,
and
and
in
correction
of
communication
interfaces between electronic signal processors and fiber-optic data
channels. One important problem for which digital GaAs appears
extremely well suited is the characterization of radar pulses. In this
application, antennas receive radar pulses, increase it's amplitude,
and convert it's frequency
to lower values. The signal is then
digitized and stored in a small buffer memory. This combination of
high speed ND converter and small buffer memory is commonly
78
referred to as a Digital RF Memory (DRFM). The use of GaAs permits
sample rates over 1 gigahertz in this DRFM.
5.5
Advanced Integrated Circuits of the Future
The advanced
fabrication
processes
in
production
today
take
advantage of minimum geometry feature size in the range from 0.8
urn to 1.25 urn. For example, the SI28K8 from Inova (Santa Clara,
CA) is 55 nanosecond, 128k-byte CMOS static RAM that packs more
than 4 million transistors onto one die [34]. This RAM employs a 1.2
urn geometries as well as redundancy to provide acceptable yields.
The S 128K8 consists of 40 4k-byte blocks on the chip. A fully
operational RAM requires only 32 functional
company has
applied the
first
blocks.
After the
layer of metal, a wafer test 1s
performed to identify faulty blocks, and disconnects them with a
laser. The second layer of metal joins the remaining blocks into a
functional device.
Memories
are not the only integrated circuits to
benefit from
advanced processing technology. LSI Logic (Milpitas,CA) has recently
introduced a gate array built with 1 urn design rules that contains
236,880 logic gates. At four transistors per gate, the array holds
947,520 transistors. Although on this gate array implementation,
only 100,000 gates can be routed.
79
"
5. 6
Wafer Scale Integration
In conjunction with TRW, Motorola developed a 0.5 urn CMOS process
for the VHSIC phase II program. The process uses direct-write
electron-beam lithography for two mask steps. TRW plans to use this
technology along with die sizes as large as 2x3 inches, to integrate
approximately 34.7 million transistors on one chip. This die size is
roughly 20 larger than the largest of todays rnircochips. This "wafer
scale" development project is not without its own unique set of
design problems. The most fundamental of which is the fact that
defect densities on conventional silicon wafers make the fabrication
of 34.7 million perfect transistors unlikely. Instead, the company
incorporated system-level features, including extensive redundancy,
in the design to circumvent defective regions on the chip. During
power up, built-in self-test circuitry identifies the working modules
and
constructs
a
fully
operational
device
from
the
properly
functioning blocks. If a module fails during operation, the self-test
circuitry on the device can repair the damage by switching in a spare
block. TRW estimates that this self-healing capability provides the
component with an expected life span of 50 years on Earth-orbital
platforms. The architecture of the device is also user-configurable to
build systems that reconfigure themselves for new applications.
5. 6. 1 A Wafer Scale FFT Processor
A 100 million sample per second wafer scale FFT processor is
currently under development by TRW in Redondo Beach, CA [35].
80
This CMOS wafer scale technology avoids the VLSI and system level
problems discussed in this thesis such as interchip buffering and
interconnect delays, permitting significantly higher data rates. The
major challenges of this program were to minimize the number of
cell types, employ a regular and short interconnection architecture,
and to use redundancy to overcome the defects found in silicon
wafers.
TRANSFORM
} OUTPUT
a::
2
:!:
g
~
~
~~~
g
~0
.,::E~
0
DELAY COMMUTATOR: DCIXI
COMPUTAnONAL ELEMENT: CE
Figure 5 .2. Radix-4 FFT processor architecture.
The algorithm used for the processor is a pipelined radix-4 FFT
implemented with a four path pipeline as shown in Figure 5.2 [36].
This architecture requires only two processing element types and
uses
nearest neighbor interconnections, thereby overcoming many
locality of computation constraints. The complete processor consists
of
six
computational
elements
81
interconnected
by
five
delay
commutators.
The delay
commutators
reorder the
data
between
computational
stages. Each computational element requires
three
complex multipliers and eight complex adders, which are realized
with 12 real multipliers and 22 real adders. The delay commutators
requue 12 to 3072 words of delay and a few thousand gates of
random logic. In total, the complete wafer must integrate a m1mmum
of 72 multipliers,
132 adders, and 4096 words of delay before
considering redundancy requirements. The arithmetic is performed
with
16-bit to
22-bit fixed
point operands
with
programmable
interstage scaling to -provide adequate precision and dynamic range
for most applications.
To successfully build wafer scale devices the fabrication process
requires the inclusion of sufficient redundant blocks or modules to
assure adequate yield. Furthermore, these redundant modules must
be organized properly to accommodate fault circumvention. This
fault circumvention process can be demonstrated by considering the
tradeoffs required in realizing the complex multiplier block. The
yield obtained with redundancy is computed by multiplying the real
multiplier
yield
by
multiplexing
logic
component).
As
the
overhead yield
required to
the
number
(i.e.
select the
of real
the
yield of the
functioning redundant
multipliers
increase,
the
probability of finding at least one good element increases as the
overhead yield decreases (more multiplexing logic is required as the
number of multipliers increase). With the CMOS technology used by
TRW, the overall yield is maximized with four real multipliers, with a
slight penalty for triple redundancy, and a significant penalty for
82
dual redundancy. As a result, triple redundancy was selected to
implement the real multipliers in the design which resulted in an
increase in yield of over four orders of magnitude as compared to
designs
without redundancy.
A similar analysis
resulted in the
selection of dual redundancy for the real adders that comprise the
complex multiplier.
Reordering logic between the computational elements provides the
delay commutator function required by the pipelined FFT algorithm.
The delay elements are implemented as RAM circular buffers instead
of shift registers in which all locations change on each clock cycle.
RAM circular buffers require only two memory locations to change
during each clock cycle, this greatly reduces power consumption.
Memory fault tolerance is provided by the use of an error correction
code which is stored along with the data bits. Upon reading data from
the memory, the check bits are used to correct single event errors.
The
error
correction
circuitry
increases
the
area
required
to
implement the memory, but the increased size is justified since the
yield increases dramatically.
5. 7
Sub-Micron Lithography
Although the demise of optical lithography has been forecasted for
some time, continual improvements has permitted the technology to
carry the semiconductor industry through its first forty years. Today,
optical
lithography
is
used
to
fabricate
devices
with
smaller
geometries than many scientists predicted it would. However, the 0.7
'
il
83
to 0.8 urn processes used to build the current state of the art
commercial
integrated
mercury-vapor
circuits have finally
lithography
to
its
true
pushed conventional
limits.
After
all,
the
wavelength of blue light is about 0.5 urn. To achieve geometries
smaller than approximately 0.7 urn, manufacturers must turn to
exposure methods that employ shorter wavelengths.
Direct-write electron beam lithography
is
one technique
that is
currently being widely used for prototype fabrication of 0.5 u m
devices and in general when a fast low volume prototype
capability
is required. However, electron beam lithography will probably not be
used in production facilities as a result of the enormous amounts of
time required to draw the intricate patterns on integrated
Instead,
manufacturers
are considering
the
circuits.
use
of lithography
techniques that employ x-rays and excimer lasers
as illumination
sources to fabricate sub 0.5 urn devices. In fact, one of the goals of
the Department of Defenses Monolithic Microwave Integrated Circuit
(MMIC) program is to make devices with 0.25 urn geometries
manufacturable. The MMIC program is expected to start generating
results during the early 1990's.
As researchers continue to shrink device geometries, the mode 1s
describing
transistor
parameters start to fall apart. Scientists at
IBM's T. J. Watson Research Laboratory in Yorktown Heights, N.Y.,
have fabricated integrated circuits containing NMOS transistors built
with 0.07 urn geometries in an attempt to discover whether such
small devices would operate as transistors. Approximately 75% of the
.
"
84
structures on the test chips were operational. The project used an
electron-beam lithography process that can write patterns as small
as 0.02 to 0.05 urn onto silicon wafers. A reduction in linear feature
size from 0.5 to 0.07 (a reduction of about 2.52) using the transistor
technology developed at IBM could therefore produce an additional
density increase of approximately 216 (14.72). The same 2 x 3 inch
die size used by TRW would then result in the integration of 7.5
billion transistors on one piece of silicon!
5.8
Summary.
In summary, by the end of the 1990's, the combination of submicron geometries, expanding silicon die size, and redundant IC
design
techniques promise to
integrate
billions
and
billions
of
transistors on a chip. According to Dataquest's (San Jose, CA)
Semiconductor Industry Service, one billion transistors on a chip will
permit 128M bytes of RAM, 1000 DEC VAX CPU's, 20 Cray 2
supercomputer CPU's, or 10 complete VAX systems with memory.
85
6. 0
COMPUTER AIDED DESIGN
Design
complexity
has
become a dominant cost limit in the
development of VLSI digital signal processing systems [37]. The
advances
in
algorithms,
architectures,
and
I.C.
fabrication
technologies discussed in this thesis will not be fully utilized without
associated
improvements
in design methodology
and tools.
A
bottleneck now exists in the time and resources required to design
and verify integrated circuits. New computer aided design (CAD)
tools are needed to reduce the design time and
increase
the
probability of first pass success for the growing base of application
specific integrated circuits for digital signal processing.
Two general categories of digital signal processors were considered in
earlier chapters; the general purpose digital signal processor and the
custom design. The general purpose processor is the signal processing
equivalent of the microprocessor and uses pipelining and a parallel
multiplier to increase the data throughput. The advantage of this
approach is programmability, which avoids the expensive design
cycles of the custom approach. The generality of these processors is,
however, a major handicap in that the processor does not fit the
specifications and requirements of a specific algorithm. Examples of
these requirements are wordlength, datapath organization, memory
size, data throughput, and I/0 configuration.
Full custom designs on the other hand can be optimally designed to
meet the requirements of a specific application. The enormous design
86
costs
and
the
multiple
design
iterations
required
due
to
the
complexity of the circuits make the custom approach unattractive for
medium to low volume applications. Standard cell and gate array
methods can shorten design times, but only with a considerable loss
m the device density and speed. Table 6.1 describes the design cycle
to reach prototype silicon for gate arrays, standard cells, and full
custom designs once a schematic has been completed.
Imp Iemen ta tion
Design Cycle
Gate Arrays
Standard Cells
Full Custom
4 to 6 weeks
10 to 16 weeks
6 months to 1 year
Table 6.1. Design cycle to reach prototype silicon.
Automation of the design process seems to be the appropriate way to
make the custom design methodology economically viable for a
broader range of applications. The ultimate goal is to provide an
automated strategy whereby the system designer can specify a high
level behavioral model of the algorithm, which is then refined and
transformed into a low level functional description in terms of a
number predefined primitive building blocks
blocks
[38]. The building
are composed out of library cell functions
which have
predefined layouts. The parameters of these predefined cells are also
fully characterized for A.C. and D.C. performance. Finally, the floor
plan of the entire chip is assembled and a last verification step 1s
performed to ensure the logic and timing of the overall device is
correct. The masks can then be fabricated, sent to the silicon foundry,
87
and the actual
silicon tested. In
summary,
silicon
compilers
encapsulate the experience of expert I.C. designers into procedural
forms
of
knowledge
representation
that
can
generate
high
performance application-specific integrated circuits.
In order to set up an automated CAD system which supports such a
methodology, it is necessary to fully define the architectural strategy.
An approach that is too general will create a silicon compilation
environment which cannot handle unlimited flexibility in an efficient
manner. Therefore, a set of constraints needs to be incorporated to
some extent in silicon compilers. These constraints will force a
tradeoff between design ease, and device performance and density.
The goal of future efforts is then to optimize the density
and
performance possible with silicon compilers by increasing "level of
expertise" in these systems.
Many silicon compilers have been proposed and built, however, the
following discussion will highlight three specific such tools aimed at
digital signal processing: LAGER, FIRST, and BSSC. Of the three
compilers, BSSC will be explored to provide a detailed overview of
compiler architectures and cells.
6.1
LAGER
LAGER [37] was developed at the University of California at Berkeley
to
realize
an
automated
telecommunication
design
applications. The
88
system
for
audio
and
system uses a number of
parallel
operating,
processors
as
the
pipelined,
target
and
microprogrammed
architecture.
This
approach
has
signal
some
advantages over the hardwired or compiled datapath approaches
since decision making is extremely expensive to implement in the
hardwired datapath approach. This limits the use of these systems to
applications such as filtering, while the use of microprogrammed
signal processors
allow
for
the
integration
of complete
signal
processing systems.
LAGER is a datapath compiler with predefined placement of maJor
components. The library consists of several registers, a variable
width barrel shifter, a variable width ALU, RAM, ROM for microcode,
and I/0 pins. Each circuit generated by LAGER contains the same
basic elements, however, the wordlength of the library functions are
programmable. Up to four processors can be synthesized in the
LAGER architecture, with communication over serial lines.
The input to the compiler is an "assembly language like" program.
The microcode to control the datapath is also generated from this
assembly language input. The user must write out the detailed code
for each nonprimitive library function.
The inclusion of several
parallel processors does allow reasonable throughput rates to be
achieved,
however,
the
serial
nature
of
inter-processor
communications does introduce both bottlenecks and synchronization
challenges.
'
~
89
6.2
FIRST
The FIRST (Fast Implementation of Realtime Transforms) compiler
was developed at the University of Edinburgh, Scotland and uses a
bit-serial architecture [39]. The input to the system is a functional
flowgraph of the circuit where each bit-serial primitive and it's
interconnections are given. The exact connections between ports and
complete instantiation of all cells must be done by the user. FIRST
also requires the user to
system
define
and
to
the
synchronization and timing
insert
delay
elements
where
throughout
the
necessary.
This task can become quite tedious and error prone
especially when feedback loops are involved. The compiler will place
and route each cell defined by the user. The cells are not required to
have the same pitch. However, the floorplan of circuits generated by
FIRST is fixed. This layout scheme does not work well for large chips
since cells are laid out in two rows only. This gives rise to long
narrow devices. FIRST does perform the power and ground routing
and wires the 110 to the pads. A functional simulator is also available
that can be used to generate test patterns.
6.3
BSSC Compiler
The BSSC (Bit-Serial Silicon Compiler) was designed by the General
Electric Corporate Research and Development Center, Schenectady, NY
to be flexible enough to handle a wide range of DSP applications and
to allow circuit specification at the algorithmic
level
[40]. Fast
synthesis of designs was an important consideration in the design of
. v
90
BSSC to facilitate
approaches
the evaluation and comparison
of alternative
and to permit system designers to perform integrated
circuit design. The BSSC system architecture is shown in Figure 6.1.
Figure 6.1. Compiler system architecture.
The input to the system is through a proprietary high level language
called BSL (Bit-Serial Language). A behavioral simulator is available
to verify algorithms and word length requirements. The description
is then run through the high-level language compiler. The compiler
performs the behavioral to structural translation and produces a list
of all cell instances and their interconnections. A gate level simulator
is also available at this stage to
91
verify
functionality.
Once
the
functionality has been verified, the layout generation system can b e
invoked to place all the instantiated calls on the chip and route their
interconnections. The output of the system will be a description
of
the final placement of all cells, pads, and routing. This description can
then be combined with the layout of each cell to generate the
complete mask layout database. Interfaces are provided to various
design rule checkers (DRC) and layout versus schematic checkers
(LVS).
6. 3 .1 Bit-Serial Cell Architecture
The structure of the cells used in the BSSC is briefly described in this
section. The cells are architected such that each input is 1-bit wide
and has the following characteristics:
1)
an external control signal is provided which indicates the
most significant bit of a word;
2)
a two phase non overlapping clock is used for
synchronization;
3)
all outputs are latched;
All the basic cells needed by the compiler have been designed usmg
a 1.25 urn two level metal CMOS process. The cell template is
illustrated in Figure 6.2. The pin locations are standardized for all
92
cells, so inter-cell connections can be handled through abutment of
adjacent cells.
control signal
phi-l clock
phi-2 clock
power bus
•
input signals
gates,
regasters,
latches
d
OUiput sagnals
·
~
ground bus
phi-2-bar clock
phi-l-bar clock
Figure 6.2. Bit-serial cell template.
1 include •Sianclard.lib• 1• include the stanclar~ library "/
SYMBOLIC PROCESSOR 5 tap filler( IN: iu(O), CONTROL( OJ; OUT: out• );
WORDSIZE 16; FIXED lSi for aclcl use Ieider ccU;
for coni!_multiplY use cmult_ceU call ·cmult.Jcnerator•:
.
CONST
al. ~.0217;
a2. 0.254;
a3. 0.444;
SPECIFICATION
ouu :• al • (in• + inx(-4]) • a2 • (in•(·ll + in•[·3J)
+ a3 • inx(-2);
END 5_tap_filter;
Figure 6.3. Symbolic cell description of a FIR filter.
6.3.2 Bit-Serial Language Cell Descriptions
The input language of the BSSC has a syntax similar to "C". Figure 6.3
1s an example of the symbolic description of a FIR filter. The BitSerial Language begins with a interface description of the input and
output ports. Following
the interface description is a definition of the
digital word length and the number of bits to the right of the decimal
93
point. One limitation of this compiler is that all signals within a
circuit must have the same word size and fixed point format. The
user provides a one-to-one mapping of the operators in the SBL to
the cells that implement each operator. The algorithm specification of
the circuit is then given in a form similar to other high level
languages. Operators supported in SBL include addition, subtraction,
shift right,
shift left,
multiplication, relational operators, logical
operators, and a conditional assignment operator.
6. 3. 3 The High Level· Language Compiler
The function of the high-level language compiler is to translate a
symbolic cell description into the intermediate format by performing
a behavioral to structural conversion. The compiler starts by building
a data flow graph from the algorithmic specification. Next, the circuit
is synchronized to ensure all signals arrive at each cell at the correct
time. Any signals that arrive at an operator before the latest arriving
signal are delayed by inserting delay cells into the graph. In this
way, the most significant bit of each input signal arrives at the cell
during the same clock cycle. Once the circuit is synchronized, an
optimization step is applied to the data flow graph to reduce the
number of delay cells and therefore reduce the chip area. The
compiler then traverses the graph, assigns instance names to each
cell, and writes out the instance list and interconnection list in an
intermediate format. This intermediate format is then used by the
layout generator and the gate-level logic simulator.
94
6. 3 .4 Layout Generator
The layout generator uses a linearization and folding technique used
for standard cell designs in which all the cells are first ordered into
one long ribbon of cells. Linearization is performed using heuristic
methods
that
cluster
highly
interconnected
cells
to
mm1m1ze
max1mum wue length, and reduce the number of signals that cannot
be handled by abutment. Results have shown that on the average
only 10 to 15 percent of a devices area is devoted to wire routing
that cannot be handled by abutment.
6.3.5 Simulators
Three simulators are provided with the BSSC: a high level behavioral
simulator, a gate-level logic simulator, and a fault simulator. The
behavioral simulator is used to verify the functionality
of the
algorithm specified in the symbolic cell. This simulator uses the
compiled executable code from SBL. The logic simulator performs a
bit by bit simulation of each signal. The logic simulation results will
match the results of the fabricated circuit, so this simulator is used to
generate expected test results from test vectors. This simulator is
also used to test newly developed library cells. The fault simulator
accepts
a
set
of test
vectors
and
a
chip
description
in
the
intermediate format. Assuming a single stuck-at-fault model (0/1 ),
the simulator determines the fault coverage achieved with the given
test vector set. The user can experiment with different sets of test
vectors until satisfactory fault coverage is attained. One tool missing
95
from this integrated CAD environment is an automatic test vector
generator. Test vector generators will use probabalistic, statistical, or
deterministic models of the circuit to automatically generate a set of
test vectors.
6.3.6 Compiled Silicon Examples
An interpolator which consisted of seven 12 by 12 multipliers, and
an equivalent number of adders and subtracters was designed using
the BSSC. The circuit- was described in about 15 lines of code and took
about 4 hours to design. The resulting chip using a 1.25 urn CMOS
process was roughly 121 mils by 128 mils and
20,000
density
had approximately
transistors. With the current library of cells, the layout
achieved
with
the
BSSC
compiler
is
about
1500
transistors/mm2. This compares with full custom layouts which can
reach a density of 2400 transistors/mm2. The clock speed of the
device was 40 Mhz at 25 degrees Celsius and it dissipated 150 mw of
power.
6.4
Summary
Several DSP silicon compilers were discussed in this chapter. Even
though the BSSC was the most advanced, it has several drawbacks.
The BSSC will accept a high level behavioral description
which supports all the constructs
required
to
describe
processor at the algorithmic level. This utility relieves
designers from the burdens
•
of a detailed IC
<)
96
language
a
DSP
system
implementation .
Although the BSSC does include an integrated set of tools to support
the successful generation of VLSI devices, a number of software tools
are missing that would permit the integration of higher performance
devices. Missing from this list are tools that permit the user to
interactively bypass the automatic place and route software to allow
designs to be optimized. This last statement does go against the
philosophy of silicon compilation, however, the state of the art in CAD
tools still places limits on the performance of automatic design tools.
Future CAD tools will aspire to truly capture the knowledge of IC
design experts to permit the system designer the ability to generate
the
equivalent
of
hand
crafted
designs
in a fraction
of the
development time now required. Until such capabilities are available,
tools that can be used in an automatic and interactive mode
will
produce far better results. The automatic mode would be used in the
non-critical
speed and density portions
of the
device
and the
interactive mode would be used in critical path portions. These tools
include a critical path timing analyzer and an interactive/automatic
place and route package to improve critical timing, system clock
rates, and device density. This does mean that for the time being,
system designers will have to enlist the help of l.C. design engineers
to develop highly optimized devices .
• 0
97
7.0
SUMMARY
Over the last fifty years there has been an astonishing change in the
sophistication level of signal processing techniques. Along with this
change has been an associated change in the computations needed in
DSP. The trend over the years has been to progress from scalar
arithmetic through vector arithmetic , and most recently, to matrix
numerical calculations.
Digital signal processing was first introduced after the second world
war as a simulation technique. However, during the past twenty
years,
sophisticated manipulation of data spectra
using
detailed
knowledge of signal noise statistics has evolved DSP algorithms to
point where they are used in many realtime applications.
Advanced
DSP tasks have also generated a new set of algorithm requirements
focused around linear systems
of equations.
Although numerical
programs have been available for general purpose computers for
these tasks for some time, the need to perform
these
substantial
computations in realtime has given rise to intensive research into
novel architectures with very high throughput.
amount
of inherent
parallelism
in
these
Fortunately,
calculations
the
allows
implementation of systems with contemporary technology.
The need for basic functional units such as multiplier-accumulators
and matrix-matrix multipliers has been shown. However, there is
also a substantial amount of processing that does not map neatly into
these structures. This class of general purpose processing includes
98
heuristic decision making and data-dependent conditions where data
cannot stream through a datapath. Some designers have attempted to
incorporate special purpose functions
within an architecture that
appears to the programmer as a traditional Von-Neuman computer.
This technique is useful for systems such as single chip general
purpose digital signal processors. These devices process signals in the
audio band, but cannot provide a total solution for very high
bandwidth
signals
where
linear
algebra
equations
need
to
be
manipulated. In these high performance applications, special purpose
VLSI array architectures must be used.
Single chip general purpose digital signal processors have taken
advantage of todays CMOS VLSI fabrication technology to provide
features such as Harvard architecture, fast, single cycle instructions,
on-chip program and data memories, single cycle branching and
zero-overhead
looping
capability,
floating
point
data paths,
and
efficient direct memory access I/0 interfaces. Moreover, devices
introduced by manufacturers such as Motorola, Texas Instruments,
and Analog Devices now provide complete implementations of C
compilers for their processors. The architectural features discussed
above are also being used by RISC processors. This trend will
continue in the future thereby blurring the differences between high
speed single cycle RICS processors and general purpose digital signal
processors.
Two popular special purpose VLSI array architectures are systolic
and
wavefront
arrays.
These
99
architectures
provide
massive
concurrency derived from pipeline processing, parallel processing, or
both. The differences between wavefront arrays and systolic arrays
stern
mostly
from
the
fact
that
systolic
synchronization while wavefront arrays do
arrays
not.
require
global
Special purpose
array architectures are attractive to DSP system designers because
they can maximize the main strength of VLSI: intensive computation
power,
and
yet
communication.
circum vent
Fortunately,
its
the
main
weakness:
commonly
used
restricted
algorithms
mentioned above fall into the class which can readily take advantage
of this VLSI architecture.
Of all the factors that influence computer architectures for digital
signal processing, technology is without question the most important.
It is not difficult to show how architectural concepts, such as floating
point hardware multipliers and register files became significant only
when
the
appropriate
enabling
combination of expanding die
technology
size,
was
available.
The
sub-micron geometries,
and
redundant IC design techniques promise to integrate many billions of
transistor on one monolithic device. In fact, one billion transistors on
a chip will permit 128M bytes of RAM, 1000 DEC VAX CPU's, 20 Cray
2 supercomputer CPU's, or 10 complete VAX systems with memory.
Design
complexity
has
become
a
dominant
cost
limit
in
the
development of VLSI DSP systems. The advances in algorithms,
architectures, and I.C. fabrication technologies discussed in this thesis
will not be fully utilized without associated improvements in design
methodology and tools. A bottleneck now exists in the time and
100
resources required to design
and verify integrated circuits. The
ultimate goal is to provide
an automated strategy whereby
the
system designer can specify a high level behavioral model of the
algorithm, which is then refined and transformed into actual silicon.
To accomplish this task, future CAD tools must aspire to capture the
knowledge of IC design experts to permit the system designer the
ability to generate the equivalent of hand crafted designs in a
fraction of the development time now required.
10 1
REFERENCES
[1]
D.O. North, "Analysis of the Factors Which Determine
SignaVNoise Discrimination in Radar," RCA Tech. Rep. PTR-6-c,
June 1943; reprinted in Proc. IRE. Vol. 51, July 1963, pp. 10161028.
[2]
K. Bromley, H.J. Whitehouse, "Signal Processing Technology
Overview," Proc. SPIE. Vol 298, pp. 102-106.
[3]
H.F. Mermoz, "Spatial Processing Beyond Adaptive
Beamforming," J. Acoust. Soc. Am. 70(1), July 1981, pp. 74-79.
[4]
J. Allen, "Computer Architecture for Digital Signal Processing,"
Proc. IEEE, May 1985, pp. 852-873.
[5]
S. L. Garverick E. A. Pierce, " A single wafer 16-point 16 MHz
FFf processor," Proc. 1983 Custom Integrated Circuits Conf.
pp.244-248.
[6]
Y. Chikada, et al., "A 6 x 320 MHz 1024-Channel FFT CrossSpectrum Analyzer for Radio Astronomy," Proc. IEEE, Vol. 75,
No. 9, September 1987.
[7]
B.Widrow, S.D. Stearns, "Adaptive Si2'nal Processin~," Englewood
Cliffs, New Jersey: Prentice-Hall, Inc., 1985.
[8]
J.M. Speiser, "Signal Processing Computational Needs," Proc.
SPIE, Vol. 696, pp. 2-4.
[9]
R.O. Schmidt, "Multiple Emitter Location and Signal Parameter
," IEEE Trans. on Antennas and Propagation, Vol. AP-34, No.3,
March 1986, pp. 276-280.
[ 10]
A. Aliphas, J. A. Feldman, "The Versatility of Digital Signal
Processing Chips," IEEE Spectrum, June 1987, pp. 40-45.
[11]
J.R. Roesgen, "The ADSP-2100 DSP Microprocessor," IEEE Micro,
December 1986, pp. 49-59.
[12]
B. Eichen, "NEC's uPD7230 Digital Signal Processor," IEEE Micro,
December 1986, pp. 60-69.
102
[13] "ADSP-2100 DSP Microprocessor User's Guide," Analog Devices,
Inc., Norwood, Mass., 1987.
[ 14] "WE DSP32C Digital Signal Processor," Advanced Information
Data Sheet, AT&T, Allentown, PA, August 1987.
[ 15] J .M. Speiser, H.J. Whitehouse, "Parallel Processing Algorithms
and Architectures for Real Time Signal Processing," Proc. SPTE.
Vol 298, pp. 2-9.
[16] J.A.B. Fortes, B.W. Wah, "Systolic Arrays-From Concept to
Implementation," Computer, Vol. 20, No. 7, July 1987.
[17] S.Y. Kung et al., "Wavefront Array Processors: Form Concept to
Implementation," Computer, Vol. 20, No. 7, July 1987.
[18] P.J. Kuekes, M.S. Schlansker, "A One-third Gigaflops Systolic
Linear Algebra Processor," Proc. SPIE, Vol. 495, pp. 137-139.
[ 1 9]
G .R. Lang et al., "An Optimum Parallel Architecture for HighSpeed Real-Time Digital Signal Processing," Computer, Vol.21,
No. 2, February 1988.
[20] J.B. Dennis, "Data Flow Supercomputers," Computer, November.
1980, pp. 48-56.
[21] H.T. Kung, "Why Systolic Architectures," Computer. January
1982, pp. 37-46.
[22] J.W. Kong, H.T. Kung," I/0 Complexity: The Red-Blue Pebble
Game," Proc. 13th Annual ACM Symp. Theorv of Computin~,
ACM Sigact, May 1981, pp. 326-333.
[23]
S.W. Song, "On a High Performance VLSI solution to Database
Problems," PhD dissertation, Carnegie-Mellon University,
Computer Science Dept., July 1981.
[24]
K. Hwang, F. Briggs, "Computer Architectures and Parallel
Processing," New York: McGraw-Hill, 1984.
[25] W. R. Blood Jr., "MECL Svstem Design Handbook," Motorola Inc.
1983 .
• d
103
.
"
[26]
W.T. Lin, et al., "Integrating Systolic Arrays into a
Supersystem," Computer, July 1987, pp.100-101.
[27]
P. Dew, L. Manning, "Comparison of Systolic and SIMD
Architectures for Computer Vision Computations," Systolic
Arrays, Boston, Mass.: Adam Hilger, 1987.
[28]
D.A. Kandle, P. Kuekes, "A One Third Gigaflop Systolic Adaptive
Beamformer," Proc. SPIE, Vol. 564, pp. 101-106.
[29]
D. Pountain, "A Tutorial Introduction to OCCAM Programming,"
INMOS Ltd., 1986.
[30]
G. Wilson, "Creating Low-Power Bipolar ECL at VLSI Densities,"
VLSI Design, May 1986.
[31]
E.L.Meyer, "Champion ASIC Technologies," VLSI Design, July
1987' pp.18-22.
[32]
V. Milutinovic, "GaaS Microprocessor Technology," Computer,
October 1986, pp.10-13.
[33]
B.K. Gilbert et al., "The Need for a Whollistic Design Approach,"
Computer, October 1986, pp.29-43.
[34]
S.H. Leibson, "Advanced ICs Portend Radical Changes in System
Design," EDN, March 3, 1988, pp.115-120.
[35]
E.E. Swartzlander, Jr. et. al., "A Wafer Scale FFf Processor,"
GOMAC 1987 Digest of Papers, pp.45-48.
[36]
L.R. Rabiner, B. Gold, "Theorv and Applications of Digital Signal
Processing," Englewood Cliffs, New Jersey: Prentice-Hall, Inc.,
1975, p.611.
[37]
J.M. Rabaey, S.P. Pope, R.W. Broderson, "An Integrated
Automated Layout Generation System for DSP Circuits," IEEE
Trans. on Computer-Aided Design, Vol. CAD-4, No. 3, July 1985.
[3 8]
F. Catthoor, et al. "Architectural Strategies for an ApplicationSpecific Synchronous Multiprocessor Environment," IEEE Trans.
on Acoust. Speech and Sig. Proc., Vol. 36, No. 2, February 1988 .
104
[39]
A.F. Murray, P.B. Denyer, "A CMOS Design Strategy for Bit-Serial
Signal Processing," IEEE Journal of Solid State Circuits, Vol. SC20, No. 3, June 1985.
[40]
F.F. Yassa, et al., "A Silicon Compiler for Digital Signal
Processing: Methodology, Implementation, and Applications,"
Proc. of the IEEE, Vol. 75, No. 9, September. 1987.
[41] R.S. Tetrick, "System-Level Metastability Considerations,"
Eletro/87, Mini-Tutorial 3, Session 3/2, pp.1-9.
[42] L. Keeman, A. Cantoni," Metastable Behavior in Digital Systems,"
IEEE Desi 2'n and Test of Computers, December 1987, pp.4-19.
[43]
"Advanced CMOS Logic Designers Handbook," Texas
Instruments, Dallas, TX, 1987.
[44]
"FAST Applications Handbook," Fairchild, Portland, Maine,
1987.
105
A.O
APPENDIX A
A.l
Multiplexing Outputs of Digital Logic
One of the advantages of ECL components is the ability to "wire-or"
outputs. Wire-oring allows designers to save logic gates and can
simplify output bus enable and disable timing. In some TTL systems,
outputs must be completely tri-stated before an output on the same
bus can be turned on. This has the effect of extending system cycle
times. On the other hand, the enable and disable times of ECL devices
can be overlapped.
A.2
ECL Logic Families
There is a limit to which ECL outputs can be wire-ored. When two
wire-ored outputs are low, the drive current through each output
transistor decreases by a factor of two. This results in reduced baseemitter voltage which shifts output voltages upward. The shift in
Voh (output high voltage) tends to cause the next stage to move
towards saturation, the shift in Vol (output low voltage) degrades
noise margin and is the more serious of the two problems. 0 n e
solution to this limitation is to use an external buffer to limit the
number of outputs wire-ored together. Figure A.l illustrates
the
tradeoffs between the change in output levels and the number of
outputs wire-ored. No more than four outputs should be wire-ored
together without the use of external buffers.
106
70
.. @700C
60
.. ,....
50
A\4
A#'
,'
20+
,a @250C
,,,. ,.,.a',
...0~
'•
40+
30+
..~~~¥~~
,/ a" ,.
.,
~
1
2
, , ,",.c' '
,."
r:f
10
3
I
'
4
5
6
7
8
I OF WIRE OR'S
Figure A.l. Change in output voltage vs. the number of outputs
wire-ored.
A.3
TIL Logic Families
TTL logic is typically bussed together using devices with high
impedance output buffers. In these applications, the output buffer of
one device will turn on with the same control signal timing used to
turn off the other device. If the minimum enable to data delay of one
buffer is shorter than the maximum disable to data delay of the
other buffer (which is most often the case), bus conflict between the
two buffers will occur. Careful attention must be paid to the bus
conflict problem that arises in this situation.
Bus conflicts of this sort do not necessarily cause reliability problems.
The increase in device junction temperature caused by the typical 10
ns of bus conflict is negligible. The issue is one of functionality. Bus
107
conflict will cause short circuit to flow which results in ground lift
problems (Appendix C). The proper design of ground planes and
supply decoupling will reduce the effects of bus conflict.
108
B .0
APPENDIX B
B.l
System Level Metastabilty Considerations
The effect of the programmable DSP processor and host processor
operating off temporally unrelated clocks will be considered in this
section. Under these circumstances, signals such as the asynchronous
buffer full and empty status flags described in earlier sections can
cause metastability problems in the synchronous arbitration decision
logic. In general, this metastability problem occurs when mutually
asynchronous signals are sampled by digital circuitry.
Figure B.l shows an example of metastable behavior in a D-type flip-
INPUT
CLOCK
th
Vee
Vih
Vii
OUTPUT
G"d
Databook
...; tproo max
l
.
Metastable
Settling
!"1!~.---Time
Figure B.l. Metastable timing diagram for a D-flip-flop.
109
flop. The D input of the flip-flop in this figure does not meet the
setup and hold time requirements around the clock. The output of
the device will then behave in an unpredictable manner. The time
for the output to become stable is much longer than the normal
propagation delay of the device.
Metastability is essentially a reliability concern and this problem can
be quantized by calculating the Mean Time Between Failure, or
MTBF. The MTBF is shown in Equation [B.l] for a sampling device and
the formula is derived in Reference [41]. Two of the variables in the
metastability equation are under the control of the designer; the
clock rate of the flip-flop and the frequency of the input to be
sampled.
The
average
time
between failures
of the
device
is
inversely proportionally to both the frequency of the clock and the
input.
MTBF =
1
-(td-tsu)
2tpdfclkfinl 0
T
where:
•
tp d
=
The propagation delay of the flip-flop.
td
=
The metastable settling time of the device.
fclk
=
The frequency of the clock input.
fin
=
The frequency of the input.
ts u
=
The setup time of the logic element.
T
=
Device process dependent factor.
!)
110
[B.l]
Equation [B.l] also is a function of r which is itself a function of the
device and not entirely under control of the I.C. designer. r is a
process dependent parameter and will vary from manufacturer to
manufacturer. The only factor that one can be assured of is that all
devices used to sample asynchronous inputs will eventually fail. The
primary goal of the board designer is then to prevent the MTBF of
the entire system from being affected by the MTBF of the decision
element. Moreover, care must be taken in the rest of the design to
minimize the effect of metastability since the problem will occur.
B.2
Synchronizers
Figure B.2 illustrates one solution to the single element synchronizer.
In this example two D-flip-flops are used to synchronize the
asynchronous input signal. This design solution is based on the fact
that the clock cycle time must be equal to or greater than the
metastable settling time plus the setup time for the second flip-flop.
ASYNC
INPUT
D
Q
D
CLOCK
Figure B.2. Dual flip-flop synchronizer.
'
d
111
Q
OUTPUT
In theory, if this condition is met, the second flip-flop will isolate the
rest of the system from an unpredictable signal. It should be noted
that this assumption is not always valid. Unfortunately, no flip-flop
manufacturers guarantee the settling time of their devices. The
designer must then characterize the settling time of each device used
on their own. However, Equation [B.l] determines that decreasing the
devices propagation delay will also decrease the probability that the
device will enter a metastable state.
Examination of an empirical study performed on three different
·types of CMOS D-flip-flops back up this theory. Figure B.3 illustrates
the result of this experiment for a 74HC74 (tprop = 44 ns), a
74ACT11074 (tprop = 9.4 ns) and a 74AC11074 (tprop = 8.2 ns) [43].
This data shows that the fastest of the three devices, the 74AC11074
is also the most reliable.
1E+12e--------------------r--~------------------------------------~
1£+11
ii
1 E+ 1 0C: +100 YEARS
Iii
IE+91j
1E+08~ • 1 YEAR
AC1 1074
1E+07i +1 MONTH
1000000ii
.._
i"'
100000!;;.+1DAY
::
10000!
1000i
100 ~ + 1 MINUTE
10\i
1;
o.1 I=G
o.o1 11
0.0015
0.QQQ10
s
I
10
0
,s
20
25
30
35
40
45
50
55
Oett• t-nl
Figure B.3.
Metastable settling time for CMOS D-flip-flops.
112
Synchronizers will also increase the latency time for a signal. With
truly asynchronous inputs,
the single flip-flop
synchronizer will
cause the output to be delayed an average of .5 clock cycles. The dual
flip-flop synchronizer will then have a 1.5 clock cycle delay. Since
this delay is known, speed critical applications can merely generate
the signal one clock cycle earlier.
113
C.O
APPENDIX C
C.l
Ground Lift Effects on High Speed Digital Logic
The effects of switching simultaneous outputs of a device will be
considered in this chapter. The effects of simultaneous switching
range from increased propagation delay to output glitching. In most
cases, the source can be traced back to poor grounds, poor decoupling
of power sources and package inductances. Undesirable effects are
produced by attempting to switch output current through systems
with these characteristics. Figure C.l illustrates the equivalent circuit
for an output transistor and load under high to
low
switching
conditions [44].
L1
Vout
c
Rs
Figure C.l. Model of the current path through a switching
TTL output.
The component Rs represents the collector saturation resistance of
the
output
transistor,
C
represents
the
load
capacitance,
L2
represents the parasitic inductive components from the package to
the ground plane, while Ll represents the inductive components
from the package to the output signal trace.
114
As the output voltage transitions from the high state to the low state,
a transient current i(t), flows as C is discharged to ground through Rs,
L1 and L2. This current induces a voltage v(t) across Ll according to
Equation [C.1]:
v(t) = L
di
dt
[C.1]
Vee
R1
R2
Q2
Vout
R4
A
L
-s
Figure C.2. Schematic of a TTL NAND gate including
inductor L which represents the inductance
associated with the package and ground plane.
115
The result of the generation of voltage v(t) is an increase in the
propagation delay and the coupling of switching noise to "quiet"
outputs. These two effects will now be explored.
Figure C.2 is a diagram of a standard TTL NAND gate. Also shown in
the figure is the parasitic inductance L between the ground bonding
pad on the die and the ground plane. As the output makes a
transition from one logic level to another, switching and crowbar
current flows through inductor L. Inductor L will act as an open
circuit to a sudden change in output voltage. A voltage potential then
is impressed between the device substrate at point A and the system
ground at point B. This increase in potential at point A reduces the
current flowing through resistor R2. This in turn reduces the base
current to transistor Q5 which then results in a slower propagation
delay.
C.1.1 Empirical Results
The rise in the voltage potential at point A can also couple through to
other outputs on the monolithic device. The result of this effect 1s
shown in Figures C.3 through C.9 for a SN74ALS374, a SN74F374 and
a SN74ACT374. This empirical data was taken by a Hewlett-Packard
5411 OD 1 gigahertz digitizing oscilloscope. Measurements were taken
by switching seven of the outputs of these 8-bit registers
and
maintaining one output at either logic one or logic zero. The circuit
board used was a wire wrap board with signal wires no longer than 2
inches. A full ground plane was added to the board. The 374's were
116
driven by two SN74ALS175 D type registers with both Q and !Q
available. Each logic device was decoupled with a .1 uf ceramic chip
capacitor.
Figures C.3 through C.5 illustrate the effect of 7 outputs switching
from a low state to a high state while one output is maintained at
logic zero. Figures C.3a, C.4a, and C.5a were taken with one TTL load
on the output of the 374. Notice the positive going noise glitch which
results from the ground rise problem discussed above. This positive
going noise pulse is followed by a negative pulse which results from
the collapsing magnetic field around the ground plane inductor L
supplying current back into the circuit as the dynamic switching
current tends toward zero. Figures C.3b, C.4b and C.5b illustrate that
the effect of adding additional loads to the quiet output is to
integrate the noise spike.
Similarly, noise is also generated by outputs making a high to low
transition while the other output is maintained at a low state as
shown in Figures C.6 and C.7. Notice the ALS374 register in Figure C.6
generates noise that is lower in magnitude than the rising transition
case in Figures C.3a and C.3b.
The large noise spikes on the CMOS ACT374 shown in Figure C.7
result from the fact that the slew rate of the edges can be as fast as
700 picoseconds per volt. In other words, a large current tries to flow
through inductor L in a very short amount of time producing a large
output noise glitch. This fast
'
~
117
output slew rate accentuates the
problems caused by
the normal
switching current and crowbar
effects.
Performance
will
also
be
degraded
when
instantaneous current is impressed on the
the
V cc pin.
demand
for
Again,
the
parasitic inductance between the V cc plane and the Vcc bonding pad
can limit drive current to external loads and result in increased
output delays.and couple noise to quiet outputs. Figures C.8 and C.9
illustrate this effect using an ALS374 and a ACT374, respectively.
Notice the voltage scale in Figure C.9 on the quiet output maintained
at a logic high level is 2.0 volts per division (offset = 2.0 volts), not
1.0 volt per division as in the other waveforms. This large glitch can
be reduced by using a Vee plane in addition to a ground plane.
The design methodology in high
1
Speed
logic systems is therefore to
minimize the voltage induced from the device substrate to the power
supply digital ground and from the power supply V cc to the device
Vee supply. To minimize these transient voltages requires reducing
inductances in the power supply planes,
sockets
(if used)
and
packages. Package inductances can be reduced by switching to
surface mount or tape automated bonding (TAB) technologies. Proper
selection of decoupling capacitors
will also help reduce these noise
problems.
C.2
Point of Interest
Logic levels in TTL circuits are referenced to the most negative rail
•
t}
118
or ground. Variations in the ground reference potential can therefore
result in incorrect output voltage waveforms. ECL logic levels on the
other hand are referenced to the most positive potential. Designers of
· ECL logic families selected the use of a negative supply voltage to
ensure that the design community who is familiar with TTL design
practices provide a well designed ground reference plane.
11 9
Channel I, 2
=
1.000 volts/div
Tirnebase
=
10.0 nsec/div
. . -.r··· . ····-····-··-··r·..................... r··------..--·-r··-..·-·····--i;··--···---T·-----·--·--r·-··-···-·-·. r-·-----·······f~----· . . -.........~
!........ --SN-J-4ALS.J74-'·····-·····-·L..................... J...............a!xx. . . . . . ~ . . . . . . . . . . ~. . . . . . . . . . . . i. .....................;.......................;
· s.sy @ 251
:
i A L---.:- .'
'
·
.
~--··"··-····--
---····
-------~-:----·-············-~---··---
_
.
.......... .......... ~--- ........-.\....................... ~ ....................... :...................... ~
.. ····--·:----···-······-·-:···<1····-·····
:........................,......... _........... ~................. _.T'"""-"'""""/i""""-""""'"t-·"'"-·---:··--·-..·---~·-··"·- ...-....;........................ ~.......................;
1:=:~.:-r;~·:_,~:r:: ~:r:tEiSt~~F=-1 ; ~~:-.r=;:~-~::: ~~~"-1
~-.-..~
:···········-····--·:····· ·-·····-····---~- ... ·········-·-·-··l··--··--·· ···-.....;......,, _________ ]~ .... ---- --~-------~-- -·-··-·· ........ ~---·· ..···· ...........
:
~
:
·············- _, ........... H ..........
.....
-~
" · · · · - · · · · ..... -
:~
;
~
;
!. -----.. . . . . . -:- · · · · · · · · · · · :
~
L...-.... . .-. ... _,.;... _........... ___ :1~ . -· ........... ··-~···· .................l ............... _. __ t..................... !._ ....... __ ......... :
......
(a)
Channel 1, 2
=
1.000 volts/div
Tirnebase
=
10.0 nsec/div
r--:~~~:~1-:==::.:=~=:~--=:=~-~~==l-====;-_::~=~-:---=-=r
: s.sy 25:
;
.
@
:-····-····-····-..-j···-··-..............~.........._.-......~. -··""-·-"'""'": ...........
.
:
!
~
.,.
:
.;.
;
''i""-·-···-·····-~·----··--r· ..
·-·---r
::::~::1:_::=:~~:~i~:~--:;~=:i~~~-:::::;~~~~:I::~~~~::::i
:--·-~~
,~~--~-~:-~:-~~~r=::_J:==~r-~=c:~~: ~=~J=:~:-:_~~~~:~~--~~--'
(b)
Figure C.3. Waveforms for a SN74ALS374 quiet output while
the 7 other outputs on the device transition from a
low to a high state: (a) outputs have 1 TTL load; (b)
outputs have 8 TTL loads.
120
Channel 1, 2
=
1.000 volts/div
Timebase .. 10.0 nsec/div
~··············~~&-~;;·;~·r····-··-·······r············---r--·--~··········:-·-·--··-····-r···-----·
--,-·-· · -·-· · · -;· ·-·-· -· · · · - ·
:········· ···s3v··@-·2s~·-·····--··-··-·t·····-·········r-·····-···- .. } .....- .......--;----·----:·-----·..-·+·---·· --·---·~- .....................
~
I
.:.
- - - .. ·:--···-··---------·-·-··~----·-~
i::::::~r::::=:~~c::::~:::~:~:E:Ei~::T:~=J_-::~~:-T:=:~~,
:
;
i
~.=.;;-;- ,;""'_, ...... T--...........
;
. .......~--=:
;....................;.................~.--..............:.........................-.. .-......i:.....................:. . -..................................-.. . . . . . . . . . . .:.............-...
' - ---~- - __ :______L:_ - : -j_ --- - ~ - __j____ -,_ --
-ra\- ____
Channel 1, 2
=
1.000 volts/div
Timebase ... 10.0 nsec/div
-··---···----...--·--·-·---·..---··-·-··--···..--·-·--·-·-------..·----.. ---·1·--·---···-·-----··. .·-·-----·-··
l. . . . . . . .S~24.F3.1.4.....................l. .-.. . . . . . .;. . . . . . . . clixx. . . . :. . . . . . . . . . .!_____. . . . . J...................;____................i
:~==F====~L~:~:=t±:t~=J~\=~~=~~=:_:~-=
t=+~-_;L~f£;t=I~=T~~f~=±~~:~;
!=.J=i_J~~=~:J~~J::~-J:j:~-J~Jt~J::-:::=~=-~=:=-~=~~:I:~::J~
(b)
Figure C.4. Waveforms for a SN74F374 quiet output while the
7 other outputs on the device transition from a low
to a high state: (a) outputs have 1 TTL load; (b)
outputs have 8 TTL loads.
121
Timebase ... 10.0 nsecldiv
Channel 1, 2 = 1.000 volts/div
··············-·····-········· -----··········-.. ·········· . ··-····-···-·····-·-··-····---------··---·.. ---··---·--·-----------------··-----
.ah_. . . .: . . . ._. . . _.J....-..--..·---~--·--..........~...............--.:...................;
............sJ74ACfj14. ._........\. . . . . .
:
55V@ 25°
:
:
;.
'
;
.
L. _. . . . . :. . . .-.. . . . . . . ;·-··~1··· . . . . . . . . :.....................~. . ·----..-------·-..·--·--.. ·-·,
·--·-···""'·•"'"'"-""'""-""'J-..·--·-"'"'"l"··---·--·-J-...-........ _.+· ·--·--·---+·---·-"·•""'"'7·····--............ ~ ...........-.. -...:__ ._,,,...,,_..,;
,: ~:: ::_::~~t:.;\~-:~~:-~1- -_:-t_~~~I::~~:-~'--~-=~:~~-=-:~
Z:;>···-··········----·i······-·-·······---·~... -...........,_______ :__ .................... ~ ......................f....................... ;
;Ooo
•ooooOo-
, 0000 0 -
0R,oooo,. . o
oooo
•o•OoO•oo~o•·-••
-O•ooo•Ooo00''':
oooOoOo- 0 oooOooOOoOooto
ooOOO_o_OooOOOo•O:t,.ooo-•00000
oOOoO_o:o . . Oooooo
' o ' - 0 0 0 . . 0 0 J 0 0 0 0 0 0 ' o ' O o · 0 - 0 ' ' 0 0 o 0 0 7 ; _ • • • • ooooo0-0000 _ _ _ ; , . . . _ , _ _ _ ** 0 ' ' * *
••O*~
(a)
Channel 1, 2
=
Timebase
1.000 volts/div
=
10.0 nsec/div
[~li~;~:~-I2:;r=:T==r===r==~~~1
)"-·---.. . T_. _. . . ._.( . . . . . . . . 1". . . . . . . .-T. -----··r. ---·- -·-n-.. . . . . . . t. .-.._. . . . . . . . . . . . -.
~-----·-·····-·;
i
---~-·-·-·-·-·-
..
!
.
......................_;_, .. _,_,,, ..... .
1
t: =~:t--~~-:~::~--~1~~--:~t-~I~:=t=~=-~-:::~:::~:~:=::=:=::~
(b)
Figure C.S. Waveforms for a SN74ACT374 quiet output while
the 7 other outputs on the device transition from a
low to a high state: (a) outputs have 1 TTL load; (b)
outputs have 8 TTL loads.
122
Channel 1, 2
=
1.000 volts/div
Timebase
=
10.0 nsec/div
············-··-r··------..........J......~..-· ..···-·:;-·--·.. ·-···-·--:···--·····--·-r·······----····-~····-···-··---··r. .·· -··········-····~
SN7~ALS374
'
~
~
~
~······················~·--······-·-··-····"'1-·-
ruXx
: .-·-·····:··-----··-r--··-.
\ .· · ·;
!,. . . . .s:svf··@;-··is<>··r·-..····-··-· . ·-r-··-·. ·-.. . T. . . . .____
r·--·---..·r--····-..-.1···-.
:·-····--··-·........:...................... ~ ......................;···---·-·······-~...- ..............:r.···-··-·... -......1" .. - .. - .......... i.......................;· · . . -.. . . . . . . . r. ·--·-· ..........."
•
.
:
:························~··-···-·····················
.
!
.................. :·········-·············:·-····-············~~-·----·-····i·-··--·-····-··r······ .. -·-··-·-f·····. ···-·-··-~-·····-·-····-····!
i·
:
.
:
. . . . ·Figure . .c.6·~-. -wa:verorffis. . . t<>r:···-a·-sN7~'Acs·3~-4····4uie't'. . olii~ili. . . while
the· 7 other outputs on the device transition from a
high to a low state.
Channel 1, 2
=
1.000 volts/div
.
Timebase
=
10.0 nsec/div
c~~:~~~5.;_;;-~:
. r=~:·~~r.~--~~~~·~~T ---~-~~=~: ..::=~:~:-~. . .r::~:·~~
. -~:·~~~~-.--~~:~~-~.~~:~~
5.5Y @ 25 :
.
:
l
1
i
1····--·--··--.. .~ ................._..J. . _. . . . . - . . . . . --..·-·-···;·---..--~~--·- -·-·····)........................!..............- .......:...................-;
:
..... -......... ____ :........................:................... - ... :..............------:.----···--·-······-·-------···-·· ·····-..············-Figure C.7. Waveforms for a SN74ACT374 quiet output while
the 7 other outputs on the dev~ce transition from a
high to a low state.
........................:.................. __________ .............
..;.
123
Channel 1, 2
=
1.000 volts/div
Timebase ... 10.0 nsec/div
:_~-~~t~~j~::---:~:-=~=ir~:~~
: .-: _:-_: ~~=~~~:=~~::: ~~-:
________
...,..,._.....................•..:·········-···············-·-···········-····'--· ............................... ········'····-·-·-·········:--·····-·
;._
................ ······:-········· ............. ~ ................,_, --~---···--· ..···-···--;......................... _____________ ,,.......!-------···-··········:- ......................:....................................... ,_ ____ ;
'· ............ ·········' ......................:...............................................:.......................c.................. _i ..................... ·........................:........
--····-· .\....................
.J
(a)
Channel 1, 2
=
Timebase
1.000 volts/div
=
10.0 nsec/div
i=ti;:~~=~~l-::~=r:~=--:=::~]-~:=::==:-·:·-_:~=::,
. .:. . . .\. . . . __' . . . . . . . ...!~.
i
'
~~~~:r.=~:r==t~_E?f]~~~J:=-~~~~==}-~=~:
r-······-·-·-··1. ----··--l---..·-----}-..--_j__________L____ . . .:. . . . ______;. . . . -------l--------~-----. . . . . . ,
!
~
;
;
·:·
~
:
;
: ................... ---~----····--··-------L....-------~---------------·.-·-·-·:;::________ ...................... ______1_..................... ~-----··-··-··· --------·-·· ···-···
(b)
Figure C.8. Waveforms for a SN74ALS374 quiet output while
the 7 other outputs on the device transition from;
(a) a low state to a high state, (b) a high state to a
low state. Both (a) and (b) were tested with a load
of 1 TTL input.
• 0
124
Chan 1 ... 1.00 volts/div, Chan 2
c:.·_·
=
2.00 volts/div
Timebase "' 10.0 nsec/div
:·:-~:~:::~~::··_··.:.::~J::.::~~~-:···.:·_:I~::::~:~~--.:I.::-.~.:~:~.:.::·~:··::·.~.---.~.:-~~:.·;,_~~-:~_-_:-.-..:··::. : : .: .·: :·:·. . .
5.5V@ 251?
.
.
--·····-··--···-······--············· ···:····-·· ......... .
···'-·-·········
- ~:- -· · ·-·-· · · - -:;- · · · · · · · ·'·t·~·- ~
. ··'··· ;.. ·················--=:.................
J. .s-· ~- ~ ·:.~: : ·~: :~ · :r·:.·:·_-· : -.:~:-.: .·~ -.·:·~:~,
OUTPUT:
.
=-·····- ····--··-:...... ·-·········;··-·· ............ ~····· ..•.......... ~ ······---·----~~---··-····-·······;···· ......\ ···---~·-···- ...........L.............-.:....................,
-·--·--····-~--·-·········-···-··-; ............. ···----~----·
i
=--
-·-······-····-: ·-···············---~-- ···-····-······················ ······•·····
r
:·······-· -········-:····-········---·-·--=····.. ······-·--····;·-·········-·-·······-··-·····-······--oo;.···· ··············-····:··--·---···-·--·-t··················· .. ··-······------..;---···-···········-·
············--·=·-·· ....... --···· ... :............ -...... _,.;. ........ --····--.... ;.....-......... --··· --- ................ ~.... _, ................ t......................:............... _............:...•·• .....- ... - .. ;
(a)
Chan 1 = 1.00 volts/div, Chan 2 = 2.00 volts/div
Timebase "" 10.0 nsec/div
f~~ii~~~t~~:-~=E;~-~~-~~c~~I=_:T~:~I-~:J
1==--~::,_::·~--·:i::::~.-:-,:.::~·:i~=:~~= "]~= ·t.--·::i:::::j
·~-··•"''"i''"_.
____ .. ,.~. --·--·-·--..~~-·-··-· . . .-...... ~..-----·······-.. ;···-···---............~..·-····--·............
;·······················,······-·· ··········'······--·-··--:--·-·····-··;·-······-···-····:~···-······-··j············-·-·~········--·········--··········---!----···········
.............................. -·······-··· ·············--·-····---···- ······-····' ········· ·····c-b }·-··.·--·- ·······--····-··-····· ·······--···--·.:·-·········· ___ .;·······--····-·····'
Figure C.9. Waveforms for a SN74ACT374 quiet output while
the 7 other outputs on the device transition from;
(a) a low state to a high state, (b) a high state to a
low state. Both (a) and (b) were tested with a load
of 1 TTL input. The offset for the output on (a) and
(b) is 2.0 volts.
125