Complex Square Root with Operand Prescaling

Journal of VLSI Signal Processing 49, 19–30, 2007
* 2007 Springer Science + Business Media, LLC. Manufactured in the United States.
DOI: 10.1007/s11265-006-0029-2
Complex Square Root with Operand Prescaling
MILOŠ D. ERCEGOVAC
Computer Science Department, University of California at Los Angeles, 4732 Boelter Hall,
Los Angeles, CA 90095, USA
JEAN-MICHEL MULLER
CNRS-Laboratoire CNRS-ENSL-INRIA-UCBL LIP Ecole Normale Supe´rieure de Lyon,
46 Alle´e d_Italie, 69364 Lyon Cedex 07, France
Received: 1 March 2005; Revised: 25 January 2006; Accepted: 23 October 2006
Abstract. We propose a radix-r digit-recurrence algorithm for complex square-root. The operand is prescaled
to allow the selection of square-root digits by rounding of the residual. This leads to a simple hardware
implementation of digit selection. Moreover, the use of digit recurrence approach allows correct rounding of the
result if needed. The algorithm, compatible with the complex division presented in Ercegovac and Muller
(BComplex Division with Prescaling of the Operands,^ in Proc. Application-Specific Systems, Architectures, and
Processors (ASAP_03), The Hague, The Netherlands, June 24–26, 2003), and its design are described. We also
give rough estimates of its latency and cost with respect to implementation based on standard floating-point
instructions as used in software routines for complex square root.
Keywords:
1.
computer arithmetic, complex square-root, digit-recurrence algorithm, operand prescaling
Introduction
1.1. Complex Square-root
Complex square-root appears in numerical computations such as complex Givens rotation [3], complex
singular value decomposition [1, 13, 25], and in applications such as principal component analysis [7],
quantum defect theory [19] and wave propagation
[23].
Complex square-root operation is commonly implemented in software based on various algorithms
developed for reliable and accurate evaluation in
languages like FORTRAN 90 [14]. There are also
collections of Fortran routines for multiple-precision
complex arithmetic which include complex squareThis is an expanded version of a paper presented at the
ASAP_2004 conference.
root [24]. These implementations rely on standard
floating-point instructions and, consequently, their
execution time is significantly longer than that of a
single arithmetic instruction. In today_s processors it
is quite common to have hardware implementation
of all basic operations on real fixed/floating-point
operands. To our knowledge, there are no implementations of complex square-root at the hardware
level on conventional processors. The only hardware
implementation of complex arithmetic, including
square-root, we are aware of is an FPGA implementation due to McIlhenny [16], who used an adaptation of on-line arithmetic to Eq. (1). With a rapid
increase in capacity of integrated circuits, it is timely
to consider hardware-based implementation of an extended set of operations. In this research we focus on
hardware-oriented algorithms and implementations
of operations on operands in the complex domain. In
[11] we proposed and developed an algorithm and its
20
Ercegovac and Muller
implementation for complex division compatible
with a standard radix-r digit-recurrence division
scheme. In this paper we extend our approach to
complex square-root operation.
Since a nonzero complex
number has two square
pffiffi
roots, we will define z as the square root whose real
part is positive when z is not a negative real number.
The simplest algorithm for evaluating a complex
square root u þ iv of x þ iy, based on real square
roots, consists in successively computing
pffiffiffiffiffiffiffiffiffiffiffiffiffiffi
‘ ¼ x 2 þ y2
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ð1Þ
u ¼ ð‘ þ xÞ=2
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
v ¼ ð‘ xÞ=2
with signðvÞ = signðyÞ [2]. This algorithm is optimal
in the algebraic sense, i.e., in the number of exact
pffi
operations þ; ; ; ; . However, it suffers from
several drawbacks:
2
2
& x þ y can overflow or underflow, even if the
exact square root is representable, leading to very
poor results;
3
real square roots, 2 squares and 3 additions/
&
subtractions are required. Even if this is
Balgebraically optimal,^ this is quite costly and
one could hope that a direct hardware implementation of complex square root could be better.
Another solution [21] is to first compute
8
0 sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
if x ¼ y ¼ 0
>
>
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
>
>
2
>
>
< pffiffiffiffiffi 1 þ 1 þ ðy=xÞ
jxj
if jxj jyj
w¼
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
q2ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
>
>
>
2
>
>
pffiffiffiffiffi jx=yj þ 1 þ ðx=yÞ
>
: jyj
if jxj < jyj:
2
and then obtain
u þ iv ¼
pffiffiffiffiffiffiffiffiffiffiffiffi
x þ iy
8
0
>
>
>
< wþi y
2w
¼
jyj
þ iw
>
2w
>
>
: jyj iw
2w
if w ¼ 0
if w ¼
6 0 and x 0
if w ¼
6 0 and x < 0 and y 0
if w ¼
6 0 and x < 0 and y < 0:
This avoids intermediate overflows at the cost of
more computation including several tests, divisions
and real square roots which make the complex square
root evaluation quite slow compared to a single
arithmetic instruction. Also, estimating the final accuracy is very difficult.
Kahan [15] gives a better solution, that also correctly handles all special cases (infinities, zeros,
NaNs, etc.), also at the cost of significantly more
computation than the naive method (1).
This paper presents an algorithm similar to the
usual digit-recurrence real square-root algorithm [9,
10, 20], suitable for hardware implementation. The
originalpapproach
was presented in [12]. For comffiffiffi
puting x, this algorithm uses the recurrence
w½j þ 1 ¼ rw½j 2sjþ1 S½j s2jþ1 r j1
ð2Þ
where w½0 ¼ x, and the square root digits sj `s are
chosen in a radix-r redundant digit-set, so that the
residual w½j is bounded.
The main problem of digit-recurrence algorithms
is to find a practical result-digit selection function for
higher radices. Several approaches have been suggested for higher radix square-root digit selection. In
[4] the use of digit selection tables is analyzed and
applied to the radix-4 case. The hardware complexity
of this approach grows rapidly with the radix. An
alternative, applied to higher radix digit-recurrence
algorithms for division [9, 10], uses prescaling of the
operands and rounding of the truncated residual to
achieve a feasible digit-selection function for higher
radices. This approach using prescaling and rounding
has been also developed for higher radix square
rooting [17, 18]. It consists of multiplying x by a
constant K so that Kx is close to 1, and pusing
ffiffiffiffiffiffi the
standard residual recurrence to compute Kx. The
prescaling allows the selection of sjþ1 by rounding
the residual w½j (or, merely, an approximation made
up with a few most significant digits of w½j) to the
nearest integer. For fast implementation, a truncated
residual is used. K is deduced from a few most significant bits of x. To simplify the multiplication K x,
it is desirable to choose a low-precision
value of K.
pffiffiffiffiffiffiffi
Throughout the paper, i is 1, and if z is a
complex number, then <ðzÞ and =ðzÞ denote the real
and imaginary parts of z. The norm jjzjj1 denotes
maxfj<ðzÞj; j=ðzÞjg, whereas jzj denotes the usual
complex absolute value
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ð<ðzÞÞ2 þ ð=ðzÞÞ2 :
Since j<ðzÞj and j=ðzÞj are
both less than
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi
porffiffiffi equal to
jjzjj1 , we deduce: jzj jjzjj21 þ jjzjj21 ¼ 2jjzjj1 .
Complex Square Root with Operand Prescaling
In the next section we describe the basic recurrence and prescaling of the argument. In Section 3
we discuss implementation of the method and give
some rough measures of its latency and cost. We also
compare with a conventional implementation of
complex square root operation.
2.
Complex Square-root Algorithm
2.1. Basic Iteration
pffiffiffi
Assume we wish to compute x, where x is a
complex number satisfying jjxjj1 < 2. We consider
a digit-recurrencepalgorithm
that produces a radix-r
ffiffiffi
representation of x in the form s0 :s1 s2 s3 s4 . . . with
I
R
I
sj ¼ sR
j þ isj , where sj and sj are in the redundant
digit set S ¼ fa; a þ 1; . . . ; ag , a r 1, and r
is the radix (in practice, r is a power of 2).
Assume that we have already computed S½j
represented by s0 :s1 s2 s3 s4 . . . sj . The jth residual is
defined as
W½j ¼ r j x ðS½jÞ2
21
However, we notice that:
2
R 2
I
The
terms
ðs
Þ
ðs
Þ
r j1 and 2sR
&
jþ1
jþ1
jþ1
I
j1
sjþ1 r
decrease rapidly as j increases, so that
their influence can be neglected when choosing sjþ1
after, possibly, a few first iterations.
R
& If S½j is close to 1 (i.e., if S ½j is close to one and
I
R
S ½j close to zero), then W ½j þ 1 would be close
I
to rW R ½j 2sR
jþ1 and W ½j þ 1 would be close to
I
I
rW ½j 2sjþ1 , so that a Bnatural^ choice for sR
jþ1
would be the integer closest to 12 rW R ½j; and a
Bnatural^ choice for sIjþ1 would be the integer
closest to 12 rW I ½j:
To make S½j close to 1, we perform prescaling of
the operand. This allows the iterations (4) to start at
step j0 > 1 and to use for the selection the Bnatural^
choices presented above. For a fast implementation
of the iteration, we will use low-precision estimates
of the shifted real and imaginary residuals. Of course,
we have to make sure that this suffices to ensure
convergence.
ð3Þ
2.2. Prescaling the Operand
Using Eq. (3), we obtain the residual recurrence
W½j þ 1 ¼ rW½j 2sjþ1 S½j s2jþ1 r j1
ð4Þ
which is the same recurrence as in the real case. It is
important to note that, from Eq. (3), any choice of
the sj ’s for which the W½j’s remain bounded will
ensure that S½j2 ! x. After separating the real and
imaginary parts of W½j þ 1 in Eq. (4), we get
8 R
R
2sIjþ1 SI ½j
W ½j þ 1 ¼ rW R ½j 2sR
>
jþ1 S ½j þ >
>
>
2
2
I
j1
>
<
ðsR
jþ1 Þ ðsjþ1 Þ r
>
>
>
W I ½j þ 1
>
>
:
¼
I
rW I ½j 2sIjþ1 SR ½j 2sR
jþ1 S ½j
R I
j1
:
2sjþ1 sjþ1 r
ð5Þ
I
R
Selection of digits sR
jþ1 and sjþ1 so that W ½j þ 1 and
I
W ½j þ 1 remain bounded is not obvious and we
now discuss our approach in obtaining a selection
function. Indeed, Fig. 1 shows that, at least in some
cases, choosing the Bcomplex digits^ sjþ1 cannot be
done at a reasonable cost.
2.2.1. How Prescaling the Operand Simplifies the
Selection. The prescaling part of the algorithm is
very similar to that of complex division [11]. Before
discussing different ways of performing the prescaling, let us show how having prescaled the input
operand simplifies our problem.
Using an input number x (for simplicity, we
assume 1=2 jjxjj1 < 1), we first obtain (from a
look-up table, either by direct look-up or using a
method presented in Section 2.4) a complex number
K such that jjKx 1jj1 < 2q ; where q is a parameter of the square-root algorithm. We then obtain
d ¼ Kx and use the digit-recurrence p
algorithm
with
ffiffiffi
selection by rounding to compute
d
.
The
table
pffiffiffiffi
stores also precomputed values 1= K , so that at the
end
pffiffiffi of the computation we can obtain the final result
x as
pffiffiffi pffiffiffi pffiffiffiffi
x ¼ d 1= K :
pffiffiffiffi
The multiplication by 1= K can be performed in
parallel with the recurrence:
as we compute p
a ffiffiffi
new
pffiffiffi
ffi
complex digit sj of d , we accumulate sj 1= K in
22
Ercegovac and Muller
Figure 1. The domains where digit sR
1 can be chosen equal to 1, 1 þ i or 0, assuming s0 ¼ 0. The complexity of these shapes shows that
prescaling is needed to simplify the digit selection.
two registers, one for the real and another for the
imaginary part.
After the prescaling step, we have obtained d ¼
q
Kx such that jjjj
1
pffiffiffi< 2 , where ¼ d 1. Let us
try to bound d 11 . We use the Taylor
expansion
pffiffiffi
2 3
54
þ d ¼1þ þ
2 8 16 128
ð6Þ
Define R and by ¼ Rei . Frompffiffiffik ¼ Rk eik ,
one gets jjk jj1 Rk ; and since R 2jjjj1 ; we
finally obtain jjk jj1 2k=2 jjjjk1 2k=2 2qk : Combining the last result with the power series (6), we
get
pffiffiffi
pffiffiffi
2q 2 22q 2 2 23q
þ
þ
jj d 1jj1 2
8
16
5 4 24q
þ þ
128
2q 2 22q
þ
þ 23q :
2
8
In particular, for q > 2:
pffiffiffi
9 q
2 :
d 1 <
1
16
Hence, there exists a radix-2 representation of
that starts with
ð7Þ
pffiffiffi
d
Complex Square Root with Operand Prescaling
& For the real part:
(8) is satisfied in all
values are
8
SR ½j >
>
< I 0
S ½j0 W R ½j0 >
>
: I
W ½j0 1: 000
00
|fflfflfflfflfflffl{zfflfflfflfflffl
ffl}
q zeros
& For the imaginary part:
0: 000
00
|fflfflfflfflfflffl{zfflfflfflfflffl
ffl}
q zeros
These digits can be used to initialize S½j0 for some
j0 . More precisely, assume that the digits sj are radixr ¼ 2k digits in the set fa; . . . ; þag. From Eq. (7),
onepcan
ffiffiffi easily show that there exists a representation
of d in this radix-r system that starts with
& For the imaginary part:
0: 000
00
|fflfflfflfflfflffl{zfflfflfflfflffl
ffl}
j0 ¼bq=kc zeros
9 q
r i 16
2 ; i.e.,
ar j0
9 q
2 :
r 1 16
ð10Þ
sjþ1 1 rW½j 1 þ 2 :
2
2
1
ð11Þ
1. Bound jjW½j þ 1jj1 , that is, bound jW R ½j þ 1j
and jW I ½j þ 1j;
2. Choose q and so that the selection function
returns digits in the set fa; . . . ; þag.
j0 ¼bq=kc zeros
i¼j0 þ1
¼ 1
¼ 0
¼ r j0 ðdR 1Þ
¼ r j0 ðdI Þ:
For convergence of the algorithm, we have to :
1: 000
00
|fflfflfflfflfflffl{zfflfflfflfflffl
ffl}
P1
practical cases. The initial
The selection function that returns the sj `s is as
follows. We will choose sR
jþ1 as the integer closest to
the number constituted by 1=2rW R ½j truncated after
the -th borrow-save position.1 We choose sIjþ1 as the
integer closest to the value 1=2rW I ½j truncated after
the -th borrow-save position. This implies that
& For the real part:
if a
23
ð8Þ
Let us first bound jjW½j þ 1jj1 . Denote S½j ¼
1 þ ¼ 1 þ R þ iI . From Eq. (10), we get jjjj1 ¼
maxfjR j; jI jg < rbq=kc : Since
R R
I
I
W R ½j þ 1 ¼ rW R ½j 2sR
jþ1 2sjþ1 þ 2sjþ1 2
2
I
j1
ðsR
jþ1 Þ ðsjþ1 Þ r
we find
Assume q ¼ kj0 þ ‘, with 0 ‘ k 1. Eq. (8) gives
a
9 ‘
2
r 1 16
ð9Þ
The ratio a=ðr 1Þ ¼ is the redundancy factor
which, for a redundant digit set fa; . . . ; þag,
satisfies 1=2 < 1. If ‘ 1 the condition (9)
can be satisfied for any r and a. For ‘ ¼ 0, the
condition requires that a > r=2, i.e., the minimally
redundant system cannot be used.
2.3. Making the Iterations Work
We assume that the radix of the iteration is r ¼ 2k .
We start the iterations from step j0 ¼ bq=kc since Eq.
jW R ½j þ 1j <1 þ 2þ1 þ 4ar bq=kc þ a2 r j1
<1 þ 2þ1 þ 4ar bq=kc þ a2 r bq=kc1
Similarly, from W I ½j þ 1 ¼ ðrW I ½j 2sIj þ 1 Þ I
R
I
j1
w e fi n d
2sIj þ 1 R 2sR
j þ 1 2sj þ 1 sj þ 1 r
jW I ½j þ 1j < 1 þ 2þ1 þ 4arbq=kc þ 2a2 r bq=kc1 2q :
Therefore, we get the following bound
jjW½j þ 1jj1 < 1 þ 2þ1
a2 bq=kc
r
þ 4a þ 2
¼W
r
ð12Þ
Let us now determine conditions to assure that the
real and imaginary parts of computed digits sjþ1
Ercegovac and Muller
24
belong to fa; . . . ; þag. These digits are obtained by
rounding to the nearest integer an estimate with
accuracy 2 of a number whose absolute value
can be as large as 12 rW. To satisfy jsjþ1 j a we must
have
1
1
rW þ 2 a þ
2
2
ð13Þ
2
where
pffiffia þ ib ¼ a ib is the complex conjugate and
2
¼ 2 ð1 iÞ.
Now, if we write x ¼ a þ ib, a and b can be
represented as binary fixed-point numbers
a
b
¼ 0:a1 a2 a3 a4 ¼ 0:b1 b2 b3 b4 ; :
where a1 ¼ 1. Define a^ and b^ as a and b rounded
to the nearest m-fractional-bit number. Our first
solution consists in looking-up
Combining Eqs. (12) and (13), we get:
Property 1 If
1
a2 bq=kc
1
r
þ 2 a þ
r
þ 2 þ 2a þ
r
2
2
K¼
1
a^ þ i ^b
ð14Þ
in a table with 2m 1 address bits.3 Now, by
^
we easily find
denoting x^ ¼ a^ þ i b,
then the recurrence (5), initialized at step j0 ¼
bq=kc with the values
pffiffiffi defined in Eq. (10) returns a
representation of d in radix r with digits in
fa; . . . ; þag which can be selected by rounding
the residual estimate of fractional bits.
x
1 mþ1
^
^
^ 1 2jjx xjj1 ^ 4jjx xjj
1 2
x
x 1
1
2.4. Two Ways of Prescaling the Argument
Our method requires that, from a given argument x, we
obtain a complex scaling factor K such that jjKx q
by a table-lookup. We must also get
1jjp
1 ffiffiffi<
ffi 2
1= K . Let us examine two approaches to obtaining K.
2.4.1. First Approach: Direct Table-lookup. We
assume that
1
jjxjj1 < 1:
2
This is obtained by a mere shift of the argument x.
We can also assume that <ðxÞ and =ðxÞ are both
nonnegative, and that <ðxÞ > =ðxÞ. This comes from
the elementary properties:
8
< 1
aþib
: 1
aþib
¼
1
i bþia
¼
ð1=ða þ ibÞÞ:
and
(
K 0 ¼ iK
pffiffiffiffiffiffiffiffiffiffiffiffi
a þ ib
)
¼
ffi ¼ pffiffiffi
i p1ffiffiffi
K0
pffiffiffiffiffiffiffiffiffiffiffiffi K
a þ ib:
Therefore, to assure that jjKx 1jj1 will be less than
2q , it suffices to choose m ¼ q þ 1. Hence the
lookup table will have 2q þ 1 address bits.
2.4.2. Second Approach: Two-step Table-lookup.
Another solution is to first use the previous method
with a much smaller value
offfi q, so that from a and b
pffiffiffiffiffi
we get a1 , b1 , K1 and 1= K1 that satisfy
j1 a1 j
jb1 j
a1
b1
< 2q1
< 2q1
¼ K1 a
¼ K1 b
To do that, we need a table with 2q1 þ 1 address
bits. Then, we use the same method again (with a
different table),
pffiffiffiffiffiffi to find from a1 and b1 , values a2 , b2 ,
K2 and 1= K2 that satisfy
j1 a2 j
jb2 j
a2
b2
< 2q
< 2q
¼ K 2 a1
¼ K 2 b1
Since the first q1 fractional bits of a1 are zeros or
ones, and since the first q1 fractional bits of b1 are
zeros, the second table lookup requires a tables with
Complex Square Root with Operand Prescaling
Table 1. Parameters q and and number of address bits as
function of r and a.
Case
r
a
q
P1
(# address bits)
P2
(# address bits)
1.
2
1
4
4
9
7
2.
2
1
6
3
13
9
3.
4
2
6
5
13
9
4.
4
3
6
3
13
9
5.
8
4
9
5
19
12
6.
8
5
9
3
19
12
7.
8
6
6
5
13
9
8.
8
7
6
4
13
9
9.
16
8
12
6
25
15
10.
16
9
8
9
17
11
11.
16
10
8
5
17
11
12.
16
11
8
4
17
11
13.
16
12
8
3
17
11
14.
16
15
8
2
17
11
2m address bits, where m ¼ ðq þ 1Þ ðq1 1Þ (the
B1’’ in Bq1 1’’ comes from the fact that we need
to know the sign of b1 and whether the first bits of a1
are all zeros p
orffiffiffiffiones).pAfter
we
ffiffiffiffiffiffi that,
pffiffiffiffiffi
ffi compute K ¼
K1 K2 and 1= K ¼ 1= K1 1= K2 .
Hence we need to lookup a table with 2q1 þ 1
address bits and a table with 2ððq þ 1Þ ðq1 1ÞÞ
address bits. The best solution is obtained by
equating the sizes of both tables, which gives
q1 ¼ 2qþ3
4 , i.e.,
q1 ¼ dq=2 þ 3=4e
Therefore, the tables have q þ 3 address bits each.
2.5. Relations among Algorithm Parameters
Table 1 gives parameters q and that satisfy Eq. (14),
depending on r and a. The table required by the
prescaling step has 2q þ 1 address bits if we perform
a direct table look-up. With the two-step prescaling
method, the tables use q þ 3 address bits. Table 1 also
indicates the number of address bits for the direct (P1)
and the two-step prescaling (P2) approaches. Note
that Case 4 (radix 4 max. redundancy) and Case
8 (radix 8 max. redundancy) have the same table size
requirements. Since the prescaling for the complex
square root is similar to the one for the complex
division algorithm [11], a combined scheme is
25
possible, which makes hardware implementation more
attractive.
3.
Implementation and Comparison
We are considering only the computation of the
significand, ignoring exponent handling and related
adjustments to the argument.pThe
ffiffiffi argument is x ¼
ðxR ; xI Þ and the result is z ¼ x ¼ ðzR ; zI Þ with the
real and imaginary components in a fixed-point
format. The overall scheme for computing complex
square root is outlined in Fig. 2.
There are four main parts in the scheme: prescaling,
recurrence evaluation, postscaling, and on-the-fly conversion with rounding. The prescaling part uses a table
lookup to determine the scaling
K ¼ ðK R ; K I Þ
pffiffiffifactor
ffi
and the postscaling factor 1= K ¼ ðCR ; CI Þ, and a
complex multiplier to obtain the scaled argument
d ¼ x K. The postscaling factor ispneeded
to obtain
ffiffiffi
pffiffiffi
z from the computed result as d as z ¼ d
ðCR ; CI Þ.
The real and imaginary residual recurrences can be
implemented using a modified conventional squareroot recurrence. As discussed in [10], it is convenient
to define the residual recurrence as
w½j þ 1 ¼ rw½j þ F½j
ð15Þ
F½j ¼ 2S½jsjþ1 s2jþ1 r ðjþ1Þ
ð16Þ
where
Since s½j digits are produced in signed-digit form,
the partial result S½j is also in signed-digit form.
Depending on the adder used, S½j is converted to
adapt to the adder. If a carry-save adder is used, F
has to be in two_s complement form. This conversion can be done on-the-fly. In this paper we assume
that the adder is of a carry-save type. We now apply
the form of Eq. (15) to the complex residual recurrence 5:
8
>
>
>
>
>
>
>
<
R
R 2 j1
¼ 2sR
FR ½j
jþ1 S ½j ðsjþ1 Þ r
GR ½j
¼ 2sIjþ1 SI ½j þ ðsIjþ1 Þ2 r j1
W R ½j þ 1 ¼
rW R ½j þ FR ½j þ GR ½j
I
R
>
¼ 2sR
F ½j
>
jþ1 S ½j þ 1
>
>
I
I
>
>
G ½j
¼ 2sjþ1 SI ½j
>
: I
W ½j þ 1 ¼
rW I ½j þ FI ½j þ GI ½j :
ð17Þ
26
Ercegovac and Muller
xR
C
xI
R
C
PRESCALING
d
R
d
I
I
R
w I[j]
w [j]
* S R[j]
*
Re RESIDUAL
Im RESIDUAL
#
I
sj+1
R
sj+1
Re LR MULTS;
RECODE;ADD
sr
Re OFC/RND
# S I[j]
R
I
I
sR
jþ1 C sjþ1 C
I
I
R
sR
jþ1 C sjþ1 C
Im LR MULTS;
RECODE;ADD
si
ð19Þ
These are implemented using four LRCF multipliers
and the corresponding recurrences are
Im OFC/RND
zR
Figure 2.
per step. Since these digits are in an over-redundant
digit set, they are recoded and added to produce the
real and imaginary digits in signed-digit form. These
are then used sequentially by on-the-fly converters to
produce the final real and imaginary parts of the
result in a conventional form. The implementation
combines postscaling with on-the-fly conversion.
Moreover, as discussed in [10], the rounding can be
integrated with the on-the-fly conversion. These
multipliers do not have final adders since the product
digits are used left-to-right in redundant form.
We now discuss the postscaling and conversion in
more detail. In step j the following increments are
added to the accumulated real and imaginary partial
products:
zI
Overall scheme for computing complex square root.
8 R
U ½j þ 1
>
>
>
> V R ½j þ 1
>
>
>
>
> pr1jþ1
>
<
pr2jþ1
I
U
½j þ 1
>
>
>
>
I
>
V
½j
þ 1
>
>
>
>
pi1
>
jþ1
:
pi2jþ1
¼
¼
¼
¼
¼
¼
¼
¼
rðfracðU R ½j þ CR sR
jþ1 ÞÞ ¼ rðfracðURÞÞ
rðfracðV R ½j CI sIjþ1 ÞÞ ¼ rðfracðVRÞÞ
intðURÞ
intðVRÞ
rðfracðU I ½j þ CI sR
jþ1 ÞÞ ¼ rðfracðUIÞÞ
rðfracðV I ½j CR sIjþ1 ÞÞ ¼ rðfracðVIÞÞ
intðUIÞ
intðVIÞ
ð20Þ
sR
jþ1
sIjþ1
The real
and imaginary
digits are selected
by rounding of the corresponding shifted residual
R ½j and rWd
R ½j,
½j respectively:
estimates rWd
sR
jþ1
¼
sIjþ1
¼
R ½jÞ
SELðrWd
½j
d
SELðrW I ½jÞ
½j
where fracðgÞ and intðgÞ are the fractional and
integer parts of g, respectively. Since
jURj r þ ðr 1Þ a
ð21Þ
ð18Þ
where SEL function consists of rounding the residual
estimate and recoding of the result digit into radix-4
signed digits.
The block-diagram of implementation of the
complex recurrence is shown in Fig.
pffiffiffi3.
The postscaling to obtain z ¼ d ðCR ; CI Þ is
performed using four sequential left-to-right carryfree (LRCF) multipliers without final adders [8].
Each LRCF multiplier produces one product digit
the range of pr1jþ1 exceeds one radix r digit.
Similarly for the other output digits. Dividing the
postscaling factors by r, we obtain the output digits
in the set fðr þ a 1Þ; . . . ; ðr þ a 1Þg which are
recoded to the set fðr 1Þ; . . . ; r 1g to simplify
the remaining modules. After recoding the digits are
added using on-line addition to produce the real
(imaginary) radix-r signed-digits of the result. These
digits are then converted using on-the-fly conversion
to obtain the final result in conventional radix-r
representation. The design details are omitted.
Complex Square Root with Operand Prescaling
The proposed scheme has an estimated latency
Table 2.
Tproposed ¼ tprescal þ titer þ tpostscale þ trecode
Module
Area
Register
Areg ¼ 0:6n
þ tOLadd þ tconvertrnd
ð22Þ
For r 16 we estimate the delays of the terms in the
expression for Tprop as follows. tprescal is the time to
perform the table lookup to get K and C and perform
d ¼ x K which we estimate to be 2tcycle . The
iteration time is titer ¼ ðn j0 Þtcycle . The postscaling
and conversion/rounding are overlapped with the
residual recurrence: trecode ¼ 2tcycle because of the
scaling of C as discussed above; the remaining stages
have single cycle delay. We estimate that the cycle
time tcycle of the recurrence loop (see Fig. 3),
measured in full-adder delays (tFA ), is
tcycle ¼ tSEL þ tF;G þ t½6:2 þ treg ¼ 6tFA
ð23Þ
where tSEL ¼ 1:5tFA is the delay of the selection
function, tF;G ¼ 1:5tFA is the delay of the F (G)
S R[j]
R
S I[j]
rw [j]
round
R
sj+1
F
sRj+1
R
I
sj+1
F R[j]
G
R
G R[j]
R
w [j]
[6:2] ADDER
registers not shown
R
w [j+1]
R
round
I
S [j]
rw I[j]
R
sj+1
F
I
I
I
sj+1
F [j]
S [j]
I
sj+1
G
I
I
G [j]
w I[j]
27
Area of primitive modules (in AFA units).
2-to-1 MUX
A21mux ¼ 0:4n
k-to-1 MUX, k=3, 4
Akmux ¼ 0:8n
[2:1] adder
A½2:1 ¼ 0:5nlogn
[3:2] adder.
A½3:2 ¼ n
[4:2] adder
A½4:2 ¼ 2:3n
[6:2] adder
A½6:2 ¼ 4:3n
SEL (round and recode
ASEL ¼ 6
On-the-fly converter
Aofc ¼ 2A21mux þ 2Areg ¼ 2n
OFC with rounding
Aofcrnd ¼ 2A21mux þ 3Areg ¼ 3n
LR multiplier
ALRmul ¼ Akmux þ A½3:2 þ 2Areg ¼ 3n
Complex rect.
multiplier
ACmul ¼ 4ð4Akmux þ A½3:2 þ A½4:2 Þþ
Divider r16
Adiv ¼ Asel þ 2Akmux þ A½3:2 þ 2Areg þ
2A½2:1 þ 2Areg ¼ ð27:2 þ 0:5lognÞn
Aofcrnd ¼ 120 þ 1:6n þ 2n þ 1:2nþ
3n ¼ 120 þ 7:8n
SQRT r16
Asqrt Adiv
network, t½6:2 ¼ 3tFA is the delay of the redundant
adder, and treg ¼ 0:5tFA . For example, for r ¼ 16 and
54-bit significand, we estimate that Tproposed ¼
ð2 þ 11 þ 2 þ 1 þ 1 þ 1Þtcycle ¼ 18tcycle ¼ 108tFA .
To get a rough comparison of the latency of the
proposed scheme with a conventional complex
square-root software implementation, we consider
the algorithm defined in [21] which uses 3 real
square-root and 2 real division operations in the
floating-point format. There are also several comparison and absolute value operations. The estimated
latency of this implementation is
Tconv ¼ tDIV þ 2tSQR þ tDIV 4tSQR
ð24Þ
To achieve this latency two square-root units in
parallel are required. We ignore exponent processing
and rounding delays. Moreover, assuming a radix-16
digit-recurrence implementation of significand computation, we estimate
tSQR tprescal þ ðn=4Þ tcycle þ tpostscale þ tround
[6:2] ADDER
registers not shown
I
w [j+1]
Figure 3. Block-diagram of implementation of the real and
imaginary residual recurrences.
ð25Þ
where tcycle tSEL þ tFnet þ t½4:2 þ treg ¼ 4tFA . We
assume that prescaling and postscaling is done for
radix 16 so that the selection function is simplified.
28
Ercegovac and Muller
For a 54-bit significand, we estimate that Tconv ¼
4 ð2 þ 14 þ 1 þ 1Þtcycle ¼ 288tFA without taking
into account comparison and absolute value operations. We conclude that the proposed scheme is at
least 2.5 times faster than a conventional one.
To implement the real part of the proposed scheme
we use the following main components: 2 multiple
generators in F and G networks, a [6:2] adder, two
registers, two sequential left-to-right multipliers
without final adder, one on-line adder, and three
registers for on-the-fly conversion with rounding.
Similarly for the imaginary part. In addition we need
a lookup table of 2q þ 1 inputs and a complex
rectangular multiplier. The cost is measured as area
occupied by modules using the area of a full-adder
(AFA Þ as the unit. The area of primitive modules is
given in Table 2.
We also assume r ¼ 16, a ¼ 10, which has prescaling tables of size 211 ð3 9Þ bits. equivalent to
Atbl ¼ 1:9KAFA . The area of the real (imaginary)
recurrence, including postscaling and on-the-fly
conversion/rounding, is
higher radices. The prescaling is more complicated
than in the real case leading to larger lookup tables.
Since the same prescaling is applicable to the digitrecurrence complex division proposed in [11], these
two algorithms can be combined. A rough comparison with a conventional implementation based on
floating-point instructions indicates a significant
speedup of the proposed scheme at a higher cost.
The proposed scheme allows correct rounding.
Areal ¼ ASEL þ 2ðAofc þ Akmux þ A½3:2 Þ þ A½6:2
1. Adaptation to carry-save notation is straightforward.
2. Multiplication by is straightforward: each time we get a
þ 2Areg
Similarly for the imaginary part Aimag . The total area
of the proposed scheme for r ¼ 16 and n ¼ 54 is
estimated as
Acknowledgements
This research was supported in part by the State of
California and STMicroelectronics MICRO Grant.
We thank reviewers for their helpful comments.
Notes
new digit of the square root, we accumulate times that
digit.
3. A straightforward implementation would require 2m address
bits, but we use the fact that a1 ¼ 1.
References
ACSQR ¼ 2Atbl þ ACmul þ Areal þ Aimag þ 4 ALRmult
þ 2 Aofcrnd
7:5KAFA
We estimate that the cost of a conventional
implementation using two FLPT square-units and a
FLPT divider would be
Aconv ¼ Adiv þ 2Asqr 2:5KAFA
which is much smaller than the implementation of
the proposed approach.
4.
Summary
We proposed a new algorithm for complex square
root. It uses two digit-recurrences and prescaling of
the operand to allow result-digit selection by rounding. This makes the proposed scheme suitable for
1. G. Adams, A. M. Finn, and M. F. Griffin, BA Fast
Implementation of the Complex Singular Value Decomposition on the Connection Machine,^ IEEE Trans. Acoust.
Speech Signal Process., 1991, pp. 1129–1132.
2. T. Ahrendt, BFast High-precision Computation of Complex
Square Roots,^ in Proceedings of ISSAC_96, Zurich, Switzerland, 1996.
3. D. Bindel, J. Demmel, W. Kahan, and O. Marques, BOn
Computing Givens Rotations Reliably and Efficiently,^ ACM
Trans. Math. Softw., vol. 28, no. 2, June 2002, pp. 206–238.
4. L. Ciminiera and P. Montuschi, BHigher Radix Square
Rooting,^ IEEE Trans. Comput., vol. 30, no. 10, October
1990, pp. 1220–1231.
5. D. Das Sarma and D. W. Matula, BFaithful Bipartite ROM
Reciprocal Tables,^ in Proceedings of the 12th IEEE
Symposium on Computer Arithmetic, Bath, UK, S. Knowles
and W. McAllister (Eds.), IEEE Computer Society Press, Los
Alamitos, CA, July 1995.
6. F. de Dinechin and A. Tisserand, BSome Improvements on
Multipartite Table Methods,^ in Proceedings of the 15th IEEE
Symposium on Computer Arithmetic, Vail, Colorado, L.
Ciminiera and N. Burgess (Eds.), IEEE Computer Society
Press, Los Alamitos, CA, June 2001.
Complex Square Root with Operand Prescaling
7. M. A. Elliot, G. A. Walker, A. Swift, and K. Vandenborne,
BSpectral Quantitation by Principal Component Analysis using
Complex Singular Value Decomposition,^ Magn. Reson.
Med., vol. 41, 1999, pp. 450–455.
8. M. D. Ercegovac and T. Lang, BFast Multiplication without
Carry-propagate Addition,^ IEEE Trans. Comput., vol. C-39,
no. 11, November 1990, pp. 1385–1390.
9. M. D. Ercegovac and T. Lang, Division and Square Root:
Digit-recurrence Algorithms and Implementations, Kluwer,
Boston, MA, 1994.
10. M. D. Ercegovac and T. Lang, Digital Arithmetic, Morgan
Kaufmann, San Mateo, CA, 2004.
11. M. D. Ercegovac and J.-M. Muller, BComplex Division with
Prescaling of the Operands,^ in Proc. Application-Specific
Systems, Architectures, and Processors (ASAP_03), The
Hague, The Netherlands, June 24–26, 2003.
12. M. D. Ercegovac and J.-M. Muller, BComplex Square Root
with Operand Prescaling,^ in Proc. Application-Specific
Systems, Architectures, and Processors (ASAP_04), Galveston,
TX, 27–29 September, 2004, pp. 52–62.
13. N. Hemkumar and J. Cavallaro, BA Systolic VLSI Architecture
for Complex SVD, Proc. of the IEEE International Symposium
on Circuits and Systems,^ 1992, pp. 1061–1064.
14. T. E. Hull, T. F. Fairgrieve, and P. T. P. Tang, BImplementing
Complex Elementary Functions Using Exception Handling,^
ACM Trans. Math. Softw., vol. 20, no. 2, 1994, pp. 215–244.
15. W. Kahan, BBranch Cuts for Complex Elementary Functions,
or Much Ado About Nothing_s Sign Bit,^ in The State of the
Art in Numerical Analysis, Clarendon Press, Oxford, 1987.
16. R. D. Mcilhenny, Complex Number On-line Arithmetic for
Reconfigurable Hardware: Algorithms, Implementations, and
Applications. PhD thesis, University of California at Los
Angeles, 2002.
17. T. Lang and P. Montuschi, BHigher Radix Square Root with
Prescaling,^ IEEE Trans. Comput., vol. 41, no. 8, 1992, pp.
996–1009.
18. T. Lang and P. Montuschi, BVery High Radix Square Root with
Prescaling and Rounding and A Combined Division/Square Root
Unit,^ IEEE Trans. Comput., vol. 48, no. 8, 1999, pp. 827–841.
19. J. Mitroy and I. A. Ivallov, BQuantum Defect Theory for the
Study of Hadronic Atoms,^ J. Phys., G, Nucl. Part. Phys., vol.
27, 2001, pp. 1–13.
20. P. Montuschi and M. Mezzalama, BSurvey of Square Rooting
Algorithms,^ IEE Proceedings E: Computers and Digital
Techniques, vol. 137, no. 1, pp. 31–40.
21. W. Press, S. A. Teukolski, W. T. Vetterling, and B. F.
Flannery, Numerical Recipes in C, 2nd Edition, Cambridge
University Press, 1992.
22. M. J. Schulte and J. E. Stine, BApproximating Elementary
Functions with Symmetric Bipartite Tables,^ IEEE Trans.
Comput., vol. 48, no. 8, Aug. 1999, pp. 842–847.
23. J. Salo, J. Fagerholm, A. T. Friberg, and M. M. Salomaa,
BUnified Description of Nondiffracting X and Y Waves,^
Phys. Rev., vol. E 62, 2000, pp. 4261–4275.
24. D. M. Smith, BAlgorithm 768: Multiple-Precision Complex
Arithmetic and Functions,^ ACM Trans. Math. Softw., vol. 24,
no. 4, December 1994, pp. 359–367.
25. R. D. Susanto, Q. Zheng, and X.-H. Yan, BComplex Singular
Value Decomposition,^ J. Atmos. Ocean. Technol., vol. 15,
no. 3, 1998, pp. 764–774.
29
Dr. Miloš D. Ercegovac is a Professor and a former Chair in
the Computer Science Department of the Henry Samueli
School of Engineering and Applied Science, University of
California at Los Angeles, where he has been on the faculty
since 1975. He earned his MS (1972) and PhD (1975) in
computer science from the University of Illinois, UrbanaChampaign, and BS in electrical engineering (1965) from the
University of Belgrade, Serbia. Dr. Ercegovac has specialized
for over 30 years in research and teaching in digital arithmetic,
digital and computer system design, and parallel architectures.
His life-long dedication to teaching and research has also
resulted in several co-authored books: two in the area of
digital design (Digital Systems and Hardware/Firmware
Algorithms, Wiley & Sons, 1985, and Introduction to Digital
Design, Wiley & Sons, 1999), and two in digital arithmetic
(Division and Square Root: Digit-Recurrence Algorithms and
Implementations, Kluwer Academic Publishers, 1994, and
Digital Arithmetic, Morgan Kaufmann Publishers—a Division
of Elsevier, 2004.) Dr. Ercegovac has been involved in
organizing the IEEE Symposia on Computer Arithmetic since
1978. He served as an associate editor of the IEEE Transactions on Computers 1988–1992 and as a subject area editor
for the Journal of Parallel and Distributed Computing 1986–
1993. Dr. Ercegovac_s work has been recognized by his
election in 2003 to IEEE Fellow and to Foreign Member of the
Serbian Academy of Sciences and Arts in Belgrade, Serbia.
He is also a member of the ACM and of the IEEE Computer
Society.
Dr Jean-Michel Muller was born in Grenoble, France, in
1961. He received his PhD degree in 1985 from the
Institut National Polytechnique de Grenoble. He is
30
Ercegovac and Muller
Directeur de Recherches (senior researcher) at CNRS,
France, and he is the former head of the LIP laboratory
(LIP is a joint laboratory of CNRS, Ecole Normale Supérieure de Lyon, INRIA and Université Claude Bernard Lyon
1). His research interests are in Computer Arithmetic. Dr
Muller was co-program chair of the 13th IEEE Symposium
on Computer Arithmetic (Asilomar, USA, June 1997),
general chair of the 14th IEEE Symposium on Computer
Arithmetic (Adelaide, Australia, April 1999). He is the
author of several books, including Elementary Functions,
Algorithms and Implementation (2nd edition, Birkhäuser
Boston, 2006). He served as associate editor of the IEEE
Transactions on Computers from 1996 to 2000. He is a
senior member of the IEEE.