Low-Latency SRT Division and Square Root Based on Remainder

Chinese Journal of Electronics
Vol.26, No.1, Jan. 2017
Low-Latency SRT Division and Square Root
Based on Remainder and Quotient Prediction∗
PENG Yuanxi, CHEN Jiyang, LEI Yuanwu, HE Tingting and DENG Ziye
(College of Computer, National University of Defense and Technology, Changsha 410073, China)
Abstract — Sweeney, Robertson and Tocher (SRT) algorithm is a common and efficient way for division and
square root (div/sqrt). We present to overlap two iterations into one cycle by predicting remainder and quotient.
To reduce latency, redundant representation is used superiorly, as well as the use of a minimum redundancy factor.
Division and square root can be integrated into one unit
which causes a reduction in hardware cost. With 40nm
technology library, the area of our architecture after layout
design, is 37795µm2 , the power is 81.19mW and the delay
is only 656ps. The cycles for double-precision division and
square root are 17 and 16, respectively. Experiments show
our architecture achieves small latency and high frequency,
together with modest area and power.
Key words — SRT, Predicting remainder and quotient,
Minimum redundancy factor, Redundant representation,
Division and square root.
I. Introduction
Division and square root are two basic operations in
high performance computing, digit signals processing and
communication. As Oberman and Flynn[1] pointed out,
floating-point division instruction only occupy 3% of all
instructions and square root is even less. However, the efficiency of the two operations improving 1% will cause the
performance of whole processor improving 20%[2] . Combining division and square root based on Sweeney, Robertson and Tocher algorithm (SRT) is widely used in processors. Intel Pentium CPUs[3] , ARM processors[4] and IBM
FPUs[5] use SRT4 and Intel Core2[6] uses SRT16 to implement the two operations.
However, there are many challenges posed to SRT algorithm in the tradeoff between performance and consumption. Low radix SRT algorithm is easy to implement, while its computing speed is too slow. Pipeline
can achieve high throughput, while much extra hardware
cost is consumed. Some researchers proposed to imple-
ment high radix SRT by enlarging lookup table and increasing adders which lead to the table delay increasing
linearly and the area increasing quadratically[6].
To get a balance between performance and cost, we
present a combined division and square root(div/sqrt)
structure, in which two SRT4 iterations are overlapped
in one cycle to implement SRT16 through predicting remainder and quotient. The structure can obtain high computing speed because of high radix-16 and keep low hardware complexity based on low radix-4. Besides, redundant
representation of remainder is advantageously used, so
the remainder generation can be calculated with Carry
save adder (CSA) instead of Carry propagation adder
(CPA) and the use of a minimum redundancy factor ensures the multiplication of divisor (or partial square root)
only needs shifter and adder instead of multiplier. What’s
more, integration of division and square root can reuse
some modules, such as lookup table and adders, to reduce hardware overhead. Our prediction of remainder and
quotient method can achieve small latency and high frequency, together with modest area and power.
The remainder of this paper is organized as follows.
Section II reviews some related works and surveys the
SRT algorithm. Our SRT16 div/sqrt structure implemented by predicting remainder and quotient is presented
in Section III. Comparisons among different SRT structures are given in Section IV. Section V draws the conclusion.
II. Related Work and Background
1. Related work
Various works have been done for increasing the performance of division and square root, including combining
the two operations, using redundant presentation instead
∗ Manuscript Received Nov. 24, 2014; Accepted May 17, 2015. This work is supported by Aerospace Science Fund of China
(No.2013ZC88003), and National Natural Science Foundation of China (No.61402499).
c 2017 Chinese Institute of Electronics. DOI:10.1049/cje.2016.10.024
Low-Latency SRT Division and Square Root Based on Remainder and Quotient Prediction
of non-redundant, choosing appropriate redundancy factor and high radix SRT.
Based on the similar iterative process of division and
square root, they can be combined into one unit to save
the hardware cost through reusing some modules. The
combination has been done for radix-8 in Ref.[7] and for
radix-4 in Refs.[8–10].
Partial remainder can be represented in two different
forms, redundant or non-redundant. If the reminder is
in non-redundant form, the subtraction to generate new
remainder requires CPAs which is time consuming[11] .
While if the remainder is represented in redundant form,
CSAs can be used to reduce latency.
The redundancy factor decides the digit set of the
SRT algorithm. By using a larger number of allowed quotient digits, the complexity of quotient digit selection is
reduced, however, it requires more digit multiplication
which typically needs extra initial delay and area. Choosing a smaller digit simplifies the multiplication of divisor
(or partial square root). As mentioned in Ref.[12], the
radix-4 SRT with a minimum redundancy factor is the
best tradeoff between area and speed.
The hardware complexity of low radix SRT is small,
however, many iterations are needed and computing speed
is slow[13−15] . To reduce iterations, high radix SRT is proposed. In Refs.[16,17], they implemented high radix by enlarging lookup table and increasing adders, but the overhead grows rapidly[6] .
To get the tradeoff between performance and cost,
some researches concentrate on overlapping stage with
another stage to implement high radix SRT. W.Liu[18] ,
A.Nannarelli[19] and I.Rust[20] propose to overlap two
stages only by predicting quotient, while Intel Core2
Penryn[3] and IBM Eserver[21] only by predicting remainder. Based on these previous work, our div/sqrt structure
combines the two methods to get the better tradeoff.
2. Background
The iterative formulas for division is:
p[j + 1] = r · p[j] − d · qj+1
(1)
(2)
where j is the iteration number, d is divisor, r is the radix,
p[j] is the partial remainder after the j th iteration, qj+1 is
the selected quotient digit in the (j + 1)th iteration, S[j]
is the partial square root after the j th iteration.
The two equations can be combined into one expression, that is:
R[j + 1] = r · R[j] − F [j]
in which R[j] is the remainder in the j th iteration, and
division
d · qj+1 ,
F [j] =
2
−(j+1)
2S[j]qj+1 + qj+1 r
, squareroot
Based on the combined expression, the quotient is chosen by a selection function:
qj+1 = SEL(r · R[j], d/S[j])
The main part of conventional SRT4 is shown in Fig.1.
Fig. 1. SRT4 div/sqrt structure
In our SRT16 algorithm, two iterations are overlapped
in each cycle and 2 bits of quotient can be generated from
each iteration. The quotient of the (j + 1)th cycle can be
expressed as[18] :
qj+1 = 4qHj+1 + qLj+1
(4)
The iterative formulas of the two iterations are:
RH[j + 1] = 4RL[j] − F L[j]
RL[j + 1] = 4RH[j + 1] − F H[j + 1]
where
F L[j] =
F H[j+1] =
(5)
−d · qLj ,
division
2 −(j+1)
, squareroot
−2S[j]sLj − sLj 4
−d · qHj+1 ,
division
2
4−(j+1) , squareroot
−2S[j]sHj+1 −sHj+1
The selection formulas are:
qHj+1 = SEL(RL[j], d/S[j])
qLj+1 = SEL(RH[j + 1], d/S[j])
III. Div/Sqrt Structure Based on
Remainder and Quotient Prediction
and for square root is:
2
r−(j+1)
p[j + 1] = r · p[j] − 2S[j]qj+1 − qj+1
59
(3)
In recent studies, overlapping methods are presented
to implement high radix SRT. However, only remainder
prediction or only quotient prediction can’t get the best
tradeoff between performance and cost. So we present a
combination method of predicting remainder and quotient
to obtain a smaller latency and a faster speed with a modest cost.
As shown in Fig.2, the overall structure of div/sqrt
unit is consisted of seven main modules and the function
of each module are as following:
Chinese Journal of Electronics
60
2017
Fig. 3. Basic unit CSA of remainder formation
Fig. 2. Overall structure of div/sqrt unit
1) QTRAN: transform quotient digit from redundant
to non-redundant.
2) HFGEN: generate F for the first iteration.
3) LFGEN: generate F for the second iteration.
4) CSA: calculate new partial remainder, two inputs
of this module are remainder of last iteration and F .
5) QSL: calculate quotient digit for each iteration.
6) MUXQ: select the exact quotient digit, the inputs
of this module are all probable quotient digit.
7) MUXR: select the extract remainder, the inputs
of this module are all probable remainders and the data
width is different from MUXQ.
In our design, remainder of two iterations and quotient
of the second iteration are predicted in parallel as dotted
box shown, so that the two iterations can be overlapped
and the latency of each cycle is small.The numerical values for these paths are reported in Table 1. In the table,
the delay in the paths from the starting node (register)
to the arriving node is presented. The critical path (highlighted in Fig.2) is 656ns and is between register qH/qL
and RL. The frequency is the reciprocal of critical delay,
which is 1524MHz.
Table 1. Timing paths
From/To
qL
RL
qH/qL
641
656
RL
602
650
1. Predicted remainder formation
Remainder formation is to generate new partial remainders in which CSA is the basic unit. To ensure accuracy, we extend the dividend, the divisor and the radicand
to 56 bits(1 integer and 55 fractional), so the CSAs used
in our structure are CSA56. From Eq.(2) mentioned in
Section II.2, the inputs of CSA are the remainder of the
last iteration and F , the output is new partial remainder.
The implementation of CSA56 is shown in Fig.3 which
includes 56 CSA3-2.
As shown in Fig.4(b), all possible remainders P −R[j +
1] are predicted in parallel with CSAs in this paper and
then the correct one R[j + 1] is selected among these values according to the quotient qj+1 from Quotient digit
selection module (QSL). Comparing with the traditional
method as show in Fig.4(a), the delay of remainder calculation and quotient selection are overlapped in our design.
The area, power and delay of each module are as shown in
Table 2 and our delay is tdelay =max(t(CSA), t(QSL)) +
t(M U XR) = t(QSL) + t(M U XR), while in traditional
method, new remainder is generated after correct F selected by quotient qj+1 and the delay is tdelay = t(QSL)+
t(M U XR) + t(CSA). So our predicting remainder way
can obtain a smaller latency.
Table 2. Layout results of all modules
Unit
Area(µm2 ) Power(µW) Crit.Path(ps)
QSL
204.9
10.1
104
CSA
511.8
31.6
52
MUXQ
49.8
2.8
75
MUXR
451.0
26.5
93
HFGEN
730.2
42.2
102
LFGEN
2398.1
142.2
123
QRTAN
677.3
42.2
105
Fig. 4. Remainder formation. (a) Traditional Method; (b)
Predicted remaider method
The number of the probable F and CSAs used here are
determined by the digit set which is related to the redundancy factor. As mentioned in Ref.[22], the radix-4 SRT
with a minimum redundancy factor is the best tradeoff
between area and speed. In this case, our redundancy factor ρ = 2/3 and the quotient digit set is {−2, −1, 0, 1, 2},
so 5 CSAs and a 5-1 selector MUXR are needed.
2. Predicted quotient digit selection
Based on our redundancy factor ρ = 2/3, it’s addressed by 5 bits of partial remainder and 4 bits of d/S[j]
(d is divisor and S[j] is partial square root)[23] . Because of
the redundant representation of remainder, a propagate
adder is used at first to add the carry and sum, then we
truncate the first 5 bits of the result together with the
Low-Latency SRT Division and Square Root Based on Remainder and Quotient Prediction
first 4 bits of d/S[j] and put them into the lookup table,
and 2 bits of quotient can be obtained in each iteration.
The process of QSL is shown in Fig.5 and the lookup table
used in our design is in Table 3.
[j]
First
5-bits
[j]
61
store quotient and quotient-1, respectively. The formulas
of the two registers are:
Q[j] =
k
qi r−i
(6)
i=1
[j]
QM [j] = Q[j] − r−j
The updating formulas are:
Q[j] + qj+1 r−(j+1) ,
qj+1 ≥ 0
Q[j+1] =
(8)
QM [j]+(r− |qj+1 |)r−(j+1) , qj+1 < 0
Q[j] + (qj+1 − 1)r−(j+1) ,
qj+1 ≥ 0
QM [j+1] =
QM [j]+((r−1)− |qj+1 |)r−(j+1) , qj+1 < 0
First
4-bits
[j]
(9)
Fig. 5. Process of quotient digit selection
Table 3. SRT4 lookup table
t
8
9 10 11 12 13 14
16· m2 (i) 13 14 16 17 18 20 21
16· m1 (i) 4
4
4
5
5
5
6
(7)
15
22
6
In the case that QSL is in the critical path typically[6] ,
the implementation of this module has a great effect
in critical latency. All possible quotients P − qLj+1 of
the second iteration are predicted in parallel in this paper as shown in Fig.6(b), then the correct one qLj+1 is
selected according to quotient qHj+1 of the first iteration. Comparing with the traditional quotient selection
method in Fig.6(a), delay of quotient selection in the
two iterations are overlapped in our design and the delay is tdelay = t(CSA) + t(QSL) + t(M U XQ), while in
traditional method, the second quotient selection should
wait until correct remainder RH[j + 1] is selected by
qHj+1 from the first iteration and the delay is tdelay =
2t(QSL)+ t(M U XQ). As shown in Table 1, delay of CSA
is smaller than QSL and delay of MUXQ is also smaller
than MUXR, so our predicting quotient way can reduce
the latency.
Because of the minimum redundancy factor we used
in our design, there are total five situations of the updates
as shown in Table 4.
Table 4. Update of Q and QM
qj+1
Q[j + 1]
QM [j + 1]
2
{Q[j], 2}
{Q[j], 1}
1
{Q[j], 1}
{Q[j], 0}
0
{Q[j], 0}
{QM [j], 3}
–1
{QM [j], 3}
{QM [j], 2}
–2
{QM [j], 2}
{QM [j], 1}
In our SRT16 structure which includes 2 radix-4
stages, Q and QM update 4 bits each cycle according
to the quotient digit qL and qH as shown in Fig.7.
Q2/QM2
Fig. 6. Quotient digit selection. (a) Traditional method; (b)
Predicted quotient method
3. Quotient transformation
For the use of redundancy representation, quotient
transformation is necessary in our design to transform
digit from redundant to non-redundant and the method
used here is similar with the on-the-fly method mentioned
in Ref.[24]. There are two registers needed, Q and QM , to
Fig. 7. Quotient transformation
4. F generation
The generation of F is different in division and square
root. In division, F is the multiplication of divisor and because of the digit set used in our design {−2, −1, 0, 1, 2},
we can obtain all probably F only with shifter.
2
r−(j+1) ).
In square root, F [j] = −(2S[j]qj+1 + qj+1
With the introduction of Q and QM , the partial square
Chinese Journal of Electronics
62
2017
root S[j] is stored in the two registers, that is:
Q[j],
qj+1 ≥ 0
S[j] =
QM [j] + 1, qj+1 < 0
so F in square root can be expressed in form of Q and
QM :
2
−2S[j]qj+1 − qj+1
r−(j+1) ,
qj+1 ≥ 0
F [j] =
2
−(j+1)
−2S[j] |qj+1 | − qj+1 r
, qj+1 < 0
−(j+1)
)qj+1 ,
qj+1 ≥ 0
−(2Q[j] + qj+1 r
=
(2QM [j]+(2r− |qj+1 |)r−(j+1) ) |qj+1 | , qj+1 < 0
(10)
Supposing Q is presented as character string aa...a and
QM as bb...b, there are also five situations of updating F
shown in Table 5.
Table
sj+1
In form of S[j]
0
0
1
−2S[j] − 4−(j+1)
2 −4S[j] − 4 · 4−(j+1)
–1
2S[j] − 4−(j+1)
–2
4S[j] − 4 · 4−(j+1)
5. Update of F
In form of Q,QM
Charter string
0
0...00000
−2Q[j] − 4−(j+1)
a...aa001
−4Q[j] − 4 · 4−(j+1)
a...a0100
2QM [j] + 7 · 4−(j+1)
b...bb111
4QM [j] + 12 · 4−(j+1)
b...b1100
To generate F H and F L in SRT16 in each cycle, the
updating process of the two variables according to Q and
QM is shown in Fig.8, the F L can be obtained by QL
and QM L while the F H is related to qH besides the two
registers.
Dividend
7fffffffffffffff
000fffffffffffff
0000000000000000
3ff00d00465a00f0
40050d00465a00f0
Fig. 8. F generation. (a) F Generation of the 1st iteration;
(b) F Generation of the 2nd iteration
IV. Results of Experiments
1. Functional verification
Golden module of division and square root are designed in this paper using C++ programming language.
The same test bench are applied to golden module and our
unified div/sqrt unit to prove the validity of our design.
As shown in Tables 6 and 7, the test results of some
special values and random number show that our div/sqrt
unit based on predicting remainder and quotient meets
the accuracy requirement.
Table 6. Test results of division
Divisor
Our result
7fffffffffffffff
7fffffffffffffff
3ffffdff465affff
0000000000000000
ffffffffffffffff
0000000000000000
40080c5100bba000
3fd55bb517c1f725
3ff80c5100139000
3ffc02f5ae378f15
Table 7. Test results of square root
Radicand
Our result
Golden module
7fffffffffffffff
7fffffffffffffff
7fffffffffffffff
000fffffffffffff
0000000000000000 0000000000000000
fff0000000000000
fff0000000000000
fff0000000000000
3ff1e78de6e82d47
3ff0ececc814d728
3ff0ececc814d726
400c2c452aba898e
3ffe068a936c58ee
3ffe068a936c58ef
2. Performance analysis
We have implemented our SRT16 div/sqrt design with
Verilog. The units are synthesized by DC of Synopsys
Company with 40nm technology library in typical condition (1V, 25◦ C) and the layout design is implemented
by encounter of cadence design systems.
Table 8 shows the synthesis results of division with
three methods: conventional SRT4, conventional SRT16
with redundancy factor ρ = 8/15 and our predicting
method. In the “Area” and “Power” row, “Before” means
the results from DC and “After” means from Encounter.
Area of the three methods after layout design are about
Golden module
7fffffffffffffff
0000000000000000
0000000000000000
3fd55bb517c1f724
3ffc02f5ae378f13
30% more than the area before layout and the power
increase respectively because layout design includes the
effect form wires which cannot be ignored in the 40nm
technology library and wires will cause some extra hardware cost. Besides, though the conventional SRT16 can
reduce execution cycles to 17 while the conventional SRT4
is 30, the area and power increase sharply. Our proposed
SRT16 achieves the same execution cycles with conventional SRT16 and the area, together with power, are much
smaller than the conventional one, that is about 44% in
area and about 62% in power after layout. The reason is
that our lookup table based on SRT4 is not enlarged and
the number of CSAs is 8 while in conventional SRT16 is
16.
Based on the similarity of the iteration process between division and square root, the two operations can
be put into one unit by reusing some modules so that the
hardware cost can be saved. The synthesis results and
Low-Latency SRT Division and Square Root Based on Remainder and Quotient Prediction
layout results of single division unit, single square root
unit and combined div/sqrt unit using our architecture
are shown in Table 9. The improvement of integrating
the two operations after layout design can be calculated
based on the formula:
improvement = 1 −
Div/Sqrt
Div + Sqrt
and the results show that after layout design, the area of
the integrative unit is 9.4% less than the sum of the two
single units and the power is 14.3% less.
Table 8. Division comparison of different radix
Division
SRT4
SRT16
Our SRT16
Area(µm2 )
Before 5483.54 32469.36
14048.96
After 8348.05 46385.20
20070.09
Power(mW) Before
6.07
12.45
9.15
After
19.00
89.06
52.08
Crit.Path(ps) Before
430
460
440
After
446
589
569
Cycles
30
17
17
Table 9. Comparison of single and integrated unit
Division Square root Div/Sqrt
Area(µm2 )
Before 14048.97
15147.35
26456.71
After 20070.09
21639.72
37795.72
Power(mW) Before
9.15
6.41
14.15
After
52.08
42.64
81.19
Crit.Path(ps) Before
440
530
545
After
569
629
656
Cycles
17
16
17/16
As shown in Table 10, different structures of div/sqrt
unit are compared with 40nm technology library. In
Ref.[20], SRT8 architecture based on only prediction of
quotient without prediction of remainder is presented.
With overlapped SRT4 and SRT2 stage, its area is
25546µm2 and delay is 1040ps after layout design. Comparing with our SRT16 design, whose area is 37795.72µm2
and delay is 656ps, the increase of area is reasonable because of our higher radix and our critical delay is much
smaller. And our frequency can reach to 1524MHz while
Ref.[20] only 961MHz. IBM Eserver[21], with only predicting remainder without predicting quotient, its area is
25000µm2, delay is 1445ps and frequency is 692MHz after
scaling to radix-8[20]. Considering the area advantage of
radix-8 against radix-16, our area increase of 51% with
IBM Eserver[21] is reasonable and our delay decrease of
55% with it is advantageous.
Table 10. Comparison of different div/sqrt structures
Ours
Ref.[20]
Ref.[21]
Before
After
Area(µm2 )/
25546
25000
26456
37795
Crit.Path(ps)
1040
1445
545
656
Frequency(MHz)
961
692
1835
1524
Technology(nm)
40
40
40
40
V. Conclusions
63
In this paper, we design an unified SRT16 div/sqrt
structure which predicts remainder and quotient. The
quotient selection and remainder generation can be executed concurrently so that the critical path delay is minimized and the operation efficiency is improved. Besides,
redundant representation of remainder is advantageously
used which leads us to realize the remainder generation
with CSA instead of CPA, so that the path delay can be
reduced. The use of a minimum redundancy factor simplifies the quotient digit multiplication and decreases the
adder tree for remainder generation. What’s more, integration of division and square root can reuse some modules to reduce hardware overhead.
In conclusion, the experiment results show the advantages of our predicting structure in terms of area-delay
tradeoff.
References
[1] S.F. Oberman and J.M. Flynn, “Division algorithm and implementations”, IEEE Transaction on Computers, Vol.46, No.8,
pp.833–854, 1997.
[2] D. Piso, J.A. Pineiro and J.D. Bruguem, “Analysis of the impact of different methods for division/square root computation
in the performance of a superscalar microprocessor”, Proc. of
Euromicro Symposium on Digital System Design, Washington,
USA, pp.218–225, 2002.
[3] Coke, James, Baliga, et al., “Improvements in the Intel Core2
Penryn processor family architecture and microarchitecture”,
Intel Technology Journal, Vol.12, No.3, pp.179–184, 2008.
[4] M.D. Ercegovac and T. Lang, Division and Square Root: Digit
Recurrence Algorithms and Implementations, Kluwer Academic
Publishers, Massachusetts, USA, 1994.
[5] P. Soerquist and M. Leeser, “Area and performance tradeoffs in
floating-point divide and square-root implementations”, ACM
Computing Surveys, Vol.28, No.3, pp.518–564, 1996.
[6] S.F. Oberman and M.J. Flynn, “Minimizing the complexity of
SRT tables”, IEEE Transaction on Very Large Integration Systems, Vol.6, No.1, pp.141–149, 1998.
[7] J. Frandrianto, “Algorithm for high-speed shared radix-8 division and radix-8 square root”, Proc. of 9th Symposium on
Computer Arithmetic, Santa Monica, CA, pp.68–75, 1989.
[8] A. Nannarelli and T. Lang, “Lower power radix-4 combined division and square root”, the International Conference on Computer Design, Austin, USA, pp.236–242, 1999.
[9] N. Burgess and C.N. Hinds, “Design of the ARM VFP11 divide
and square root synthesisable macrocell”, Proc. of 18th IEEE
Symposium on Computer Arithmetic, Montepellier, France,
pp.87–96, 2007.
[10] N. Burgess, “Retiming the ARM VFP-11 division and square
root macrocell”, Proc. of 41st Asilomar Conference on Signals,
Systems and Computers, Monterey, USA, pp.363–366, 2007.
[11] A. Amaricai and O. Boncalo, “SRT radix-2 dividers with (5,4)
redundant representation of partial remainder”, IEEE Transaction on Very Large Scale Integration Systems, Vol.23, No.5,
pp.1016–1020, 2013.
[12] M. Issad, M. Anane and H. Bessalah, “Inuence de la base sur
les performances de la division SRT”, Proc. of Journal Francophones La Algorithm Architecture, Dijon, France, pp.91–94,
2005. (in French)
[13] T.N. Pham and E.E. Swartzlander, “Design of radix-4 SRT dividers in 65 nanometer CMOS technology”, International Con-
64
Chinese Journal of Electronics
ference on Application-specific Systems, Architectures and Processors, Steamboat Springs, Colorado, pp.105–108, 2006.
[14] M.R. Patel, D. Tejas, V.Shah, et al., “Implementation and
analysis of interval SRT radix-2 division algorithm”, International Journal of Electronics and Computer Science Engineering, Vol.1, No.3, pp.971–976, 1971.
[15] N.R. Srivastava, “Radix 4 SRT division with quotient prediction and operand scaling”, Europe Conference and Exhibition
on Design, Automation and Test, Nice, France, pp.1–6, 2007.
[16] M. Issad, H. Bessalah and N. Anane, “Higher radix and redundancy factor for floating point SRT division”, IEEE Transactions on Very Large Scale Integration Systems, Vol.16, No.6,
pp.774–779, 2008.
[17] C.X. Wang, K.F. Zhang, K. Liu, et al., “Study on double precision floating point division”, Microprocessors, Vol.6, No.6,
pp.1–3, 2011. (in Chinese)
[18] L. Wei and N. Alberto, “Power efficient division and square
root unit”, IEEE Transactions on Computers, Vol.61, No.8,
pp.1059–1070, 2012.
[19] A. Nannarelli, “Radix-16 combined division and square root
unit”, 2011 20th IEEE Symposium on Computer Arithmetic,
Tuebingen, Germany, pp.169–176, 2011.
[20] R. Ingo and T.G. Noll, “A digit-set-interleaved radix-8 division/square root kernel for double-precision floating point”,
2010 International Symposium on System on Chip (SoC), Tampere, Finland, pp.150–153, 2010.
[21] H. Wetter, E.M. Schwarz and J. Haess, “The IBM eServer z990
floating-point unit”, IBM Journal of Research and Development, Vol.48, No.3, pp.311-322, 2004.
[22] M.D. Ercegovac and T. Lang, Digital Arithmetic, Morgan Kaufmann Publishers, California, USA, 2003.
[23] P. Kornerup, “Digit selection for SRT division and square root”,
IEEE Trans on Computers, Vol.54, No.3, pp.294–303, 2005.
[24] D. Stevenson, “A national standard IEEE standard for binary
floating-point arithmetic”, ACM Sigplan Notices, Vol.22, No.2,
pp.9–25, 1987.
2017
PENG Yuanxi is a professor of National University of Defense and Technology, China. His research interests include
high performance computing, multi- and
many-core architectures, on-chip networks
and architectural support for parallel programming. (Email: [email protected])
CHEN Jiyang (corresponding author) received the B.S. degree and M.S.
degree in computer science in 2012 and
2015, respectively, from National University of Defense and Technology, China.
Her research interests include high performance computing, multi-core architectures and on-chip networks. (Email:
[email protected])
LEI Yuanwu is an associate professor of National University of Defense and
Technology, China. His research interests
include high performance computer architecture and computing engineering.
HE Tingting
is a student in microelectronics and solid-state electronics in
National University of Defense and Technology, China. Her research interests include high performance computer architecture and computing engineering.
DENG Ziye is a student in microelectronics and solid-state
electronics in National University of Defense and Technology, China.
His research interests include high performance computer architecture and computing engineering.