INSTRUCTION ISSUE LOGIC FOR PIPELINED SUPERCOMPUTERS
S h l o m o Weiss
Computer Sciences Department
U n i v e r s i t y of W i s c o n s i n - M a d i s o n
Madison, WI 53706
J a m e s E. S m i t h
D e p a r t m e n t of E l e c t r i c a l a n d C o m p u t e r E n g i n e e r i n g
U n i v e r s i t y of W i s c o n s i n - M a d i s o n
Madison, WI 53706
Abs~t
B a s i c p r i n c i p l e s a n d d e s i g n t r a d e o f f s for c o n t r o l ol
p i p e l i n e d p r o c e s s o r s a r e f i r s t d i s c u s s e d . We c o n c e n t r a t e
o n r e g i s t e r - r e g i s t e r a r c h i t e c t u r e s like t h e GRAY-1 w h e r e
p i p e l i n e c o n t r o l logic is l o c a l i z e d to o n e o r two p i p e l i n e
s t a g e s a n d is r e f e r r e d t o as " i n s t r u c t i o n i s s u e logic".
D e s i g n t r a d e o f f s a r e e x p l o r e d b y giving d e s i g n s for a
v a r i e t y of i n s t r u c t i o n i s s u e m e t h o d s t h a t r e p r e s e n t a
r a n g e of c o m p l e x i t y a n d s o p h i s t i c a t i o n . T h e s e v a r y f r o m
t h e o r i g i n a l CRAY-1 i s s u e logic to a v e r s i o n of T o m a s u l o ' s
a l g o r i t h m , f i r s t u s e d in t h e IBM 3 6 0 / 9 1 floating p o i n t
u n i t . Also s t u d i e d a r e T h o r n t o n ' s " s c o r e b o a r d " algor i t h m u s e d o n t h e CDC 8600 a n d a n a l g o r i t h m we h a v e
d e v i s e d . To p r o v i d e a s t a n d a r d for c o m p a r i s o n , all t h e
i s s u e m e t h o d s a r e u s e d t o i m p l e m e n t t h e GRAY-1 s c a l a r
a r c h i t e c t u r e . Then, u s i n g a s i m u l a t i o n m o d e l a n d t h e
L a w r e n c e L i v e r m o r e Loops c o m p i l e d w i t h t h e GRAY FORTRAN c o m p i l e r , p e r f o r m a n c e r e s u l t s f o r t h e v a r i o u s i s s u e
methods are given and discussed.
1.1. T r a d e o ~
We b e g i n w i t h a d i s c u s s i o n of d e s i g n t r a d e o f f s t h a t
centers on four principle issues:
(1) c l o c k p e r i o d ,
(2) i n s t r u c t i o n s c h e d u l i n g ,
(3) i s s u e logic c o m p l e x i t y , a n d
(4) h a r d w a r e cost, d e b u g g i n g , a n d m a i n t e n a n c e .
E a c h of t h e s e i s s u e s will b e d i s c u s s e d in t u r n .
C l o c k P e r i o d . In a p i p e l t n e d c o m p u t e r , t h e r e a r e a
n u m b e r of s e g m e n t s c o n t a i n i n g c o m b i n a t i o n a l logic w i t h
l a t c h e s s e p a r a t i n g s u c c e s s i v e s e g m e n t s . All t h e l a t c h e s
a r e s y n c h r o n i z e d b y t h e s a m e clock, a n d t h e p i p e l i n e is
c a p a b l e of i n i t i a t i n g a n e w i n s t r u c t i o n e v e r y c l o c k
p e r i o d . H e n c e , u n d e r i d e a l c o n d i t i o n s , i.e. n o d e p e n d e n cies o r r e s o u r c e conflicts, p i p e l i n e p e r f o r m a n c e is
d i r e c t l y r e l a t e d to t h e p e r i o d of t h e c l o c k u s e d t o sync h r o n i z e t h e pipe. E v e n w i t h d a t a d e p e n d e n c i e s a n d
r e s o u r c e conflicts, t h e r e is a h i g h correlation b e t w e e n
performance and clock period.
Historically, pipelined supercomputers
have had
s h o r t e r c l o c k p e r i o d s t h a n o t h e r c o m p u t e r s . This is in
p a r t d u e t o t h e u s e of t h e f a s t e s t a v a i l a b l e logic t e c h n o logies, b u t i t is also d u e to d e s i g n s t h a t m i n i m i z e logic
levels b e t w e e n s u c c e s s i v e l a t c h e s .
S e h e d n l i n g of I n s t r u c t i o n s . P e r f o r m a n c e of a p i p e l i n e d p r o c e s s o r d e p e n d s g r e a t l y o n t h e o r d e r of t h e
i n s t r u c t i o n s i n t h e i n s t r u c t i o n s t r e a m . If c o n s e c u t i v e
i n s t r u c t i o n s h a v e d a t a and c o n t r o l d e p e n d e n c i e s a n d
c o n t e n d for r e s o u r c e s , t h e n " h o l e s " i n t h e p i p e l i n e will
d e v e l o p a n d p e r f o r m a n c e will suffer. To i m p r o v e p e r f o r m a n c e , it is o f t e n p o s s i b l e to a r r a n g e t h e c o d e , o r
s c h e d u l e it, so t h a t d e p e n d e n c i e s a n d r e s o u r c e c o n f l i c t s
a r e m i n i m i z e d . R e g i s t e r s c a n also b e a l l o c a t e d so t h a t
register conflicts are reduced (register conflicts caused
b y d a t a d e p e n d e n c i e s c a n n o t b e e l i m i n a t e d i n t h i s way,
h o w e v e r ) . B e c a u s e of t h e i r close r e l a t i o n s h i p , i n t h e
r e m a i n d e r of t h e p a p e r we will g r o u p c o d e s c h e d u l i n g
a n d r e g i s t e r a l l o c a t i o n t o g e t h e r a n d r e f e r t o t h e m coll e c t i v e l y as " c o d e s c h e d u l i n g " .
T h e r e a r e two d i f f e r e n t ways t h a t c o d e s c h e d u l i n g
c a n b e d o n e . First, i t c a n b e d o n e a t c o m p i l e t i m e b y t h e
software. We r e f e r to t h i s as " s t a t i c " s c h e d u l i n g b e c a u s e
it d o e s n o t c h a n g e a s t h e p r o g r a m r u n s . S e c o n d , it c a n
b e d o n e b y t h e h a r d w a r e a t r u n t i m e . We r e f e r t o t h i s as
" d y n a m i c " s c h e d u l i n g . T h e s e two m e t h o d s a r e n o t m u t u ally exclusive.
Most c o m p i l e r s for p i p e l i n e d p r o c e s s o r s d o s o m e
f o r m of s t a t i c s c h e d u l i n g t o avoid d e p e n d e n c i e s . This
a d d s a new d i m e n s i o n to t h e o p t i m i z a t i o n p r o b l e m s
f a c e d b y a c o m p i l e r , a n d o c c a s i o n a l l y a p r o g r a m m e r will
h a n d c o d e i n n e r l o o p s in a s s e m b l y l a n g u a g e to a r r i v e a t
a better schedule than a compiler can provide.
1. I n t r o d u c t i o n
A l t h o u g h m o d e r n s u p e r e o m p u t e r s a r e c l o s e l y assoc i a t e d w i t h h i g h s p e e d v e c t o r o p e r a t i o n , it is widely
r e c o g n i z e d t h a t s c a l a r o p e r a t i o n is a t l e a s t of e q u a l
i m p o r t a n c e , a n d p i p e l i n i n g [KOGGS1] is t h e p r e d o m i n a n t
t e c h n i q u e f o r a c h i e v i n g h i g h s c a l a r p e r f o r m a n c e . In a
p i p e l i n e d c o m p u t e r , i n s t r u c t i o n p r o c e s s i n g is b r o k e n
i n t o s e g m e n t s a n d p r o c e s s i n g p r o c e e d s in a n a s s e m b l y
line f a s h i o n w i t h t h e e x e c u t i o n of s e v e r a l i n s t r u c t i o n s
b e i n g o v e r l a p p e d . B e c a u s e of d a t a a n d c o n t r o l d e p e n d e n cies in a s c a l a r i n s t r u c t i o n s t r e a m , i n t e r l o c k logic is
placed between critical pipeline segments to control
i n s t r u c t i o n flow t h r o u g h t h e pipe. In a n r e g i s t e r - r e g i s t e r
a r c h i t e c t u r e like t h e CDC 6600 [THOR70], t h e CDC 7000
[BONS69], a n d t h e GRAY-1 [GRAY77, CRAY79,RUSST8],
m o s t of t h e i n t e r l o c k logic is l o c a l i z e d t o o n e s e g m e n t
e a r l y i n t h e p i p e l i n e a n d is r e f e r r e d t o as " i n s t r u c t i o n
i s s u e " logic.
It is t h e p u r p o s e of t h i s p a p e r to h i g h l i g h t s o m e of
the tradeoffs that affect pipeline control, with particular
e m p h a s i s o n i n s t r u c t i o n i s s u e logic. The p r i m a r y v e h i c l e
for t h i s d i s c u s s i o n is a s i m u l a t i o n s t u d y of d i f f e r e n t
i n s t r u c t i o n i s s u e m e t h o d s with v a r y i n g d e g r e e s of c o m plexity. T h e s e r a n g e f r o m t h e s i m p l e a n d s t r a i g h t f o r w a r d as in t h e GRAY-1 to t h e c o m p l e x a n d s o p h i s t i c a t e d
as in t h e CDC 6600 a n d t h e IBM 3 6 0 / 9 1 floating p o i n t u n i t
[TOMA67]. E a c h is u s e d t o i m p l e m e n t t h e GRAY-1 s c a l a r
a r c h i t e c t u r e , a n d e a c h i m p l e m e n t a t i o n is s i m u l a t e d
u s i n g t h e 14 L a w r e n c e I A v e r m o r e Loops [MCMA72] as
c o m p i l e d b y t h e G r a y R e s e a r c h FORTRAN c o m p i l e r ( c g r ) .
II0
0 1 9 4 - 7 1 1 1 / 8 4 / 0 0 0 0 / 0 1 1 0 5 0 1 . 0 0 © 1984 I E E E
a l g o r i t h m we h a v e d e v i s e d to allow d y n a m i c s c h e d u l i n g
while doing away w i t h s o m e of t h e a s s o c i a t i v e c o m p a r e s
required by the other methods that perform dynamic
s c h e d u l i n g . E a c h of t h e f o u r CRAY-1 i m p l e m e n t a t i o n s is
s i m u l a t e d over t h e s a m e s e t of b e n c h m a r k s t o allow p e r f o r m a n c e c o m p a r i s o n s . S e c t i o n 6 c o n t a i n s a f u r t h e r discussion on the relationship between software and
h a r d w a r e c o d e s c h e d u l i n g in p i p e l i n e d p r o c e s s o r s , a n d
Section 7 contains conclusions.
Issue Logic Complexity. By u s i n g c o m p l e x i s s u e
logic, d y n a m i c s c h e d u l i n g of i n s t r u c t i o n s c a n b e
a c h i e v e d . This allows i n s t r u c t i o n s t o b e g i n e x e c u t i o n
" o u t - o f - o r d e r " w i t h r e s p e c t to t h e c o m p i l e d c o d e
s e q u e n c e . This h a s two a d v a n t a g e s . F i r s t , it r e l i e v e s
s o m e t h e b u r d e n o n t h e c o m p i l e r t o g e n e r a t e a good
s c h e d u l e . T h a t is, p e r f o r m a n c e is n o t as d e p e n d e n t o n
t h e q u a l i t y of t h e c o m p i l e d code. S e c o n d , d y n a m i c
s c h e d u l i n g a t i s s u e t i m e c a n t a k e a d v a n t a g e of d e p e n d e n c y i n f o r m a t i o n t h a t is n o t a v a i l a b l e to t h e c o m p i l e r
w h e n it d o e s s t a t i c s c h e d u l i n g . C o m p l e x i s s u e logic does
r e q u i r e l o n g e r c o n t r o l p a t h s , h o w e v e r , w h i c h c a n l e a d to
a longer clock period.
H a r d w a r e Cost, D e b u g g i n g , and Maintenance. Comp l e x i s s u e m e t h o d s l e a d to a d d i t i o n a l h a r d w a r e cost.
More logic is n e e d e d , a n d d e s i g n t i m e is i n c r e a s e d . Comp l e x c o n t r o l logic is also m o r e e x p e n s i v e to d e b u g a n d
maintain. These problems are aggravated by issue
m e t h o d s t h a t d y n a m i c a l l y s c h e d u l e c o d e b e c a u s e it m a y
b e difficult to r e p r o d u c e e x a c t i s s u e s e q u e n c e s .
2. The CRAY-1 Architecture a n d Instruction Issue Algorithm
2.1. Overview of t h e CRAY-1
The CRAY-I a r c h i t e c t u r e a n d o r g a n i z a t i o n a r e u s e d
t h r o u g h o u t t h i s p a p e r as a b a s i s for c o m p a r i s o n . The
CRAY-1 s c a l a r a r c h i t e c t u r e is s h o w n i n F i g u r e 1. It cons i s t s of two s e t s of r e g i s t e r s a n d f u n c t i o n a l u n i t s for (1)
a d d r e s s p r o c e s s i n g a n d (2) s c a l a r p r o c e s s i n g . The
a d d r e s s r e g i s t e r s a r e p a r t i t i o n e d i n t o two levels: e i g h t A
r e g i s t e r s a n d s i x t y four B r e g i s t e r s . The i n t e g e r a d d a n d
m u l t i p l y f u n c t i o n a l u n i t s a r e d e d i c a t e d t o a d d r e s s proc e s s i n g . Similarly, t h e s c a l a r r e g i s t e r s a r e p a r t i t i o n e d
i n t o two levels: e i g h t S r e g i s t e r s a n d s i x t y f o u r T r e g i s t e r s . The B a n d T r e g i s t e r files m a y b e u s e d as a
programmer-manipulated
data cache, although this
f e a t u r e is l a r g e l y u n u s e d b y t h e CFT c o m p i l e r . F o u r
f u n c t i o n a l u n i t s a r e u s e d e x c l u s i v e l y for s c a l a r p r o c e s s ing. In a d d i t i o n , t h r e e floating p o i n t f u n c t i o n a l u n i t s a r e
s h a r e d with t h e v e c t o r p r o c e s s i n g s e c t i o n ( v e c t o r r e g i s t e r s a r e n o t s h o w n in Fig. 1).
1.2. H i s t o r i c a l P e r s p e e U ~ e
It is i n t e r e s t i n g to r e v i e w t h e way t h e a b o v e
t r a d e o f f s h a v e b e e n d e a l t w i t h h i s t o r i c a l l y . In t h e l a t e
1950's a n d e a r l y 1960's t h e r e was r a p i d m o v e m e n t
toward increasingly complex issue methods. Important
m i l e s t o n e s w e r e a c h i e v e d b y STRETCH [BUCH62] in 1961
a n d t h e CDC 6600 in 1964. P r o b a b l y t h e m o s t s o p h i s t i c a t e d i s s u e logic u s e d t o d a t e is in t h e IBM 3 6 0 / 9 1
[ANDE67], s h i p p e d in 1967. A f t e r t h i s f i r s t r u s h t o w a r d
m o r e a n d m o r e c o m p l e x m e t h o d s , t h e r e was a r e t r e a t
t o w a r d s i m p l e r i n s t r u c t i o n i s s u e m e t h o d s t h a t a r e still in
u s e t o d a y . At CDC, t h e 7600 was d e s i g n e d to i s s u e
instructions in strict program sequence with no dynamic
s c h e d u l i n g . The c l o c k period, h o w e v e r , was v e r y fast,
e v e n b y t o d a y ' s s t a n d a r d s . The m o r e r e c e n t CRAY-1 a n d
CRAY-XMP [CRAYS2] differ v e r y l i t t l e f r o m t h e CDC7600 in
t h e way t h e y h a n d l e s c a l a r i n s t r u c t i o n s .
The CDC
CYBER205 [CDCS1] s c a l a r u n i t is also v e r y s i m i l a r . At
IBM, a n d l a t e r a t A m d a h l Corp., p i p e l i n e d i m p l e m e n t a t i o n s of t h e 3 6 0 / 3 7 0 a r c h i t e c t u r e following t h e 3 6 0 / 9 1
h a v e i s s u e d i n s t r u c t i o n s s t r i c t l y in o r d e r .
As for t h e f u t u r e , b o t h t h e d e b u g / m a i n t e n a n c e
problem and the hardware cost problem may be
s i g n i f i c a n t l y a l l e v i a t e d b y u s i n g VLSI w h e r e logic is m u c h
less e x p e n s i v e a n d w h e r e r e p l a c e a b l e p a r t s a r e s u c h t h a t
f a u l t i s o l a t i o n d o e s n o t n e e d to b e a s p r e c i s e as with SSI.
In a d d i t i o n , t h e r e is a t r e n d t o w a r d m o v i n g s o f t w a r e
problems into hardware, and code scheduling seems to
be a candidate. Consequently, tradeoffs are shifting and
i n s t r u c t i o n i s s u e logic t h a t d y n a m i c a l l y s c h e d u l e s c o d e
deserves renewed study.
~.~
T
REGISTER
FILE
REGIS~R
FILE
FlU
F'UNC1
'LY
1.3. P a p e r Overview
The t r a d e o f f s j u s t d i s c u s s e d l e a d t o a s p e c t r u m of
i n s t r u c t i o n i s s u e a l g o r i t h m s . T h r o u g h s i m u l a t i o n we c a n
look a t t h e p e r f o r m a n c e g a i n s t h a t a r e m a d e p o s s i b l e b y
d y n a m i c c o d e s c h e d u l i n g . O t h e r i s s u e s like c l o c k p e r i o d
a n d h a r d w a r e c o s t a n d m a i n t e n a n c e a r e m o r e difficult
a n d r e q u i r e d e t a i l e d d e s i g n a n d c o n s t r u c t i o n to m a k e
q u a n t i t a t i v e a s s e s s m e n t s . In t h i s p a p e r , we do d i s c u s s
t h e c o n t r o l f u n c t i o n s t h a t n e e d t o b e i m p l e m e n t e d to
facilitate qualitative judgements. Section 2 examines
o n e e n d p o i n t of t h e s p e c t r u m : t h e CRAY-1. The CRAY-1
u s e s s i m p l e i s s u e logic w i t h a f a s t c l o c k a n d s t a t i c c o d e
s c h e d u l i n g only. S e c t i o n 3 e x a m i n e s t h e o t h e r e n d p o i n t
of t h e s p e c t r u m : T o m a s u l o ' s a l g o r i t h m . T o m a s u l o ' s algor i t h m is c a p a b l e of c o n s i d e r a b l e d y n a m i c c o d e s c h e d u l ing via a c o m p l e x i s s u e m e c h a n i s m . S e c t i o n s 4 a n d 5
t h e n d i s c u s s two i n t e r m e d i a t e p o i n t s . The f i r s t is a v a r i a t i o n of T h o r n t o n ' s " s c o r e b o a r d " a l g o r i t h m u s e d in t h e
CDC 6600. T h o r n t o n ' s a l g o r i t h m is also u s e d t o i m p l e m e n t t h e CRAY-1 s c a l a r a r c h i t e c t u r e . The s e c o n d is a n
AL
{IFT
F i g u r e 1 - The CRAY-1 S c a l a r A r c h i t e c t u r e
The i n s t r u c t i o n s e t is d e s i g n e d f o r e f f i c i e n t p i p e l i n e
p r o c e s s i n g . Being a r e g i s t e r - r e g i s t e r a r c h i t e c t u r e , only
l o a d a n d s t o r e i n s t r u c t i o n s c a n a c c e s s m e m o r y . The r e s t
of i n s t r u c t i o n s u s e o p e r a n d s f r o m r e g i s t e r s . I n s t r u c t i o n s
for t h e B a n d T r e g i s t e r s a r e r e s t r i c t e d t o m e m o r y
a c c e s s a n d c o p i e s t o a n d f r o m r e g i s t e r files A a n d S,
respectively.
The i n f o r m a t i o n flow is f r o m m e m o r y t o r e g i s t e r s A
(S), or to t h e i n t e r m e d i a t e r e g i s t e r s B (T). F r o m file A (S)
d a t a is s e n d t o t h e f u n c t i o n a l u n i t s , f r o m w h i c h it
r e t u r n s t o file A (S). T h e n d a t a c a n b e f u r t h e r p r o c e s s e d
b y f u n c t i o n a l u n i t s , s t o r e d i n t o m e m o r y o r s a v e d i n file B
(T). Block t r a n s f e r s of o p e r a n d s b e t w e e n m e m o r y a n d
r e g i s t e r s B a n d T a r e also available, t h u s r e d u c i n g t h e
n u m b e r of m e m o r y a c c e s s i n s t r u c t i o n s .
111
2.2. CRAY-1 Issue Logic
Loop
I n s t r u c t i o n s a r e f e t c h e d f r o m i n s t r u c t i o n b u f f e r s at
t h e r a t e of one p a r c e l (16 bits) p e r c l o c k p e r i o d .
Individual i n s t r u c t i o n s a r e e i t h e r o n e or two p a r c e l s
long. After a c l o c k p e r i o d is s p e n t for i n s t r u c t i o n d e c o d ing, t h e i s s u e logic c h e c k s i n t e r l o c k s . If t h e r e is a n y
conflict, i s s u e is b l o c k e d u n t i l t h e c o n f l i c t c o n d i t i o n g o e s
away.
1
2
3
4
5
6
7
8
9
10
ii
12
13
14
F o r s c a l a r i n s t r u c t i o n s , t h e following a r e t h e prim a r y i n t e r l o c k c h e c k s m a d e a t t h e t i m e of i n s t r u c t i o n
issue:
(1) r e g i s t e r s ; b o t h t h e s o u r c e a n d d e s t i n a t i o n r e g i s t e r s
must not be reserved;
(2)
r e s u l t bus; t h e A a n d S r e g i s t e r files h a v e one b u s
e a c h o v e r w h i c h d a t a c a n b e w r i t t e n i n t o t h e files.
B a s e d on t h e c o m p l e t i o n t i m e of t h e p a r t i c u l a r
i n s t r u c t i o n , a c h e c k is m a d e t o d e t e r m i n e if t h e b u s
will b e available at t h e c l o c k p e r i o d w h e n t h e
instruction completes.
# instructions
executed
7217
8448
14015
9783
8347
9350
4573
4031
4918
4412
12002
11999
8846
9915
# clock
cycles
18046
18918
88039
22198
31707
23045
10381
7841
10146
10230
30011
29999
18858
22391
Table 1 -- CRAY-1 E x e c u t i o n t i m e s for t h e 14 L a w r e n c e
L i v e r m o r e Loops - 1 p a r c e l i n s t r u c t i o n s a r e i s s u e d in
1 cycle, 2 p a r c e l i n s t r u c t i o n s in 2 cycles.
(3)
f u n c t i o n a l unit; due t o v e c t o r i n s t r u c t i o n s , a f u n c tional unit m a y be busy when a scalar instruction
w i s h e s t o u s e it. S i n c e we a r e c o n s i d e r i n g s c a l a r
p e r f o r m a n c e only, t h i s t y p e of c o n f l i c t will n o t
o c c u r . The m e m o r y s y s t e m c a n also b e v i e w e d as a
f u n c t i o n a l unit; it c a n o c c a s i o n a l l y b e c o m e b u s y d u e
t o a m e m o r y b a n k conflict, b u t for s c a l a r c o d e t h i s
is a v e r y i n f r e q u e n t o c c u r r e n c e a n d d o e s n o t a f f e c t
p e r f o r m a n c e in a n y a p p r e c i a b l e way.
If all its i n t e r l o c k s p a s s , a n i n s t r u c t i o n i s s u e s a n d
c a u s e s t h e following to t a k e p l a c e .
(1) The d e s t i n a t i o n r e g i s t e r is r e s e r v e d ; t h i s r e s e r v a t i o n
is r e m o v e d only w h e n t h e i n s t r u c t i o n c o m p l e t e s .
(2) The r e s u l t b u s is r e s e r v e d f o r t h e c l o c k p e r i o d w h e n
the instruction completes.
M e m o r y a c c e s s e s h a v e one f u r t h e r i n t e r i o c k t o b e
c h e c k e d : m e m o r y b a n k b u s y . This m u s t b e d e l a y e d u n t i l
t h e i n d e x i n g r e g i s t e r is r e a d a n d t h e e f f e c t i v e a d d r e s s is
c o m p u t e d . H e n c e , t h e b a n k b u s y c h e c k is p e r f o r m e d two
c l o c k p e r i o d s a f t e r a l o a d o r s t o r e i n s t r u c t i o n i s s u e s . If
the bank h a p p e n s to be busy, the m e m o r y "functional
u n i t " is b u s i e d , a n d no f u r t h e r l o a d s or s t o r e s c a n b e
i s s u e d . B e c a u s e all l o a d s a n d s t o r e s a r e two p a r c e l s
long, t h e y c a n i s s u e a t a m a x i m u m r a t e of o n e e v e r y two
c l o c k p e r i o d s . This m e a n s t h a t a b a n k b u s y b l o c k a g e
c a t c h e s a s u b s e q u e n t load or s t o r e b e f o r e it is i s s u e d .
A f t e r a l o a d i n s t r u c t i o n p a s s e s t h e b a n k b u s y c h e c k , it
p l a c e s its r e s e r v a t i o n f o r t h e a p p r o p r i a t e r e s u l t bus. A
load c a n only c o n f l i c t for t h e b u s with a p r e v i o u s l y i s s u e d
r e c i p r o c a l a p p r o x i m a t i o n i n s t r u c t i o n , so t h e a d d i t i o n a l
i n t e r l o c k i n g d o n e a t t h a t p o i n t is m i n i m a l .
a m o u n t of v e c t o r c o d e , a n d half r e m a i n s c a l a r .
F o r t h e s i m u l a t i o n s r e p o r t e d h e r e , we have m a d e
t h e following s i m p l i f i c a t i o n s :
(1)
T h e r e a r e no m e m o r y b a n k conflicts.
(2)
All loops fit in t h e i n s t r u c t i o n buffers.
One r e a s o n f o r t h i s s i m p l i f i c a t i o n was t o s i m p l i f y t h e
s i m u l a t o r d e s i g n for t h e a l t e r n a t i v e CRAY-1 i s s u e
m e t h o d s to b e g i v e n later; as m e n t i o n e d e a r l i e r o u r original CRAY-1 s i m u l a t o r is c a p a b l e of m o d e l i n g b a n k
c o n f l i c t s a n d i n s t r u c t i o n buffers. Also, this allows us to
c o n c e n t r a t e on t h e p e r f o r m a n c e d i f f e r e n c e s c a u s e d b y
i s s u e logic a n d t o f i l t e r t h e "noise" i n t r o d u c e d by o t h e r
f a c t o r s (e.g. i n s t r u c t i o n b u f f e r c r o s s i n g s ) . Tabte 1 ~ilows
t h e s c a l a r p e r f o r m a n c e of t h e CRAY-1 for t h e f i r s t 14
L a w r e n c e L i v e r m o r e Loops.
3. T o m a s u l o ' s A l g o r i t h m
The CRAY-1 f o r c e s i n s t r u c t i o n s t o i s s u e s t r i c t l y in
p r o g r a m o r d e r . If a n i n s t r u c t i o n is b l o c k e d f r o m issuing
d u e t o a conflict, all i n s t r u c t i o n s following it in t h e
i n s t r u c t i o n s t r e a m a r e also b l o c k e d , e v e n if t h e y h a v e n o
conflicts. In c o n t r a s t , t h e s c h e m e in t h i s s e c t i o n allows
i n s t r u c t i o n s to b e g i n e x e c u t i o n o u t of p r o g r a m o r d e r . It
is a v a r i a t i o n of t h e i n s t r u c t i o n i s s u e a l g o r i t h m f i r s t
p r e s e n t e d in [TOMA67]. A l t h o u g h t h e o r i g i n a l a l g o r i t h m
was d e v i s e d for t h e floating p o i n t u n i t of t h e IBM 360/91,
we show how it c a n b e a d a p t e d a n d e x t e n d e d t o c o n t r o l
t h e e n t i r e p i p e l i n e s t r u c t u r e of a CRAY-1 i m p l e m e n t a tion.
F i g u r e 2 i l l u s t r a t e s t h e e s s e n t i a l e l e m e n t s of a t a g b a s e d m e c h a n i s m for i s s u e of i n s t r u c t i o n s o u t - o f - o r d e r .
Fig. 3 i l l u s t r a t e s t h e full CRAY-1 i m p l e m e n t a t i o n .
2.3. CRAY-1 P e r f o r m a n c e
In t h i s p a p e r p e r f o r m a n c e is m e a s u r e d b y s i m u l a t ing t h e first 14 L a w r e n c e L i v e r m o r e Loops. T h e s e a r e
e x c e r p t s f r o m l a r g e FORTRAN p r o g r a m s t h a t h a v e b e e n
j u d g e d t o p r o v i d e a g o o d m e a s u r e of l a r g e s c a l e c o m p u t e r p e r f o r m a n c e . The loops w e r e c o m p i l e d u s i n g t h e
CFT c o m p i l e r , a n d i n s t r u c t i o n t r a c e t a p e s w e r e g e n erated. These were t h e n simulated with a p e r f o r m a n c e
s i m u l a t o r w r i t t e n in C, r u n n i n g o n a V A X l l / 7 8 0 . With
bank busies and instruction buffer misses modeled, the
s i m u l a t o r a g r e e s e x a c t l y w i t h a c t u a l CRAY-1 t i m i n g s ,
e x c e p t w h e n t h e r e is a d i f f e r e n c e in t h e way a loop fits
i n t o t h e i n s t r u c t i o n buffers. This p a r t i c u l a r d i f f e r e n c e is
a f u n c t i o n of w h e r e t h e l o a d e r c h o o s e s to p l a c e a p r o g r a m in m e m o r y , a n d f o r p r a c t i c a l u s e h a s t o b e v i e w e d
as a n o n d e t e r m i n i s m .
S i n c e we a r e i n t e r e s t e d in s c a l a r p e r f o r m a n c e , t h e
CFT c o m p i l e r was r u n with t h e " v e c t o r i z e r " t u r n e d off so
t h a t n o v e c t o r i n s t r u c t i o n s w e r e p r o d u c e d . When t h e
v e c t o r i z e r is on, half of t h e 14 loops c o n t a i n a s u b s t a n t i a l
E a c h r e g i s t e r in t h e A a n d S r e g i s t e r files is augm e n t e d b y a ready bit (R) a n d a fag field. A s s o c i a t e d with
e a c h f u n c t i o n a l u n i t is a s m a l l n u m b e r of rese~rvat/on
stations. E a c h r e s e r v a t i o n s t a t i o n c a n s t o r e a p a i r of
o p e r a n d s ; e a c h o p e r a n d h a s its own t a g field a n d r e a d y
bit. A r e s e r v a t i o n s t a t i o n also h o l d s a destination tag
(DTG). When an i n s t r u c t i o n is i s s u e d , a n e w t a g is s t o r e d
i n t o DTG ( s e e Fig. 2).
New d e s t i n a t i o n t a g s a r e a s s i g n e d f r o m a " t a g pool"
t h a t c o n s i s t s of s o m e finite s e t of t a g s . T h e s e a r e a s s o c i a t e d with a n i n s t r u c t i o n f r o m t h e t i m e t h e i n s t r u c t i o n is
i s s u e d t o a r e s e r v a t i o n s t a t i o n u n t i l t h e t i m e it p r o d u c e s
a r e s u l t a n d c o m p l e t e s . The t a g is r e t u r n e d t o t h e pool
w h e n a n i n s t r u c t i o n finishes. In t h e o r i g i n a l T o m a s u l o ' s
a l g o r i t h m , t h e t a g s w e r e in 1-to-1 c o r r e s p o n d e n c e with
t h e r e s e r v a t i o n s t a t i o n s . This p a r t i c u l a r way of a s s i g n i n g
t a g s is n o t e s s e n t i a l , h o w e v e r . Any m e t h o d will w o r k as
long as t a g s a r e a s s i g n e d a n d r e l e a s e d t o t h e pool as
d e s c r i b e d above.
i12
(I) The requested functional unit m u s t have an available reservation station
(2) There m u s t be an available tag from the tag pool. ]n
Tomasulo's implementation, conditions I and 2 ace
equivalent.
(8) A source register being used by the instruction
m u s t not be loaded with a lust-completed result
during the s a m e clock period as the instruction
issues to a reservation station. This hazard
condition is often neglected w h e n
discussing
Tomasulo's algorithm. If it is not handled properly,
the source register contents and the instruction
that uses the source register will both be
transferred during the s a m e clock period. Because
the instruction is not in the reservation station at
the time of the register transfer, the register's contents will not be correctly sent to the reservation
station. W h e n this hazard condition is detected,
instruction issue is held for one clock period.
l
I IrAG....
I
........
..........
[nJNcTl~t U N . .
]n our simulations, each functional unit had 8 reservation stations. The reason w e had a relatively large
n u m b e r of reservation stations was to monitor their
usage; in fact m o s t of t h e m were not required. F e w functional units need m o r e than one or two reservation stations. Those that need more, usually because they wait
for instructions with long latency, such as load or floating point multiply and add, would do very well with 4
reservation stations. Although in our s c h e m e reservation
stations are statically allocated to each functional unit,
it is possible to reduce their n u m b e r and optimize their
usage by clustering t h e m in a c o m m o n
pool and then
allocating as needed.
W h e n an instruction is issued, the following actions
take place.
(I) The instruction's source register(s) contents are
copied into the requested functional unit's reservation station.
(2) The instruction's source register(s) ready bits are
copied into the reservation station.
(3) The instruction's source register(s) tag fields are
copied into the reservation station.
(4) A tag allocated from the tag pool is placed in the
result register's tag field (if there is a result), the
register's ready bit is cleared, and the tag is written
into the D T G field of the reservation station.
Figure 2 -- Tag b a s e d m e c h a n i s m t o i s s u e o u t - o f - o r d e r .
We t r e a t t h e r e g i s t e r files B a n d T as a unit, With one
b u s y bit p e r file, s i n c e it is n o t p r a c t i c a l to a s s i g n t a g s to
so m a n y r e g i s t e r s . When one of t h e s e r e g i s t e r s awaits a n
o p e r a n d , t h e whole file is s e t t o busy.
To f a c i l i t a t e t r a n s f e r of o p e r a n d s b e t w e e n t h e r e g i s t e r files, s p e c i a l c o p y u n i t s (AS, SA, AB, a n d ST) a r e
i n t r o d u o e d . T h e s e a r e t r e a t e d as f u n c t i o n a l units, with
r e s e r v a t i o n s t a t i o n s a n d e x e c u t i o n t i m e of o n e c l o c k
cycle. T h e s e r e s e r v a t i o n s t a t i o n s ( a n d s o m e o t h e r s , e.g.
r e c i p r o c a l a p p r o x i m a t i o n ) have o n l y one o p e r a n d .
The m e m o r y u n i t ap~)ears t o t h e i s s u e logic a s a
( s o m e w h a t m o r e c o m p l e x ) f u n c t i o n a l unit. I n s t e a d of
one s e t of r e s e r v a t i o n s t a t i o n s , t h e m e m o r y u n i t h a s
t h r e e : Load r e s e r v a t i o n s t a t i o n s , Sto~'e r e s e r v a t i o n s t a tions, a n d a Co,%fleet ~hze~e. When a new m e m o r y
i n s t r u c t i o n Ii is issued, its effective a d d r e s s (if available)
is c h e c k e d a g a i n s t a d d r e s s e s in t h e Load a n d S t o r e
r e s e r v a t i o n s t a t i o n s . If t h e r e is a c o n f l i c t w i t h i n s t r u c t i o n
]I, ]i is i s s u e d a n d q u e u e d in t h e Conflict Queue. When l] is
e v e n t u a l l y p r o c e s s e d a n d t h e conflict d i s a p p e a r s , Ii is
t r a n s f e r r e d f r o m t h e Conflict Queue to a Load o r S t o r e
r e s e r v a t i o n s t a t i o n . If Ii u s e s a n i n d e x r e g i s t e r t h a t is
n o t r e a d y , t h e effective a d d r e s s is u n k n o w n a n d t h e r e is
no way t o c h e c k for conflicts. In t h i s c a s e , li is s t o r e d in
t h e Conflict Queue, w h e r e it waits for its i n d e x r e g i s t e r to
become ready.
T h e r e f o r e , i n s t r u c t i o n s f r o m t h e Load a n d S t o r e
units can be p r o c e s s e d asynchronously, since t h e y never
conflict with each other (no two instructions in these
units have the s a m e effective address). On the other
hand, instructions from the Conflict Queue are processed
in the order of arrival. This guarantees that two instructions with the s a m e effective address are processed in
the right order. The above m e c h a n i s m takes care of any
read after write, write after read or write after write
hazards. The Conflict queue is the only unit in the syst e m in which instructions are strictly processed in the
order of their arrival. This is a simpler m e c h a n i s m that
the one e mp lo ye d by the IBM 360/91 Storage S y s t e m
[BOLA67]. The latter has a similar queue for resolving
m e m o r y conflicts, but instructions stored in this queue
can be processed out of order; only two or m o r e requests
for a particular address are kept in sequence.
The tag m e c h a n i s m described above allows decoded
instructions to issue to functional units with little regard
for dependencies. There are three conditions that m u s t
be checked before an instruction can be sent to a functional unit, however.
-7
A-S
S-A
r--~-~
i
[,
@
~
, ]
12
Figure 3 -- A modified CRAY-I scalar
architecture to issue instructions out-of-order.
113
HAL
F~LY
Thus, if a s o u r c e r e g i s t e r ' s r e a d y bit is set, t h e r e s e r v a t i o n s t a t i o n will h o l d a valid o p e r a n d . Otherwise, it will
h o l d a tag t h a t i d e n t i f i e s t h e e x p e c t e d o p e r a n d .
due to t h e m o r e c o m p l e x c o n t r o l d e c i s i o n s t h a t a r e
r e q u i r e d . The p r i m a r y f a c t o r t h a t m a y l e a d to l o n g e r
c o n t r o l p a t h s is t h e c o n t e n t i o n t h a t t a k e s p l a c e a m o n g
t h e r e s e r v a t i o n s t a t i o n s for f u n c t i o n a l u n i t s a n d b u s s e s
w h e n m o r e t h a n o n e a r e s i m u l t a n e o u s l y r e a d y to i n i t i a t e
an instruction.
When b r a n c h e s a r e t a k e n , i s s u e is h e l d for a t l e a s t 5
c l o c k c y c l e s . S i n c e b r a n c h e s t e s t t h e c o n t e n t s of r e g i s t e r A0 for t h e c o n d i t i o n s p e c i f i e d in t h e b r a n c h i n s t r u c tion, A0 s h o u l d n o t b e b u s y in t h e p r e v i o u s 2 c y c l e s .
T h e s e a s s u m p t i o n s a r e in a c c o r d with t h e CRAY-1 i m p l e m e n t a t i o n a n d t h e a s s u m p t i o n s m a d e t o p r o d u c e Table
1.
In o r d e r for a n i n s t r u c t i o n waiting in a r e s e r v a t i o n
s t a t i o n to b e g i n e x e c u t i o n , t h e following m u s t b e
satisfied.
(1) All its o p e r a n d s m u s t b e r e a d y .
(2)
It m u s t gain a c c e s s t o t h e r e q u i r e d f u n c t i o n a l unit;
t h i s m a y involve c o n t e n t i o n with o t h e r r e s e r v a t i o n
s t a t i o n s b e l o n g i n g to t h e s a m e f u n c t i o n a l u n i t t h a t
also have all o p e r a n d s r e a d y .
(3) It m u s t g a i n a c c e s s tic t h e r e s u l t b u s for t h e c l o c k
p e r i o d w h e n its r e s u l t will b e r e a d y ; again, t h i s m a y
involve c o n t e n t i o n w i t h o t h e r i n s t r u c t i o n s issuing to
t h e s a m e or a n y o t h e r f u n c t i o n a l u n i t t h a t will
c o m p l e t e at t h e s a m e t i m e .
When a n i n s t r u c t i o n b e g i n s e x e c u t i o n , it d o e s t h e following.
(1) It r e l e a s e s its r e s e r v a t i o n s t a t i o n .
(2) It r e s e r v e s t h e r e s u l t b u s for t h e c l o c k p e r i o d w h e n
it will c o m p l e t e . R e s e r v i n g t h e b u s in a d v a n c e
avoids t h e i m p l e m e n t a t i o n p r o b l e m s of s t o p p i n g t h e
p i p e l i n e if t h e b u s is busy. An a l t e r n a t i v e is to
r e q u e s t t h e b u s a s h o r t t i m e b e f o r e t h e e n d of e x e c u t i o n a n d to p r o v i d e buffering a t t h e o u t p u t of t h e
f u n c t i o n a l u n i t [TOMA67].
(3) It c o p i e s its d e s t i n a t i o n tag i n t o t h e f u n c t i o n a l u n i t
control pipeline b e c a u s e the destination tag m u s t
be a t t a c h e d to the result when the instructio n completes.
When a n e i t h e r a l o a d o r f u n c t i o n a l u n i t i n s t r u c t i o n
c o m p l e t e s , its r e s u l t a n d c o r r e s p o n d i n g d e s t i n a t i o n t a g
a p p e a r on t h e r e s u l t bus. (In a p r a c t i c a l i m p l e m e n t a tion, t h e t a g will p r o b a b l y p r e c e d e t h e d a t a b y one c l o c k
p e r i o d . ) The d a t a is s t o r e d in all t h e r e s e r v a t i o n s t a t i o n s
a n d r e g i s t e r s t h a t h a v e t h e r e a d y bit c l e a r a n d a t a g t h a t
m a t c h e s t h e t a g of t h e r e s u l t . T h e n t h e r e a d y bit is s e t
to signal a valid o p e r a n d .
3.1. P e r f o r m a n c e R e s u l t s
Table 2 s h o w s t h e r e s u l t s of s i m u l a t i n g t h e CRAY-1
i m p l e m e n t e d with Tomasulo's algorithm.
The t o t a l
s p e e d u p a c h i e v e d was 1.43. We r e c o g n i z e t h a t t h e s e a r e
in a s e n s e , " t h e o r e t i c a l m a x i m u m s p e e d u p s " ; a n y
l e n g t h e n i n g of t h e c l o c k p e r i o d d u e t o l o n g e r c o n t r o l
p a t h s will d i m i n i s h t h i s s p e e d u p .
Loop
1
2
3
4
5
6
7
8
9
i0
ii
12
13
14
total
B e c a u s e i n s t r u c t i o n i s s u e t a k e s p l a c e in two p h a s e s
(the first m o v es an i n s t r u c t i o n to a reservation station
a n d t h e s e c o n d m o v e s it o n to t h e f u n c t i o n a l u n i t for
e x e c u t i o n ) we a s s u m e t h a t e a c h p h a s e t a k e s a full c l o c k
p e r i o d . In t h e CRAY-1 t h e r e is only a o n e c l o c k p e r i o d
d e l a y in moving a n i n s t r u c t i o n t o a f u n c t i o n a l unit. We
r e c o g n i z e t h e one c l o c k p e r i o d d i f f e r e n c e in o u r s i m u l a t i o n m o d e l , so t h a t t h e m i n i m u m t i m e for a n i n s t r u c t i o n
to c o m p l e t e is one c l o c k p e r i o d g r e a t e r t h a n in t h e
CRAY-1. This t a k e s i n t o a c c o u n t s o m e of t h e l o s t t i m e
Loop
1
2
3
4
5
6
7
8
9
I0
11
12
13
# clock cycles
on CRAY-1
18046
16918
38039
22198
21707
23045
10361
7841
10146
10230
30011
29999
18858
14
22391
total
281790
# clock cycles
for T o m a s u l o ' s
algorithm
10838
14102
30017
16534
16925
15042
6513
6780
8238
7421
20006
20000
12314
12780
197510
# clock cycles
1 parcel/cp
18046
18918
38039
22198
21707
23045
10361
7841
10146
10230
30011
29999
18858
22391
261790
# clock cycles
l instr/cp
17244
17717
37037
21163
20709
22045
10241
6874
9744
9430
28013
28001
17957
22087
268262
speedup
1.05
1.07
1.03
1.05
1.05
1.05
1.01
1.14
1.04
1.08
1.07
1.07
1.05
1.01
1.05
Table 3 -- C o m p a r i s o n of t h e 14 L a w r e n c e L i v e r m o r e
Loops on CRAY-I: one i n s t r u c t i o n p e r c l o c k p e r i o d
vs one p a r c e l p e r c l o c k p e r i o d .
W e noticed while doing the simulations that limiting
instruction fetches to the m a x i m u m rate of one parcel
per clock period appeared to be restricting performance. H e n c e , we m o d i f i e d t h e i m p l e m e n t a t i o n so t h a t
a full i n s t r u c t i o n c o u l d b e f e t c h e d a n d i s s u e d t o a r e s e r v a t i o n s t a t i o n e a c h c l o c k p e r i o d . This would b e slightly
m o r e e x p e n s i v e t o i m p l e m e n t , b u t f o r T o m a s u l o ' s algor i t h m it gives a s i g n i f i c a n t p e r f o r m a n c e i m p r o v e m e n t
over one parcel p e r clock period.
To k e e p c o m p a r i s o n s fair, we w e n t b a c k a n d
m o d i f i e d t h e o r i g i n a l CRAY-1 s i m u l a t i o n m o d e l so t h a t it,
too, c o u l d i s s u e i n s t r u c t i o n s at t h e h i g h e r r a t e . Table 3
gives t h e r e s u l t s of t h e s e s i m u l a t i o n s , a n d c o m p a r e s
t h e m with t h e one p a r c e l p e r c l o c k p e r i o d r e s u l t s g i v e n
e a r l i e r . H e r e , t h e p e r f o r m a n c e i m p r o v e m e n t is small.
This is a n i n t e r e s t i n g r e s u l t in itself, a n d shows t h e wisd o m of o p t i n g for s i m p l e r i n s t r u c t i o n f e t c h logic in t h e
original CRAY-1.
Because the higher i n s t r u c t i o n fetch rate does
a p p e a r to alleviate a b o t t l e n e c k t h a t r e d u c e s t h e
e f f i c i e n c y of T o m a s u l o ' s a l g o r i t h m , we i n c o r p o r a t e d it
i n t o t h e m o d e l f o r t h e s t u d i e s t o follow, a n d u s e t h e
CRAY-1 r e s u l t s of Table 3 (1 i n s t r u c t i o n p e r c l o c k p e r i o d )
as a b a s i s for f u r t h e r c o m p a r i s o n s .
speedup
1.67
1.34
1.27
1.34
1.26
1.53
1.59
1.16
1.23
1.38
1.50
1.50
1.53
1.75
1.43
Table 2 -- P e r f o r m a n c e with T o m a s u l o ' s Algorithm;
One P a r c e l I s s u e d p e r Clock P e r i o d
114
T a b l e 4 shows s i m u l a t i o n r e s u l t s for T o m a s u l o ' s
a l g o r i t h m with o n e i n s t r u c t i o n i s s u e d p e r c l o c k p e r i o d .
The s p e e d u p a c h i e v e d is i n t h e r a n g e 1.23 - 2.02 ( t o t a l
1.58). In t h r e e o u t of t h e f o u r l o o p s w h o s e s p e e d u p was
Loop
# clock cycles
o n CRAY-1
1
2
3
4
5
6
7
6
9
10
ii
12
13
14
Lotal
17244
17717
37037
21163
20709
22045
10241
6874
9744
9430
28013
28001
17957
22087
268262
# clock cycles
for Tomasulo's
algorithm
8832
11062
28012
15418
16898
12956
5069
5195
8332
5318
17871
16001
9364
11713
170041
1: $5 <"-TOO
A1 <-5.5
$6 <-oII1,A1
Sl<-off2.A1
S4<- S6 - F S1
$3 <-S5 + S7
A2 < - B02
A0 ~ &?. + I
O3, AI<..-S4
TOO ~ - $3
[]02 ~ AO
JAM
I
speedup
1.95
1.60
1.32
1.37
1.23
1.70
2.02
1.32
1.54
1.77
1.57
1.75
1.92
1.69
1.58
.COPY TOO TO SS
,COPY 55 TO A1
.LOAD 56 (ADDRESS INDEXED BY A1)
.LOAD S1 (ADORESS INDEXED BY A1)
.FLOATINGDIFFERENCE OF $6 AND S1 TO S4
.INTEGERSUM OF S5 AND S7 TO $3
.COPY B02 TO A2
.INTECERSUM OF A2 AND 1 TO A0
,STORE S4 (ADORESS IN[Tc.XEDBY A1)
.COPY $3 TO TOO
.COPY AO TO B02
.BRANCH TO LOOP ENTRY
(a) LLL 12 in CRAY A s s e m b l y L a n g u a g e ;
a r r o w s i n s e r t e d for r e a d a b i l i t y .
1;
55 4 - TO0
AS ~-S,5
H
H
H
s6 ~offl,A~
Sl<- off2,A1
H
: ', : : ', : ', ', ', ', I ',
', : ', : ', ; ', : ', 11 I
$4~- 56 - F Sl
S3-,-S5 + S7
A2 <-- B02
A0<--A2 + 1
Q3. AI<'- $4
T00 < - 53
I]02 < - AO
J*M
', ', ', ', ', ', ', I ', I : :
::::',::
IIII
H
H
H
H
I
: : : ', ', ',
(h) The CRAY-1
I;
Table 4 -- P e r f o r m a n c e of T o m a s u l o ' s AlgoriLhm;
One I n s t r u c t i o n I s s u e d p e r Clock P e r i o d
$5 'q'- TO0
AI ,~,-S,5
S6 < - off 1,A1
Sl 4--- oII2,A1
54<- $6 - F Sl
H
H
I-H
I:',::~:::',111
I':::::::;llll
H.I.H,H-H.I.i-:
I.H
I:::~:::~::::
I'::::::::::::
: : :: ::
I.I.I.I.I,I.I.H.H.: : : : : : :
I.H-H
t.H-H
H
H
I.t-H
I-H-I
I't'I-H-H-t.I.H.I.t.H
I.H.H.H.I-t.I.H-I'H
I'H
I-H
I.H
I-H
S3.P-S5 + S7
~ . ~-B02
AO,~-.~ + 1
03, A 1 < - $ 4
100 ~'- S3
B02 ~ AO
less t h a n 1.5 (i.e. loops 3, 4 a n d 8) i s s u e was h a l t e d d u e
t o a b u s y T file. With a n a r c h i t e c t u r e t h a t d o e s n ' t h a v e
t h e l a r g e n u m b e r of r e g i s t e r s t h e CRAY-1 h a s , it would b e
p o s s i b l e t o u s e t a g s for all t h e r e g i s t e r s , t h u s i n c r e a s i n g
t h e s p e e d u p of t h e a b o v e loops.
3.2. ~ m p l e
F i g u r e 4 shows a t i m i n g d i a g r a m of T o m a s u t o ' s algor i t h m c o m p a r e d w i t h t h a t of CRAY-1, b o t h e x e c u t i n g loop
12. On t h e left a p p e a r t h e i n s t r u c t i o n s as g e n e r a t e d b y
t h e CFT c o m p i l e r . The t i m i n g for two c o n s e c u t i v e loop
i t e r a t i o n s a r e s h o w n n e x t t o e a c h o t h e r . E a c h "l-I"
r e p r e s e n t s o n e c l o c k p e r i o d . F r o m t h e s t a n d p o i n t of
i s s u e logic, w h e n s t o r e i n s t r u c t i o n s a r e i n i t i a t e d t h e y So
to memory and are no longer considered. Therefore,
s t o r e s a r e s h o w n t o e x e c u t e only f o r t h e c l o c k p e r i o d
t h e y a r e i n i t i a t e d . Solid l i n e s i n d i c a t e t h a t t h e r e s p e c tive i n s t r u c t i o n is i n e x e c u t i o n . F o r T o m a s u l o ' s algor i t h m , a d o t t e d line shows t h a t a n i n s t r u c t i o n h a s b e e n
i s s u e d t o a r e s e r v a t i o n s t a t i o n , a n d is waiting for
operand(s).
With t h e o r i g i n a l CRAY-1 issue a l g o r i t h m , all i n s t r u c tions in a loop must begin execution and a loopterminating conditional branch instruction must comp l e t e b e f o r e t h e n e x t loop i t e r a t i o n c a n begin. With
T o m a s u l o ' s a l g o r i t h m , loop i t e r a t i o n s c a n b e " t e l e s c o p e d " ; a s e c o n d loop i t e r a t i o n c a n b e g i n a f t e r all
i n s t r u c t i o n s h a v e b e e n s e n t to r e s e r v a t i o n s t a t i o n s a n d
t h e c o n d i t i o n a l b r a n c h is c o m p l e t e d . The i n s t r u c t i o n s
b e l o n g i n g t o t h e f i r s t loop i t e r a t i o n do n o t n e c e s s a r i l y
n e e d t o b e g i n e x e c u t i o n . One c o u l d also view t h i s as
d y n a m i c r e s c h e d u l i n g of t h e b r a n c h i n s t r u c t i o n . Obviously, t h e b r a n c h i n s t r u c t i o n is o n e i n s t r u c t i o n t h a t c a n
n o t b e m o v e d e a r l i e r in t h e loop as would h a v e t o b e t h e
c a s e w i t h s t a t i c r e s c h e d u l i n g . The s i g n i f i c a n t s p e e d u p of
T o m a s u l o ' s a l g o r i t h m for t h i s loop (1.75, s e e Table 4) is
d u e m a i n l y t o t h e o v e r l a p of t h e loading of r e g i s t e r s $6
a n d S1 w i t h t h e f l o a t i n g p o i n t d i f f e r e n c e (-F) w h i c h u s e s
$6 a n d S1 as o p e r a n d s . A l t h o u g h -F c a n n o t b e e x e c u t e d
until the operands return from memory, it can be issued
t o a r e s e r v a t i o n s t a t i o n ; t h i s allows following i n s t r u c t i o n s
to p r o c e e d . On t h e o t h e r h a n d , t h e CRAY-1 e x e c u t e s t h e
load of S1 a n d t h e f l o a t i n g p o i n t d i f f e r e n c e s t r i c t l y in
order.
(c)
Tomasulo's algorithm
F i g u r e 4 -- Ttmir~_ D i a g r a m s for L a w r e n c e L i v e r m o r e Loop 12.
One c a n s e e f r o m t h e t i m i n g d i a g r a m s t h a t w i t h
Tomasulo's algorithm only 2 reservation stations are
u s e d for m o r e t h a n 1 c l o c k p e r i o d . The s t o r e i n s t r u c t i o n
n e e d s a r e s e r v a t i o n s t a t i o n t h a t is r e l e a s e d a s h o r t
p e r i o d of t i m e b e f o r e t h e n e x t s t o r e is i s s u e d . A n o t h e r
r e s e r v a t i o n s t a t i o n is u s e d e x t e n s i v e l y b y t h e floating
point difference instruction.
4. T h o r u t o n ' s A l g o r i t h m
T o m a s u l o ' s a l g o r i t h m l e a d s t o c o m p l e x i s s u e logic
a n d m a y b e q u i t e e x p e n s i v e t o i m p l e m e n t . T h e r e f o r e , in
t h i s s e c t i o n a n d in t h e n e x t s e c t i o n , we c o n s i d e r ways to
r e d u c e t h e cost. One m a j o r c o s t is t h e a s s o c i a t i v e
h a r d w a r e n e e d e d t o m a t c h tags. When a n o p e r a n d a n d
its a t t a c h e d t a g a p p e a r o n a bus, r e g i s t e r files A a n d S
a n d all t h e r e s e r v a t i o n s t a t i o n s h a v e t o b e s e a r c h e d
s i m u l t a n e o u s l y . The o p e r a n d is s t o r e d in a n y r e g i s t e r or
r e s e r v a t i o n s t a t i o n w i t h a m a t c h i n g tag.
In t h i s s e c t i o n , we i m p l e m e n t t h e CRAY-1 s c a l a r
a r c h i t e c t u r e w i t h a n i s s u e m e t h o d t h a t is a d e r i v a t i v e of
T h o r n t o n ' s " s c o r e b o a r d " a l g o r i t h m u s e d in t h e CDC6600.
Here, c o n t r o l is m o r e d i s t r i b u t e d ( t h e r e is n o g l o b a l
scoreboard), and reservation stations have been added
t o f u n c t i o n a l u n i t s . The p r i m a r y d i f f e r e n c e b e t w e e n
T h o r n t o n ' s a l g o r i t h m a n d T o m a s u l o ' s is t h a t i n s t r u c t i o n
i s s u e is h a l t e d w h e n t h e d e s t i n a t i o n of a n i n s t r u c t i o n is a
r e g i s t e r t h a t is b u s y . This simplifies t h e i s s u e logic
h a r d w a r e in t h e following ways.
(1)
(2)
115
The a s s o c i a t i v e c o m p a r e w i t h t h e r e g i s t e r file t a g s is
eliminated.
Tag a l l o c a t i o n a n d d e - a l l o c a t i o n h a r d w a r e is eliminated because the result register designator acts
a s t h e tag.
rRI
we a s s u m e t h a t $2 a n d $4 a r e r e a d y , while $3 is n o t (e.g.
it awaits a n o p e r a n d f r o m m e m o r y ) . The s e c o n d i n s t r u c t i o n will b e c o m p l e t e d b e f o r e t h e f i r s t one, b u t on t h e
CDC 6600 t h e r e s u l t c a n n o t be s t o r e d i n t o $2 s i n c e it
s e r v e s as a s o u r c e r e g i s t e r for t h e first i n s t r u c t i o n .
T h e r e f o r e , t h e a d d f u n c t i o n a l u n i t will r e m a i n b u s y u n t i l
t h e f i r s t i n s t r u c t i o n c o m p l e t e s as well. On t h e o t h e r
h a n d , with t h e a l g o r i t h m we h a v e given, $2 h a s b e e n
c o p i e d i n t o t h e floating p o i n t m u l t i p l y r e s e r v a t i o n s t a t i o n w h e n t h e first i n s t r u c t i o n was i s s u e d , a n d t h e r e f o r e
t h e s e c o n d i n s t r u c t i o n c a n c o m p l e t e b e f o r e t h e f i r s t one.
[ ~OISTE~
FILE
F-
__u~__~_L. . . . ~ - - - . - ~ - ~ ; ~ _
4.1.
I
l F~.J~CTIOI~UNJT
AL 2
........
1- .......
Performance
Results
We originally p l a n n e d t o s i m u l a t e t h e s c o r e b o a r d as
d e s i g n e d b y T h o r n t o n , b u t d e c i d e d t h a t it would l e a d t o a
m o r e i n t e r e s t i n g c o m p a r i s o n if m u l t i p l e r e s e r v a t i o n
---
F i g u r e 5 -- T h o r n t o n ' s i s s u e logic.
F i g u r e 5 i l l u s t r a t e s t h e a l g o r i t h m . Most r e s e r v a t i o n
s t a t i o n s h o l d two o p e r a n d s ( a l t h o u g h s o m e n e e d only
one, d e p e n d i n g o n t h e f u n c t i o n a l unit} a n d t h e a d d r e s s of
t h e d e s t i n a t i o n r e g i s t e r (DR). A t t a c h e d to e a c h o p e r a n d
is t h e s o u r c e r e g i s t e r d e s i g n a t o r (SR) a n d a r e a d y flag
(R). Also, a t t a c h e d t o e a c h r e g i s t e r in t h e r e g i s t e r file is
a r e a d y bit (R). (These a r e t h e s a m e as t h e r e s e r v e d b i t s
u s e d in t h e original CRAY-1 c o n t r o l . )
The following o p e r a t i o n s a r e p e r f o r m e d w h e n a n
i n s t r u c t i o n is i s s u e d :
(1)
A r e s e r v a t i o n s t a t i o n of t h e r e q u e s t e d f u n c t i o n a l
unit, if available, is r e s e r v e d . Otherwise, i s s u e is
blocked.
(2) If a s o u r c e r e g i s t e r is r e a d y , it is c o p i e d i n t o t h e
r e s e r v a t i o n s t a t i o n a n d t h e r e a d y bit is s e t . O t h e r wise, t h e r e a d y bit is c l e a r e d a n d t h e s o u r c e
r e g i s t e r ' s d e s i g n a t o r is s t o r e d i n t o t h e SR field.
The c o n d i t i o n s f o r moving a n i n s t r u c t i o n f r o m a
r e s e r v a t i o n s t a t i o n t o b e g i n e x e c u t i o n a r e t h e s a m e as
g i v e n for T o m a s u l o ' s a l g o r i t h m in t h e p r e v i o u s s e c t i o n .
When a f u n c t i o n a l u n i t is f i n i s h e d .with a n i n s t r u c t i o n t h e
r e s u l t r e g i s t e r d e s i g n a t o r is m a t c h e d a g a i n s t all t h e SR
fields in t h e r e s e r v a t i o n s t a t i o n s , a n d t h e r e s u l t is writt e n i n t o t h e r e s e r v a t i o n s t a t i o n s w h e r e t h e r e is a m a t c h .
The c o r r e s p o n d i n g o p e r a n d r e a d y b i t s a r e t h e n s e t . The
r e s u l t is also s t o r e d i n t o t h e d e s t i n a t i o n r e g i s t e r , a n d it
is s e t r e a d y .
Loop
# clock cycles
on CRAY-1
1
2
3
4
5
6
7
8
9
i0
ii
12
13
14
total
17244
17717
37037
21163
20709
22045
10241
6874
9744
9430
28013
26001
17957
22087
268262
# clock cycles
for T h o r n t o n ' s
algorithm
12434
13500
28020
19578
17030
18370
8671
6381
8634
7820
iBO01
16003
14870
20273
209585
speedup
1.39
1.31
1.32
L06
1.22
1.20
1.16
1.08
1.13
1.21
1.56
1.75
1.21
1.09
1.28
Table 5 -- P e r f o r m a n c e of T h o r n t o n ' s Algorithm;
One I n s t r u c t i o n I s s u e d p e r Clock P e r i o d
s t a t i o n s w e r e allowed. The r e s u l t s of t h e s i m u l a t i o n a r e
s h o w n in Table 5. The t o t a l s p e e d u p is 1.28. We d i s c u s s
ways t h i s c a n b e i m p r o v e d b y s t a t i c c o d e s c h e d u l i n g in
S e c t i o n 6.
5. A n I s s u e M e t h o d U s i n g a D i r e c t Tag S e a r c h
In t h i s s e c t i o n we p r o p o s e a n a l t e r n a t i v e i s s u e algor i t h m t h a t is r e l a t e d t o T o m a s u l o ' s a l g o r i t h m , b u t w h i c h
e l i m i n a t e s t h e n e e d for a s s o c i a t i v e t a g c o m p a r i s o n
h a r d w a r e in t h e r e s e r v a t i o n s t a t i o n s . This a l g o r i t h m
i n s t e a d u s e s a d i r e c t tag s e a r c h (DTS), a n d will b e
r e f e r r e d t o a s t h e "DTS" a l g o r i t h m . The DTS a l g o r i t h m
imposes the r e s t r i c t i o n that a particular tag can be
s t o r e d only in o n e r e s e r v a t i o n s t a t i o n . This is easily
i m p l e m e n t e d b y a s s o c i a t i n g w i t h e a c h t a g in t h e t a g pool
a u s e d bit. W h e n e v e r a r e g i s t e r t h a t is n o t r e a d y is
a c c e s s e d for t h e f i r s t t i m e , its t a g is c o p i e d to t h e
r e s p e c t i v e r e s e r v a t i o n s t a t i o n a n d t h e u s e d bit is set. A
s e c o n d a t t e m p t t o u s e t h e s a m e t a g will b l o c k issue.
A l t h o u g h t h e above is d e r i v e d f r o m T h o r n t o n ' s original a l g o r i t h m , t h e r e a r e s o m e d i f f e r e n c e s . The f u n c t i o n a l u n i t s of t h e CDC 6600 do n o t h a v e r e s e r v a t i o n s t a t i o n s (one c o u l d s a y t h a t t h e y h a v e r e s e r v a t i o n s t a t i o n s
of d e p t h 1 t h a t a r e n o t able t o h o l d o p e r a n d s , only cont r o l i n f o r m a t i o n ) . This i m p o s e s s o m e a d d i t i o n a l r e s t r i c tions. An i n s t r u c t i o n t o b e e x e c u t e d on a p a r t i c u l a r
f u n c t i o n a l u n i t c a n b e i s s u e d e v e n if its s o u r c e r e g i s t e r s
are not ready, but a second instruction requiring the
s a m e f u n c t i o n a l u n i t will b l o c k i s s u e until t h e f i r s t one is
d o n e . On t h e o t h e r h a n d , w i t h r e s e r v a t i o n s t a t i o n s ,
s e v e r a l i n s t r u c t i o n s c a n walt for o p e r a n d s a t t h e i n p u t of
t h e f u n c t i o n a l unit. F o r e x a m p l e ,
S1 <- $ 2 " F S 3
$4 <- $ 5 " F S 8
with t h e CDC 8800 s c o r e b o a r d t h e s e c o n d i n s t r u c t i o n will
block, while with t h e a l g o r i t h m h e r e it will b e i s s u e d .
A n o t h e r r e s t r i c t i o n of t h e s c o r e b o a r d is t h e following. In t h i s e x a m p l e ,
S1 <- $ 2 " F S 3
$2 <- $4 + 1
STATION
F i g u r e 6 -- Tag S e a r c h Table for t h e DTS Algorithm.
116
Figure 7 illustrates an example, extracted from
L a w r e n c e L i v e r m o r e Loop 4. The i n s t r u c t i o n "T02 <$5" s a v e s r e g i s t e r $5 for u s e d u r i n g t h e n e x t p a s s
t h r o u g h t h e loop. F o r t h e DTS i s s u e logic, t h i s i n s t r u c t i o n
h a s t o b e b l o c k e d , s i n c e it a t t e m p t s to u s e r e g i s t e r $5 as
a s o u r c e for t h e s e c o n d t i m e ( r e g i s t e r $5 is n o t r e a d y
a n d was u s e d for t h e f i r s t t i m e b y t h e p r e v i o u s i n s t r u c tion). However, n e i t h e r $5 n o r T02 a r e u s e d b e f o r e t h e
b r a n c h i n s t r u c t i o n , so t h i s i n t e r l o c k c o u l d b e p o s t p o n e d
b y m o v i n g i n s t r u c t i o n "T02 <- $5" down, j u s t b e f o r e t h e
The DTS a l g o r i t h m allows i m p l e m e n t a t i o n of t h e t a g
s e a r c h m e c h a n i s m b y a t a b l e i n d e x e d b y t a g s (Fig. 6),
r a t h e r t h a n a s s o c i a t i v e h a r d w a r e . F o r e a c h t a g t h e r e is
o n e e n t r y i n t h e t a b l e t h a t s t o r e s t h e a d d r e s s of a r e s e r v a t i o n s t a t i o n . The t a b l e is s m a l l s i n c e t h e r e a r e few
t a g s (we u s e d 5 b i t s for e a c h tag; t h e r e a r e 32 t a g s ) .
P e r f o r m a n c e Results
A comparison of the results for the DTS algorithm
with T o m a s u l o ' s a l g o r i t h m r e v e a l s t h a t f o r 9 o u t of t h e 14
loops t h e DTS a l g o r i t h m a c h i e v e s s p e e d u p s i m i l a r to
T o m a s u l o ' s a l g o r i t h m . This shows t h a t t h e r e s t r i c t i o n
i m p o s e d b y t h e DTS a l g o r i t h m , n a m e l y t h a t a t a g c a n b e
in n o m o r e t h a n o n e r e s e r v a t i o n s t a t i o n a t a g i v e n t i m e ,
h a s only a l i m i t e d e f f e c t o n p e r f o r m a n c e . The r e a s o n is
t h a t t h e following p a t t e r n is q u i t e c o m m o n :
S1 <- offl,A1
S2 <- off2,A1
S3 <- S I * F S 2
$5 <- $ 3 + F S 4
T h a t is, two r e g i s t e r s a r e l o a d e d a n d a r e s e n t to a f u n c t i o n a l u n i t w h o s e r e s u l t is i n p u t t o a n o t h e r f u n c t i o n a l
u n i t , a n d so on. The DTS a l g o r i t h m is able to p r o c e s s
s u c h c o d e a t full s p e e d .
On t h e o t h e r h a n d , if a r e g i s t e r is u s e d as a n i n p u t
t o a f u n c t i o n a l u n i t a n d at t h e s a m e t i m e h a s to b e
s t o r e d (in t h e m e m o r y , o r t e m p o r a r i l y in t h e T o r B file),
t h e n t h e r e g i s t e r , a n d its tag, h a v e to b e u s e d twice, so
t h e DTS a l g o r i t h m b l o c k s issue. S u c h c a s e s a c c o u n t for
t h e lower s p e e d u p of 5 o u t of 14 loops.
5.1.
Loop
# clock cycles
o n CRAY-1
1
2
3
4
5
8
7
8
9
I0
II
12
13
14
t o t al
17244
17717
37037
21163
20709
22045
10241
6874
9744
9430
28013
28001
17957
22087
268262
# clock cycles
for t h e DTS
algorithm
8832
11083
28020
20534
17690
17027
5069
5361
6732
8518
17871
16001
14356
17421
194515
do 175 1=7,107,50
lw=l
do 4 ]=30,870,5
x(I-1) = x(1-1) - x(lw)*yO)
4 lw= lw+ 1
x(l-l) = y(5)*x(l-l)
175 continue
(a) F o r t r a n c o d e for L a w r e n c e L i v e r m o r e Loop 4
(banded linear equations).
L00P4:$6
S1
A3
A2
$3
$5
$4
$3
$5
$4
At
$3
off4,A7
T02
TOI
A0
TOO
B02
JAM
speedup
1.95
1.80
1.32
1.03
1.17
1.29
2.02
1.28
1.45
1.11
1.57
1.75
1.25
1.27
I. 38
<<<<<<<<<<<<<<<<<<-
TOO
T01
S1
$6
off2,A3
; x(lw)
off3,A2
;y(j)
$5"RS3
TO2
$3-FS4
SI+$2
B02
$6 + $7
$5
; x(l-1)
$5
$4
A1 + 1
$3
AO
LOOP4
(b) C o m p i l e d o b j e c t c o d e ( t h e i n n e r loop)
e x t r a c t e d f r o m L ~ w r e n c e L i v e r m o r e Loop 4.
F i g u r e 7 -- E x a m p l e of i n t e r l o c k for t h e
DTS I s s u e Logic.
Table 6 -- P e r f o r m a n c e of DTS I s s u e Logic.
6.
Farther Comments
on Code
b r a n c h . The 4 c l o c k c y c l e s t h u s saved, m u l t i p l i e d b y t h e
n u m b e r of i t e r a t i o n s t h r o u g h t h e loop, r e s u l t in a 4% p e r formanee improvement.
F i g u r e 8 shows a n o t h e r e x a m p l e , e x t r a c t e d f r o m
loop 1. R e g i s t e r s $2, $6, $1, $3 a n d $ 4 a r e d e s i g n a t e d as
d e s t i n a t i o n r e g i s t e r s for t h e f i r s t t i m e in t h e u p p e r half
of t h e loop, a n d t h e n for t h e s e c o n d t i m e , a l m o s t in t h e
s a m e o r d e r , in t h e lower h a l f of t h e loop. The s e c o n d
u s a g e of $2 ( $ 2 <- $ 4 *R $6) as a d e s t i n a t i o n r e g i s t e r
c a u s e s a n i m m e d i a t e b l o c k a g e for T h o r n t o n ' s a l g o r i t h m ,
s i n c e $2 is still b u s y f r o m t h e p r e v i o u s l o a d i n s t r u c t i o n
($2'<offl,A1). Any i n d e p e n d e n t i n s t r u c t i o n i n s e r t e d
j u s t b e f o r e i n s t r u c t i o n " $ 2 <- $4 *R $6" will e x e c u t e for
free. T h e r e a r e 3 s u c h i n s t r u c t i o n s b e f o r e t h e b r a n c h :
A2 <- B02
A0 < - A 2 + 1
B02 <- A0
This s i m p l e r e s e h e d u l i n g gives a p e r f o r m a n c e i m p r o v e m e n t of 10.7g.
Scheduling
F o r all t h e v a r i o u s i s s u e logic s i m u l a t i o n s , we u s e d
as i n p u t t h e o b j e c t c o d e g e n e r a t e d b y t h e CRAY-1 o p t i m i z i r ~ FORTRAN c o m p i l e r , w i t h o u t a n y c h a n g e s . T h e r e fore, t h e level of c o d e o p t i m i z a t i o n a n d s c h e d u l i n g is
r e a l i s t i c for t h e CRAY-1. T o m a s u l o ' s a l g o r i t h m is l e s s
s e n s i t i v e t o t h e o r d e r of t h e i n s t r u c t i o n s , s i n c e i t d o e s a
g r e a t d e a l of d y n a m i c s c h e d u l i n g . It is also c a p a b l e of
d y n a m i c r e g i s t e r r e - a l l o c a t i o n so it is n o t as s u s c e p t i b l e
t o t h e c o m p i l e r ' s r e g i s t e r a l l o c a t i o n m e t h o d . On t h e
other hand, static scheduling and register allocation has
a s i g n i f i c a n t i m p a c t o n t h e p e r f o r m a n c e of t h e DTS i s s u e
logic a n d T h o r n t o n ' s a l g o r i t h m . S i n c e we h a v e n o t r e o r g a n i z e d t h e c o d e for t h e l a t t e r two s c h e m e s , m a n y
d e p e n d e n c i e s a n d r e s o u r c e conflicts t h a t a p p e a r i n t h e
c o m p i l e d c o d e c o n t r i b u t e to lower p e r f o r m a n c e as c o m pare d with Tomasulo' s algorithm.
117
i s s u e logic c o m p l e x i t y . T o m a s u l o ' s a l g o r i t h m gives a total
s p e e d u p of 1.58. The d i r e c t t a g s e a r c h (DTS) a l g o r i t h m ,
i n t r o d u c e d in t h i s p a p e r , allows d y n a m i c s c h e d u l i n g while
e l i m i n a t i n g t h e n e e d for a s s o c i a t i v e t a g c o m p a r i s o n
h a r d w a r e . The DTS i s s u e logic a c h i e v e s a t o t a l s p e e d u p of
1.38, a n d t h u s r e t a i n s m u c h of t h e p e r f o r m a n c e g a i n of
T o m a s u l o ' s a l g o r i t h m . A d e r i v a t i v e of T h o r n t o n ' s a l g o r i t h m
gives a t o t a l g a i n of 1.28.
In o u r m o d e l , t h e l a r g e i n t e r m e d i a t e r e g i s t e r files of
t h e CRAY-1 (B a n d T) a r e t r e a t e d as a unit, s i n c e it is n o t
p r a c t i c a l t o a s s i g n t a g s t o so m a n y r e g i s t e r s . With
T o m a s u i o ' s a l g o r i t h m , t h i s was t h e c a u s e of a r e l a t i v e l y low
p e r f o r m a n c e i m p r o v e m e n t for t h r e e loops. With a n a r c h i t e c t u r e t h a t does n o t h a v e t h e l a r g e n u m b e r of r e g i s t e r s
t h e CRAY-1 has, it would b e p o s s i b l e t o u s e t a g s f o r all t h e
r e g i s t e r s , t h u s i n c r e a s i n g t h e s p e e d u p of t h e a b o v e loops.
Finally, we h a v e d i s c u s s e d t h e i m p a c t of c o d e s c h e d u l i r ~ on t h e s i m u l a t i o n r e s u l t s . T o m a s u l o ' s a l g o r i t h m is l e s s
s e n s i t i v e t o t h e o r d e r of t h e i n s t r u c t i o n s , s i n c e i t d o e s a
g r e a t d e a l of d y n a m i c s c h e d u l i n g a n d r e g i s t e r r e a l l o c a t i o n . On t h e o t h e r h a n d , s t a t i c s c h e d u l i n g h a s a
s i g n i f i c a n t i m p a c t o n t h e p e r f o r m a n c e of t h e DTS a n d
T h o r n t o n ' s a l g o r i t h m s . We g a v e s p e c i f i c e x a m p l e s h o w t h e
p e r f o r m a n c e of t h e l a t t e r two a l g o r i t h m s c a n b e i m p r o v e d
even by simple code scheduling.
q = 0.0
do l k = 1,400
x(k) = q + y(k)*(r*z(k+10) + t'z(k+ll))
(a) F o r t r a n c o d e for L a w r e n c e L i v e r m o r e Loop 1
(hydro excerpt).
L00PI:
$5
A1
$2
$6
S1
$3
S4
$2
$6
S1
$4
$3
A2
A0
off4,Al
TOO
B02
JAM
<-TOO
<- $5
<- o f f l , A l
<- ofl2,Al
<- TO2
<- S1 * R S 2
<- TO1
<- S 4 * R $6
<- off3,A1
<- $ 2 + F S 3
<- S 6 * R S I
<- $5 + $7
<- BO2
<- A2 + 1
<- $4
<- $ 3
<- A0
LOOP 1
; z(k+10)
; z(k+ 1 I)
8. Acknowledgements
This is material based upon work supported by the
National Science Foundation under Grant ECS-8207277.
The authors would like to thank Nick Pang for the
CRAY-I simulators used for generating the results in
T a b l e s 1 a n d 3.
(b) A s s e m b l y c o d e e x t r a c t e d f r o m
L a w r e n c e L i v e r m o r e Loop 1.
F i g u r e 8 -- E x a m p l e of i n t e r l o c k for T h o r n t o n ' s A l g o r i t h m .
S i n c e t h e r e is a l a r g e g a p b e t w e e n T o m a s u l o ' s a n d
T h o r n t o n ' s a l g o r i t h m s for loop 1 (1.95 v s 1.39), we t r i e d t o
s e e if t h i s g a p c o u l d b e c l o s e d b y r e a l l o c a t i n g r e g i s t e r s f o r
T h o r n t o n ' s a l g o r i t h m . The following c o d e d o e s t h e p r o c e s s ing of loop 1, e x c e p t for i n d e x a n d a d d r e s s c a l c u l a t i o n s :
S1 <- o f f l , A l
$ 2 <- off2.Al
$ 4 <- S1 *F $ 3
$6 <- $2 *F $5
$7 <- $ 4 + F $6
$8 <- olT3,AI
$9 <- $7 *F $8
off4,Al <- $9
Since registers are not re-used during the s ame pass
through the loop, Thornton's algorithm would run as fast
as Tomasulo's. But we need 9 scalar registers, m o r e than
available on the CRAY-I. The high speedup achieved by
Tomasulo's algorithm demonstrates the importance of its
abilityto reallocate registers dynamically.
9. References
[ANDE67] D.W. Anderson. F.J. Sparacio, F.M. Tomasulo, "The IBM
S y s t e m / 3 6 0 Model 91: Machine Philosophy a n d I n s t r u c t i o n Handling", IBM Journal, V 11, J a n 1967
[BOLA67] LJ. Boland, G.D. Granito, A.U. Marcotte, B.U. Messina,
J.W. Smith, "The IBM S y s t e m / 3 6 0 Model 91: Storage System", IBM Journal, V 11, J a n 1987.
[BONS69] P. Bonseigneur, "Description of t h e 7600 C o m p u t e r
System," C o m p u t e r Group News, May 1989.
[BUCH62] W. Bucholz,ed., Planning a Computer System,
McGraw-Hill, New York, 1962.
•
[CDC81] "CDC CYBER 200 Model 205 C o m p u t e r S y s t e m
Hardware R e f e r e n c e Manual," Control Data Corp., Arden
Hills, MN, 1981.
[CRAY77] "The CRAY-1 Computing System", Cray Research,
Inc., P u b l i c a t i o n n u m b e r 2240008 B, 1977.
7. ~ ] m m R r ~ a n d Conclusions
W e have discussed design tradeoffs for control of pipelined processors. Performance of a pipelined processor
depends greatly on its clock period and the order in which
instructions are executed. SLrnple control schemes allow a
short clock period and place the burden of code scheduling on the compiler. Complex control schemes generally
require a longer clock period, but are less susceptible to
the order of the instructions generated by the compiler.
The l a t t e r a r e also a b l e t o t a k e a d v a n t a g e of i n f o r m a t i o n
only available at run time and "dynamically" resehedule
the instructions. Additional factors to be considered are
h a r d w a r e cost, d e b u g g i n g a n d m a i n t e n a n c e .
We h a v e p r e s e n t e d a q u a n t i t a t i v e m e a s u r e of t h e
s p e e d u p a c h i e v a b l e b y s o p h i s t i c a t e d i s s u e logic s c h e m e s .
The CRAY-1 s c a l a r a r c h i t e c t u r e is u s e d as a b a s i s for c o m p a r i s o n . S i m u l a t i o n r e s u l t s of t h e 14 L a w r e n c e L i v e r m o r e
Loops e x e c u t e d o n 4 d i f f e r e n t i s s u e logic m e c h a n i s m s show
t h e p e r f o r m a n c e g a i n a c h i e v a b l e b y v a r i o u s d e g r e e s of
[CRAY79] "CRAY-I C o m p u t e r Systems, H a r d w a r e R e f e r e n c e
Manuar', Cray Research, inc., Chippewa Falls, WI, 1979.
[CRAYB2] "CRAY X-MP Computer Systems Mainframe Reference
Manual", Cray Research, Inc.,Chippewa Falls,WI,1982.
[KOGGSI] P. M. Kogge, The Architect~Te of PLpelined
e r s , McGraw-Hill, 1981.
Comput-
[MCMA72] F. H. McMshon, "FORTRAN CPU Performance
Analysis." Lawrence Livermore Laboratories, 1972.
[RUSS7B] P~M. Russell, "The CRAY-I Computer System", C o m m .
ACM, V 21. N 1, January 1978, pp. 83-72.
[THOR70] J.E. Thornton, /~zs-~ of ~ Cbmputer - The Control
/~*te 8800, Scott, Foresman and Co., Glenview, IL, 1970
[TOMA87] R.M. Tomasulo, "An EfficientAlgorithm for Exploiting
Multiple A r i t h m e t i c Units", IBM Journal, V 11, J a n 1967
118
© Copyright 2026 Paperzz