Hardware
Support
Dynamically
for Large
Scheduled
Atomic
Units
Machines
in
Stephen W. Melvin
Michael C. Shebanow
Yale N. Patt
Computer
Science Division
University
of California,
Berkeley
Berkeley,
CA 947211
ABSTRACT
Microarchitectures that implement conventional instruction
set architectures are usuaHy limited in that they are only
able to execute a small number of microoperations concurrently. This limitation is due in part to the fact that the
units of work that the hardware treats as indivisible are
small. While this limitation is not important for microarchitectures with a low level of functllonality, it can be significant’if the goal is to build hardware that can support a
large number of microoperations executing concurrently. In
this paper we address the tradeoffs associated with the sizes
of the various units of work that a processor considers indivisible, or atomic. We argue that by allowing larger units
of work to be atomic, restrictions on concurrent operation
are reduced and performance is increased. We outline the
implement.ation of a front end for a dynamically scheduled
processor with hardware support for large atomic units. We
discuss tradeoffs in the design and show that with a modest
investment in hardware, the run-time advantages of large
atomic units can be realized without the need to alter the
instruction set architecture.
architectural
atonuc units are exactly the same, but in principle the
architectural
atomic units could be larger. In that case, the compiler
would group together a set of individually
define them to be uninterruptible.
generally called the instruction
An important
atomic units is the notion of the degree of ‘coupling’ to the implementation.
The compiler atomic units may be freely formated, in which
case they are decoupled from the implementation,
unit.
words’ {which are the units of work that a machine can issue in one
cycle). We discuss this isstie further below.
The execution atomic units are the smallest groups of microoperations that the hardware issues at run-time as an indivisible
,that implement the atomic
is a basic unit cjf work like a register-tooperation,
units. An architectural
boundary.
‘instruction’.
An architectural
atomic unit
The compiler atomic units are the smallest indivisible
THO236-0/88/0000/0060$01.00
decreases the associated hardwarr
boundaries that are
the execution,atomic
units are no larger
atomic units. This is because allowing the ex-
ecution atomic units to increase beyond the size of the architectural
fruru the
atomic units is non-trivial
and requires the implementation
sort of backup hardware
[Ii.
of some
In the event of an exception in the
middle of an execution atomic unit (but at the boundary of an arunits of
chitectural
In most cases the compiler and
0 1988 IEEE
and/or
In most implementations
than the architectural
standpoint of interrupt latency and exception handling.
work that the compiler generates.
is added and the number of function
preserved.
and exceptions. The state
of these units is important
(since a potential microbranch or mi-
crotrap exists). As pipelining
cost by reducing the number of microoperation
atomic unit is what is referred to as an
The specification
of the previous microoperation
ailows higher utilization
atomic unit is a group of microoperations
of execution can only be preserved at an architectural
This is because
is dependent on the outcome
units is increased, increasing the size of the execution atomic units
a load or a store.
atomic units, compiler atomic units. and execution atomic
which is atomic with respect to interrupts
unit.
the execution atomic
microoperations.
the execution of each microoperation
We have identified three separate types of atomic units: architectural
microarchitectures,
units are generally the individual
When we refer to the ‘size’ of such a unit, we
A microoperation
or alternatively
they may be formated exactly into what we refer to as ‘multinode-
In this paper we discuss the notion of .an indivisible unit of work,
register move, a simple arithmetic
set architecture.
issue which relates to the format of the compiler
In sequential, non-pipelined
mean the the number of ‘microoperations’
The definitions ofthe architectural
atomic units and the compiler atomic units together form what is
1. Introduction
OI an ‘atomic unit.’
produced units of work and
atomic unit), the backup hardware restores the state of
a machine to the beginning of the execution atomic unit.
60
Eaecuu-
tion then proceeds one architectural
atomic unit at a time until the
time effort but allow more compile-time
optimization
by allowing a
exception is discovered. We define the operation of a machine implc-
finer level of control over the work to be done.
menting this backup hardware as having two modes: fast and slow
atomic units allow a higher degree of parallelism to be taken advan-
In fast mode, the machine executer the rxr<.ution atomic units (which
tage of more efficientIy.
are larger than the architectural
unit,
atumlr units).
machine executes the architectural
As an alternative,
In slow mode, the
atomic units.
the architectural
Larger execution
By issuing a larger unit of
work
as a single
more concurrency can be achieved at a lower cost.
In a RISC machine, smal1 compiler atomic units are implemented
atomic units can be made
in order to exploit their advantages. Since most processors have ar-
equal to the execution atomic units. In either case. if the execution
chitectural atomic units which are the same as the compiler atomic
atomic units are made larger than compiler atomic units, the mech-
units, and since increasing the execution atomic units beyond the
anism for packing compiler atomic units into execution atomic units
size of the architectural
becomes an important
processors with small execution atomic units. Thus, while these ma-
issue. This packing can either be done stati-
cally by the compiler or it can be done dynamically
by the hardware.
chines can benefit from the performance
If the compiler atomic units are closely coupled to the implementation (i.e. they are formated into multinodewords),
the case for a statically
implementation,
technology
as would be the case if a
must
be
dynamicallv.
done
Figure 1 illustrates
If.
to
and
Consider a hypothetical
two
implementations
of a hypotheticalCISC
the average number of microoperations
is being imple-
RISC
microoperations
machine in which
per instruction
is four.
In
the high performance CISC case, we assume that the architectural
ln this paper, we
discuss a hardware mechanism called a ‘fill unit‘ for dynamically
this point.
machine in which the compiler specifies individual
the
against changes in
buffer
is desired or if an existing architecture
mented, the packing
small execution atomic units.
scheduled machine or for some dynamically
however, the compiler atomic units are not closely coupled
advantages of small compiler
atomic units, they are limited by the performance disadvantages of
as would be
scheduled machines [2], the compiler does the packing statically.
atomic units is complex. the result has been
atomic units are the same as the compiler atomic units and this lim-
cre-
its the size of the execution
ating execution atomic units from smaller compiler atomic units.
atomic units. As one moves further up
and to the left on this graph (meaning larger execution atomic units
This paper contains four sections. Section two presents a general
and smaller compiler atomic units), performance increases. Further-
discussion of atomic unit sizes and the tradeoffs associated with them.
more, we believe that for high performance processors. the size of
Section three
execution atomic units is more critical than the size of the compiler
details on the implementation
provides
of a fill unit.
Finally section four provides some concluding remarks.
the
atomic units.
In summary, for higher performance it is desirable to have large
2. Atomic
Unit
Size Issues
execution atomic units and small compiler atomic units. In order to
Many tradeoffs exist for choosing the sizes of the atomic units
and it is important
for a particular
to identify
tradeoff.
make the large execution atomic units less expensive to implement,
which type of atomic unit is relevant
large architectural
Each of the three types of atomic units
atomic units are also desired. However, a designer
implementing a given instruction
introduced above have distinct performance implications.
Larger architectural
potentially
atomic units increase interrupt
decrease the size of the architectural
temporary
results
need
not
cannot choose the
Figure 1
Unit Size Diagram
state of the machine.
This is due to the fact that within an uninterruptible
crooperations,
Atomic
latency but
set architecture
block of mi-
be given explicit
Higher
Performance
names,
thus fewer general purpose registers are required. This makes context
switching less expensive.
In addition, larger architectural
atomic units simplify the design
Average
number
of
microoperations
of high performance machines since there are fewer microoperation
boundaries to preserve. This can be significant for microarchitectures
with a high degree of concurrency.
architectural
Per
Execution
The third advantage of large
Atomic
Unit
atomic units is that they make it easier to increase the
size of execution atomic units as mentioned
ahwe
(nu
fast
mode
slow mode scheme is required).
However, the architectural
atomic units are secondary to perfor-
t
mance. The sizes of the compiler atomic units and execution atomic
Average
units are more critical. Small compiler atomic units increase compile-
61
number
2
of microperations
3
4
per
Compiler
Atomic
Unit
composition
units.
of the compiler
atomic
units or the architectural
The goal then is to implement
execution
atomic
Figure 3
Fill Unit
atomic
units
which
are as large as possible.
In the next section
which
allows
we describe
large execution
dynamically
scheduled
decoupled
a mechanism
atomic
machine
from the machine
units
given
small
Instruction
Figure
Unit
atomic
in a
Address
Re-alias
Unit
units
:Details
Over&w
2 shows an overview
and decode unit.
compiler
implementation.
3. 1mpieme:ntation
3.1.
we call a ‘fill unit’
to be implemented
Instructions
of a hypothetical
instruction
are fetched from main memory
fetch
Fill Structure
01 from
c
I
an instruction
cache
and stored
on the format
of the compiler
in a pre-fetch
atomic
units,
buffer.
Depending
microoperations
H
start PC
PC #1
may
PC x2
be directly
accessible
necessary.
In either
or a microoperation
case, a sequence
generation
stage may
of microoperations
be
is passed on
APh
to the fill unit.
The till unit
which
packs microoperations
are then passed
frequently
from
occurring
case, execution
the decoded
unit, after having
The branch
atomic
instruction
register
predictor
instruction
atomic
cache.
units
t
A index
In the
atomic: units are fetched directly
cache and merged
aliases resolved
predicts
branches
into
the execution
by. the register
and sequences
alias table.
the execution
coded instruction
units.
A checkpoint
retirement
atomic
units as they become
within
a particular
to the previous
smaller
into execution
on to the decoded
than
execution
boundary.
the execution
(slow mode)
where
system
atomic
Then,
atomic
Figure
to retire
unit, the machine
units.
a bypass
bypass
appropriate
execution
When an exception
if the architectural
microoperations
Fetch
is present
fully executed.
without
arises
units are
3.2.
mode is entered
the fill unit
course
of action
Fill
accepts
microoperations
units.
in so doing,
execution
atomic
which
with
the
was detected
We will now con-
generator
(or the compiler)
to specify
one at a time
it allows
and forms
the microoperation
microoperations
without
hav-
2
and Decode
Unit
ing to know
their
an execution
atomic
A diagram
final locations
of a hypothetical
alias unit
changes
execution
atomic
cution
I
atomic
the format
ation.
unit
unit
is shown
or within
and three
register
copied
reference,
the [e-alias
into
ALU.
The fill structure
is under
construction.
in the decoded
reference.
is either
if a different
to the first
rethe
address
consists
should
microoperation,
microoperation
of an exe-
structure
has
cache.
a constant,
to another
If an operand
unit uses the block register
with
to support
no work is required.
the fill structure.
3. We
The address
This
instruction
or a reference
is a constant.
refers to another
in figure
a multinodeword
of microoperations
in each microoperation
If an operand
ified relative
62
format.
which
of an entry
to determine
fill unit
one memory
is simply
operand
a muitinodeword
a fill unit which supports
the addresses
Each operand
eral purpose
within
unit.
four microoperations.
I
for the exception
proceeds
Unit
show in this example
Checkpoint
Retirement
System
the machine
in more detail.
The fill unit
and the de-
Otherwise,
the need to reissue any microoperations.
sider the fill unit
is backed up
atomic
cache.
microoperthe constant
is a register
alias table
be inserted.
a gen-
(BRAT)
Finally,
it is assumed
in the instruction.
if an
to be specIn this
case the re-alias
to translate
cies within
satisfied
unit
will use the microoperation
the reference.
These
a multinodeword
ister.
There
write
valid bit
(RV).
The WV
and within
in the BRAT
(WV),
new execution
atomic
reset to signify
If an operand
that register
situation
to thr
register.
write
the
of that
If an operand
address
field
writes
that
the BRAT
with
advantage
is changed
mation
into a
This
fields
is detected.
(assuming
address
erating
transferred
slots are required
number
in order
situation
in which
to be generated.
PCs
of free slots, it will finalize
indicating
an unconditional
here that
an execution
even the largest
the unit,
branch
atomic
atomic
unit
unit
and start
is fi-
to the
scheme cannot
prediction
take
infor-
it locks the architecture
and
to
compiler
atomic
units
structure
and/or
to decouple
filling
instruction
cache,
is useful.
but with
execution
atomic
the architecture
The
a suit-
this can be minimized.
of large
the need to lock
a fill unit
is incurred
which
Thus.
units
into
can br
a specific
work
effort
in microarchitecture
at Berkeley,
the Aquarius
in the Aquarius
they provide.
We also acknowledge
Order
Advanced
Contract
Project.
architectural
We acknowledge
group for the stimulating
No. 4871, monitored
under
is part of a larger
that
Research
environment
part of this work
Projects
by Space and Naval
spon-
was
Agency
(DOD),
Warfare
Sys-
No. N00039-84.C’-0089.
that
Wen-mei W. Hwu and Yale N. Patt, “Checkpoint
Repair for
High Performance
Out-of-order
Execution
Machines,”
IEEE
‘Transactions
on Computers,
December,
1987.
PI Wen-mei
W. Hwu and Yale N. Patt, “HPSm, A High
Performance
Restricted
Data Flow Architecuture
Having
Minimal
Functionality,”
Proceedings,
13th .4nnuai
International
Symposium
on Computer
ilrchitecture,
Tokyo,
June, 1986.
this
131 Yale N. Patt, Michael C. Shebanow, Wen-mei Hwu and Stephen
W. Melvin, “A C Compiler for HPS I. .4 Highly Parallel
Execution
Engine,”
Proceedings,
19th Hawaii hternational
Conference on System Sciences, Honolulu.
HI, January 1986.
this
information
a new unit.
are formated
REFERENCES
generator
has less than
is large enough
and
format.
tems Command
how many free
store control
thr multLnodrrords
This
the implementation,
our colleagues
Arpa
slots for
types (note
If the fill structure
solution
instruction
The microoperation
for each of the microoperatibn
case estimate).
values)
of run-time
sored by the Defense
The contents
microoperation
of each instruction
Thr
can pack the microoperations
to support
advantages
without
use and merged into the
an execution
whenof low
arises is that
Acknowledgement
will be gen-
to the decoded
empty
performance
Our
[II
to the fill unit at the start
can be a worst
operand
overhead
multinodeword
which
the two alternate
them s,ut-of-order
(such as branch
from
research
microoperation
for repeated
is when there are not enough
the next instruction
signals
regarding
information
reads and
unit is finalized,
multiple
that
units
statically.
the architecture
exploited
allows
unit.
The second
nalized
atomic
units
tp the multinodeword
while preserving
are stored in the fill structure.
cache where they will be stored
main execution
to generate
atomic
are not formated
the
waantics.
and which
are then
the
is when a change in the flow of
Information
the test condition
of the fill structure
or dynamic
additional
fields into the
mechanism
an instruction.
The first
two way branches)
address
This
the compiler)
is how an execution
in two situations.
control
for each entry.
intra-execution-atomic-unit
One last detail
of run-time
into
A dynam-
this by merging
tnultinvdrword<.
the compiler
atomic
Alternatively,
bit is set
is sent,
within
units,
the implementation.
has written
the WV
A problem
complier
the
function
or dynamically.
structure,
into the execution
is
to do
pack the microoperations
multinodeword
the reference).
the fill structure
and write
in any order within
inter-instruction,
of microoperations
In cases where
RV bit for
reference
for an instruction
generator(or
to registers
utilization
this can be done statically
unit.
thr wad address field).
updated
is
reference
to a register,
to copy the WV
the microoperation
contain
place within
and scheduling
are ready.
pipelined
slots full as possible.
per cycle
operands
is to better
a
pipeline
attempts
ably large decoded
RV and read address
occurs
Whenever
filled microoperation
the last microoperation
is signaled
writes
(from
bit
field and
multiple
processor
ever their
WV and RV bits are
will tramlatr
the register
a previously
that
field.
scheduled
microoperations
field, the
address
If the bit is reset, the register
reference
reg-
microoperation.
When
BRAT
all BR.\T
(at merge time, the RAT
indicates
purpose
with
the goal is to keep as many
ically
field and the read valid
reads the value in’a register,
microoperation
In a microarchitecture
unit are
address
of the write
have taken
If the RV bit is set, however,
relative
the write
of the read address
no writes
is consulted.
left unchanged
validity
unit is created.
that
4. Conclusions
dependen-
atomic
to a general
the read address
validity
that
into the data path.
corresponds
bit indicates
ensure
an execution
are four fields in each entry:
the RV bit indicates
and
re-aliases
when the unit is merged
Each entry
alias table (MOAT)
We assume
[41 Yale N. Patt, Stephen W. Melvin, Wen-mei Hwu, Michael C.
Shebanow, Chien Chen and Jiajuin Wei. “Run-time
Generation
of HPS Microinstructions
From a VAX Instruction
Stream,”
Proceedings,
19th Annual Workshop on .Vflcroprogramming,
October 15-17, 1986, New York,
NY.
to completely
instruction.
63
© Copyright 2026 Paperzz