ARM single-cycle instruction 3-stage pipeline

ARM Organization and
Implementation
CONFIDENTIAL & PROPRIETARY
1
3-stage pipeline ARM organization
The register
bank, which stores
the processor state.
Barrel Shifter,
which can shift or
rotate one operand
ALU, performs
bythe
any number of
arithmetic andbits.
logic
functions required
by the instruction
set.
CONFIDENTIAL & PROPRIETARY
2
3-stage pipeline ARM organization
Address register and
incrementer, select and
hold all memory addresses
and generate sequential
addresses when required.
Data Register, which
hold data passing to and
from memory.
CONFIDENTIAL & PROPRIETARY
3
1. In a single-cycle data processing instruction, two
registers operands are accessed, the value on
the B bus is shifted and combined with the value
on the A bus in the ALU, then the result is written
back into the register bank.
2. The program counter value is in the address
register, from where it is fed into the incrementer,
the incremented value is copied back into r15 in
the register bank and also into the address
register to be used as the address for the next
instruction fetch if needed.
CONFIDENTIAL & PROPRIETARY
4
The 3-stage pipeline
ARM processors up to the ARM7 employ a simple 3-stage
pipeline with the following pipeline stages
1. Fetch
2. Decode
3. Execute
CONFIDENTIAL & PROPRIETARY
5
ARM single-cycle instruction 3-stage pipeline operation
1. When the processor is executing simple data
processing
instructions
the
pipeline
enables one instruction to be completed every
clock cycle.
2. An individual instruction takes three clock
cycles to complete, so it has three-cycle
latency, but the throughput is one instruction
per cycle.
CONFIDENTIAL & PROPRIETARY
6
ARM Multi Cycle instruction
3-stage pipeline operation
CONFIDENTIAL & PROPRIETARY
7
3-stage pipeline operation
1.When a multi-cycle instruction is executed the flow is less
regular, as illustrated in Figure.
2.This shows a sequence of single-cycle ADD instructions
with a data store instruction, STR, occurring after the first
ADD.
3.The cycle colored in yellow is accessing main memory, so
it can be seen that memory is used in every cycle.
4.The datapath is likewise used in every cycle, being
involved in all the execute cycles, the address calculation
and the data transfer.
5.The decode logic is always generating the control signals
for the datapath to use in the next cycle, so in addition to
the explicit decode cycles it is also generating the control
for the data transfer during the address calculation cycle of
the STR.
CONFIDENTIAL & PROPRIETARY
8
5 stage pipe line ARM organization
The time T, required to execute a given program is given
by :
T prog 
N inst  CPI
f clk
where,
N inst - Number of ARM instructio ns executed in the course of the program
CPI - Average number of clock cycles per instructio ns
f clk - Processor' s clock frequency
Since Ninst is constant for a given program (compiled
with a given compiler using a given set of optimizations,
and so on) there are only two ways to increase
performance.
CONFIDENTIAL & PROPRIETARY
9
1. Increase the clock rate, fclk.
• This requires the logic in each pipeline stage to be
simplified and, therefore, the number of pipeline
stages to be increased.
2. Reduce the average number of clock cycles per
instruction, CPI.
• This requires either that instructions which occupy
more than one pipeline slot in a 3-stage pipeline
ARM are re-implemented to occupy fewer slots, or
that pipeline stalls caused by dependencies
between instructions are reduced, or a combination
of both.
CONFIDENTIAL & PROPRIETARY
10
Memory Bottleneck
1. A 3-stage ARM core accesses memory on (almost)
every clock cycle either to fetch an instruction or to
transfer data. Simply tightening up on the few cycles
where the memory is not used will yield only a small
performance gain.
2. To get a significantly better CPI the memory system
must deliver more than one value in each clock cycle
either by delivering more than 32 bits per cycle from a
single memory or by having separate memories for
instruction and data accesses.
CONFIDENTIAL & PROPRIETARY
11
1. As a result of the issues, higher performance ARM
cores employ a 5-stage pipeline and have separate
instruction and data memories.
2. Breaking instruction execution down into five
components rather than three reduces the maximum
work which must be completed in a clock cycle, and
hence allows a higher clock frequency to be used.
3. The separate instruction and data memories allow a
significant reduction in the core's CPI.
CONFIDENTIAL & PROPRIETARY
12
5-Stage Pipeline Organization (1/2)
next
pc
• Fetch
+4
fetch
I-cache
pc + 4
pc + 8
I decode
r15
instruction
decode
register read
immediate
fields
mul
LDM/
STM
+4
postindex
reg
shift
shift
pre-index
execute
ALU
forwarding
paths
mux
B, BL
MOV pc
SUBS pc
load/store
address
buffer/
data
rot/sgn ex
LDR pc
register write
• Decode
– The instruction is decoded
and register operands read
from the register files. There
are 3 operand read ports in
the register file so most ARM
instructions can source all
their operands in one cycle
• Execute
byte repl.
D-cache
– The instruction is fetched
from memory and placed in
the instruction pipeline
write-back
– An operand is shifted and the
ALU result generated. If the
instruction is a load or store,
the memory address is
computed in the ALU
CONFIDENTIAL & PROPRIETARY
13
5-Stage Pipeline Organization (2/2)
next
pc
• Buffer/Data
+4
fetch
I-cache
pc + 4
pc + 8
I decode
r15
instruction
decode
register read
immediate
fields
mul
LDM/
STM
+4
postindex
reg
shift
shift
pre-index
execute
ALU
forwarding
paths
mux
– Data memory is accessed if
required. Otherwise the ALU
result is simply buffered for
one cycle
• Write back
– The result generated by the
instruction are written back
to the register file, including
any data loaded from memory
B, BL
MOV pc
SUBS pc
byte repl.
load/store
address
D-cache
buffer/
data
rot/sgn ex
LDR pc
register write
write-back
CONFIDENTIAL & PROPRIETARY
14
Pipeline Hazards
There are situations, called hazards, that prevent
the next instruction in the instruction stream from
being executing during its designated clock cycle.
Hazards reduce the performance from the ideal
speedup gained by pipelining.
CONFIDENTIAL & PROPRIETARY
15
Pipeline Hazards
• There are three classes of hazards:
– Structural Hazards: They arise from resource conflicts
when the hardware cannot support all possible
combinations of instructions in simultaneous overlapped
execution.
– Data Hazards: They arise when an instruction depends
on the result of a previous instruction in a way that is
exposed by the overlapping of instructions in the
pipeline.
– Control Hazards: They arise from the pipelining of
branches and other instructions that change the PC
CONFIDENTIAL & PROPRIETARY
16
Structural Hazards
1. When a machine is pipelined, the overlapped
execution of instructions requires pipelining of
functional units and duplication of resources to
allow all possible combinations of instructions in
the pipeline.
2. If some combination of instructions cannot be
accommodated because of a resource conflict,
the machine is said to have a structural
hazard.
CONFIDENTIAL & PROPRIETARY
17
Example
• A machine has shared a single-memory pipeline
for data and instructions. As a result, when an
instruction contains a data-memory reference
(load), it will conflict with the instruction reference
for a later instruction (instr 3):
Clock cycle number
instr
1
2
3
4
5
load
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
Instr 1
Instr 2
Instr 3
6
7
8
WB
CONFIDENTIAL & PROPRIETARY
18
Solution (1/2)
• To resolve this, we stall the pipeline for one clock
cycle when a data-memory access occurs. The
effect of the stall is actually to occupy the
resources for that instruction slot. The following
table shows how the stalls are actually
implemented.
Clock cycle number
instr
1
2
3
4
5
load
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
stall
IF
ID
EX
Instr 1
Instr 2
Instr 3
6
7
8
9
MEM
WB
CONFIDENTIAL & PROPRIETARY
19
Solution (2/2)
• Another solution is to use separate instruction
and data memories.
• ARM used Harvard architecture, so we do not
have this hazard
CONFIDENTIAL & PROPRIETARY
20
Data Hazards
• Data hazards occur when the pipeline changes the
order of read/write accesses to operands so that the
order differs from the order seen by sequentially
executing instructions on the unpipelined machine.
Clock cycle number
ADD
R1,R2,R3
SUB
R4,R5,R1
AND
R6,R1,R7
OR
R8,R1,R9
XOR
R10,R1,R11
1
2
3
4
5
6
7
8
IF
ID EX
MEM WB
IF
IDsub
EX
MEM WB
IF
IDand
EX
MEM WB
IF
IDor
EX
MEM WB
IF
IDxor
EX
9
MEM WB
CONFIDENTIAL & PROPRIETARY
21
Forwarding
• The problem with data hazards, introduced by this
sequence of instructions can be solved with a
simple hardware technique called forwarding.
Clock cycle number
ADD
R1,R2,R3
SUB
R4,R5,R1
AND
R6,R1,R7
1
2
3
4
5
6
IF
ID
EX
MEM
WB
IF
IDsub
EX
MEM
WB
IF
IDand
EX
MEM
7
WB
CONFIDENTIAL & PROPRIETARY
22
Forwarding
• Forwarding involves feeding output data into a
previous stage of the pipeline.
• Forwarding is implemented by feeding back the
output of an instruction into the previous stage(s)
of the pipeline as soon as the output of that
instruction is available.
Clock cycle number
ADD
R1,R2,R3
SUB
R4,R5,R1
AND
R6,R1,R7
1
2
3
4
5
6
IF
ID
EX
MEM
WB
IF
IDsub
EX
MEM
WB
IF
IDand
EX
MEM
7
WB
CONFIDENTIAL & PROPRIETARY
23
Forwarding Architecture
next
pc
+4
fetch
I-cache
pc + 4
pc + 8
I decode
r15
instruction
decode
register read
immediate
fields
mul
LDM/
STM
+4
postindex
reg
shift
shift
pre-index
execute
ALU
forwarding
paths
mux
B, BL
MOV pc
SUBS pc
byte repl.
load/store
address
D-cache
buffer/
data
rot/sgn ex
works
as
– The ALU result from the
EX/MEM register is always fed
back to the ALU input latches.
– If the forwarding hardware
detects that the previous ALU
operation has written the
register corresponding to the
source for the current ALU
operation,
control
logic
selects the forwarded result as
the ALU input rather than the
value read from the register
file.
forwarding paths
LDR pc
register write
• Forwarding
follows:
write-back
CONFIDENTIAL & PROPRIETARY
24
Forward Data
Clock cycle number
ADD
R1,R2,R3
SUB
R4,R5,R1
AND
R6,R1,R7
1
2
3
4
5
6
IF
ID
EXadd
MEMadd WB
IF
ID
EXsub
MEM
WB
IF
ID
EXand
MEM
7
WB
• The first forwarding is for value of R1 from EXadd to EXsub.
The second forwarding is also for value of R1 from MEMadd to EXand.
This code now can be executed without stalls.
• Forwarding can be generalized to include passing the result directly to
the functional unit that requires it
• A result is forwarded from the output of one unit to the input of another,
rather than just from the result of a unit to the input of the same unit.
CONFIDENTIAL & PROPRIETARY
25
Without Forward
Clock cycle number
ADD
R1,R2,R3
SUB
R4,R5,R1
AND
R6,R1,R7
1
2
3
4
5
6
7
IF
ID
EX
MEM
WB
IF
stall stall
stall stall
8
IDsub
EX
MEM WB
IF
IDand
EX
9
MEM WB
CONFIDENTIAL & PROPRIETARY
26
Data Forwarding
• Data dependency arises when an instruction needs to
use the result of one of its predecessors before the result
has returned to the register file => pipeline hazards
• Forwarding paths allow results to be passed between
stages as soon as they are available
• 5-stage pipeline requires each of the three source
operands to be forwarded from any of the intermediate
result registers
• Still one load stall
LDR rN, […]
ADD r2,r1,rN
;use rN immediately
– One stall
– Compiler rescheduling
CONFIDENTIAL & PROPRIETARY
27
Stalls are required
LDR
R1,@(R2)
SUB
R4,R1,R5
AND
R6,R1,R7
OR
R8,R1,R9
1
2
3
4
5
6
7
IF
ID
EX MEM
WB
IF
ID
EXsub
MEM
WB
IF
ID
EXand
MEM
WB
IF
ID
EXE
MEM
8
WB
• The load instruction has a delay or latency that cannot be
eliminated by forwarding alone.
CONFIDENTIAL & PROPRIETARY
28
The Pipeline with one Stall
LDR
R1,@(R2)
SUB
R4,R1,R5
AND
R6,R1,R7
OR
R8,R1,R9
1
2
3
4
5
6
7
IF
ID
EX MEM WB
IF
ID
stall
IF
8
EXsub
MEM
WB
stall
ID
EX
MEM
WB
stall
IF
ID
EX
MEM
9
WB
• The only necessary forwarding is done for R1 from MEM to
EXsub.
CONFIDENTIAL & PROPRIETARY
29
Control hazards
 Control hazards can cause a greater performance loss for
ARM pipeline than data hazards.
 When a branch is executed, it may or may not change the
PC (program counter) to something other than its current
value plus 4.
 The simplest method of dealing with branches is to stall the
pipeline as soon as the branch is detected until we reach the
EX stage
Branch
Branch successor
Branch successor+1
IF
ID
EXE MEM WB
IF (stall) Stall IF
ID
EXE MEM WB
IF
ID
EXE
MEM WB
CONFIDENTIAL & PROPRIETARY
30
Data Processing Instructions
add ress regis ter
add ress regis ter
increme nt
Rd
PC
Rn
regi sters
Rm
increme nt
Rd
PC
regi sters
Rn
mul t
mul t
as ins.
as instruction
as ins.
as instruction
[7:0]
data ou t
data in
i. pipe
(a) register - register operations
data ou t
data in
i. pipe
(b) register - immediate operations
CONFIDENTIAL & PROPRIETARY
31
Data Processing Instructions
1.A data processing instruction requires two operands,
one of which is always a register and the other is either
a second register or an immediate value.
2.The second operand is passed through the barrel
shifter where it is subject to a general shift operation,
then it is combined with the first operand in the ALU
using a general ALU operation. Finally, the result from
the ALU is written back into the destination register.
3.All these operations take place in a single clock cycle
as shown in Figure on.
CONFIDENTIAL & PROPRIETARY
32
Data Processing Instructions
1.The PC value in the address register is incremented and
copied back into both the address register and r15 in the
register bank, and the next instruction but one is loaded
into the bottom of the instruction pipeline (i. pipe).
2.The immediate value, when required, is extracted from
the current instruction at the top of the instruction
pipeline. For data processing instructions only the bottom
eight bits (bits [7:0]) of the instruction are used in the
immediate value.
CONFIDENTIAL & PROPRIETARY
33
Data Transfer Instructions
STR (store register) datapath activity
add ress regis ter
add ress regis ter
increme nt
PC
regi sters
increme nt
Rn PC
regi sters
Rn
Rd
mul t
mul t
lsl #0
shi fter
= A / A+ B / A- B
=A+ B/ A-B
[1 1:0 ]
data ou t
data in
i. pipe
(a) 1st cycle - compute address
byte?
data in
i. pipe
(b) 2nd cycle - stor e data & auto-index
CONFIDENTIAL & PROPRIETARY
34
STR (store register) datapath activity
1.A data transfer (load or store) instruction computes a
memory address in a manner very similar to the way a
data processing instruction computes its result.
2.A register is used as the base address, to which is added
(or from which is subtracted) an offset which again may
be another register or an immediate value.
3.The address is sent to the address register, and in a
second cycle the data transfer takes place.
4.Rather than leave the datapath largely idle during the
data transfer cycle, the ALU holds the address
components from the first cycle and is available to
compute an auto-indexing modification to the base
register if this is required
CONFIDENTIAL & PROPRIETARY
35
Branch Instructions
The first two (of three) cycles of a branch instruction
add ress regis ter
add ress regis ter
increme nt
increme nt
R14
regi sters
regi sters
PC
PC
mul t
mul t
lsl #2
shi fter
= A+ B
=A
[23 :0]
data ou t
data in
i. pipe
(a) 1st cycle - compute branch tar get
data ou t
data in
i. pipe
(b) 2nd cycle - sav e r eturn address
CONFIDENTIAL & PROPRIETARY
36
Branch Instructions
1.Branch instructions compute the target address in the first cycle as
shown in Figure.
2.A 24-bit immediate field is extracted from the instruction and then
shifted left two bit positions to give a word-aligned offset which is
added to the PC.
3.The result is issued as an instruction fetch address, and while the
instruction pipeline refills the return address is copied into the link
register (r14) if this is required (that is, if the instruction is a 'branch
with link').
4.The third cycle, which is required to complete the pipeline refilling, is
also used to make a small correction to the value stored in the link
register in order that it points directly at the instruction which follows
the branch.
5.This is necessary because r15 contains pc + 8 whereas the address
of the next instruction is pc + 4.
CONFIDENTIAL & PROPRIETARY
37
CONFIDENTIAL & PROPRIETARY
38
CONFIDENTIAL & PROPRIETARY
39