现代计算机体系结构 - 天津大学研究生e

现代计算机体系结构
主讲教师:张钢 教授
天津大学计算机学院
通信邮箱:[email protected]
提交作业邮箱:[email protected]
2013年
现代计算机体系结构
1
Exploiting ILP Using Multiple
Issue and Static Scheduling
现代计算机体系结构
2
Multiple-issue Processors Come in
Three Major Flavors
• Statically Scheduled Superscalar Processors
– issue varying numbers of instructions per clock
– use in-order execution
• Dynamically Scheduled Superscalar Processors
– issue varying numbers of instructions per clock
– use out-of-order execution
• VLIW (very long instruction word) processors
– issue a fixed number of instructions formatted either
as one large instruction or as a fixed instruction
packet
现代计算机体系结构
3
The Basic VLIW Approach
• VLIWs use multiple, independent function
units
• A VLIW packages the multiple operations
into one very long instruction
• Or a VLIW requires that the instructions in
the issue packet satisfy the same
constraints
• There is no fundamental difference in
the two approaches
现代计算机体系结构
4
Case study: A VLIW processor
• A VLIW processor with instructions that
contain five operations
– One integer operation (or a branch)
– Tow floating-point operations
– Two memory references
• An instruction length of between 80 and
120 bits
– 16 to 24 bits per field => 5*16 or 80 bits to
5*24 or 120 bits wide
现代计算机体系结构
5
Recall: Unrolled Loop that
Minimizes Stalls for Scalar
1 Loop:
2
3
4
5
6
7
8
9
10
11
12
13
14
L.D
L.D
L.D
L.D
ADD.D
ADD.D
ADD.D
ADD.D
S.D
S.D
S.D
DSUBUI
BNEZ
S.D
F0,0(R1)
F6,-8(R1)
F10,-16(R1)
F14,-24(R1)
F4,F0,F2
F8,F6,F2
F12,F10,F2
F16,F14,F2
0(R1),F4
-8(R1),F8
-16(R1),F12
R1,R1,#32
R1,LOOP
8(R1),F16
L.D to ADD.D: 1 Cycle
ADD.D to S.D: 2 Cycles
; 8-32 = -24
14 clock cycles, or 3.5 per iteration
现代计算机体系结构
6
Loop Unrolling in VLIW
Memory
reference 1
L.D F0,0(R1)
L.D F10,-16(R1)
L.D F18,-32(R1)
L.D F26,-48(R1)
Memory
FP
reference 2
operation 1
L.D F6,-8(R1)
L.D F14,-24(R1)
L.D F22,-40(R1) ADD.D F4,F0,F2
ADD.D F12,F10,F2
ADD.D F20,F18,F2
S.D 0(R1),F4
S.D -8(R1),F8
ADD.D F28,F26,F2
S.D -16(R1),F12 S.D -24(R1),F16
S.D -32(R1),F20 S.D -40(R1),F24
S.D -0(R1),F28
FP
op. 2
Int. op/
branch
Clock
1
2
ADD.D F8,F6,F2
3
ADD.D F16,F14,F2
4
ADD.D F24,F22,F2
5
6
7
DSUBUI R1,R1,#48 8
BNEZ R1,LOOP
9
Unrolled 7 times to avoid delays
7 results in 9 clocks, or 1.3 clocks per iteration (1.8X)
Average: 2.5 ops per clock, 50% efficiency
Note: Need more registers in VLIW (15 vs. 6 in SS)
现代计算机体系结构
7
Problems with 1st Generation VLIW
• Increase in code size
– generating enough operations in a straightline code fragment requires ambitiously
unrolling loops
– whenever VLIW instructions are not full,
unused functional units translate to wasted
bits in instruction encoding
现代计算机体系结构
8
Problems with 1st Generation VLIW
• Operated in lock-step; no hazard detection
HW
– a stall in any functional unit pipeline caused
entire processor to stall, since all functional
units must be kept synchronized
– Compiler might prediction function units, but
caches hard to predict
现代计算机体系结构
9
Problems with 1st Generation VLIW
• Binary code compatibility
– Pure VLIW => different numbers of functional
units and unit latencies require different
versions of the code
现代计算机体系结构
10
Intel/HP IA-64 “Explicitly Parallel
Instruction Computer (EPIC)”
• IA-64: instruction set architecture
• 128 64-bit integer regs + 128 82-bit floating point
regs
– Not separate register files per functional unit as in old
VLIW
• Hardware checks dependencies
• Predicated execution (select 1 out of 64 1-bit
flags)
=> 40% fewer mispredictions?
现代计算机体系结构
11
Intel/HP IA-64 “Explicitly Parallel
Instruction Computer (EPIC)”
 Itanium™ was first implementation (2001)
– Highly parallel and deeply pipelined hardware at
800Mhz
– 6-wide, 10-stage pipeline at 800Mhz on 0.18 µ
process
 Itanium 2™ is name of 2nd implementation
(2005)
– 6-wide, 8-stage pipeline at 1666Mhz on 0.13 µ
process
– Caches: 32 KB I, 32 KB D, 128 KB L2I, 128 KB L2D,
9216 KB L3
现代计算机体系结构
12
Increasing Instruction Fetch
Bandwidth
 Predicts next instruct address, sends it out
before decoding instruction
 PC of branch sent to BTB
 When match is found, Predicted PC is
returned
 If branch predicted taken, instruction fetch
continues at Predicted PC
现代计算机体系结构
13
Example
• On a 2 issue processor
Loop:
LW R2,0(R1)
DADDIU R2,R2,#1
SW 0(R1),R2
DADDIU R1,R1,#4
BNE R2,R3,Loop
;R2=array element
; increment R2
;store result
;increment pointer
; branch if not last element
• Assume
– separate integer functional units for effective
address calculation, for ALU operations, and for
branch condition evaluation.
– up to two instructions of any type can commit per
clock
现代计算机体系结构
14
现代计算机体系结构
Without speculation, control dependency is the main performance limitation
15
现代计算机体系结构
With speculation, overlapping between iterations
16
Branch Target Buffer (BTB)
• To reduce the branch penalty, we must
know whether the as-yet-undecoded
instruction is a branch, if so, what the next
PC should be.
– We can have a branch penalty of zero.
• Branch-target buffer/Branch-target cache
– A branch-prediction cache that stores the
predicted address for the next instruction after
a branch
现代计算机体系结构
17
Branch Target Buffer (BTB)
现代计算机体系结构
18
Return Address Predictor
• Indirect jumps
– Destination address varies at run time
– For example
• Case statement
• Procedure return
• Procedure return can be predicted with a
branch-target buffer, but the accuracy can
be low. Why?
– The procedure may be called from multiple sites
– The calls from one site are not clustered in time
• E.g. nested recursion
现代计算机体系结构
19
Return Address Predictor
– Caches most recent
return addresses
– Call: Push a return
address on stack
– Return: Pop an
address off stack &
predict as new PC
70%
Misprediction frequency
• How to overcome
this problem?
• Small buffer of
return addresses
acts as a stack
go
m88ksim
60%
cc1
50%
compress
40%
xlisp
ijpeg
30%
perl
20%
vortex
10%
0%
0
1
2
4
8
16
Return address buffer entries
现代计算机体系结构
20
Integrated Instruction Fetch Units
• Multiple instructions are demanded by
multiple-issue processors in a clock
• How to meet the demand?
• Integrated Instruction Fetch Units is one of
the approaches
– Integrated branch prediction
– Instruction prefetch
– Instruction memory access and buffering
现代计算机体系结构
21
Integrated Instruction Fetch Units
• Integrated branch prediction
– branch predictor is part of instruction fetch unit and is
constantly predicting branches
• Instruction prefetch
– Instruction fetch units prefetch to deliver multiple
instruct. per clock, integrating it with branch prediction
• Instruction memory access and buffering
– Fetching multiple instructions per cycle:
• May require accessing multiple cache blocks
(prefetch to hide cost of crossing cache blocks)
• Provides buffering, acting as on-demand unit to
provide instructions to issue stage as needed and in
quantity needed
现代计算机体系结构
22
Value Prediction
•Taxonomy of speculative execution
现代计算机体系结构
23
Why can we do value Prediction?
• Several recent studies have shown that
there is significant result redundancy in
programs, i.e., many instructions perform
the same computation and, hence, produce
the same result over and over again.
• These studies have found that for several
benchmarks more than 75% of the dynamic
instructions produce the same result as
before.
现代计算机体系结构
24
Value Prediction
• Attempts to predict value produced by instruction
– E.g., Loads a value that changes infrequently
• Value prediction is useful only if it significantly
increases ILP
– Focus of research has been on loads; so-so results, no
processor uses value prediction
• Related topic is address aliasing prediction
– RAW for load and store or WAW for 2 stores
• Address alias prediction is both more stable and
simpler since need not actually predict the address
values, only whether such values conflict
– Has been used by a few processors
现代计算机体系结构
25
Pipeline with VP
• The predictions are obtained from a hardware table, called
Value Prediction Table (VPT).
• These predicted values are used as inputs by instructions,
which can then execute earlier than they could have if they
had to wait for their inputs to become available in the
traditional way.
• When the correct values become available (after executing
an instruction) the speculated values are verified
– if a speculation is found to be wrong, the instructions
which executed with the wrong inputs are re-executed
– if the speculation is found to be correct then nothing
special needs to be done
现代计算机体系结构
26
Pipeline with VP
• a flow of a dependent chain of instructions (I, J, and K)
through two different pipelines: (i) a base pipeline (without
VP or IR); (ii) a pipeline with VP.
• we assume the instructions I, J, and K, are fetched,
decoded and renamed together.
• In the base pipeline, the instructions execute sequentially,
since they are data dependent, requiring three cycles to
execute them;
– the chain is committed by cycle 6.
• In the pipeline with VP, the dependence between
instructions is broken by predicting the outputs of I and J
(alternately, the inputs of J and K). This enables the three
instructions to execute simultaneously;
– the chain is committed in cycle 4.
现代计算机体系结构
27
作业4
• 习题 3.2
现代计算机体系结构
28
现代计算机体系结构
29