p1,p2 - University of Houston

COSC6385 Advanced Computer Architecture
Lecture 14. VLIW and EPIC
Instructor: Weidong Shi (Larry), PhD
Computer Science Department
University of Houston
Itanium Pipelines
Ckt improved
Front-end
Dependency Scoreboard Stall
checked here prior to EXE
• Performance improvement due to pipeline shortening — 4% to 6%
• Large integer register file cause extra stage WLD (Word Line Decode) in Itanium,
circuit improved for Itanium 2
• Inter-group latency is enforced by a scoreboard
– Latency due to scheduling that failed to space instructions out
– Due to cache misses
2
Itanium 2 Eight-stage Pipeline
FP
Core
FP1
IPG
FP2 FP3
ROT EXP REN REG EXE DET
L2
L2N
L2I
FP4
WB
WB
L2A L2M L2D
L2C L2W
IPG
IP Generate, L1I cache (6 inst) and TLB
access
EXE
ALU Execute, L1D Cache and TLB
Access + L2 Cache Tag Access
ROT
Instruction Rotate and Buffer (6 inst)
DET
Exception Detect, Branch Correction
EXP
Expand, Port assignment and routing
WB
Writeback, INT register update
REN
INT and FP register rename
FP1-WB
FP FMAC pipeline (2) + register write
REG
INT and FP register file read
L2N-L2I
L2 Queue Nominate/Issue (4)
(speculatively issued with L1 request)
L2A-L2W
L2 Access, Rotate, Correct, Write (4)
3
IA-32
Decode
& Control
L1 I-Cache &
I-TLB
Fetch/Prefetch engine
Instruction
8 bundles
Queue
Branch
Prediction
11 issue B
ports
B
B
M
M
M
M
I
I
F
F
Register stack engine / remapping
Branch &
Predicate
Branch
Branch
Branch
Units
Units
Units
128 INT
Registers
INT
&&MM
INT
MM
INT
&
MM
Units
INT
&&
MM
Units
INT
INT &MM
MM
Units
Units
Units
Units
Bus Controller (ECC)
4
Quad-port
(INT) L1
PIPT Data
Cache (WT)
D-TLB
128 FP
Registers
ALAT
Scoreboard, Predicate
NaT, Exceptions
PIPT Unified L2 Cache Quad-Port (ECC)
On-chip PIPT Unified L3 Cache Single-ported
(ECC)
Itanium 2 Microarchitecture
Floating
Floating
Point
Point
Units
Units
5
Itanium Register Files
0
63
127
0
81
127
Stacked
(Rotating)
32
31
Stacked
(Rotating)
32
31
Static
0
General Purpose Registers
Static
0
FP Registers
6
0
63
16
15
0
Stacked
(Rotating)
Static
Predicate Registers
Register Stack Engine
127
32
31
illegal
size of frame (sof)
size of locals (sol = i+l)
outputs
locals
(inputs)
size of rotating (sor)
Static
0
Current Frame Marker (CFM) 38 bits
rrb.pr
rrb.fr
rrb.gr
sor
sol
sof
• Avoid spills/fills during function call/return
• Callee uses instruction alloc r1=ar.pfs, i, l, o, r upon entering a
function
7
Function Call Example
main(){
r127
r127
a=foo(i*i, b[i]);
}
int foo(int ii, int bb)
{
}
r38
r45
r44
r43
b[i]
i*i
r33
r32
r43
b[i]
i*i
main:
alloc r32=ar.pfs,0,12,2,0
r32
r32
foo:
alloc r26=ar.pfs,2,5,0,0
GPR
GPR
Caller (main)
Callee (foo)
8
RSE: A Function Call
52
46
out
38
32
out
loc
call
32
CFM
PFS.pfm
sol
sof
sol
sof
14
21
0
7
x
x
14
21
pfm: Previous frame marker
9
RSE: Alloc
50
48
52
46
out
38
32
out
out
loc
32
inputs
loc
call
32
CFM
PFS.pfm
alloc r32=ar.pfs,7,9,3,0
sol
sof
sol
sof
sol
sof
14
21
0
7
16
19
x
x
14
21
14
21
alloc copies PFM to GR (r32)
10
RSE: Return
50
48
52
46
out
38
32
loc
call
out
loc
out
32
alloc
return
32
CFM
PFS.pfm
52
46
out
loc
32
sol
sof
sol
sof
sol
sof
sol
sof
14
21
0
7
16
19
14
21
x
x
14
21
14
21
14
21
11
Predicated Execution
(normal branch code)
(predicated code)
A
if (cond) {
b = 0;
}
else {
b = 1;
}
T
N
C
B
A
B
C
D
D
A
B
C
p1 = (cond)
branch p1, TARGET
mov b, 1
jmp JOIN
TARGET:
mov b, 0
A
B
C
Convert control flow dependency to data dependencyD
Pro: Eliminate hard-to-predict branches
Cons: (1) Fetch blocks B and C all the time
(2) Wait until p1 is resolved
p1 = (cond)
(!p1) mov b, 1
(p1) mov b, 0
add x, b, 1
12
Control Speculation (Speculative
Load)
Conventional Architectures
instr 1
instr
... 2
br
Itanium
ld.s
instr 1
instr 2
br
Barrier
Load
use
chk.s
use
Elevate loads above a branch
• To improve memory latency by control speculation at compile time
• Defer exceptions by setting NaT (GR’s 65th bit) that indicates:
 Whether or not an exception has occurred
 Branch to fixup code required
• NaT set during ld.s, checked by chk.s
13
Control Speculation (Hoist Uses)
IA-64
ld.s
instr 1
instr 2
br
chk.s
use
• The uses of speculative data can be executed speculatively
– Distinguishes speculation from simple prefetch
• NaT bit propagates down to the dependent instruction chain
14
Control Speculation (Recovery)
• All computation instructions propagate NaTs to the consumers
to reduce number of checks
• Cmp propagates “false” if NaT is set when writing predicates (“0”
for both target predicates)
ld8.s r3 = (r9)
ld8.s r4 = (r10)
add r6 = r3, r4
ld8.s r5 = (r6)
p1,p2 = cmp(...)
Recovery code
ld8
ld8
add
ld8
br home
chk.s r5, recv
sub r7 = r5,r2
Allows single chk on
result
15
Data Speculation (Advanced Loads)
• Compiler can hoist a load prior to a preceding, possibly-conflicting
store
• ALAT (Advanced Load Address Table) is used for checking every
store address in-between
• Can be done by superscalar machine using Store coloring
Conventional Architectures
instr 1
instr 2
...
st8
Barrier
Itanium
ld8.a
instr 1
instr 2
st8
ld8
use
ld.c
use
16
Data Speculation (load.a + chk.a)
• Compiler hoist a load and its subsequent consumers prior
to a preceding, possibly-conflicting store
• Need to patch a recovery code for mis-speculation
ld8.a r3=
instr 1
instr 2
st8
ld8.a r3=
instr 1
add =r3,
instr 2
st8
ld.c
add =r3,
chk.a
L1:
17
Recovery code
ld8 r3=
add =r3,
br L1
Predication
Itanium™ Architecture
Traditional Architectures
cmp
cmp
then
else
p1
p2
p1
p2
p1
p2
• Converts branches to conditional execution
 Executes multiple paths simultaneously
• Exposes parallelism and reduces critical path
 Better utilizes wider machines
 Reduces mispredicted branches
Parallel Compare Types
A
A
B
B
C
D
C
D
Reduces Critical Path
• Three new types of compares:
 and: both target predicates set FALSE if compare is false
 or: both target predicates set TRUE if compare is true
 DeMorgan: if true, sets one TRUE, sets other FALSE
19
And Predicate
cmp.eq.and p1,p2= 80, r4
• Usage
 p1 = p1 and (80 == r4?)
 p2 = p2 and (80 == r4?)
• How to initialize p1 and p2
 cmp.unc.eq
p1,p2 = r0,r0
20
More Example of Parallel Compare
Parallel cmp.eq.and or cmp.eq.or write the same values to both predicates
Use cmp.eq.and.orcm or
cmp.eq.or.andcm for writing
complementary predicates
Also called DeMorgan type
(for complementary output)
cmp.ge.and.orcm p6,p7= 80, r4
More Example of Parallel Compare
if (c1 && c2 && c3 && c4)
r1 = r2 + r3;
else // !c1 || !c2 || !c3 || !c4
r4 = r5 – r6
c1
c2
cmp.eq
cmp.eq.and.orcm
1
cmp.eq.and.orcm
cmp.eq.and.orcm
cmp.eq.and.orcm
2 (p1) add r1=r2,r3
(p2) sub
r4=r5-r6
0
c3
c4
then
Itanium Code
else
22
p1,p2 = r0,r0;;
p1,p2 = c1,r0
p1,p2 = c2,r0
p1,p2 = c3,r0
p1,p2 = c4,r0
23
Eight Queen Example
if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))
Unconditional Compares
8 queens control flow
R1=&b[j]
1 R3=&a[i+j]
R5=&c[i-j+7]
ld R2=[R1]
2 ld.s R4=[R3]
ld.s R6=[R5]
4 p1,p2=cmp.unc(R2==true)
P2
P1
P3
5 (p1) chk.s R4
(p1) p3,p4=cmp.unc(R4==true)
P5
6 (p3) chk.s R6
P6
Then
(p3) p5,p6=cmp.unc(R5==true)
(p5) br then
7
else
Source: Crawford & Huck
P4
24
Else
Eight Queen Example
if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))
Parallel Compares
1
2
4
5
R1=&b[j]
R3=&a[i+j]
R5=&c[i-j+7]
p1 <- true
ld R2=[R1]
ld R4=[R3]
ld R6=[R5]
p1,p2 <- cmp.and(R2==true)
p1,p2 <- cmp.and(R4==true)
p1,p2 <- cmp.and(R6==true)
(p1) br then
else
P2
P1
P4
P3
P1= true
P5
Then
Then
Reduced from 7 cycles to 5
Source: Crawford & Huck
25
P1=False
P6
Else
Else
Multiway Branches
w/o Speculation
ld8 r6 = (ra)
(p1) br exit1
P1
P2
ld8 r7 = (rb)
(p3) br exit2
P3
P4
ld8 r8 = (rc)
(p5) br exit3
P5
Hoisting Loads
ld8 r6 = (ra)
ld8.s r7 = (rb)
ld8.s r8 = (rc)
ld8 r6 = (ra)
ld8.s r7 = (rb)
ld8.s r8 = (rc)
(p1) br exit1
chk r7, rec1
(p3) br exit2
chk r8, rec2
(p5) br exit3
P6
3 branch cycles
•
Multiway branches: more than 1 branch in a single cycle
•
E.g. BBB template enables 3 branches in one bundle



Multi-way Branches
(p2) chk r7,
rec1
(p4) chk r8,
rec2
(p1) br exit1
(p3) br exit2
(p5) br exit3
1 branch cycle
Itanium allows multiple “consecutive” B instructions in the same inst group
Allows n-way branching (Itanium and Itanium 2 have 3 branch units) per cycle
Ordering matters if branch predicates are not mutually exclusive
26
Modulo Scheduling Support
• Itanium features support modulo scheduling (or
software pipelining)
 Full Predication
 Special branch handling features
• br.ctop (for for-loop with known loop count)
• br.wtop (for while-loop)
 Register rotation: removes loop copy overhead
• No modulo variable expansion, tighter code
 Predicate rotation/generation
• Removes prologue & epilogue
27
List Scheduling
C1
ld X1 11
+
A1 9
x
C3
5 M2
C2
x
+
x
+
A2 5
M3 3
A3 1
st X2 0
M1
7
P = Mem[A++] + C1;
Q = P * C2;
Y = P * C3 + (P + Q) * (P * C3);
Mem[B++] = Y;
Latency: Mem — 1 cycle
Adder — 2 cycles
Multiplier — 2 cycles
• Build dependency graph
• Assign a priority of “0” to all operations
having no successors
• Assign each remaining operation the sum
of priority and latency of their successor.
If more than one successor, assign the
maximum.
• Schedule instructions based on priority
Schedule = {X1, A1, M1, A2, M2, M3, A3, X2}
28
List Scheduling
C1
ld X1 11
+
A1 9
5 M2
x
+
x
+
Time
0
1
2
3
4
5
6
7
8
9
10
11
C2
x
C3
Reservation Table
M1
7
A2 5
M3 3
A3 1
st X2 0
•
•
MEM
X1
ADDER
MULT
A1
M1
M2
A2
M3
A3
X2
LS (a heuristic) provides near-optimal schedule
But no guarantee for optimality, especially, in terms of
throughput
29
Scheduling
• If I want to use the same schedule, what is the
minimum initiation interval?
• In the example, do I need to wait for 12 cycles?
• If not, how do I avoid collision?
Time
0
1
2
3
4
5
6
7
8
9
10
11
30
MEM
X1
ADDER
MULT
A1
M1
M2
A2
M3
A3
X2
Modulo Scheduling [RauGlaeser’81]
• A.k.a. “Polycyclic scheduling” or “Software pipelining”
• Exploit ILP among loop iterations to maximize
 Machine utilization
 Throughput
• Use a common schedule for the majority of iterations
• Overlap execution of consecutive iterations
• Constant initiation rate  Initiation Interval (II)
• Minimum II (MII) generates an optimal schedule with
maximum throughput
• Originally developed for polycyclic architecture (or
horizontal architecture, or aka VLIW later)
31
Modulo Scheduling: Resource
Constraint
•
•
•
•
The optimal schedule is constrained by the number of
available resources
Determine ResII (Resource minimal initiation interval)
 Successive iterations will be scheduled ResII cycles
apart
N(i) is the number of usage of resource i in a loop
C(i) is the number of resources i
 N(1)   N(2)   N(3) 
ResII  max( 
,
,
, .... )



 C(1)   C(2)   C(3) 
32
Resource II
C1
ld X1
+
A1
M2
x
+
x
+
– 1 adder with 2-cycle latency
– 1 mult with 2-cycle latency
– 1 mem unit with 1-cycle
latency
C2
x
C3
• Assume 3 FUs
M1
• Determine MII = Resource II
A2
2 3 3
ResII  MII  max( , , )  3
1 1 1
M3
A3
st X2
33
Modulo Reservation Table (MRT)
Time
0
1
2
3
4
5
6
7
8
9
10
11
MEM
X1
ADDER
MULT
A1
M1
M2
A2
M3
A3
X2
Modulo
0
1
2
0
1
2
0
1
2
0
1
2
0
1
2
Time
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
MEM
ADDER
MULT
New Schedule for 1 iteration
Modulo
0
1
2
MEM
ADDER
MRT
MULT
34
Modulo Reservation Table (MRT)
Time
0
1
2
3
4
5
6
7
8
9
10
11
MEM
X1
ADDER
MULT
A1
M1
M2
A2
M3
A3
X2
Modulo
0
1
2
0
1
2
0
1
2
0
1
2
0
1
2
Time
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
MEM
X1
ADDER
MULT
A1
M1
M2
A2
M3
A3
X2
New Schedule for 1 iteration
Modulo
0
1
2
MEM
X1
X2
ADDER
A3
A1
A2
MRT
MULT
M1
M2
M3
35
Modulo Scheduled Loop
Modulo
0
1
2
0
1
2
0
1
2
0
1
2
0
1
2
0
1
2
0
1
2
0
1
2
0
1
2
0
1
2
0
1
2
Time
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
MEM
X1 (1)
ADDER
MULT
A1 (1)
X1 (2)
A1 (2)
A2 (1)
X1 (3)
A1 (3)
A2 (2)
X1 (4)
X1 (5)
X2 (1)
X1 (6)
X2 (2)
X1 (7)
X2 (3)
X1 (8)
X2 (4)
X1 (9)
X2 (5)
X1 (10)
X2 (6)
X1 (11)
X2 (7)
A1 (4)
A2 (3)
A3 (1)
A1 (5)
A2 (4)
A3 (2)
A1 (6)
A2 (5)
A3 (3)
A1 (7)
A2 (6)
A3 (4)
A1 (8)
A2 (7)
A3 (5)
A1 (9)
A2 (8)
A3 (6)
A1 (10)
A2 (9)
A3 (7)
A1 (11)
A2 (10)
M1 (1)
M2 (1)
M1 (2)
M2 (2)
M3 (1)
M1 (3)
M2 (3)
M3 (2)
M1 (4)
M2 (4)
M3 (3)
M1 (5)
M2 (5)
M3 (4)
M1 (6)
M2 (6)
M3 (5)
M1 (7)
M2 (7)
M3 (6)
M1 (8)
M2 (8)
M3 (7)
M1 (9)
M2 (9)
M3 (8)
M1 (10)
M2 (10)
M3 (9)
Prolog
Kernel, steady state (MRT schedule)
36
Modulo Scheduled Loop
Modulo
0
1
2
0
1
2
0
1
2
0
1
2
0
1
2
0
1
2
0
1
2
0
1
2
0
1
2
0
1
2
0
1
2
Time
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
MEM
X1 (1)
ADDER
MULT
A1 (1)
X1 (2)
A1 (2)
A2 (1)
X1 (3)
A1 (3)
A2 (2)
X1 (4)
X1 (5)
X2 (1)
X1 (6)
X2 (2)
X1 (7)
X2 (3)
X1 (8)
X2 (4)
X1 (9)
X2 (5)
X1 (10)
X2 (6)
X1 (11)
X2 (7)
A1 (4)
A2 (3)
A3 (1)
A1 (5)
A2 (4)
A3 (2)
A1 (6)
A2 (5)
A3 (3)
A1 (7)
A2 (6)
A3 (4)
A1 (8)
A2 (7)
A3 (5)
A1 (9)
A2 (8)
A3 (6)
A1 (10)
A2 (9)
A3 (7)
A1 (11)
A2 (10)
M1 (1)
M2 (1)
M1 (2)
M2 (2)
M3 (1)
M1 (3)
M2 (3)
M3 (2)
M1 (4)
M2 (4)
M3 (3)
M1 (5)
M2 (5)
M3 (4)
M1 (6)
M2 (6)
M3 (5)
M1 (7)
M2 (7)
M3 (6)
M1 (8)
M2 (8)
M3 (7)
M1 (9)
M2 (9)
M3 (8)
M1 (10)
M2 (10)
M3 (9)
Modulo
0
1
2
0
1
2
0
1
2
0
1
2
0
1
2
0
1
2
0
1
2
37
Time
T+0
T+1
T+2
T+3
T+4
T+5
T+6
T+7
T+8
T+9
T+10
T+11
T+12
T+13
T+14
T+15
T+16
T+17
T+18
T+19
T+20
MEM
X1 (N-2)
X2 (N-6)
X1 (N-1)
X2 (N-5)
X1 (N)
X2 (N-4)
X2 (N-3)
ADDER
A3 (N-6)
A1 (N-2)
A2 (N-3)
A3 (N-5)
A1 (N-1)
A2 (N-2)
A3 (N-4)
A1 (N)
A2 (N-1)
A3 (N-3)
A2 (N)
A3 (N-2)
X2 (N-2)
M3 (N)
A3 (N-1)
X2 (N-1)
A3 (N)
X2 (N)
MULT
M1 (N-3)
M2 (N-3)
M3 (N-4)
M1 (N-2)
M2 (N-2)
M3 (N-3)
M1 (N-1)
M2 (N-1)
M3 (N-2)
M1 (N)
M2 (N)
M3 (N-1)
Last
kernel
Epilog
Another Modulo Schedule Example
B C
A
E
M1
3
x
+
A1
0
+
+
3
x
1
D
Given 2 adders (1-cycle) & 1 multiplier (2-cycle)
A2
prolog
M2
1
5x kernel
A3
Z
MII = max(3/2, 2/1) = 2
epilog
Modulo Reservation Table
Modulo
0
1
ADDER1 ADDER2
A1 (3)
A2 (3)
A3 (1)
MULT
M2 (2)
M1 (3)
Multiplier is fully utilized
38