slides - People

UPC
UNIVERSITAT POLITÈCNICA
DE CATALUNYA
A Unified Modulo Scheduling and Register
Allocation Technique for Clustered Processors
Josep M. Codina, Jesús Sánchez and Antonio González
Dept. of Computer Architecture
Universitat Politècnica de Catalunya
Barcelona, SPAIN
E-mail: {jmcodina,fran,antonio}@ac.upc.es
Why Clustered Architectures?
UPC

INTRODUCTION


Semiconductor technology is continuously improving

New technologies pack more logic in a single chip

Exploit more ILP  More functional units, registers, etc.

Faster clock cycles
However, new problems may arise

Delay of signals or data movement from one part to another

Power consumption
Solution: exploit communication locality

Divide the system into several “units”

They can work almost independently and at very high frequency

Some communication channels are used to exchange signals/data

CLUSTERING
UPC
Current Trends in Clustered Architectures
 Partition
INTRODUCTION



the register file & functional units
For embedded/DSP processors: VLIW design

C6000 DSP of Texas Instruments

TigerSharc of Analog Devices

Lx of HP/ST, etc.
Code generation

Cluster assingment

Instruction Scheduling

Register Allocation
For loops: modulo scheduling
UPC
Previous work on modulo scheduling
 Several works for non-clustered VLIW architectures
 Iterative
MS, Slack MS, Swing MS, IRIS MS, etc…
INTRODUCTION
 Some works for clustered VLIW architectures

E. Nystrom and A. E. Eichenberger [MICRO ´98]

M. M. Fernandes et al. [HPCA ´99]

J. Sánchez and A. González [ICPP ´00]

J. Sánchez and A. González [MICRO ´00]
 All of them are non-register constraint
Shared
memory
Distributed
memory
UPC
How to deal with register constraints?


Add spill code and/or increment II

Eisembeis et al. [MICRO´94]

Ruttenberg et al. [PLDI ´96]

Zalamea et al. [PLDI´00]
In these previous works:
INTRODUCTION

Non-clustered
Spill after scheduling

List Scheduling  K.Kailas, K.Ebcioglu and A.Agrawala [HPCA´01]

In this work:

Clustered

Spill during scheduling

Modulo Scheduling
Talk Outline
UPC

Clustered VLIW Architecture

Our previous work

URACAM

Basic Ideas

Algorithm

Example

Evaluation

Conclusions
Talk Outline
UPC

Clustered VLIW Architecture

Our previous work

URACAM

Basic Ideas

Algorithm

Example

Evaluation

Conclusions
Architecture Overview
CLUSTERED VLIW ARCHITECTURE
UPC
BUS/ES
LOCAL
REGISTER FILE
LOCAL
REGISTER FILE
...
FU
FU
FU
FU
L1 CACHE
FU
FU
Talk Outline
UPC

Clustered VLIW Architecture

Our previous work

URACAM

Basic Ideas

Algorithm

Example

Evaluation

Conclusions
Our previous work
UPC


Features of the basic scheduling algorithm (SA+GO - ICPP ’00)

Unified assign-and-schedule approach

Cluster assignment heuristics to reduce the number of communications

Loop Unrolling to reduce the number of communications
Main drawbacks

It does not deal with Spill Code

Unroll could increase code size
Talk Outline
UPC

Clustered VLIW Architecture

Our previous work

URACAM

Basic Ideas

Algorithm

Example

Evaluation

Conclusions
Basic Ideas
UPC
URACAM


Main factors in Modulo Scheduling for Clustered Architectures

Communications

Register requirements

Memory pressure
A good scheme has to take into account all of them
Algorithm Overview
UPC
Compute MII
Sort DDG nodes
START
Try to schedule
in cluster 0
Try to
Improve
Ne
w
Stat
e
Best
State
Next node
Try to schedule
in cluster N
+ II
URACAM
Ne
w
Stat
e
Ne
w
Stat
e
Try to
Improve
Ne
w
Stat
e
No Feasible
State
Algorithm Overview
UPC
Compute MII
Sort DDG nodes
START
Try to schedule
in cluster 0
Try to
Improve
Ne
w
Stat
e
Best
State
Next node
Try to schedule
in cluster N
+ II
URACAM
Ne
w
Stat
e
Compute MII
 Like a monolithic architecture
 Recurrences
 Resources
Ne
w
Stat
e
Try to
Improve
Ne
w
Stat
e
No Feasible
State
Algorithm Overview
UPC
Compute MII
Sort DDG nodes
START
Try to schedule
in cluster 0
Ne
w
Stat
e
Try to
Improve
Best
State
Next node
Try to schedule
in cluster N
Ne
w
Stat
e
Try to
Improve
+ II
URACAM
Ne
w
Stat
e
Sort DDG nodes
 According to SMS (Llosa et al., PACT´96)
 Priority to nodes in recurrences
 Avoids predecessors and successors before a node
Ne
w
Stat
e
No Feasible
State
Algorithm Overview
UPC
Compute MII
Sort DDG nodes
START
Try to schedule
in cluster 0
Ne
w
Stat
e
Try to
Improve
Best
State
Next node
Try to schedule
in cluster N
Ne
w
Stat
e
Try to
Improve
+ II
URACAM
Ne
w
Stat
e
START + Next node
 All nodes are handled following computed order
Ne
w
Stat
e
No Feasible
State
Algorithm Overview
UPC
Compute MII
Sort DDG nodes
START
Try to schedule
in cluster 0
New
Stat
e
Best
State
Next node
Try to schedule
in cluster N
New
Stat
e
+ II
URACAM
Try to
Improve
Ne
w
Stat
e
Try to
Improve
Ne
w
Stat
e
No Feasible
State
Try to schedule in all clusters
 Generation of a possible partial schedule (new state).
 Schedule the operation as close as possible to scheduled ones
 Resource constraints
 Communications are scheduled
Algorithm Overview
UPC
Compute MII
Sort DDG nodes
START
Try to schedule
in cluster 0
Try to
Improve
New
Stat
e
Best
State
Next node
Try to schedule
in cluster N
+ II
URACAM
Ne
w
Stat
e
Ne
w
Stat
e
Try to
Improve
New
Stat
e
No Feasible
State
Trying to improve
 Adding spill code to reduce register requirements
 Spill code to reduce communications  memory-based communications
 Communications to reduce memory pressure
 Undoing Spill Code to reduce memory pressure
Algorithm Overview
UPC
Compute MII
Sort DDG nodes
START
Try to schedule
in cluster 0
Ne
w
Stat
e
Best
State
Next node
Try to schedule
in cluster N
Ne
w
Stat
e
+ II
URACAM
Try to
Improve
Ne
w
Stat
e
Try to
Improve
Ne
w
Stat
e
No Feasible
State
Best State
 Non valid candidates are discarded
 If no feasible state  increase II
 Best candidate from the valid ones choosed  Figure of Merit
Figure of Merit
UPC

Used to choose the best alternative in every partial schedule

A unique criteria to evaluate a schedule

Measuring the utilization of the most critical resources

Underlying concepts:
Scare resources are more valuable than abundant ones
URACAM
Maximize the available resources of the most used ones

Set of percentages
0
1
%
Com
N
%
...
Mem
N+1
%
%
N+N
...
Regs
%
2N+1 Percentages
N = num_clusters
Using Figure of Merit
UPC


Comparing two new states
 Compute
the percentage of remaining resources usage
 Compare
from the highest to the lowest percentages
Figure of Merit in transformations gives
Best candidate
URACAM
Benefit of the transformation
An Example
UPC
• 2 clusters
• 2 general-purpose FU x cluster
• 2 Memory port x cluster, Lat = 1
• Unified mII = 2 cycles
• 8 registers x cluster
• 2 Bus, Latency = 1
A
2
B
• A, B, D, C non mem ops
latency of 1
4
D
C
4
4
4
16
16
URACAM
Nodes Cluster 1 Cluster 2
Free
resources
D
B
A
C
Used
resources
Bus
Mem Mem Regs Regs
Clust Clust Clust Clust
1
2
1
2
An Example
UPC
• 2 clusters
• 2 general-purpose FU x cluster
• 2 Memory port x cluster, Lat = 1
• Unified mII = 2 cycles
• 8 registers x cluster
• 2 Bus, Latency = 1
A
2
B
• A, B, D, C non mem ops
latency of 1
4
D
16
C
4
4
4
12
16
URACAM
Nodes Cluster 1 Cluster 2
D
B
A
C
Free
resources
0%
0%
6,25% 25% - 6,25%
20% 50% -13,33%
Used
resources
Bus
Mem Mem Regs Regs
Clust Clust Clust Clust
1
2
1
2
An Example
UPC
• 2 clusters
• 2 general-purpose FU x cluster
• 2 Memory port x cluster, Lat = 1
• Unified mII = 2 cycles
• 8 registers x cluster
• 2 Bus, Latency = 1
A
2
B
• A, B, D, C non mem ops
latency of 1
4
D
16
C
4
4
4
12
16
83,33%
URACAM
Nodes Cluster 1 Cluster 2
D
B
A
C
Free
resources
0%
0%
6,25% 25% - 6,25%
20% 50% -13,33%
83,33%
Used
resources
Bus
Mem Mem Regs Regs
Clust Clust Clust Clust
1
2
1
2
An Example
UPC
• 2 clusters
• 2 general-purpose FU x cluster
• 2 Memory port x cluster, Lat = 1
• Unified mII = 2 cycles
• 8 registers x cluster
• 2 Bus, Latency = 1
A
2
B
4
D
• A, B, D, C non mem ops
latency of 1
Spill Code
16
C
4
4
4
12
16
URACAM
Nodes Cluster 1 Cluster 2
D
B
A
C
0%
0%
6,25% 25% - 6,25%
20% 50% -13,33%
83,33%
50% - 8,33%
Free
resources
50%
8,33%
Bus
Mem Mem Regs Regs
Clust Clust Clust Clust
1
2
1
2
Used
resources
An Example
UPC
• 2 clusters
• 2 general-purpose FU x cluster
• 2 Memory port x cluster, Lat = 1
• Unified mII = 2 cycles
• 8 registers x cluster
• 2 Bus, Latency = 1
A
2
B
4
D
Communicatio
n
• A, B, D, C non mem ops
latency of 1
Communicatio
n
Through mem.
C
4
16
4
4
12
16
URACAM
Nodes Cluster 1 Cluster 2
D
B
A
C
0%
0%
6,25% 25% - 6,25%
20% 50% -13,33%
25%-25%-...
Free
resources
25% 25% 25%
6,25%
Bus
Mem Mem Regs Regs
Clust Clust Clust Clust
1
2
1
2
Used
resources
An Example
UPC
• 2 clusters
• 2 general-purpose FU x cluster
• 2 Memory port x cluster, Lat = 1
• Unified mII = 2 cycles
• 8 registers x cluster
• 2 Bus, Latency = 1
Cluster 1
A
2
B
St
4
D
Ld
Com
• A, B, D, C non mem ops
latency of 1
Cluster 2
4
4
4
16
16
3
3
3
12
15
C
URACAM
Nodes Cluster 1 Cluster 2
D
B
A
C
Free
resources
0%
0%
6,25% 25% - 6,25%
20% 50% -13,33%
83,33% 25%-25%-...
50% - 8,33%
Used
resources
Bus
Mem Mem Regs Regs
Clust Clust Clust Clust
1
2
1
2
Memory operations
UPC

Additional memory operations
Spill Code
Communications through memory

Maybe operations from the original DDG cannot be scheduled

Solution:
 Differentiate memory pressure in the figure of merit
URACAM
Global  Original memory operations
Local
0
1
%
N
%
Com Global
Mem
N+1
%
...
Local
Mem
%
N+2
%
2N+1
...
Regs
%
2N+2 Percentages
N = num_clusters
Talk Outline
UPC

Clustered VLIW Architecture

Our previous work

URACAM

Basic Ideas

Algorithm

Example

Evaluation

Conclusions
Evaluation
URACAM
UPC

Evaluated using SPECfp95

Using graphs generated by the ICTINEO compiler
PERFORMANCE EVALUATION
UPC
Configuration
Resources
INT/cluster
FP/cluster
MEM/cluster
REGS/cluster
Unified
4
4
4
64/32
Comm Buses
Bus Latency
Latencies
MEM
ARITH /ABS
MUL
DIV/SQR/TRG
2-cluster 4-cluster
2
1
2
1
2
1
32/16
16/8
2-cluster 4-cluster
1/4
1/4
1
1
INT
2
1
2
6
FP
2
3
6
18
IPC - 64 registers
UPC
URACAM 2-clusters
SA+GO 2-clusters
URACAM 4-clusters
SA+GO 4-clusters
5
4
3
2
1
EA
N
HM
wa
ve
5
pp
fpp
ap
si
3d
tur
b
ap
plu
rid
mg
2d
hy
dro
su
2c
or
sw
im
ca
tv
0
tom
IPC
PERFORMANCE EVALUATION
6
IPC - 64 registers
UPC
URACAM 2-clusters
SA+GO 2-clusters
URACAM 4-clusters
SA+GO 4-clusters
5
4
3
2
1
EA
N
HM
wa
ve
5
pp
fpp
ap
si
3d
tur
b
ap
plu
rid
mg
2d
hy
dro
su
2c
or
sw
im
ca
tv
0
tom
IPC
PERFORMANCE EVALUATION
6
IPC - 32 registers
UPC
URACAM 2-clusters
SA+GO 2-clusters
URACAM 4-clusters
SA+GO 4-clusters
5
4
3
2
1
EA
N
HM
wa
ve
5
pp
fpp
ap
si
3d
tur
b
ap
plu
rid
mg
2d
hy
dro
su
2c
or
sw
im
ca
tv
0
tom
IPC
PERFORMANCE EVALUATION
6
IPC - 32 registers
UPC
URACAM 2-clusters
SA+GO 2-clusters
URACAM 4-clusters
SA+GO 4-clusters
5
4
3
2
1
EA
N
HM
wa
ve
5
pp
fpp
ap
si
3d
tur
b
ap
plu
rid
mg
2d
hy
dro
su
2c
or
sw
im
ca
tv
0
tom
IPC
PERFORMANCE EVALUATION
6
HM
EA
N
e5
pp
2-Clusters
wa
v
fpp
si
3d
Unified
ap
tur
b
pl u
d
64 Registers
ap
mg
ri
d
2c
or
dr o
2
su
hy
tv
sw
im
tom
ca
IPC
PERFORMANCE EVALUATION
UPC
URACAM Performance – 1 bus
4-Clusters
9
8
7
6
5
4
3
2
1
0
HM
EA
N
e5
pp
2-Clusters
wa
v
fpp
si
3d
Unified
ap
tur
b
pl u
d
64 Registers
ap
mg
ri
d
2c
or
dr o
2
su
hy
tv
sw
im
tom
ca
IPC
PERFORMANCE EVALUATION
UPC
URACAM Performance – 4 bus
4-Clusters
9
8
7
6
5
4
3
2
1
0
Talk Outline
UPC

Clustered VLIW Architecture

Our previous work

URACAM

Basic Ideas

Algorithm

Evaluation

Conclusions
Conclusions
UPC
 URACAM
handles at the same time communications, memory
pressure and registers
 Search for the best overall solution
 Figure
of Merit: a unique criterion to compare partial schedules
 Transformations
to improve partial schedules
 Spill Code to reduce register pressure
 Communications through memory to reduce bus pressure
 Communications through bus to reduce memory pressure
 Undo Spill Code to reduce memory pressure
 Spill Code for Clustered VLIW Architecture
 Done during the scheduling
Conclusions
UPC
 URACAM
achieves better schedules than previous work on
Modulo Scheduling for a Clustered VLIW Architecture
 Speed up of 18% for 2 clusters and 22% for 4 clusters
[ For 1 inter-register bus with 1-cycle latency and 32 registers ]
 Degradation
with respect non-clustered architecture
 3% for 2 clusters and 10% for 4 clusters
[ For 4 inter-register bus with 1-cycle latency and 32 registers ]
 URACAM
is an adaptive and powerful technique
 Figure of Merit
 Transformations
UPC
UNIVERSITAT POLITÈCNICA
DE CATALUNYA
A Unified Modulo Scheduling and Register
Allocation Technique for Clustered Processors
Josep M. Codina, Jesús Sánchez and Antonio González
Dept. of Computer Architecture
Universitat Politècnica de Catalunya
Barcelona, SPAIN
E-mail: {jmcodina,fran,antonio}@ac.upc.es