UPC
UNIVERSITAT POLITÈCNICA
DE CATALUNYA
A Unified Modulo Scheduling and Register
Allocation Technique for Clustered Processors
Josep M. Codina, Jesús Sánchez and Antonio González
Dept. of Computer Architecture
Universitat Politècnica de Catalunya
Barcelona, SPAIN
E-mail: {jmcodina,fran,antonio}@ac.upc.es
Why Clustered Architectures?
UPC
INTRODUCTION
Semiconductor technology is continuously improving
New technologies pack more logic in a single chip
Exploit more ILP More functional units, registers, etc.
Faster clock cycles
However, new problems may arise
Delay of signals or data movement from one part to another
Power consumption
Solution: exploit communication locality
Divide the system into several “units”
They can work almost independently and at very high frequency
Some communication channels are used to exchange signals/data
CLUSTERING
UPC
Current Trends in Clustered Architectures
Partition
INTRODUCTION
the register file & functional units
For embedded/DSP processors: VLIW design
C6000 DSP of Texas Instruments
TigerSharc of Analog Devices
Lx of HP/ST, etc.
Code generation
Cluster assingment
Instruction Scheduling
Register Allocation
For loops: modulo scheduling
UPC
Previous work on modulo scheduling
Several works for non-clustered VLIW architectures
Iterative
MS, Slack MS, Swing MS, IRIS MS, etc…
INTRODUCTION
Some works for clustered VLIW architectures
E. Nystrom and A. E. Eichenberger [MICRO ´98]
M. M. Fernandes et al. [HPCA ´99]
J. Sánchez and A. González [ICPP ´00]
J. Sánchez and A. González [MICRO ´00]
All of them are non-register constraint
Shared
memory
Distributed
memory
UPC
How to deal with register constraints?
Add spill code and/or increment II
Eisembeis et al. [MICRO´94]
Ruttenberg et al. [PLDI ´96]
Zalamea et al. [PLDI´00]
In these previous works:
INTRODUCTION
Non-clustered
Spill after scheduling
List Scheduling K.Kailas, K.Ebcioglu and A.Agrawala [HPCA´01]
In this work:
Clustered
Spill during scheduling
Modulo Scheduling
Talk Outline
UPC
Clustered VLIW Architecture
Our previous work
URACAM
Basic Ideas
Algorithm
Example
Evaluation
Conclusions
Talk Outline
UPC
Clustered VLIW Architecture
Our previous work
URACAM
Basic Ideas
Algorithm
Example
Evaluation
Conclusions
Architecture Overview
CLUSTERED VLIW ARCHITECTURE
UPC
BUS/ES
LOCAL
REGISTER FILE
LOCAL
REGISTER FILE
...
FU
FU
FU
FU
L1 CACHE
FU
FU
Talk Outline
UPC
Clustered VLIW Architecture
Our previous work
URACAM
Basic Ideas
Algorithm
Example
Evaluation
Conclusions
Our previous work
UPC
Features of the basic scheduling algorithm (SA+GO - ICPP ’00)
Unified assign-and-schedule approach
Cluster assignment heuristics to reduce the number of communications
Loop Unrolling to reduce the number of communications
Main drawbacks
It does not deal with Spill Code
Unroll could increase code size
Talk Outline
UPC
Clustered VLIW Architecture
Our previous work
URACAM
Basic Ideas
Algorithm
Example
Evaluation
Conclusions
Basic Ideas
UPC
URACAM
Main factors in Modulo Scheduling for Clustered Architectures
Communications
Register requirements
Memory pressure
A good scheme has to take into account all of them
Algorithm Overview
UPC
Compute MII
Sort DDG nodes
START
Try to schedule
in cluster 0
Try to
Improve
Ne
w
Stat
e
Best
State
Next node
Try to schedule
in cluster N
+ II
URACAM
Ne
w
Stat
e
Ne
w
Stat
e
Try to
Improve
Ne
w
Stat
e
No Feasible
State
Algorithm Overview
UPC
Compute MII
Sort DDG nodes
START
Try to schedule
in cluster 0
Try to
Improve
Ne
w
Stat
e
Best
State
Next node
Try to schedule
in cluster N
+ II
URACAM
Ne
w
Stat
e
Compute MII
Like a monolithic architecture
Recurrences
Resources
Ne
w
Stat
e
Try to
Improve
Ne
w
Stat
e
No Feasible
State
Algorithm Overview
UPC
Compute MII
Sort DDG nodes
START
Try to schedule
in cluster 0
Ne
w
Stat
e
Try to
Improve
Best
State
Next node
Try to schedule
in cluster N
Ne
w
Stat
e
Try to
Improve
+ II
URACAM
Ne
w
Stat
e
Sort DDG nodes
According to SMS (Llosa et al., PACT´96)
Priority to nodes in recurrences
Avoids predecessors and successors before a node
Ne
w
Stat
e
No Feasible
State
Algorithm Overview
UPC
Compute MII
Sort DDG nodes
START
Try to schedule
in cluster 0
Ne
w
Stat
e
Try to
Improve
Best
State
Next node
Try to schedule
in cluster N
Ne
w
Stat
e
Try to
Improve
+ II
URACAM
Ne
w
Stat
e
START + Next node
All nodes are handled following computed order
Ne
w
Stat
e
No Feasible
State
Algorithm Overview
UPC
Compute MII
Sort DDG nodes
START
Try to schedule
in cluster 0
New
Stat
e
Best
State
Next node
Try to schedule
in cluster N
New
Stat
e
+ II
URACAM
Try to
Improve
Ne
w
Stat
e
Try to
Improve
Ne
w
Stat
e
No Feasible
State
Try to schedule in all clusters
Generation of a possible partial schedule (new state).
Schedule the operation as close as possible to scheduled ones
Resource constraints
Communications are scheduled
Algorithm Overview
UPC
Compute MII
Sort DDG nodes
START
Try to schedule
in cluster 0
Try to
Improve
New
Stat
e
Best
State
Next node
Try to schedule
in cluster N
+ II
URACAM
Ne
w
Stat
e
Ne
w
Stat
e
Try to
Improve
New
Stat
e
No Feasible
State
Trying to improve
Adding spill code to reduce register requirements
Spill code to reduce communications memory-based communications
Communications to reduce memory pressure
Undoing Spill Code to reduce memory pressure
Algorithm Overview
UPC
Compute MII
Sort DDG nodes
START
Try to schedule
in cluster 0
Ne
w
Stat
e
Best
State
Next node
Try to schedule
in cluster N
Ne
w
Stat
e
+ II
URACAM
Try to
Improve
Ne
w
Stat
e
Try to
Improve
Ne
w
Stat
e
No Feasible
State
Best State
Non valid candidates are discarded
If no feasible state increase II
Best candidate from the valid ones choosed Figure of Merit
Figure of Merit
UPC
Used to choose the best alternative in every partial schedule
A unique criteria to evaluate a schedule
Measuring the utilization of the most critical resources
Underlying concepts:
Scare resources are more valuable than abundant ones
URACAM
Maximize the available resources of the most used ones
Set of percentages
0
1
%
Com
N
%
...
Mem
N+1
%
%
N+N
...
Regs
%
2N+1 Percentages
N = num_clusters
Using Figure of Merit
UPC
Comparing two new states
Compute
the percentage of remaining resources usage
Compare
from the highest to the lowest percentages
Figure of Merit in transformations gives
Best candidate
URACAM
Benefit of the transformation
An Example
UPC
• 2 clusters
• 2 general-purpose FU x cluster
• 2 Memory port x cluster, Lat = 1
• Unified mII = 2 cycles
• 8 registers x cluster
• 2 Bus, Latency = 1
A
2
B
• A, B, D, C non mem ops
latency of 1
4
D
C
4
4
4
16
16
URACAM
Nodes Cluster 1 Cluster 2
Free
resources
D
B
A
C
Used
resources
Bus
Mem Mem Regs Regs
Clust Clust Clust Clust
1
2
1
2
An Example
UPC
• 2 clusters
• 2 general-purpose FU x cluster
• 2 Memory port x cluster, Lat = 1
• Unified mII = 2 cycles
• 8 registers x cluster
• 2 Bus, Latency = 1
A
2
B
• A, B, D, C non mem ops
latency of 1
4
D
16
C
4
4
4
12
16
URACAM
Nodes Cluster 1 Cluster 2
D
B
A
C
Free
resources
0%
0%
6,25% 25% - 6,25%
20% 50% -13,33%
Used
resources
Bus
Mem Mem Regs Regs
Clust Clust Clust Clust
1
2
1
2
An Example
UPC
• 2 clusters
• 2 general-purpose FU x cluster
• 2 Memory port x cluster, Lat = 1
• Unified mII = 2 cycles
• 8 registers x cluster
• 2 Bus, Latency = 1
A
2
B
• A, B, D, C non mem ops
latency of 1
4
D
16
C
4
4
4
12
16
83,33%
URACAM
Nodes Cluster 1 Cluster 2
D
B
A
C
Free
resources
0%
0%
6,25% 25% - 6,25%
20% 50% -13,33%
83,33%
Used
resources
Bus
Mem Mem Regs Regs
Clust Clust Clust Clust
1
2
1
2
An Example
UPC
• 2 clusters
• 2 general-purpose FU x cluster
• 2 Memory port x cluster, Lat = 1
• Unified mII = 2 cycles
• 8 registers x cluster
• 2 Bus, Latency = 1
A
2
B
4
D
• A, B, D, C non mem ops
latency of 1
Spill Code
16
C
4
4
4
12
16
URACAM
Nodes Cluster 1 Cluster 2
D
B
A
C
0%
0%
6,25% 25% - 6,25%
20% 50% -13,33%
83,33%
50% - 8,33%
Free
resources
50%
8,33%
Bus
Mem Mem Regs Regs
Clust Clust Clust Clust
1
2
1
2
Used
resources
An Example
UPC
• 2 clusters
• 2 general-purpose FU x cluster
• 2 Memory port x cluster, Lat = 1
• Unified mII = 2 cycles
• 8 registers x cluster
• 2 Bus, Latency = 1
A
2
B
4
D
Communicatio
n
• A, B, D, C non mem ops
latency of 1
Communicatio
n
Through mem.
C
4
16
4
4
12
16
URACAM
Nodes Cluster 1 Cluster 2
D
B
A
C
0%
0%
6,25% 25% - 6,25%
20% 50% -13,33%
25%-25%-...
Free
resources
25% 25% 25%
6,25%
Bus
Mem Mem Regs Regs
Clust Clust Clust Clust
1
2
1
2
Used
resources
An Example
UPC
• 2 clusters
• 2 general-purpose FU x cluster
• 2 Memory port x cluster, Lat = 1
• Unified mII = 2 cycles
• 8 registers x cluster
• 2 Bus, Latency = 1
Cluster 1
A
2
B
St
4
D
Ld
Com
• A, B, D, C non mem ops
latency of 1
Cluster 2
4
4
4
16
16
3
3
3
12
15
C
URACAM
Nodes Cluster 1 Cluster 2
D
B
A
C
Free
resources
0%
0%
6,25% 25% - 6,25%
20% 50% -13,33%
83,33% 25%-25%-...
50% - 8,33%
Used
resources
Bus
Mem Mem Regs Regs
Clust Clust Clust Clust
1
2
1
2
Memory operations
UPC
Additional memory operations
Spill Code
Communications through memory
Maybe operations from the original DDG cannot be scheduled
Solution:
Differentiate memory pressure in the figure of merit
URACAM
Global Original memory operations
Local
0
1
%
N
%
Com Global
Mem
N+1
%
...
Local
Mem
%
N+2
%
2N+1
...
Regs
%
2N+2 Percentages
N = num_clusters
Talk Outline
UPC
Clustered VLIW Architecture
Our previous work
URACAM
Basic Ideas
Algorithm
Example
Evaluation
Conclusions
Evaluation
URACAM
UPC
Evaluated using SPECfp95
Using graphs generated by the ICTINEO compiler
PERFORMANCE EVALUATION
UPC
Configuration
Resources
INT/cluster
FP/cluster
MEM/cluster
REGS/cluster
Unified
4
4
4
64/32
Comm Buses
Bus Latency
Latencies
MEM
ARITH /ABS
MUL
DIV/SQR/TRG
2-cluster 4-cluster
2
1
2
1
2
1
32/16
16/8
2-cluster 4-cluster
1/4
1/4
1
1
INT
2
1
2
6
FP
2
3
6
18
IPC - 64 registers
UPC
URACAM 2-clusters
SA+GO 2-clusters
URACAM 4-clusters
SA+GO 4-clusters
5
4
3
2
1
EA
N
HM
wa
ve
5
pp
fpp
ap
si
3d
tur
b
ap
plu
rid
mg
2d
hy
dro
su
2c
or
sw
im
ca
tv
0
tom
IPC
PERFORMANCE EVALUATION
6
IPC - 64 registers
UPC
URACAM 2-clusters
SA+GO 2-clusters
URACAM 4-clusters
SA+GO 4-clusters
5
4
3
2
1
EA
N
HM
wa
ve
5
pp
fpp
ap
si
3d
tur
b
ap
plu
rid
mg
2d
hy
dro
su
2c
or
sw
im
ca
tv
0
tom
IPC
PERFORMANCE EVALUATION
6
IPC - 32 registers
UPC
URACAM 2-clusters
SA+GO 2-clusters
URACAM 4-clusters
SA+GO 4-clusters
5
4
3
2
1
EA
N
HM
wa
ve
5
pp
fpp
ap
si
3d
tur
b
ap
plu
rid
mg
2d
hy
dro
su
2c
or
sw
im
ca
tv
0
tom
IPC
PERFORMANCE EVALUATION
6
IPC - 32 registers
UPC
URACAM 2-clusters
SA+GO 2-clusters
URACAM 4-clusters
SA+GO 4-clusters
5
4
3
2
1
EA
N
HM
wa
ve
5
pp
fpp
ap
si
3d
tur
b
ap
plu
rid
mg
2d
hy
dro
su
2c
or
sw
im
ca
tv
0
tom
IPC
PERFORMANCE EVALUATION
6
HM
EA
N
e5
pp
2-Clusters
wa
v
fpp
si
3d
Unified
ap
tur
b
pl u
d
64 Registers
ap
mg
ri
d
2c
or
dr o
2
su
hy
tv
sw
im
tom
ca
IPC
PERFORMANCE EVALUATION
UPC
URACAM Performance – 1 bus
4-Clusters
9
8
7
6
5
4
3
2
1
0
HM
EA
N
e5
pp
2-Clusters
wa
v
fpp
si
3d
Unified
ap
tur
b
pl u
d
64 Registers
ap
mg
ri
d
2c
or
dr o
2
su
hy
tv
sw
im
tom
ca
IPC
PERFORMANCE EVALUATION
UPC
URACAM Performance – 4 bus
4-Clusters
9
8
7
6
5
4
3
2
1
0
Talk Outline
UPC
Clustered VLIW Architecture
Our previous work
URACAM
Basic Ideas
Algorithm
Evaluation
Conclusions
Conclusions
UPC
URACAM
handles at the same time communications, memory
pressure and registers
Search for the best overall solution
Figure
of Merit: a unique criterion to compare partial schedules
Transformations
to improve partial schedules
Spill Code to reduce register pressure
Communications through memory to reduce bus pressure
Communications through bus to reduce memory pressure
Undo Spill Code to reduce memory pressure
Spill Code for Clustered VLIW Architecture
Done during the scheduling
Conclusions
UPC
URACAM
achieves better schedules than previous work on
Modulo Scheduling for a Clustered VLIW Architecture
Speed up of 18% for 2 clusters and 22% for 4 clusters
[ For 1 inter-register bus with 1-cycle latency and 32 registers ]
Degradation
with respect non-clustered architecture
3% for 2 clusters and 10% for 4 clusters
[ For 4 inter-register bus with 1-cycle latency and 32 registers ]
URACAM
is an adaptive and powerful technique
Figure of Merit
Transformations
UPC
UNIVERSITAT POLITÈCNICA
DE CATALUNYA
A Unified Modulo Scheduling and Register
Allocation Technique for Clustered Processors
Josep M. Codina, Jesús Sánchez and Antonio González
Dept. of Computer Architecture
Universitat Politècnica de Catalunya
Barcelona, SPAIN
E-mail: {jmcodina,fran,antonio}@ac.upc.es
© Copyright 2025 Paperzz