PACT’02
presentation
A Framework for Parallelizing
Load/Stores on Embedded Processors
Xiaotong Zhuang
Santosh Pande
John S. Greenland Jr.
College of Computing, Georgia Tech
1
PACT’02
presentation
Background and Motivation
Speed gap between memory and CPU remains
Multi-bank memory architecture: Motorola DSP56000 series,
NEC 77016, SONY pDSP, Analog Devices ADSP-210x, Starcore
SC140 processor core
Parallel instructions allow parallel access to memory banks:
PLDXY r1, @a, r2, @b, loads @ar1 and @br2 at the
same time.
Objective:
Try to maximally generate parallel Load/Store (such as PLDXY)
instructions through compiler optimizations.
Controlled code & data segment growth
Reasonable speed of compilation
2
PACT’02
presentation
General approaches
Model as ILP problem--Rainer Leupers, Daniel Kotte, “Variable
partitioning for dual memory bank DSPs”, ICASSP, May’01
Variables Ni with value 0/1 for each LD/ST instr. to represent its memory
bank assignment (X or Y)
Variables Eij with value 0/1 to represent whether two instructions can be
merged
Enforcing other constraints and max the selected edge weight
Model as Graph problem--A.Sudarsanam, S.Malik, “Simultaneous
Reference Allocation in Code Generation for Dual Data Memory
Bank ASIPs”, TODAES, Apr’00
Each Load/Store as a node
Edge between nodes represents they can be merged
Pick maximal number of edges that are disjoint
3
PACT’02
presentation
Major contributions
Keep the model simple and easy to be solved mathematically
Identify the movable boundary problem, which impedes the
problem modeling and simplification
Propose Motion Schedule Graph (MSG) and two approaches to
solve it heuristically
Merge with instruction duplication and variable duplication
Cross basic block merges
Other improvements like local conflict elimination through
rematerialization and some global optimization issues
An iterative approach, which systematically grows the code
segment and then the data segment minimally.
4
PACT’02
presentation
Basic concepts (1)
Post-pass approach: assuming a good register allocator has been
used--Appel & George’s register allocation algorithm
Alias analysis
Memory access instruction dis-ambiguity
Most alias can be uniquely determined in our benchmark program
Memory access instructions
ST[addr],r is the definition of a memory address
LD[addr],r is the use of a memory address
For base-offset Load/Store instructions, normally for arrays, assume arrays
are inseparable and more register conflicts will be considered.
DependenciesAlias analysis
Address conflicts
Register conflicts
5
PACT’02
presentation
Basic concepts (2)
Building Webs
Webs: maximal union of du-chains. All variable def/use on the web MUST be
allocate to the same memory location
One variable appears in separate web can be put into different memory
locations
Achieve value separation
Motion range determination
Defined as interval between program points where a Load/Store can be legally
moved, restrained by dependencies
Load/Store instructions with overlapping range MAY be merged
Notice for Movable Boundary problem
6
PACT’02
presentation
Movable boundary problem
1 MOV r1, 3
2 MOV r2, 2
3 ST [addr2], r1 (1)
4 ST [addr1], r2 (2)
5
6
7
8 LD [addr1], r1 (3)
9 MOV r1, 0
A
1 MOV r1, 3
2 MOV r2, 2
3
4
5 LD [addr1], r1 (3)
6 ST [addr2], r1 (1)
7 ST [addr1], r2 (2)
8
9 MOV r1,0
B
The motion boundary of one Load/Store instruction is also a
Load/Store instruction
Assuming fixed boundary will cause incorrect merge
7
PACT’02
presentation
Motion schedule graph
Pseudo fixed-boundary
For Store: move as early as possible
assuming other instructions are fixed
For Load: move as late as possible
assuming other instructions are fixed
Motion Schedule Graph
Nodes represent individual Load/Store
instructions
Oval encloses Load/Store on the same
web
Edges link nodes that have overlapped
motion range (with respect to pseudo
fixed-boundaries)
8
PACT’02
presentation
Conflict resolution
Addr1 D
Addr1 U
Addr1 D
Addr2 D/U
Addr1 D
Addr1 D
Addr3 D/U
Addr2 D/U
Addr1 U
Addr1 U
Addr3 D/U
Addr1 U
B
A
r1 D
r1 D
X
r1 U
X
Y
r1 D
r4 D/U
X
Y
r1 D
Y
r4 D/U
X
Y
r1 U
r3 D/U
r3 D/U
r1 U
r1 U
B
A
X
Y
X
r4 D/UU
r1 U
Y
X
r1 D
Y
r1 D
r1 U
Y
r3 D/U
X
r3 D/U
r4 D/U
B
A
9
PACT’02
presentation
Example
Motion range
Bank assignments:
a—x, b—x, c—x, d—y, e—y
(1) (2) (3) (7) (8) (9)
(1) ST a, r1
(2) ST c, r3
(3) ST d, r5
(1) ST a, r1---(7)LD d,r6
(4) Mov r1,1
(5) Mov r2,2
(4) Mov r1,1
(5) Mov r2,2
(6) Mov r5,3
(8) LD b,r3---(3)ST d,r5
(7) LD d, r6
(6) Mov r5,3
(8) LD b, r3
(2) ST c, r3---(9)LD e,r2
(9) LD e, r2
A
B
(1) ST a, r1---(3)ST d,r5
(1) ST a, r1---(3)ST d,r5
(4) Mov r1,1
(4) Mov r1,1
(5) Mov r2,2
(5) Mov r2,2
(8) LD b,r3---(7)ST d,r6
(2) ST c,r3---(7)ST d,r6
(6) Mov r5,3
(6) Mov r5,3
(2) ST c, r3---(9)LD e,r2
(8) LD b, r3---(9)LD e,r2
C
D
10
PACT’02
presentation
Graph solving
The whole problem is provably NP-complete—refer to Appendix A
Two separate problems: Bank Assignment and Edge Picking
For predetermined bank assignments, the Edge Picking problem
can be optimally solved in polynomial time
Heuristic algorithms
Brutal force searching will take O(|V|32n) time. Doable for small programs
SA can approach the optimal solution but will greatly increase the compilation
time
Use heuristic to solve bank assignment, then get optimal solution for Edge
Picking
11
PACT’02
presentation
Edge Picking as max flow problem
Symbol Group 1
n11
n12
n13
Symbol Group 3
n31
n32
n33
Symbol Group 2
Max_flow=0;
For each Node nx in X-bank do
Pick an edge from nx to ny where both ny and the edge are not
Marked.
Max_flow++;
Mark nx , ny , edge (nx , ny);
End for
n21
n22
n23
Symbol Group 4
n41
n42
n43
Source node
Sink node
Symbol Group p
np1
np2
np3
Symbol Group q
nq1
nq2
nq3
12
PACT’02
presentation
Bank assignment heuristic
Denote E as the set of all symbol groups
Let Set DETX={Sx} where Sx is the symbol group with all the nodes
that must be put in X bank memory.
Let Set DETY={Sy} where Sy is the symbol group with all the nodes
that must be put in Y bank memory.
Define the Cij as the number of edges between two symbol groups
While (E<>DETX DETY) do
For each Sk E-DETX DETY
Calculate CX k
Cki and CY
S i DETX
k
C
S i DETX
End for
CX_max=max( CXk| Sk E-DETX DETY ),
CY_max=max( CYk| Sk E-DETX DETY),
If (CX_max>CY_max)
Add the symbol group with CX_max to DETY
Else
Add the symbol group with CY_max to DETX
Endif
End while
13
ki
PACT’02
presentation
Post-pass phases
Determine the range of motion
Merge without replication
A
LD/ST replication
Conflict elimination
Merge
B
variable duplication
Conflict elimination
Merge
C
14
PACT’02
presentation
Cross BB merge (Instr. duplication)
Move to predecessor/successor
to create new opportunities
To guarantee profitability
Basic Block 1
Move to where the reference is live
Move ST on EBB
Move LD on reverse EBB
Make sure: can be combined if
pushed to at least one of the live
predecessors/successors
15
ST main.a, r6
Mov r3,r4
Mult r2,r3,r5
Basic Block 2
Add r1,r2,r3
Mov r1, r3
Mult r4,r2,r3
LD [addr0], r7
Basic Block 3
PACT’02
presentation
Variable duplication
Basic Block 1
Basic Block 2
LD main.b, r3
ST main.a, r2
ST main.a, r3
LD main.a, r5
ST main.c, r1
Basic Block 3
LD main.a, r4
ST main.d, r2
Basic Block 4
(a)
Basic Block 1
LD main.b, r3---ST main.ay,r2
ST main.ax, r2
LD main.ay, r5---ST main.c, r1
Basic Block 3
Basic Block 2
ST main.ax, r3
ST main.ay, r3
LD main.ay, r4---ST main.d, r2
(b)
16
Basic Block 4
PACT’02
presentation
Local conflict elimination
Add r3, r1, r4
ST main.a, r2
Mov r2, r3
Mov r3, r5
LD main.b, r5
Mov r3, r5
LD main.b, r5-- ST main.a, r2
Add r2, r1, r4
Motivation
Register allocator may assign same register to neighboring ranges, which
leads to register conflicts
ISA restrictions may need particular registers but not available at the program
point
Rematerialization to free a register and reconstruct it after the
merge to make the register available.
17
PACT’02
presentation
Merge type and MSG properties
Classification of generated parallel load/stores
Merge type
Without
dup
Biquad_N_sections
4
Complex_update
6
n_complex_updates
9
n_real_updates
0
GSM Untoast
73
g721 decoder
17
Rawcaudio (adpcm)
5
Rawdaudio (adpcm)
7
SPEC2000-Bzip2
177
SPEC2000-VPR
394
Failed
merge
With LDLD LDST ISA STST
Redup
solved
1
2
3
0
1
2
2
3
5
3
3
2
3
4
8
2
3
3
1
0
1
0
2
0
14
32
55
4
10
9
4
6
15
4
4
4
2
2
5
1
1
1
2
4
5
3
5
3
43
78 142 10
13
12
103 154 343 26
30
45
18
MSG properties
Grp Node Edge
Avg.
#
#
# Size of
web
Biquad_N_sections
8
Complex_update 11
n_complex_updates 21
N_real_updates
7
GSM Untoast 129
g721 decoder 32
Rawcaudio(adpcm) 10
Rawdaudio(adpcm)
9
SPEC2000-Bzip2 395
SPEC2000-VPR 937
26
26
67
18
401
191
21
26
1432
2927
12
27
49
7
486
130
25
31
1109
3104
3.25
2.363636
3.190476
2.571429
3.108527
5.96875
2.1
2.888889
3.625316
3.123799
PACT’02
presentation
Compilation time
Comparison of compilation time (Seconds)
Original Heuristic(at A)
Biquad_N_sections
7.75
Complex_update
4.99
n_complex_updates
10.1
12.3
N_real_updates
6.34
6.91
GSM Untoast
g721 decoder
Rawcaudio(adpcm)
Rawdaudio(adpcm)
%Speedup (after
var replication/SA)
291.75
11.23
17.85
2497.474516
1534.454138
6.1 1125.99
6.85
10.75
16354.67326
10368.68129
511.09
14.73
21.60
3370.477952
2265.213625
295.34
7.61
12.49
3781.307223
2264.920316
391
430 8717.78
556.03
902.50
1467.852987
865.96204
250
295 4061.06
412.85
614.94
883.6588128
560.395309
15.8
18.3
674.75
21.42
35.36
3049.802212
1808.277119
14.6 1984.01
20.04
31.72
9800.269262
6155.699016
1745.4 32765.97
2234.98
3331.24
1366.048962
883.5954124
2690.81 57197.27
3157.40
4386.57
1711.532724
1203.918
4428.309791
2791.112
11
SPEC2000-Bzip2 1378.82
SPEC2000-VPR 2237.76
8.92
SA LD/ST replication Var replication
%Speedup (after
(at B)
(at C) LD/ST replication/SA)
Average
19
PACT’02
presentation
Runtime performance
Comparison of execution time (104 cycles)
Original Heuristic(at A)
SA LD/ST replication Var replication
(at B)
(at C)
%Speedup (after var
replication/Original)
%Speedup (after
var replication/SA)
Biquad_N_sections
94
87.5
86.64
85.08
83.9
13.18432767
3.258081524
Complex_update
1.6
1.56
1.56
1.53
1.53
5.643817202
2.24142932
n_complex_updates
312
300.3
295.93
299.4
298.2
4.594244163
-0.771453874
N_real_updates
123.3
116.1
114.74
114.2
113.3
8.842528413
1.26705758
GSM Untoast
7652
5478
5149.9
5393
5346
43.12337734
-3.677966343
g721 decoder
5201
4138
3718.8
4137
4084
27.33031336
-8.958826183
Rawcaudio(adpcm)
10490
9052
9071.8
8937
8802
19.17685555
3.064144342
Rawdaudio(adpcm)
4677
4186
3961
4150
4114
13.69017833
-3.715787615
SPEC2000-Bzip2
20945
17893
16875
17578
17423
20.2098194
-3.146948583
17.31060682
-1.160029981
Average
20
PACT’02
presentation
Code size comparison
Comparison of code size (# of Instructions)
Original Heuristic(at A)
SA LD/ST replication Var replication
(at B)
(at C)
%Code growth
%Code growth
(after LDST
(after var
replication/Original) replication/Original)
6.369426752
14.64968153
Biquad_N_sections
157
151
147
167
180
Complex_update
118
106
102
135
148
14.40677966
25.42372881
n_complex_updates
295
273
271
321
347
8.813559322
17.62711864
N_real_updates
181
179
179
197
209
8.839779006
15.46961326
GSM Untoast
3992
3843
3818
4053
4284
1.528056112
7.314629259
g721 decoder
1847
1813
1805
1909
2074
3.356794802
12.29020032
Rawcaudio(adpcm)
204
193
190
218
234
6.862745098
14.70588235
Rawdaudio(adpcm)
190
174
172
209
221
10
16.31578947
SPEC2000-Bzip2
12256
12102
12073
12452
13150
1.59921671
7.294386423
SPEC2000-VPR
46954
46280
46194
48338
52214
2.947565703
11.20245347
6.472392317
14.22934835
Average
21
PACT’02
presentation
Conclusion
A framework to analyze and merge LD/STs.
Our heuristic approach comes close to exhaustive search
with less compilation time.
Enhancing the range of motion of the instructions by
undertaking variable and instruction replications, so the
generated code quality is superior to the exhaustive
methods previously proposed.
22
© Copyright 2026 Paperzz