Instruction Scheduling with k-Successor Tree for Clustered VLIW

Instruction Scheduling with k-Successor Tree for
Clustered VLIW Processors
Xuemeng Zhang1,2 , Hui Wu1 , and Jingling Xue1
1
School of Computer Science and Engineering
The University of New South Wales, Australia
2
School of Computer
National University of Defense Technology, China
Email:{xuemengz, huiw, jingling}@cse.unsw.edu.au
Abstract. Clustering is a well-known technique for improving the scalability of classical VLIW (Very Long Instruction Word) processors. A
clustered VLIW processor consists of multiple clusters. Each cluster has
a local register file and a set of functional units. This paper proposes a
novel phase coupled, priority-based heuristic for scheduling a set of operations in a basic block on a clustered VLIW processor. Our heuristic
converts the instruction scheduling problem to the problem of scheduling
a set of operations with a common deadline. The priority of each operation vi is the lmax (vi )-successor-tree-consistent deadline. This deadline
is the upper bound on the latest completion time of vi in any feasible
schedule for a relaxed problem where the precedence-latency constraints
only between vi and all its successors are considered. We have simulated
our heuristic and the Integrated heuristic on the 808 basic blocks taken
from the MediaBench II benchmark suite using three processor models.
On average, for the three processor models, our heuristic improves over
the Integrated heuristic by 13%, 18%, 16% , respectively.
Keywords: Clustered VLIW Processor, Inter-operation Latency, Inter-cluster
Communication Latency, Instruction Scheduling
1
Introduction
A VLIW (Very Long Instruction Word) processor consists of multiple functional
units. All the functional units share a single register file. At each processor cycle,
multiple operations are issued and executed in parallel on different functional
units. All the operations are scheduled at compile time. Hence, an optimising
compiler plays a key role in employing ILP (Instruction Level Parallelism) [1].
The objective of an optimising compiler is to schedule the operations of a program such that the execution time of the program is minimised. As a result,
the compiler needs to schedule as many operations as possible at each cycle.
Nevertheless, a large amount of ILP puts immense pressure on the processor’s
registers and bypass network.
A major problem with the classical VLIW processors is that a single register
file hampers the scalability of the processor. Clustering is an efficient technique
for overcoming this problem. In a clustered VLIW processor, Each cluster has
its own functional units and local register file with fewer registers and ports.
Clusters are connected by an inter-cluster communication network [2].
Instruction scheduling for clustered VLIW processors becomes more challenging, as there are additional inter-cluster communication constraints. If an
operation vi executed on a cluster needs a result produced by another operation
executed on a different cluster, the compiler needs to take into account the communication latency between these two clusters when scheduling the operation
vi .
In this paper, we propose a novel phase coupled, priority based heuristic for
scheduling a set of operations in a basic block on a clustered VLIW processor. Our heuristic converts the instruction scheduling problem to the problem
of scheduling a set of operations with a common deadline. The priority of each
operation vi is the upper bound on the latest completion time of vi in any feasible schedule for a relaxed problem where the precedence-latency constraints
only between vi and all its successors are considered. Given a basic block, the
time complexity of our heuristic is O(n3 ), where n is the number of operations
in the basic block. We simulated our heuristic and the Integrated heuristic [3]
on the 808 basic blocks taken from the MediaBench II benchmark suite using
three processor models. On average, for the three processor models, our heuristic
improves over the Integrated heuristic by 13%, 18%, 16% , respectively
This paper is organised as follows. Section 2 describes the related work. Section 3 gives our processor model and definitions. Section 4 proposes our AEAP
(As Early As Possible) scheduling algorithm and ALAP (As Late As Possible)
scheduling algorithm based on the maximum bipartite matching. Section 5 describes our algorithm for computing the lmax (vi )-successor-tree-consistent deadline for each operation vi . Section 6 presents our phase coupled scheduling heuristic. Section 7 shows our simulation results. Section 8 concludes this paper.
2
Related Work
The problem of scheduling a set of operations in a basic block executed on a clustered VLIW processor to minimise the execution time of the basic block is NPcomplete, even if the processor has only one cluster [4]. A number of heuristics
have been proposed. All the previous heuristics can be classified into two main
categories, the phase decoupled approaches and the phase coupled approaches.
The phase decoupled approaches [5–7] partition operations into clusters before
scheduling the operations. They suffer from the phase ordering problem as partitioning is performed without knowing the conflicts of resources and the final
schedule. As a result, they may introduce unnecessary inter-cluster communications, and impose more constraints on the subsequent scheduling. The phase
coupled approaches [3, 8–12] combine cluster assignment and scheduling into a
single phase without the phase ordering problem.
The Bulldog compiler [5] is the first work on instruction scheduling for clustered VLIW processors. It uses a two-phase approach that separates cluster
assignment from scheduling. The first phase assigns each operation to a cluster
using the BUG (Bottom-Up Greedy) algorithm, and the second phase uses list
scheduling [13] to construct a schedule that respects the cluster assignments. The
BUG uses the depth-first search and a latency-weighted depth priority scheme.
It tries to put the operations on the critical path to the same cluster so that
inter-cluster data transfer latency can be minimised. Unfortunately this approach
leads to the phase ordering problem as the BUG does not know the utilisation of
functional units and inter-cluster data path. As a result, it may not fully utilise
the machine resources to minimise the execution time of a basic block.
Lapinskii et al. [7] propose an effective binding algorithm for clustered VLIW
processors. The algorithm uses a three-component ranking function to determine
the order in which each operation is considered for binding:
1. The ALAP (As Late As Possible) scheduling time of the operation.
2. Mobility of the operation, which is the ALAP scheduling time of the operation minus the AEAP (As Early As Possible) scheduling time.
3. Number of successors of the operation.
The algorithm computes the cost of binding an operation to a cluster by
a cost function that considers the data transfer penalty, the resource and bus
serialisation penalty. The problem with this algorithm is that the assignment of
operations to clusters prior to the scheduling phase can not get the exact information of load on resources.
The UAS (Unified Assign and Schedule) heuristic [8] combines cluster assignment and scheduling into a single unified phase. The outer loop of the heuristic
uses a list scheduler [13] to fill the operations into the schedule cycle by cycle,
and the inner loop considers each possible cluster assignment for each operation.
Each cluster is checked in priority order to see if it has a free functional unit for
the operation. Five different priority functions are investigated:
1. None: The cluster list is not ordered.
2. Random Ordering: The cluster list is ordered randomly.
3. MWP (Magnitude-weighted Predecessor): An operation is placed into a
cluster where the majority of its input operands reside.
4. CWP (Completion-weighted Predecessor): This gives priority to the cluster that produces a source operand later than other source operands.
5. CPSC (Critical-Path in Single Cluster) using CWP Heuristic: The CWP
heuristic is used, but operations on the critical path are forced to be assigned to
the same cluster.
Nagpal et al. [3] propose an effective heuristic that integrates temporal and
spatial scheduling. The heuristic utilises the exact knowledge of available communication slots, functional units, and load on different clusters as well as future resource and communication requirements known only at schedule time.
The heuristic extends the standard list scheduling algorithm and uses a threecomponent ordering function to order the ready operations as follows:
1. Mobility of the operation.
2. The number of different functional units capable of executing the operation.
3. Number of successor operations.
Once an operation is selected for scheduling, a cluster assignment decision is
made as follows:
1. The chosen cluster should have at least one free resource of the type needed
to perform this operation.
2. A communication model is proposed to select a cluster based on the number of snoops in the current cycle, number of future communications required,
as well as number of move operations in the earlier cycle.
A major problem with all the previous priority-based approaches is that they
do not consider the processor resource constraints when computing the priority
of each operation. As a result, the priorities may be biased and may not reflect
the relative importance of each operation. Our heuristic is the first one that takes
into account the processor resource constraints when computing the priority of
each operation.
There are a number of papers that study the problem of effectively scheduling loops on clustered VLIW processors by using software pipelining and modulo
scheduling techniques [11,12,14–16]. The primary objective of software pipelining
and modulo scheduling techniques is to efficiently overlap the different iterations
of an inner-most loop.
3
Processor Model and Definitions
A target clustered VLIW processor has m clusters, C1 , C2 , . . . , Cm . Each cluster
has a set of functional units. Each functional unit executes a subset of operations.
Each operation can be executed on one or more functional units. The mapping
between the functional units and the operations can be represented by a bipartite graph where each node denotes a functional unit or an operation, and each
edge denotes that the operation can be executed on the functional unit. The
target processor is fully pipelined, i.e, the execution of each operation takes only
one processor cycle. Assume that all the clusters are fully connected 3 , i.e., the
inter-cluster communication latency is a constant c between any two clusters. If
operations vi and vj are executed on different clusters and vj needs the result of
vi , a move operation, executed on vi ’s cluster, is required to transfer the result
from vi ’s cluster to vj ’s cluster. The move operation, denoted by MVD, is also
pipelined.
A basic block is represented by its precedence-latency graph, which is a
weighted DAG (Directed Acyclic Graph) G = (V, E, W ), where V = {v1 , v2 , . . . ,
vn : vi is an operation}, E = {(vi , vj ) : vi , vj ∈ V and vj is directly dependent on
vi }, and W = {lij : (vi , vj ) ∈ E and lij is the inter-operation latency between
vi and vj }. An inter-operation latency may exist between two operations with a
3
Our scheduling heuristic can be extended to handle an arbitrary communication
network.
precedence constraint due to the pipelining architecture and the off-chip memory latency. Fig. 1a shows the precedence-latency graph of a basic block with 12
operations 4 , where the operations and inter-operation latencies are taken from
the TMS320C64X processor [17], a clustered VLIW processor with two clusters.
(a) The precedence-latency graph
(b) The 4-successor-tree of v3
Fig. 1: An example of the precedence-latency graph and the k-successor tree
The instruction scheduling problem studied in this paper is described as
follows. Given a basic block represented by a weighted DAG G = (V, E, W ), find
a valid schedule σ with the minimum length satisfying the following constraints:
1. Resource constraints: (a) Each operation can only be executed on one of
the designated functional units. (b) At each time, only one operation can be
executed on each functional unit.
2. Precedence-latency constraints: For each (vi , vj ) ∈ E, if vi and vj are on
different clusters, σ(vi )+2+c+lij ≤ σ(vj ) holds; otherwise, σ(vi )+1+lij ≤ σ(vj )
holds.
Given two operations vi and vj , if there is a directed path from vi to vj , vi
is a predecessor of vj , and vj is a successor of vi . Especially, if (vi , vj ) ∈ E, vi
is an immediate predecessor of vj , and vj is an immediate successor of vi . If vi
has no immediate predecessor, it is a source operation. If vi has no immediate
successor, it is a sink operation. vi is a sibling of vj if vi and vj have a shared
immediate successor. Throughout this paper, all the successors of an operation
vi are denoted by Succ(vi ), and the number of elements in a set U is denoted
by |U |.
Definition 1. Given a DAG G = (V, E) and an operation vi ∈ V , the successor
tree of vi is a directed tree T (G, vi ) = (V 0 , E 0 ), where V 0 = {vi } ∪ Succ(vi ) and
E 0 = {(vi , vj ) : vj ∈ Succ(vi )}.
4
The LDW operations load data from the memory to registers. The ADD and SSU B
operations perform addition and subtraction.
In a weighted DAG G, if there is a directed path Pij from vi to vj , the weighted
path length of Pij is the sum of its constituent edge weights and the number of
operations in Pij , excluding the two end operations vi and vj . The maximum
+
weighted path length from vi to vj , denoted by lij
, is the maximum weighted
path length of all the paths from vi to vj .
Definition 2. Given a weighted DAG G = (V, E, W ), an operation vi ∈ V and
an integer k, the k-successor tree of vi is a weighted, directed tree ST (G, k, vi ) =
(V 0 , E 0 , W 0 ), where V 0 = {vi } ∪ {vj : vj ∈ Succ(vi )}, E 0 = {(vi , vj ) : vj ∈
+
+
0
0
0
Succ(vi )}, and W 0 = {lij
: if lij
< k, lij
= lij
; otherwise, lij
= k}.
Fig. 1b shows the 4-successor tree of v3 in Fig. 1a.
Given a problem instance P : a set of operations in a basic block represented
by a weighted DAG and a clustered VLIW processor M , we define a new problem
instance P (d) as follows: the same set of operations as in P with a common
deadline d and the same VLIW processor M . Conceptually, d can be any number.
In order not to produce negative start time for any operation, we select d to be n∗
c∗lmax , where n is the number of operations, c is the inter-cluster communication
latency, and lmax is the maximum inter-operation latency. A valid schedule for
the problem instance P (d) is a feasible schedule if the completion time of each
operation is not greater than the common deadline d.
Let lmax (vi ) denote the maximum latency between vi and its immediate
successors. Next, we define the lmax (vi )-successor-tree-consistent deadline of an
operation vi .
Definition 3. Given a problem instance P (d), the lmax (vi )-successor-tree-consistent deadline of an operation vi , denoted by d0i , is recursively defined as follows.
If vi is a sink operation, d0i is equal to the common deadline d; otherwise, d0i is
the upper bound on the latest completion time of vi in any feasible schedule for
the relaxed problem instance P (vi ): a set V 0 = {vi } ∪ Succ(vi ) of operations with
the precedence and latency constraints in the form of the lmax (vi )-successor tree
ST (G, lmax (vi ), vi ), the deadline constraints D0 = {d00j : d00j , the deadline of vj
in P (vi ), is d0j if j 6= i; otherwise, d00j is dj }, and the same clustered VLIW
processor P . Formally, d0i = max{σ(vi ) : σ is a feasible schedule for P (vi ).}
In our scheduling heuristic, the priority of each operation vi is its lmax (vi )successor-tree-consistent deadline which considers the relaxed precedence-latency
constraints and the resource constraints. As a result, it represents the relative importance of vi more accurately than the deadline derived only from the
precedence-latency constraints.
Given a graph G = (V, E), a matching M of G is a set of non-adjacent edges.
A maximum matching is a matching that contains the largest possible number
of edges. The matching number of a graph is the size of a maximum matching.
4
AEAP Scheduling and ALAP Scheduling
Our instruction scheduling heuristic needs to schedule an operation as early as
possible or as late as possible. Both the AEAP scheduling problem and the ALAP
scheduling problem can be reduced to the following basic scheduling problem.
Given a cluster Ck , a particular cycle t, a set Sk of independent operations already scheduled on Ck at cycle t, and a new operation vi , find a schedule for all
the operations in Sk ∪ {vi } at cycle t whenever such a schedule exists. This problem can be converted to the maximum bipartite matching problem as follows.
Construct a bipartite graph Gk (Sk ∪ {vi }) = (Uk ∪ Vk , Ek ), where Uk = {ui : ui
is a functional unit of Ck }, Vk = {vj : vj is an operation in Sk ∪ {vi }}, and
Ek = {(ui , vj ): the operation vj can be executed on the functional unit ui }. It
is easy to prove that there is a valid schedule for Sk ∪ {vi } IFF the matching
number of Gk is equal to |Sk ∪ {vi }|.
√
The maximum matching problem for a general graph can be solved in ne
[18], where n and e are the number of nodes and the number of edges of the
graph, respectively. In our maximum matching problem, the number of nodes in
the bipartite graph Gk is at most 2mk , where mk is the number of functional
units of Ck . Since mk is quite small, we can consider it as a constant. Therefore,
the time complexity of finding a maximum matching for the bipartite graph Gk
is O(1). As a result, the time complexity of testing if an operation can be scheduled on a cluster Ck at cycle t is O(1).
Consider the following example. Assume that the target clustered VLIW
processor is the TMS320C64X processor [17]. The ADD operation and the SUB
operation can be executed on the L unit, the S unit, or the D unit. The MPY
operation can only be executed on the M unit, and the LDW operation can
only be executed on the D unit. We need to schedule a ready LDW operation
on the cluster C1 at cycle t, where the ADD, MPY and SUB operations have
already been scheduled, as shown in Table 1. Since the LDW operation can
not be executed on the L unit, we resort to the maximum bipartite matching.
We construct the bipartite graph G1 (S1 ∪ {vi }) as shown in Fig. 2a, where S1
consists of the ADD, MPY and SUB operations, and vi is the ready LDW operation. A maximum matching of G1 (S1 ∪ {vi }) is shown in Fig. 2b. A schedule
of the operations at cycle t based on the maximum matching is shown in Table 2.
Time
t
L
S
ADD
Time
t
L
SUB
S
ADD
M MPY
M MPY
D
D
SUB
Table 1: A schedule for cycle t
LDW
Table 2: A schedule with LDW for cycle t
Next, we show how to use the maximum bipartite matching to solve the
AEAP scheduling problem and the ALAP scheduling problem.
The AEAP scheduling problem is formulated as follows. Given a schedule for
a set of operations with individual deadlines on the clustered VLIW processor
M , and a new operation vi that is ready at cycle ri , schedule vi on a cluster as
(a) A bipartite graph
(b) A maximum matching
Fig. 2: An example of the maximum bipartite matching
early as possible The algorithm for the AEAP scheduling problem is shown in
Alg. 1.
Algorithm 1 AEAP-scheduler(σ, vi )
Require:
A schedule σ for a set of operations on the clustered VLIW processor M
A new operation vi that is ready at cycle ri and has a deadline di
Ensure:
The earliest start time Fvi of vi
1: for t = ri , ri + 1, · · · , di − 1 do
2:
for Each cluster Cj (j = 1, 2, · · · , m) of M do
3:
Let S be the set of vi and all the operations scheduled on Cj at t
4:
Construct the bipartite graph Gj (S)
5:
Find a maximum matching for Gj (S)
6:
if The matching number of Gj (S) is equal to |S| then
7:
Schedule each operation in S at t based on the maximum matching
8:
Fvi = t
9:
Break
10:
end if
11:
end for
12: end for
Next, we analyse the time complexities of the AEAP scheduling algorithm
and the ALAP scheduling algorithm. The outer for loops of these two algorithms
test a set of successive cycles to see if vi can be scheduled at one of the cycles.
Recall that the common deadline d of all the operations is set to n ∗ c ∗ lmax in
Section 3. This big common deadline guarantees that each operation vi can be
scheduled in the time window [ri , di ), where ri and di are the ready time and
the deadline of vi , respectively. Therefore, these two algorithms guarantee to
schedule vi at a particular cycle before they return. Assume that the number of
operations scheduled before vi is h. Clearly, both the AEAP scheduling algorithm
and the ALAP scheduling algorithm need to test at most h cycles before they
find the start time of vi . As we have discussed before, each test takes O(1) time.
As a result, the time complexities of both the AEAP scheduling algorithm and
the ALAP scheduling algorithm are O(h).
Algorithm 2 ALAP-scheduler(σ, vi )
Require:
A schedule σ for a set of operations on the clustered VLIW processor M
A new operation vi with a deadline di
Ensure:
The latest start time Fvi of vi
1: for t = di − 1, di − 2, · · · , 0 do
2:
for Each cluster Cj (j = 1, 2, · · · , m) of M do
3:
Let S be the set of vi and all the operations scheduled on Cj at t
4:
Construct the bipartite graph Gj (S)
5:
Find a maximum matching for Gj (S)
6:
if The matching number of Gj (S) is equal to |S| then
7:
Schedule each operation in S at t based on the maximum matching
8:
Fvi = t
9:
Break
10:
end if
11:
end for
12: end for
The ALAP scheduling problem is described as follows. Given a schedule for
a set of operations with individual deadlines on the clustered VLIW processor
M , and a new operation vi , schedule vi on a cluster as late as possible. The
algorithm for the ALAP scheduling problem is shown in Alg. 2.
5
Computing the lmax (vi )-Successor-Tree-Consistent
Deadline
In our scheduling heuristic, the priority of each operation vi is its lmax (vi )successor-tree-consistent deadline defined in Section 3. For a sink operation vi ,
its lmax (vi )-successor-tree-consistent deadline is equal to the common deadline
d. Next, we show how to compute the lmax (vi )-successor-tree-consistent deadline
of each operation vi .
The lmax (vi )-successor-tree-consistent deadline of each operation vi is computed in the reverse topological sort order of all the operations. There are three
major steps when computing the lmax (vi )-successor-tree-consistent deadline of a
particular operation vi . Firstly, schedule all the successors of vi in the lmax (vi )successor tree as late as possible by using the maximum bipartite matching as
described in Section 4. If two operations have the same deadline, schedule the
operation with the larger latency first. Secondly, find a successor of vi with
the largest impact on vi ’s latest start time without considering the inter-cluster
communication latency, and tentatively schedule vi on the same cluster as this
successor, in order to make vi ’s start time as late as possible. Then, compute the
tentative latest start time of vi by considering the communication latency between vi and all the successors which are not on the same cluster as vi . Thirdly,
reschedule the successors of vi , which are not on the same cluster as vi , to the
same cluster as vi , in order to make vi ’s start time later than its tentative start
time.
Given a successor w of vi , let Lvi ,w be the latency between vi and w in the
lmax (vi )-successor tree. Our algorithm for computing the lmax (vi )-successor-treeconsistent deadline of each operation vi is shown in Alg. 3.
In order to maximise the start time of vi , our algorithm tests if an operation
in S 0 can be moved to cluster C 0 . Next, we prove that at most the first c ∗ p
operations in S 0 need to be tested. Consider the tentative latest start time smax
of vi computed in Step 2. It is easy to see that the latest start time of vi is at
most smax + c. Therefore, at most the first c ∗ p operations in S 0 need to be
tested.
Fig. 3: The mapping between the functional units and the operations
Example 1: Consider the basic block shown in Fig. 1a. Assume that the
target clustered VLIW processor is the TMS320C64X processor [17]. The processor has two identical clusters. Cluster A has the register file A and four
functional units L1, S1, D1, and M 1. Cluster B has the register file B, and
four functional units L2, S2, D2, and M 2. The ADD operation can be executed
on the L unit, the S unit, or the D unit. The SSUB operation can only be
executed on the L unit. The LDW operation can only be executed on the D
unit. The MVD operation can only be executed on the M unit. The mapping
between the functional units and the operations is shown in Fig. 3. For ease of
descriptions, we assume that only the operation MVD is used to transfer data
between two clusters. The latency of the MVD operation, i.e., the inter-cluster
communication latency c, is 3 cycles. As a result, the common deadline d of all
the operations is 12 ∗ 3 ∗ 4 = 144. Assume that our algorithm has computed
d0i (i = 6, 7, 8, 9, 10, 11, 12) with d06 = d07 = d08 = 142, d09 = d010 = 143, and
d011 = d012 = 144. Next, we show how our algorithm computes the lmax (v3 )successor-tree-consistent deadline d03 . Recall that d03 is the upper bound on the
latest completion time of v3 in any feasible schedule for the relaxed problem
instance P (v3 ), where the 4-successor tree of v3 is shown in Fig. 1b.
Algorithm 3 ALAP deadline(G, v, d, M )
Require:
The precedence-latency graph G of a basic block
An array v of all the operations in the basic block
A common deadline d for all the operations
A clustered VLIW processor M
Ensure:
The lmax (vi )-successor-tree-consistent deadline d0 (vi ) of each operation vi
1: Sort v in topological sort order
2: for i = |v| − 1 to 0 do
3:
if vi is a sink operation then
4:
d0 (vi ) = d
5:
Continue
6:
end if
7:
Construct the lmax (vi )-successor tree T of vi
8:
Let S be an array of Succ(vi ) in T
9:
Sort S in non-increasing order of lmax (vi )-successor-tree-consistent deadlines
10:
Sort all the operations in S with the same deadline in non-increasing order of
their latencies in T
11:
for j = 0 to |S| − 1 do
12:
Schedule Sj on a cluster by ALAP-scheduler(σ, Sj )
13:
FSj = the start time of Sj
14:
end for
15:
Let l be an integer in [0, |S| − 1] satisfying FSl − LSl ,vi = min{FSj − LSj ,vi : j ∈
[0, |S| − 1]}
16:
Let C 0 be the cluster on which Sl is scheduled
17:
Let S 0 be the subset of Succ(vi ) not scheduled on C 0
18:
/ ∗ smax stores the latest start time of vi ∗ /
19:
smax = FSl − LSl ,vi − 1
20:
for k = 0, · · · , |S 0 | − 1] do
21:
IS 0 = FS 0 − LS 0 ,vi
k
k
k
22:
smax = min{IS 0 − c − 2, smax }
k
23:
end for
24:
Sort S 0 in non-decreasing order of I
25:
Let p be the maximum number of functional units of any cluster
26:
for k = 0, 1, · · · , min{|S 0 | − 1, c ∗ p} do
27:
/∗ test if Sk0 can be moved to the cluster C 0 ∗/
28:
Let A be an array containing Sk0 and all the operations scheduled on C 0
29:
Sort A in non-increasing order of their lmax (vj )-successor-tree-consistent
deadlines
30:
Sort all the operations in A with the same deadline in non-increasing order of
their latencies in T
31:
for j = 0, 1, · · · , |A| − 1 do
32:
Schedule Aj on the cluster C 0 by ALAP-scheduler(σ, Aj )
33:
FAj = the start time of Aj
34:
end for
35:
t1 = min{Fvj − Lvj ,vi − 1 : j = 0, · · · , |A| − 1 and vj is scheduled on C 0 }
36:
t2 = min{Fvj − Lvj ,vi − c − 2 : j = 0, · · · , |A| − 1 and vj is not scheduled on
C0}
37:
if min{t1 , t2 } > smax then
38:
move Sk0 to C 0
39:
smax = min{t1 , t2 }
40:
end if
41:
end for
42:
d0 (vi ) = smax + 1
43: end for
Firstly, our algorithm schedules each successor vj of v3 as late as possible
based on its lmax (vj )-successor-tree-consistent deadline and the latency between
v3 and vj in the relaxed problem instance P (v3 ). The schedule of all the successors of v3 is shown in Table 3a. Secondly, our algorithm finds the tentative
latest start time of v3 with respect to the previous schedule, the inter-operation
latencies, as well as the inter-cluster communication latency. The tentative latest start time of v3 is 133, as shown in Table 3b. Lastly, our algorithm tests if
vi (i = 10, 12) can be moved to the cluster A so that the resulting latest start
time of v3 can be increased. As a result, v10 is moved to the cluster A, increasing
the resulting latest start time of v3 by one cycle. The final schedule for computing the lmax (v3 )-successor-tree-consistent deadline is shown in Table 3c. From
the schedule shown in Table 3c, we get d03 = 135.
The lmax (vi )-successor-tree-consistent deadline of each operation vi is shown
in the lmax (vi )-STC Deadline row in Table 4. For comparison, the priorities used
in the Integrated heuristic [3], based on the operation mobility, the number of
different functional units capable of executing the operation (in the DFU row),
and the number of successors, are also shown in Table 4.
Next, we analyse the the time complexity of our algorithm for computing
all the lmax (vi )-successor-tree-consistent deadlines.
Theorem 1. Given a basic block, the time complexity of our algorithm for computing all the lmax (vi )-successor-tree-consistent deadlines is O(n3 ), where n is
the number of operations in the basic block.
Proof. The time complexity of one iteration of the outer for loop is analysed as
follows.
1. Step 1. It takes O(e) time to construct the lmax (vi )-successor tree T of vi .
It takes O(n log n) time to sort S. Since it takes O(n) time to schedule an
operation as late as possible by using the maximum bipartite matching, the
first inner for loop takes O(n2 ) time. As a result, Step 1 takes O(n2 ) time.
2. Step 2. Clearly, this step takes O(n) time.
3. Step 3. The number of iterations of the outer for loop is at most c∗p+1, where
c is inter-cluster communication latency and p is the number of functional
units in each cluster. Since both c and p are very small, c∗p can be considered
as a constant. Furthermore, the time complexity of each iteration of the outer
for loop is dominated by the inner for loop which takes O(n2 ) time. As a
result, Step 3 takes O(n2 ) time.
Therefore, the time complexity of our algorithm is O(n3 ).
6
Instruction Scheduling Heuristic
In our scheduling heuristic, the priority of each operation vi is its lmax (vi )successor-tree-consistent deadline. A smaller deadline implies a higher priority.
Our heuristic schedules operations in non-decreasing order of their deadlines
and tries to schedule the immediate predecessors of each operation to the same
cluster without delaying this operation, in order to reduce the inter-cluster communication latency. Our heuristic works as follows. At any time, it selects a
ready operation vi with the smallest lmax (vi )-successor-tree-consistent deadline.
An operation vi is ready if both its inter-operation latencies and inter-cluster
communication latency have elapsed. The following two cases are distinguished.
1. vi has a sibling vk that is already scheduled. Let vj be the immediate successor of vi satisfying the following constraints: a) One immediate predecessor
of vj is already scheduled. b) vj has the smallest lmax (vj )-successor-treeconsistent deadline among all the immediate successors of vi having an immediate predecessor already scheduled. Consider the following two cases:
(a) There are two clusters Cs and Ct such that the start time of vj is minimised if vi is scheduled on either cluster Cs or cluster Ct , where the
start time of vj is the time at which vj can be scheduled immediately
after vi . In this case, schedule vi on a cluster as early as possible by using
the maximum bipartite matching.
(b) Such clusters Cs and Ct do not exist. Schedule vi on a cluster such
that the start time of vj is minimised by using the maximum bipartite
matching, where the start time of vj is the time at which vj can be
scheduled immediately after vi .
2. vi is a sink operation, or vi does not have a sibling that is already scheduled.
In this case, schedule vi on a cluster as early as possible by the maximum
bipartite matching.
The reason that our scheduling heuristic distinguishes between Case (a) and
Case (b) is that vj may not be scheduled immediately after vi , and other operations scheduled between vj and vi will affect the start time of vj . In this case,
scheduling vi on a cluster earlier may make vj start earlier.
A schedule for Example 1 constructed by our heuristic is shown in Table 5a.
For comparison, a schedule for Example 1 generated by the Integrated heuristic [3] is shown in Table 5b. We can see that our heuristic can effectively reduce
the execution time of the basic block by 2 cycles.
After the lmax (vi )-successor-tree-consistent deadline of each operation vi is
computed, our instruction scheduling heuristic takes O(n2 ) to construct a schedule for all the operations in the basic block. Therefore, the time complexity of our
entire instruction scheduling heuristic is dominated by the algorithm for all the
lmax (vi )-successor-tree-consistent deadlines. As a result, the following Theorem
holds.
Theorem 2. Given a basic block, the time complexity of our entire instruction
scheduling heuristic is O(n3 ), where n is the number of operations in the basic
block.
7
Simulation Results
To evaluate the performance of our scheduling heuristic, we have simulated
our scheduling heuristic and the previous best performing heuristic, Integrated
heuristic [3], by using the 808 basic blocks selected from the MediaBench II
benchmark suite [19]. The number and size range of all the basic blocks selected
from each of the benchmarks are shown in Table 6. Most of the basic blocks
of the benchmark suite are relative small. The sizes of the selected basic blocks
range from 20 to 379 operations. In order to generate larger basic blocks, we
unrolled the loops from the benchmark suite to obtain basic blocks with 1200
operations on average.
The hardware platform for the simulation is a Intel(R) Core(TM)2 Quad
CPU Q9400 with a clock frequency of 2.66GHz, 3.5 GB memory, and 3MB cache.
We used three different clustered VLIW processor models, namely, Processor (a),
Processor (b), and Processor (c). All the three processor models are based on
the TMS320C64X processor [17]. Processors (a), (b), and (c) have 4 clusters, 3
clusters, and 2 clusters, respectively. All the clusters of each processor model are
identical. Each cluster has a L unit, a S unit, a D unit, and a M unit, as in
the TMS320C64X processor [17]. The instruction set is the same as that of the
TMS320C64X processor [17]. The inter-cluster communication latency is 3 cycles. The latencies of the load operation, the multiply operation and the branch
operation are 4, 1 and 5 cycles, respectively. The latencies of other operations
are 0 cycle. In addition, we assume that the register file of each cluster has unlimited number of registers.
Fig. 4 shows the simulation results for the real basic blocks taken from the
MediaBench II benchmark suite, and Fig. 6 shows the simulation results for
the artificial basic blocks unrolled from the selected loops taken from the MediaBench II benchmark suite. The simulation results shown in Fig. (i) were obtained
by using Processor (i). For example, Fig. 3(a) assumes Processor (a). In each
figure, a vertical axis represents the average processor cycles of the basic blocks
selected from a benchmark on a specific clustered VLIW processor.
The improvements of our heuristic over the Integrated heuristic are shown
in Fig. 5 and Fig. 7. As we can see, our scheduling heuristic outperforms the
Integrated heuristic for all the selected basic blocks. The improvement on the
real basic blocks is higher than on the artificial basic blocks. The major reason
is as follows. After unrolling a loop, more parallelism is available. Therefore,
the number of idle cycles of all the functional units decreases significantly. As
a result, the improvement of our heuristic over the Integrated heuristic on the
artificial basic blocks is not as significant as on the real basic blocks. As shown
in Fig. 5, the improvements of our heuristic over the Integrated heuristic on
the H.264 and JPG2000 benchmarks are not as significant as on other benchmarks. This occurs because many operations have only one successor in both
of the benchmarks. As a result, the lmax (vi )-successor-tree-consistent deadlines
of those operations degenerate to the priority derived from the priority scheme
that does not consider the resource constraints.
On average, our heuristic outperforms the Integrated heuristic by 13%, 18%,
16% for the three processor models, respectively. There are several reasons why
our heuristic performs better than the Integrated heuristic. Firstly, our priority
scheme considers the processor resource constraints, while the priority scheme of
the Integrated heuristic does not. As a result, our priority scheme is more accurate, especially when the operation has many successors competing for limited
resources. Secondly, the communication cost function of the Integrated heuristic
aims to minimise the inter-cluster communication, while our heuristic aims to
minimise the schedule length instead. Lastly, our heuristic uses the maximum
bipartite matching to schedule each operation as early as possible, while the
Integrated heuristic does not use the maximum bipartite matching and cannot
utilise the functional units to the largest extent.
Our heuristic runs slower than the Integrated heuristic. Fig. 8 shows the
scheduling times (in milliseconds) of all the heuristics for the real basic blocks.
For each of the three processors, the average scheduling time of our heuristic is
negligible. The scheduling times of all the heuristics (in seconds) for the artificial basic blocks are shown in Fig. 9. For the artificial basic blocks, the average
scheduling times of our heuristic for the three processors are 3.66 seconds, 4.82
seconds and 5.66 seconds, respectively. The average scheduling times of the Integrated heuristic for the three processor are 0.78 seconds, 0.52 seconds and 0.43
seconds, respectively.
(a)
(b)
(c)
Fig. 4: Simulation results Part A: Average processor cycles for real basic blocks
8
Conclusion
This paper proposes an efficient phase coupled, priority-based heuristic for scheduling operations in a basic block on a clustered VLIW processor. Our heuristic
computes the lmax (vi )-successor-tree-consistent deadline for each operation vi
and uses these deadlines as priorities to construct a schedule. Compared to the
previous priority schemes, our priority scheme is the first one to consider both the
precedence-latency constraints and resource constraints. The priorities computed
by our heuristic are more accurate with respect to the relative importance of
each operation. We have simulated our heuristic and the Integrated heuristic
(a)
(b)
(c)
Fig. 5: Simulation results Part B: Improvements of our heuristic on real basic
blocks
(a)
(b)
(c)
Fig. 6: Simulation results Part C: Average processor cycles for artificial basic
blocks
on the 808 basic blocks taken from the MediaBench II benchmark suite using
three processor models. On average, our heuristic improves over the Integrated
heuristic for the three processor models by 13%, 18%, 16% , respectively.
In the existing approaches, register allocation and instruction scheduling are
separated. For clustered VLIW processors, these two problems are closely related
and have a significant impact on each other. An open problem is how to integrate
register allocation and instruction scheduling into a single problem and solve it
efficiently.
References
1. John L. Hennessy and David A. Patterson. Computer Architecture: A Quantitative
Approach. Elsevier, 114-120, fourth edition, 2006.
(a)
(b)
(c)
Fig. 7: Simulation results Part D: Improvement of our heuristic on artificial basic
blocks
(a)
(b)
(c)
Fig. 8: Simulation results Part E: Scheduling time for real basic blocks
2. Andrei Terechko, Erwan Le Thenaff, Manish Garg, Jos van Eijndhoven, and Henk
Corporaal. Inter-cluster communication models for clustered vliw processors. In
proceedings of Symposium on High Performance Computer Architectures, 2003.
3. Rahul Nagpal and Y. N. Srikant. pragmatic integrated scheduling for clustered
vliw architectures. software-practice and experience, 38:227–257, 2008.
4. Jeffrey D. Ullman. Complexity of Sequencing Problems. John Wiley and Sons,
1976.
5. John R. Ellis. Bulldog: A Compiler for VLIW Architectures. The MIT Press, 1986.
6. Saurabh Jang, Steve Carr, Philip Sweany, and Darla Kuras. A code generation
framework for vliw architectures with partitioned register banks. In proceedings of
3rd International Conference on Massively Parallel Computing Systems, 1998.
7. Victor S. Lapinskii and Margarida F. Jacome. cluster assignment for highperformace embedded vliw processors. ACM transactions on design automation of
electronic systems, 7(3):430–454, July 2002.
8. E. Ozer, S. Banerjia, and T. M. Conte. Unified assign and schedule: A new approach
to scheduling for clustered register file microarchitectures. In Proceedings of the
31st Annual International Symposium on Microarchitecture, 1998.
(a)
(b)
(c)
Fig. 9: Simulation results Part F: Scheduling time for artificial basic blocks
9. Rainer Leupers. Instruction scheduling for clustered vliw dsps. In proceedings of
the International Conference on Parallel Architecture and Compilation Techniques,
2000.
10. Kailas K, Agrawala A, and Ebcioglu K. Cars: A new code generation framework
for clustered ilp processors. In Proceedings of the 7th International Symposium on
High-Performance Computer Architecture, 2001.
11. Jesús Sánchez and Antonio Gonzálezor. Instruction scheduling for clustered vliw
architectures. In Proceedings of 13th International Symposium on System Synthesis, 2000.
12. Javier Zalamea, Josep Llosa, Eduard Ayguade, and Matoe Valero. Modulo scheduling with integrated register spilling for clustered vliw architectures. In Proceedings
of the 34th Annual International Symposium on Microarchitecture, pages 160–169,
2001.
13. Phillip B. Gibbons and Steven S. Muchnick. Efficient instruction scheduling for
a pipelined architecture. In Proceedings of the ACM SIGPLAN Conference on
Programming Language Design and Implementation, 1986.
14. Josep M. Codina, Jesús Sánchez, and Antonio González. A unified modulo scheduling and register allocation technique for clustered processors. In Proceedings of
2001 International Conference on Parallel Architecture and Compilation Techniques, 2001.
15. Yi Qian, Steve Carr, and Philip Sweany. optimizing loop performance for clustered
vliw architectures. In Proceedings of 2002 International Conference on Parallel
Architecture and Compilation Techniques, 2002.
16. Alex Aleta, Josep M. Codina, Jesús Sánchez, Antonio González, and David Kaeli.
Agamos: A graph-based approach to modulo scheduling for clustered microarchitectures. IEEE Transactions on Computers, 58(6):770–783, 2009.
17. Ti tms320c64xx dsps. http://www.ti.com.
18. Silvio Micali and Vijay Vazirani. An algorithm for finding maximum matching in
general graphs. In Proceedings of the 21st IEEE Symposium on Foundations of
Computer Science, 1980.
19. Mediabench ii benchmark. http://euler.slu.edu/ fritts/mediabench/.
Time 133 134 135 136 137 138 139 140 141 142 143
L1
v6 v9 v11
S1
v7
M1
D1
v8
L2
v10 v12
S2
M2
D2
(a) Step 1
Time 133
L1
S1
M1
D1
v3
L2
S2
M2
D2
134
135
136
137
138
139
140
141
v6
v7
142
v9
143
v11
v10
v12
142
v9
143
v11
MVD
v8
(b) Step 2
Time 133
L1
S1
M1
D1
L2
S2
M2
D2
134
135
136
137
138
139
140
v10
141
v6
v7
MVD
v3
v8
v12
(c) Step 3
Table 3: Computing the lmax (v3 )-successor-tree-consistent deadline
Table 4: Priority of each operation vi
Instruction
AEAP
ALAP
Instruction Mobility
DFU
Number of Successors
lmax (vi )-STC Deadline
v1
0
2
2
1
2
136
v2
0
2
2
1
2
136
v3
0
2
2
1
3
135
v4
0
2
2
1
2
135
v5
5
7
2
3
1
141
v6
5
7
2
3
1
142
v7
5
7
2
3
1
142
v8
6
8
2
1
2
142
v9
6
8
2
1
1
143
v10
6
8
2
1
1
143
v11
7
9
2
1
0
144
v12
7
9
2
1
0
144
Time 0 1 2 3 4 5
6
7
8
9 10 11 12 13 14
L1
v6
S1
v7
M1
MVD
MVD MVD
D1 v3 v4
L2
v5
v8
v9 v10 v11 v12
S2
M2
D2 v1 v2
(a) A schedule by our scheduling heuristic
Time 0 1 2 3 4 5
6
7 8 9 10 11 12 13 14 15 16
L1
v6
v9 v10 v8 v11
S1
v7
MVD
MVD
M1
D1 v3 v4
L2
v5
v12
S2
M2
MVD MVD MVD
D2 v1 v2
(b) A schedule by the Integrated heuristic
Table 5: Schedules by different heuristics
Benchmark
MPEG2 JPEG2000 H.264 H.263 MPEG4 JPEG
Basic Block Number
82
76
273
73
212
92
Smallest Basic Block Size
20
20
20
20
20
20
Biggest Basic Block Size
116
120
379 163
211
108
Average Basic Block Size
38
30
42
39
50
37
Table 6: Information of benchmarks