An Improved Systolic Architecture for LU Decomposition

An Improved Systolic Architecture for LU Decomposition∗
DaeGon Kim and Sanjay V. Rajopadhye
Computer Science Department
Colorado State University
E-mail: {kim,svr}@cs.colostate.edu
Abstract
LU-Decomposition is a classic problem for which many
systolic array implementations have been proposed, the best
of which takes 3n − 3 time on n2 /2 PEs, for a dense, n × n
matrix. In this paper, we first give a proof that if only nearest neighbor communication is allowed, this time is a lower
bound. We then generalize it to 2n + n/k − 3 time if
one allows k-bounded broadcasts (i.e., if it takes m/k
time steps to broadcast a value to m destination nodes). We
also present a new architecture with this improved execution time, which uses n2 /2 PEs, each one consisting of two
multiplier-subtractor units, but active only on alternate cycles. This leads to a speedup and efficiency of kn2 /(6k + 3)
and 2k/(6k + 3) respectively. For k = 1, our proposed architecture achieves the performance of the best previously
known systolic array implementation. Special cases of our
results include similar improvements to algorithms for solving (upper and lower) triangular linear systems by (forward
and backward) substitution.
1. Introduction
Systolic architectures were proposed by Kung and Leiserson [6, 5] to implement compute-intensive applications on
hardware structures. They are parallel architectures consisting of locally connected, simple processing elements (PEs)
where data flows in a synchronous, rhythmic and pipelined
fashion, and each PE works concurrently to exploit massive
parallelism. Elegant systolic arrays have been proposed for
many compute-bound algorithms in scientific computation,
signal and image processing and dense linear algebra.
LU decomposition is an important step in solving a system of linear equalities, a fundamental problem in many
scientific areas. It factorizes an n × n square matrix into
two triangular matrices so that the resulting linear system
∗ This
research was supported in part, by the National Science Foundation, under the grant EI-030614: HiPHiPECS: High Level Programing of
High Performance Embedded Computing Systems
Application-specific Systems, Architectures and Processors (ASAP'06)
0-7695-2682-9/06 $20.00 © 2006
can be more easily solved in O(n2 ) work. This decomposition is a computation-intensive algorithm requiring O(n3 )
operations (specifically n3 /3 multiply-subtractions). Many
systolic architectures have been proposed for this problem [1, 2, 4, 6, 11]. It is widely accepted that it can be
solved in 3n − 3 time.
In this paper we consider the possibility of bounded
broadcasts as proposed by Yaacoby and Cappello [14] and
Risset and Robert [13]. Specifically, we say that there is a
k-bounded broadcast whenever the time taken to communicate a value from a single source to m destinations is at
least m/k. We present a new systolic architecture for LU
decomposition, whose execution time is 2n + n/k − 3
with k-bounded broadcasts. We also prove that this execution time is a lower bound for k-bounded broadcasts. For
the special case of k = 1, this shows that the best known
latency 3n − 3 is a lower bound.
The rest of the paper is organized in the following way.
In section 2, we illustrate our major observations using a
simple example, forward substitution. In section 3, after formulating the LU decomposition equations and presenting some of their properties, we present a lower-bound
proof when only nearest-neighbors communication is allowed. Section 4 is dedicated to an improved architecture
for LU decomposition. We describe the related work in section 5, and conclude this paper with some indications of
future work and open problems.
2. Illustrating Example
Consider the following triangular linear system:
Ly = b
(1)
where L is a n × n lower triangular matrix all of whose
diagonal elements are 1, and y and b are both n-vectors. An
equation for the unknown y can be derived from equation
(1):
8
yi =
>
< i=1:
bi
>
: 1<i≤n:
bi −
i−1
X
j=1
Lij yj
(2)
j
j
y7(= Y [7, 6])
y6
y5
y4
y3
y2
y1
Y [7, 5]
Y [7, 4]
Y [7, 3]
Y [7, 2]
Y [7, 1]
Y [7, 0]
i
i
Figure 1. Iteration space of the serialized
Ly = b computation of equation (3) for n = 7.
Figure 2. Iteration space of the uniform recurrence equations of a triangular linear system.
The computation can be defined by the following recurrence: yi = Y [i, i − 1] where
j
Y [i, j] =
j=0:
j>0:
bi
Y [i, j − 1] − Lij Y [j, j − 1]
(3)
and ∀i = 1, . . . , n, yi = Y [i, i − 1]. In this recurrence, we
serialize the accumulation in (2) along the increasing order of j. The dependence graph of this equation is shown in
Figure 1. Note that the chosen serialization is the only linear
serialization which admits a linear schedule. If we choose to
accumulate in the decreasing order of j, the very first computation needed to produce yi , namely Y [i, i − 1] requires
yi−1 . Hence, the accumulation of the values contributing to
yi cannot start until yi−1 is computed, and this, in turn, requires all the points in the triangle defined by [1, 0], [i−1, 0]
and [i − 1, i − 2]. This leads to a quadratic schedule. Therefore, we have to serialize the reduction in the increasing
order of j in order to obtain a systolic array.
Let us compare this systolic (linear) schedule with
an ideal situation, namely a PRAM [3], which is the
most powerful abstract parallel machine. Let us assume
CREW(Concurrent Read Exclusive Write) where many
processors can read the same memory location at the same
time but cannot write the same location concurrently. The
maximum latency of the Ly = b computation is n − 1 because the length of the critical path consisting of y1 . . . yn
is n − 1 and all the computations have at most two inputs.
(Even on a CRCW model PRAM, the latency is n − 1, so
the weaker model does not cost us for this problem) Specifically, all the j-th accumulation can be done simultaneously.
The (PRAM) schedule is t(i, j) = j.
Consider all the points having same j. The distances
in the dependence graph from each point to the end point
Y [n, n − 1] are exactly the same. So, among these points,
the computation that is executed last determines the total execution time. For instance, consider all the computations at
j = 1 in figure 1. The length of the critical path from (2, 1)
Application-specific Systems, Architectures and Processors (ASAP'06)
0-7695-2682-9/06 $20.00 © 2006
to (7, 6) is the same as that of the critical path from (7, 1)
to (7, 6). All these points, which can be done in parallel
in the PRAM model, depend on y1 (= Y [1, 0]). In the systolic array model in which broadcasts are not allowed, all
the points on j = 1 cannot receive y1 in a constant time. So
the point that receives y1 last becomes a bottleneck of this
computation. In fact, the time for a value to be propagated
from a single source to m destinations is a linear function
of m, O(m). Therefore, under the systolic model, the triangular system cannot be solved in n − 1 or n + c for some
constant c.
Before continuing our discussion, we first describe a
standard architecture for this problem. A triangular system can be solved in 2n − 3 time in systolic arrays using
pipelining. Now, this computation is described by the following recurrence:
j
Y [i, j] =
j=0:
j>0:
bi
Y [i, j] − Lij Z[i, j]
(4)
where Z is declared over {i, j | 1 ≤ j ≤ i ≤ n} and
defined by
j
Z[i, j] =
i−1=j :
i−1>j :
Y [i − 1, j − 1]
Z[i − 1, j]
(5)
The corresponding dependence graph is shown in Figure
2. From these uniform recurrence equations, which can be
automatically mapped to a systolic architecture, we can get
a 2n − 3 time bound with the fastest schedule t(i, j) =
i + j − 2.
Now, consider the effects of the order among points on
a horizontal line. Let T(u) be the set of all the reachable
nodes from a point u in the iteration space. Since all the
computations in T(u) depend on u, no computation in T(u)
can begin before u is done. Since T(1, 0) = {i, j | 1 ≤ j <
i ≤ n}, the whole computation cannot start before Y [1, 0]
is available. Now, consider the set of points reachable from
j
j
y7(= Y [7, 6])
y6(= Y [6, 5])
y5(= Y [5, 4])
Y [7, 5]
Y [7, 4]
2
1
1
y4(= Y [4, 3])
Y [7, 3]
i
y3(= Y [3, 2])
Y [7, 2]
y2(= Y [2, 1])
1
2
Y [7, 1]
3
2
2
Figure 4. A pipeline using bounded broadcast
of size two in forward substitution
1
y1(= Y [1, 0])
Y [2, 0]
Y [3, 0]
Y [4, 0]
Y [5, 0]
Y [6, 0]
Y [7, 0]
i
Figure 3. A propagation sequence in Forward
substitution iteration space; the weight of every serialization edge is 1; the weights of the
other edges represent delays between production of yi and its arrival time; the edge
weights that are not shown are irrelevant.
Now let LASTt (j ) be the additional time delay between the computation time C of yj , and the time R when
LAST(j ) receives yj . Precisely, LASTt (j ) is R − C − 1.
In the standard architecture in figure 2, LASTt (1) = n − 2.
Since Y [1, 0], Y [n, 1] and Y [n, n − 1] is a dependence
path, a lower bound of the standard architecture is n − 1 +
LASTt (1)(=2n − 3). In general, a lower bound of any parallelization is
X
(n − 1) +
a point [k, k − 1] on the diagonal boundary: T(k, k − 1) =
{i , j | k < j < i ≤ n}. It is the “upper right” triangle
“above” j = k and no point in it can start before Y [k, k − 1]
becomes available.
Suppose that the computation Y [4, 1] receives y1 at t =
3 and other computations on the line j = 1 receive it before
t = 3. Because of the vertical dependence, all the computations in {i, j | j ≤ 3} can finish at t = 5 and some
computations on the line j = 3 might finish even before
t = 5. However, any computation above j = 3 cannot even
start until t = 5 when the value Y [4, 3] (i.e. y4 ) becomes
available. So we cannot expedite the execution of the whole
computation by rapidly propagating to a point without also
propagating to all points to its left. Note that we can apply
the same argument recursively to smaller triangles containing the diagonal line.
To formalize our argument, let LAST(j ) be the i coordinate of the point that last receives yj among the points
on j = j . We define a propagation sequence s1 s2 . . . sl
where s1 = LAST(1) and sl +1 = LAST(sl ) for all
l = 1, . . . , l − 1. Informally, we start from y1 , go the last
point on the next row that receives y1 , follow that vertically
to the diagonal, and repeat for the next row onwords. Figure
3 also shows a propagation sequence s1 s2 s3 where s1 = 4,
s2 = 6 and s3 = 7. Note that l is between 1 and n−1. Each
edge is labelled by the difference between computation time
of two iteration points. In the standard architecture, l = 1
and s1 = n because Y [n, 1] is the last computation that receives y1 . The sequence s1 s2 . . . sl is strictly monotonically
increasing; hence l is at most n − 1.
Application-specific Systems, Architectures and Processors (ASAP'06)
0-7695-2682-9/06 $20.00 © 2006
LASTt (s)
(6)
s∈s1 ...sl
where s1 . . . sl is the propagation sequence in an architecture. The above equations represents the sum of edge
weights along a particular path. In fact, the equations holds
for all the paths from Y [1, 0] to Y [n, n − 1]. A propagation sequence is a critical path. Therefore, there is no performance gain from the propagation to right points with a
hole, because a path containing the hole becomes a bottleneck later.
We have shown that (i) without broadcasting, a triangular
system cannot be solved in n+c for some constant c; and (ii)
a value should be propagated in the increasing order of i in
order to speed up the execution; therefore, the only way for
speeding up the execution is to simultaneously propagate a
value to some adjacent right neighbors with no hole.
We now show how to use bounded broadcasts to improve
this running time. Starting from the standard architecture,
we change the recurrence equations (4) and (5) into new
equations which have a longer dependence vector.
j
Y [i, j] =
j=0:
j>0:
bi
Y [i, j] − Lij Z[i, j]
(7)
where Z is now defined by
8
< i−1=j :
i−2=j :
Z[i, j] =
:
i−2>j :
Y [i − 1, j − 1]
Y [i − 2, j − 1]
Z[i − 2, j]
(8)
In this system of uniform recurrence equations, the length
of the critical paths is (n − 2) + (n/2 − 1). With a schedule λ(i, j) = i/2 + j − 2 and an allocation π(i, j) = i,
: Propagation at i=j-1
Propagation at i=j-1
Propagation at i<j-1
: Serialization
Odd PE
Even PE
Figure 5. Forward substitution architecture
where the processor mapping is π(i, j) = i
and n = 8.
the above recurrence equation describes an improved architecture which solves a triangular system in n + n/2 − 3.
The difference from the standard architecture is the propagation of the y values with a distance 2, as shown in figure 4. Also, figure 5 shows the architecture for n = 8.
In this architecture each processor has a multiplier and a
subtracter. In general, we can achieve the n + n/k − 3
time bound without increasing processing power when kbounded broadcasts are used.
Now, we prove that this is a lower bound of this computation when k-bounded broadcasts are used and the projection
is not π(i, j) = i − j. Note that Y [i + k, i − 1] cannot be
computed before Y [i, i − 1] because the broadcast size is k
and both need Y [i−1, i−2]. To compute Y [i+k, i+k −1],
k steps are required after both Y [i, i − 1] and Y [i + k, i − 1]
which itself depends on Y [i, i − 1]. Thus there must be
a delay of at least k + 1 between Y [i + k, i + k − 1] and
Y [i, i−1]. Since Y [i+k, i−1] depends on Y [i+k, i+k−1],
for every k, there is one additional delay. This amounts to
n/k − 1. More precisely, there is one additional delay at
k + 2, 2k + 2, . . . , l × k + 2 point where l = (n − 2)/k.
Finally, note that we cannot choose the allocation function, π(i, j) = j, that is valid for the standard architecture.
In an architecture with the λ(i, j) = i/2 + j − 2, this allocation maps two computations to the same processor at the
same time. Although this allocation removes the long connection, it requires increasing the number of PEs by a factor
of k. Figure 6 shows an architecture where the allocation
function is π(i, j) = j. Although the internal structure of
the processing elements is not shown, each element has two
multipliers and two subtractors.
Later, we will use this reasoning about the choice of processor projection to prove that using a linear mapping it is
impossible to avoid the use of n2 PEs when bounded broadcasts are used for the LU decomposition.
Application-specific Systems, Architectures and Processors (ASAP'06)
0-7695-2682-9/06 $20.00 © 2006
Figure 6. Forward substitution architecture
where the processor mapping is π(i, j) = j
and n = 8.
3. LU Decomposition
We now apply these observations to LU decomposition.
Our main results are: (i) a proof that 3n − 3 is a lower
bound in systolic arrays when only a linear serialization and
nearest-neighbors communications are allowed; (ii) a generalization of this proof to 2n + n/k − 3 when k-bounded
broadcasts are used; and (iii) a new systolic architecture
that solves LU decomposition in 2n + n/k − 3 time using
k-bounded broadcasts.
3.1. Problem Statement
LU decomposition factorizes a matrix A into a lower triangular L and an upper triangular U matrices.
A = LU
We adopt the convention that L is unit diagonal. Since L
and U are a lower and upper triangular matrices respectively, the above equation can be written as follows:
Aij
8
>
>
>
i>j>1:
>
>
>
>
>
j=1:
=
> 1=i≤j:
>
>
>
>
>
>
>
: 1<i≤j:
j
X
Lik Ukj
k=1
Lij Ujj
Lii Uij
i
X
Lik Ukj
(9)
k=1
Now, the equations for the unknown L and U matrices can
be derived from equation (9).
Lij =
8
>
j=1:
>
: i>j>1:
Uij =
Aij /Ujj
j−1
X
(Aij −
Lik Ukj )/Ujj
(10)
k=1
8
>
< 1=i≤j:
Aij
>
: 1<i≤j:
Aij −
i−1
X
Lik Ukj
(11)
k=1
This yields the classic algorithm for LU decomposition
(without pivoting). By serializing the accumulations in the
equation (10) and (11), equations of single assignment form
can be derived. The increasing direction of k is the only
linear serialization that allows a linear schedule even in
PRAM. We get a quadratic schedule if we serialize the other
j
j
L elements
i
U elements
N
Data flows
- Solid lines:(i,j-1)->(j,j)
- Dashed lines:(j,j)->(i,j)
The computation of U4,7 requires
L4,1...3 and U1...3,7
U4,7
The computation of L6,5 requires
L6,1...4 and U1...5,5
U6,5
1
Figure 7. Data dependences in LU decomposition.
1
N
i
Figure 9. Dependences of equation (14)
(n,n,n)
(1,n,1)
(n,n,1)
(1,1,1)
k
Ti , iteration plane for
the computation of
i-th row of L
(n,1,1)
j
X
i
Figure 8. LU decomposition iteration space.
order of k, the decreasing order. This follows from the argument discussed in section 2. After this serialization we
have Uij = F [i, j, i − 1] and Lij = F [i, j, j] where F is
defined as
8
k=0:
>
>
>
>
 j ≥ 1; k = j :
1 = i ≤ j; k = 1 :
F [i, j, k] =
>
>
1
≤ k < i; k < j :
>
>
:
decomposition cannot be done in 2n−1 time without broadcasting. We then prove that 2n+n/k−3 is a lower bound
if k-bounded broadcasts are allowed. It follows that when
k = 1, the best known execution time 3n − 3 is thus a lower
bound.
Triangle computation
Consider the following problem: Given an input X, at
the triangular set of points defined below
Aij
F [i, j, k − 1]/F [j, j, j − 1]
F [i, j, k − 1]
F [i, j, k − 1]
−F [i, k, k]F [k, j, k − 1]
(12)
The computation for Lij uses the elements of L in the ith row to the left of j and all the elements in the j-th column
of U . Similarly, computation of Uij needs all the elements
of U in the j-th column that are above i and all the elements of the i-th row of L. For example, as shown in figure
7, L6,5 depends on L6,1 , . . . , L6,4 and U1,5 , . . . , U5,5 , and
U4,7 depends on U1,7 , . . . , U3,7 and L4,1 , . . . , L4,3 . Clearly,
this computation can be done in 2n − 1 time under CREW
PRAM model where the i-th row of U is computed at time
2i − 1 and j-th column of L is computed at 2j. Since the jth column of L and i-th row of U respectively accumulate j
and i computations, the actual iteration space is a pyramidshaped polyhedron as shown in figure 8.
3.2. Lower-bound Proof
In this section we investigate some properties of the computation of the LU decomposition, and first show that LU
Application-specific Systems, Architectures and Processors (ASAP'06)
0-7695-2682-9/06 $20.00 © 2006
:
{(i, j)|1 ≤ j ≤ i ≤ N }
(13)
compute Z defined by
8
i = 1; j = 1 :
>
>
<
j = 1; i > j :
Z[i, j] =
i > 1; i = j :
>
>
:
i > 1; j < i :
X[i, j]
h(Z[j, j], X[i, j])
g(Z[i, j − 1], X[i, j])
f (Z[i, j − 1], Z[j, j], X[i, j])
(14)
The iteration space and the dependences of Z on itself are
shown in Figure 9 (dependences on X are not shown). We
can easily see that the lower bound of this computation is
2(N − 1), because of the critical path along the diagonal.
Also, this argument holds for any right triangle in the iteration space whose diagonal is on the i = j line. For example,
if Z[k, k] is computed at time t, the computation cannot finish before time t + 2(N − k). Another observation is that
any computation to the right and above it cannot start until
Z[k, k] is available.
Consider the iteration plane of the i-th row of the lower
triangular matrix L. From equation (12), we derive the
equation for this plane:
8
i > j; k = 0 :
>
>
 j ≥ 1; k = j :
Ti [j, k] =
> 1≤k<j<i:
>
:
Aij
F [i, j, k − 1]/F [j, j, j − 1]
F [i, j, k − 1]
−F [i, k, k]F [k, j, k − 1]
(15)
Note that F [j, j, j − 1] and F [k, j, k − 1] are Ujj and Ukj ,
respectively. Each computation requires a different element
of U , and the computation of Lij requires all Lik for all
k = 1 . . . j − 1. This iteration plane called Ti has the same
dependence properties as the triangle problem above when
U is regarded as the input and the input Aij is ignored.
Since all diagonal elements of L need not be computed, the
height of Tn , the triangle for the last row is n − 1. So the
total execution time is 2n − 4. Together with the time for
computing L21 and Unn this is the same as the execution
time of LU decomposition under CREW PRAM model.
From the above discussion we see that the i-th row of L
requires 2i − 4 time. Moreover, uii can be computed only
after this last point Li,i−1 . Let us assume that uii is computed at the next cycle after Ti is finished. All the Ti s need
u11 to start their computation. If broadcast is not allowed, a
delay between the production of u11 and the last consumption of u11 cannot be constant but must be a linear function
of n. Suppose Ti is the last triangle computation to receive
u11 and this happens at time kn for some k. Then ui i
can be computed at best at kn + 2i − 3 and is needed to
compute the small triangle in Tn starting with (i , i ). This
triangle cannot be done before 2n − 2i − 2. So the total
execution time of the whole computation is 2n + kn − 2. So
without broadcasting, LU decomposition cannot be done in
2n − 1 time steps. This disproves the claim of a systolic
architecture with 2n − 1 time bound proposed by Paul and
Mickle [9]. Furthermore, the dependence among the triangles Ti , is exactly same as the dependence in forward substitution in section 2. Without computing previous T i , it is
impossible to speed up the whole computation by computing next Ti slices.
Now, we will prove that 2n + n/k − 3 is a lower bound
of this computation with k-bounded broadcasts. The basic
idea is the same as that in the lower bound proof in Forward
Substitution. That is, for each k, there is one additional delay. Consider the computation time of the diagonal elements
of U . After the computation for ui−1,i−1 finishes, Ti and
Ti+k receive it. Since the length of broadcast is k, Ti+k receives one time after Ti receives. Let t be the time when Ti
receives. Because three time units are required to produce
uii , uii can be computed at time t + 3. Similarly, ui+k,i+k
can be at best computed at time (t + 1) + (2k + 2) + 1. The
difference between these two time stamps is 2k + 1. So the
total execution time is at least 2n − 2 + n/k − 1.
Symmetry between L and U
As seen in the equations for L and U in the introduction,
lij and uij have a similar computation pattern in which the
j-th column of L and i-th row of U need same column of
U and same row of L, respectively. This symmetry can be
visualized by imagining a mirror placed on the main diagonal in the iteration space. The computation of L and U are
the reflections of each other (except for the division). To
understand this formally, consider the transpose of U , say
W.
8
Wij =
>
< 1=j≤i:
Aji
>
: 1<j≤i:
Aji −
j−1
X
ljk wik
(16)
k=1
Expressing the computation in terms of U T and its definition on 3-dimensional space, we get the following equa-
Application-specific Systems, Architectures and Processors (ASAP'06)
0-7695-2682-9/06 $20.00 © 2006
tions:
8
< k=0:
F [i, j, k] =
1 ≤ j < i; j = k :
:
1≤k<j<i:
Aij
F [i, j, k − 1]/wjj
F [i, j, k − 1] − lik wjk
(17)
8
Aji
< k=0:
1 = j ≤ i; k = 1 : G[i, j, k − 1]
G[i, j, k] =
:
1<k<j≤i:
G[i, j, k − 1] − wik ljk
(18)
8
k=0:
Aij
>
>
<
1 ≤ j < i; j = k : F [i, j, k − 1]/G[j, j, j − 1]
F [i, j, k] =
1≤k<j<i:
F [i, j, k − 1]
>
>
:
−F [i, k, k]G[j, k, k − 1]
(19)
8
k
=
0
:
A
>
ji
>
<
1 = j ≤ i; k = 1 : G[i, j, k − 1]
G[i, j, k] =
1<k<j≤i:
G[i, j, k − 1]
>
>
:
−F [j, k, k]G[i, k, k − 1]
(20)
The i-th row of L and U cannot be computed without the
(i − 1)-th column of U and L, respectively. So we cannot
improve the whole computation by expediting only one of
L and U . Furthermore, the way to speed up should be the
same for L and U , because the properties of these two computations are same except the division step. The iteration
space of this computation is shown in Figure 10.
4. Improved Architecture
As shown in the previous section, the running time can
be reduced only when some consecutive rows receive values
at the same time, i.e., through bounded broadcasts. Another
observation from the end of section 2 shows that whenever
the processor projection is the same as broadcast directions,
the computing resources (number and/or complexity of processors) increase by a factor of the broadcast distance. LU
decomposition requires two bounded broadcasts; the one
along i direction is for L and the other along j direction
for U . Thus, projections along i or j require increasing processing power; however, any other processor projection will
lead to n2 processors. Therefore, any architecture for LU
decomposition using bounded broadcasts for speedup will
have at least the processing power equivalent to n2 processors with one multiplier and subtractor.
We now present an architecture based on equations (17)
- (20) (c.f. section 3.2.2). In this architecture each processor
performs two computations at the same time, one for L and
one for U . The complexity of each processor is doubled but
the broadcast direction is only i, not both i and j.
The architecture uses broadcasts of length 2 along the i
direction. The equations for the architecture are:
j
FL [i, j, k] =
k=j<i:
k<j≤i:
F [i, j, k]
FL [i, j − 1, k]
(21)
(n,n,n)
U
(1,n,1)
(n,n,1)
L
(1,1,1)
k
(n,1,1)
j
i
Figure 10. Iteration space of proposed LU decomposition architecture
p2 (=j-k)
p1 (=i)
h_out
8
< j + 1 = i : FL [i − 1, j − 1, k]
(22)
FLL [i, j, k] =
j + 2 = i : FL [i − 2, j − 1, k]
:
j + 2 < i : FLL [i − 2, j, k]
8
< k − 1 = j; j + 1 = i : G[i − 1, j, k]
k − 1 = j; j + 2 = i : G[i − 2, j, k]
GD [i, j, k] =
:
k − 1 = j; j + 2 < i : GD [i − 2, j, k]
(23)
j
k =j−1:
G[i, j, k]
(24)
GL [i, j, k] =
k + 1 < j ≤ i : GL [i, j − 1, k]
8
< j + 1 = i : GL [i − 1, j − 1, k − 1]
GLL [i, j, k] =
j + 2 = i : GL [i − 2, j − 1, k − 1] (25)
:
j + 2 
>
>
i = 1; k = 1 :
>
>
>
>
1 ≤ k <j < i−2 :
>
>
<
>
>
>
>
>
>
>
>
>
:
8
>
>
>
>
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
>
>
>
>
:
Aij
F [i, j, k − 1]/GD [i, j, k − 1]
F [i, j, k − 1]−
FL [i, j − 1, k] × GLL [i − 2, j, k]
1 ≤ k < j = i − 1 : F [i, j, k − 1] − FL [i, j − 1, k]
×GL [i − 1, j − 1, k]
1 ≤ k < j = i − 2 : F [i, j, k − 1] − FL [i, j − 1, k]
×GL [i − 2, j − 1, k − 1]
(26)
k=0:
aji
j = 1; j < i; k = 1 :
G[i, j, k − 1]
1 < i; j + 2 < i; 1 ≤ k : G[i, j, k − 1] − GL [i, j, k − 1]
×FLL [i − 2, j, k]
1 < i = j + 1; 1 ≤ k :
G[i, j, k − 1] − GL [i, j, k − 1]
×FL [i − 1, j − 1, k]
1 < i = j + 2; 1 ≤ k :
G[i, j, k − 1] − GL [i, j, k − 1]
×FL [i − 2, j − 1, k]
1 ≤ k; k < j = i :
G[i, j, k − 1] − FL [i, j − 1, k]
×GL [i, j, k − 1]
(27)
with a schedule t(i, j, k) = i/2 + j + k − 2. F and G
are variables for computing L and U respectively, FL and
GL are variables for propagating along j direction, FLL and
GLL are variables for bounded broadcasts along i and GD
is a variable for propagating the diagonal elements of U .
We choose a processor projection (0, 1, 1) that maps
(i, j, k) to (i, j −k). Since the j = k plane contains division
operations, this projection makes only boundary processors
have the special division operators. Another advantage is
that the number of processors can be reduced to half because only one out of every two consecutive processors is
active. Also, it is scalable with the size of broadcasts. In
Application-specific Systems, Architectures and Processors (ASAP'06)
0-7695-2682-9/06 $20.00 © 2006
s_in
u_in
v_in
v_out
h_out := h_in;
v_out := v_in;
s_out_l := s_in_l - h_in*v_in_u;
s_out_u := s_in_u - u_in*v_in_l;
s_out
h_in
Figure 11. An architecture for LU decomposition with pipeline initialization and bounded
broadcast.
other words, PEs do not require more processing power as
the broadcast size increases. The disadvantage of the projection is extra hardware for loading data.
Figure 11 shows the architecture (for n = 6) with interconnection and structure of internal processors. All the data
are first propagated to consecutive processors along j direction; then they are propagated using bounded broadcast
along the i direction. All the connections, except u_in,
propagates two values, one for each of L and U .
We have two dependence vectors (0, 0, 0)T and
(1, 0, 0)T in equations (21), (23) and (24), which conflict
with this schedule. The right-hand sides of these equations can, however, be rewritten by its definition, since
F and G respect this schedule. So the total execution is
2n + n/2 − 3. An architecture with broadcasts of a bigger size is similar to this architecture.
5. Related Work
The idea of systolic arrays was introduced by Kung and
Leiserson [5, 6]. Due to its massive parallelism, systolic
architectures have widely been used in high performance
computing, such as image processing and scientific applications. Many researchers have designed systolic architectures for various compute-bound algorithms [7]. Quinton [10] showed that such systolic arrays can be systematically derived from a system of uniform recurrence equa-
tions.
Bounded or local broadcasts were proposed for faster architecture [13, 12, 14]. Yaacoby and Cappello derived necessary conditions on schedules when such broadcasts are
used. Independently, Risset and Robert proposed the same
concept at the source equation level while focusing on automatic synthesis of such architecture. In their paper, they
point out that complexity of PEs has to increase by a factor
of the broadcast size if a broadcast direction is parallel to
the processor allocation vector.
A number of authors have addressed the LU decomposition problem [2, 6, 1, 11]. Kung and Leiserson [6] first
introduced a two-dimensional systolic array for LU decomposition, which has a hexagonal layout of processors and
4n total execution time when the input matrix is dense. The
subsequent work by Gentleman and Kung [4] reduced this
time to 3n on n2 /2 PEs. Rajopadhye [11] presents a formal
derivation of this array, and Benaini and Robert [1] further
decrease the number of processors to n2 /4 + O(n).
Paul and Mickle [9] claim a three-dimensional architecture which solves LU-decomposition in 2n−2 cycles. However, as shown in section 4, this cannot be achieved without (unbounded) broadcasting. Hence their proposed architecture is not systolic. Megson and Gaston [8] proposed a
three-dimensional architecture consisting of two planar layers for matrix triangularization. Their architecture uses a
similar idea as in this paper. They claim that their architecture triangulates an n × n dense matrix in 5n/2 time units.
They use broadcasts of size 2 only along the i direction,
but we have shown this cannot expedite the computation
for U . Since they choose the j direction as a projecting
vector, each processor is active at every cycle after it starts.
It results in u2n being computed at time n + 1. So, 3n
time is unavoidable. To produce the values of U at the right
time, the architecture should compute two values of U at
the same time. Hence, the PEs must have two multipliers
and subtractors, but the architecture is claimed to have only
one multiplier and subtractor. To achieve 5n/2 total execution time, the entire input matrix A must be fed into the
architecture by n/2 and their arrays do not do this either.
6. Conclusion and Future Work
An improved systolic architecture for LU decomposition
was presented. The improvement was attained by allowing
bounded broadcasts. Also it turns out broadcasts are the
only way to reduce the execution time. One interesting feature of our work is a formal reasoning on the computation
for LU decomposition. In addition, our approach can be
applied to Forward and Backward substitutions, as well.
Finally, we mention some open problems. For scientific
computations that have the property described in the paper,
i.e., a value depends all the “previous” values which leads a
Application-specific Systems, Architectures and Processors (ASAP'06)
0-7695-2682-9/06 $20.00 © 2006
triangle-shape iteration space, our technique can be beneficial. Another possible criterion that determines applicability of our technique is that the total execution time of systolic implementations is larger than that in CREW PRAM
model. We would like to answer the question of how to automatically detect such properties and derive an improved
architectures with bounded broadcasts.
References
[1] A. Benaini and Y. Robert. Spacetime-minimal systolic arrays for Gaussian elimination and the algebraic path problem. Parallel Computing, 15(1-3):211–225, 1990.
[2] E. Casseau and D. Degrugillier. A linear systolic array for
LU decomposition. In VLSI Design, pages 353–358, 1994.
[3] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms, chapter 30. Algorithms for Parallel Computers. The MIT Press and McGraw-Hill Book Company,
1989.
[4] W. M. Gentleman and H. T. Kung. Matrix triangularization
by systolic arrays. In Proceedings SPIE(Society of PhotoOptical Instrumentation Engineers), Real-Time Signal Processing IV, volume 298, pages 19–26, San Diego,CA, Jan.
1981.
[5] H. T. Kung. Why systolic architectures? IEEE Computer,
15(1):37–46, 1982.
[6] H. T. Kung and C. E. Leiserson. Systolic arrays for VLSI.
In Cal Tech Conference on VLSI, Jan. 1979.
[7] W. F. McColl. Special Purpose Parallel Computing. In
A. M. Gibbons and P. Spirakis, editors, Lectures on Parallel
Computation. Proc. 1991 ALCOM Spring School on Parallel
Computation, pages 261–336. Cambridge University Press,
1993.
[8] G. M. Megson and F. M. F. Gaston. Improved matrix triangularisation using a double pipeline systolic array. Information Processing Letters, 36(2):103–109, 1990.
[9] J. M. Paul and M. H. Mickle. Three-dimensional computational pipelining with minimal latency and maximum
throughput for L-U factorization. Circuits and Systems II:
Analog and Digital Signal Processing, IEEE Transactions,
45(11):1465 – 1475, Nov. 1998.
[10] P. Quinton. Automatic synthesis of systolic arrays from uniform recurrent equations. In Annual Symposium on Computer Architecture, pages 208–214, 1984.
[11] S. V. Rajopadhye. Systolic arrays for lu-decomposition: An
application of formal techniques. International Journal of
Computer Aided VLSI Design, 3, Jan. 1991.
[12] T. Risset. A method to synthesize modular systolic arrays
with local broadcast facility. In International Conference
on Application Specific Array Processors, pages 415–428,
Oakland, CA, Aug. 1992.
[13] T. Risset and Y. Robert. Uniform but non-local dags: a tradeoff between pure systolic and simd solutions. In International Conference on Application Specific Array Processors,
pages 296–308, Barcelona, Sept. 1991.
[14] Y. Yaacoby and P. Cappello. Bounded broadcast in systolic
arrays. International Journal of High Speed Computing,
6(2):223–237, 1994.

Download Report

An Improved Systolic Architecture for LU Decomposition

Paperzz.com

Your Paperzz