Reducing the e ect of global communication in GMRES(m) and CG

Reducing the eect of global communication in GMRES(m) and
CG on Parallel Distributed Memory Computers
(Technical Report 832, Mathematical Institute, University of Utrecht, October 1993)
E. de Sturler
Faculty of Technical Mathematics and Informatics
Delft University of Technology
Mekelweg 4
Delft, The Netherlands
and
H. A. van der Vorst
Mathematical Institute
Utrecht University
Budapestlaan 6
Utrecht, The Netherlands
Abstract
In this paper we study possibilities to reduce the communication overhead introduced by
inner products in the iterative solution methods CG and GMRES(m). The performance of
these methods on parallel distributed memory machines is often limited because of the global
communication required for the inner products. We investigate two ways of improvement.
One is to assemble the results of a number of local inner products of a processor and to accumulate them collectively. The other is to try to overlap communication with computation.
The matrix vector products may also introduce some communication overhead, but for many
relevant problems this involves communication with a few nearby processors only, and this
does not necessarily degrade the performance of the algorithm.
Key words. parallel computing, distributed memory computers, conjugate gradient methods,
performance, GMRES, modied Gram{Schmidt.
AMS(mos) subject classication. 65Y05, 65F10, 65Y20.
1 Introduction
The Conjugate Gradients (CG) method 9] and the GMRES(m) method 11] are widely used
methods for the iterative solution of specic classes of linear systems. The time-consuming
kernels in these methods are: inner products, vector updates, and matrix vector products (including preconditioning operations). In many situations, especially when the matrix operations
are well-structured, these operations are suited for implementation on vector computers and
shared memory parallel computers 8].
This author wishes to acknowledge Shell Research B.V. and STIPT for the nancial support of his research.
1
For parallel distributed memory machines the picture is entirely dierent. In general the
vectors are distributed over the processors, so that even when the matrix operations can be
implemented eciently by parallel operations, we cannot avoid the global communication required for inner product computations. These global communication costs become relatively
more and more important when the number of parallel processors is increased and thus they
have the potential to aect the scalability of the algorithms in a very negative way 5]. This
aspect has received much attention and several approaches have been suggested to improve the
performance of these algorithms.
For CG the approaches come down to reformulating the orthogonalization part of the algorithm, so that the required inner products can be computed in the same phase of the iteration
step (see, e.g., 4, 10]), or to combine the orthogonalization for several successive iteration steps,
as in the s-step methods 2]. The numerical stability of these approaches is a major point of
concern.
For GMRES(m) the approach comes down to some variant of the s-step methods 1, 3]. After
having generated basis vectors for part of the Krylov subspaces by some suitable recurrence
relation, they have to be orthogonalized. One often resorts to cheap but potentially unstable
methods like Gram-Schmidt orthogonalization.
In the present study we investigate other ways to reduce the global communication overhead
due to the inner products. Our approach is to identify operations that may be executed while
communication takes place, since our aim is to overlap communication with computation. For
CG this is done by rescheduling the operations, without changing the numerical stability of
the method 7]. For GMRES(m) it is achieved by reformulating the modied Gram{Schmidt
orthogonalization step 5, 6]. For GMRES(m) we also exploit the possibility of packing the results
of the local inner products of a processor in one message and accumulating them collectively.
We believe that our ndings are relevant for other Krylov subspace methods as well, since
methods like BiCG, and its variants CGS, BiCGSTAB, and QMR have much in common with
CG from the implementation point of view. Likewise, the communication problems with GMRES(m) are representative for the problems in methods like ORTHODIR, GENCG, FOM, and
ORTHOMIN.
We have carried out our experiments on a 400-processor Parsytec Supercluster at the Koninklijke/Shell-Laboratorium in Amsterdam. The processors are connected in a xed 20 20 mesh, of
which arbitrary submeshes can be used. Each processor is a T800-20 transputer. The transputer
supports only nearest neighbor synchronous communication more complicated communication
has to be programmed explicitly. The communication rate is fast compared to the op rate,
but to current standards the T800 is a slow processor. Another feature of the transputer is
the support of time-shared execution of multiple `parallel' CPU-processes on a single processor,
which facilitates the implementation of programs that switch between tasks when necessary (on
interrupt basis), e.g., between communication and computation. Finally, transputers have the
possibility of concurrent communication and computation. As a result it is possible to overlap
computation and communication on a single processor.
The program that runs on each processor consists of two processes that run time-shared: a
computation process and a communication process. The computation process functions as the
master. If at some point communication is necessary, the computation process sends the data
to the communication process (on the same processor) or requests the data from the communication process, which then handles the actual communication. This organization permits the
computation processes on dierent processors to work asynchronously even though the actual
communication is synchronous. The communication process is given the higher priority so that
2
if there is something to communicate this is started as soon as possible.
2 The algorithms for GMRES(m) and CG
Preconditioned CG:
GMRES(m):
start:
start:
x0= initial guess r0 = b ; Ax0
p;1 = 0 ;1 = 0
Solve for w0 in Kw0 = r0
0 = (r0 w0)
x0 = initial guess r0 = b ; Ax0 v1 = r0 =kr0k2
iterate:
for j = 1 2 : : : m do
v^j+1 = Avj for i = 1 2 : : : j do
hij = (^vj +1 vi)
v^j+1 = v^j+1 ; hij vi
end
hj+1j = kv^j+1 k2
vj+1 = v^j +1 =hj+1j end
iterate:
for i = 0 1 2 :::: do
pi = wi + i;1pi;1
qi = Api
i = i =(pi qi)
xi+1 = xi + ipi
ri+1 = ri ; iqi compute krk
if accurate enough then quit
Solve for wi+1 in Kwi+1 = ri+1
i+1 = (ri+1 wi+1)
i = i+1=i
end
Figure 1: The preconditioned CG algorithm
form the approximate solution:
xm = x0 + Vm ym where
ym minimizes kr0k2e1 ; Hm y 2 , y 2 IRm
restart:
compute rm = b ; Axm if satised then
stop, else x0 = xm v1 = rm =krmk2 goto iterate
Figure 2: The GMRES(m) algorithm
In this section we will discuss the time-consuming kernels in CG and GMRES(m): the vector
update (daxpy), the preconditioner, the matrix vector product, and the inner product (ddot),
see Figures 1 and 2.
Because the results of the inner products are needed on all processors, the Hessenberg matrix Hm (see Figure 2) is available on each processor. Hence, the computation of ym can be
done on each processor. This is often ecient, because it would be a synchronization point if
implemented on a single processor, and then the other processors would have to wait for the
result. However, if the size of the reduced system is large compared to the local number of
unknowns, the computation might be expensive enough to make the distribution and parallel
solution worthwhile. We have not pursued this idea.
The parallel implementation of the vector update (daxpy) poses no problem since it involves
only local computation.
In this paper we restrict ourselves to problems for which the parallelism in the matrix vector
product does not pose serious problems. That is, our model problems have a strong data
locality which is typical for many nite dierence and nite element problems. A suitable
3
1
2
1
2
4
1
2
5
1
2
4
3
2
1
3
2
1
3
2
1
3
2
1
accumulation over the processor grid, no
outgoing step can be taken before all
incoming steps have taken place.
Figure 3: accumulation over the processor grid
domain decomposition approach preserves this locality more or less independent of the number of
processors, so that the matrix vector product requires only neighbor{neighbor communication or
communication with only a few nearby processors. This could be overlapped with computations
for the interior of the domain but it is relatively less important, since the number of boundary
operations is in general an order of magnitude less than the number of interior operations (this
is the surface to volume eect).
The communication overhead introduced by the preconditioner is obviously strongly dependent on the selected preconditioner. Popular preconditioners on sequential computers, like the
(M)ILU variants, are highly sequential or introduce irregular communication patterns (as in the
hyperplane approach, see 8]) and therefore these are not suitable. Obviously we prefer preconditioners which only require a limited amount of communication, for instance comparable to
or less than that of the matrix vector product. On the other hand we would like to retain the
iteration reducing eect of the preconditioners, and these considerations are often in conict. In
our study we have avoided to discuss the convergence accelerating eects of the preconditioner
and we have used a simple Incomplete Block Jacobi preconditioner with blocks corresponding
to the domains. In this case we have no communication at all for the preconditioner.
Since the vectors are distributed over the processor grid, the inner product (ddot) is computed
in two steps. All processors start to compute in parallel the local inner product. After that,
the local inner products are accumulated on one `central' processor and broadcasted. We will
describe the implementation in little more detail for a 2-dimensional mesh of processors, see
Figure 3. The processors on each processor line in the x-direction accumulate their results
along this line on an àccumulation' processor at the same place on each processor line: each
processor waits for the result from its neighbor further from the accumulation processor, then
adds this result to its own partial result and sends the new result along. Then the àccumulation'
processors do a likewise accumulation in the y-direction. The broadcast consists of the reverse
process. Each processor is active at only a limited number of steps and will be idle for the rest
of the time. So, here are opportunities to make it available for other tasks.
The communication time of an accumulation or a broadcast is of the order of the diameter of
the processor grid. This means that for an increasing number of processors the communication
time for the inner products increases as well, and hence this is a potential threat to the scalability
4
of the method. Indeed, if the global communication for the inner products is not overlapped, it
often becomes a bottleneck on large processor grids, as will be shown later.
In 5] a simple performance model based on these considerations is introduced, which clearly
shows quantitatively the dramatic inuence of the global communication for inner products over
large processor grids on the performance of Krylov subspace methods. That model also shows
that the degradation of performance depends on the relative costs of local computation and
global communication. This means that results analogous to those presented in Sections 5 and 7
will be seen for larger problems on processor conguration(s) with relatively faster computational
speed (this is the current trend in parallel computers). Moreover, if the problem sizes increase
proportionally to the number of processors, the local computation time remains the same but
the global communication cost increases. This emphasizes the necessity to reduce the eect of
the global communication costs.
3 Parallel performance of GMRES(m) and CG
We will now describe briey a model for the computation time, the communication cost, and
the communication time of the main kernels in Krylov subspace methods. We use the term communication cost to indicate the wall clock time spent in communication that is not overlapped
with useful computation (so that it really contributes to wall-clock time). The term communication time is used to indicate the wall-clock time of the whole communication. In the case of
a nonoverlapped communication, the communication time and the communication cost are the
same. Our quantitative formulas are not meant to give very accurate predictions of the exact
execution times, but they will be used to identify the bottlenecks and to evaluate improvements.
Several of the parameters that we introduce may vary over the processor grid. In that case the
value to use is either a maximum or an average, whatever is the most appropriate.
Computation time.
We will only be concerned with the local computation time, since the cost of communication and
synchronization is modeled explicitly. The computation time for the solution of the Hessenberg
system is neglected in our model. For a vector update (daxpy) or an inner product (ddot)
the computation time is given by 2tfl N=P , where N=P is the local number of unknowns of
a processor and tfl is the average time for a double precision oating point operation. The
computation time for the (sparse) matrix vector product is given by (2nz ; 1)tfl N=P , where nz
is the average number of non-zero elements per row of the matrix. As preconditioner we chose
Block-(M)ILU variants without ll-in of the form LD;1 U for GMRES(m), and LLT for CG. For
CG we have scaled the system so that diag (L) = I . The computation time of the preconditioner
for GMRES(m) is (2nz + 1)tfl N=P , and for CG it is 2(nz ; 1)tfl N=P .
A full GMRES(m) cycle has approximately 21 (m2 + 3m) inner products, the same number
of vector updates, and (m + 1) multiplications with the matrix and the preconditioner, if one
computes the exact residual at the end of each cycle. The complete (local) computation time
for the GMRES(m) algorithm is given by the equation:
gmr = ;2(m2 + 3m) + 4n (m + 1) N t :
Tcmp
z
1
P fl
(1)
A single iteration of CG has three inner products, the same number of vector updates and one
multiplication with the matrix and the preconditioner. The complete (local) computation time
is given by the equation:
cg = (9 + 4n ) N t :
Tcmp
(2)
z
fl
P
5
Communication cost.
As we mentioned already the solution of the Hessenberg system in GMRES(m) and the vector
update are local and involve no communication cost.
The most important communication is for the global inner products. If we do not overlap
this global communication then we are concerned with the wall clock time for the entire, global
operation and not with the local part of a single processor. We note that we can view the time
for the accumulation and broadcast either as the communication time for the entire operation
or as a small local communication time and a long delay because of global synchronization. In
the rst interpretation we would consider overlapping the global communication, whereas in the
second one we would consider removing the delays by reducing the number of synchronization
points. We will take the rst point of view.
p
Consider a processor grid with P = p2 processors. With pd 2dp=2e( P ), the maximum
distance to the `most central' processor over the processor grid is pd . Let the communication
start-up time be given by ts and the word (32 bits) transmission time by tw . The time to
communicate one double precision number between two neighboring processors is then (ts +3tw ),
since a double precision number takes two words and we need a one word header to accompany
each message. Hence, the global accumulation and broadcast of one double precision number
takes 2pd (ts + 3tw ) and the global accumulation and broadcast of a vector of k double precision
numbers takes 2pd (ts + (2k + 1)tw ).
For GMRES(m) in the nonoverlapped case the communication time for the modied GramSchmidt algorithm (with 21 (m2 + 3m) accumulations and broadcasts) is
2
Tagmr
+b = (m + 3m)pd (ts + 3tw )
(3)
where à + b' indicates the accumulation and broadcast.
For CG in the nonoverlapped case the communication time of the three inner products per
iteration, is
Tacg+b = 6pd (ts + 3tw ):
(4)
The communication for the matrix vector product is necessary for the exchange of so-called
boundary data: sending boundary data to other processors and receiving boundary data from
other processors. Assume that each processor has to send and to receive nm messages, which each
take d steps of nearest neighbor communication from source to destination, and let the number
of boundary data elements on a processor be given by nb . The total number of words that have
to be communicated (sent and received) is then 2(2nb + nm ) per processor. For GMRES(m) the
communication time of (m + 1) matrix vectors products is
gmr = 2dn (m + 1)t + 2d(m + 1)(2n + n )t Tbde
m
s
b
m w
(5)
where `bde' refers to the boundary exchange. For CG the communication time of one matrix
vector product is
cg = 2dn t + 2d(2n + n )t :
Tbde
(6)
ms
b
m w
Note that we have assumed no overlap. For preconditioners that only need boundary exchanges,
we could have used the same formulas with a dierent choice of the parameter values if necessary,
but in our experiments we have used only local block preconditioners (without communication).
6
4 Communication overhead reduction in GMRES(m)
From the expressionsp(1), (3) and (5) we conclude that the communication cost for GMRES(m)
is of the order O(m2 P ) and for large processor grids this will become a bottleneck. Moreover,
in the standard implementation we cannot reduce these costs by accumulating multiple inner
products together, saving on start-up times, or overlap this expensive communication with
computation, reducing the runtime lost in communication. The problem stems from the fact
that the modied Gram-Schmidt orthogonalization of a single vector against some set of vectors
and its subsequent normalization is an inherently sequential process. However, if the modied
Gram-Schmidt orthogonalization of a set of vectors is considered there is no such problem, since
the orthogonalizations of all intermediate vectors on the previously orthogonalized vectors are
independent. Therefore, we can compute several or all of the local inner products rst and then
accumulate the subresults collectively.
Suppose the set of vectors v1 v^2 v^3 : : : v^m+1 has to be orthogonalized, where kv1k2 = 1.
The modied Gram-Schmidt process can be implemented as sketched in Figure 4. This reduces
for i = 1 2 : : : m do
orthogonalize vî+1 : : : v^m+1 on vi vi+1 = vî+1=kvî+1k2
end
Figure 4: a block-wise modied Gram{
Schmidt orthogonalization
the number of accumulations to only m instead of 12 (m2 + 3m) for the usual implementation of
GMRES(m), but the length of the messages has increased. In this way, start-up time is saved
by packing small messages, corresponding to one block of orthogonalizations, into one larger
message. Moreover, we also reduce the amount of data transfer because we have less message
headers.
Instead of computing all local inner products in one block and accumulating these partial
results only once for the whole block, it is preferable to split each step into two blocks of
orthogonalizations, since this oers the possibility to overlap with communication. This overlap
is achieved by performing the accumulation and broadcast of the local inner products of the
rst block concurrently with the computation of the local inner products of the second block
and performing the accumulation and broadcast of the local inner products of the second block
concurrently with the vector updates of the rst block, see Figure 5.
Note that the computation time for this approach is equal to that for the standard modied
Gram{Schmidt algorithm.
For the parallel òverlapped' implementation of the modied Gram-Schmidt algorithm given
in Figure 5, we will neglect potential eects of overlap of the communication with computation
on a single processor. We will only consider the overlap with useful computational work of the
time that a processor is not active in the global accumulation and broadcast. If we assume
that sucient computational work can be done to completely ll this time, the communication
cost Tagmr
+b , see (3), reduces to only the communication time spent locally by a processor. This
`local' communication cost for the accumulation and broadcast of a vector of k double precision
numbers is given by 4ts + 4(2k + 1)tw , for a receive and send in the accumulation phase and a
receive and send in the broadcast phase, if the processor only participates in the accumulation
7
for i = 1 2 : : : m do
split vî+1 : : : v^m+1 into two blocks
compute local inner products (LIPs) block 1
k
(
accumulate LIPs block 1
compute LIPs block 2
update vî+1 , compute LIP for kvî+1 k2,
place this LIP into block 2
k
end
(
accumulate LIPs block 2
update vectors block 1
update vectors block 2
normalize vî+1
Figure 5: the implementation of the modied Gram-Schmidt
process
along the x-direction and it is given by 8ts + 8(2k + 1)tw if the processor also participates in
the accumulation along the y-direction. The latter case is obviously the most important, since
all processors nish the modied Gram-Schmidt algorithm more or less at the same time. The
communication cost of the entire parallel modied Gram-Schmidt algorithm (mgs) now becomes
Tl mgs = 16mts + 8(m2 + 5m)tw :
(7)
In general we may not have enough computational work to overlap all the communication time in
a global communication process. For the wall clock time of (parallel) operations, it is the longest
time that matters. Here it is the global communication time for the modied Gram{Schmidt
algorithm (mgs):
Tg mgs = 4mpd ts + 2(m2 + 5m)pd tw :
(8)
Since the communication is partly overlapped, the communication cost is in general signicantly
lower than the communication time, and then it may still be better described by (7) instead of
(8).
Two important facts are highlighted by expressions (3), (7) and (8). First, assuming sufcient computational
p work, the contribution of start-up times to the communication cost is
2
reduced from O(m P ) in the standard GMRES(m) (3) to O(m) using the parallel modied
Gram-Schmidt algorithm (7). Especially for machines with relatively high start-up times this
is important. In fact, if the start-ups dominate the communication cost, then we can reduce
this contribution by a factor of two by the algorithm given in Figure 4 (even if we neglect the
overlap). Second, assuming sucient computational work, the communication cost no longer
depends on the size of the processor grid instead of being of the order of the diameter of the
processor grid pd , it is now more or less constant. If we lack sucient computational work the
communication cost is described by (8) minus the time for the overlapped computation.
8
v^1 = v1 = r=krk2
for i = 1 2 : : : m do
vî+1 = vî ; di Avî
end
Figure 6: Generation of a polynomial basis for
the Krylov subspace
In order to be able to use this parallel modied Gram{Schmidt algorithm in GMRES(m),
a basis for the Krylov subspace has to be generated rst. The idea to generate a basis for
the Krylov subspace rst and then to orthogonalize this basis was already suggested for the
CG algorithm, referred to as s-step CG, in 2] for shared (hierarchical) memory parallel vector
processors. In 2] it is also reported that the s-step CG algorithm may converge slowly due to
numerical instability for s > 5.
In the parGMRES(m) algorithm stability seems to be much less of a problem since each vector
is explicitly orthogonalized against all the other vectors, and we generate a polynomial basis for
the Krylov subspace such as to minimize the condition number, see 1] where the Krylov subspace
is generated rst to exploit higher level BLAS in the orthogonalization, and 6]. The basis vectors
for the Krylov subspace vî are generated as indicated in Figure 6, where the parameters di are
used to get the condition number of the matrix v1 v^2 : : : v^m+1 ] suciently small. Bai, Hu, and
Reichel 1] discuss a strategy for this. Their idea is to use one cycle of standard GMRES(m).
Then the eigenvalues of the resulting Hessenberg matrix, which approximate those of A, are
used in the so-called Leja ordering as the parameters d;i 1 in the rest of the modied GMRES(m)
cycles. Their examples indicate that the convergence of such a GMRES(m) is (virtually) the
same as that of standard GMRES(m). This is also borne out by our experience. Therefore, in
the next section we limit our experiments to the evaluation of a single GMRES(m) cycle.
Our parallel computation of the Krylov subspace basis requires m extra daxpys. It is obvious
from (1) that this cost is negligible. However, for completeness we give the computation time
with these extra daxpys,
gmr = (2m(m + 4) + 4n (m + 1)) N t :
Tcmp
z
2
P fl
(9)
Because we generate the Krylov subspace basis rst and then orthogonalize it, the Hessenberg
matrix that we obtain from the inner products is not VmT+1 AVm , as in the standard GMRES(m)
algorithm, and therefore we need to solve the least squares problem in a slightly dierent way.
Dene v^1 = v1 = krk;2 1r, and generate the other basis vectors as vî+1 = (I ; diA)^vi , for
i = 1 : : : m. This gives the following relation:
^v2 v^3 : : : v^m+1 ] = Vbm ; AVbm Dm (10)
where Dm = diag (di) and Vbm is the matrix with the vectors vî as its columns. This relation between vectors and matrices composed from these vectors will be used throughout this
discussion.
The parallel modied Gram-Schmidt orthogonalization gives the orthogonal set of vectors
fv1 : : : vm+1g, for which we have
vj+1
= h;1
j +1j +1
v^j+1 ;
j
X
i=1
!
hij +1vi 9
for i = 1 : : : m
(11)
where hij is dened by (but computed dierently)
hij =
(
(vi v^j ) i j
0 i > j:
(12)
Notice the subtle dierence with the denition of Hm in the standard implementation of GMRES(m). Here the matrix Hm+1 is upper triangular. Furthermore, as long as hii 6= 0 the matrix
Hi is nonsingular, whereas hii = 0 indicates a lucky breakdown.
We will further assume, without loss of generality, that hii 6= 0, for i = 1 : : : m + 1. Let hi
denote the i-th column of Hm+1 . From equations (11) and (12) it follows that
Vbi = ViHi
for i = 1 : : : m + 1:
(13)
Equation (10) can be rewritten as
Vbm ; ^v2 : : : v^m+1] = AVbm Dm = AVm Hm Dm :
(14)
Dene Hb m = h1 h2 : : :hm ] ; h2 h3 : : :hm+1 ], so that Hb m is an upper Hessenberg matrix of
rank m, since hii 6= 0, for i = 1 : : : m + 1. Substituting this in (14) nally leads to
Vm+1 Hb m = AVm Hm Dm :
(15)
Using this expression the least squares problem can be solved in the same way as for standard
GMRES(m):
min
kr ; AVmyk2 = min
kr ; AVmHm Dmy^k2 where HmDmy^ = y:
y
y^
(16)
Because Hm and Dm are nonsingular, the latter by denition, Hm Dm y^ = y is always well-dened.
Combining (15) and (16) yields
y^ : min
r ; Vm+1 Hb m y^ 2 = min
krk2e1 ; Hb my^ 2 :
y^
y^
(17)
The additional computational work in this approach is only O(m2) and therefore negligible. We
will refer to this adapted version of GMRES(m) as parGMRES(m).
5 Performance of GMRES(m) and parGMRES(m)
Before we discuss the experiments below, we present a short theoretical analysis. The communication time for the exchange of boundary data and the computation time for the m additional
vector updates in the parGMRES(m) implementation will be neglected in this analysis, because
they are relatively unimportant. The runtime of a GMRES(m) cycle on P 4 processors is
gmr + T gmr , see (1) and (3):
then given by TP = Tcmp
1
a+b
;
; 2
p
TP = 2(m2 + 3m) + 4nz (m + 1) tfl N
+
(
m
+
3
m
)(
t
+
3
t
)
P:
(18)
s
w
P
This equation shows that for suciently large P the communication will dominate. Following
the analysis in 5] we introduce the value Pmax as the number of processors that minimizes the
runtime of GMRES(m). We have studied the performance of GMRES(m) and parGMRES(m)
for numbers of processors less than or approximately equal to Pmax . Note that for parGMRES(m)
10
we can improve the performance further with more processors than Pmax , because it has a lower
communication cost.
The cost of communication is reduced in parGMRES(m) in two steps. First, we reduce the
communication time by accumulating and broadcasting multiple inner products in groups. This
reduces the communication time from Tagmr
+b to Tm mgs , see (3) and (8). Second, we overlap the
non-local part of the remaining communication time with half the computation in the modied
Gram-Schmidt algorithm, see Figure 5. The length of the overlap then determines the performance of parGMRES(m) and the improvement over GMRES(m). Therefore we introduce the
value Povl , which is the number of processors for which the overlap is exact. The performance
and the improvement are then related to whether P Povl or P > Povl and how large Povl
is relative to Pmax , because the fraction of the runtime spent in communication increases for
increasing P , see (18).
We will now give relations for Pmax and Povl . The minimization of (18) gives
2=3
2
Pmax = 4(m(m+23+m3)m+)(8tnz+(m3t+)1)]tfl N (19)
s
w
gmr .
and the eciency EP = T1 =(PTP ) for Pmax processors is given by EPmax = 1=3, where T1 = Tcmp
1
2
This means that 3 TPmax is spent in communication, because in this model eciency is lost only
through communication. For Povl we have that the (total) communication time Tg mgs , see
(8), is equal to the sum of the overlapping computation time, (m2 + 2m)tfl NP , and the local
communication time Tl mgs , see (7):
p
(4mts + (2m2 + 10m)tw ) Povl = (m2 + 2m)tfl PN + 16mts + (8m2 + 40m)tw : (20)
ovl
If P Povl then the communication costpis reduced to Tl mgs , see (7). This means thatpthe
cost of start-ups is reduced by a factor of m16+3 P and the cost of data transfer by a factor of 38 P .
Furthermore, as long as P < Povl an increase in the number of processors will not result in an
increase of the communication cost, and hence the eciency remains constant. If P > Povl then
the overlap is no longer complete and the communication cost is given by the communication
time minus the computation time of the overlapping computation: Tg mgs ; (m2 +2m)tfl NP . The
runtime is then given by
;
p
;
2
+
4
mt
+
(2
m
+
10
m
)
t
P:
T~P = (m2 + 4m) + 4nz (m + 1) tfl N
s
w
P
(21)
For P > Povl we see that the eciency decreases again, because the communication time increases and the computation time of the overlap decreases.
Equation (20) gives
2=3
2 + 2m)t
(
m
fl
Povl 4mt + (2m2 + 10m)t N :
(22)
s
w
Comparing (19) with (22), we see that if ts dominates the communication, that is ts tw ,
then Povl > Pmax and we always have P Povl , so that we can overlap all communication after
the reduction of start-ups. This means that we can reduce the runtime by almost a factor of
three. For transputers we have ts tw , and comparing (19) and (22) we see that Povl < Pmax .
One can prove that the improvement of parGMRES(m) compared to GMRES(m), TP =T~P , as a
function of P is either constant or a strictly increasing or decreasing function. The maximum
improvement is therefore found for either P = Povl or P = Pmax .
11
For P = Povl , the communication time is strongly reduced. Furthermore, (19) and (22)
indicate that for m large enough Povl (1=2)2=3Pmax , which means that the eciency at Povl is
less than about 50%. Therefore we may expect an improvement by about a factor of two.
For P = Pmax the runtime is given by (21). When ts tw we get Tg mgs 12 Tagmr
+b , and we
may say that due to the overlap the cost of computation is reduced by (m2 + 2m)tfl NP , that is
approximately by a factor of
2m2 + 6m + 4nz (m + 1) 2m + 6 + 4nz m2 + 4m + 4nz (m + 1) m + 4 + 4nz
which is a little less than a factor of two. Hence we may expect an improvement by a factor of
about two in this case also.
We now discuss our experimental observations on the parallel performance of GMRES(m)
and the adapted algorithm parGMRES(m) on the 400-transputer machine. We will only consider
the performance of one (par)GMRES(m) cycle, because both algorithms take about the same
number of iterations, which generally leads to the same number of GMRES(m) cycles, with only
a possible dierence in the last cycle. The dierence may be that GMRES(m) stops before it
completes the full m iterations of the last cycle. This gives on average a dierence of only a half
GMRES(m) cycle, which is often more than compensated by the much better performance of
parGMRES(m) for the other cycles.
In our experiments we used square processor grids (minimal diameter), and this is optimal
for GMRES(m). For other processor grids the degradation of performance for GMRES(m) will
be even worse. The parGMRES(m) algorithm is much less sensitive to the diameter of the
processor grid.
We have solved a convection diusion problem discretized by nite volumes over a 100 100
grid, resulting in the familiar ve-diagonal matrix with a tridiagonal block-structure, corresponding to the 5-point star. This relatively small problem size was chosen, because for the processor
grids of increasing size it very well shows the degradation of performance for GMRES(m) and
the large improvements of parGMRES(m) over GMRES(m). As we will see, the parGMRES(m)
variant has much better scaling properties than GMRES(m).
The measured runtimes for a single (par)GMRES(m) cycle are listed in Table 1 for m = 30
and m = 50. For m = 30 we have that Pmax 400 and Povl 236. For m = 50 we have
Pmax 375 and Povl 244. We give speed-ups and eciencies in Table 2. These are calculated
from the measured runtimes of GMRES(m) and parGMRES(m) and an estimated sequential
runtime for GMRES(m), because the problem was too large to run on a single processor. The
estimated T1 is the net computation time derived from (1). We mention that for CG (see Section
7) the measured T1 is approximately 9% less than the estimated T1, but this is not necessarily
processor
m = 30
m = 50
grid
GMRES(m) parGMRES(m) GMRES(m) parGMRES(m)
(s)
(s)
(s)
(s)
10 10
1.01
0.813
2.47
1.93
14 14
0.738
0.448
1.90
1.05
17 17
0.667
0.389
1.66
0.891
20 20
0.682
0.365
1.75
0.851
Table 1: measured runtimes for GMRES(m) and parGMRES(m)
12
processor
grid
10 10
14 14
17 17
20 20
m = 30
GMRES(m)
E (%) S
77.2
77.2
53.9 106.
40.5 117.
28.6 114.
parGMRES(m)
E (%)
S
95.9
95.9
88.8 174.
69.4 201.
53.4 214.
m = 50
GMRES(m)
E (%) S
76.8
76.8
50.9
99.8
39.5 114.
27.1 108.
parGMRES(m)
E (%)
S
98.2
98.2
92.1 181.
73.6 213.
55.7 223.
Table 2: Eciencies and speed-ups for GMRES(m) and parGMRES(m) based on measured
runtimes and an estimated sequential runtime for GMRES(m)
the case for GMRES(m) too. The dierence between the estimated sequential runtime and the
measured one for CG is probably due to a simpler implementation (e.g., less indirect addressing,
copying of buers) for the sequential program, which results in a higher (average) op-rate.
The runtime for GMRES(m) is reduced by approximately 25%, when increasing the number
of processors from 100 to 196. When increasing this from 100 to 289, the runtime reduces only
by some 35%. When we further increase the number of processors to 400 then the runtime is
already more than for 289 processors, which is in agreement with the previous discussion because
P Pmax for m = 30 and P > Pmax for m = 50. Hence the cost of communication spoils the
performance of GMRES(m) completely for large P .
On the other hand, for parGMRES(m) the runtime reduction when increasing from 100 to
196 processors is approximately 45%, where the upper bound is 49%, so this is almost optimal.
Such a speed-up shows that the eciency remains almost constant for this increase in the number
of processors, see also Table 2. This is to be expected because we have P < Povl , so that any
increase in the communication time of the inner products is more than compensated by the
overlapping computation. On 289 processors the runtime is about 53% of the runtime on 100
processors, which is still quite good. If we continue to increase the number of processors, we
see that for 400 processors the runtime is not much better than for 289 processors, although
it is still decreasing. At this point the speed-up for parGMRES(m) levels o, because there is
insucient computational work to overlap the communication (P > Povl ).
A direct comparison between the runtimes of GMRES(m) and parGMRES(m) shows that,
for 100 processors, GMRES(m) is about 25% slower than parGMRES(m). However, for 196
processors this has increased already to 65% and 81% for m = 30 and m = 50 respectively.
From then on the relative dierence increases more gradually to a maximum of about a factor
of two for Pmax processors. These results are very much in agreement with our theoretical
expectations. Note that although the maximum is reached for Pmax the improvement is already
substantial for 196 processors, which is near Povl .
In Table 4 we give the estimated runtimes from expressions (1), (3), and (5) for GMRES(m)
and formulas (5), (7), and (9) for parGMRES(m). Table 3 gives a short overview of the relevant parameters and their meaning (see Section 3). If the value of a parameter is xed, its
value is given also. The parameters d, nz and nm are derived from our model problem and
implementation the parameters ts , tw and tfl have been determined experimentally.
A comparison of the estimates with the measured execution times indicates that the formulas
are quite accurate except for the 400 processor case. The rst reason for this discrepancy is that
for both algorithms the neglected costs become more important when the size of the local problem
is small. These neglected costs are due to, e.g., copying of buers for communication and indirect
13
parameter
meaning
tw (4:80
s)
communication word rate
ts (5:30
s)
communication start-up time
tfl (3:00
s)
average time for a single oating point operation
d (1)
(max) number of communication steps in boundary exchange
nm (4)
number of messages (to send and receive) in boundary exchange
nz (5)
average number of non-zero elements per row in the matrix
pd
maximum distance to the `most central' processor
nb
(max) number of boundary data elements on a processor
m
size of the Krylov space over which (par)GMRES(m) minimizes
Table 3: parameters and meaning
processor
grid
10 10
14 14
17 17
20 20
m = 30
m = 50
pd Nl nb GMRES(m) parGMRES(m) GMRES(m) parGMRES(m)
10
14
18
20
100
50
35
25
40
30
24
20
(s)
1.00
0.683
0.641
0.600
(s)
0.867
0.462
0.339
0.257
(s)
2.46
1.71
1.63
1.54
(s)
2.08
1.11
0.812
0.615
Table 4: Estimated runtimes for GMRES(m) and parGMRES(m)
addressing using exterior data, the organization of the communication, and the solution of the
least squares problem. For the parGMRES(m) algorithm there is a second and more important
reason, viz. due to the small size of the local problem we can no longer assume an almost
complete overlap of the communication in the modied Gram-Schmidt algorithm (P > Povl ).
This is illustrated in Table 5, which gives estimates for the two overlapping parts given in (20).
We refer to the the sum of the local communication time and half of the computation time in
the modied Gram-Schmidt algorithm as comp, and to the total communication time for the
accumulation as comm. Already for the 17 17 grid we do not have a complete overlap, although
the overlap will still be good. For the 20 20 processor grid an overlap of about 55% is already
the maximum. Obviously for a larger problem this would improve.
6 Communication overhead reduction in CG
For a reduction in the communication overhead for preconditioned CG we follow the approach
suggested in 7]. In that approach the operations are rescheduled to create more opportunities
for overlap. This leads to an algorithm (parCG) as the one given in Figure 7, where we have
assumed that the preconditioner K can be written as K = LLT . For a discussion of the ideas
behind this scheme we refer to 7].
For our purposes it is relevant to point at the inner products at lines (1), (2) and (3). The
communication for these inner products is overlapped by the computational work in the following
line. We split the preconditioner to create an overlap for the inner products (1) and (3) and
14
processor
grid
10 10
14 14
17 17
20 20
m = 30
pd Nl
10
14
18
20
comp
(s)
0.331
0.187
0.144
0.115
100
50
35
25
comm
(s)
0.107
0.150
0.193
0.214
m = 50
comp
(s)
0.889
0.500
0.383
0.305
comm
(s)
0.275
0.384
0.494
0.549
Table 5: Comparison of estimated costs for overlapping computation and `global' communication
of the modied Gram-Schmidt implementation in parGMRES(m)
we have extra overlap possibilities since the inner product (2) is followed by the update for x
corresponding to the previous iteration step.
Under the assumption of a complete overlap for the time that a processor is not active in
the accumulation and broadcast of the inner products and following the derivation of (7), the
communication cost for the three inner products in a parCG iteration reduces from Tacg+b , see
(4), to the communication time spent locally by a processor:
(23)
Tlcga+b = 24(ts + 3tw ):
p
Therefore, the communication cost is reduced from O( P ) to O(1), which means that (in theory)
the communication cost is independent of the processor grid size.
7 Performance of CG variants
We will follow closely the lines set forth in the analysis for (par)GMRES(m) in Section 5.
The communication time for the exchange of boundary data will be neglected in this analysis,
because it is relatively unimportant for our kind of model problems. The problem dependent
parameters and the machine dependent parameters have the same values as in the discussion
for GMRES(m), see Table 3.
cg + T cg , see (2)
The runtime for a CG iteration with P 4 processors is given by TP = Tcmp
a+b
and (4):
p
TP = (9 + 4nz )tfl N
+
6(
t
+
3
t
)
P:
(24)
s
w
P
This expression shows that for suciently large P the communication time will dominate. Here
we can also dene a Pmax as the number of processors that gives the minimal runtime, and a
Povl as the number of processors for which the (total) communication time in the inner products
Tacg+b (see (4)) is equal to the sum of the computation time of the preconditioner, one vector
update (2nz tfl NP ), and the local communication time Tlcga+b (see (23)).
2=3
(18
+
8
n
)
Nt
z
fl
Pmax = 6(t + 3t )
(25)
s
w
cg For P = Pmax processors the eciency EP = T1 =(PTP ) is again EPmax = 1=3, where T1 = Tcmp
2
therefore, the communication time is 3 TPmax . The value for Povl is given by
6(ts + 3tw ) Povl = 2nz tfl PN + 24(ts + 3tw ):
p
ovl
15
(26)
parCG:
(1)
(2)
(3)
x;1 = x0= initial guess r0 = b ; Ax0
p;1 = 0 ;1 = 0
s = L;1 r0
;1 = 1
for i = 0 1 2 ::: do
i = (s s)
wi = L;T s
i;1 = i=i;1
pi = wi + i;1pi;1
qi = Api
= (pi qi)
xi = xi;1 + i;1pi;1
i = i= ri+1 = ri ; iqi
compute krk
s = L;1ri+1
if accurate enough then
xi+1 = xi + i pi
quit
end
Figure 7: The parCG algorithm
For P Povlpthe communication cost is reduced from Tacg+b to Tlcga+b , which gives a reduction by
a factor of 41 P . For P > Povl the communication cost is given by Tacg+b ; 2nz tfl NP .
A comparison of (25) and (26) shows that Povl < Pmax . Even though the preconditioner is
strongly problem- and implementation dependent, this holds in general, because for P = Povl
the communication time is equal to a part of the computation time, whereas for P = Pmax
the communication time is already twice the computation time. This leads to three phases in
the performance of parCG. Let a be the computation time, and for 2 0 1] let a be the
computation time for the `potential' overlap, and let c be the communication time. Then the
runtime of CG is given by a + c, whereas for parCG it is given by (1 ; )a + max(a c). For
increasing P , a decreases and c increases as described above. For small P , c a , P Povl ,
all communication can be overlapped but the communication time is relatively unimportant.
For medium P , c a , P Povl , the communication time is more or less in balance with
the computation time for the overlap and the improvement is maximal, see below. For large
P , c a , P Povl , the communication time will be dominant, and then we will not have
enough computational work to overlap it suciently. It is easy to prove that the fraction
(a + c)=((1 ; )a + max(a c))
is maximal if a = c, that is for P = Povl , and then the improvement is
a+c
= a + a = 1 + :
(27)
(1 ; )a + max(a c)
a
Hence, the maximum improvement of parCG over CG is determined by this fraction . The
larger this fraction is, the larger is the maximum improvement by parCG. If the computation
16
time of the preconditioner is dominant, e.g. when nz is large and when we use preconditioners
from a factorization with ll in, then < 1, and we can expect an improvement by a factor of
1
2nz
two. In our model we have = 9+4
nz 2 , so that for nz large enough we can expect a reduction
by a factor of 1:5. For our model problem we have nz = 5, so that the improvement is limited
to factor of 1:33.
We will now discuss the results for the parallel implementation of the standard CG algorithm
and the adapted version parCG on the 400-transputer machine for a model problem. Since the
algorithms are equivalent they take the same number of iterations, and therefore we will only
consider the runtime for one single iteration.
We have solved a diusion problem discretized by nite volumes over a 100 100 grid,
resulting in a symmetric positive denite ve-diagonal matrix (corresponding to the 5-point
star). We have solved this relatively small problem on processor grids of increasing size. This
problem size was chosen because for processor grids of increasing size it shows the three dierent
phases mentioned before.
processor
grid
10
14
17
20
10
14
17
20
TP
(ms)
10.7
6.90
6.09
5.59
CG
SP
73.6
114.
129.
141.
EP
(%)
73.6
58.3
44.8
35.2
TP
(ms)
10.2
5.84
5.29
5.04
parCG
SP
77.3
135.
149.
156.
EP
(%)
77.3
68.8
51.5
39.1
di
(%)
4.90
18.2
15.1
10.9
Table 6: Measured runtimes for CG and parCG, speed-up and eciency compared to the sequential runtime of CG
Table 6 gives the measured runtimes for one iteration step, the speed-ups, and the eciencies
for both CG and parCG for several processor grids. The speed-ups and eciencies are computed
relative to the measured, sequential runtime of the CG iteration, which is given by: T1 = 0:788s.
Although CG has much less inner products than GMRES(m) per iteration (i.e., per matrix
vector product), we observe that the performance levels o fairly quickly. This is in agreement
with the ndings reported in 5] which show that such behavior is to be expected for any Krylov
subspace method. For our test problem we have that Pmax 600 and Povl 228.
For the processor grids that we used we have P < Pmax , so that the runtime decreases
for increasing numbers of processors as predicted by our analysis. Note also the large relative
dierence between Povl and Pmax compared to the relative small dierence for GMRES(m). This
indicates that for this test problem, with a small nz and with a relatively cheap preconditioner,
we have a small . Hence, the improvement in the runtime will be limited, as is illustrated in
Table 6.
We see that the parCG algorithm leads to better speed-ups than the standard CG algorithm,
especially on the 14 14 and 17 17 processor grids, where the number of processors is closest
to Povl . Moreover, for parCG we observe that if the number of processors is increased from 100
to 196, the eciency remains almost constant, and the runtime is reduced by a factor of about
1:75 (against a maximum of 1:96). Just as for GMRES(m) this is predicted by our analysis,
because P < Povl , so that the increase in the communication time is masked by the overlapping
computation.
The initial decrease of eciency when going from 1 to 100 processors is due to a substan17
processor
grid
10
14
17
20
10
14
17
20
CG
(ms)
10.7
6.66
5.71
5.00
parCG non-overlapped parCG and non-ovl.
communication
communication
(ms)
(ms)
(ms)
10.0
0.00
10.0
5.48
0.094
5.57
4.06
0.603
4.66
3.11
1.14
4.25
Table 7: Estimated runtimes for CG and parCG, a correction of the estimate for parCG and
the corrected estimate for parCG
tial initial overhead. This parallel overhead is also illustrated by the fact that the estimated
cg , see (2), is 0:870s, which is about 10% larger than the measured
sequential runtime from Tcmp
sequential runtime. The three phases in the performance of parCG are illustrated by the difference in runtime between CG and parCG. For small processor grids the communication time
is not very important and we see only small dierences. For processor grids with P near Povl
the communication and the overlapping computation are in balance and we see an increase in
the runtime dierence. For larger processor grids we can no longer overlap the communication,
which dominates the runtime, to a sucient degree, and we see the dierences decrease again.
We cannot quite match the improvements for parGMRES(m), but on the other hand it
is important to note that the improvement for parCG comes virtually for free. Besides, for
GMRES(m) we have the possibility to combine messages as well as to overlap communication,
whereas for CG we can only exploit overlap of communication unless we combine multiple
iterations. Expression (27) indicates that for our problem we cannot expect much more: 1=3
so that the maximum improvement is approximately 33%. This estimate is rather optimistic in
view of the large initial parallel overhead. When the computation time for the preconditioner
is large or even dominant ( 1) then the improvement may also be large. This would be the
case if nz is large or when (M)ILU preconditioners with ll-in are used. For many problems this
may be a realistic assumption.
Another important observation is that as long as P > Povl , we can increase the computation
time of the preconditioner without increasing the runtime of the iteration, because the preconditioner is overlapped with the accumulation and distribution. That means that we can decrease
the number of iterations without increasing the runtime of an iteration.
In Table 7 we show estimates for the execution times of the CG algorithm and the parCG
algorithm. The total cost for CG is computed from (2), (6), and (4) and for parCG we have used
(2), (6), and (23). Just as for GMRES(m) the estimates for CG are relatively accurate, except
for the 20 20 case. Again, this is probably caused by neglected costs in the implementation
that become more important when the local problem size becomes small. For parCG as well
as for parGMRES(m) there is also a discrepancy between the measured execution time and the
estimated time, due to an incomplete overlap.
When we cannot overlap all communication, we can correct the estimate for the runtime of
parCG by adding an estimate for the non-overlapped communication time. These corrections
can be computed from Table 8 and from the local communication time for one accumulation
and broadcast (0:158ms). Note that we need computation time for three inner products in one
iteration (see (23)). For example, for the 14 14 processor grid the computation time of the
vector update is not sucient to overlap the non-local communication time for the accumulation
18
processor
grid
10 10
14 14
17 17
20 20
pd Nl
10
14
18
20
100
50
35
25
L;1 or vector 1 accumulation computation
L;T update and broadcast
(ms)
(
s)
(
s)
(ms)
1.20
0.600
0.420
0.300
600.
300.
210.
150.
394.
552.
709.
788.
8.70
4.35
3.05
2.18
Table 8: Estimated computation time of the parts of the CG algorithm that are used for the
overlap and the computation time
and distribution, so we subtract the computation time for the vector update and the local
communication time from the time for one accumulation and distribution: 0:552ms ; (0:300 +
0:158)ms = 0:094ms. The corrections and the corrected estimates of the runtime of parCG are
given in Table 7. The corrected estimates for the execution time for parCG appear again to be
relatively accurate, except for the 20 20 processor grid.
The three phases are nicely illustrated by results in Table 8. On processor grids that are
relatively small (10 10) the cost of communication is not dominant. On larger processor grids
(17 17 and 20 20) the relative cost for the global communication increases and becomes
dominant, but then there is insucient computation for overlap. The table shows that the best
improvements by the new algorithm are indeed obtained if the global communication time is
balanced by the local computation time that is used for overlap (14 14).
Finally, Table 9 shows estimates for the costs of computation, for communication in the
exchange of boundary data, for global communication in the inner products, and for the local
part of the accumulation and broadcast if the non-local communication is completely overlapped.
This reveals the relative importance of the dierent parts depending on the number of processors.
cg )
Most important is the decrease in the communication time for the matrix vector product (Tbde
when the number of processors increases, whereas the communication time for the accumulation
and broadcast at the same time increases. This is similar to the situation for GMRES(m). On
large processor grids the communication in the matrix vector products for CG is much less than
the communication in the accumulation and distribution. For GMRES(m), where the number
of inner products is larger than the number of matrix vector products by a factor of about m=2,
this is even more the case.
processor
grid
10 10
14 14
17 17
20 20
pd Nl nb
10
14
18
20
100
50
35
25
40
30
24
20
cg
Tcmp
(ms)
8.70
4.35
3.05
2.18
cg
Tbde
(
s)
849.
657.
542.
465.
Tacg+b Tlcga+b
(ms) (
s)
1.18
1.66
2.13
2.36
473.
473.
473.
473.
Table 9: Estimated runtimes for parts of CG and parCG
19
8 Conclusions
We have studied the implementation of GMRES(m) and CG for distributed memory parallel
computers. These algorithms represent two dierent classes of Krylov subspace methods, and
their parallel properties are quite representative. The experiments show how the global communication in the inner products degrades the performance on large processor grids, as is indicated
by our model in Section 3 and the discussions in Sections 5 and 7. We have considered alternative algorithms for GMRES(m) and CG in which the actual cost for global communication is
decreased by reducing synchronization, reducing start-up times, and overlapping communication
with computation.
Our experiments clearly indicate this to be a successful approach. For GMRES(m) we
p have
2
reduced the communication cost by reducing the contribution of start-ups
from
O
(
m
P ) to
p
O(m) and by reducing the contribution of data transfer from O(m2 P ) to O(m2). This results
in a total reduction of the runtime by about a factor of two. For CG we can only overlap
p
the communication. In theory this may reduce the communication cost by a factor of O( P ),
but for our model problem we cannot achieve this performance improvement. The (maximum)
improvement for parCG over CG depends mainly on the computation time for the preconditioner
relative to the total computation time. For problems in which the cost of the preconditioner
is more dominant, that is for a large average number of non-zero coecients per row in the
matrix and ll-in in the preconditioner, better results may be expected and such a situation
is not unrealistic. Moreover, in the case that we cannot overlap all communication in parCG,
we can use a more expensive preconditioner that may reduce the number of iterations without
increasing the runtime for a single iteration. If we want to further improve the performance of
CG, then polynomial preconditioners or methods which optimize over a larger subspace per step
might be considered.
For (full) GMRES or GMRES(m) with very large m, it is of course possible to generate
the Krylov subspace block-wise, and we can then exploit similar ideas as presented here. This
may help to reduce the disadvantage that parGMRES(m) in its last cycle may compute a few
unnecessary basis vectors.
Although only illustrated and supported by a model problem of xed size, our analysis
(which seems to give relatively accurate predictions) indicates that also for other problem sizes
and processor grids our conclusions remain valid.
20
References
1] Z. Bai, D. Hu, and L. Reichel. A Newton basis GMRES implementation. Technical Report
91-03, University of Kentucky, 1991.
2] A. T. Chronopoulos and C. W. Gear. s-Step iterative methods for symmetric linear systems.
J. on Comp. and Appl. Math., 25:153{168, 1989.
3] A. T. Chronopoulos and S. K. Kim. s-Step Orthomin and GMRES implemented on parallel
computers. Technical Report 90/43R, UMSI, Minneapolis, 1990.
4] E. F. D'Azevedo and C. H. Romine. Reducing communication costs in the conjugate
gradient algorithm on distributed memory multiprocessors. Technical Report ORNL/TM12192, Oak Ridge National Lab., 1992.
5] E. de Sturler. A parallel restructured version of GMRES(m). Technical Report 91-85, Delft
University of Technology, Delft, 1991.
6] E. de Sturler. A parallel variant of GMRES(m). In R. Vichnevetsky, J. H. H. Miller, editors,
Proc. of the 13th IMACS World Congress on Computation and Applied Mathematics ,
IMACS, Criterion Press Dublin 1991, pp 682{683.
7] J. W. Demmel, M. T. Heath, and H.A. van der Vorst. Parallel numerical linear algebra.
Acta Numerica Vol 2, Cambridge Press, New York, 1993
8] J. J. Dongarra, I. S. Du, D. C. Sorensen, and H. A. van der Vorst. Solving Linear Systems
on Vector and Shared Memory Computers. SIAM, Philadelphia, PA, 1991.
9] M. R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems.
J. Res. Natl. Bur. Stand., 49:409{436, 1954.
10] G. Meurant. Numerical experiments for the preconditioned conjugate gradient method on
the CRAY X-MP/2. Technical Report LBL-18023, University of California, Berkeley, CA,
1984.
11] Y. Saad and M. H. Schultz. GMRES: a generalized minimal residual algorithm for solving
nonsymmetric linear systems. SIAM J. Sci. Statist. Comput., 7:856{869, 1986.
21

Download Report

Reducing the e ect of global communication in GMRES(m) and CG

Paperzz.com

Your Paperzz