mvm(BiCGStab)

Massively parallel solution of unsymmetric
linear systems using polynomially
preconditioned GMRes method
Igor Kaporin
Dorodnitsyn Computing Center RAS, Nuclear Safety Inst. RAS (IBRAE)
Moscow, Russia [email protected]
Olga Milyukova
Keldysh Inst.Appl.Math. RAS, Moscow, Russia [email protected]
The Third Russian-Chinese Workshop
on Numerical Mathematics and Scientific Computing
11-13 Sep 2013 Moscow
Presentation outline
• Primary Preconditioning: Approximate Block
Jacobi (using ILU2 “by value”)
• Secondary preconditioning: GMRes-based
Adaptive Polynomial
• KS implementation: GMRes vrs BiCGStab
• Comparison of resulting solvers on a subset of
Univ. of Florida sparse matrix collection
Statement of the problem
• Linear system
• Matrix: nonsingular,
quasi-dissipative
• Unstructured sparse
• Approximate solution
Ax  b
T
cond (Ws
AWs )  cond ( A)
Ws  [e js (1) ... e js ( ms ) ]
nz ( A)  O(n)
b  Axk   b
Numerical solution method:
preconditioned KS iterations
xk  x0  span (Cr0 , CACr0 ,..., C ( AC )
r0  b  Ax0
CA
1
C  Bq d 1 ( AB)
B  (BlockDiag ( A))
1
k 1
r0 )
ILU2-based Approximate Block Jacobi
BlockDiag ( A)  LU  E
 LU  LR1 
T
R2 U
 S,
R1  O ( ), R2  O ( ), S  O ( )
2
. . . . . . ..... .
. . . . . . ... . .
1 1
B U L
B has block diagonal form and therefore its
parallel implementation is data exchange-free
Polynomial preconditioning I
There were a lot of proposals how to construct
the polynomial qd 1 () in the general
unsymmetric case, typically relying on projected
eigenproblems
However, the construction presented below is
considerably different
The practical advantages it provides:
• very cheap computationally
• structurally simple & self-contained
• showed good efficiency in extensive testing
Earlier Research
• N.M. Nachtigal, L. Reichel, and L.N. Trefethen,
A hybrid GMRES algorithm for nonsymmetric linear
systems, SIAM J. Matrix Anal. Appl., 13 (1992), 796-825
• W. Joubert,
A robust GMRES-based adaptive polynomial preconditioning
algorithm for nonsymmetric linear systems,
SIAM Journal on Scientific and Statistical Computing 15
(1994) 427-439
• M.B. van Gijzen,
A polynomial preconditioner for the GMRES algorithm
Journal of Computational and Applied Mathematics 59
(1995) 91-107
Polynomial preconditioning II
• Perform, say, l=2d+4 steps of the GMRes method
to obtain the Arnoldi basis, i.e. the columns of
Vl  [v1 | ... | vl ]
• Use the arising l x l Hessenberg matrix to find the
Generalized Discrete Least-Squares polynomial
2
pd ()  arg min pd ( AB)[v1 | ... | vl d ] F
pd ( AB)  I  ABq d 1 ( AB)
• Use the GDLS(d)-polynomially preconditioned
GMRes to finalize the solution process
Polynomial preconditioning III
Using a generalization of the standard relation

T
ABVk  Vk 1 H k  Vk H k  vk 1 k 1,k ek
in the form
pd ( AB)Vl d  Vl pd ( H l ) I k ,k d
I k ,k d  [e1 | ... | ek d ]
we can essentially reduce the size of the above
Least Squares problem
Polynomial preconditioning IV
The LS optimization problem takes the form
pd ()  arg min
pd ( 0) 1
pd ( H l ) I l ,l d
2
F
pd (t )  1  tqd 1 (t )
For simplicity, we parametrize the polynomial as
d
qd 1 (t )   j (1  t )
j 1
j 1
Polynomial preconditioning V
The action of the preconditioned matrix by a vector
z  ABqd 1 ( AB)v
is implemented via properly initialized recurrence
w :  d  j v  w  ABw
j  1,..., d  1
Polynomial preconditioning VI
Assume that all three methods have performed
the same number k of MVM operations
Then the number of MPI_AllReduce calls will be:
BiCGStab.. ................  5k / 2
GMRES( k )...................2k  1
GMR_GDLS( d ).......  (2k / d )  4d
Polynomial preconditioning VII
For instance, if all three methods have performed
300 MVM operations and d=10, l=2d+4 in GDLS,
then the number of MPI_AllReduce calls will be:
BiCGStab..................  750
GMRES(k )................  600
GMR_GDLS(d )........  100
Numerical testing using the public
domain matrix collection
• Davis T.A., Hu Y.F.
University of Florida sparse matrix collection.
ACM Trans. on Math.Software 38 (2011) 1-25
• http://www.cise.ufl.edu/research/sparse/matrices
A subset of The Univ.of Florida Collection I
name
Origin of the problem
tmt_unsymm
Electromagnetic modelling
atmosmodd
Atmosphere modelling (discrete 3D
elliptic boundary value problem)
Electronic circuit modelling (memory
chip model)
Discretization (FD by time and FE by
space) of the coupled consolidation
nonstationary model
Model of flow and transport in a porous
media
memchip
CoupCons3D
transport
A subset of The Univ.of Florida Collection II
name
n
Nz(A) nzr_max
tau(ILU2)
tmt_unsymm
917 825
4 584 801
5
0.003
atmosmodd
1270432
8 814 880
7
0.03
memchip
2707524 13 343 948
27
0.1
416800 17 277 420
76
0.1
1602111 23 487 281
15
0.003
CoupCons3D
transport
Testing on The Univ.of Florida Collection I
x0  0
b  Ax*
n
1 2
x*   , ,..., 
n
n n
b  Axk   b
  10
8
T
Testing on 4-core PC Intel® Core(TM) i7-3770 [email protected] GHz with
32Gbytes memory under Linux Ubuntu (note: l=2d+7 in GDLS)
name
Nproc
BiCGStab
#mvm T_tot
GMRes(50)
#mvm T_tot
tmt_unsymm
1
297 15.65
>18K
tmt_unsymm
128
1327 45.40
---
---
---
---
>1K
GMRes(100)
#mvm T_tot
GMRes(200)
#mvm T_tot
GDLS(d=10)
#mvm T_tot
822 114.30
354 75.23
278
14.91
>64K >7K
998
27.59
atmosmodd
1
113
6.17
124
14.55
91
17.20
91
17.19
128
7.16
atmosmodd
128
214
9.57
224
16.03
192
20.00
164
24.45
218
6.99
96
55.65
96
55.61
138
43.65
516 120.87
448
36.55
transport
1
127 42.05
160
59.26
transport
128
489 47.69
---
---
734 122.31
JUST TO RECALL:
TYPICALLY, THE BICGSTAB PERFORMS MUCH BETTER THAN
THE PLAIN GMRES(K) FOR ALL K,
ESPECIALLY FOR WEAK (PARALLELIZABLE) PRECONDITIONERS
Testing on The Univ.of Florida Collection II
For the test runs we used the MVS-100K
multicore multiprocessor
(Joint Supercomputing Center RAS)
with 8 cores per node and 1Gb memory per core
http://www.jscc.ru/hard/mvs100k.shtml
The code was written in Fortran /MPI
Only double precision was used
Each node was assigned to 2 MPI processes
Timing results for “tmt_unsymm” problem:
BJ(64)-GDLS(10)-GMRes vrs. BJ(64)-BiCGStab
Nproc
T_prec
mvm(GDLS_GMR)
T(GDLS_GMR)
mvm(BiCGStab)
T(BiCGStab)
16
0.41
655
6.27
818
7.68
32
0.20
755
2.61
1018
4.35
64
0.09
845
1.18
1079
1.39
128
0.04
995
0.75
1328
1.02
192
0.03
1125
0.56
1511
1.07
256
0.02
1195
0.54
1673
1.29
Timing for “tmt_unsymm”
Timing results for “atmosmodd” problem:
BJ(64)-GDLS(10)-GMRes vrs. BJ(64)-BiCGStab
Nproc
T_prec
mvm(GDLS_GMR)
T(GDLS_GMR)
mvm(BiCGStab)
T(BiCGStab)
16
0.28
190
1.54
188
1.66
32
0.12
195
0.70
170
0.69
64
0.06
205
0.29
195
0.33
128
0.03
210
0.27
197
0.38
192
0.02
220
0.28
194
0.44
256
0.02
225
0.30
224
0.46
Timing for “atmosmodd”
Timing results for “memchip” problem:
BJ(64)-GDLS(10)-GMRes vrs. BJ(64)-BiCGStab
Nproc
T_prec
mvm(GDLS_GMR)
T(GDLS_GMR)
mvm(BiCGStab)
T(BiCGStab)
16
0.21
170
2.27
211
3.30
32
0.09
175
1.33
169
1.45
64
0.05
205
0.72
219
0.92
128
0.04
185
0.57
215
0.86
192
0.03
175
0.58
191
1.20
256
0.02
175
0.55
248
2.61
Timing for “memchip”
Timing results for “CoupCons3D” problem:
BJ(64)-GDLS(10)-GMRes vrs. BJ(64)-BiCGStab
Nproc
T_prec
mvm(GDLS_GMR)
T(GDLS_GMR)
mvm(BiCGStab)
T(BiCGStab)
16
0.23
125
0.95
126
0.97
32
0.10
145
0.52
149
0.52
64
0.04
145
0.14
146
0.15
128
0.02
135
0.08
133
0.09
192
0.01
145
0.06
141
0.10
256
0.01
135
0.08
155
0.15
Timing for “CoupCons3D”
Timing results for “Transport” problem:
BJ(64)-GDLS(10)-GMRes vrs. BJ(64)-BiCGStab
Nproc
T_prec
mvm(GDLS_GMR)
T(GDLS_GMR)
mvm(BiCGStab)
T(BiCGStab)
16
2.69
315
9.87
375
10.22
32
1.37
355
4.48
353
4.63
64
0.78
435
2.38
489
2.74
128
0.45
475
0.87
551
1.15
192
0.25
505
0.82
538
0.99
256
0.24
525
0.55
569
0.92
Timing for “Transport”
Conclusion
• A new modification of the GMRes method
is developed on the base of low-degree
adaptive polynomial preconditioning
• The algorithm is simple and robust
• The resulting linear solver is competitive
with BiCGStab on realistic problems even
with weak preconditionings (both in
sequential and massively-parallel
computations)
References
•
Kaporin I.: High quality preconditionings of a general
symmetric positive definite matrix based on its
UTU+UTR+RTU-decomposition // Numer. Linear
Algebra Appl. 5 (1998) 483-509.
•
Kaporin, I.: Scaling, Reordering, and Diagonal Pivoting
in ILU Preconditionings // Russian Journal of Numerical
Analysis and Mathematical Modelling 22 (2007) 341375.
•
Kaporin I.E., Milyukova O.Yu.: Massively parallel
implementation of the preconditioned CG method for
the numerical solution of linear systems // Paper
collection of The Applied Optimization Problems Dept.
CC RAS (V.G.Zhadan, ed.) Moscow, CC RAS Publ.,
2011, pp.127-152 (in Russian)
THANK YOU FOR ATTENTION !
Any questions?