Massively parallel solution of unsymmetric linear systems using polynomially preconditioned GMRes method Igor Kaporin Dorodnitsyn Computing Center RAS, Nuclear Safety Inst. RAS (IBRAE) Moscow, Russia [email protected] Olga Milyukova Keldysh Inst.Appl.Math. RAS, Moscow, Russia [email protected] The Third Russian-Chinese Workshop on Numerical Mathematics and Scientific Computing 11-13 Sep 2013 Moscow Presentation outline • Primary Preconditioning: Approximate Block Jacobi (using ILU2 “by value”) • Secondary preconditioning: GMRes-based Adaptive Polynomial • KS implementation: GMRes vrs BiCGStab • Comparison of resulting solvers on a subset of Univ. of Florida sparse matrix collection Statement of the problem • Linear system • Matrix: nonsingular, quasi-dissipative • Unstructured sparse • Approximate solution Ax b T cond (Ws AWs ) cond ( A) Ws [e js (1) ... e js ( ms ) ] nz ( A) O(n) b Axk b Numerical solution method: preconditioned KS iterations xk x0 span (Cr0 , CACr0 ,..., C ( AC ) r0 b Ax0 CA 1 C Bq d 1 ( AB) B (BlockDiag ( A)) 1 k 1 r0 ) ILU2-based Approximate Block Jacobi BlockDiag ( A) LU E LU LR1 T R2 U S, R1 O ( ), R2 O ( ), S O ( ) 2 . . . . . . ..... . . . . . . . ... . . 1 1 B U L B has block diagonal form and therefore its parallel implementation is data exchange-free Polynomial preconditioning I There were a lot of proposals how to construct the polynomial qd 1 () in the general unsymmetric case, typically relying on projected eigenproblems However, the construction presented below is considerably different The practical advantages it provides: • very cheap computationally • structurally simple & self-contained • showed good efficiency in extensive testing Earlier Research • N.M. Nachtigal, L. Reichel, and L.N. Trefethen, A hybrid GMRES algorithm for nonsymmetric linear systems, SIAM J. Matrix Anal. Appl., 13 (1992), 796-825 • W. Joubert, A robust GMRES-based adaptive polynomial preconditioning algorithm for nonsymmetric linear systems, SIAM Journal on Scientific and Statistical Computing 15 (1994) 427-439 • M.B. van Gijzen, A polynomial preconditioner for the GMRES algorithm Journal of Computational and Applied Mathematics 59 (1995) 91-107 Polynomial preconditioning II • Perform, say, l=2d+4 steps of the GMRes method to obtain the Arnoldi basis, i.e. the columns of Vl [v1 | ... | vl ] • Use the arising l x l Hessenberg matrix to find the Generalized Discrete Least-Squares polynomial 2 pd () arg min pd ( AB)[v1 | ... | vl d ] F pd ( AB) I ABq d 1 ( AB) • Use the GDLS(d)-polynomially preconditioned GMRes to finalize the solution process Polynomial preconditioning III Using a generalization of the standard relation T ABVk Vk 1 H k Vk H k vk 1 k 1,k ek in the form pd ( AB)Vl d Vl pd ( H l ) I k ,k d I k ,k d [e1 | ... | ek d ] we can essentially reduce the size of the above Least Squares problem Polynomial preconditioning IV The LS optimization problem takes the form pd () arg min pd ( 0) 1 pd ( H l ) I l ,l d 2 F pd (t ) 1 tqd 1 (t ) For simplicity, we parametrize the polynomial as d qd 1 (t ) j (1 t ) j 1 j 1 Polynomial preconditioning V The action of the preconditioned matrix by a vector z ABqd 1 ( AB)v is implemented via properly initialized recurrence w : d j v w ABw j 1,..., d 1 Polynomial preconditioning VI Assume that all three methods have performed the same number k of MVM operations Then the number of MPI_AllReduce calls will be: BiCGStab.. ................ 5k / 2 GMRES( k )...................2k 1 GMR_GDLS( d )....... (2k / d ) 4d Polynomial preconditioning VII For instance, if all three methods have performed 300 MVM operations and d=10, l=2d+4 in GDLS, then the number of MPI_AllReduce calls will be: BiCGStab.................. 750 GMRES(k )................ 600 GMR_GDLS(d )........ 100 Numerical testing using the public domain matrix collection • Davis T.A., Hu Y.F. University of Florida sparse matrix collection. ACM Trans. on Math.Software 38 (2011) 1-25 • http://www.cise.ufl.edu/research/sparse/matrices A subset of The Univ.of Florida Collection I name Origin of the problem tmt_unsymm Electromagnetic modelling atmosmodd Atmosphere modelling (discrete 3D elliptic boundary value problem) Electronic circuit modelling (memory chip model) Discretization (FD by time and FE by space) of the coupled consolidation nonstationary model Model of flow and transport in a porous media memchip CoupCons3D transport A subset of The Univ.of Florida Collection II name n Nz(A) nzr_max tau(ILU2) tmt_unsymm 917 825 4 584 801 5 0.003 atmosmodd 1270432 8 814 880 7 0.03 memchip 2707524 13 343 948 27 0.1 416800 17 277 420 76 0.1 1602111 23 487 281 15 0.003 CoupCons3D transport Testing on The Univ.of Florida Collection I x0 0 b Ax* n 1 2 x* , ,..., n n n b Axk b 10 8 T Testing on 4-core PC Intel® Core(TM) i7-3770 [email protected] GHz with 32Gbytes memory under Linux Ubuntu (note: l=2d+7 in GDLS) name Nproc BiCGStab #mvm T_tot GMRes(50) #mvm T_tot tmt_unsymm 1 297 15.65 >18K tmt_unsymm 128 1327 45.40 --- --- --- --- >1K GMRes(100) #mvm T_tot GMRes(200) #mvm T_tot GDLS(d=10) #mvm T_tot 822 114.30 354 75.23 278 14.91 >64K >7K 998 27.59 atmosmodd 1 113 6.17 124 14.55 91 17.20 91 17.19 128 7.16 atmosmodd 128 214 9.57 224 16.03 192 20.00 164 24.45 218 6.99 96 55.65 96 55.61 138 43.65 516 120.87 448 36.55 transport 1 127 42.05 160 59.26 transport 128 489 47.69 --- --- 734 122.31 JUST TO RECALL: TYPICALLY, THE BICGSTAB PERFORMS MUCH BETTER THAN THE PLAIN GMRES(K) FOR ALL K, ESPECIALLY FOR WEAK (PARALLELIZABLE) PRECONDITIONERS Testing on The Univ.of Florida Collection II For the test runs we used the MVS-100K multicore multiprocessor (Joint Supercomputing Center RAS) with 8 cores per node and 1Gb memory per core http://www.jscc.ru/hard/mvs100k.shtml The code was written in Fortran /MPI Only double precision was used Each node was assigned to 2 MPI processes Timing results for “tmt_unsymm” problem: BJ(64)-GDLS(10)-GMRes vrs. BJ(64)-BiCGStab Nproc T_prec mvm(GDLS_GMR) T(GDLS_GMR) mvm(BiCGStab) T(BiCGStab) 16 0.41 655 6.27 818 7.68 32 0.20 755 2.61 1018 4.35 64 0.09 845 1.18 1079 1.39 128 0.04 995 0.75 1328 1.02 192 0.03 1125 0.56 1511 1.07 256 0.02 1195 0.54 1673 1.29 Timing for “tmt_unsymm” Timing results for “atmosmodd” problem: BJ(64)-GDLS(10)-GMRes vrs. BJ(64)-BiCGStab Nproc T_prec mvm(GDLS_GMR) T(GDLS_GMR) mvm(BiCGStab) T(BiCGStab) 16 0.28 190 1.54 188 1.66 32 0.12 195 0.70 170 0.69 64 0.06 205 0.29 195 0.33 128 0.03 210 0.27 197 0.38 192 0.02 220 0.28 194 0.44 256 0.02 225 0.30 224 0.46 Timing for “atmosmodd” Timing results for “memchip” problem: BJ(64)-GDLS(10)-GMRes vrs. BJ(64)-BiCGStab Nproc T_prec mvm(GDLS_GMR) T(GDLS_GMR) mvm(BiCGStab) T(BiCGStab) 16 0.21 170 2.27 211 3.30 32 0.09 175 1.33 169 1.45 64 0.05 205 0.72 219 0.92 128 0.04 185 0.57 215 0.86 192 0.03 175 0.58 191 1.20 256 0.02 175 0.55 248 2.61 Timing for “memchip” Timing results for “CoupCons3D” problem: BJ(64)-GDLS(10)-GMRes vrs. BJ(64)-BiCGStab Nproc T_prec mvm(GDLS_GMR) T(GDLS_GMR) mvm(BiCGStab) T(BiCGStab) 16 0.23 125 0.95 126 0.97 32 0.10 145 0.52 149 0.52 64 0.04 145 0.14 146 0.15 128 0.02 135 0.08 133 0.09 192 0.01 145 0.06 141 0.10 256 0.01 135 0.08 155 0.15 Timing for “CoupCons3D” Timing results for “Transport” problem: BJ(64)-GDLS(10)-GMRes vrs. BJ(64)-BiCGStab Nproc T_prec mvm(GDLS_GMR) T(GDLS_GMR) mvm(BiCGStab) T(BiCGStab) 16 2.69 315 9.87 375 10.22 32 1.37 355 4.48 353 4.63 64 0.78 435 2.38 489 2.74 128 0.45 475 0.87 551 1.15 192 0.25 505 0.82 538 0.99 256 0.24 525 0.55 569 0.92 Timing for “Transport” Conclusion • A new modification of the GMRes method is developed on the base of low-degree adaptive polynomial preconditioning • The algorithm is simple and robust • The resulting linear solver is competitive with BiCGStab on realistic problems even with weak preconditionings (both in sequential and massively-parallel computations) References • Kaporin I.: High quality preconditionings of a general symmetric positive definite matrix based on its UTU+UTR+RTU-decomposition // Numer. Linear Algebra Appl. 5 (1998) 483-509. • Kaporin, I.: Scaling, Reordering, and Diagonal Pivoting in ILU Preconditionings // Russian Journal of Numerical Analysis and Mathematical Modelling 22 (2007) 341375. • Kaporin I.E., Milyukova O.Yu.: Massively parallel implementation of the preconditioned CG method for the numerical solution of linear systems // Paper collection of The Applied Optimization Problems Dept. CC RAS (V.G.Zhadan, ed.) Moscow, CC RAS Publ., 2011, pp.127-152 (in Russian) THANK YOU FOR ATTENTION ! Any questions?
© Copyright 2026 Paperzz