FMMの自動チューニング可能なパラメータについて 東京工業大学 Seventh symposium on Automatic Tuning Technology and its Application (ATTA 2015) December 25, 2015 横田理央 ‣ 1946 — The Monte Carlo method. ‣ 1947 — Simplex Method for Linear Programming. ‣ 1950 — Krylov Subspace Iteration Method. ‣ 1951 — The Decompositional Approach to Matrix Computations. ‣ 1957 — The Fortran Compiler. ‣ 1959 — QR Algorithm for Computing Eigenvalues. ‣ 1962 — Quicksort Algorithms for Sorting. ‣ 1965 — Fast Fourier Transform. ‣ 1977 — Integer Relation Detection. ‣ 1987 — Fast Multipole Method Dongarra& Sullivan, IEEE Comput. Sci. Eng., Vol. 2(1):22-- 23 ( 2000) Assumes all positive charges Hardcoded quadrupoles 多様なFMMとその発展の歴史 FMMの基本原理 x = 1 xi xj = i j 1 xi 1 x⇥ 1 xj x xi x and pth order Taylor expansion 1 1 ✓ gives i = = p 1 X j=1 = N X j=1 = Error bound : O(θp) ✓k xj ✓= xi k=0 N X p 1 X k=0 mj xi O(N 2 ) for i=1:N for j=1:N Phi[i] += m[j]/(x[i]-x[j]); end end x⇤ <1 x⇤ xj mj xi x⇤ (xi ⇥ (p 1 ✓ X xj x⇤ ) k=0 k 1 xi 8 N <X : j=1 x⇤ x⇤ ◆k ) mj (xj x⇤ ) k 9 = ; O(N ) for k=1:p for j=1:N M[k] += m[j]*(x[j]-xs)^k; end end for i=1:N for k=1:p Phi[i] += (x[i]-xs)^(-k-1)*M[k]; end end FMMの基本原理 FMM P2M P2P 4000 cputime[s ] 3000 2000 1000 0 M2M P 2P comm P 2P P 2M M2M M2L comm M2L L 2L L 2P other 2 4 N 6 8 10 x10 6 M2L L2L L2P M2L FMMとTreecodeの違い 1024 512 Intel Sandy Bridge AMD Abu Dhabi IBM BG/Q Fujitsu FX10 NVIDIA Kepler Intel Xeon Phi 256 O(N 2 ) O(N log N ) O(N ) 8 1/16 1/8 1/4 1/2 FMM P2P 16 DGEMM 3D FFT 32 FMM M2L (Cartesian) 64 FMM M2L (Spherical) Stencil 128 SpMV Double precision performance (Gflop/s) 2048 1 2 4 8 16 32 Operational intensity (flop/byte) 64 128 256 級数展開の次数Pと相互作用角度θ FMMの構成要素 1. 2. 3. 4. 5. FMM has Kernels (1444 lines) Local tree (589 lines) Lists (344 lines) Partition (245 lines) Global tree (447 lines) Reuse & swap among FMM codes Kernels Common interface Separation of concerns Tree MPI part FMMの構成要素と可変パラメータ カーネル 級数展開の次数: P 通信 木構造 相互作用角度: θ 領域分割の方法: HOT ORB 末端あたりの粒子数: Ncrit 基底関数の種類: 木構造の深さ 動的負荷分散 重み付け カーネルの種類 Use of symmetry Geometric Compute-Storage Tradeoff 2048 1024 512 Compute Intel Sandy Bridge AMD Abu Dhabi IBM BG/Q Fujitsu FX10 NVIDIA Kepler Intel Xeon Phi 256 8 1/16 1/8 1/4 1/2 FMM P2P 16 DGEMM 3D FFT 32 FMM M2L (Cartesian) 64 FMM M2L (Spherical) Stencil 128 SpMV Memory Sampling Double precision performance (Gflop/s) Algebraic 1 2 4 8 16 32 Operational intensity (flop/byte) 64 128 256 高性能参照実装の多様性 STRUMPACK HACApK Algebraic Memory (Bytes) PVFMM ASKIT Sampling ScalFMM exaFMM Use of symmetry Byte/Flop Bonsai Geometric Compute (Flops) FMMとTreecodeのハイブリッド化 領域分割 Morton HOT Hilbert HOT ORB New ORB 領域分割の自動チューニング FMM通信のマシン依存性 Shaheen2 (Cray XC40) ↵ : latency Mira (BG/Q) Titan (Cray XK7) : inverse bandwidth : delay per hop n : message size h : number of hops hm : minimum ofh c : number of cores B : bandwidth Bmax : maximum bandwidth T↵ T↵ penalty T penalty T↵ penalty = ↵ + n + (h hm ) = c↵ + n + (h hm ) Bmax =↵+n + (h hm ) B = ↵ + n + c(h hm ) FMMのスケーリング O(log P + (N/P )2/3 ) Machine: Shaheen2 (Cray XC40) Kernel: Laplace (Cartesian expansion) Distribution: Random in cube Partition: Hashed Octree MPI: isend,irecv (with overlap) Strong scaling (N=300,000,000) Weak scaling (N=100,000,000 per node) 800 600 Grow tree Traverse Communication Other 70 60 Grow tree Traverse Communication Other 50 500 time [s] time [s] x Number of cores 700 80 400 40 300 30 200 20 100 10 0 0 64 512 4096 Number of cores 32768 256 2048 16384 Number of cores 131072 まとめ:チューニング可能なパラメータは? カーネル 級数展開の次数: P 通信 木構造 相互作用角度: θ 領域分割の方法: HOT ORB 末端あたりの粒子数: Ncrit 基底関数の種類: 木構造の深さ 動的負荷分散 重み付け
© Copyright 2026 Paperzz