Ab Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jörg Kussmann Theoretical Chemistry, University of Munich (LMU) 23rd May 2014 J. Kussmann Quantum Chemistry@GPU Outline Introduction Challenges of Ab Initio Quantum Chemistry Optimizing SCF-Algorithms @ GPUs Data-Arrangement Coulomb-, Exchange-, XC-Potential Exchange Potential: GPU-specific optimization Examplary Calculations: SCF & Properties Hybrid MPI/CUDA Parallelization Outlook: Post-HF Algorithms @ GPUs Challenge SOS-MP2 @ GPUs J. Kussmann Quantum Chemistry@GPU PART 1: Ab Initio Methods Schrödinger equation: ĤΨ = i~Ψ̇ stat −→ ĤΨ = EΨ Molecular properties: Energetics/Geometries Vibrational frequencies Electric properties Magnetic properties Dynamic properties Conventional methods: Systems with up to a few 100 atoms on HF- or KS-DFT level (min. O(M3 )) Aim: Reduce scaling to O(M)! J. Kussmann Quantum Chemistry@GPU PART 1: Ab Initio Methods Schrödinger equation: ĤΨ = i~Ψ̇ stat −→ ĤΨ = EΨ Molecular properties: Energetics/Geometries Vibrational frequencies Electric properties Magnetic properties Dynamic properties Conventional methods: Systems with up to a few 100 atoms on HF- or KS-DFT level (min. O(M3 )) Aim: Reduce scaling to O(M)! J. Kussmann Quantum Chemistry@GPU PART 1: Ab Initio Methods Schrödinger equation: ĤΨ = i~Ψ̇ stat −→ ĤΨ = EΨ Molecular properties: Energetics/Geometries Vibrational frequencies Electric properties Magnetic properties Dynamic properties Conventional methods: Systems with up to a few 100 atoms on HF- or KS-DFT level (min. O(M3 )) Aim: Reduce scaling to O(M)! J. Kussmann Quantum Chemistry@GPU Computational Effort: SCF Calculations Roothaan-Hall: FC = SCǫ core XC Fµν = hµν + Jµν [P] − (1 − a)Kµν [P] + Vµν [a, P] Rate-determing steps: 1) Fock-Build 2) Diagonalization: F −→ C ( a=0: HF 0 < a < 1 : hybrid-DFT a=1: KS-DFT O(N2 )−→O(N) O(N3 )−→O(N) aaaaaaaaaaaaaaaaaaaaaaaaaa [Kussmann/Beer/Ochsenfeld, WIREs Comput Mol Sci 3, 614 (2013)] Example: 16 A-T base pairs HF/SVP (ϑint = 10−10 , ϑconv = 10−7 ) 1052 atoms, 11230 basis functions 3 078 087 function pairs 9.5 × 1012 primitive 2-e− integrals O(N) Fock-Build (8 cores): 30 000 s (19 SCF-iterations for tight convergence) J. Kussmann Quantum Chemistry@GPU Moore’s Law: 1965-2010 Embrace new technologies: GPUs J. Kussmann Quantum Chemistry@GPU Moore’s Law: 1965-2010 Embrace new technologies: GPUs J. Kussmann Quantum Chemistry@GPU Implementation of GPU-algorithms Automatic code generation All double-precision, higher l-qn support Coulomb McMurchie-Davidson based J-engine Pre/Post-processing on CPU Ignore bra/ket symmetry (2 x integrals) Exchange McMurchie-Davidson Evaluate complete integral on GPU Exploit only 1 permutational symmetry (4 x integrals) 1 thread / 1 prim. integral: fine-grained data arrangement [Ufimtsev/Martinez, JCTC 4, 222 (2008)] J. Kussmann Quantum Chemistry@GPU Coulomb Potential J. Kussmann Quantum Chemistry@GPU Exchange Potential J. Kussmann Quantum Chemistry@GPU Implementation of GPU-algorithms Automatic code generation All double-precision, higher l-qn support Coulomb McMurchie-Davidson based J-engine Pre/Post-processing on CPU Ignore bra/ket symmetry (2 x integrals) Exchange McMurchie-Davidson Evaluate complete integral on GPU Exploit only 1 permutational symmetry (4 x integrals) Coulomb very fast, try to improve on exchange first... A) B) C) Reduce scaling to linear Reduce local memory effort Reduce shared memory effort J. Kussmann Quantum Chemistry@GPU A) PreLinK: O(N) Exact Exchange on GPUs Problem: O(N) algorithms employ loads of book-keeping, Problem: branching, communication Loop: bra l-quantum number combination Loop: ket l-quantum number combination Loop: bra shell-pairs µ, λ Determine sig. (µλ|σν) quartets: max Q Qµλ Pλσ σν ≥ ϑint + permutations Loop: ket shell-pairs σ, ν Evaluate: Kµν , Kµσ , Kλν , Kλσ End Loop End Loop End Loop Screening within inner loop J. Kussmann Quantum Chemistry@GPU A) PreLinK: O(N) Exact Exchange on GPUs Problem: O(N) algorithms employ loads of book-keeping, Problem: branching, communication Solution: Perform screening prior to integral evaluation by Solution: pre-selection: PreLinK J. Kussmann Quantum Chemistry@GPU A) PreLinK: O(N) Exact Exchange on GPUs Problem: O(N) algorithms employ loads of book-keeping, Problem: branching, communication Solution: Perform screening prior to integral evaluation by Solution: pre-selection: PreLinK Kµν = P (µλ|νσ)Pλσ λσ Schwarz: PreLinK: p p (µλ|νσ) ≤ Qµλ Qνσ = (µλ|µλ) (νσ|νσ) P ′ Qµλ Qνσ |Pλσ | ≥ Kµν Qµν = λσ ′ −→ Q = Q × |P| × Q ′ Determine significant elements of K from Q ! aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa [Kussmann/Ochsenfeld, JCP 138, 134114 (2013)] J. Kussmann Quantum Chemistry@GPU A) PreLinK: Pre-Selection Threshold |P| Overestimation of K 16 α-D-glucose units, HF/SVP J. Kussmann Quantum Chemistry@GPU A) PreLinK: Pre-Selection Threshold Effect of pre-selection on final SCF energy DNA-fragment with 4 A-T base-pairs, HF/SVP (ϑconv = 10−7 , ϑint = 10−10 ). Errors in µHartree. Error always below convergence criterion J. Kussmann Quantum Chemistry@GPU A) PreLinK: Timings Linear alkanes, HF/SV, max.: C640 H1282 1-4 x NVidia M2090 (old generation, Kepler: approx. 3 x faster) J. Kussmann Quantum Chemistry@GPU B) Improving the Exchange: Reduced Local Memory 16 A-T base pairs, HF/SVP (ϑint = 10−10 , ϑpre = 10−3 , 1 x GTX Titan) Resort to Rys-quadrature for larger total l-qn J. Kussmann Quantum Chemistry@GPU C) Improving the Exchange: Reduced Shared Memory Shared Memory per thread-block Most suitable size: 8x8 thread-blocks, use shared memory for Kµν Ex.: d-shells (l-qn = 2), 48 kB shared memory 36 cartesian Kµν elements Memory per thread-block: 8 x 8 x 8 (double) x 36 = 18.4 kB Max. 2 thread-blocks per SMX, only 128 out of 192 cores 25 pure Kµν elements Memory per thread-block: 8 x 8 x 8 (double) x 25 = 12.8 kB Max. 3 thread-blocks per SMX, 192 out of 192 cores Direct transformation to pure allows larger l-qn shells! Ex.: 2 A-T base pairs, HF/TZVP 267 s (cart) vs 216 s (pure) Significant impact: 20% speedup Only ca. 7% of l-qn combinations affected J. Kussmann Quantum Chemistry@GPU Examplary Calculations: Water-Cluster SCF Fock-Build and Nuclear Gradient (4 x GTX Titan, PBE0/SVP, 75/302) PreLinK for Gradients [Kussmann/Ochsenfeld, in preparation] J. Kussmann Quantum Chemistry@GPU NMR-Shieldings @ GPU Timings: Water-Clusters (4 x GTX Titan, PBE0/SVP, 75/302) Algorithm dJ/dB: Reuse SCF-kernels with l + 1, different post-processing dK /dB: Special GPU-kernels K [dP/dB]: 6 x SCF-kernels (skew symmetry) J. Kussmann Quantum Chemistry@GPU CIS/RPA @ GPU Timings: Water-Clusters J. Kussmann (4 x GTX Titan, PBE/SVP, 75/302) Quantum Chemistry@GPU Hybrid MPI/CUDA Parallelization: SCF Calculations HF/SVP (Single Fock-build, ϑint = 10−10 , ϑpre = 10−3 ) 16 A-T base pairs (H2 O)1123 Hardware/Parallelization Per Node: 12 CPU cores (Intel E5-2620 v2 @ 2.0 GHz), 4 GTX Titan Primitive Load-balancing, Master-Slave work distribution 1 Gb Ethernet J. Kussmann Quantum Chemistry@GPU Hybrid MPI/CUDA Parallelization: SCF Calculations HF/SVP 16 A-T base pairs J. Kussmann (H2 O)1123 Quantum Chemistry@GPU Hybrid MPI/CUDA Parallelization: MutM@H2 O J. Kussmann Quantum Chemistry@GPU Post-HF @ GPUs Challenge Less favorable scaling, conv. O(N5 ) at best (MP2) Not integral evaluation, but linear algebra rate-determining Porting CPU-algorithms shows small speedups only Problem: DGEMM-speedup is rather small (ca. x 8) Ansatz Re-considering algorithms with GPUs in mind First attempt: SOS-RI-MP2 [O(N4 )] [Jung/Shao/Head-Gordon, J. Comp. Chem. 12, 1953 (2007)] J. Kussmann Quantum Chemistry@GPU Post-HF @ GPUs: SOS-RI-MP2 OS ERI−MP2 =− X X (ia|R) J−1 (S|jb)(ia|R ′ ) J−1 R ′ S ′ (S ′ |jb) RS ǫa + ǫb − ǫi − ǫj ijab RSR ′ S ′ JRS : two-center/two-electron integrals (aux. basis) Laplace-Transform: OS ERI−AO−MP2 =− X α X X Pocc µµ′ Pvirt νν ′ Pocc λλ′ Pvirt σσ ′ µνλσ RSR ′ S ′ µ′ ν ′ λ′ σ ′ h i (µν|R) J−1 RS h i (S|λσ)(µ′ ν ′ |R ′ ) J−1 R′ S′ (S ′ |λ′ σ ′ ). Evaluation via Intermediates: ZRS = X (R|µ′ ν ′ )Pocc µµ′ Pvirt νν ′ (µν|S) = µνµ′ ν ′ X (R|µν)(µν|S) µν OS Correlation Energy: ERI−AO−MP2 =− P P α RS Z̃RS Z̃SR with Z̃ = ZJ−1 [Maurer/Kussmann/Ochsenfeld, submitted (2014)] J. Kussmann Quantum Chemistry@GPU Post-HF @ GPUs: SOS-RI-MP2 @ GPUs Ansatz Use Cholesky-factors of pseudo-densities & sparse algebra O(N3 ) Evaluate ZRS via J-engine on GPUs. Algorithm (1) (2) (3) (4) (5) (6) (7) (8) Calculation of (R|µν) Calculation of JRS = (R|S) Calculation of J−1 Calculation of pseudo-densities Transformation of (R|µν) to (R|µν) P Contraction µν (R|µν)(µν|S) (@ GPU) Multiplication ZJ−1 P Contraction RS Z̃RS Z̃SR O(N2 ) O(N2 ) O(N3 ) O(N3 ) O(N2 ) O(N3 ) O(N3 ) O(N2 ) [Maurer/Kussmann/Ochsenfeld, submitted (2014)] J. Kussmann Quantum Chemistry@GPU SOS-RI-MP2: J-engine@GPU J. Kussmann Quantum Chemistry@GPU SOS-RI-MP2 @ GPU: Linear Alkanes J. Kussmann Quantum Chemistry@GPU SOS-RI-MP2 @ GPU: DNA J. Kussmann Quantum Chemistry@GPU Conclusions Rethink algorithms, don’t simply transfer CPU-code Coulomb: O(N2 ) J-engine, but small pre-factor Efficient O(N) exchange evaluation on GPUs by PreLinK Performance/Cost (DNA16 @ HF/SVP, 1052 atoms, 11230 BF, 1 x Fock) Q-Chem @ 8 CPU-cores: ∼ 30000 s (∼ 2000 e) FermiONs++ @ 4 M2090: ∼ 2100 s (∼ 10000 e) FermiONs++ @ 4 Titan: ∼ 500 s (∼ 8000 e) ∼ 60 x faster, 4 x more expensive Fine-grained data-arrangement strong-scaling parallelization FermiONs++: Release 2014 J. Kussmann Quantum Chemistry@GPU Acknowledgement ◮ ◮ ◮ Prof. Dr. C. Ochsenfeld Dr. Simon Maurer Group Thank you for your attention... J. Kussmann Quantum Chemistry@GPU
© Copyright 2026 Paperzz