Ab Initio Quantum Chemistry on Graphics Processing Units

Ab Initio Quantum Chemistry
on Graphics Processing Units
Rethinking Algorithms for Massively Parallel Architectures
Jörg Kussmann
Theoretical Chemistry, University of Munich (LMU)
23rd May 2014
J. Kussmann
Quantum Chemistry@GPU
Outline
Introduction
Challenges of Ab Initio Quantum Chemistry
Optimizing SCF-Algorithms @ GPUs
Data-Arrangement
Coulomb-, Exchange-, XC-Potential
Exchange Potential: GPU-specific optimization
Examplary Calculations: SCF & Properties
Hybrid MPI/CUDA Parallelization
Outlook: Post-HF Algorithms @ GPUs
Challenge
SOS-MP2 @ GPUs
J. Kussmann
Quantum Chemistry@GPU
PART 1: Ab Initio Methods
Schrödinger equation:
ĤΨ = i~Ψ̇
stat
−→
ĤΨ = EΨ
Molecular properties:
Energetics/Geometries
Vibrational frequencies
Electric properties
Magnetic properties
Dynamic properties
Conventional methods:
Systems with up to a few 100 atoms on HF- or KS-DFT level (min. O(M3 ))
Aim: Reduce scaling to O(M)!
J. Kussmann
Quantum Chemistry@GPU
PART 1: Ab Initio Methods
Schrödinger equation:
ĤΨ = i~Ψ̇
stat
−→
ĤΨ = EΨ
Molecular properties:
Energetics/Geometries
Vibrational frequencies
Electric properties
Magnetic properties
Dynamic properties
Conventional methods:
Systems with up to a few 100 atoms on HF- or KS-DFT level (min. O(M3 ))
Aim: Reduce scaling to O(M)!
J. Kussmann
Quantum Chemistry@GPU
PART 1: Ab Initio Methods
Schrödinger equation:
ĤΨ = i~Ψ̇
stat
−→
ĤΨ = EΨ
Molecular properties:
Energetics/Geometries
Vibrational frequencies
Electric properties
Magnetic properties
Dynamic properties
Conventional methods:
Systems with up to a few 100 atoms on HF- or KS-DFT level (min. O(M3 ))
Aim: Reduce scaling to O(M)!
J. Kussmann
Quantum Chemistry@GPU
Computational Effort: SCF Calculations
Roothaan-Hall:
FC = SCǫ
core
XC
Fµν = hµν
+ Jµν [P] − (1 − a)Kµν [P] + Vµν
[a, P]
Rate-determing steps:
1) Fock-Build
2) Diagonalization: F −→ C
(
a=0:
HF
0 < a < 1 : hybrid-DFT
a=1:
KS-DFT
O(N2 )−→O(N)
O(N3 )−→O(N)
aaaaaaaaaaaaaaaaaaaaaaaaaa [Kussmann/Beer/Ochsenfeld, WIREs Comput Mol Sci 3, 614 (2013)]
Example: 16 A-T base pairs
HF/SVP (ϑint = 10−10 , ϑconv = 10−7 )
1052 atoms, 11230 basis functions
3 078 087 function pairs
9.5 × 1012 primitive 2-e− integrals
O(N) Fock-Build (8 cores): 30 000 s
(19 SCF-iterations for tight convergence)
J. Kussmann
Quantum Chemistry@GPU
Moore’s Law: 1965-2010
Embrace new technologies: GPUs
J. Kussmann
Quantum Chemistry@GPU
Moore’s Law: 1965-2010
Embrace new technologies: GPUs
J. Kussmann
Quantum Chemistry@GPU
Implementation of GPU-algorithms
Automatic code generation
All double-precision, higher l-qn support
Coulomb
McMurchie-Davidson based J-engine
Pre/Post-processing on CPU
Ignore bra/ket symmetry (2 x integrals)
Exchange
McMurchie-Davidson
Evaluate complete integral on GPU
Exploit only 1 permutational symmetry (4 x integrals)
1 thread / 1 prim. integral: fine-grained data arrangement
[Ufimtsev/Martinez, JCTC 4, 222 (2008)]
J. Kussmann
Quantum Chemistry@GPU
Coulomb Potential
J. Kussmann
Quantum Chemistry@GPU
Exchange Potential
J. Kussmann
Quantum Chemistry@GPU
Implementation of GPU-algorithms
Automatic code generation
All double-precision, higher l-qn support
Coulomb
McMurchie-Davidson based J-engine
Pre/Post-processing on CPU
Ignore bra/ket symmetry (2 x integrals)
Exchange
McMurchie-Davidson
Evaluate complete integral on GPU
Exploit only 1 permutational symmetry (4 x integrals)
Coulomb very fast, try to improve on exchange first...
A)
B)
C)
Reduce scaling to linear
Reduce local memory effort
Reduce shared memory effort
J. Kussmann
Quantum Chemistry@GPU
A) PreLinK: O(N) Exact Exchange on GPUs
Problem: O(N) algorithms employ loads of book-keeping,
Problem: branching, communication
Loop: bra l-quantum number combination
Loop: ket l-quantum number combination
Loop: bra shell-pairs µ, λ
Determine sig. (µλ|σν) quartets:
max Q
Qµλ Pλσ
σν ≥ ϑint + permutations
Loop: ket shell-pairs σ, ν
Evaluate: Kµν , Kµσ , Kλν , Kλσ
End Loop
End Loop
End Loop
Screening within inner loop
J. Kussmann
Quantum Chemistry@GPU
A) PreLinK: O(N) Exact Exchange on GPUs
Problem: O(N) algorithms employ loads of book-keeping,
Problem: branching, communication
Solution: Perform screening prior to integral evaluation by
Solution: pre-selection: PreLinK
J. Kussmann
Quantum Chemistry@GPU
A) PreLinK: O(N) Exact Exchange on GPUs
Problem: O(N) algorithms employ loads of book-keeping,
Problem: branching, communication
Solution: Perform screening prior to integral evaluation by
Solution: pre-selection: PreLinK
Kµν =
P
(µλ|νσ)Pλσ
λσ
Schwarz:
PreLinK:
p
p
(µλ|νσ) ≤ Qµλ Qνσ = (µλ|µλ) (νσ|νσ)
P
′
Qµλ Qνσ |Pλσ | ≥ Kµν
Qµν =
λσ
′
−→ Q = Q × |P| × Q
′
Determine significant elements of K from Q !
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa [Kussmann/Ochsenfeld, JCP 138, 134114 (2013)]
J. Kussmann
Quantum Chemistry@GPU
A) PreLinK: Pre-Selection Threshold
|P|
Overestimation of K
16 α-D-glucose units, HF/SVP
J. Kussmann
Quantum Chemistry@GPU
A) PreLinK: Pre-Selection Threshold
Effect of pre-selection on final SCF energy
DNA-fragment with 4 A-T base-pairs, HF/SVP
(ϑconv = 10−7 , ϑint = 10−10 ).
Errors in µHartree.
Error always below convergence criterion
J. Kussmann
Quantum Chemistry@GPU
A) PreLinK: Timings
Linear alkanes, HF/SV, max.: C640 H1282
1-4 x NVidia M2090 (old generation, Kepler: approx. 3 x faster)
J. Kussmann
Quantum Chemistry@GPU
B) Improving the Exchange: Reduced Local Memory
16 A-T base pairs, HF/SVP (ϑint = 10−10 , ϑpre = 10−3 , 1 x GTX Titan)
Resort to Rys-quadrature for larger total l-qn
J. Kussmann
Quantum Chemistry@GPU
C) Improving the Exchange: Reduced Shared Memory
Shared Memory per thread-block
Most suitable size: 8x8 thread-blocks, use shared memory for Kµν
Ex.: d-shells (l-qn = 2), 48 kB shared memory
36 cartesian Kµν elements
Memory per thread-block: 8 x 8 x 8 (double) x 36 = 18.4 kB
Max. 2 thread-blocks per SMX, only 128 out of 192 cores
25 pure Kµν elements
Memory per thread-block: 8 x 8 x 8 (double) x 25 = 12.8 kB
Max. 3 thread-blocks per SMX, 192 out of 192 cores
Direct transformation to pure allows larger l-qn shells!
Ex.: 2 A-T base pairs, HF/TZVP
267 s (cart) vs 216 s (pure)
Significant impact: 20% speedup
Only ca. 7% of l-qn combinations affected
J. Kussmann
Quantum Chemistry@GPU
Examplary Calculations: Water-Cluster
SCF Fock-Build and Nuclear Gradient
(4 x GTX Titan, PBE0/SVP, 75/302)
PreLinK for Gradients [Kussmann/Ochsenfeld, in preparation]
J. Kussmann
Quantum Chemistry@GPU
NMR-Shieldings @ GPU
Timings: Water-Clusters
(4 x GTX Titan, PBE0/SVP, 75/302)
Algorithm
dJ/dB: Reuse SCF-kernels with l + 1, different post-processing
dK /dB: Special GPU-kernels
K [dP/dB]: 6 x SCF-kernels (skew symmetry)
J. Kussmann
Quantum Chemistry@GPU
CIS/RPA @ GPU
Timings: Water-Clusters
J. Kussmann
(4 x GTX Titan, PBE/SVP, 75/302)
Quantum Chemistry@GPU
Hybrid MPI/CUDA Parallelization: SCF Calculations
HF/SVP (Single Fock-build, ϑint = 10−10 , ϑpre = 10−3 )
16 A-T base pairs
(H2 O)1123
Hardware/Parallelization
Per Node: 12 CPU cores (Intel E5-2620 v2 @ 2.0 GHz), 4 GTX Titan
Primitive Load-balancing, Master-Slave work distribution
1 Gb Ethernet
J. Kussmann
Quantum Chemistry@GPU
Hybrid MPI/CUDA Parallelization: SCF Calculations
HF/SVP
16 A-T base pairs
J. Kussmann
(H2 O)1123
Quantum Chemistry@GPU
Hybrid MPI/CUDA Parallelization: MutM@H2 O
J. Kussmann
Quantum Chemistry@GPU
Post-HF @ GPUs
Challenge
Less favorable scaling, conv. O(N5 ) at best (MP2)
Not integral evaluation, but linear algebra rate-determining
Porting CPU-algorithms shows small speedups only
Problem: DGEMM-speedup is rather small (ca. x 8)
Ansatz
Re-considering algorithms with GPUs in mind
First attempt: SOS-RI-MP2 [O(N4 )]
[Jung/Shao/Head-Gordon, J. Comp. Chem. 12, 1953 (2007)]
J. Kussmann
Quantum Chemistry@GPU
Post-HF @ GPUs: SOS-RI-MP2
OS
ERI−MP2
=−
X X (ia|R) J−1
(S|jb)(ia|R ′ ) J−1 R ′ S ′ (S ′ |jb)
RS
ǫa + ǫb − ǫi − ǫj
ijab RSR ′ S ′
JRS : two-center/two-electron integrals (aux. basis)
Laplace-Transform:
OS
ERI−AO−MP2
=−
X
α
X
X
Pocc µµ′ Pvirt νν ′ Pocc λλ′ Pvirt σσ ′
µνλσ
RSR ′ S ′
µ′ ν ′ λ′ σ ′
h
i
(µν|R) J−1
RS
h
i
(S|λσ)(µ′ ν ′ |R ′ ) J−1
R′ S′
(S ′ |λ′ σ ′ ).
Evaluation via Intermediates:
ZRS =
X
(R|µ′ ν ′ )Pocc µµ′ Pvirt νν ′ (µν|S) =
µνµ′ ν ′
X
(R|µν)(µν|S)
µν
OS
Correlation Energy: ERI−AO−MP2
=−
P P
α
RS
Z̃RS Z̃SR with Z̃ = ZJ−1
[Maurer/Kussmann/Ochsenfeld, submitted (2014)]
J. Kussmann
Quantum Chemistry@GPU
Post-HF @ GPUs: SOS-RI-MP2 @ GPUs
Ansatz
Use Cholesky-factors of pseudo-densities & sparse algebra
O(N3 )
Evaluate ZRS via J-engine on GPUs.
Algorithm
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
Calculation of (R|µν)
Calculation of JRS = (R|S)
Calculation of J−1
Calculation of pseudo-densities
Transformation of (R|µν) to (R|µν)
P
Contraction µν (R|µν)(µν|S) (@ GPU)
Multiplication ZJ−1
P
Contraction RS Z̃RS Z̃SR
O(N2 )
O(N2 )
O(N3 )
O(N3 )
O(N2 )
O(N3 )
O(N3 )
O(N2 )
[Maurer/Kussmann/Ochsenfeld, submitted (2014)]
J. Kussmann
Quantum Chemistry@GPU
SOS-RI-MP2: J-engine@GPU
J. Kussmann
Quantum Chemistry@GPU
SOS-RI-MP2 @ GPU: Linear Alkanes
J. Kussmann
Quantum Chemistry@GPU
SOS-RI-MP2 @ GPU: DNA
J. Kussmann
Quantum Chemistry@GPU
Conclusions
Rethink algorithms, don’t simply transfer CPU-code
Coulomb: O(N2 ) J-engine, but small pre-factor
Efficient O(N) exchange evaluation on GPUs by PreLinK
Performance/Cost
(DNA16 @ HF/SVP, 1052 atoms, 11230 BF, 1 x Fock)
Q-Chem @ 8 CPU-cores: ∼ 30000 s (∼ 2000 e)
FermiONs++ @ 4 M2090: ∼ 2100 s (∼ 10000 e)
FermiONs++ @ 4 Titan: ∼ 500 s (∼ 8000 e)
∼ 60 x faster, 4 x more expensive
Fine-grained data-arrangement
strong-scaling parallelization
FermiONs++: Release 2014
J. Kussmann
Quantum Chemistry@GPU
Acknowledgement
◮
◮
◮
Prof. Dr. C. Ochsenfeld
Dr. Simon Maurer
Group
Thank you for your attention...
J. Kussmann
Quantum Chemistry@GPU