How does hardware development influence quantum chemistry? Mickaël Delcey, Per-Åke Malmqvist, Steven Vancoillie, Valera Veryazov Theoretical Chemistry, Chemical Center, P.O.B. 124, Lund 22100, Sweden, E-mail: [email protected] Processors capacities have increased exponentially, following the famous Moore’s law. The development of computational units go now to a new dimension: instead of increasing the number of transistors on the same chip, various parallelization techniques are used: communication between CPUs via network, via bus on the same mainboard and finally units can be combined as computational cores on the same CPU. Parallelization over n cores Cho-SEWARD Speed-up Multicores Cho-CASSCF 1 Cho-CASPT2 0.5 Num. Grad SEWARD CASSCF CASPT2 1 Multinodes SDD Graphical Processing Units (GPUs) 2 n (cores) 4 These hardware implementations have different transfer speed of data between computational units, different transfer speed between CPUs and memory and different cache size per CPU. So different computational algorithms can benefit differently from these hardware solutions. MOLCAS code[2] uses three kinds of parallelization algorithms. Independent calculations (Numerical Gradients), computing of parts of integrals (SEWARD+wave function calculation) and fine grain parallelization (multi-particle density matrices in CASSCF and CASPT2). Except for the last kind of parallel algorithms, the amount of data transferred between CPUs is fairly small.We selected benzene dimer as a test case, with 528 basis functions, which corresponds to an average calculation with MOLCAS code. Differences in scaling of different modules for parallelization over cores are shown on the left. Clearly, parallelization over cores is not efficient, although the different modules behave differently. Thus, Numerical Gradients and SEWARD see a small improvement by using more cores on the same CPU, while CASSCF and CASPT2, code are even slower in parallel. The situation is better with Cholesky decomposition technique, where I/O is less important. 16 On the right are shown results of parallelization over nodes up to 24 nodes. The tests were made on LUNARC cluster with relatively slow Gigabit connection. Numerical gradient shows a perfect scaling until the number of nodes is bigger than the number of displacements. Computing and usage of integrals needs small data transfer and thus shows an almost perfect scaling. The result of RASPT2 for a polyyne system is also shown, to be representative of small molecules with a big active space. The scaling is clearly better for this system since only the many-particle density matrices generation in the RASPT2/CASPT2 code is parallelized yet. Leading CPU manufacturers claim that in a near future, CPUs with 96 and more cores will be available. Quantum chemical codes should be adapted in order to use efficiently these technologies. Solid State Drives and cheap memory cards used e.g. in a camera or as USB memory stick have quite different read and write timings compared to conventional disks. A test MOLCAS job running on an external memory drive (connected via the USB 2.0 interface) showed a surprising result: RASSCF was running slightly faster and CASPT2 was only 17% slower, than using the conventional HDD. SEWARD Timings for parallelization over n nodes 8 4 Cho-CASSCF Speed-up 2 RASPT2 CASSCF Cho-SEWARD 2 1 1 2 4 n (nodes) CASPT2 Cho-CASPT2 8 16 32 A modern GPU is a highly parallel and multi-threaded processor with very high flop performance and internal memory bandwidth. Nvidia has made these features useful for non-graphical tasks, creating an environment which includes code development tools, and the high performance libraries CuBLAS and CuFFT. A quantum chemistry software is typically rich in BLAS calls. This suggests a simple approach to GPU acceleration of QCh code: can we simply use a BLAS library that implements the BLAS calls by performing the corresponding CuBLAS ones? This idea is simple enough to be tested at once: we run MOLCAS code with the only modification that the matrix multiplication code was replaced by a wrapper to CuBLAS_DGEMM.Test calculations were done on a Tesla GPU, provided by STFC Daresbury Lab. However, the transfer of data between GPU card and CPU requires transfer via temporary allocated memory, and takes time. It is fastest if a portion of memory can be 'pinned' and reused for several calls 1000 Timings for n x n matrix multiply Matrix sizes in RASSCF DGEMM calls conventional BLAS DGEMM vs. CUBLAS Asymptotic speeds: Conventional 2.9 Gflop/s CUDA, paged memory, 86.9 Gflop/s CUDA, pinned memory 60.0 Gflop/s (assuming pinned memory reused for ten calls) 100 10 Time (seconds) 100000 10000 1 -- 10 calls 10 -- 100 100 -- 1000 1000 -- 10000 --100000 --1000000 --10000000 --100000000 Crossing point, n=1845, t=4.5 1 Matrix size n Parallel computing Crossing point, n=798, t=0.40 0.1 1000 100 0.01 10 0.001 CuBLAS Conclusions References 10 100 Matrix size n 1000 1 10000 Using of CuBLAS_DGEMM is clearly faster for huge matrices. RASSCF and CASPT2 codes spend up to 80% of the time in DGEMM. But do we have such large matrices in a typical calculation? Polyyne was used as a representative RASSCF calculation with 234 basis function and a 4/8/4 active space. The statistics of DGEMM calls for this test case shows that the percentage of big matrices is very small. Also, these sizes are asymmetric, while CUBLAS_DGEMM is more efficient for square matrix. Efficiently batched algorithms such as the one in MOLCAS have less advantages from GPU technology. Increasing of connection speed between CPU and GPU might change the conclusion completely. Accumulated time, % of total 0.0001 Conventional CUDA paged CUDA pinned 1 10 100 Matrix size m 1000 10000 100000 100 When is CUDA worthwhile? 95 90 20% = total time for 1088 calls that were large enough to allow the overhead of CUDA calls 85 80 75 80% = total time for 109665333 calls that 70 were shorter than 0.02 seconds, i.e., too short for CUDA. 65 60 100% = 2090 seconds = total time for 109666421 calls 55 Time spent in each DGEMM call, sec 50 0.001 0.01 0.1 Hardware development is not driven by the needs of computational chemistry. At the same time, quite often these new technologies can not be used directly, and require some modifications of the application software. Substantial time can be needed to adjust computational codes and take advantage of new computer architectures. [1] W. A. de Jong, E. Bylaska, N.Govind, C. L. Janssen, K. Kowalski, T. Müller, I. M. B. Nielsen, H. J. J. van Dam, V. Veryazov and R. Lindh, Utilizing High Performance Computing for Chemistry: Parallel Computational Chemistry, PCCP 12, 6896-6920, (2010). [2] .F. Aquilante, L. De Vico, N. Ferré, G. Ghigo, P.-Å Malmqvist, P. Neogrády, T. B. Pedersen, M. Pitoňak, M. Reiher, B. O. Roos, L. Serrano-Andrés, M. Urban, V. Veryazov, and R. Lindh. MOLCAS 7: The Next Generation. J. Comput. Chem., 31, 224-247, (2010)
© Copyright 2026 Paperzz