Two-level parallelization of CPU-GPU hybrid large scale discrete element simulation Ji Xu Institute of Process Engineering Chinese Academy of Sciences Contents 1 • Introduction 2 • Algorithms 3 • Applications 2 Introduction 3 Discrete Particle Systems Natural Phenomenon Drug Storage Grain Storage Chemical Industry 4 DEM: Discrete Element Method Discrete Element Method (DEM) ─ P. A. Cundall & O. D. L Strack DEM tracks every single particle in the system very good for investigating discrete particle systems, especially the phenomena occurring at the length scale of a particle diameter huge computational cost for modeling larger scale systems ‒ e.g. Vsystem = 1 L, d = 100 μm → ~108 particles 5 Models in DEM Control Equation mi dVi mi g Fij dt j Ii dω i M ij dt j Contact model Contact start overlap F kn nij n v nij kt t ij t vtij Irregular shaped objects Multi-sphere approach Contact model is simple Computational cost is high Overlapping 6 Why Use the GPU? GPU has evolved into a very flexible and powerful processor ‒ It offers lots of GFLOPS ‒ SIMT is suitable for DEM simulation 7 Algorithms 8 Flowchart of DEM Simulation Specify the initial conditions ‒ ‒ ‒ ‒ ‒ N elements positions, velocities, rotation boundary conditions Contact models etc. Main loop of simulation ‒ Compute all forces ‒ Motion Integrate Compute system properties ‒ analysis 9 Single GPU 10 Algorithms: Neighbor List Neighbors search based on cells Cell interaction region ‒ Primary cell ‒ Neighbor cells ‒ Cutoff radius Particles with different radii Primary cell neighbor cells 11 Algorithms: Neighbor Searching fixed cutoff varying cutoff 12 Neighbor Searching Handle particles parallel concurrently Searching based on the Cell One Cell one thread block 13 Force Computing Handle particle parallel concurrently One particle is dealt with one thread Prerequisite Variables ‒ ‒ ‒ ‒ ‒ index of this block: bid thread index within a block: tid threads number in a block: M blocks run on GPU: N/M neighbors of each particle: Nblist 14 Integration Explicit Verlet integration method is adopted ‒ [Verlet, L., 1967. Physical Review 159, 98-103] CPU implementation ‒ Handle particle serially one after another GPU implementation ‒ Naturally SIMD ‒ Handle particles parallel concurrently ‒ One GPU thread one particle 15 Performance Case 1 2 3 4 5 6 7 8 9 10 11 12 Region/m3 0.83 1.23 1.63 2.03 2.43 2.83 3.23 3.63 4.03 4.43 4.83 5.23 N/104 0.37 1.27 3.05 6.03 10.50 16.77 25.16 35.96 49.49 66.06 85.97 109.5 PSS/107 2.33 3.18 3.25 3.29 3.24 3.24 3.25 3.29 3.32 3.31 3.37 3.39 Time percentage of each algorithm PSS: particlessteps / second 16 Performance GPU/CPU Speedup of each algorithm ‒ ‒ ‒ ‒ bin is slower than CPU update is the highest collide increases with N nblist is higher than collide GPU/CPU overall Speedup ‒ much faster than CPU ‒ single precision is much faster than double precision 17 Multiple GPUs 18 Task Partitioning Domain decomposition ‒ Multiple GPUs in multiple nodes ‒ Decomposition is according to space property Regular space ‒ The space is whole, no blank region separating it Irregular space ‒ Blank region separating the space 19 Regular Space Decomposition Whole space is partitioned into sub-domains ‒ 1/2/3 dimensional ‒ equal / unequal (considering static load balance) 1D 3D Task0 Task1 Task2 Task3 2D Task0 Task2 Task1 Task3 20 Regular Space Decomposition: Communication Communication methods ‒ Real space: particles in the region should be computed ‒ Virtual/Ghost space: communicated from neighbor processes ‒ The ‘Shift’ communication method: X->Y->Z X Y Z 21 Two-level Irregular Space Decomposition First level: whole space is partitioned into sub-domains according to the space property ‒ The resulting sub-domains are of regular shapes 4 2 3 5 6 7 1 8 10 9 11 22 Two-level Irregular Space Decomposition 4 2 Second level: each sub-domain is partitioned further in the regular space decomposition fashion 3 5 6 7 Communication methods ‒ Second level sub-domain: ‘Shift’ method ‒ First level sub-domains: ‘P2P’ Determining the crossing space between two sub-domains is non-trivial Staggered space relationship 1 8 10 9 11 23 Computing & Communication Overlap Asynchronous of GPU Computing & Memory copy 1D partition Time line ‒ Overlap Comp. outmost Comp. inner real Comm. outmost real ‒ No Overlap Comp. all particles Comm. outmost real 24 Applications 25 Particles Packing Simulation Parameters Particle Number Particle Radius Gravity density Young’s Modulus Passion ratio Restitution Coefficient Friction Coefficient Rolling Friction Coefficient Cohesion Energy Density Characteristic Velocity Vibration Direction Amplitude Frequency values 19052 2.5 mm 9.81 m/s2 2500 kg/m3 5.0 MPa 0.45 0.1 0.5 0.2 0.0 2.0 X Direction 0.5 mm 31.83 Hz 26 Flow of Nonspherical particle Deposition of the pyramid-shape particles in the bend ‒ particles drop quickly in the vertical tube ‒ flow slowly at the bend ‒ creep flow can last long time 27 Repose of Nonspherical particle Effect of particle shape on repose angle Repose angle Sphere: 30.96O Bar: 36.87O Diamond: 33.66O Pyramid: 34.41O 28 Baosteel Blast Furnace Simulation 108 Particle number: Particle diameter ratio: 1-10 Experiment: high risk & cost Commercial soft: Slow & inaccurate 1 Our method 4 1 12 29 Screw Conveyor Model system Virtual process engineering (VPE) platform Control module: adjust the parameters or the operating conditions while simulation is running Post-processing module & visualization module on-line significantly accelerate the equipment design or optimization 30 Extension to Gas-solid Flow Lab scale Industrial equipment Measurement instruments RTD Real time ≈1h Traditional method Simulation results Experimental results < 100 s Parallel visualization System console (User interface) EMMS-DPM whole process Control system Virtual Process Engineering (VPE) 3D full-loop MTO simulation with EMMS-DPM Speed: 2s/day 31 Acknowledgment EMMS Group: www.emms.cn Thanks for Your Attention! 32
© Copyright 2024 Paperzz