Discrete Particle Systems

Two-level parallelization of CPU-GPU hybrid
large scale discrete element simulation
Ji Xu
Institute of Process Engineering
Chinese Academy of Sciences
Contents
1
• Introduction
2
• Algorithms
3
• Applications
2
Introduction
3
Discrete Particle Systems
Natural
Phenomenon
Drug
Storage
Grain
Storage
Chemical
Industry
4
DEM: Discrete Element Method
 Discrete Element Method (DEM) ─ P. A. Cundall & O. D. L Strack
 DEM tracks every single particle in the system
 very good for investigating discrete particle systems, especially
the phenomena occurring at the length scale of a particle diameter
 huge computational cost for modeling larger scale systems
‒ e.g. Vsystem = 1 L, d = 100 μm → ~108 particles
5
Models in DEM
Control Equation
mi
dVi
 mi g   Fij
dt
j
Ii
dω i
  M ij
dt
j
Contact model
Contact
start
overlap
F   kn nij   n v nij    kt t ij   t vtij 
Irregular shaped objects
Multi-sphere approach
 Contact model is simple
 Computational cost is high
Overlapping
6
Why Use the GPU?
 GPU has evolved into a very flexible and powerful processor
‒ It offers lots of GFLOPS
‒ SIMT is suitable for DEM simulation
7
Algorithms
8
Flowchart of DEM Simulation
 Specify the initial conditions
‒
‒
‒
‒
‒
N elements
positions, velocities, rotation
boundary conditions
Contact models
etc.
 Main loop of simulation
‒ Compute all forces
‒ Motion Integrate
 Compute system properties
‒ analysis
9
Single GPU
10
Algorithms: Neighbor List
 Neighbors search based on cells
Cell interaction region
‒ Primary cell
‒ Neighbor cells
‒ Cutoff radius
 Particles with different radii
Primary cell
neighbor cells
11
Algorithms: Neighbor Searching
fixed cutoff
varying cutoff
12
Neighbor Searching
 Handle particles parallel concurrently
 Searching based on the Cell
 One Cell
one thread block
13
Force Computing
 Handle particle parallel
concurrently
 One particle is dealt with one
thread
 Prerequisite Variables
‒
‒
‒
‒
‒
index of this block: bid
thread index within a block: tid
threads number in a block: M
blocks run on GPU: N/M
neighbors of each particle: Nblist
14
Integration
 Explicit Verlet integration method is adopted
‒ [Verlet, L., 1967. Physical Review 159, 98-103]
 CPU implementation
‒ Handle particle serially one after another
 GPU implementation
‒ Naturally SIMD
‒ Handle particles parallel concurrently
‒ One GPU thread
one particle
15
Performance
Case
1
2
3
4
5
6
7
8
9
10
11
12
Region/m3
0.83
1.23
1.63
2.03
2.43
2.83
3.23
3.63
4.03
4.43
4.83
5.23
N/104
0.37
1.27
3.05
6.03
10.50
16.77
25.16
35.96
49.49
66.06
85.97
109.5
PSS/107
2.33
3.18
3.25
3.29
3.24
3.24
3.25
3.29
3.32
3.31
3.37
3.39
 Time percentage of each algorithm
PSS: particlessteps / second
16
Performance
 GPU/CPU Speedup of
each algorithm
‒
‒
‒
‒
bin is slower than CPU
update is the highest
collide increases with N
nblist is higher than collide
 GPU/CPU overall Speedup
‒ much faster than CPU
‒ single precision is much
faster than double precision
17
Multiple GPUs
18
Task Partitioning
 Domain decomposition
‒ Multiple GPUs in multiple nodes
‒ Decomposition is according to space property
 Regular space
‒ The space is whole, no blank region separating it
 Irregular space
‒ Blank region separating the space
19
Regular Space Decomposition
 Whole space is partitioned into sub-domains
‒ 1/2/3 dimensional
‒ equal / unequal (considering static load balance)
 1D
 3D
Task0
Task1
Task2
Task3
 2D
Task0
Task2
Task1
Task3
20
Regular Space Decomposition:
Communication
 Communication methods
‒ Real space: particles in the region should be computed
‒ Virtual/Ghost space: communicated from neighbor processes
‒ The ‘Shift’ communication method: X->Y->Z
X
Y
Z
21
Two-level Irregular Space Decomposition
 First level: whole space is partitioned into sub-domains
according to the space property
‒ The resulting sub-domains are of regular shapes
4
2
3
5
6 7
1
8
10 9
11
22
Two-level Irregular Space Decomposition
4
2
 Second level: each sub-domain is
partitioned further in the regular
space decomposition fashion
3
5
6 7
 Communication methods
‒ Second level sub-domain: ‘Shift’ method
‒ First level sub-domains: ‘P2P’
 Determining the crossing space
between two sub-domains is non-trivial
 Staggered space relationship
1
8
10 9
11
23
Computing & Communication Overlap
 Asynchronous of GPU Computing & Memory copy
 1D partition
 Time line
‒ Overlap
Comp. outmost
Comp. inner real
Comm. outmost real
‒ No Overlap
Comp. all particles
Comm. outmost real
24
Applications
25
Particles Packing
Simulation Parameters
Particle Number
Particle Radius
Gravity
density
Young’s Modulus
Passion ratio
Restitution Coefficient
Friction Coefficient
Rolling Friction
Coefficient
Cohesion Energy
Density
Characteristic Velocity
Vibration Direction
Amplitude
Frequency
values
19052
2.5 mm
9.81 m/s2
2500 kg/m3
5.0 MPa
0.45
0.1
0.5
0.2
0.0
2.0
X Direction
0.5 mm
31.83 Hz
26
Flow of Nonspherical particle
 Deposition of the pyramid-shape particles in the bend
‒ particles drop quickly in the vertical tube
‒ flow slowly at the bend
‒ creep flow can last long time
27
Repose of Nonspherical particle
 Effect of particle shape on repose angle
 Repose angle
Sphere: 30.96O
Bar: 36.87O
Diamond: 33.66O
Pyramid: 34.41O
28
Baosteel Blast Furnace Simulation


108
Particle number:
Particle diameter ratio: 1-10
Experiment: high risk & cost
Commercial soft: Slow & inaccurate
1
Our method
4
1
12
29
Screw Conveyor
 Model system
 Virtual process engineering
(VPE) platform
 Control module: adjust the
parameters or the operating
conditions while simulation is
running
 Post-processing module &
visualization module on-line
 significantly accelerate the
equipment design or optimization
30
Extension to Gas-solid Flow
Lab scale
Industrial equipment
Measurement
instruments
RTD

Real time
≈1h

Traditional
method
Simulation results
Experimental
results
< 100 s
Parallel
visualization
System console
(User interface)

EMMS-DPM
whole process
Control
system
Virtual Process Engineering
(VPE)
3D full-loop
MTO simulation with EMMS-DPM
Speed: 2s/day
31
Acknowledgment
EMMS Group: www.emms.cn
Thanks for Your Attention!
32