ArgentiniSciComp9

High performance computing
for a family of smooth
trajectories using parallel
environments
Bologna,
March 23 - 26, 2004
Gianluca Argentini
Advanced Computing Laboratory
[email protected]
Introduction - 1
The company:
• products for heating and conditioning
• development and production of residential and
industrial burners
• presence of a Center of Excellence for the study of
combustion and flame processes
• R&D Department, extensive CAD (Catia from IBMDassault Systemes) and FLUENT computations
1
Introduction - 2
Industrial and power burners have
particular requirements:
• customized study of combustion
head
• study of accurate geometry of
combustion chamber (shape of the
flame, flow of gas or oil and oxygen)
• ventilation and air circulation fans
for a correct oxygen supply, right
pressurization and continous cooling
• reduction of vibrations and noise
2
Introduction - 3
Rapid prototyping for optimal shape of combustion head
and combustion chamber involves Computational Fluid
Dynamics:
• tracing of air or gas flow particles streamlines
• shape of the flow in a generic geometry
High graphic resolution requires a large amount of
particles paths:
• strong computational memory-expensive and cpubased effort
• distribution of paths on a multiprocessor environment
3
The problem
Focus on numerical simulation of flows (in combustion
head, chamber or in fans mechanism)
The large numerical output of simulation is generated
by Navier-Stokes (use of FLUENT package) or
Cellular Automaton models (MATLAB package)
From data, we would obtain:
• path tracking of fluid particles, useful for customized
design of combustion heads and chambers
• smooth 3D visualization of particles trajectories,
possibly with continuous slope and curvature
(analitically: class C2)
4
About problem treatment
Data
Algebric
curves
Step 1. The data obtained from simulation
model are treated by an algorithm for the
computation of algebric curves (cubic
polynomials) associated to particles paths:
• block-data distribution for parallel computing
• necessity of continuous reallocation in RAM
Step 2. Evaluation of polynomials on a large
set of values for fine resolution:
Massive
Computing • very expensive CPU computation
• sets of curves distribution on processors, no
communication
5
Fitting the trajectories
From simulation, a single particle trajectory is a set of 3D
points:
• S is the number of points
• M is the number of trajectories
Interpolation of the points:
• Bezier-like is not realistic in case of
twist or divergence of speeds field
• Chebychev or Least-Squares-like are
too rigid for a customized application
• polinomial fitting is simple but often
shows spurious effects as Runge-Gibbs
phenomenon
We think a splines-based technique is more useful
6
The splines-based algorithm
Let S = 4 x N : path is divided into four-points groups
For every group the points are interpolated by three
cubic polynomials imposing four analytical conditions:
• passage at Pk point, 1 k 3
• passage at Pk+1 point
• continuous slope at Pk point
• continuous curvature at Pk point
For smooth rendering and for avoiding excessive
twisting of trajectories, the cubics uk are added to the
Bezier curve b associated to the four points:
v = ab + buk
0 < a, b < 1
7
Finding the splines
We have choose a = b = 0.5
Let b = As3 + Bs2 + Cs + D (0 s 1) the Bezier curve of
control points P1,…,P4; for every spline
uk =
at3 + bt2 + ct + d (0 t 1)
the
coefficients are computed by (2 k 3, for k = 1 the
formulas are slightly different but of the same algebraic
form; a, b, c, d are 3-dimensional cartesian vector)
a = Pk+1 - Pk - 3B - C - 6
b=B+3
(1)
c = 2B + C + 3
d = Pk
8
A matrix for splines
The system (1) can be represented as c = T b (matrixvector multiplication) where
c = (a, b, c, d)
b = (Pk+1, Pk, B, C, 1)
1 -1 -3 -1 -6
T=
0
0
1
0
3
0
0
2
1
3
0
1
0
0
0
For every spline, only
the vector b is variable;
for a single trajectory, it
must be reassigned in
RAM every group of
two points, after the
computation of the
relative Bezier curve.
9
A global matrix for splines
If we define a global matrix
=
T 0 . . . 0
0 T . . . 0
.
.
0 0 . .0 T
as
with 0 as 4 x 5 zero-matrix, we
have a 4M x 5M sparse matrix
(optimization of memory storage
in MATLAB)
and with B = (Pk+1, Pk, B1, C1, 1, . . ., Pk+1, Pk, BM, CM, 1)
we can compute for every two-points group the
coefficients of cubic splines for all the M trajectories:
C=
B
10
Computational complexity analysis
Every four-points group, for the M trajectories the flops
(floating point operations) number for computing the
splines coefficients is:
• for Bezier curves (customized Matlab script): 316M
• for
matrix-vector multiplication (upper estimate):
324M
We have N groups of four-points at every trajectory: the
total flops number of the Step 1 is about
640MN
11
A parallel distribution for splines
With P, number of processes, divisor of M, the method

used is the distribution of M/P trajectories (rows of
matrix) to every process; no communication is involved.
The value of M is important for the occupation of RAM
at every computational node.
M
pP
.
.
p2
p1
linear execution for every process
N
12
Computing splines: hardware and software
Bezier curves and splines computation on
• Linux cluster IBM x330, biprocessor Pentium III 1.133
GHz, at CINECA (2003); C routines and MPI (for parallel
startup and data distribution)
• 2 nodes Windows2000 / Linux RedHat IBM x440,
biprocessor Xeon 2.4 GHz Hyper Threading, 2 GB RAM,
at Riello (2003); MATLAB rel. 6.5 scripts (startup of
simultaneous multi-engine)
13
Computing splines: performance results
Beowulf CINECA:
The registered speedup is
quasi-linear; for high value of P
the amount of data distribution
(M variable) among processes
is more intrusive.
X440 cluster:
Better performances of
Win2k (linear speedup)
- compared with Linux
- with Intel HT technology
14
Post-processing for splines
Now we would a fast method for computing the splines
values in a set of parameter ticks with fine sampling.
The CFD packages have some limits in the post-processing
phase:
• resolution based on pre-processing mesh
• rigid (when possible) load distribution among available
processors
For good graphic visualization,
the interval between two datapoints might be divided in a
suitable number of ticks:
15
Valuating the splines
Let V + 1 the number of ticks for each cubic spline
valuation; then the ticks are
(0, 1/ V, 2/ V, . . ., (V -1)/ V , 1)
and the values of splines parameter in the computation
are their (0, 1, 2, 3)-th degree powers. The value of a
cubic at t0 can be view as a dot product:
at03 + bt02 + ct0 + d = (a, b, c, d)  (t03, t02, t0, 1)

Let
the preallocable constant
4x(V+1) matrix:
0 (1/ V)3 . . . . ((V -1)/ V)3 1
0 (1/ V)2 . . . . ((V -1)/ V)2 1
0 (1/ V)1 . . . . ((V -1)/ V)1 1
1 1
....
1
1
16
An eulerian view

Let
the M x 4
matrix (each row a
spline for each
trajectory):
a1
a2
b1
b2
c1
c2
d1
d2
.
.
.
.
aM
bM cM dM
 
Then the M x (V+1) matrix product =
contains in each row the values of a cubic between two
data-points, for all the M trajectories (eulerian method).
For the product, the flops are 21M(V+1), the number of

matrices is 3N; the total number of flops are
63NM(V+1).
17
A lagrangian view

Let
the 3N x 4
matrix (each row a
spline along one single
trajectory):
a1
a2
b1
b2
c1
c2
d1
d2
.
.
.
.
a3N b3N c3N d3N
 
Then the 3N x (V+1) matrix product =
contains in each row the values of a cubic between two
data-points, for a single trajectory (lagrangian method).
For the product, the flops are 63N(V+1), the number of

matrices is M; the total number of flops are
63NM(V+1).
18
Data distribution: eulerian case
With P, number of processes, divisor of 3N (amount of
two-points groups), the method used is the distribution of

3N/P
matrices to every process; no communication
is involved.
The value of N is important for the total computation
time, N and M for the RAM allocation of each process.
3N
CPU
. . . . .
M
RAM
19
Data distribution: lagrangian case
With P, number of processes, divisor of M (amount of
trajectories), the method used is the distribution of M/P
matrices to every process; no communication is
involved.
The value of N is important for the total computation
time, N and M for the RAM allocation of each process.
3N
RAM
.
M
CPU
20
Hardware and software
Hardware: 2 x { IBM x440, 2 Xeon 2.4 GHz HT, 2 GB },
at Riello
Software: Windows2000 / Linux RH 8.1, MATLAB 6.5,
parallelism of simultaneous Matlab engines
• for matrix multiplication, Matlab 6.5 uses internal
LAPACK Level 3 BLAS routines (good performances)

• the matrix is computed only one time (in case of
uniform and costant sampling interval), its values are
probably always cached during matrices multiplication
21
Performance results
Performances of a single
Matlab process for the

product with V = 100; as
theory, the execution time is
linear on M variable.
Performances of multiprocess
products (case 3N = 4200P); for
P 8, the total computation time
depends on NM (Gustafson law),
as expected.
22
Performance results: considerations
• Linear speedup until P=8 (= number
of virtual Hyper Threads processors);
for P8 reallocations of RAM and
caches have a negative effect
• For large data sets, the amount of
RAM in the nodes of cluster is a
critical factor, while the CPUs
performances are good with the use
of LAPACK routines
• First results with a technique using “global M-N” matrices,
an MPI-multithreads version of MATLAB (Cornell Toolbox),
and parallel matrix multiplication algorithms, show an
overhead, in case of large data, due to communications
23
Performance results: Hyper Threading
Performance of Intel Hyper
Threading Technology of
Xeon processors; the vertical
unit is time execution in the
case of 8 processes
(M=5000, 3N = 4200P); until
8, the time seems to be
quadratic on processes
number.
Similar results have been obtained
• using Win2k or Linux
• using High Performance Linpack benchmarking
24
Examples
red = trajectory computation
with V = 100; black = least
squares method, 3° degree
polynomials; gray = data-points
from simulation
Forced injection of air in combustion
head; the ribbons show some particles
trajectories; data-points from simulation,
paths computation with V=100, M=5000,
N=1600, P=8; computation and rendering
by Matlab; total computation time 85 secs
Thanks