Data Set Computations with CFX: To 100 Million Nodes

Large Data Set
Computations with CFX:
To 100 Million Nodes & Beyond !
Dr. Mark E. Braaten
Mechanical Engineer
GE Global Research, Niskayuna, NY
© 2006 ANSYS, Inc.
M. Braaten – GE
Acknowledgements
• Thanks to many people, in particular:
– GEGR Colleagues
• Stuart Connell, Semir Kapetanovic, Adam Stevenson,
Peter Schmid
– CFX Colleagues
• Mike Marchant, Andy Mortimer, David Main (UK)
• Dan Williams, Phil Zwart, Steve Elias, Paul Galpin,
Hrvoje Roglic (Canada)
© 2006 ANSYS, Inc.
M. Braaten – GE
Large Data Set Project
• Goal of Large Data Set Project is to look at hardware /
software issues that arise when running very large CFD
applications (>100M nodes )
– CFX and GE in-house turbomachinery code
• Examine scalability of current CFX system to very large
problems
– Look at mesh generation, pre-processing, partitioning, solver, postprocessor
• Create series of cases of increasing size
– Roughly 1M, 2M, 10M, 25M, 50M, 100M nodes
• Run on 64-bit desktop and 32/64-bit Linux clusters
– Expect these to be the likely computer resources available in the
near future
© 2006 ANSYS, Inc.
M. Braaten – GE
Current Status of CFX-10.0
• CFX can handle both unstructured and structured
meshes
• CFX-Solver has demonstrated parallel scalability on
Linux clusters (32- and 64-bit)
• CFX-Pre, CFX-Post, and the CFX Partitioner are
available for 64-bit machines
• CFX-Mesh is currently limited to 32-bit, due to its
integration into ANSYS Workbench
– Impossible to make single mesh larger than about 5M nodes
• CFX Partitioner is serial only
– Serial METIS has a severe memory bottleneck
© 2006 ANSYS, Inc.
M. Braaten – GE
Selection of Large CFX Test Case
IDEA:
• Use an existing block-structured mesh for a
turbomachinery passage, and replicate it to
make larger problems
• Advantages:
– Easy to make sequence of larger problems
• Avoids current problem with 32-bit mesh generation
– Models a problem of interest to Large Data Set project
• Multiple identical passages, multiple dissimilar passages
(AVP, MPT), single large mesh
– Easy to assess solutions on multiple passages
• Should be same as a single passage replicated
– Allows use of simple circumferential partitioning
• Avoids METIS memory requirements
© 2006 ANSYS, Inc.
M. Braaten – GE
CFX Test Case
• Use typical turbine blade as the test case
• Convenient for generating series of test
cases
– Mesh for one passage is about 1 M nodes
– Full annulus contains 92 passages (88M nodes)
Block-structured mesh
(95 blocks,
970,000 grid points)
© 2006 ANSYS, Inc.
M. Braaten – GE
Large Data Set CFX Test Case (cont’d)
• One passage base case
– Block-structured mesh (95 blocks) imported
directly into CFX-Pre
– 970,000 grid points per passage
• Replicate passages to generate larger and
larger test cases
–
–
–
–
–
© 2006 ANSYS, Inc.
2 passages
10 passages
23 passages (quarter annulus)
46 passages (half annulus)
92 passages (full annulus)
~ 2 M nodes
~ 10 M nodes
~ 22 M nodes
~ 44 M nodes
~ 88 M nodes
M. Braaten – GE
Computer Resources
• Initially CFX-Pre, CFX-Partition, and CFXPost run on 64-bit Itanium machine
– Process has since been repeated using new
EM64T- based ATW machines that are dualbooted with 64-bit Linux O.S.
• CFX-Solver run on Linux clusters at GEGR
– 32-bit cluster with 2GB memory per processor
– 64-bit cluster with 3GB memory per processor
© 2006 ANSYS, Inc.
M. Braaten – GE
CFX-Pre
• Had existing problem setup for a single passage
– Exported CCL from this case
• Process:
– Import single passage grid
– Replicate grid
• Specify number of copies, angle/axis of rotation
– Import CCL from single passage case
– Point boundary conditions to replicated faces
– Takes less than 15 minutes (start to finish) for even full annulus
case
• No significant scalability issues for replicated passages
• Memory increases linearly for independent passages
– Can reach almost 200M node hex mesh w/ 16GB memory
• Not a barrier for now
© 2006 ANSYS, Inc.
M. Braaten – GE
Ten Passages
© 2006 ANSYS, Inc.
M. Braaten – GE
Twenty Three Passages
(Quarter Annulus)
© 2006 ANSYS, Inc.
M. Braaten – GE
Ninety Two Passages
(Full Annulus)
© 2006 ANSYS, Inc.
M. Braaten – GE
Full Annulus Simulation in CFX-10.0
inlet
92 passages
88M grid points
Hub cavity
inlet
exit
© 2006 ANSYS, Inc.
Casing cavity
inlet
M. Braaten – GE
CFX-Partitioner
• Used simple circumferential partitioning to divide
up the final mesh
– Avoids memory requirements of METIS
– Partitioning is reasonable for problem of this type
• Set number of processors ~equal to number of
passages
– One million nodes per processor
– 1.5GB memory per processor within 32-bit limit
– Cases range from 1 to 92 processors
• Required >9 GB memory for 92 passage case
(i.e. a 64-bit machine with lots of memory)
• Need a parallel METIS partitioner for a more
general problem
– To be discussed later
© 2006 ANSYS, Inc.
M. Braaten – GE
Partitioner Run
+--------------------------------------------------------------------+
|
Partitioning Information
|
+--------------------------------------------------------------------+
Partitioning of domain: Domain 1
Partitioner run
on Itanium23 processors
9.1 GB memory
30 minutes cpu
-
Partitioning tool:
Number of partitions:
Number of nodes:
Partitioning axis from:
Partitioning axis to:
Cirumferential direction (weighted)
23
87733040
( 0.000E+00, 0.000E+00, 0.000E+00)
( 0.000E+00, 0.000E+00, 1.000E+00)
Partitioning information for domain: Domain 1
+-----------+---------------------+-----------+--------+
| Elements | Vertices (Overlap) |
Faces
| Weight |
+-------------+-----------+---------------------+-----------+--------+
| Full mesh
| 85016832 | 87733040
|
8434560 |
|
+-------------+-----------+---------------------+-----------+--------+
| Part.
1 |
3719706 |
3862309
1.2% |
368176 | 0.043 |
| Part.
2 |
3719706 |
3862309
1.2% |
368176 | 0.043 |
| Part.
3 |
3719706 |
3862309
1.2% |
368176 | 0.043 |
| Part.
4 |
3719706 |
3862309
1.2% |
368176 | 0.043 |
| Part.
5 |
3719706 |
3862309
1.2% |
368176 | 0.043 |
| Part.
6 |
3719706 |
3862309
1.2% |
368176 | 0.043 |
| Part.
7 |
3719704 |
3862307
1.2% |
368176 | 0.043 |
| Part.
8 |
3719706 |
3862309
1.2% |
368176 | 0.043 |
| Part.
9 |
3719704 |
3862307
1.2% |
368176 | 0.043 |
| Part.
10 |
3719706 |
3862309
1.2% |
368176 | 0.043 |
| Part.
11 |
3719706 |
3862309
1.2% |
368176 | 0.043 |
| Part.
12 |
3719706 |
3862309
1.2% |
368176 | 0.043 |
| Part.
13 |
3719702 |
3862305
1.2% |
368176 | 0.043 |
| Part.
14 |
3719706 |
3862309
1.2% |
368176 | 0.043 |
| Part.
15 |
3719706 |
3862309
1.2% |
368176 | 0.043 |
| Part.
16 |
3719706 |
3862309
1.2% |
368176 | 0.043 |
| Part.
17 |
3719702 |
3862305
1.2% |
368174 | 0.043 |
| Part.
18 |
3719710 |
3862313
1.2% |
368178 | 0.043 |
| Part.
19 |
3719698 |
3862301
1.2% |
368174 | 0.043 |
| Part.
20 |
3719706 |
3862309
1.2% |
368176 | 0.043 |
| Part.
21 |
3719706 |
3862309
1.2% |
368176 | 0.043 |
| Part.
22 |
3719706 |
3862309
1.2% |
368176 | 0.043 |
| Part.
23 |
3719706 |
3862309
1.2% |
368192 | 0.043 |
+-------------+-----------+---------------------+-----------+--------+
| Sum of part.| 85553222 | 88833091
1.2% |
8468062 | 1.000 |
+-------------+-----------+---------------------+-----------+--------+
CPU-Time requirements:
-
© 2006 ANSYS, Inc.
Preparations
Low-level mesh partitioning
Global partitioning information
Vertex, element and face partitioning information
Element and face set partitioning information
Summed CPU-time for mesh partitioning
3.182E-01
7.375E+02
1.675E+02
5.516E+02
4.840E+01
1.754E+03
seconds
seconds
seconds
seconds
seconds
seconds
M. Braaten – GE
Partitioned Mesh
92 partitions
© 2006 ANSYS, Inc.
M. Braaten – GE
CFX Solutions
• The following slides show some typical
results
– Results with different number of passages agree
very well
– Parallel scalability is very good
• Converged solution for full annulus (88M
nodes) obtained in less than 8 hours clock
time
© 2006 ANSYS, Inc.
M. Braaten – GE
Ten Passages (10M nodes)
© 2006 ANSYS, Inc.
M. Braaten – GE
Forty-Six Passages (44M nodes)
Half annulus
© 2006 ANSYS, Inc.
M. Braaten – GE
Ninety-Two Passages (88M nodes)
7 hr, 47 min
clock time w/
90 processors,
100 iterations
Full annulus
According to CFX, this is
the largest “realistic”
computation to date w/ CFX
© 2006 ANSYS, Inc.
M. Braaten – GE
Convergence Path
© 2006 ANSYS, Inc.
M. Braaten – GE
File Sizes, Run Times
#Pass.
1
2
10
23
46
922
Mesh Size
1.0M
1.9M
9.7M
22M
44M
88M
Definition
20 MB
46 MB
255 MB
597 MB
1200 MB
2407 MB
Results
300MB
618 MB
2958 MB
6822 MB
13544 MB
9960 MB3
Clk Time1
7.22 hrs
5.75 hrs
7.75 hrs
Footnotes:
1 Timings on 64-bit Linux cluster
2 90 processors
3 Full results files (the default) contain many unnecessary variables. Choosing
“Selected Variables” makes results files only 1/3 as large
© 2006 ANSYS, Inc.
M. Braaten – GE
Memory and I/O
• Memory usage
– 1.2 GB memory allocated per processor
– 300 words per node allocated
• Actual usage closer to 250 words per node
• CFX-10.0 does all I/O on the master processor
– Serial I/O paradigm
• I/O times (writing to project share)
– 1 processor < 1 minute for final file write
– 90 processors ~ 45 minutes for final file write
• File compression runs serially on master processor
– Slows file output significantly
– File compression will be done in parallel in CFX-11.0
© 2006 ANSYS, Inc.
M. Braaten – GE
Beyond 100M nodes …
• Very large block-structured meshes for single
blade passage recently obtained from GE
colleagues (Tolpadi, Sewall)
– Mesh 1: 3.1 million nodes
– Mesh 2: 5.1 million nodes
• Replicated 64 passages to create full annulus
in CFX-Pre
– Mesh 1 – 199.3 M nodes (5.5 GB definition file)
– Mesh 2 – 327.0 M nodes (9.9 GB definition file)
© 2006 ANSYS, Inc.
M. Braaten – GE
Beyond 100M nodes (cont’d)
• Ran into limitation in CFX-10.0 writing
definition files
– Reported problem to CFX Development
– CFX determined cause of problem, developed fix
within two weeks
– Received preview version of CFX-11.0 that
corrects problem
© 2006 ANSYS, Inc.
M. Braaten – GE
The Largest CFX Mesh Yet …
Linux!
© 2006 ANSYS, Inc.
M. Braaten – GE
327 Million Node Mesh
Rendered in CFX-POST
Memory
Requirements:
•PostEngine:
•23GB
•PostGui
•5GB
wireframe
© 2006 ANSYS, Inc.
M. Braaten – GE
327 Million Node Mesh (cont’d)
Mesh
© 2006 ANSYS, Inc.
M. Braaten – GE
Biggest CFX Needs for Large Data
Set Cases
• To date, large data set project has identified
two major needs for CFX
1. 64-bit versions of ANSYS Workbench and CFXMesh
– For unstructured meshing of large single grids
2. Parallel METIS partitioner
© 2006 ANSYS, Inc.
M. Braaten – GE
Biggest CFX Needs for Large Data
Set Cases (cont’d)
• Projects to develop these are underway
– CFX-Mesh has been compiled using Intel 8
compiler on 64-bit Itanium
• Included in Service Pack 1 for Workbench 10.0 on XP-64
• 64-bit Linux version under development for Workbench11.0
– Parallel METIS has been demonstrated for CFX
unstructured meshes, block structured meshes as
part of this project
• See following charts
© 2006 ANSYS, Inc.
M. Braaten – GE
Parallel METIS Prototype –
Current Status
•
A working prototype has been developed to use
parallel METIS to partition a CFX definition file
Three components of prototype:
•
1.
2.
3.
•
Meshes up to 88M nodes have been partitioned
with excellent results
–
© 2006 ANSYS, Inc.
Convert CFX definition file to ASCII input file for ParMETIS
Run ParMETIS for mixed mesh of tetrahedra, pyramids,
prisms, hexes
Convert output from ParMETIS to CFX partition file
Actual partitioning of 88M node mesh into 92 partitions
took only 40 cpu sec on 46 processors
M. Braaten – GE
Parallel Partitioning of a CFX
Mesh using PARMetis
1.2 M nodes,
16 partitions,
16 processors
0
© 2006 ANSYS, Inc.
Partition
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
% Element
6.5134E+00
6.1985E+00
6.4263E+00
6.0389E+00
6.5297E+00
6.3401E+00
6.2589E+00
6.5466E+00
6.5493E+00
6.1257E+00
6.3303E+00
5.5954E+00
6.2288E+00
6.3850E+00
6.3096E+00
5.6235E+00
% Node
6.5950E+00
6.2898E+00
6.4665E+00
6.1230E+00
6.5979E+00
6.4064E+00
6.3009E+00
6.6255E+00
6.4806E+00
6.1159E+00
6.2306E+00
5.4724E+00
6.1053E+00
6.2206E+00
6.2862E+00
5.6834E+00
M. Braaten – GE
Another Example of Parallel
Partitioning of a CFX Mesh
44 M nodes,
46 partitions,
46 processors
Partitioning itself takes
only 20 cpu seconds !
© 2006 ANSYS, Inc.
M. Braaten – GE
Concluding Remarks
• CFX-10.0 has demonstrated the ability to run
problems exceeding 100 million grid points on
present hardware
– Converged solution obtained for complete turbine
wheel in under 8 hours on 90 processors
– Solver can run well on both 32-bit and 64-bit Linux
clusters, with good parallel scalability
• A large memory (16GB) 64-bit OS machine is
needed for pre- and post-processing of large
cases
• Billion node CFX computations are not far off
– Come back next year !
© 2006 ANSYS, Inc.
M. Braaten – GE