Large Data Set Computations with CFX: To 100 Million Nodes & Beyond ! Dr. Mark E. Braaten Mechanical Engineer GE Global Research, Niskayuna, NY © 2006 ANSYS, Inc. M. Braaten – GE Acknowledgements • Thanks to many people, in particular: – GEGR Colleagues • Stuart Connell, Semir Kapetanovic, Adam Stevenson, Peter Schmid – CFX Colleagues • Mike Marchant, Andy Mortimer, David Main (UK) • Dan Williams, Phil Zwart, Steve Elias, Paul Galpin, Hrvoje Roglic (Canada) © 2006 ANSYS, Inc. M. Braaten – GE Large Data Set Project • Goal of Large Data Set Project is to look at hardware / software issues that arise when running very large CFD applications (>100M nodes ) – CFX and GE in-house turbomachinery code • Examine scalability of current CFX system to very large problems – Look at mesh generation, pre-processing, partitioning, solver, postprocessor • Create series of cases of increasing size – Roughly 1M, 2M, 10M, 25M, 50M, 100M nodes • Run on 64-bit desktop and 32/64-bit Linux clusters – Expect these to be the likely computer resources available in the near future © 2006 ANSYS, Inc. M. Braaten – GE Current Status of CFX-10.0 • CFX can handle both unstructured and structured meshes • CFX-Solver has demonstrated parallel scalability on Linux clusters (32- and 64-bit) • CFX-Pre, CFX-Post, and the CFX Partitioner are available for 64-bit machines • CFX-Mesh is currently limited to 32-bit, due to its integration into ANSYS Workbench – Impossible to make single mesh larger than about 5M nodes • CFX Partitioner is serial only – Serial METIS has a severe memory bottleneck © 2006 ANSYS, Inc. M. Braaten – GE Selection of Large CFX Test Case IDEA: • Use an existing block-structured mesh for a turbomachinery passage, and replicate it to make larger problems • Advantages: – Easy to make sequence of larger problems • Avoids current problem with 32-bit mesh generation – Models a problem of interest to Large Data Set project • Multiple identical passages, multiple dissimilar passages (AVP, MPT), single large mesh – Easy to assess solutions on multiple passages • Should be same as a single passage replicated – Allows use of simple circumferential partitioning • Avoids METIS memory requirements © 2006 ANSYS, Inc. M. Braaten – GE CFX Test Case • Use typical turbine blade as the test case • Convenient for generating series of test cases – Mesh for one passage is about 1 M nodes – Full annulus contains 92 passages (88M nodes) Block-structured mesh (95 blocks, 970,000 grid points) © 2006 ANSYS, Inc. M. Braaten – GE Large Data Set CFX Test Case (cont’d) • One passage base case – Block-structured mesh (95 blocks) imported directly into CFX-Pre – 970,000 grid points per passage • Replicate passages to generate larger and larger test cases – – – – – © 2006 ANSYS, Inc. 2 passages 10 passages 23 passages (quarter annulus) 46 passages (half annulus) 92 passages (full annulus) ~ 2 M nodes ~ 10 M nodes ~ 22 M nodes ~ 44 M nodes ~ 88 M nodes M. Braaten – GE Computer Resources • Initially CFX-Pre, CFX-Partition, and CFXPost run on 64-bit Itanium machine – Process has since been repeated using new EM64T- based ATW machines that are dualbooted with 64-bit Linux O.S. • CFX-Solver run on Linux clusters at GEGR – 32-bit cluster with 2GB memory per processor – 64-bit cluster with 3GB memory per processor © 2006 ANSYS, Inc. M. Braaten – GE CFX-Pre • Had existing problem setup for a single passage – Exported CCL from this case • Process: – Import single passage grid – Replicate grid • Specify number of copies, angle/axis of rotation – Import CCL from single passage case – Point boundary conditions to replicated faces – Takes less than 15 minutes (start to finish) for even full annulus case • No significant scalability issues for replicated passages • Memory increases linearly for independent passages – Can reach almost 200M node hex mesh w/ 16GB memory • Not a barrier for now © 2006 ANSYS, Inc. M. Braaten – GE Ten Passages © 2006 ANSYS, Inc. M. Braaten – GE Twenty Three Passages (Quarter Annulus) © 2006 ANSYS, Inc. M. Braaten – GE Ninety Two Passages (Full Annulus) © 2006 ANSYS, Inc. M. Braaten – GE Full Annulus Simulation in CFX-10.0 inlet 92 passages 88M grid points Hub cavity inlet exit © 2006 ANSYS, Inc. Casing cavity inlet M. Braaten – GE CFX-Partitioner • Used simple circumferential partitioning to divide up the final mesh – Avoids memory requirements of METIS – Partitioning is reasonable for problem of this type • Set number of processors ~equal to number of passages – One million nodes per processor – 1.5GB memory per processor within 32-bit limit – Cases range from 1 to 92 processors • Required >9 GB memory for 92 passage case (i.e. a 64-bit machine with lots of memory) • Need a parallel METIS partitioner for a more general problem – To be discussed later © 2006 ANSYS, Inc. M. Braaten – GE Partitioner Run +--------------------------------------------------------------------+ | Partitioning Information | +--------------------------------------------------------------------+ Partitioning of domain: Domain 1 Partitioner run on Itanium23 processors 9.1 GB memory 30 minutes cpu - Partitioning tool: Number of partitions: Number of nodes: Partitioning axis from: Partitioning axis to: Cirumferential direction (weighted) 23 87733040 ( 0.000E+00, 0.000E+00, 0.000E+00) ( 0.000E+00, 0.000E+00, 1.000E+00) Partitioning information for domain: Domain 1 +-----------+---------------------+-----------+--------+ | Elements | Vertices (Overlap) | Faces | Weight | +-------------+-----------+---------------------+-----------+--------+ | Full mesh | 85016832 | 87733040 | 8434560 | | +-------------+-----------+---------------------+-----------+--------+ | Part. 1 | 3719706 | 3862309 1.2% | 368176 | 0.043 | | Part. 2 | 3719706 | 3862309 1.2% | 368176 | 0.043 | | Part. 3 | 3719706 | 3862309 1.2% | 368176 | 0.043 | | Part. 4 | 3719706 | 3862309 1.2% | 368176 | 0.043 | | Part. 5 | 3719706 | 3862309 1.2% | 368176 | 0.043 | | Part. 6 | 3719706 | 3862309 1.2% | 368176 | 0.043 | | Part. 7 | 3719704 | 3862307 1.2% | 368176 | 0.043 | | Part. 8 | 3719706 | 3862309 1.2% | 368176 | 0.043 | | Part. 9 | 3719704 | 3862307 1.2% | 368176 | 0.043 | | Part. 10 | 3719706 | 3862309 1.2% | 368176 | 0.043 | | Part. 11 | 3719706 | 3862309 1.2% | 368176 | 0.043 | | Part. 12 | 3719706 | 3862309 1.2% | 368176 | 0.043 | | Part. 13 | 3719702 | 3862305 1.2% | 368176 | 0.043 | | Part. 14 | 3719706 | 3862309 1.2% | 368176 | 0.043 | | Part. 15 | 3719706 | 3862309 1.2% | 368176 | 0.043 | | Part. 16 | 3719706 | 3862309 1.2% | 368176 | 0.043 | | Part. 17 | 3719702 | 3862305 1.2% | 368174 | 0.043 | | Part. 18 | 3719710 | 3862313 1.2% | 368178 | 0.043 | | Part. 19 | 3719698 | 3862301 1.2% | 368174 | 0.043 | | Part. 20 | 3719706 | 3862309 1.2% | 368176 | 0.043 | | Part. 21 | 3719706 | 3862309 1.2% | 368176 | 0.043 | | Part. 22 | 3719706 | 3862309 1.2% | 368176 | 0.043 | | Part. 23 | 3719706 | 3862309 1.2% | 368192 | 0.043 | +-------------+-----------+---------------------+-----------+--------+ | Sum of part.| 85553222 | 88833091 1.2% | 8468062 | 1.000 | +-------------+-----------+---------------------+-----------+--------+ CPU-Time requirements: - © 2006 ANSYS, Inc. Preparations Low-level mesh partitioning Global partitioning information Vertex, element and face partitioning information Element and face set partitioning information Summed CPU-time for mesh partitioning 3.182E-01 7.375E+02 1.675E+02 5.516E+02 4.840E+01 1.754E+03 seconds seconds seconds seconds seconds seconds M. Braaten – GE Partitioned Mesh 92 partitions © 2006 ANSYS, Inc. M. Braaten – GE CFX Solutions • The following slides show some typical results – Results with different number of passages agree very well – Parallel scalability is very good • Converged solution for full annulus (88M nodes) obtained in less than 8 hours clock time © 2006 ANSYS, Inc. M. Braaten – GE Ten Passages (10M nodes) © 2006 ANSYS, Inc. M. Braaten – GE Forty-Six Passages (44M nodes) Half annulus © 2006 ANSYS, Inc. M. Braaten – GE Ninety-Two Passages (88M nodes) 7 hr, 47 min clock time w/ 90 processors, 100 iterations Full annulus According to CFX, this is the largest “realistic” computation to date w/ CFX © 2006 ANSYS, Inc. M. Braaten – GE Convergence Path © 2006 ANSYS, Inc. M. Braaten – GE File Sizes, Run Times #Pass. 1 2 10 23 46 922 Mesh Size 1.0M 1.9M 9.7M 22M 44M 88M Definition 20 MB 46 MB 255 MB 597 MB 1200 MB 2407 MB Results 300MB 618 MB 2958 MB 6822 MB 13544 MB 9960 MB3 Clk Time1 7.22 hrs 5.75 hrs 7.75 hrs Footnotes: 1 Timings on 64-bit Linux cluster 2 90 processors 3 Full results files (the default) contain many unnecessary variables. Choosing “Selected Variables” makes results files only 1/3 as large © 2006 ANSYS, Inc. M. Braaten – GE Memory and I/O • Memory usage – 1.2 GB memory allocated per processor – 300 words per node allocated • Actual usage closer to 250 words per node • CFX-10.0 does all I/O on the master processor – Serial I/O paradigm • I/O times (writing to project share) – 1 processor < 1 minute for final file write – 90 processors ~ 45 minutes for final file write • File compression runs serially on master processor – Slows file output significantly – File compression will be done in parallel in CFX-11.0 © 2006 ANSYS, Inc. M. Braaten – GE Beyond 100M nodes … • Very large block-structured meshes for single blade passage recently obtained from GE colleagues (Tolpadi, Sewall) – Mesh 1: 3.1 million nodes – Mesh 2: 5.1 million nodes • Replicated 64 passages to create full annulus in CFX-Pre – Mesh 1 – 199.3 M nodes (5.5 GB definition file) – Mesh 2 – 327.0 M nodes (9.9 GB definition file) © 2006 ANSYS, Inc. M. Braaten – GE Beyond 100M nodes (cont’d) • Ran into limitation in CFX-10.0 writing definition files – Reported problem to CFX Development – CFX determined cause of problem, developed fix within two weeks – Received preview version of CFX-11.0 that corrects problem © 2006 ANSYS, Inc. M. Braaten – GE The Largest CFX Mesh Yet … Linux! © 2006 ANSYS, Inc. M. Braaten – GE 327 Million Node Mesh Rendered in CFX-POST Memory Requirements: •PostEngine: •23GB •PostGui •5GB wireframe © 2006 ANSYS, Inc. M. Braaten – GE 327 Million Node Mesh (cont’d) Mesh © 2006 ANSYS, Inc. M. Braaten – GE Biggest CFX Needs for Large Data Set Cases • To date, large data set project has identified two major needs for CFX 1. 64-bit versions of ANSYS Workbench and CFXMesh – For unstructured meshing of large single grids 2. Parallel METIS partitioner © 2006 ANSYS, Inc. M. Braaten – GE Biggest CFX Needs for Large Data Set Cases (cont’d) • Projects to develop these are underway – CFX-Mesh has been compiled using Intel 8 compiler on 64-bit Itanium • Included in Service Pack 1 for Workbench 10.0 on XP-64 • 64-bit Linux version under development for Workbench11.0 – Parallel METIS has been demonstrated for CFX unstructured meshes, block structured meshes as part of this project • See following charts © 2006 ANSYS, Inc. M. Braaten – GE Parallel METIS Prototype – Current Status • A working prototype has been developed to use parallel METIS to partition a CFX definition file Three components of prototype: • 1. 2. 3. • Meshes up to 88M nodes have been partitioned with excellent results – © 2006 ANSYS, Inc. Convert CFX definition file to ASCII input file for ParMETIS Run ParMETIS for mixed mesh of tetrahedra, pyramids, prisms, hexes Convert output from ParMETIS to CFX partition file Actual partitioning of 88M node mesh into 92 partitions took only 40 cpu sec on 46 processors M. Braaten – GE Parallel Partitioning of a CFX Mesh using PARMetis 1.2 M nodes, 16 partitions, 16 processors 0 © 2006 ANSYS, Inc. Partition 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 % Element 6.5134E+00 6.1985E+00 6.4263E+00 6.0389E+00 6.5297E+00 6.3401E+00 6.2589E+00 6.5466E+00 6.5493E+00 6.1257E+00 6.3303E+00 5.5954E+00 6.2288E+00 6.3850E+00 6.3096E+00 5.6235E+00 % Node 6.5950E+00 6.2898E+00 6.4665E+00 6.1230E+00 6.5979E+00 6.4064E+00 6.3009E+00 6.6255E+00 6.4806E+00 6.1159E+00 6.2306E+00 5.4724E+00 6.1053E+00 6.2206E+00 6.2862E+00 5.6834E+00 M. Braaten – GE Another Example of Parallel Partitioning of a CFX Mesh 44 M nodes, 46 partitions, 46 processors Partitioning itself takes only 20 cpu seconds ! © 2006 ANSYS, Inc. M. Braaten – GE Concluding Remarks • CFX-10.0 has demonstrated the ability to run problems exceeding 100 million grid points on present hardware – Converged solution obtained for complete turbine wheel in under 8 hours on 90 processors – Solver can run well on both 32-bit and 64-bit Linux clusters, with good parallel scalability • A large memory (16GB) 64-bit OS machine is needed for pre- and post-processing of large cases • Billion node CFX computations are not far off – Come back next year ! © 2006 ANSYS, Inc. M. Braaten – GE
© Copyright 2026 Paperzz