Presentation Outline A word or two about our program Our HPC system acquisition process Program benchmark suite Evolution of benchmark-based performance metrics Where do we go from here? HPC Modernization Program HPC Modernization Program Goals DoD HPC Modernization Program HPCMP Serves a Large, Diverse DoD User Community 519 projects and 4,086 users at approximately 130 sites Computational Fluid Dynamics – 1,572 Users Computational Electromagnetics & Acoustics – 337 Users Requirements categorized in 10 Computational Technology Areas (CTA) FY08 non-real-time requirements of 1,108 Habu-equivalents Electronics, Networking, and Systems/C4I – 114 Users Computational Structural Mechanics – 437 Users Environmental Quality Modeling & Simulation – 147 Users Forces Modeling & Simulation – 182 Users Computational Chemistry, Biology & Materials Science – 408 Users Climate/Weather/Ocean Modeling & Simulation – 241 Users Integrated Modeling & Test Environments – 139 Users Signal/Image Processing – 353 Users 156 users are self characterized as “Other” High Performance Computing Centers Strategic Consolidation of Resources 4 Major Shared Resource Centers 4 Allocated Distributed Centers HPCMP Center Resources 1993 2007 Legend MSRCs ADCs (DCs) 350 MSRCs 300 ADCs 69.6 Habus 250 200 150 13.9 247.0 100 50 0 1.1 4.2 1.0 6.3 3.0 14.8 7.9 23.1 10.3 39.2 109.5 FY 01 FY 02 FY 03 FY 04 FY 05 FY 06 FY 07 Fiscal Year (TI-XX) Note: Computational capability reflects available GFLOPS during fiscal year Total HPCMP End-of-Year Computational Capabilities HPC Modernization Program (MSRCs) FY03 FY04 FY05 FY06 FY07 As of: August 2007 HPC Center System Processors Army Research Laboratory (ARL) Linux Networx Cluster Linux Networx Cluster IBM Opteron Cluster (C) SGI Altix Cluster (C) Linux Networx Cluster Linux Networx Cluster (C) 256 PEs 2,100 PEs 2,372 PEs 256 PEs 4,528 PEs 3,464 PEs Aeronautical Systems Center (ASC) SGI Origin 3900 SGI Origin 3900 (C) IBM P4 (C) SGI Altix Cluster HP Opteron SGI Altix 2,048 PEs 128 PEs 32 PEs 2,048 PEs 2,048 PEs 9,216 PEs Engineer Research and Development Center (ERDC) SGI Origin 3900 Cray XT3 (FY 07 upgrade) Cray XT4 1,024 PEs 8,192 PEs 8,848 PEs Naval Oceanographic Office (NAVO) IBM P4+ IBM 1600 P5 Cluster IBM 1600 P5 Cluster (C) 3,456 PEs 3,072 PEs 1,920 PEs HPC Modernization Program (ADCs) FY03 FY04 FY05 FY06 As of: August 2007 HPC Center System Processors Army High Performance Computing Research Center (AHPCRC) Cray X1E Cray XT3 1,024 PEs 1,128 PEs Arctic Region Supercomputing Center (ARSC) IBM Regatta P4 Sun x4600 800 PEs 2,312 PEs Maui High Performance Dell PowerEdge 1955 Computing Center (MHPCC) Space & Missile Defense Command (SMDC) SGI Origin 3000 SGI Altix West Scientific Cluster IBM e1300 Cluster IBM Regatta P4 Cray X1E Atipa Linux Cluster IBM Xeon Cluster Cray XD1 5,120 PEs 736 PEs 128 PEs 64 PEs 256 PEs 32 PEs 128 PEs 256 PEs 128 PEs 288 PEs Overview of TI-XX Acquisition Process Determination of Requirements, Usage, and Allocations Choose application benchmarks, test cases, and weights Vendors provide measured and projected times on offered systems Measure benchmark times on DoD standard system Determine performance for each offered system on each application test case Measure benchmark times on existing DoD systems Determine performance for each existing system on each application test case Center facility requirements Vendor pricing Usability/past performance information on offered systems Determine performance for each offered system Collective Acquisition Decision Use optimizer to determine price/performance for each offered system and combination of systems Life-cycle costs for offered systems TI-08 Synthetic Test Suite CPUBench – Floating point execution rate ICBench – Interconnect bandwidth and latency LANBench – External network interface and connection bandwidth MEMBench – Memory bandwidth (MultiMAPS) OSBench – Operating system noise (PSNAP from LANL) SPIOBench – Streaming parallel I/O bandwidth TI-08 Application Benchmark Codes ICEPIC – Particle-in-cell magnetohydrodynamics code – (C, MPI, 60,000 SLOC) LAMMPS – Molecular dynamics code – (C++, MPI, 45,400 SLOC) AMR – Gas dynamics code – (C++/FORTRAN, MPI, 40,000 SLOC) AVUS (Cobalt-60) – Turbulent flow CFD code – (Fortran, MPI, 19,000 SLOC) CTH – Shock physics code – (~43% Fortran/~57% C, MPI, 436,000 SLOC) GAMESS – Quantum chemistry code – (Fortran, MPI, 330,000 SLOC) HYCOM – Ocean circulation modeling code – (Fortran, MPI, 31,000 SLOC) OOCore – Out-of-core solver mimicking electromagnetics code – (Fortran, MPI, 39,000 SLOC) Overflow2 – CFD code originally developed by NASA – (Fortran, MPI, 83,600 SLOC) WRF – Multi-Agency mesoscale atmospheric modeling code – (Fortran and C, MPI, 100,000 SLOC) Application Benchmark History Computational Technology Area FY 2003 FY 2004 FY 2005 FY 2006 FY 2007 FY 2008 Computational Structural Mechanics CTH CTH RFCTH RFCTH CTH CTH Computational Fluid Dynamics Cobalt60 LESLIE3D Aero Cobalt60 Aero AVUS Overflow2 Aero AVUS Overflow2 Aero AVUS Overflow2 AVUS Overflow2 AMR Computational Chemistry, Biology, and Materials Science GAMESS NAMD GAMESS NAMD GAMESS GAMESS LAMMPS GAMESS LAMMPS GAMESS LAMMPS OOCore OOCore OOCore OOCore ICEPIC OOCore ICEPIC HYCOM HYCOM WRF HYCOM WRF HYCOM WRF HYCOM WRF Computational Electromagnetics and Acoustics Climate/Weather/ Ocean Modeling and Simulation NLOM Determination of Performance Establish a DoD standard benchmark time for each application benchmark case – ERDC Cray dual-core XT3 (Sapphire) chosen as standard DoD system – Standard benchmark times on DoD standard system measured at 128 processors for standard test cases and 512 processor for large test cases – Split in weight between standard and large application test cases will be made at 256 processors Benchmark timings (at least four on each test case) are requested for systems that meet or beat the DoD standard benchmark times by at least a factor of two (preferably four) Benchmark timings may be extrapolated provided they are guaranteed, but at least two actual timings must be provided for each test case Determination of Performance (cont.) Curve fit: Time = A/N + B + C*N – N = number of processing cores – A/N = time for parallel portion of code (|| base) – B = time for serial portion of code – C*N = parallel penalty (|| overhead) Constraints – A/N ≥ 0 Parallel base time is non-negative. – Tmin≥ B ≥ 0 Serial time is non-negative and is not greater than the minimum observed time. Determination of Performance (cont.) Curve fit approach – For each value of B (Tmin≥ B ≥ 0) Determine A: Time – B = A/N Determine C: Time – (A/N + B) = C*N Calculate fit quality (Ni, Ti) = time Ti observed at Ni cores M = number of observed core counts Fit Quality 1.0 M 2 ( T A / N B C * N ) i i i i 1 – Select the value of B with largest fit quality Determination of Performance (cont.) Calculate score (in DoD standard system equivalents) – C = number of compute cores in target system – Cbase = number of compute cores in standard system – Sbase = number of compute cores in standard execution – STM = size-to-match = number of compute cores of target system required to match performance of Sbase cores of the standard system Sbase C Score Cbase STM AMR Large Test Case on HP Opteron Cluster Relative Performance (Sapphire Eq.) 1.2 1.1 1.0 0.9 0.8 0.7 Benchmark Data 0.6 Benchmark Curve STM Range 0.5 256 320 384 448 512 576 Cores 640 704 768 832 AMR Large Test Case on SGI Altix 1.7 Relative Performance (Sapphire Eq.) 1.6 1.5 1.4 1.3 1.2 1.1 1.0 Benchmark Data 0.9 Benchmark Curve STM Range 0.8 192 256 320 384 448 512 Cores 576 640 704 768 832 AMR Large Test Case on Dell Xeon Cluster Relative Performance (Sapphire Eq.) 1.6 1.4 1.2 1.0 0.8 Benchmark Data Benchmark Curve STM Range 0.6 192 256 320 384 448 512 Cores 576 640 704 768 832 Overflow-2 Standard Test Case on Dell Xeon Cluster Relative Performance (Sapphire Eq.) 2.5 2.0 1.5 1.0 0.5 Benchmark Data Benchmark Curve STM Range 0.0 0 64 128 192 Cores 256 320 Overflow-2 Large Test Case on IBM P5+ Relative Performance (Sapphire Eq.) 1.1 1.0 0.9 0.8 0.7 0.6 Benchmark Data 0.5 Benchmark Curve STM Range 0.4 192 256 320 384 448 512 Cores 576 640 704 768 832 ICEPIC Standard Test Case on SGI Altix 1.1 Relative Performance (Sapphire Eq.) 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Benchmark Data Benchmark Curve 0.2 STM Range 0.1 0 64 128 192 256 Cores 320 384 448 ICEPIC Large Test Case on SGI Altix 1.1 Relative Performance (Sapphire Eq.) 1.0 0.9 0.8 0.7 0.6 0.5 0.4 Benchmark Data Benchmark Curve 0.3 STM Range Pseudo Score 0.2 192 320 448 576 704 832 960 Cores 1088 1216 1344 1472 1600 Comparison of HPCMP System Capabilities: FY 2003 - FY 2008 16 Habu-equivalents per Processor 14 FY 2003 FY 2004 12 10 FY 2005 FY 2006 FY 2007 8 FY 2008 6 4 2 0 IBM P3 IBM P4 IBM P4+ IBM P5+ HP SC40 HP SC45 HP SGI Opteron O3800 Cluster SGI SGI Altix O3900 LNXI Xeon Cluster (3.6) LNXI Xeon Cluster (3.0) Dell Cray XT3 Xeon Cluster What’s Next? Continue to evolve application benchmarks to represent accurately the HPCMP computational workload Increase profiling and performance modeling to understand application performance better Use performance predictions to supplement application benchmark measurements and guide vendors in designing more efficient systems
© Copyright 2026 Paperzz