Emulating Massively Parallel (PetaFLOPS) Machines Neelam Saboo, Arun Kumar Singla Joshua Mostkoff Unger, Gengbin Zheng, Laxmikant V. Kalé http://charm.cs.uiuc.edu Department of Computer Science Parallel Programming Laboratory Roadmap • • • • • BlueGene Architecture Need for an Emulator Charm++ BlueGene Converse BlueGene Future Work • Blue Gene: Processor-inmemory Case Study Five steps to a PetaFLOPS, taken from: – http://www.research.ibm.com/bluegene/ BOARD PROCESSOR 1 GFlop/s, 0.5 MB NODE/CHIP 25 GFlop/s, 12.5 MB TOWER BLUE GENE 1 PFlop/s, 0.5 TB FUNCTIONAL MODEL: 34X34X36 cube of shared memory nodes each having 25 processors. SMP Node •25 processors •200 processing elements •Input/Output Buffer •32 x 128 bytes •Network •Connected to six neighbors via duplex link •16 bit @ 500 MHz = 1 Gigabyte/s •Latencies: •5 cycles per hop •75 cycles per turn Processor STATS: •500 MHz in out •Memory-side cache eliminates coherency problems •10 cycles local cache •20 cycles remote cache •10 cycles cache miss •8 integer units sharing 2 floating point units •8 x 25 x ~40,000 = ~8 x 106 processing elements! Need for Emulator • Emulator – enables programmer to develop, compile, and run software using programming interface that will be used in actual machine Emulator Objectives • Emulate Blue Gene and other petaFLOPS machines. • Memory limitations and time limitations on single processor requires that simulation MUST be performed on parallel architecture. • Issues: – Assume that program written for processor-inmemory machine will handle out-of-order execution and messaging. – Therefore don’t need complex event queue/rollback. Emulator Implementation • What are basic data structures/interface? – Machine configuration (topology), handler registration – Nodes with node-level shared data – Threads (associated with each node) representing processing elements – Communication between nodes • How to handle all these objects on parallel architecture? How to handle object-to-object communication? • Difficulties of implementation eased by using Charm++, object-oriented parallel programming paradigm. Experiments on Emulator • Sample applications implemented: – Primes – Jacobi relaxation – MD prototype •40,000 atoms, no bonds calculated, nearest neighbor cutoff •Ran full Blue Gene (with 8 x 106 threads) on ~100 ASCI-Red processors ApoA-I: 92k Atoms Collective Operations • Explore different algorithms for broadcasts and reductions RING LINE OCTREE z y x Use “primitive” 30 x 30 x 20 (10 threads) Blue Gene emulation on 50 processor Linux cluster Converse BlueGene Emulator Objective • Performance estimation (with proper time stamping) • Provide API for building Charm++ on top of emulator. Bluegene Emulator Communication threads Worker thread inBuffer Non-affinity message queue Affinity message queue Node Structure Performance • Pingpong – Close to Converse pingpong; • 81-103 us v.s. 92 us RTT – Charm++ pingpong • 116 us RTT – Charm++ Bluegene pingpong • 134-175 us RTT Charm++ on top of Emulator • BlueGene thread represents Charm++ node; • Name conflict: – Cpv, Ctv – MsgSend, etc – CkMyPe(), CkNumPes(), etc Future Work: Simulator • LeanMD : Fully functional MD with only cutoff • How can we examine performance of algorithms on variants of processor-inmemory design in massive system? • Several layers of detail to measure – Basic: Correctly model performance, timestamp messages with correction for out-of-order execution – More detailed: network performance, memory access, modeling sharing of floating-point unit, estimation techniques
© Copyright 2026 Paperzz