PPT - Parallel Programming Laboratory

Emulating Massively Parallel
(PetaFLOPS) Machines
Neelam Saboo, Arun Kumar Singla
Joshua Mostkoff Unger, Gengbin Zheng,
Laxmikant V. Kalé
http://charm.cs.uiuc.edu
Department of Computer Science
Parallel Programming Laboratory
Roadmap
•
•
•
•
•
BlueGene Architecture
Need for an Emulator
Charm++ BlueGene
Converse BlueGene
Future Work
•
Blue Gene: Processor-inmemory Case Study
Five steps to a PetaFLOPS, taken from:
– http://www.research.ibm.com/bluegene/
BOARD
PROCESSOR
1 GFlop/s, 0.5 MB
NODE/CHIP
25 GFlop/s, 12.5 MB
TOWER
BLUE GENE
1 PFlop/s, 0.5 TB
FUNCTIONAL MODEL:
34X34X36 cube of shared memory
nodes each having 25 processors.
SMP Node
•25 processors
•200 processing elements
•Input/Output Buffer
•32 x 128 bytes
•Network
•Connected to six
neighbors via duplex
link
•16 bit @ 500 MHz =
1 Gigabyte/s
•Latencies:
•5 cycles per hop
•75 cycles per turn
Processor
STATS:
•500 MHz
in
out
•Memory-side cache
eliminates coherency
problems
•10 cycles local cache
•20 cycles remote cache
•10 cycles cache miss
•8 integer units sharing
2 floating point units
•8 x 25 x ~40,000 = ~8 x 106
processing elements!
Need for Emulator
• Emulator – enables programmer to
develop, compile, and run software
using programming interface that will be
used in actual machine
Emulator Objectives
• Emulate Blue Gene and other petaFLOPS
machines.
• Memory limitations and time limitations on
single processor requires that simulation
MUST be performed on parallel architecture.
• Issues:
– Assume that program written for processor-inmemory machine will handle out-of-order
execution and messaging.
– Therefore don’t need complex event
queue/rollback.
Emulator Implementation
• What are basic data structures/interface?
– Machine configuration (topology), handler
registration
– Nodes with node-level shared data
– Threads (associated with each node) representing
processing elements
– Communication between nodes
• How to handle all these objects on parallel
architecture? How to handle object-to-object
communication?
• Difficulties of implementation eased by using
Charm++, object-oriented parallel programming
paradigm.
Experiments on Emulator
• Sample applications implemented:
– Primes
– Jacobi relaxation
– MD prototype
•40,000 atoms, no
bonds calculated,
nearest neighbor cutoff
•Ran full Blue Gene
(with 8 x 106 threads)
on ~100 ASCI-Red
processors
ApoA-I: 92k Atoms
Collective Operations
• Explore different algorithms for broadcasts
and reductions
RING
LINE
OCTREE
z
y
x
Use “primitive” 30 x 30 x 20 (10 threads) Blue Gene
emulation on 50 processor Linux cluster
Converse BlueGene Emulator
Objective
• Performance estimation (with proper time
stamping)
• Provide API for building Charm++ on top
of emulator.
Bluegene Emulator
Communication threads
Worker thread
inBuffer
Non-affinity message queue
Affinity message queue
Node Structure
Performance
• Pingpong
– Close to Converse pingpong;
• 81-103 us v.s. 92 us RTT
– Charm++ pingpong
• 116 us RTT
– Charm++ Bluegene pingpong
• 134-175 us RTT
Charm++ on top of Emulator
• BlueGene thread represents Charm++ node;
• Name conflict:
– Cpv, Ctv
– MsgSend, etc
– CkMyPe(), CkNumPes(), etc
Future Work: Simulator
• LeanMD : Fully functional MD with only cutoff
• How can we examine performance of
algorithms on variants of processor-inmemory design in massive system?
• Several layers of detail to measure
– Basic: Correctly model performance, timestamp
messages with correction for out-of-order
execution
– More detailed: network performance, memory
access, modeling sharing of floating-point unit,
estimation techniques