The Blue Gene Supercomputer and LOFAR LOIS

The BlueGene Supercomputer and LOFAR/LOIS
LOFAR
Bruce Elmegreen
IBM Watson Research Center
914 945 2448
[email protected]
LOFAR = Low Frequency Array
LOIS = LOFAR Outrigger in Sweden
BlueGene/L
at LLNL
LOIS
LOFAR
Current Radio Telescope Design
parabolic receivers
Onsala Space Observatory
connected element interferometers
Westerbork, Netherlands
Very Long Baseline Interferometers
EVN
LOFAR and LOIS: A Radical New Design
(Radio Telescopes as Sensor Networks)
10's of thousands of simple, fixed antennae
Sample design for LOFAR
Sample antenna for LOIS
Points to Sky Electronically
Current way to focus
The LOFAR/LOIS way to focus:
electronically add
signals with time
delays chosen to
reinforce the
desired direction
+
76 m Lovell telescope at Manchester UK
Advantages of the Sensor Network Approach
Antennae are cheap (no precision surfaces needed)
No moving parts (no expensive mounts, no pointing problems)
Low profile (no radome needed, no wind/weather problems)
Can change pointing directions quickly
Can observe multiple directions at the same time (i.e., with several sums)
Can extend for large distances by placing sensors anywhere
Provides a country-wide optical fiber grid for general use
Disadvantages of the Sensor Network Approach
Enormous data flows (parabolic focus replaced by digital sum)
Enormous central processors needed to handle data
Enormous Data Flows from Antenna Stations
LOFAR will have 46 remote stations and 64 stations in the central core
Each remote station transmits:
ƒ 32000 channels/ms in one beam
–or 8 beams with 4000 channels
ƒ 8+8 bit (or 16+16) complex data
ƒ 2 polarizations.
– 1-2 Gbps from each station
Each central core station transmits
same data rate in several independent sky
directions (for epoch of recombination experiment)
t 110 - 320 Gbps input rates to central processor
sample LOFAR
station array and
antenna array for
each station
the
Enormous Central Processors Needed
To make a single, highly-sensitive sky beam, the central processor does a
weighted sum of all the station data (faint source detection, pulsars, ...)
To make a high-resolution map with a field of view equal to that of a single
station, the central processor cross-correlates all station data.
Cross correlation of 110 LOFAR Station inputs requires
ƒ 110*109/2 product pairs and 4 polarization product pairs for each of the
32,000 frequency channels every millisecond
= 0.8 Tera (1012) complex products/second
For Epoch of Recombination maps using the LOFAR Central Core
ƒ 64*63/2 pairs * 4 pol. prod. * 32,000 ch./ms * 5 beams
= 1.3 Tera complex products/second
LOFAR and LOIS: A Radical New Frequency
Will observe at 10-250 MHz, which is good because
ƒ
ƒ
ƒ
ƒ
ƒ
neutral hydrogen is cosmologically redshifted (z~10)
pulsars are bright
galactic cosmic rays radiate
radar echoes off the solar wind
...
but is bad because of:
Black hole+jet of 3C236,
the largest known radio
galaxy (Schilizzi et al. 2001)
Quasar (z=3.2) lensed by foreground
galaxy (Koopmans et al. 2002)
ƒ terrestrial emission from FM radio, TV, aircraft, satellites, ...
ƒ ionospheric variations from solar irradiation and wind irregularities
t Need streaming noise mitigation for each antenna, ionospheric phase
corrections for each antennae station, high bit count for data sampling, and
flexible programming environment (LINUX).
BlueGene/L for LOFAR
These requirements of LOFAR can be matched by BlueGene/L, a massively
parallel computer designed by IBM in collaboration with the Lawrence
Livermore National Laboratory.
BlueGene/L will also provide a top-ranked supercomputer facility in the
Netherlands, useful for a wide range of problems.
BlueGene/L: History
November 2001, partnership with LLNL
November 2002, acquisition of BG/L by LLNL announced.
November 2003, 1/2-rack achieves 1.4 TFlop/s on Linpack
ƒ 73 on TOP500
February 2004, IBM and ASTRON announce joint research into high data
volume supercomputing using BG/L
May 2004, Argonne Nat. Lab announce plan to get BG/L
June 2004, 4 rack prototype achieves 11.68 TFlop/s on Linpack
July 2004, AIST (Japan) announce use of BG/L for drug design
September 2004, 8 racks achieves 36 Tfs on LINPACK, surpassing NEC's Earth
Simulator to become world's fastest computer
NEC Earth Simulator vs 64-rack BG/L at LLNL
40 Tf/s
10 TBy memory
640 nodes
500 MHz
5 MWatts power
34000 Sq foot floor space
$350 M cost
360 Tf/s
32 Tbytes memory
65,536 nodes
700 MHz
1.5 MWatts power
2500 Sq foot floor space
$100 M cost
64 racks of BG/L will have much more for much less
BlueGene/L: "System on a Chip"
IBM CU-11, 0.13 µm
11 x 11 mm die size
25 x 32 mm CBGA
474 pins, 328 signal
1.5/2.5 Volt
2 CPUs (4 FPUs) +
memory + connection
controls in each chip
2 chips per card
Heatsinks
64 MBy DRAM
0.5 GB RAM/node
ECC
16 cards per
tray
+ 4 IO
nodes
Processor
nodes
IO
nodes
Four Gbps ethernet connectors
512 Way BG/L Prototype
16 trays per
half-rack = 512
nodes + 64 IOs
midplane = half rack = full torus
Cables for
torus in 3
dimensions
6 ASTRON racks
have ~1 km of
cables, each
1/2" thick
2 half-racks
per rack +
cables, power,
cooling, etc
8 racks, 36 Tf: #1 in
the world
Rochester, MN
September 2004
Hardware Summary
Two CPU cores per chip (used independently or as comm/comp)
Each CPU can do 4 floating point ops (2 multiply/adds) per clockcycle
ƒ "double hummer" with quad-word loads
ƒ double precision complex product takes 2 clock cycles/ CPU, 1 cycle/chip
Frequency is 700 MHz
ƒ Peak performance is 5.6 GFlops per chip, or 700 M c.p./s (using both CPUs)
On chip memory: L1 (4kB/FPU) , L2(prefetch), L3( 4MB/chip)
2 chips/card, 16 cards/tray, 16 trays/midplane, 2 midplanes/rack
ƒ 1024 chips/rack
ƒ = 720 Giga complex products/second (peak, using both CPUs/chip)
–tests with streaming data achieve 50%-90% efficiency for comp. prods.
ƒ peak rate for LOFAR (6 racks) = 4.3 T comp.prod./s
Chips are connected by a 3-Dimensional torus network
A torus has independent
connections in 3 dimensions.
The torus bandwidth
is 1.4 Gbps each way
in all 3 dimensions
simultaneously
Wiring actually jumps over the physical neighbors to
prevent large timing mismatches between distant edges.
IO Network
4 IO connections / tray, 128 / rack, each is 1Gbps bi-directional
ƒ peak IO for LOFAR (6 racks) = 768 Gbps
each IO feed goes to an IO node, which is connected to 8 compute nodes by a
hierarchical tree that pipes data at 2.8 Gbps bi-directional
 tree also does global communication & combine/broadcast for all nodes
Two other internal networks
Barrier network to all nodes: allows programs to stay in synch
Control network to all nodes: boot, monitor, partition
ƒ
ƒ
ƒ
ƒ
ƒ
partitions set up in software using "linkchips"
smallest torus partition is a midplane (512 nodes, half rack)
may partition system into tori as midplanes, racks, multiple racks
or may partition into smaller units (32-node tray) which connect nodes as a mesh
each partition runs a different job
LOFAR EXAMPLE: Cross Correlation of Station Beams
64 Virtual core stations and 45 Remote Stations transmit data at ~220 Gbps
32,000 frequency channels are divided into 8 parts, each a separate sky beam
Each beam of data goes to a separate midplane:
ƒ 220/8=27.5 Gbps input/midplane, 64 Gbps ethernet IOs available C
Midplane processors must do
ƒ 110*109/2 station products * 4000 ch. per ms * 4 pol. products =
= 96 Giga complex products/second
compared to peak of 360 Giga comp.prod./s in virtual node mode
C
LOFAR EXAMPLE: Internal Data Flow
One beam of data per midplane means
ƒ 4000 channels, 2 polarizations, 1000 ms time integration, 4 Byte data (16+16)
ƒ = 62.5 k Bytes of data into each of the 512 nodes in a midplane each second
ƒ (NOTE: 4 MBy storage available in L3 cache per node C)
MPI_alltoall redistributes data so each node has some of the channels in both
polarizations from all telescopes
ƒ midplane has 8x8x8 nodes, longest hop on torus is 4 nodes on each axis, average is
2 nodes per axis (x,y,z)
ƒ hopping speed = 1.4 Gbps = 175 MBps each direction, 350 MBps each axis
ƒ all-to-all time is 62.5 kB * 2 hops / 350 MBps = 3.6E-4 second C.
–(NOTE: plenty of time for computations t virtual node mode OK)
LOIS Example: 15 antennae with IBM JS20
Prototype LOIS will have 10-15 antenna and 50,000 frequency channels
Will perform calculations on an IBM JS20 Blade Center
–28 blades, 56 CPUs @ 1.6 GHz, 2 FPU/CPU, 2 L&S /CPU
–each FPU does complex product (2L,4M,2A,2S) in 4 clock cycles
–each CPU does comp. prod. in 2 clock cycles, each blade in 1 cycle
–28 blades does complex products at peak rate of 44.8 Giga c.p./s
–required rate is 15*14/2*50000/ms=5.3 G c.p./s per EM component
x 9 for 3x3 matrix products of electric vectors
–IO rate from antennae is 15 Gbps, total capacity of JS20 is 16 Gbps
ƒ with inefficiencies in computation and IO, initial configuration is ~1/2 the
full requirement.
Some External Applications on BG/L
June 2004 - Robert Walkup: [email protected]
Gyan Bhanot: [email protected]
Enzo (San Diego)
Flash (ANL)
CTH (Sandia)
MM5
Amber7
GAMESS
QMC (Caltech)
LJ (Caltech)
PolyCrystal (Caltech)
PMEMD (LBL)
Miranda (LLNL)
SEAM (NCAR)
QCD (IBM)
SAGE (LANL)
SPPM (LLNL)
UMT2K (LLNL)
Sweep3d (LANL)
MDCASK (LLNL)
GP (LLNL)
CPMD (IBM)
TLBE (LBL)
HPCMW (RIST)
DD3d (LLNL)
SPHOT (LLNL)
LSMS (ORNL)
 NIWS (NISSEI)
Limitations to Applications
0.5 GB (card)/node, 4MB DRAM(chip)/node, 32kB/CPU L1 Cache
ƒ cannot pass all data to each node, no shared memory
ƒ quad loads required for streaming float operations, 16B-aligned data
ƒ MPI the most effective way to program
UNIX only on the IO nodes, not on the compute nodes:
–no UNIXIPC, MMAP, forks, sockets, openMP, shell creation, thread creation,
UNIX system calls, etc. on a compute node
Existing MPI codes port easily, but may need changes to use double hummer
Off-shelf MPI code ~5x slower/processor on BG/L than Power4 1.7 GHz
ƒ 1.7 GHz * 4 Flop/cycle (2FPU/CPU PowerPC) = 6.8 Gflops peak/CPU
ƒ vs. 0.7 GHz * 2 Flop/cycle (co-proc, w/o double hummer) = 1.4 Gfs/CPU
But low power: 64 W/CPU for JS20 Power 4 blades, 12 W/CPU for BG/L
Main BlueGene advantage is packaging: high CPU density and connectivity
Summary
BlueGene/L can handle LOFAR station data rates and c.p. rates
configuration can change with software commands
handles data with high bit counts
also usable as general purpose computer (~20 Tf sustained)
ƒ good for Dutch Infrastructure, an attraction for Industry partners, science ...
BlueGene/L Innovations:
–system on a chip design (2 processors, all networks, memory)
–4 independent networks: tree, torus, barrier, control
–variable torus size (controlled by software using link chips)
–moderate clock speed (700 MHz)
 good for RAM reads, good for low power consumption (25 kW/rack)
–LINUX kernel on IO nodes