The BlueGene Supercomputer and LOFAR/LOIS LOFAR Bruce Elmegreen IBM Watson Research Center 914 945 2448 [email protected] LOFAR = Low Frequency Array LOIS = LOFAR Outrigger in Sweden BlueGene/L at LLNL LOIS LOFAR Current Radio Telescope Design parabolic receivers Onsala Space Observatory connected element interferometers Westerbork, Netherlands Very Long Baseline Interferometers EVN LOFAR and LOIS: A Radical New Design (Radio Telescopes as Sensor Networks) 10's of thousands of simple, fixed antennae Sample design for LOFAR Sample antenna for LOIS Points to Sky Electronically Current way to focus The LOFAR/LOIS way to focus: electronically add signals with time delays chosen to reinforce the desired direction + 76 m Lovell telescope at Manchester UK Advantages of the Sensor Network Approach Antennae are cheap (no precision surfaces needed) No moving parts (no expensive mounts, no pointing problems) Low profile (no radome needed, no wind/weather problems) Can change pointing directions quickly Can observe multiple directions at the same time (i.e., with several sums) Can extend for large distances by placing sensors anywhere Provides a country-wide optical fiber grid for general use Disadvantages of the Sensor Network Approach Enormous data flows (parabolic focus replaced by digital sum) Enormous central processors needed to handle data Enormous Data Flows from Antenna Stations LOFAR will have 46 remote stations and 64 stations in the central core Each remote station transmits: ƒ 32000 channels/ms in one beam –or 8 beams with 4000 channels ƒ 8+8 bit (or 16+16) complex data ƒ 2 polarizations. – 1-2 Gbps from each station Each central core station transmits same data rate in several independent sky directions (for epoch of recombination experiment) t 110 - 320 Gbps input rates to central processor sample LOFAR station array and antenna array for each station the Enormous Central Processors Needed To make a single, highly-sensitive sky beam, the central processor does a weighted sum of all the station data (faint source detection, pulsars, ...) To make a high-resolution map with a field of view equal to that of a single station, the central processor cross-correlates all station data. Cross correlation of 110 LOFAR Station inputs requires ƒ 110*109/2 product pairs and 4 polarization product pairs for each of the 32,000 frequency channels every millisecond = 0.8 Tera (1012) complex products/second For Epoch of Recombination maps using the LOFAR Central Core ƒ 64*63/2 pairs * 4 pol. prod. * 32,000 ch./ms * 5 beams = 1.3 Tera complex products/second LOFAR and LOIS: A Radical New Frequency Will observe at 10-250 MHz, which is good because ƒ ƒ ƒ ƒ ƒ neutral hydrogen is cosmologically redshifted (z~10) pulsars are bright galactic cosmic rays radiate radar echoes off the solar wind ... but is bad because of: Black hole+jet of 3C236, the largest known radio galaxy (Schilizzi et al. 2001) Quasar (z=3.2) lensed by foreground galaxy (Koopmans et al. 2002) ƒ terrestrial emission from FM radio, TV, aircraft, satellites, ... ƒ ionospheric variations from solar irradiation and wind irregularities t Need streaming noise mitigation for each antenna, ionospheric phase corrections for each antennae station, high bit count for data sampling, and flexible programming environment (LINUX). BlueGene/L for LOFAR These requirements of LOFAR can be matched by BlueGene/L, a massively parallel computer designed by IBM in collaboration with the Lawrence Livermore National Laboratory. BlueGene/L will also provide a top-ranked supercomputer facility in the Netherlands, useful for a wide range of problems. BlueGene/L: History November 2001, partnership with LLNL November 2002, acquisition of BG/L by LLNL announced. November 2003, 1/2-rack achieves 1.4 TFlop/s on Linpack ƒ 73 on TOP500 February 2004, IBM and ASTRON announce joint research into high data volume supercomputing using BG/L May 2004, Argonne Nat. Lab announce plan to get BG/L June 2004, 4 rack prototype achieves 11.68 TFlop/s on Linpack July 2004, AIST (Japan) announce use of BG/L for drug design September 2004, 8 racks achieves 36 Tfs on LINPACK, surpassing NEC's Earth Simulator to become world's fastest computer NEC Earth Simulator vs 64-rack BG/L at LLNL 40 Tf/s 10 TBy memory 640 nodes 500 MHz 5 MWatts power 34000 Sq foot floor space $350 M cost 360 Tf/s 32 Tbytes memory 65,536 nodes 700 MHz 1.5 MWatts power 2500 Sq foot floor space $100 M cost 64 racks of BG/L will have much more for much less BlueGene/L: "System on a Chip" IBM CU-11, 0.13 µm 11 x 11 mm die size 25 x 32 mm CBGA 474 pins, 328 signal 1.5/2.5 Volt 2 CPUs (4 FPUs) + memory + connection controls in each chip 2 chips per card Heatsinks 64 MBy DRAM 0.5 GB RAM/node ECC 16 cards per tray + 4 IO nodes Processor nodes IO nodes Four Gbps ethernet connectors 512 Way BG/L Prototype 16 trays per half-rack = 512 nodes + 64 IOs midplane = half rack = full torus Cables for torus in 3 dimensions 6 ASTRON racks have ~1 km of cables, each 1/2" thick 2 half-racks per rack + cables, power, cooling, etc 8 racks, 36 Tf: #1 in the world Rochester, MN September 2004 Hardware Summary Two CPU cores per chip (used independently or as comm/comp) Each CPU can do 4 floating point ops (2 multiply/adds) per clockcycle ƒ "double hummer" with quad-word loads ƒ double precision complex product takes 2 clock cycles/ CPU, 1 cycle/chip Frequency is 700 MHz ƒ Peak performance is 5.6 GFlops per chip, or 700 M c.p./s (using both CPUs) On chip memory: L1 (4kB/FPU) , L2(prefetch), L3( 4MB/chip) 2 chips/card, 16 cards/tray, 16 trays/midplane, 2 midplanes/rack ƒ 1024 chips/rack ƒ = 720 Giga complex products/second (peak, using both CPUs/chip) –tests with streaming data achieve 50%-90% efficiency for comp. prods. ƒ peak rate for LOFAR (6 racks) = 4.3 T comp.prod./s Chips are connected by a 3-Dimensional torus network A torus has independent connections in 3 dimensions. The torus bandwidth is 1.4 Gbps each way in all 3 dimensions simultaneously Wiring actually jumps over the physical neighbors to prevent large timing mismatches between distant edges. IO Network 4 IO connections / tray, 128 / rack, each is 1Gbps bi-directional ƒ peak IO for LOFAR (6 racks) = 768 Gbps each IO feed goes to an IO node, which is connected to 8 compute nodes by a hierarchical tree that pipes data at 2.8 Gbps bi-directional tree also does global communication & combine/broadcast for all nodes Two other internal networks Barrier network to all nodes: allows programs to stay in synch Control network to all nodes: boot, monitor, partition ƒ ƒ ƒ ƒ ƒ partitions set up in software using "linkchips" smallest torus partition is a midplane (512 nodes, half rack) may partition system into tori as midplanes, racks, multiple racks or may partition into smaller units (32-node tray) which connect nodes as a mesh each partition runs a different job LOFAR EXAMPLE: Cross Correlation of Station Beams 64 Virtual core stations and 45 Remote Stations transmit data at ~220 Gbps 32,000 frequency channels are divided into 8 parts, each a separate sky beam Each beam of data goes to a separate midplane: ƒ 220/8=27.5 Gbps input/midplane, 64 Gbps ethernet IOs available C Midplane processors must do ƒ 110*109/2 station products * 4000 ch. per ms * 4 pol. products = = 96 Giga complex products/second compared to peak of 360 Giga comp.prod./s in virtual node mode C LOFAR EXAMPLE: Internal Data Flow One beam of data per midplane means ƒ 4000 channels, 2 polarizations, 1000 ms time integration, 4 Byte data (16+16) ƒ = 62.5 k Bytes of data into each of the 512 nodes in a midplane each second ƒ (NOTE: 4 MBy storage available in L3 cache per node C) MPI_alltoall redistributes data so each node has some of the channels in both polarizations from all telescopes ƒ midplane has 8x8x8 nodes, longest hop on torus is 4 nodes on each axis, average is 2 nodes per axis (x,y,z) ƒ hopping speed = 1.4 Gbps = 175 MBps each direction, 350 MBps each axis ƒ all-to-all time is 62.5 kB * 2 hops / 350 MBps = 3.6E-4 second C. –(NOTE: plenty of time for computations t virtual node mode OK) LOIS Example: 15 antennae with IBM JS20 Prototype LOIS will have 10-15 antenna and 50,000 frequency channels Will perform calculations on an IBM JS20 Blade Center –28 blades, 56 CPUs @ 1.6 GHz, 2 FPU/CPU, 2 L&S /CPU –each FPU does complex product (2L,4M,2A,2S) in 4 clock cycles –each CPU does comp. prod. in 2 clock cycles, each blade in 1 cycle –28 blades does complex products at peak rate of 44.8 Giga c.p./s –required rate is 15*14/2*50000/ms=5.3 G c.p./s per EM component x 9 for 3x3 matrix products of electric vectors –IO rate from antennae is 15 Gbps, total capacity of JS20 is 16 Gbps ƒ with inefficiencies in computation and IO, initial configuration is ~1/2 the full requirement. Some External Applications on BG/L June 2004 - Robert Walkup: [email protected] Gyan Bhanot: [email protected] Enzo (San Diego) Flash (ANL) CTH (Sandia) MM5 Amber7 GAMESS QMC (Caltech) LJ (Caltech) PolyCrystal (Caltech) PMEMD (LBL) Miranda (LLNL) SEAM (NCAR) QCD (IBM) SAGE (LANL) SPPM (LLNL) UMT2K (LLNL) Sweep3d (LANL) MDCASK (LLNL) GP (LLNL) CPMD (IBM) TLBE (LBL) HPCMW (RIST) DD3d (LLNL) SPHOT (LLNL) LSMS (ORNL) NIWS (NISSEI) Limitations to Applications 0.5 GB (card)/node, 4MB DRAM(chip)/node, 32kB/CPU L1 Cache ƒ cannot pass all data to each node, no shared memory ƒ quad loads required for streaming float operations, 16B-aligned data ƒ MPI the most effective way to program UNIX only on the IO nodes, not on the compute nodes: –no UNIXIPC, MMAP, forks, sockets, openMP, shell creation, thread creation, UNIX system calls, etc. on a compute node Existing MPI codes port easily, but may need changes to use double hummer Off-shelf MPI code ~5x slower/processor on BG/L than Power4 1.7 GHz ƒ 1.7 GHz * 4 Flop/cycle (2FPU/CPU PowerPC) = 6.8 Gflops peak/CPU ƒ vs. 0.7 GHz * 2 Flop/cycle (co-proc, w/o double hummer) = 1.4 Gfs/CPU But low power: 64 W/CPU for JS20 Power 4 blades, 12 W/CPU for BG/L Main BlueGene advantage is packaging: high CPU density and connectivity Summary BlueGene/L can handle LOFAR station data rates and c.p. rates configuration can change with software commands handles data with high bit counts also usable as general purpose computer (~20 Tf sustained) ƒ good for Dutch Infrastructure, an attraction for Industry partners, science ... BlueGene/L Innovations: –system on a chip design (2 processors, all networks, memory) –4 independent networks: tree, torus, barrier, control –variable torus size (controlled by software using link chips) –moderate clock speed (700 MHz) good for RAM reads, good for low power consumption (25 kW/rack) –LINUX kernel on IO nodes
© Copyright 2026 Paperzz