Status of the CSCS Compute Infrastructure

-  Offer CSCS users true capability computing resources that are able to scale
for large computational experiments
-  The resources form a portfolio of systems with complementary
characteristics to support different kinds of workloads and pre-requisite
requirements
-  At the same time, the portfolio must be as small as possible to minimize
operational costs and to maximize science output
- The different compute resources must be integrated: network, storage,
security
R. Alexander/D. Ulmer, User Assembly, May 22nd 2006
-  Storage Farm
- Internal Network Upgrade
- Horizon Status and Plans
- Terrane Plans and Configuration
- Zenith
- Summary
R. Alexander/D. Ulmer, User Assembly, May 22nd 2006
- Fibre channel as primary technology
- Ease of expanding and adjusting individual systems storage
requirement
- Leveraging Terrane hardware
-  3 32 port 4Gbit FC switches
-  Switches will be trunked together
R. Alexander/D. Ulmer, User Assembly, May 22nd 2006
- 2 CISCO 6509 blade-based switches
- Leverage Terrane equipment
- Create internal 10 Gbit trunks
-  Between 6509s
-  To firewall & external net when needed
- 10 Gbit among all primary systems (supercomputers, frontends, HSM)
R. Alexander/D. Ulmer, User Assembly, May 22nd 2006
-  installed in summer 2005, in production since January 2006
-  Consisting of three Cray XT3 massively-parallel systems:
-  Palu: production system with 1’100 compute nodes with 2.6 GHz single
core AMD Opterons, 2 GB/CPU memory -  Gele: porting and testing system with 84 compute nodes with the same
characteristics as on Gele
-  Fred: internal CSCS HW and SW test system with 32 compute nodes as
above
R. Alexander/D. Ulmer, User Assembly, May 22nd 2006
•  4 login nodes with DNS rotary name - palu.cscs.ch
•  2 yod/mom nodes
•  Scratch for general users - approximately 9TB
•  4 Lustre servers (1 MDS / 15 OSTs) •  Scratch for PSI users - approximately 6.6TB
•  3 Lustre servers (1 MDS / 11 OSTs)
•  each OST = 600MB
R. Alexander/D. Ulmer, User Assembly, May 22nd 2006
• Current availability > 90% • Current usage:
•  >90+% node utilization
•  Jobs using 64-512 nodes are “typical”
•  768 nodes max size now
•  Oversubscribed
•  Will be upgraded with 600 more processors (6 racks) in summer.
Two groups telling us they have done science they couldn’t do before.
R. Alexander/D. Ulmer, User Assembly, May 22nd 2006
• Cray working on each of these bugs
•  Extremely difficult to diagnose
•  Job start failures - one of our highest priority bugs
• Currently nodes stay down until next machine reboot
•  Single node reboot available with 1.4
•  DataDirectNetwork disc controllers have been unreliable
•  Cray & DDN are replacing all controllers with newly manufactured ones
•  CSCS will receive 1 controller pair from Engenio with the Cray extension during
summer R. Alexander/D. Ulmer, User Assembly, May 22nd 2006
Firm: • Single compute node reboot
•  Dual core service nodes with upgrade of palu system
Potentially:
•  Dual core CPUs on gele for performance testing
•  Test Linux on compute node - fred
R. Alexander/D. Ulmer, User Assembly, May 22nd 2006
- Targeted at capacity/capability problems mostly run within the node
-  To be installed in summer 2006
-  Multiprocessor nodes with SMP capability, 4.5 Tflops aggregate
-  Full-scale operating system on all node types
-  Support for commercial HPC codes
Solution after public call for tender: IBM Power-5 (!) Infiniband cluster
R. Alexander/D. Ulmer, User Assembly, May 22nd 2006
-  48 p575 compute nodes, each with 16 CPU -  p550+: 8 I/O, 2 login nodes and 3 auxiliary nodes, each with 4 CPUs
-  4X Infiniband (96 port Cisco Topspin switch) -  Gigabit Ethernet interconnect (2x CISCO 6509 blade-based switch)
-  44 TB external FC disc storage and 4Gbit FC switches
-  water-cooled compute racks
R. Alexander/D. Ulmer, User Assembly, May 22nd 2006
•  48 P5 575 compute nodes
•  16-way SMP with SMT = effectively 32-way
•  47 w/32GB main memory
•  1 w/64 GB main memory
•  1.5 GHz clock, 96 Gflops per node
•  4X dual-port infiniband adapter = 10 Gbit/s per port
SMT means Simultaneous Multi-Threading
R. Alexander/D. Ulmer, User Assembly, May 22nd 2006
• 6 P5+ 550 I/O nodes
•  4-way SMP with SMT = effectively 8-way
•  16GB main memory
•  1 remote I/O drawer each
•  1.9 GHz clock
•  4 have 6 2Gbit FC cards
•  2 have 2 4Gbit dual-channel FC cards
•  1 GX bus dual port 4X Infiniband
•  1 10Gbit Ethernet card
R. Alexander/D. Ulmer, User Assembly, May 22nd 2006
• 2 P5+ 550 login nodes
•  4-way SMP with SMT = effectively 8-way
•  16GB main memory
•  1.9 GHz clock
•  1 GX bus dual port 4X Infiniband
•  1 10Gbit Ethernet card
R. Alexander/D. Ulmer, User Assembly, May 22nd 2006
• 3 P5+ 550 auxiliary nodes
•  4-way SMP with SMT = effectively 8-way
•  8 GB main memory
•  1.9 GHz clock
•  1 GX bus dual port 4X Infiniband
R. Alexander/D. Ulmer, User Assembly, May 22nd 2006
•  Linux on Power (SLES9) or AIX 5.3
•  GPFS •  Full IBM HPC SW stack (compilers, libraries, loadleveler batch
system, cluster management SW)
•  3-years maintenance HW & SW, with on-site same-business
day response if called until noon
R. Alexander/D. Ulmer, User Assembly, May 22nd 2006
• Cisco Topspin 270 switch
• 96 4X ports
• Great IP network
•  GPFS transport within Terrane
•  General IP traffic
• Test for MPI
• Linux and AIX as candidates
R. Alexander/D. Ulmer, User Assembly, May 22nd 2006
-  Installation from July to September (very early users, CSCS migration task
force)
-  early access from September onwards
-  available for LUP call 1/07
Planned upgrades:
- 12X Infiniband by 4Q06 (dual-striped ?, scaling to 128 CPU)
-  extension with Power-4 trade-in by 3Q06 (8 16-(32)-way nodes p575+ with
32 GB RAM, 1 64-(128)-way node p595+ with 256 GB RAM)
R. Alexander/D. Ulmer, User Assembly, May 22nd 2006
Zenith was characterised by (announcement on CSCS website)
- Shared-memory capacity
- Strong single-processor and node performance
- ease of use
With Terrane we got:
- Shared-memory from 32 to 64 GB (base system) resp. 256 GB (trade-in of SP4)
- Single-processor peak performance of up to 7.5 Gflops and
node peak performance of 96 to 121 Gflops (base) resp. 484 Gflops (trade-in)
- Well known IBM user environment and reliability
Base characteristics well covered
R. Alexander/D. Ulmer, User Assembly, May 22nd 2006
R. Alexander/D. Ulmer, User Assembly, May 22nd 2006
LM (weather forecast): Shows very good scaling on Cray XT3 and is used
operationally by code author, the German weather service, on IBM Power-5
ECHAM-5 (climate): Is being ported by code author on Cray XT3 with
extremely promising results (see HPCWire, December 05 & later
presentation by FB)
CP (Quantum Espresso suite, molecular dynamics): Is being very successfully
used in production on Cray XT3 and IBM SP4
Transit (CFD): CSCS will port code on scalar architecture
Gaussian (chemistry): Requires big shared memory, which is available for IBM
Power-5 for which code is supported R. Alexander/D. Ulmer, User Assembly, May 22nd 2006
CSCS can drop the Zenith project because of the (unexpected) solution for
Terrane
- All basic requirements covered
- All CSCS applications supported
- With Terrane we got a scalable big 5 Tflops SMP system at the price of a
loosely-coupled cluster
Extremely economic solution for all needs below 64 proc/job
Similar installations can be found at: DWD, MPI Garching, ECMWF, EPCC/
Daresbury, Los Alamos, NERSC/Berkeley, Lawrence Livermore, ....
R. Alexander/D. Ulmer, User Assembly, May 22nd 2006
As of May 2006
R. Alexander/D. Ulmer, User Assembly, May 22nd 2006