SuperMUC @ Leibniz Supercomputer Centre

SuperMUC @ Leibniz Supercomputer Centre
• Movie on YouTube
Peak Performance
• Peak performance: 3 Peta Flops 3*1015 Flops
•
•
•
•
•
•
Mega
Giga
Tera
Peta
Exa
Zetta
106
109
1012
1015
1018
1021
million
billion
trillion
quadrillion
quintillion
sextillion
• Flops: Floating Point Operations per Seconds
Distributed Memory Architecture
• 18 partitions called islands with 512 nodes
• Node is a shared memory system
with 2 processors
• Sandy Bridge-EP
Intel Xeon E5-2680 8C
– 2.7 GHz (Turbo 3.5 GHz)
• 32 GByte memory
• Inifiniband network interface
• Processor has 8 cores
• 2-way hyperthreading
• 21.6 GFlops @ 2.7 GHz per core
• 172.8 GFlops per processor
Sandy Bridge Processor
Core
Latency:
•
•
•
8 multithreaded cores
Core
Bandwidth:
4 cycles
L1 32KB
L1 32KB
•
2*16/cycle
L2 256KB
L2 256KB
•
32 / cycle
•
32 / cycle
12 cycles
31 cycles
L3 2.5 MB
Memory
Shared L3
QPI
L3 2.5 MB
Network frequency
equal to core frequency
PCIe
• L3 cache
• Partitioned with cache coherence based on core valid bits
• Physical addresses distributed by a hash function
NUMA Node
4GB
2 QPI links
4GB
4GB
Each 2 GT/s
4GB
Sandy Bridge
Sandy Bridge
4GB
4GB
4GB
4GB
8xPCIe3.0 (8GB/s)
Infiniband
• 2 processors with 32 GB of memory
• Aggregate memory bandwidth per node 102.4 GB/s
• Latency
• local ~50ns (~135 cycles @2.7 GHz)
• remote ~90ns (~240 cycles)
Interconnection Network
• Infiniband FDR-10
•
•
•
•
FDR means fourteen data rate
FDR-10 has an effective data rate of 41.25 Gb/s
Latency: 100 nsec per switch, 1usec MPI
Vendor: Mellanox
• Intra-Island Topology: non-blocking tree
• 256 communication pairs can talk in parallel.
• Inter-Island Topology: Pruned Tree 4:1
• 128 links per island to next level
Peak Performance
36 port switch
126 spine
switches
36 port switch
Rest for fat
node and
IO
19 links
126 links
648 port switch
18 islands
+ IO island
516 links
516 nodes
648 port switch
9288 Compute Nodes
Cold Corridoor
Infiniband (red)
and
Ethernet (green)
cabling
Matthias Brehm, Herbert Huber, LRZ High Performance Systems Division
Infiniband Interconnect
19 Orcas 126 Spine Switches
11900 Infiniband Cables
Matthias Brehm, Herbert Huber, LRZ High Performance Systems Division
IO System
Spine Infiniband Switches
GPFS for
$WORK and
$SCRATCH
10 PB @ 200 GB/s
Login nodes
$HOME
Archive
5 PB @ 80 Gb/s
30 PB @ 10GbE
Parallel File System GPFS
10 Pbyte, 200 GigaByte/s I/O Bandwidth
9 DDN SFA 12k Controller
5040 3 TByte SATA Disks
Matthias Brehm, Herbert Huber, LRZ High Performance Systems Division
SuperMIC
• Intel Xeon Phi Cluster
• 32 Nodes
– 2 Xeon Ivy-Bridge processors E5-2650
– 8 cores each
– 2.6 GHz clock frequency
– 2 Intel Xeon Phi coprocessors 5110P
– 60 cores @ 1.1 GHz
– Memory
– 64 GB host memory
– 2x8 GB Xeon Phi
Intel Xeon Phi
Connection to host
6.2 GB/s
Number of cores
60
Frequency of cores
1.1 GHz
GDDR5 memory size
8 GB
Number of hardware threads per
core
4
SIMD vector registers
32 (512-bit wide) per thread context
Flops/cycle
16 (DP), 32 (SP)
Theoretical peak performance
1 TFlop/s (DP), 2 TFlop/s (SP)
L2 cache per core
512 kB
Nodes with Coprocessors
Access to SuperMIC
• Login to SuperMUC
• Login to SuperMIC
• ssh supermic.smuc.lrz.de
• Load leveler script with class phi
• Interactive access to nodes and coprocessors
• Submit batch script with sleep command.
• Login to compute nodes
• ssh i01r13???
• Login to MIC coprocessors
• ssh i01r13???-mic0
• ssh i01r13???-mic1
• PPK required
The Compute Cube of LRZ
Rückkühlwerke
Hö
Höchstleistungsrechner
(säulenfrei)
(sä
Zugangsbrü cke
Zugangsbrücke
Server/Netz
Archiv/Backup
Archiv/Backup
Klima
Klima
Elektro
Run jobs in batch
• Advantages
• Reproducable performance
• Run larger jobs
• No need to interactive poll for resources
• Test queue
• Max 1 island, 32 nodes, 2h, 1 job in queue
• General queue
• Max 1 island, 512 nodes, 48 h
• Large
• Max 4 islands, 2048 nodes, 48 h
• Special
• Max 18 islands …
Job Script
#!/bin/bash
#@ wall_clock_limit = 00:4:00
#@ job_name = add
#@ job_type = parallel
#@ class = test
#@ network.MPI = sn_all,not_shared,us
#@ output = job$(jobid).out
#@ error = job$(jobid).out
#@ node = 2
#@ total_tasks=4
#@ node_usage = not_shared
#@ queue
. /etc/profile
cd ~/apptest/application
poe appl
• llsubmit job.scp
• Submission to batch system
• llq –u $USER
• Check status of own jobs
• llcancel <jobid>
• Kill job if no longer needed
Limited CPU Hours available
• Please
• Specify job execution as tight as possible.
• Do not request more nodes than required. We have to „pay“
for all allocated cores, not only the used ones.
• SHORT (<1sec) sequential runs can be done on the login
node.
• Even SHORT OMP runs can be done on the login node.
Login to SuperMUC, Documentation
• First change the standard password
• https://idportal.lrz.de/r/entry.pl
• Login via
• lxhalle due to restriction on connecting machines
• ssh <userid>@supermuc.lrz.de
• No outgoing connections allowed
• Documentation
• http://www.lrz.de/services/compute/supermuc/
• http://www.lrz.de/services/compute/supermuc/loadleveler/
• Intel compiler:
http://software.intel.com/sites/products/documentation/hpc/co
mposerxe/en-us/2011Update/cpp/lin/index.htm
Batch Script Parameters
• #@ energy_policy_tag = NONE
• Switch of automatic adaptation of core frequency for
performance measurements
• #@ node = 2
• #@ total_tasks= 4
• #@ task_geometry = {(0,2) (1,3)}
• #@ tasks_per_node = 2
• Limitations on combination documented at LRZ web page
Compiler
• Intel C++
• icc Version 12.1
• Editors
•
•
•
•
vi
emacs
xedit
…