SuperMUC @ Leibniz Supercomputer Centre • Movie on YouTube Peak Performance • Peak performance: 3 Peta Flops 3*1015 Flops • • • • • • Mega Giga Tera Peta Exa Zetta 106 109 1012 1015 1018 1021 million billion trillion quadrillion quintillion sextillion • Flops: Floating Point Operations per Seconds Distributed Memory Architecture • 18 partitions called islands with 512 nodes • Node is a shared memory system with 2 processors • Sandy Bridge-EP Intel Xeon E5-2680 8C – 2.7 GHz (Turbo 3.5 GHz) • 32 GByte memory • Inifiniband network interface • Processor has 8 cores • 2-way hyperthreading • 21.6 GFlops @ 2.7 GHz per core • 172.8 GFlops per processor Sandy Bridge Processor Core Latency: • • • 8 multithreaded cores Core Bandwidth: 4 cycles L1 32KB L1 32KB • 2*16/cycle L2 256KB L2 256KB • 32 / cycle • 32 / cycle 12 cycles 31 cycles L3 2.5 MB Memory Shared L3 QPI L3 2.5 MB Network frequency equal to core frequency PCIe • L3 cache • Partitioned with cache coherence based on core valid bits • Physical addresses distributed by a hash function NUMA Node 4GB 2 QPI links 4GB 4GB Each 2 GT/s 4GB Sandy Bridge Sandy Bridge 4GB 4GB 4GB 4GB 8xPCIe3.0 (8GB/s) Infiniband • 2 processors with 32 GB of memory • Aggregate memory bandwidth per node 102.4 GB/s • Latency • local ~50ns (~135 cycles @2.7 GHz) • remote ~90ns (~240 cycles) Interconnection Network • Infiniband FDR-10 • • • • FDR means fourteen data rate FDR-10 has an effective data rate of 41.25 Gb/s Latency: 100 nsec per switch, 1usec MPI Vendor: Mellanox • Intra-Island Topology: non-blocking tree • 256 communication pairs can talk in parallel. • Inter-Island Topology: Pruned Tree 4:1 • 128 links per island to next level Peak Performance 36 port switch 126 spine switches 36 port switch Rest for fat node and IO 19 links 126 links 648 port switch 18 islands + IO island 516 links 516 nodes 648 port switch 9288 Compute Nodes Cold Corridoor Infiniband (red) and Ethernet (green) cabling Matthias Brehm, Herbert Huber, LRZ High Performance Systems Division Infiniband Interconnect 19 Orcas 126 Spine Switches 11900 Infiniband Cables Matthias Brehm, Herbert Huber, LRZ High Performance Systems Division IO System Spine Infiniband Switches GPFS for $WORK and $SCRATCH 10 PB @ 200 GB/s Login nodes $HOME Archive 5 PB @ 80 Gb/s 30 PB @ 10GbE Parallel File System GPFS 10 Pbyte, 200 GigaByte/s I/O Bandwidth 9 DDN SFA 12k Controller 5040 3 TByte SATA Disks Matthias Brehm, Herbert Huber, LRZ High Performance Systems Division SuperMIC • Intel Xeon Phi Cluster • 32 Nodes – 2 Xeon Ivy-Bridge processors E5-2650 – 8 cores each – 2.6 GHz clock frequency – 2 Intel Xeon Phi coprocessors 5110P – 60 cores @ 1.1 GHz – Memory – 64 GB host memory – 2x8 GB Xeon Phi Intel Xeon Phi Connection to host 6.2 GB/s Number of cores 60 Frequency of cores 1.1 GHz GDDR5 memory size 8 GB Number of hardware threads per core 4 SIMD vector registers 32 (512-bit wide) per thread context Flops/cycle 16 (DP), 32 (SP) Theoretical peak performance 1 TFlop/s (DP), 2 TFlop/s (SP) L2 cache per core 512 kB Nodes with Coprocessors Access to SuperMIC • Login to SuperMUC • Login to SuperMIC • ssh supermic.smuc.lrz.de • Load leveler script with class phi • Interactive access to nodes and coprocessors • Submit batch script with sleep command. • Login to compute nodes • ssh i01r13??? • Login to MIC coprocessors • ssh i01r13???-mic0 • ssh i01r13???-mic1 • PPK required The Compute Cube of LRZ Rückkühlwerke Hö Höchstleistungsrechner (säulenfrei) (sä Zugangsbrü cke Zugangsbrücke Server/Netz Archiv/Backup Archiv/Backup Klima Klima Elektro Run jobs in batch • Advantages • Reproducable performance • Run larger jobs • No need to interactive poll for resources • Test queue • Max 1 island, 32 nodes, 2h, 1 job in queue • General queue • Max 1 island, 512 nodes, 48 h • Large • Max 4 islands, 2048 nodes, 48 h • Special • Max 18 islands … Job Script #!/bin/bash #@ wall_clock_limit = 00:4:00 #@ job_name = add #@ job_type = parallel #@ class = test #@ network.MPI = sn_all,not_shared,us #@ output = job$(jobid).out #@ error = job$(jobid).out #@ node = 2 #@ total_tasks=4 #@ node_usage = not_shared #@ queue . /etc/profile cd ~/apptest/application poe appl • llsubmit job.scp • Submission to batch system • llq –u $USER • Check status of own jobs • llcancel <jobid> • Kill job if no longer needed Limited CPU Hours available • Please • Specify job execution as tight as possible. • Do not request more nodes than required. We have to „pay“ for all allocated cores, not only the used ones. • SHORT (<1sec) sequential runs can be done on the login node. • Even SHORT OMP runs can be done on the login node. Login to SuperMUC, Documentation • First change the standard password • https://idportal.lrz.de/r/entry.pl • Login via • lxhalle due to restriction on connecting machines • ssh <userid>@supermuc.lrz.de • No outgoing connections allowed • Documentation • http://www.lrz.de/services/compute/supermuc/ • http://www.lrz.de/services/compute/supermuc/loadleveler/ • Intel compiler: http://software.intel.com/sites/products/documentation/hpc/co mposerxe/en-us/2011Update/cpp/lin/index.htm Batch Script Parameters • #@ energy_policy_tag = NONE • Switch of automatic adaptation of core frequency for performance measurements • #@ node = 2 • #@ total_tasks= 4 • #@ task_geometry = {(0,2) (1,3)} • #@ tasks_per_node = 2 • Limitations on combination documented at LRZ web page Compiler • Intel C++ • icc Version 12.1 • Editors • • • • vi emacs xedit …
© Copyright 2024 Paperzz