The IBM Blue Gene/P Supercomputer

Introduction to HPC Programming
2. The IBM Blue Gene/P Supercomputer
Valentin Pavlov <[email protected]>
About these lectures
• This is the second of series of six introductory lectures
discussing the field of High-Performance Computing;
• The intended audience of the lectures are high-school
students with some programming experience (preferrably
using the C programming language) having interests in
scientific studies, e.g. physics, chemistry, biology, etc.
• This lecture provides an overview of the IBM Blue Gene/P
supercomputer’s architecture, along with some practical
advices about its usage.
What does “super-” mean?
• When talking about computers, the prefix “super-” does not
have the same meaning as when talking about people (e.g.
Superman);
• The analogy is closer to that of a supermarket – a market
that sells a lot of different articles;
• Thus, a supercomputer is not to be tought a priori as a
very powerful computer, but simply as a collection of a lot
of ordinary computers.
What does “super-” mean?
• Everyone with a spare several thousand euro can build an
in-house supercomputer out of a lot of cheap components
(think Raspberry Pi) which would in principle be not much
different than high-end supercomputers, only slower.
• Most of the information in these lectures is applicable to
such ad-hoc supercomputers, clusters, etc.
• In this lecture we’ll have a look at the architecture of a real
supercomputer, the IBM Blue Gene/P, and also discuss the
differences with the new version of this architecture, IBM
Blue Gene/Q.
IBM Blue Gene/P
• IBM Blue Gene/P is a modular hybrid parallel system.
• Its basic module is called a “rack” and a certain
configuration can have from 1 to 72 racks.
• In the full 72 racks configuration, the theoretical peak
performance of the system is around 1 PFLOPS;
• Detailed information about system administration and
application programming of this machine is available online
from the IBM RedBooks publication series, e.g.
http://www.redbooks.ibm.com/abstracts/sg247287.html
The IBM Blue Gene/P @ NCSA, Sofia
• The Bulgarian Supercomputing Center in Sofia operates
and provides access to an IBM Blue Gene/P configuration
that consists of 2,048 Computing Nodes, having a total of
8,192 PowerPC cores @ 850 MHz and 4TB of RAM;
• The connection of the Computing Nodes with the rest of
the system is through 16 10 Gb/s channels;
• Its theoretical performance is 27.85 TFLOPS;
• Its energy efficiency is 371.67 MFLOPS/W;
• When it was put into operation in 2008, it was ranked
126-th in the world in the http://top500.org list.
Why a supercomputer?
• This supercomputer is not much different than a network of
2,000 ordinary computers (cluster), or let’s say 40 different
clusters of 50 machines;
• So why bother with a supercomputer? Because it offers
several distinctive advantages:
• Energy efficient – the maximum power consumption of the
system at full utilization is 75 kWh; This might seem a lot,
but is several times less than 2,000 ordinary computers.
• Small footprint – it fits in a small room, while 2,000 PCs
would probably occupy a football stadium. 40 clusters of 50
machines would occupy 40 different rooms.
Why a supercomputer?
• Transparent high-speed and highly available network –
the mass of cables and devices that interconnect 2,000
PCs would be a nightmarish mess;
• Standard programming interfaces (MPI and OpenMP) –
the same would be used on clusters. So, a software for the
cluster would work on the supercomputer, too (at least in
principle);
• High scalability to thousands of cores – in the 40 different
clusters scenarios each cluster is small and cannot run
extra large jobs;
Why a supercomputer?
• High availability at lower price – built as an integrated
unit from the start, it breaks a lot less than 2,000 ordinary
computers would. Moreover, it can be operated by a small
team of staff, as opposed to 40 different teams in the many
clusters scenario.
• Better utilization, compared to the 40 clusters scenario.
The centralized management allows different teams of
researchers to use the processing power in a shared
resource manner, which would be very hard to do if the
clusters were owned by different groups.
IBM Blue Gene/P Hardware Organization
Figure: IBM Blue Gene/P – from the CPU to the full system (Source: IBM)
Compute Nodes (CN)
• The processing power of the supercomputer stems from
the multitude of Compute Nodes (CNs). There are 1,024
CNs in a rack which totals to 73,728 CNs in a full
configuration.
• Each CN contains a quad-core PowerPC @ 850 MHz with
dual FPU (called “double hummer”) and 2 GB RAM.
• Ideally, each core can process 4 instructions per cycle,
thus performing at 850 × 4 = 3400 MFLOPS. Multiplied by
the number of cores, this brings the performance of a
single CN to 4 × 3.4 = 13.6 GFLOPS.
Compute Nodes (CN)
• The theoretical performance of the whole system is thus
73728 × 13.6 = 1002700.8 GFLOPS
= 1.0027008 PFLOPS
• Each CN has 4 cores and behave as a shared memory
machine with regards to the 2 GB of RAM on the node;
• The cores on one CN does not have access to the memory
of another CN, so the collection of CNs behave as a
distributed memory machine;
• Thus, the machine has hybrid organization – distributed
memory between nodes and shared memory within the
same node.
Connectivity
• Each CN is directly connected to its immediate neighbours
in all 3 directions;
• Communications between non-neighbouring nodes involve
at least one node that apart from computation is also
involved in forwarding network traffic, which brings down its
performance.
• The whole system looks like a 3D MESH, but in order to
reduce the amount of forwarding it can also be configured
as a 3D TORUS – a 4D figure in which the ends of the
mesh in each of the 3 directions are connected to each
other.
Connectivity
• The advantage of the torus is that it halves the amount of
forwarding necessary, since the longest distance is now
half the number of nodes in each direction.
• The connectivity with the rest of the system is achived
through special Input/Output Nodes (IONs);
• Each Node Card (32 CNs) has 1 ION through which the
CNs access shared disk storage and the rest of the
components of the system via 10 GB/s network;
• There are other specialized networks, e.g. for collective
communications, etc.
Supporting Hardware
• Apart from the racks containing the CNs, the
supercomputer configuration includes several other
components, the most important of them being:
• Front-End Nodes (FENs) – a collection of servers to which
the users connect remotely using secure shell protocol. In
the BGP configuration they are PowerPC 64bit machines
running SuSE Linux Enterprise Server 10 (SLES 10);
• Service Node (SN) – a backend service node that manages
and orchestrates the work of the whole machine. It is off
premises for the end users and only administrators have
access to it;
Supporting Hardware
• File Servers (FS) – several servers that run distributed file
system which is exported and seen by both the CNs and
the FENs. The home directories of the users are stored on
this distributed file system and this is where all input and
output goes.
• Shared Storage library – disk enclosures containing the
physical HDDs over which the distributed file system
spans.
Software features—cross-compilation
• In contrast to some other supercomputers and clusters,
Blue Gene has two distinct sets of computing devices:
CNs—the actual work horses; and FENs—the machines to
which the users have direct access.
• CNs and FENs are not binary compatible—a program that
is compiled to run on the FEN cannot run on the CNs and
vice versa.
• This puts the users in a situation in which they have to
compile their programs on the FEN (since they only have
access to it), but the programs must be able to run on the
CNs. This is called cross-compilation.
Software features—batch execution
• Since cross-compiled programs cannot run on the FEN,
users cannot execute them directly—they need some way
to post a program for execution.
• This is called batch job execution. The users prepare what
is called ’job control file’ (JCF) in which the specifics of the
job are stated and submit the job to a resource scheduler
queue. When the resource scheduler finds free resources
that can execute the job, it is sent to the corresponding
CNs;
• Blue Gene/P uses TWS LoadLeveler (LL) as resource
scheduler;
Software features—batch execution
• Important consequence of the batch execution is that
programs better not be interactive.
• While it is possible to come up with some sophisticated
mechanism to wait on the queue and perform redirection in
order to allow interactivity, it is not desirable, since its one
cannot predict exactly when the program will be run.
• And when it does run and waits for user input, and the user
is not there, the CNs will idly waste time and power.
• Thus, all parameters of the programs must be passed via
configuration files, command line options or some other
way, but not via user interaction.
Partitions
• The multitude of CNs is divided in “partitions” (or “blocks”)
and the smallest partition depends on the exact machine
configuration, but is usually 32 nodes (on the machine in
Sofia the smallest partition is 128 nodes1 );
• A partition that encompases 21 rack (512 CNs) is called
’midplane’ and is the smallest partition for which TORUS
network topology can be chosen;
• When LL starts a job, it dynamically creates a
correspondingly sized partition for it. After the job
terminates, the partition is destroyed.
1
Which means that there are at most 16 simulateously running jobs on this
machine!
Resource Allocation
• LL does resource allocation. Its task is to maximize the
number of executed jobs for minimum extent of time given
the limited hardware resources.
• This is an optimization problem and is solved by heuristic
means;
• In order to solve this problem LL needs to know the extents
of the jobs both in space (number of nodes) and in time
(maximum execution time, called “wall clock limit”);
Constraints and prirotization
• In order to ensure fair usage of the resources, the
administrators can put constraints on the queue—e.g. a
user can have no more than N running jobs and M jobs in
total in the queue;
• Apart from this, LL can dynamically assign a priority on
each job, based on things like job submition time, number
of jobs in the queue for the same user, last time a job was
run by the same user, etc.
• The policy for these things is configured by the system
administrators.
Execution modes
• Each job must specify the execution mode in which to run.
The execution mode specifies the shared/distributed
memory configuration for cores inside each of the CNs in
the job’s partition.
• There are 3 available modes: VN, DUAL and SMP;
• In VN mode each CN is viewed as a set of 4 different
CPUs, working in distributed memory fashion. Each
processor executes a separate copy of the parallel
program, and the program cannot use threads. The RAM
is divided into 4 blocks of 512 MB each and each core
“sees” only its own block of RAM.
Execution modes
• In DUAL mode each CN is viewed as 2 sets of 2 cores.
Each set of 2 cores runs one copy of the program, and this
copy can spawn one worker thread in addition to the
master thread that is initially running. The RAM is divided
into 2 blocks of 1 GB each and each set of 2 cores sees its
own block.
• This is a hybrid setting—the two threads running inside a
set of cores work in shared memory fashion, while the
different sets of cores work in distributed memory fashion.
Execution modes
• In SMP mode each CN is viewed as 1 set of 4 cores. The
node runs one copy of the program, and this copy can
spawn three worker threads in addition to the master
thread that is initially running. The RAM is not divided and
the 4 cores see the whole 2 GB of RAM in a purely shared
memory fashion.
• Again, this is a hybrid setting—the four threads running
inside a node work in shared memory fashion, while the
different nodes work in distributed memory fashion.
Execution modes—which one to use?
• In VN mode that partition looks like a purely distributed
memory machine, while in DUAL and SMP mode the
partition looks like a hybrid machine;
• It is much easier to program a distributed memory machine
than a hybrid one.
• Thus, VN mode is the easiest, but it has a giant
drawback—there are only 512 MB of RAM available to
each copy of the program.
Execution modes—which one to use?
• If you need more memory, you have to switch to DUAL or
SMP mode.
• But then you also have to take into consideration the hybrid
nature of the machine and properly utilize the 4 threads
available to each copy of the program.
• Running single-threaded application in DUAL or SMP
mode is enormous waste of resources!
Job submition
• Prepared jobs are run using the command llsubmit,
which accepts as an argument the “job control file” (JCF),
which describes the required resources, executable file, its
arguments and environment.
• Calling llsubmit puts the job in the queue of waiting jobs.
This queue can be listed using the command llq
• A job can be cancelled by using llcancel and supplying it
the jobid as seen in the llq list.
JCF Contents
# @ job name = hello
# @ comment = "This is a Hello World program"
# @ error = $(jobid).err
# @ output = $(jobid).out
# @ environment = COPY ALL;
# @ wall clock limit = 01:00:00
# @ notification = never
# @ job type = bluegene
# @ bg size = 128
# @ class = n0128
# @ queue
/bgsys/drivers/ppcfloor/bin/mpirun -exe hello -verbose
1 -mode VN -np 512
Important directives in JCF
• error = $(jobid).err
• output = $(jobid).out—These two directives specify a
set of files to which the output of the job is redirected.
• Remember that jobs are not interactive and the user
cannot see what would normally be seen on the screen if
the program was run by itself.
• So the output that usually goes on the screen is stored in
the specified files: errors go in the first file and the regular
output—in the second.
• LL replaces the text $(jobid) with the real ID assigned to
the job, so not to overwrite some previous output.
Important directives in JCF
• wall clock limit = 01:00:00
• bg size = 128
• The first directive provides the maximum extent of the job
in time (HH:MM:SS). If the job is not finished at the end of
the specified period, it is killed by LL;
• The second directive provides the extent of the job in
space and gives the number of CNs required by the job.
• Remember that in order for LL to be able to solve the
optimization problem, it needs these two pieces of data.
Important directives in JCF
• class = n0128
• The class of the job determines several important
parameters, among which:
• The maximum number of nodes that can be requested;
• The maximal wall clock limit that can be specified;
• The job priority—larger jobs have precedence over smaller
jobs and faster jobs have priority over slower ones;
• Administrators can put in place constraints in regards to
the number of simultaneously executing jobs of each class.
• The classes are different for each installation and their
characteristics must be made available to the users in
some documentation file.
Other supercomputers – Top 10 as of November 2012
1
Titan (USA) – 17 PFLOPS Cray XK7
2
Sequoia (USA) – 16 PFLOPS IBM Blue Gene/Q
3
K Computer (Japan) – 10 PFLOPS, SPARC64-based
4
Mira (USA) – 8 PFLOPS, Blue Gene/Q
5
JUQUEEN (Germany) – 8 PFLOPS, Blue Gene/Q
6
SuperMUC (Germany) – 2.8 PFLOPS, Intel iDataPlex
7
Stampede (USA) – 2.6 PFLOPS, uses Intel Xeon Phi
8
Tianhe-1A (China) – 2.5 PFLOPS, uses NVIDIA Tesla
9
Fermi (Italy) – 1.7 PFLOPS Blue Gene/Q
10
DARPA Trial Subset (USA) – 1.5 PFLOPS POWER7-based
Blue Gene/Q
• In the Top 10 list as of November 2012, 4 of the machines
are IBM Blue Gene/Q
• Conceptionally it is very similar to IBM Blue Gene/P, but its
Compute Nodes are a lot more powerful;
• Each compute node has 18 64-bit PowerPC cores @ 1.6
GHz (only 16 used for computation), 16 GB RAM and a
peak performance of 20 PFLOPS;
• Important aspects as cross-compilation, batch job
submition, JCF file format, etc. are basically the same as
those in Blue Gene/P. The obvious difference is mode
specification, since now VN, SMP and DUAL are obsolete
and the specification on the BG/Q is more flexible.