White paper: Unleash your HPC performance with Bull

white paper
UNLEASH YOUR HPC PERFORMANCE
WITH BULL
Maximizing computing performance while reducing power consumption
Contents
EXECUTIVE SUMMARY ........................................................................................................................................................ 2
THE AUTHORS ..................................................................................................................................................................... 2
MARKET DYNAMICS by Addison Snell, CEO, Intersect360 Research .................................................................................. 3
PERFORMANCE CONSTRAINTS ........................................................................................................................................... 6
BEST PRACTICES FOR APPLICATION OPTIMIZATION ........................................................................................................... 7
BULL: AN INDISPUTABLE AND UNIQUE ACTOR ................................................................................................................ 11
CONCLUSION by Addison Snell, CEO, Intersect360 Research ........................................................................................... 16
With an introduction by
Sponsored by
white paper
EXECUTIVE SUMMARY
The High Performance Computing (HPC) Industry is marked by a relentless pursuit of performance, in a race
to meet the demand for ever faster, more complex and more precise simulations. As a consequence, we
are today in the early stages of a major transition in the prevailing HPC architecture, with the multiplication
of compute cores, both in CPUs or through the addition of accelerators or coprocessors. But application
designers will also have to re-think their programming and/or their algorithms to fit the new architectures
and find solutions to the issue of computing and power performance.
This white paper identifies HPC performance inhibitors, and presents the best practices that can be
implemented to avoid them, while optimizing energy efficiency.
THE AUTHORS
Mathieu Dubois is a software engineer in Bull’s Applications & Performance Team that
he joined in 2009 after a PhD and several years’ experience in research for
nanotechnologies. His activities focus on all aspects of parallelization and code
optimization, with a strong involvement in accelerators and co-processors development.
He has a strong knowledge of HPC architectures acquired from a long experience in presales activities.
Xavier Vigouroux, after a PhD in Distributed computing, worked for several major
companies in different positions: From Investigator at Sun labs to Support Engineer for
HP. He has now been working for bull for six years. He led the HPC benchmarking team
for the first five years and is now in charge of the " Education and Research " market for
HPC at Bull
2
|
Unleash your HPC performance with Bull
white paper
MARKET DYNAMICS by Addison Snell, CEO, Intersect360 Research
The high performance computing (HPC) industry continues to fuel new discoveries and capabilities in
science, engineering, and business. At the high end, new problems are always on the horizon. There is no
final frontier of science, nor an end to engineering, nor a perfect business that cannot be evolved for
greater competitiveness. HPC technologies are the tools that can help organizations in their perpetual drive
toward innovation. For buyers in academic and government research, HPC can accelerate the path to
scientific discovery. For commercial use cases, it does the same, but with the added metrics of quantifiable
return on investment, as companies seek to speed time to market, improve product quality, reduce the
costs of failures, and streamline operational efficiencies.
Chris Willard, Chief Research Officer of Intersect360 Research, describes the striving attitude of HPC this
way: “Once you solve the problem, it’s no longer interesting. You don’t need to design the same bridge
twice. You move on to the next, harder bridge.” This is the nature of the demand for ever-greater HPC
performance, regardless of application.
In the relentless pursuit of performance, the HPC era has seen an evolution in the architectures deployed
by the majority of the market. Vector processors gave way to scalar, RISC was supplanted by x86, and UNIX
was replaced by Linux in the majority of HPC installations. Each of these transitions came with implied
changes in the software models that would best leverage the hardware, in order to deliver performance at
scale for real applications.
Today we are in the early stages of
Figure 1: Average Cores per Processor by Year of Acquisition for
Distributed Memory Systems
another such transition, as the
Source:
Intersect360 Research, 2014
industry adopts multi-core and manycore processing technologies. Strictly
in the x86 paradigm, the frequency
race has ended. Rather than chasing
Moore’s Law with faster gigahertz
ratings, processor manufacturers such
as Intel are delivering greater
performance by putting more cores on
each socket. This delivers more
floating point operations per dollar by
introducing a new level of parallelism
at the chip level. The transition to
multi-core began several years ago,
and it has now evolved such that
average HPC systems currently in
deployment have more than eight cores per processor, with further increases on the horizon.1 (See
Figure 1.)
1
Intersect360 Research HPC market advisory service, “HPC User Site Census: Processors,” October 2013
3
|
Unleash your HPC performance with Bull
white paper
And this is not the only processing technology transition in the market. For applications that can benefit
from even greater parallelism, models of many-core processing are now available, in the form of Intel Xeon
Phi and NVIDIA GPU processing elements. In either case, a supplemental “many-core” processing element,
containing hundreds of individual cores, acts as a secondary computational accelerator that can boost
performance when called upon.
Accelerators are not new to HPC, but they were traditionally held back by three dominant constraints. Each
of these has been addressed by both NVIDIA and Intel, albeit in different ways.
1. Pace of development: The developers of low-volume, custom co-processors have been unable to
maintain a development schedule that keeps pace with high-volume microprocessor markets. Intel,
of course, is accustomed to its “tick-tock” drumbeat of new releases, while NVIDIA is driven by the
rate of change in the high-volume gaming market.
2. Programmability: Co-processing elements need to be explicitly called by the application; this puts a
burden on programmers to insert these calls into their codes. NVIDIA revolutionized the approach
with its CUDA development environment, allowing programmers to interact with GPU accelerators
in standardized ways. Intel uses extensions of its x86 tool sets for Intel Xeon Phi programming.
3. Latency: Individual functions can be accelerated by co-processors, but the acceleration gained must
be sufficient to overcome the latency hit endured by moving them off of the microprocessor.
NVIDIA has supplemented evolutions to faster PCI-E connections with its own features designed to
reduce latency, such as NVLink and direct-memory connections. Intel Xeon Phi combines processor
and co-processor onto a single chip.
With these technology advancements,
more and more HPC users are adopting
accelerators.2 (See Figure 2.) Most of
these deployments are on NVIDIA GPUs
currently, but Intel Xeon Phi – later to
the market – is now seeing significant
testing among end users as well. In
either case, end users must evaluate
how best to leverage the many-core
parallelism in this new performance
scheme.
Figure 2: Systems with Accelerators by Year of Last Modification
Source: Intersect360 Research, 2014
Of course, this adoption is not merely
technology for the sake of technology,
nor even performance for the sake of
performance. The HPC community
continues to seek out leaps in
performance for the same reasons it
always has: to drive new discoveries, to
accelerate innovation, and to improve
2
Intersect360 Research HPC market advisory service, “HPC User Site Census: Processors,” October 2013
4
|
Unleash your HPC performance with Bull
white paper
competitiveness. The discontinuous gain offered by these new technologies does require software changes,
but it also can enable new simulations or techniques, making the previously impossible now possible.
With so much at stake, Bull is investing in initiatives to help the HPC user community evaluate, implement,
and optimize new processing technologies. Access to hardware is part of the solution, but more
importantly, so is access to expertise. These technology transitions are happening; end users need to figure
out how best to deploy them. Resources like Bull’s Centre for Excellence in Parallel Programming and the
Fast Start program are designed to help scientists and engineers with the technology transitions that will
fuel their next generations on innovation.
5
|
Unleash your HPC performance with Bull
white paper
PERFORMANCE CONSTRAINTS
A new HPC paradigm for performance
For many years, the HPC paradigm was that the regular increase of processor frequency and the
improvement of processor micro-architecture brought regular performance gain without any pain. You
simply had to copy your code to the new machine, launch it, and directly get more performance. This
natural evolution came to an end a few years back, when processor manufacturers, faced with the power
consumption wall, started to increase intrinsic performance by increasing the number of compute cores on
a single socket. To get the most out of these new CPU architectures, programmers and developers had to
rethink their applications in terms of parallelism. Today, the architecture of HPC supercomputers is
designed to allow applications take advantage of the parallelism. The building block is a compute node
including twenty or so CPU cores, as well as memory and disks. The individual computes nodes are
interconnected through a fast network, commonly based on the InfiniBand technology. There are
potentially no limits for performance gain following that path, helped in that matter by ever-faster memory
DIMMs and interconnect network. Today, the main limitation lies within the code itself and the
programmers’ abilities. Most applications cannot handle the new performance challenges imposed by
hardware evolution that addresses the compromise of more performance for less power consumption.
Hence, programmers are struggling to extract more parallelism, handle more hybrids configurations
(accelerators and co-processors are today’s premium choice for extra compute power with limited impact
on energy cost), and support more heterogeneity. Applications do not make the most of the latest
hardware evolutions, and source codes need to be modified to benefit from the latest technological
progress.
Three inhibitors to performance
Furthermore, a given code will usually be run on different supercomputers with different key points Being
efficient on each of them requires adapting some parts of the code. In this way, each implementation will
fit any kind of architecture. More precisely, three factors or inhibitors govern scientific parallel application
performance:
• the sensitivity to CPU frequency,
• the sensitivity to memory bandwidth,
• the time devoted to communications and IOs.
Depending on the physical or the scientific aspects of the code and of the implemented algorithms, a given
application will strongly be impacted by one or several of these inhibitors. It is the responsibility of HPC
vendors to direct customers to the right architecture based on a deep analysis and understanding of the
end-user’s applications. On the other hand, coding methods can sometimes have an impact and artificially
weigh on one of these inhibitors. Only code profiling and optimization can then unblock the bottlenecks.
The power consumption issue
Application performance is also constrained by energy costs. Beyond just optimizing the execution time of
an application and the ability to get results faster, optimizing the global electrical consumption has become
a critical issue for most datacenters. There are two ways to address the energy issue. On one hand,
6
|
Unleash your HPC performance with Bull
white paper
optimizing the code will make it possible to use a smaller system to target a given performance. On the
other hand, for a given supercomputer size, code optimization will provide a higher application throughput
for the same energy capping. Either way, optimizing the execution time for an application will allow to get
more results for every Watt consumed by the system.
Thus, computer simulation requires not just a supercomputer with hundreds of cores, petabytes of
memory, and zetabytes of storage, but a continuum of components from the hardware, to the final result,
including the software, mathematical libraries, programming abilities and analysis. To be fast, accurate,
productive and reliable, all these components must be tightly integrated and smoothly run together. If one
component is weak, then the whole chain is weak, and the business is at risk. Integrating all these
components together requires a lot of different skills, the most important of which is the ability to port and
optimize applications to the most appropriate platform.
BEST PRACTICES FOR APPLICATION OPTIMIZATION
A new area of expertise
To take up the performance challenges, application expertise
has become mandatory. Understanding the purpose and
requirements of scientific applications is a key factor to
propose the best suited and efficient solution to customers and
to offer satisfying user experience. With the petascale, then
exascale quests, application engineers at HPC companies must
progress from benchmarkers to real experts that understand
the major performance hurdles, can propose adapted
hardware solutions, and optimize codes. This can only be
achieved by perfectly mastering today’s and tomorrow’s HPC
architectures and programming environments.
Benchmarking remains the first step for application
optimization. It is crucial to start with the best possible
execution time of the code “as is” (with no source
modification), by testing different hardware platforms (for
example different processor types, different interconnect
topologies…), mathematical libraries, software environments,
compilers and compilation options. This will serve as a
reference time for future optimization.
InfiniBand topologies are one of the
critical components in HPC
architectures. As the number of cores
increases MPI communications
become more and more time
consuming. An optimized interconnect
network is obtained by maintaining
identical bandwidth at each level of the
network. Introducing “pruning” in a
network consists in sharing
connections between two or more
points in the network. This will reduce
bandwidth but, in the case were
communications are not critical, it will,
more importantly, reduce the number
of switches and cables, optimizing both
the cost and power consumption.
First, start by profiling the application…
Once the reference is obtained, the application needs to be profiled. As previously detailed, the execution
time of a code is determined by three factors : the CPU frequency, the memory bandwidth, and
communications and inputs/outputs. Being able to measure and evaluate those three parts is the key to
propose the most efficient machine for this code, and it is the starting point for optimization. There is no
point, for instance, in optimizing floating point operations if the code spends most of the time in MPI
communications. In that case, it is more relevant to optimize communication patterns or algorithms. On the
7
|
Unleash your HPC performance with Bull
white paper
other hand, if communications are marginal, the global infrastructure can be optimized by introducing
pruning in the interconnect network, but optimizations efforts should address other identified bottlenecks.
Whether they are home-made tools or third party tools, profilers and debuggers can help software
engineers detect performance bottlenecks and understand the algorithmic and software structure of
applications. Some of them provide the “critical path”. This path is built from a sequence of computations
and communications; this “path” is very useful as its duration is the duration of the whole program. To
reduce the execution time of the application, the programmer must reduce the components of the critical
path. Any other modification would be useless.
… Before trying to optimize
More generally, a detailed profiling of the code is the starting point for any optimization work.
Figure 1: Minimal output of Bull's bullxprof tools
First, we try to get a global overview of an application behavior. Performance can be detailed to analyze
the time consuming components (CPU, communications, IO) of the application. On the above example, 80%
of the total execution time is spent in MPI communications, IO are negligible, and 20% of the time (USER) is
spent in running the application code; memory accesses are counted as CPU time. This overview of the
application gives precious information for optimization. High values for the time spent in running the
application code are usually what we want. If this is average, we can further investigate scalar and vector
numeric operations and memory accesses. A high volume of memory accesses means that the per-core
performance is memory-bound. We can then use a profiler to identify time-consuming loops and check
their cache performance. If little time is spent in vectorized instructions, one might want to check the
compiler’s vectorization advice to see why key loops could not be vectorized.
High values for MPI communications are usually bad. This means that the application spends more time
communicating than performing actual computation. Getting a detailed MPI profiling makes it possible to
determine whether communications are point-to-point or collective ones and to obtain the transfer rate of
both types. Low transfer rates can be caused by inefficient message sizes, such as many small messages, or
by imbalanced workloads causing processes to wait. The MPI profiler can then be used to identify the
problematic calls and ranks.
We also generally want to avoid spending too much time in I/Os, writing or reading to the file system. Some
codes generate large amounts of data or restart from reading intermediate results saved on disks and will
need a fast parallel file system (Lustre, for instance) to minimize the time spent on I/Os.
8
|
Unleash your HPC performance with Bull
white paper
Finally, it is crucial to obtain information about memory usage since it may affect the scaling of an
application. When the per-process memory usage is too important, MPI communications and MPI memory
footprint can be reduced by running fewer MPI processes and mixing with OpenMP formalism. This may
lead to a significant gain in performance.
Defining the best hardware architecture for today’s and tomorrow’s simulations
Of course, when optimizing an application or defining an HPC infrastructure for it, both current and future
technologies must be taken into consideration to maximize performance. Some of these future
technologies are the natural evolution of existing hardware, and their impact on programming models is
negligible. However, some newly emerged hardware components generate new programming paradigms.
In the quest for exascale computing, hardware accelerators based on either Graphic Processing Units (GPU)
or on the Many Integrated Core (MIC) architecture are one of the key elements to combine performance
and limited power consumption. There is no doubt that the future of supercomputers will rely on hybrid
architectures. Hence it is critical to start today to move user applications to these platforms, so as to get
the most out of heterogeneous or hybrid infrastructures (i.e. infrastructures combining standard CPU
resources and accelerators/coprocessors). This usually involves a complete reconsideration of the
application structure and algorithms. Expertise is also needed here.
Porting an application to hardware accelerators/coprocessors: a methodology
Independently of the application area, a generic methodology can be
used to port an application to either an accelerator or a coprocessor.
The methodology consists in following a series of steps starting with an
“as-is” execution of the application. As usual, this will provide the
programmer with a reference time and output data that will be
compared to the final accelerated version of the code. A profiling of
the application will be done to identify potential bottlenecks and
hotspots. In these hotspots one should then look for an adapted
parallelism for accelerators. It can be either parallelism over large
amounts of data, or independent work sharing (for instance: the same
calculation is done over different data sets).
Figure 3: Biotin (ligand in yellow/blue)
docked in Streptavidin (protein in purple)
From this point, porting to accelerators/coprocessors can be started. It - Source: URCA Reims, France
generally consists in selecting the appropriate target platform (NVIDIA®
GPU or Intel® Xeon Phi™) and environment: CUDA, OpenCL, OpenACC for GPUs, native or offload modes for
Xeon Phi. Writing the compute kernels for the hotspots is the next step. It may rely directly on the original
algorithm and the use of existing optimized libraries (CUDA BLAS, CUDA FFT, MKL…) but sometimes a
complete redesign of the algorithm may be necessary. As already mentioned, the performance obtained
from accelerators is strongly related to how data are accessed in the device memory. To optimize these
memory accesses it is sometimes necessary to completely modify data structures.
Once the first version of the code is obtained, we validate the porting by comparing the numerical accuracy
of results. Then, we look for optimizations. In particular, it is critical to work on data reuse and data
persistency on the accelerator board. Since these devices are PCI-e cards, transfers over PCI-e connectors
can be an obvious bottleneck. Optimizations are obtained by taking care of data alignment, data size (do
we need everything on the card?) and, if possible, asynchronous transfers for overlap with computation.
9
|
Unleash your HPC performance with Bull
white paper
The next step is to fine tune the ported application for different architectures in case of GPU (K104 vs. K110
architecture for example), or determine the best threads count and threads affinity (placement) for Xeon
Phi. For this tuning we use tools such as NVIDIA Visual Profiler or Intel VTune that help identify final
optimization possibilities.
Finally, search for multi-accelerators and/or hybrid CPU + accelerators implementation can also be
investigated starting from the single device code.
Portability and maintenance of these versions are usually good and, although new features and hardware
evolutions will bring more power and eliminate remaining bottlenecks, less work is needed once the
application has been ported. Many users understand today the necessity to move to such technologies, but
the number of true expert remains limited. However, this shift must be made now to follow the natural
evolutions of hardware, dictated by the exascale quest and the need to increase performance while limiting
energy consumption.
Future challenges: optimizing power consumption at code level
The power consumption and power efficiency of a
supercomputer are becoming as important as
application performance itself. Optimizing energy
costs can be tackled at the hardware level, but the
idea is gradually gaining ground that power
consumption can be reduced at the code level itself.
Many customers wish to have tools to very precisely
measure power consumption during code execution
and to relate that, in time, to specific parts of the
application. The aim is to directly relate a highly costly
power consumption to a few lines at the assembly
level. Projects (see box about HDEEM) are emerging
to optimize energy consumption at this level of the
application code. However, due to the high frequency
of the CPU clock, hardware tools are necessary to
sample electrical consumption at such a frequency
and to address this critical issue.
HDEEM is a project with two founding
members: the University of Dresden and Bull.
The goal is to expose very accurate power
measurement (500 samplings per second) to
the end-user and to link it with the source code.
To achieve this, Bull developed dedicated
hardware. In this way, the probing system is not
intrusive and does not change the performance
of the monitored system.
Precise metrics is the prerequisite for
optimization. With this information the
programmer can improve the flops/W of its
program, or, implement different algorithms
according to the criterion he wants to optimize
(time-to-result, watt-to-result).
One goal of this project is to gather many
Whether it is imposed by end-users’ demands for ever
others ones on the top of it : policy
more compute power and ever more precise
management, energy-aware batch scheduler, …
simulations, or by natural hardware evolutions on the
many extensions can be anticipated.
road to exascale, most applications need to be ported
and optimized. Application developers cannot rely on
the intrinsic power of supercomputers anymore. They need to face up to new programing environments,
rethink the programing models, create new algorithms. They need new tools and they need experts.
10
|
Unleash your HPC performance with Bull
white paper
BULL: AN INDISPUTABLE AND UNIQUE ACTOR
Bull’s Portfolio
Building on the success of its bullx supercomputers, mobull HPC containers, and extreme eactory HPC cloud
services, Bull has become the trusted provider of HPC solutions in Europe. With products scaling from
midrange HPC to petaflop-scale supercomputers, Bull’s strategy is to provide systems that deliver efficient
productivity at scale. Bull’s solution portfolio targets flexibility.



It is “open”: it is based on best-of-breed open standards and components. In this way, one component
can be changed for another satisfying the same properties. The key benefit is that users and
administrators can have a customized machine with the tools they are used to.
It is “integrated”: even though some components can be changed, Bull R&D engineers have integrated
everything into a consistent and efficient whole. Replication is minimized, useless parts are removed,
configuration is fine-tuned.
It is “modular”: components can be removed if useless for the customer. Obviously, the lighter the
solution, the better.
This approach ensures that the resulting system is fast, accurate, productive and reliable. These properties
are especially true for the software environment, but Bull also applies them to the hardware infrastructure.
Indeed, Bull’s portfolio includes products to address all HPC needs: thin nodes, fat nodes, integration of
accelerators, storage… All can be mixed and fitted to any application requirements. Based on your business
workload and constraints, Bull experts can define the best supercomputer for your needs.
Centre for Excellence in Parallel Computing: the largest team of application experts in Europe
To help HPC users come to grips with parallelism, Bull launched its Center for Excellence in Parallel
Programming, the first European center of industrial and technical excellence for parallel programming.
Leveraging a team of experts unique in Europe - more than 300 engineers - the Centre for Excellence in
Parallel Programming delivers the highest level of expertise and skills to help organizations optimize their
applications to take advantage of the new many-core technologies. In partnership with Intel, the Center for
Excellence in Parallel Programming focuses especially on Intel Xeon and Intel Xeon Phi processors, and the
continued development and deployment of Intel compilers and tools.
Thanks to the Center for Excellence and its application experts, Bull has a unique capacity to address the
challenges of performance in HPC. Bull’s entire ecosystem is available to Bull engineers and customers (on
demand) for benchmarking and tests. Bull’s teams benefit from one of the largest tests resources in the
world including most of the components we propose to our customers. This includes a complete range of
server types, processors, storage solutions, hardware accelerators, remote visualization solutions, etc.
Bull has two main benchmarking facilities. One is based in our factory in Angers and relies on the latest Bull
blade solutions and Intel processor technology. It is intended for large-scale tests (several thousand CPU
cores) and presales activities. Another supercomputer is installed in our center in Grenoble, France. it is
more flexible and is dedicated to the exploration of future technologies.
11
|
Unleash your HPC performance with Bull
white paper
Figure 4: Bull's large scale benchmarking facility (SID) architecture
Bull’s teams also gather the largest pool of engineers in Europe for application expertise and services. Our
application experts are specialized in optimizing codes on Bull solutions – this is their everyday job. They
have expert knowledge of BOTH applications AND Bull platforms. So when it gets to optimizing your
applications, they are much more efficient than end-users who need some time to get to know the
software environment, the supercomputer architecture, the file systems characteristics… The scientific
background of the Bull experts gives them a deep insight into the behavior and goal of the applications to
be ported and optimized, and largely facilitates the relationship with end-users.
The skills of our application experts includes the following:






During deals, commit on performance figures on the proposed architecture. This operation requires
very high skills, as the architecture is usually not available for real experiments.
During acceptance tests, obtain the figures they committed on.
Within the Centre for Excellence in Parallel Programing, be a source of expertise and perform Proofs of
Concepts.
Provide trainings and high level services
Present technical talks and papers in conferences, workshops…
Explore future technology trends
Bull’s unique expertise in the application field leverages the scientific background of its engineers in a very
large variety of areas (climate, life science, quantum chemistry, financial mathematics...). When customers
meet our experts, it is crucial that we both speak the same language, to understand the code, the science
behind it, and what is needed in terms of performance. We believe that improving the performance of an
application (not only optimize the source code but also propose the best suitable hardware architecture)
12
|
Unleash your HPC performance with Bull
white paper
relies on both a good understanding of the scientific aspects and the way the code is implemented and
works.
However, although experience and expertise in the application field are mandatory for performance
optimization, evaluation tools are also needed. Another of Bull’s assets is that the software R&D
department is physically located on the same site as the application experts. Strong interactions are then
motivated by our customer’s needs and led to the development of Bull’s open monitoring and profiling
tools that are day to day used by the Bull team and proposed to our customers to get full advantage of
their new Bull supercomputers. Developments around Interconnects, MPI, IOs, accelerators, batch
scheduler and power consumption monitoring are developed based on actual requests from end users and
system administrators. Application experts are associated to the development and rely on these tools in
their quest for more performance.
The future must also be explored today. Bull has a unique capacity, thanks to strong partnerships with
technological partners, to explore future technology trends in all the domains of HPC: processors,
accelerators and coprocessors, interconnect and storage. Partnerships with Intel and NVIDIA in particular,
allow Bull’s teams to early test and evaluate future generations of processors and GPUs. For customers,
acquiring a new HPC facility is a several months long process. Technologies that will be available by the
time the system is installed are usually not available when performance is evaluated on benchmarking
systems. The technology watch that Bull invests in, allows us to properly evaluate performance on next
generation hardware starting from today’s measurements.
Having access to future technologies also allows Bull to anticipate the redesign of algorithms that might be
necessary for applications to take full advantage of them. Bull is then strongly involved in algorithmic
research and has presented, over the last years, several significant results through articles or conference
presentations. More are in preparation.




“Porting the BigDFT application to Xeon Phi” (SC’12 demo, jointly with CEA and Intel Parallel Labs) –
Technical paper on going.
"Evaluation of DGEMM implementation on Intel Xeon Phi Coprocessor“, article presented at the
International Conference on Computer Science and Information Technology SC13 conference. Jointly
with Pawel Gepner and Victor Gamayunov (Intel Corp.)
“Intel Xeon E5-2600v2 Family Based Cluster Performance Evaluation Using Scientific and Engineering
Benchmarks”, article in prep., jointly with Pawel Gepner and Victor Gamayunov (Intel Corp.)
“Evaluation of the 4th generation Intel Core Processor concentrating on HPC applications”, article in
prep., jointly with Pawel Gepner and Victor Gamayunov (Intel Corp.)
Bull, through Proofs of Concepts (POC) or joint collaborations, eases and accelerates the transition to these
new technologies for its customers. Bull expertise in this area is full of success stories in both academic and
research organizations (see for instance the article “Using GPUs for the Exact Alignment of Short-Read
Genetic Sequences by Means of the Burrows-Wheeler Transform” by Jose Salavert Torres, et al. in
Transactions on Computational Biology and Bioinformatics) and business companies.
13
|
Unleash your HPC performance with Bull
white paper
A Start-Up’s Success Story
Bull’s application expertise is well recognized in academics
and research, but our experts also help start-ups and small
businesses. One of the most striking examples is the case of
EZBiometrics.
EZ Biometrics is a French start-up that develops and provides
fingerprint biometrics solutions (AFIS) for integrators. They
propose hardware integration as well as expertise and service
in biometrics for both civilian and police usage. The strength
of this company lies in an accurate coder algorithm as well as
a fast matcher algorithm. The proposed solution is able to
perform high speed matching of a fingerprint against
potentially hundreds of millions of fingerprints. The current
software solution compares one specific fingerprint to all
fingerprints in the database and returns the records that
present the highest matching probability. To get the most
efficient algorithm (the performance of which is measured in
millions of matches per second) and to define the most
suitable hardware solution, EZ Biometrics relied on Bull’s
application engineers.
Based on the fact that one fingerprint is matched one after
the other with all fingerprints of the database, the fit with a
parallel computing process is obvious. Therefore, the most
efficient solution to reach the highest speed would be to use
as many computing cores as possible. With a hardware cluster
architecture, there is virtually no limit to the performance of
the matcher. However, large, cumbersome cluster
architecture can be an issue especially if portability is required
and if the hardware implementation cost must be kept
affordable.
TESTIMONY: ”EZ Biometrics develops
and sells biometric solutions. We are
specialised in developing innovative
software products for coding and
matching fingerprints. We develop a
high performance AFIS system for
integrators that need biometric
technologies.
The Bull CEPP’s has allowed us to
access remotely to their ROBIN cluster.
The access was made through an
internet secure connection. Our
development team could appreciate
this very easy way to get access to
huge computation resources. Moreover
an application expert and a system
administrator have supported us all
along our project; this help has been a
key element in our project success.
The ROBIN cluster is a great solution to
speed up development process with no
need to provide integration effort and
resource. It has allowed us to save a lot
of money and time.
Thank again to the Bull CEPP ROBIN
cluster solution “
Jean Henaff, CEO, EZ Biometrics
Bull and EZ Biometrics worked together on the porting of the
algorithm onto a hybrid architecture made of standard CPU servers and Intel® Xeon Phi™ co-processors.
Here, the word “co-processor” takes on its full meaning. The Xeon Phi co-processors are based on the
"Many Integrated Core" architecture offering in a PCI-e board form factor up to 250 x86 cores. Although
the working frequency of the co-processor cores is limited compared to standard Xeon processors, the high
number of them allows to dramatically increase the number of matches within a single server.
Bull brought its expertise in porting the algorithm onto the Xeon Phi architecture by taking advantage of
the available computing resources. This was done by combining and balancing the workload over standard
processors and Xeon Phi co-processors. This joint effort with Bull allowed EZ Biometrics to offer an
innovative solution in terms of integration and performance.
14
|
Unleash your HPC performance with Bull
white paper
The Fast Start Program
Finally, Bull’s expertise in the application field is recognized and appreciated through the “Fast Start”
program. The Fast Start program is tailored to the needs of each customer, to successfully achieve their
specific project goals. It can be viewed as a “fully tailored effort.”
The Bull services begin with a systematic needs analysis that will determine the project targets and the
priorities of the different targets, so as to spread the effort efficiently. Then, at agreed time intervals,
advancement meetings (or calls, or emails, as agreed) take place to share the status on the Fast Start
program: tasks already performed, remaining effort, status for each target, issues, mitigation plans…
Possible actions that can fall within the Fast Start program are:
 Compile the application or library with source codes
 Determine the best set of compilation options for an important code
 Configure applications or libraries to allow them to efficiently use the supercomputer
 Create scripts to launch an application efficiently
 Install different versions of an application or a library.
 In case of issue, help in screening the root cause.
 Rely on third parties experts (developers of applications we are in contact with) for help.
The Impact of Bull’s Application Development and Performance Tuning at Cardiff University
Established in 2007, Advanced Research Computing @Cardiff (ARCCA) provides, co-ordinates, supports and develops
computational services for Cardiff University researchers, enabling leading research that is far beyond the
capabilities of the average desktop or laptop computer. ARCCA’s choice of partner at the outset of these
developments was Bull, a choice driven by the promise of reliable, performing and cost-effective technology, and
the quality of their associated support. This promise has been amply demonstrated over the past six years as ARCCA
has seen a continual growth in the number of researchers registering to use its facilities; from ca. 30 in 2007 to over
430 registered users today.
There is no doubt that the heart of this partnership with Bull lies in
the added value continually demonstrated by their team in the area
of application development and performance tuning. They have far
exceeded expectations throughout – from the start of service with
their initial benchmarking of key Cardiff codes and the associated
“fast-start” program, to trouble-shooting problem applications
during the lifetime of the service. Bull has provided a level of proactive support not matched in our experience by any of their
The ARCCA
competitors, support that has included the secondment of key staff
Team
to the Cardiff site. This has been critical in terms of our being able to
support the diverse community of users of the ARCCA services, from experienced practitioners running applications
which scale over 100’s cores to those just starting to consider the impact of computing on their research aspirations.
The impact of this support on the ARCCA service was demonstrated in the competitive procurement for the new
Cardiff supercomputer in 2012. Bull proved yet again to be our supplier of choice with the replacement bullx B500
Sandy Bridge based blade system in late 2012.
15
|
Unleash your HPC performance with Bull
white paper
CONCLUSION by Addison Snell, CEO, Intersect360 Research
HPC is a tool for driving innovation, and as such, the technology must itself innovate in order to deliver
continuous improvements over time. But while the theoretical performance gains are continuous, the path to
achieve new levels of performance gains is discontinuous, as applications must be revisited to ensure they are
optimized for the benefits of new generations of technology.
This is the juncture the industry finds itself at today. Even without the springboard leap offered by many-core
accelerators, the transition to multi-core processors is enough to make application developers reconsider how
best to parallelize and optimize their codes. Intel Xeon Phi and NVIDIA GPU computing bring even more
options into consideration.
Which architecture to choose, and how to optimize for it, are questions that will be specific to each
application and circumstance. The resources and expertise offered by Bull and its partners can help end users
evaluate the best path for each application, and therefore, for their own innovations into the future.
W-HPCperformance--en1
For more information, contact us on www.bull.com/extreme-computing or at [email protected]
© Bull SAS - 2014 - RCS Versailles B 642 058 739 - RCS Versailles B 642 058 739 – All trademarks mentioned herein are the property of their respective
owners. Bull reserves the right to modify this document at any time without prior notice. Some offers or parts of offers described in this document may not be
available locally. Please contact your local Bull correspondent to know which offers are available in your country. This document has no contractual
significance.
UK: Bull Maxted Road, Hemel Hempstead, Hertfordshire HP2 7DZ / USA: Bull 300 Concord Road, Billerica, MA 01821
Bull - Rue Jean Jaurès - 78340 Les Clayes sous Bois – France
16
|
Unleash your HPC performance with Bull
This flyer is printed on paper containing 40% eco-certifed fibers originating from sustainably managed forests and 60% recycled fibers, in conformity with environmental regulations (ISO
14001).