white paper UNLEASH YOUR HPC PERFORMANCE WITH BULL Maximizing computing performance while reducing power consumption Contents EXECUTIVE SUMMARY ........................................................................................................................................................ 2 THE AUTHORS ..................................................................................................................................................................... 2 MARKET DYNAMICS by Addison Snell, CEO, Intersect360 Research .................................................................................. 3 PERFORMANCE CONSTRAINTS ........................................................................................................................................... 6 BEST PRACTICES FOR APPLICATION OPTIMIZATION ........................................................................................................... 7 BULL: AN INDISPUTABLE AND UNIQUE ACTOR ................................................................................................................ 11 CONCLUSION by Addison Snell, CEO, Intersect360 Research ........................................................................................... 16 With an introduction by Sponsored by white paper EXECUTIVE SUMMARY The High Performance Computing (HPC) Industry is marked by a relentless pursuit of performance, in a race to meet the demand for ever faster, more complex and more precise simulations. As a consequence, we are today in the early stages of a major transition in the prevailing HPC architecture, with the multiplication of compute cores, both in CPUs or through the addition of accelerators or coprocessors. But application designers will also have to re-think their programming and/or their algorithms to fit the new architectures and find solutions to the issue of computing and power performance. This white paper identifies HPC performance inhibitors, and presents the best practices that can be implemented to avoid them, while optimizing energy efficiency. THE AUTHORS Mathieu Dubois is a software engineer in Bull’s Applications & Performance Team that he joined in 2009 after a PhD and several years’ experience in research for nanotechnologies. His activities focus on all aspects of parallelization and code optimization, with a strong involvement in accelerators and co-processors development. He has a strong knowledge of HPC architectures acquired from a long experience in presales activities. Xavier Vigouroux, after a PhD in Distributed computing, worked for several major companies in different positions: From Investigator at Sun labs to Support Engineer for HP. He has now been working for bull for six years. He led the HPC benchmarking team for the first five years and is now in charge of the " Education and Research " market for HPC at Bull 2 | Unleash your HPC performance with Bull white paper MARKET DYNAMICS by Addison Snell, CEO, Intersect360 Research The high performance computing (HPC) industry continues to fuel new discoveries and capabilities in science, engineering, and business. At the high end, new problems are always on the horizon. There is no final frontier of science, nor an end to engineering, nor a perfect business that cannot be evolved for greater competitiveness. HPC technologies are the tools that can help organizations in their perpetual drive toward innovation. For buyers in academic and government research, HPC can accelerate the path to scientific discovery. For commercial use cases, it does the same, but with the added metrics of quantifiable return on investment, as companies seek to speed time to market, improve product quality, reduce the costs of failures, and streamline operational efficiencies. Chris Willard, Chief Research Officer of Intersect360 Research, describes the striving attitude of HPC this way: “Once you solve the problem, it’s no longer interesting. You don’t need to design the same bridge twice. You move on to the next, harder bridge.” This is the nature of the demand for ever-greater HPC performance, regardless of application. In the relentless pursuit of performance, the HPC era has seen an evolution in the architectures deployed by the majority of the market. Vector processors gave way to scalar, RISC was supplanted by x86, and UNIX was replaced by Linux in the majority of HPC installations. Each of these transitions came with implied changes in the software models that would best leverage the hardware, in order to deliver performance at scale for real applications. Today we are in the early stages of Figure 1: Average Cores per Processor by Year of Acquisition for Distributed Memory Systems another such transition, as the Source: Intersect360 Research, 2014 industry adopts multi-core and manycore processing technologies. Strictly in the x86 paradigm, the frequency race has ended. Rather than chasing Moore’s Law with faster gigahertz ratings, processor manufacturers such as Intel are delivering greater performance by putting more cores on each socket. This delivers more floating point operations per dollar by introducing a new level of parallelism at the chip level. The transition to multi-core began several years ago, and it has now evolved such that average HPC systems currently in deployment have more than eight cores per processor, with further increases on the horizon.1 (See Figure 1.) 1 Intersect360 Research HPC market advisory service, “HPC User Site Census: Processors,” October 2013 3 | Unleash your HPC performance with Bull white paper And this is not the only processing technology transition in the market. For applications that can benefit from even greater parallelism, models of many-core processing are now available, in the form of Intel Xeon Phi and NVIDIA GPU processing elements. In either case, a supplemental “many-core” processing element, containing hundreds of individual cores, acts as a secondary computational accelerator that can boost performance when called upon. Accelerators are not new to HPC, but they were traditionally held back by three dominant constraints. Each of these has been addressed by both NVIDIA and Intel, albeit in different ways. 1. Pace of development: The developers of low-volume, custom co-processors have been unable to maintain a development schedule that keeps pace with high-volume microprocessor markets. Intel, of course, is accustomed to its “tick-tock” drumbeat of new releases, while NVIDIA is driven by the rate of change in the high-volume gaming market. 2. Programmability: Co-processing elements need to be explicitly called by the application; this puts a burden on programmers to insert these calls into their codes. NVIDIA revolutionized the approach with its CUDA development environment, allowing programmers to interact with GPU accelerators in standardized ways. Intel uses extensions of its x86 tool sets for Intel Xeon Phi programming. 3. Latency: Individual functions can be accelerated by co-processors, but the acceleration gained must be sufficient to overcome the latency hit endured by moving them off of the microprocessor. NVIDIA has supplemented evolutions to faster PCI-E connections with its own features designed to reduce latency, such as NVLink and direct-memory connections. Intel Xeon Phi combines processor and co-processor onto a single chip. With these technology advancements, more and more HPC users are adopting accelerators.2 (See Figure 2.) Most of these deployments are on NVIDIA GPUs currently, but Intel Xeon Phi – later to the market – is now seeing significant testing among end users as well. In either case, end users must evaluate how best to leverage the many-core parallelism in this new performance scheme. Figure 2: Systems with Accelerators by Year of Last Modification Source: Intersect360 Research, 2014 Of course, this adoption is not merely technology for the sake of technology, nor even performance for the sake of performance. The HPC community continues to seek out leaps in performance for the same reasons it always has: to drive new discoveries, to accelerate innovation, and to improve 2 Intersect360 Research HPC market advisory service, “HPC User Site Census: Processors,” October 2013 4 | Unleash your HPC performance with Bull white paper competitiveness. The discontinuous gain offered by these new technologies does require software changes, but it also can enable new simulations or techniques, making the previously impossible now possible. With so much at stake, Bull is investing in initiatives to help the HPC user community evaluate, implement, and optimize new processing technologies. Access to hardware is part of the solution, but more importantly, so is access to expertise. These technology transitions are happening; end users need to figure out how best to deploy them. Resources like Bull’s Centre for Excellence in Parallel Programming and the Fast Start program are designed to help scientists and engineers with the technology transitions that will fuel their next generations on innovation. 5 | Unleash your HPC performance with Bull white paper PERFORMANCE CONSTRAINTS A new HPC paradigm for performance For many years, the HPC paradigm was that the regular increase of processor frequency and the improvement of processor micro-architecture brought regular performance gain without any pain. You simply had to copy your code to the new machine, launch it, and directly get more performance. This natural evolution came to an end a few years back, when processor manufacturers, faced with the power consumption wall, started to increase intrinsic performance by increasing the number of compute cores on a single socket. To get the most out of these new CPU architectures, programmers and developers had to rethink their applications in terms of parallelism. Today, the architecture of HPC supercomputers is designed to allow applications take advantage of the parallelism. The building block is a compute node including twenty or so CPU cores, as well as memory and disks. The individual computes nodes are interconnected through a fast network, commonly based on the InfiniBand technology. There are potentially no limits for performance gain following that path, helped in that matter by ever-faster memory DIMMs and interconnect network. Today, the main limitation lies within the code itself and the programmers’ abilities. Most applications cannot handle the new performance challenges imposed by hardware evolution that addresses the compromise of more performance for less power consumption. Hence, programmers are struggling to extract more parallelism, handle more hybrids configurations (accelerators and co-processors are today’s premium choice for extra compute power with limited impact on energy cost), and support more heterogeneity. Applications do not make the most of the latest hardware evolutions, and source codes need to be modified to benefit from the latest technological progress. Three inhibitors to performance Furthermore, a given code will usually be run on different supercomputers with different key points Being efficient on each of them requires adapting some parts of the code. In this way, each implementation will fit any kind of architecture. More precisely, three factors or inhibitors govern scientific parallel application performance: • the sensitivity to CPU frequency, • the sensitivity to memory bandwidth, • the time devoted to communications and IOs. Depending on the physical or the scientific aspects of the code and of the implemented algorithms, a given application will strongly be impacted by one or several of these inhibitors. It is the responsibility of HPC vendors to direct customers to the right architecture based on a deep analysis and understanding of the end-user’s applications. On the other hand, coding methods can sometimes have an impact and artificially weigh on one of these inhibitors. Only code profiling and optimization can then unblock the bottlenecks. The power consumption issue Application performance is also constrained by energy costs. Beyond just optimizing the execution time of an application and the ability to get results faster, optimizing the global electrical consumption has become a critical issue for most datacenters. There are two ways to address the energy issue. On one hand, 6 | Unleash your HPC performance with Bull white paper optimizing the code will make it possible to use a smaller system to target a given performance. On the other hand, for a given supercomputer size, code optimization will provide a higher application throughput for the same energy capping. Either way, optimizing the execution time for an application will allow to get more results for every Watt consumed by the system. Thus, computer simulation requires not just a supercomputer with hundreds of cores, petabytes of memory, and zetabytes of storage, but a continuum of components from the hardware, to the final result, including the software, mathematical libraries, programming abilities and analysis. To be fast, accurate, productive and reliable, all these components must be tightly integrated and smoothly run together. If one component is weak, then the whole chain is weak, and the business is at risk. Integrating all these components together requires a lot of different skills, the most important of which is the ability to port and optimize applications to the most appropriate platform. BEST PRACTICES FOR APPLICATION OPTIMIZATION A new area of expertise To take up the performance challenges, application expertise has become mandatory. Understanding the purpose and requirements of scientific applications is a key factor to propose the best suited and efficient solution to customers and to offer satisfying user experience. With the petascale, then exascale quests, application engineers at HPC companies must progress from benchmarkers to real experts that understand the major performance hurdles, can propose adapted hardware solutions, and optimize codes. This can only be achieved by perfectly mastering today’s and tomorrow’s HPC architectures and programming environments. Benchmarking remains the first step for application optimization. It is crucial to start with the best possible execution time of the code “as is” (with no source modification), by testing different hardware platforms (for example different processor types, different interconnect topologies…), mathematical libraries, software environments, compilers and compilation options. This will serve as a reference time for future optimization. InfiniBand topologies are one of the critical components in HPC architectures. As the number of cores increases MPI communications become more and more time consuming. An optimized interconnect network is obtained by maintaining identical bandwidth at each level of the network. Introducing “pruning” in a network consists in sharing connections between two or more points in the network. This will reduce bandwidth but, in the case were communications are not critical, it will, more importantly, reduce the number of switches and cables, optimizing both the cost and power consumption. First, start by profiling the application… Once the reference is obtained, the application needs to be profiled. As previously detailed, the execution time of a code is determined by three factors : the CPU frequency, the memory bandwidth, and communications and inputs/outputs. Being able to measure and evaluate those three parts is the key to propose the most efficient machine for this code, and it is the starting point for optimization. There is no point, for instance, in optimizing floating point operations if the code spends most of the time in MPI communications. In that case, it is more relevant to optimize communication patterns or algorithms. On the 7 | Unleash your HPC performance with Bull white paper other hand, if communications are marginal, the global infrastructure can be optimized by introducing pruning in the interconnect network, but optimizations efforts should address other identified bottlenecks. Whether they are home-made tools or third party tools, profilers and debuggers can help software engineers detect performance bottlenecks and understand the algorithmic and software structure of applications. Some of them provide the “critical path”. This path is built from a sequence of computations and communications; this “path” is very useful as its duration is the duration of the whole program. To reduce the execution time of the application, the programmer must reduce the components of the critical path. Any other modification would be useless. … Before trying to optimize More generally, a detailed profiling of the code is the starting point for any optimization work. Figure 1: Minimal output of Bull's bullxprof tools First, we try to get a global overview of an application behavior. Performance can be detailed to analyze the time consuming components (CPU, communications, IO) of the application. On the above example, 80% of the total execution time is spent in MPI communications, IO are negligible, and 20% of the time (USER) is spent in running the application code; memory accesses are counted as CPU time. This overview of the application gives precious information for optimization. High values for the time spent in running the application code are usually what we want. If this is average, we can further investigate scalar and vector numeric operations and memory accesses. A high volume of memory accesses means that the per-core performance is memory-bound. We can then use a profiler to identify time-consuming loops and check their cache performance. If little time is spent in vectorized instructions, one might want to check the compiler’s vectorization advice to see why key loops could not be vectorized. High values for MPI communications are usually bad. This means that the application spends more time communicating than performing actual computation. Getting a detailed MPI profiling makes it possible to determine whether communications are point-to-point or collective ones and to obtain the transfer rate of both types. Low transfer rates can be caused by inefficient message sizes, such as many small messages, or by imbalanced workloads causing processes to wait. The MPI profiler can then be used to identify the problematic calls and ranks. We also generally want to avoid spending too much time in I/Os, writing or reading to the file system. Some codes generate large amounts of data or restart from reading intermediate results saved on disks and will need a fast parallel file system (Lustre, for instance) to minimize the time spent on I/Os. 8 | Unleash your HPC performance with Bull white paper Finally, it is crucial to obtain information about memory usage since it may affect the scaling of an application. When the per-process memory usage is too important, MPI communications and MPI memory footprint can be reduced by running fewer MPI processes and mixing with OpenMP formalism. This may lead to a significant gain in performance. Defining the best hardware architecture for today’s and tomorrow’s simulations Of course, when optimizing an application or defining an HPC infrastructure for it, both current and future technologies must be taken into consideration to maximize performance. Some of these future technologies are the natural evolution of existing hardware, and their impact on programming models is negligible. However, some newly emerged hardware components generate new programming paradigms. In the quest for exascale computing, hardware accelerators based on either Graphic Processing Units (GPU) or on the Many Integrated Core (MIC) architecture are one of the key elements to combine performance and limited power consumption. There is no doubt that the future of supercomputers will rely on hybrid architectures. Hence it is critical to start today to move user applications to these platforms, so as to get the most out of heterogeneous or hybrid infrastructures (i.e. infrastructures combining standard CPU resources and accelerators/coprocessors). This usually involves a complete reconsideration of the application structure and algorithms. Expertise is also needed here. Porting an application to hardware accelerators/coprocessors: a methodology Independently of the application area, a generic methodology can be used to port an application to either an accelerator or a coprocessor. The methodology consists in following a series of steps starting with an “as-is” execution of the application. As usual, this will provide the programmer with a reference time and output data that will be compared to the final accelerated version of the code. A profiling of the application will be done to identify potential bottlenecks and hotspots. In these hotspots one should then look for an adapted parallelism for accelerators. It can be either parallelism over large amounts of data, or independent work sharing (for instance: the same calculation is done over different data sets). Figure 3: Biotin (ligand in yellow/blue) docked in Streptavidin (protein in purple) From this point, porting to accelerators/coprocessors can be started. It - Source: URCA Reims, France generally consists in selecting the appropriate target platform (NVIDIA® GPU or Intel® Xeon Phi™) and environment: CUDA, OpenCL, OpenACC for GPUs, native or offload modes for Xeon Phi. Writing the compute kernels for the hotspots is the next step. It may rely directly on the original algorithm and the use of existing optimized libraries (CUDA BLAS, CUDA FFT, MKL…) but sometimes a complete redesign of the algorithm may be necessary. As already mentioned, the performance obtained from accelerators is strongly related to how data are accessed in the device memory. To optimize these memory accesses it is sometimes necessary to completely modify data structures. Once the first version of the code is obtained, we validate the porting by comparing the numerical accuracy of results. Then, we look for optimizations. In particular, it is critical to work on data reuse and data persistency on the accelerator board. Since these devices are PCI-e cards, transfers over PCI-e connectors can be an obvious bottleneck. Optimizations are obtained by taking care of data alignment, data size (do we need everything on the card?) and, if possible, asynchronous transfers for overlap with computation. 9 | Unleash your HPC performance with Bull white paper The next step is to fine tune the ported application for different architectures in case of GPU (K104 vs. K110 architecture for example), or determine the best threads count and threads affinity (placement) for Xeon Phi. For this tuning we use tools such as NVIDIA Visual Profiler or Intel VTune that help identify final optimization possibilities. Finally, search for multi-accelerators and/or hybrid CPU + accelerators implementation can also be investigated starting from the single device code. Portability and maintenance of these versions are usually good and, although new features and hardware evolutions will bring more power and eliminate remaining bottlenecks, less work is needed once the application has been ported. Many users understand today the necessity to move to such technologies, but the number of true expert remains limited. However, this shift must be made now to follow the natural evolutions of hardware, dictated by the exascale quest and the need to increase performance while limiting energy consumption. Future challenges: optimizing power consumption at code level The power consumption and power efficiency of a supercomputer are becoming as important as application performance itself. Optimizing energy costs can be tackled at the hardware level, but the idea is gradually gaining ground that power consumption can be reduced at the code level itself. Many customers wish to have tools to very precisely measure power consumption during code execution and to relate that, in time, to specific parts of the application. The aim is to directly relate a highly costly power consumption to a few lines at the assembly level. Projects (see box about HDEEM) are emerging to optimize energy consumption at this level of the application code. However, due to the high frequency of the CPU clock, hardware tools are necessary to sample electrical consumption at such a frequency and to address this critical issue. HDEEM is a project with two founding members: the University of Dresden and Bull. The goal is to expose very accurate power measurement (500 samplings per second) to the end-user and to link it with the source code. To achieve this, Bull developed dedicated hardware. In this way, the probing system is not intrusive and does not change the performance of the monitored system. Precise metrics is the prerequisite for optimization. With this information the programmer can improve the flops/W of its program, or, implement different algorithms according to the criterion he wants to optimize (time-to-result, watt-to-result). One goal of this project is to gather many Whether it is imposed by end-users’ demands for ever others ones on the top of it : policy more compute power and ever more precise management, energy-aware batch scheduler, … simulations, or by natural hardware evolutions on the many extensions can be anticipated. road to exascale, most applications need to be ported and optimized. Application developers cannot rely on the intrinsic power of supercomputers anymore. They need to face up to new programing environments, rethink the programing models, create new algorithms. They need new tools and they need experts. 10 | Unleash your HPC performance with Bull white paper BULL: AN INDISPUTABLE AND UNIQUE ACTOR Bull’s Portfolio Building on the success of its bullx supercomputers, mobull HPC containers, and extreme eactory HPC cloud services, Bull has become the trusted provider of HPC solutions in Europe. With products scaling from midrange HPC to petaflop-scale supercomputers, Bull’s strategy is to provide systems that deliver efficient productivity at scale. Bull’s solution portfolio targets flexibility. It is “open”: it is based on best-of-breed open standards and components. In this way, one component can be changed for another satisfying the same properties. The key benefit is that users and administrators can have a customized machine with the tools they are used to. It is “integrated”: even though some components can be changed, Bull R&D engineers have integrated everything into a consistent and efficient whole. Replication is minimized, useless parts are removed, configuration is fine-tuned. It is “modular”: components can be removed if useless for the customer. Obviously, the lighter the solution, the better. This approach ensures that the resulting system is fast, accurate, productive and reliable. These properties are especially true for the software environment, but Bull also applies them to the hardware infrastructure. Indeed, Bull’s portfolio includes products to address all HPC needs: thin nodes, fat nodes, integration of accelerators, storage… All can be mixed and fitted to any application requirements. Based on your business workload and constraints, Bull experts can define the best supercomputer for your needs. Centre for Excellence in Parallel Computing: the largest team of application experts in Europe To help HPC users come to grips with parallelism, Bull launched its Center for Excellence in Parallel Programming, the first European center of industrial and technical excellence for parallel programming. Leveraging a team of experts unique in Europe - more than 300 engineers - the Centre for Excellence in Parallel Programming delivers the highest level of expertise and skills to help organizations optimize their applications to take advantage of the new many-core technologies. In partnership with Intel, the Center for Excellence in Parallel Programming focuses especially on Intel Xeon and Intel Xeon Phi processors, and the continued development and deployment of Intel compilers and tools. Thanks to the Center for Excellence and its application experts, Bull has a unique capacity to address the challenges of performance in HPC. Bull’s entire ecosystem is available to Bull engineers and customers (on demand) for benchmarking and tests. Bull’s teams benefit from one of the largest tests resources in the world including most of the components we propose to our customers. This includes a complete range of server types, processors, storage solutions, hardware accelerators, remote visualization solutions, etc. Bull has two main benchmarking facilities. One is based in our factory in Angers and relies on the latest Bull blade solutions and Intel processor technology. It is intended for large-scale tests (several thousand CPU cores) and presales activities. Another supercomputer is installed in our center in Grenoble, France. it is more flexible and is dedicated to the exploration of future technologies. 11 | Unleash your HPC performance with Bull white paper Figure 4: Bull's large scale benchmarking facility (SID) architecture Bull’s teams also gather the largest pool of engineers in Europe for application expertise and services. Our application experts are specialized in optimizing codes on Bull solutions – this is their everyday job. They have expert knowledge of BOTH applications AND Bull platforms. So when it gets to optimizing your applications, they are much more efficient than end-users who need some time to get to know the software environment, the supercomputer architecture, the file systems characteristics… The scientific background of the Bull experts gives them a deep insight into the behavior and goal of the applications to be ported and optimized, and largely facilitates the relationship with end-users. The skills of our application experts includes the following: During deals, commit on performance figures on the proposed architecture. This operation requires very high skills, as the architecture is usually not available for real experiments. During acceptance tests, obtain the figures they committed on. Within the Centre for Excellence in Parallel Programing, be a source of expertise and perform Proofs of Concepts. Provide trainings and high level services Present technical talks and papers in conferences, workshops… Explore future technology trends Bull’s unique expertise in the application field leverages the scientific background of its engineers in a very large variety of areas (climate, life science, quantum chemistry, financial mathematics...). When customers meet our experts, it is crucial that we both speak the same language, to understand the code, the science behind it, and what is needed in terms of performance. We believe that improving the performance of an application (not only optimize the source code but also propose the best suitable hardware architecture) 12 | Unleash your HPC performance with Bull white paper relies on both a good understanding of the scientific aspects and the way the code is implemented and works. However, although experience and expertise in the application field are mandatory for performance optimization, evaluation tools are also needed. Another of Bull’s assets is that the software R&D department is physically located on the same site as the application experts. Strong interactions are then motivated by our customer’s needs and led to the development of Bull’s open monitoring and profiling tools that are day to day used by the Bull team and proposed to our customers to get full advantage of their new Bull supercomputers. Developments around Interconnects, MPI, IOs, accelerators, batch scheduler and power consumption monitoring are developed based on actual requests from end users and system administrators. Application experts are associated to the development and rely on these tools in their quest for more performance. The future must also be explored today. Bull has a unique capacity, thanks to strong partnerships with technological partners, to explore future technology trends in all the domains of HPC: processors, accelerators and coprocessors, interconnect and storage. Partnerships with Intel and NVIDIA in particular, allow Bull’s teams to early test and evaluate future generations of processors and GPUs. For customers, acquiring a new HPC facility is a several months long process. Technologies that will be available by the time the system is installed are usually not available when performance is evaluated on benchmarking systems. The technology watch that Bull invests in, allows us to properly evaluate performance on next generation hardware starting from today’s measurements. Having access to future technologies also allows Bull to anticipate the redesign of algorithms that might be necessary for applications to take full advantage of them. Bull is then strongly involved in algorithmic research and has presented, over the last years, several significant results through articles or conference presentations. More are in preparation. “Porting the BigDFT application to Xeon Phi” (SC’12 demo, jointly with CEA and Intel Parallel Labs) – Technical paper on going. "Evaluation of DGEMM implementation on Intel Xeon Phi Coprocessor“, article presented at the International Conference on Computer Science and Information Technology SC13 conference. Jointly with Pawel Gepner and Victor Gamayunov (Intel Corp.) “Intel Xeon E5-2600v2 Family Based Cluster Performance Evaluation Using Scientific and Engineering Benchmarks”, article in prep., jointly with Pawel Gepner and Victor Gamayunov (Intel Corp.) “Evaluation of the 4th generation Intel Core Processor concentrating on HPC applications”, article in prep., jointly with Pawel Gepner and Victor Gamayunov (Intel Corp.) Bull, through Proofs of Concepts (POC) or joint collaborations, eases and accelerates the transition to these new technologies for its customers. Bull expertise in this area is full of success stories in both academic and research organizations (see for instance the article “Using GPUs for the Exact Alignment of Short-Read Genetic Sequences by Means of the Burrows-Wheeler Transform” by Jose Salavert Torres, et al. in Transactions on Computational Biology and Bioinformatics) and business companies. 13 | Unleash your HPC performance with Bull white paper A Start-Up’s Success Story Bull’s application expertise is well recognized in academics and research, but our experts also help start-ups and small businesses. One of the most striking examples is the case of EZBiometrics. EZ Biometrics is a French start-up that develops and provides fingerprint biometrics solutions (AFIS) for integrators. They propose hardware integration as well as expertise and service in biometrics for both civilian and police usage. The strength of this company lies in an accurate coder algorithm as well as a fast matcher algorithm. The proposed solution is able to perform high speed matching of a fingerprint against potentially hundreds of millions of fingerprints. The current software solution compares one specific fingerprint to all fingerprints in the database and returns the records that present the highest matching probability. To get the most efficient algorithm (the performance of which is measured in millions of matches per second) and to define the most suitable hardware solution, EZ Biometrics relied on Bull’s application engineers. Based on the fact that one fingerprint is matched one after the other with all fingerprints of the database, the fit with a parallel computing process is obvious. Therefore, the most efficient solution to reach the highest speed would be to use as many computing cores as possible. With a hardware cluster architecture, there is virtually no limit to the performance of the matcher. However, large, cumbersome cluster architecture can be an issue especially if portability is required and if the hardware implementation cost must be kept affordable. TESTIMONY: ”EZ Biometrics develops and sells biometric solutions. We are specialised in developing innovative software products for coding and matching fingerprints. We develop a high performance AFIS system for integrators that need biometric technologies. The Bull CEPP’s has allowed us to access remotely to their ROBIN cluster. The access was made through an internet secure connection. Our development team could appreciate this very easy way to get access to huge computation resources. Moreover an application expert and a system administrator have supported us all along our project; this help has been a key element in our project success. The ROBIN cluster is a great solution to speed up development process with no need to provide integration effort and resource. It has allowed us to save a lot of money and time. Thank again to the Bull CEPP ROBIN cluster solution “ Jean Henaff, CEO, EZ Biometrics Bull and EZ Biometrics worked together on the porting of the algorithm onto a hybrid architecture made of standard CPU servers and Intel® Xeon Phi™ co-processors. Here, the word “co-processor” takes on its full meaning. The Xeon Phi co-processors are based on the "Many Integrated Core" architecture offering in a PCI-e board form factor up to 250 x86 cores. Although the working frequency of the co-processor cores is limited compared to standard Xeon processors, the high number of them allows to dramatically increase the number of matches within a single server. Bull brought its expertise in porting the algorithm onto the Xeon Phi architecture by taking advantage of the available computing resources. This was done by combining and balancing the workload over standard processors and Xeon Phi co-processors. This joint effort with Bull allowed EZ Biometrics to offer an innovative solution in terms of integration and performance. 14 | Unleash your HPC performance with Bull white paper The Fast Start Program Finally, Bull’s expertise in the application field is recognized and appreciated through the “Fast Start” program. The Fast Start program is tailored to the needs of each customer, to successfully achieve their specific project goals. It can be viewed as a “fully tailored effort.” The Bull services begin with a systematic needs analysis that will determine the project targets and the priorities of the different targets, so as to spread the effort efficiently. Then, at agreed time intervals, advancement meetings (or calls, or emails, as agreed) take place to share the status on the Fast Start program: tasks already performed, remaining effort, status for each target, issues, mitigation plans… Possible actions that can fall within the Fast Start program are: Compile the application or library with source codes Determine the best set of compilation options for an important code Configure applications or libraries to allow them to efficiently use the supercomputer Create scripts to launch an application efficiently Install different versions of an application or a library. In case of issue, help in screening the root cause. Rely on third parties experts (developers of applications we are in contact with) for help. The Impact of Bull’s Application Development and Performance Tuning at Cardiff University Established in 2007, Advanced Research Computing @Cardiff (ARCCA) provides, co-ordinates, supports and develops computational services for Cardiff University researchers, enabling leading research that is far beyond the capabilities of the average desktop or laptop computer. ARCCA’s choice of partner at the outset of these developments was Bull, a choice driven by the promise of reliable, performing and cost-effective technology, and the quality of their associated support. This promise has been amply demonstrated over the past six years as ARCCA has seen a continual growth in the number of researchers registering to use its facilities; from ca. 30 in 2007 to over 430 registered users today. There is no doubt that the heart of this partnership with Bull lies in the added value continually demonstrated by their team in the area of application development and performance tuning. They have far exceeded expectations throughout – from the start of service with their initial benchmarking of key Cardiff codes and the associated “fast-start” program, to trouble-shooting problem applications during the lifetime of the service. Bull has provided a level of proactive support not matched in our experience by any of their The ARCCA competitors, support that has included the secondment of key staff Team to the Cardiff site. This has been critical in terms of our being able to support the diverse community of users of the ARCCA services, from experienced practitioners running applications which scale over 100’s cores to those just starting to consider the impact of computing on their research aspirations. The impact of this support on the ARCCA service was demonstrated in the competitive procurement for the new Cardiff supercomputer in 2012. Bull proved yet again to be our supplier of choice with the replacement bullx B500 Sandy Bridge based blade system in late 2012. 15 | Unleash your HPC performance with Bull white paper CONCLUSION by Addison Snell, CEO, Intersect360 Research HPC is a tool for driving innovation, and as such, the technology must itself innovate in order to deliver continuous improvements over time. But while the theoretical performance gains are continuous, the path to achieve new levels of performance gains is discontinuous, as applications must be revisited to ensure they are optimized for the benefits of new generations of technology. This is the juncture the industry finds itself at today. Even without the springboard leap offered by many-core accelerators, the transition to multi-core processors is enough to make application developers reconsider how best to parallelize and optimize their codes. Intel Xeon Phi and NVIDIA GPU computing bring even more options into consideration. Which architecture to choose, and how to optimize for it, are questions that will be specific to each application and circumstance. The resources and expertise offered by Bull and its partners can help end users evaluate the best path for each application, and therefore, for their own innovations into the future. W-HPCperformance--en1 For more information, contact us on www.bull.com/extreme-computing or at [email protected] © Bull SAS - 2014 - RCS Versailles B 642 058 739 - RCS Versailles B 642 058 739 – All trademarks mentioned herein are the property of their respective owners. Bull reserves the right to modify this document at any time without prior notice. Some offers or parts of offers described in this document may not be available locally. Please contact your local Bull correspondent to know which offers are available in your country. This document has no contractual significance. UK: Bull Maxted Road, Hemel Hempstead, Hertfordshire HP2 7DZ / USA: Bull 300 Concord Road, Billerica, MA 01821 Bull - Rue Jean Jaurès - 78340 Les Clayes sous Bois – France 16 | Unleash your HPC performance with Bull This flyer is printed on paper containing 40% eco-certifed fibers originating from sustainably managed forests and 60% recycled fibers, in conformity with environmental regulations (ISO 14001).
© Copyright 2026 Paperzz