VRIJE UNIVERSITEIT AMSTERDAM Radio astronomy beam forming on GPUs by Alessio Sclocco Supervisors Dr. Rob van Nieuwpoort Dr. Ana Lucia Varbanescu A thesis submitted in partial fulfillment for the degree of Master of Science in the Faculty of Sciences Department of Computer Science March 2011 “Computer Science is no more about computers than astronomy is about telescopes.” E. W. Dijkstra VRIJE UNIVERSITEIT AMSTERDAM Abstract Faculty of Sciences Department of Computer Science Master of Science by Alessio Sclocco In order to build the radio telescopes needed for the experiments planned for the years to come, it will be necessary to design computers capable of performing thousands more floating point operations per second than the actual most powerful computers of today, and do it in a very power efficient way. In this work we focus on the parallelization of a specific operation (part of the pipeline of most modern radio telescopes): the beam forming. We aim at discovering if this operation can be accelerated using Graphics Processing Units (GPUs). To do so we analyze a reference beam former, the one that ASTRON uses for the LOFAR radio telescope, discuss different parallelization strategies, and then implement and test the algorithm on a NVIDIA GTX 480 video card. Furthermore, we want to compare the performance of our algorithm using two different implementation frameworks: CUDA and OpenCL. Contents Abstract ii List of Figures v List of Tables vi Abbreviations viii 1 Introduction 1 2 Background 2.1 Radio astronomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Electromagnetic radiation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Beam forming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 6 9 3 Related works 11 3.1 Hardware beam formers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 Software beam formers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4 General Purpose computations on GPUs 4.1 The GPU pipeline . . . . . . . . . . . . . 4.2 The reasons behind GPGPU . . . . . . . 4.3 NVIDIA architecture . . . . . . . . . . . . 4.4 CUDA . . . . . . . . . . . . . . . . . . . . 4.5 An example: SOR . . . . . . . . . . . . . 4.5.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 17 19 21 23 23 29 5 Application analysis 5.1 Data structures . . . . . . . . 5.2 The beam forming algorithm 5.2.1 Delays computation . 5.2.2 Flags computation . . 5.2.3 Beams computation . 5.3 Parallelization strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 32 33 34 34 35 37 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 CUDA BeamFormer 40 6.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 6.2 BeamFormer 1.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 iii Contents 6.3 6.4 6.5 6.6 6.7 6.8 BeamFormer 1.1 BeamFormer 1.2 BeamFormer 1.3 BeamFormer 1.4 BeamFormer 1.5 Conclusions . . . iv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 OpenCL BeamFormer 7.1 The Open Computing Language . . . . . . 7.2 Porting the BeamFormer 1.5 from CUDA to 7.3 OpenCL BeamFormer performance . . . . . 7.4 Conclusions . . . . . . . . . . . . . . . . . . 8 Finding the best station-beam block size 8.1 Experimental setup . . . . . . . . . . . . . 8.2 OpenCL results . . . . . . . . . . . . . . . 8.3 CUDA results . . . . . . . . . . . . . . . . 8.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 44 46 46 47 48 . . . . 52 52 54 55 56 . . . . 59 59 60 62 63 9 Conclusions 65 A CUDA BeamFormer execution time 68 B CUDA BeamFormer GFLOP/s 73 C CUDA BeamFormer GB/s 78 D OpenCL BeamFormer measurements 83 E Finding the best station-beam block size 86 F Data structures 90 Bibliography 99 List of Figures 2.1 2.2 Electromagnetic spectrum, courtesy of Wikipedia. . . . . . . . . . . . . . 7 Hardware beam former, courtesy of Toby Haynes [1]. . . . . . . . . . . . . 10 3.1 3.2 One of the THEA boards, courtesy of ASTRON. . . . . . . . . . . . . . . 12 EMBRACE radio frequency beam former chip, courtesy of P. Picard [2]. . 13 4.1 4.2 Hardware pipeline of a video card [3]. . . . . . . . . . . . . . . . . . . . Comparison between Intel CPUs and NVIDIA GPUs in term of GFLOP/s, courtesy of NVIDIA [4]. . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison between Intel CPUs and NVIDIA GPUs in term of GB/s, courtesy of NVIDIA [4]. . . . . . . . . . . . . . . . . . . . . . . . . . . . The number of transistors devoted to different functions in CPUs and GPUs, courtesy of NVIDIA [4]. . . . . . . . . . . . . . . . . . . . . . . . NVIDIA Tesla GPU architecture [5]. . . . . . . . . . . . . . . . . . . . . NVIDIA Fermi GPU architecture [6]. . . . . . . . . . . . . . . . . . . . . SOR execution time (lower is better) . . . . . . . . . . . . . . . . . . . . SOR speed-up (higher is better) . . . . . . . . . . . . . . . . . . . . . . . 4.3 4.4 4.5 4.6 4.7 4.8 6.1 6.2 6.3 6.4 7.1 7.2 7.3 7.4 8.1 8.2 8.3 . 18 . 19 . 19 . . . . . Execution time in seconds of various BeamFormer versions merging 64 stations (lower is better). . . . . . . . . . . . . . . . . . . . . . . . . . . . Execution time in seconds of various BeamFormer versions merging 64 stations (lower is better). . . . . . . . . . . . . . . . . . . . . . . . . . . . GFLOP/s of various BeamFormer versions merging 64 stations (higher is better). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . GB/s of various BeamFormer versions merging 64 stations (higher is better). NDRange example, courtesy of Khronos Group [7]. . . . . . . . . . . . . GFLOP/s of BeamFormer 1.5 implemented with CUDA and OpenCL merging 64 stations (higher is better). . . . . . . . . . . . . . . . . . . . GB/s of BeamFormer 1.5 implemented with CUDA and OpenCL merging 64 stations (higher is better). . . . . . . . . . . . . . . . . . . . . . . . . Execution time in seconds of BeamFormer 1.5 implemented with CUDA and OpenCL merging 64 stations (lower is better). . . . . . . . . . . . . 21 21 22 30 30 49 50 51 51 . 53 . 56 . 57 . 58 GFLOP/s for the OpenCL BeamFormer: block sizes from 64x1 to 64x16 (higher is better). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 GFLOP/s for the CUDA BeamFormer: block sizes from 64x1 to 64x16 (higher is better). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Comparison of CUDA and OpenCL BeamFormers: block sizes from 64x1 to 64x16 (higher is better). . . . . . . . . . . . . . . . . . . . . . . . . . . 64 v List of Tables 4.1 Comparison between an Intel CPU and an NVIDIA GPU. . . . . . . . . . 20 6.1 6.2 Operational intensity and registers used by each kernel . . . . . . . . . . . 41 Algorithms’ optimization strategies and code differences. . . . . . . . . . . 41 9.1 Comparison of the beam former running on the ASTRON IBM Blue Gene/P and on an NVIDIA GTX 480. . . . . . . . . . . . . . . . . . . . . 66 A.1 Execution A.2 Execution A.3 Execution A.4 Execution A.5 Execution A.6 Execution A.7 Execution A.8 Execution A.9 Execution A.10 Execution A.11 Execution A.12 Execution A.13 Execution A.14 Execution A.15 Execution A.16 Execution A.17 Execution time time time time time time time time time time time time time time time time time B.1 B.2 B.3 B.4 B.5 B.6 B.7 B.8 B.9 B.10 B.11 B.12 B.13 for for for for for for for for for for for for for GFLOP/s GFLOP/s GFLOP/s GFLOP/s GFLOP/s GFLOP/s GFLOP/s GFLOP/s GFLOP/s GFLOP/s GFLOP/s GFLOP/s GFLOP/s in in in in in in in in in in in in in in in in in the the the the the the the the the the the the the seconds seconds seconds seconds seconds seconds seconds seconds seconds seconds seconds seconds seconds seconds seconds seconds seconds for for for for for for for for for for for for for for for for for the the the the the the the the the the the the the the the the the BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer 1.1 . . . 1.1 2x2 . 1.1.1 . . 1.1.1 2x2 1.2 2x2 . 1.2 4x4 . 1.2 8x8 . 1.2.1 . . 1.2.1.1 . 1.2.2 . . 1.2.2.1 . 1.3 . . . 1.4 . . . vi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.0.2 . . 1.1 . . . 1.1 2x2 . 1.1.1 . . 1.1.1 2x2 1.2 2x2 . 1.2 4x4 . 1.2 8x8 . 1.2.1 . . 1.2.1.1 . 1.2.2 . . 1.2.2.1 . 1.3 . . . 1.4 . . . 1.5 2x2 . 1.5 4x4 . 1.5 8x8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 68 69 69 69 69 70 70 70 70 71 71 71 71 72 72 72 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 73 74 74 74 74 75 75 75 75 76 76 76 List of Tables vii B.14 GFLOP/s for the BeamFormer 1.5 2x2 . . . . . . . . . . . . . . . . . . . . 76 B.15 GFLOP/s for the BeamFormer 1.5 4x4 . . . . . . . . . . . . . . . . . . . . 77 B.16 GFLOP/s for the BeamFormer 1.5 8x8 . . . . . . . . . . . . . . . . . . . . 77 C.1 C.2 C.3 C.4 C.5 C.6 C.7 C.8 C.9 C.10 C.11 C.12 C.13 C.14 C.15 C.16 C.17 GB/s GB/s GB/s GB/s GB/s GB/s GB/s GB/s GB/s GB/s GB/s GB/s GB/s GB/s GB/s GB/s GB/s for for for for for for for for for for for for for for for for for the the the the the the the the the the the the the the the the the BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer BeamFormer 1.0.2 . . . 1.1 . . . . 1.1 2x2 . 1.1.1 . . . 1.1.1 2x2 1.2 2x2 . 1.2 4x4 . 1.2 8x8 . 1.2.1 . . . 1.2.1.1 . . 1.2.2 . . . 1.2.2.1 . . 1.3 . . . . 1.4 . . . . 1.5 2x2 . 1.5 4x4 . 1.5 8x8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 78 79 79 79 79 80 80 80 80 81 81 81 81 82 82 82 D.1 D.2 D.3 D.4 D.5 D.6 D.7 D.8 D.9 Execution time in seconds for the BeamFormer 1.5-opencl 2x2 . Execution time in seconds for the BeamFormer 1.5-opencl 4x4 . Execution time in seconds for the BeamFormer 1.5-opencl 8x8 . GFLOP/s for the BeamFormer 1.5-opencl 2x2 . . . . . . . . . . GFLOP/s for the BeamFormer 1.5-opencl 4x4 . . . . . . . . . . GFLOP/s for the BeamFormer 1.5-opencl 8x8 . . . . . . . . . . GB/s for the BeamFormer 1.5-opencl 2x2 . . . . . . . . . . . . GB/s for the BeamFormer 1.5-opencl 4x4 . . . . . . . . . . . . GB/s for the BeamFormer 1.5-opencl 8x8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 83 84 84 84 84 85 85 85 E.1 E.2 E.3 E.4 E.5 E.6 E.7 E.8 E.9 E.10 E.11 E.12 GFLOP/s for the OpenCL BeamFormer 1.5: block sizes from 1x1 to 8x8 . GFLOP/s for the OpenCL BeamFormer 1.5: block sizes from 1x9 to 8x16 GFLOP/s for the OpenCL BeamFormer 1.5: block sizes from 9x1 to 16x8 GFLOP/s for the OpenCL BeamFormer 1.5: block sizes from 9x9 to 16x16 GFLOP/s for the OpenCL BeamFormer 1.5: block sizes from 2x1 to 256x8 GFLOP/s for the OpenCL BeamFormer 1.5: block sizes from 2x9 to 256x16 GFLOP/s for the CUDA BeamFormer 1.5: block sizes from 1x1 to 8x8 . . GFLOP/s for the CUDA BeamFormer 1.5: block sizes from 1x9 to 8x16 . GFLOP/s for the CUDA BeamFormer 1.5: block sizes from 9x1 to 16x8 . GFLOP/s for the CUDA BeamFormer 1.5: block sizes from 9x9 to 16x16 GFLOP/s for the CUDA BeamFormer 1.5: block sizes from 2x1 to 256x8 GFLOP/s for the CUDA BeamFormer 1.5: block sizes from 2x9 to 256x16 86 86 87 87 87 87 88 88 88 88 89 89 Abbreviations CU Compute Unit CUDA Compute Unified Device Architecture EMR ElectroMagnetic Radiation FPGA Field Programmable Gate Array GDBF Generic Digital Beam Former GPGPU General Purpose computations on GPU GPU Graphical Processing Unit LOFAR LOw Frequency ARray OpenCL Open Computing Language PE Processing Element SIMT Single Instruction Multiple Thread SOR Successive Over-Relaxation SPA Streaming Processor Array TPC Texture/Processor Cluster VHDL VHSIC Hardware Description Language viii Chapter 1 Introduction Radio astronomy is changed in recent years, and most of these changes are relative to the instruments of radio astronomy itself, i.e. radio telescopes. Classic radio telescopes were big dish directional antennas and all the experiments were performed with hardware sensors connected to those antennas. Unfortunately, there are engineering limitations in the dimension that a single dish telescope can achieve, and furthermore the building of special purpose hardware instruments that are increasingly complex is expensive and does not offer flexibility. Radio interferometry gives engineers the possibility of building bigger radio telescopes, instead of using a large single dish antenna, it combinines the signals received from many different antennas to form a virtual radio telescope that is the combination of all the small antennas. If at the beginning this technique was used to connect only antennas close to each other, it is nowadays possible to connect antennas that are thousands of kilometers away, and so it is now possible to obtain radio telescopes with apertures so big that were not even imaginable ten years ago. The LOw Frequency ARray (LOFAR) radio telescope [8] is an example of this new generation of telescopes. In this context the beam forming algorithm, the subject of this work, acquires an increased importance. This technique, in fact, permits to combine different signals, received from many antennas, and to form a single coherent signal, called a beam. Moreover, beam forming permits to give directionality to an array of non-directional antennas. The other revolution in the instruments of radio astronomy is the possibility to implement larger components of the operational pipeline of a radio telescope in software, 1 Chapter 1. Introduction 2 thus reducing the costs to develop new instruments and increasing the flexibility of radio telescopes. However, to perform new and much more complex experiments in the following years, more powerful instruments will be necessary. With software radio telescopes there will be a need for more powerful computers to run this software, computers able to perform a very large number of operations per second (a thousand times more than the actual computational power of the best supercomputers in the world: exa-scale computers [9]). But, in order to build this next generation of supercomputers, we will not just need faster processing units and increased memory and network bandwidth, we will also need to achieve better power efficiency. In fact, with the actual technology just to operate these exa-scale computers, we currently need personal power plants [10]. A possible solution to build more powerful and power efficient supercomputers is to accelerate the computations using Graphics Processing Units (GPUs). The architecture of modern GPUs is inherently parallel and can easily be used to accelerate complex scientific operations, like radio astronomy beam forming in our case. And it is not just now that the absolute GPUs performance, in terms of both computational power and memory bandwidth, is higher than the performance achieved by CPUs: the gap between GPUs and CPUs performance is widening, and GPUs are more power efficient. The goal of this thesis is to answer the question if the beam forming algorithm used for the LOFAR reference radio telescope can be efficiently parallelized to run on GPUs. Thinking about how to build an exa-scale supercomputer, we want to understand if a GPU-based beam former can match, or even outperform the parallel beam former used in production at ASTRON. The rest of this work is organized as follows. Chapters 2 and 3 provide, respectively, a background on radio astronomy and beam forming, and related work in digital beam forming for modern radio telescopes. Chapter 4 presents an introduction to general purpose computing on GPUs, introducing the main concepts, presenting the NVIDIA architecture used to parallelize the beam former, and showing an example of parallelizing the parallelization a simple algorithm on the GPU. The application analysis, consisting on the description of the sequential algorithm and the parallelization strategies, is included in Chapter 5. Chapters 6, 7 and 8 provide the description of the implemented parallel beam formers, of the experiments and of the results, providing also partial conclusions and comparisons. The overall conclusions of this work are presented in Chapter Chapter 1. Introduction 3 9. In the Appendices we provide the detailed results of all the performed experiments and the source code of the relevant input and output data structures. Chapter 2 Background The focus of this master project is the parallelization of the beam forming algorithm on GPUs. The beam forming algorithm is a standard signal processing technique aimed at providing spatial selectivity in the reception or transmission of a signal. In this work we use the beam forming in the field of radio astronomy, to receive data from a particular region of the sky using a large array of omnidirectional antennas. Section 2.1 gives a brief introduction to radio astronomy and its instruments. In Section 2.2 we introduce the fundamental physical notions, on electromagnetic radiation, that are necessary to understand how the beam former works, and finally in Section 2.3 we provide a brief introduction to beam forming itself. 2.1 Radio astronomy Radio astronomy is the field of astronomy that studies the universe at radio frequencies, while the classical astronomy studies only the so called “visible” universe. Its origin can be dated in 1933 when Karl G. Jansky published [11] the discovery of an electromagnetic emission from our galaxy, the Milky Way. This discovery, made by Jansky during an investigation aimed at finding the causes of static disturbances on transatlantic voice transmissions for the Bell Telephone Laboratories, made the scientists design and develop more complex and precise instruments to receive radio sources originating from outer space. Today the possibility to look at the universe in other frequencies of the electromagnetic spectrum gives astronomers the means to penetrate some of the deepest 4 Chapter 2. Background 5 secrets of the universe, like analyzing the molecules composing planets, star, galaxies and everything else in the known universe, and also the possibility to look at objects that, in the realm of visible light, are invisible, like pulsars. The history and achievements of radio astronomy are deeply connected with the history of radio telescopes. As a standard optical telescope is, in its barely essence, a reflecting glass, a radio telescope can be seen as just an antenna, tuned to receive a particular frequency. The Jansky prototype, for example, was an array of dipoles and reflectors receiving radio emissions at 20,5 MHz. The whole antenna was able to rotate, completing a full circle every 20 minutes. However, this prototype is really different from what is known today as a radio telescope. Nowadays, radio telescopes are in the majority of cases large parabolic antennas, in which the electromagnetic emission is reflected from the surface of a dish to an electronic receiver. The first radio telescope of this kind was realized by Grote Reber in 1937. But, as with optical telescopes, engineering difficulties arise while trying to increase the dimensions of the dish to have bigger telescopes, able to provide bigger apertures and needed for more accurate observations. An increase in costs is also unavoidable while building bigger radio telescopes. A solution to this problem is provided by radio interferometry. Radio interferometry is a technique that permits to combine the signals received by two or more receivers, obtaining a virtual radio telescope with the resolution equivalent to that of a telescope with a single antenna whose diameter is equal to the distance between the farthest receivers. The LOw Frequency ARray (LOFAR) is a radio interferometric array, developed by ASTRON, and is the radio telescope whose beam former we parallelize on GPUs in this work. Composed by more than 10.000 antennas , the LOFAR is one of the biggest radio telescopes ever built. But not only the dimensions of the LOFAR are groundbreaking: a really interesting aspect is that this radio telescope is mostly implemented in software. The LOFAR’s antennas are of two kinds: low-band antennas, for the frequency range of 10-80 MHz, and high-band antennas, for the range of 110-250 MHz. Antennas are organized in stations; stations will be important in this work because our beam former will not merge the signals directly coming from the antennas, but will merge the output signals of the stations. In fact, each station combines, using FPGAs, the signals of its antennas and sends a single signal to the central processing facility. The real-time Chapter 2. Background 6 processing pipeline of the LOFAR radio telescope is done, at the central processing facility, using a two and a half rack IBM Blue Gene/P. In order to reduce the amount of data sent by the stations to the central processing facility, only a reduced number of directions and frequencies are sent by the stations. These (reduced number of) directions and frequencies are called subbands. The subbands are split further into narrower frequency bands, at the central processing facility. These narrow frequency bands are called channels. These concepts will return frequently in the remainder of this work. 2.2 Electromagnetic radiation The existence of the electromagnetic radiation was postulated by James Clerk Maxwell in his theory of electromagnetism, and then demonstrated with a successive experiment by Heinrich Hertz. It is, in fact, possible to derive from Maxwell’s equations that waves are generated by the oscillation of an electric and a magnetic field (i.e. an oscillating electromagnetic field). While an ordinary wave is capable of propagating only through matter, an electromagnetic wave is capable of propagating also through the vacuum, and it propagates through the vacuum with a constant speed equal to 299,792 Km/s (the speed of light). This characteristic of the electromagnetic waves permits us to receive signals from outer space, although there is only vacuum between the Earth and the emitting sources. The visible light is just an example of electromagnetic radiation, as other examples are radio waves or X-rays. Another peculiarity of electromagnetic radiation, that derives from the quantum theory, is that it can behave both as a wave or as a particle. First we characterize the properties of the electromagnetic radiation as a wave. As any other wave the electromagnetic radiation it has three characteristics: speed, frequency and wavelength. As we previously said the speed is a constant in the vacuum and it is exactly the speed of light, which is symbolized in physics with c. In media different than vacuum, the speed is less than the speed of light, and it is dependant from the specific medium; in this case we use a v to indicate the speed of the wave. The frequency is the rate, measured in Hertz (Hz), at which the radiating electromagnetic field is oscillating; in physics, the symbol ν is used for the frequency. The wavelength of a wave is the Chapter 2. Background 7 distance between two successive crests, or troughs, and is simply measured in meters. The symbol that we use for the wavelength is λ. The frequency, speed and wavelength of a wave are related: in general we can write this relation as v = ν · λ, that in case of an electromagnetic wave propagating in the vacuum becomes c = ν · λ. An electromagnetic radiation can be classified using its frequency, or its wavelength, producing what is called the electromagnetic spectrum, shown in Figure 2.1. While propagating from the generating electromagnetic field, the radiation travels in all directions in straight lines, as covering the surface of a sphere. As the area of the sphere increases proportionally to the length of its radius (i.e. the distance the radiation has travelled) following the well known equation A = 4πR2 , the radiation loses its signal strength. Another property of the EMR is polarization. The polarization is the direction of the oscillation of the electromagnetic field’s electric component. Figure 2.1: Electromagnetic spectrum, courtesy of Wikipedia. When two different waves have similar frequencies, the relative measure of their alignment is called phase and is measured in degrees, from 0 to 360. If the peaks and troughs of two waves match over time, then they are said to be in phase with each other. When viewed as a particle, the electromagnetic radiation is composed by a discrete stream of energy’s quanta, called photons. The energy that each photon transport is Chapter 2. Background 8 proportional to the wave frequency, and is expressed by the equation E = h · ν, where h = 6, 625 × 10−27 erg/second is the Planck constant. To conclude this introduction on the properties of electromagnetic radiation, it is interesting to understand where these oscillating electromagnetic fields are originating from, especially in the field of radio astronomy. The main mechanism for the production of electromagnetic radiation is thermal. Heat is produced by the movement of the inner molecules of a solid, gas or liquid. When molecules move, an electromagnetic radiation is produced at all the frequencies of the electromagnetic spectrum, with the amount of radiation emitted for each frequency related to the temperature of the emitting body. Indeed, an emitting body that has a higher temperature will emit more energy, and so more electromagnetic radiation, at all frequencies, with its peak of emission concentrated on higher frequencies. This relationship is known as Wien’s law and it can be written as νmax = α h kT where T is the temperature (in Kelvin), k is the Boltzmann constant and α ≈ 2, 821439 is an empirical constant. Matter, in the state of a solid or plasma, is said to be a blackbody if emits thermal radiation. We can summarize the characteristics of a blackbody as the following: 1. A blackbody that has a temperature higher than 0 Kelvin emits some energy at all frequencies; 2. A blackbody whose temperature is higher than another one’s will emit more energy, at all frequencies, than the other one; 3. The higher the temperature of the blackbody, the higher the frequency at which the maximum energy is emitted. An electromagnetic field can also be produced (in rare cases) as the consequence of a non-thermal phenomenon. An example of electromagnetic radiation produced of nonthermal origin is the synchrotron radiation. This radiation is produced when a charged particle enters a magnetic field and, being forced to move around the magnetic lines of force, is accelerated to nearly the speed of light. However, as a difference from electromagnetic radiations produced by thermal emissions, in non-thermal radiations the intensity decreases with the frequency, i.e. lower is the frequency of the radiation, higher is the energy emitted. Chapter 2. Background 2.3 9 Beam forming Beam forming is a standard signal processing technique used to control the directionality of an array of antennas. It can be used for both transmitters and receivers. In this work, we focus only on using beam forming for reception, i.e., to combine the signals received from an array of antennas and simulate a larger directional antenna. The problem when combining signals received from different antennas is that the receivers are in different places in space, and so each of them is receiving the same signal emitted by a given source at different times. Simply combining the signals received by the different antennas does not produce meaningful information, because the waves are interfering. But these interferences can be both constructive and destructive, and exploiting the behavior of constructive interfering waves is exactly what beam forming is based on. The simplest beam former can be built just by connecting nearby antennas to the same receiver with wires of different lengths, thus delaying the signals and producing a temporal shift and an increase of sensitivity on a specific direction. This solution is not very flexible and beam formers are actually implemented with special purpose hardware, or with software. In general, to form a beam from different received signals, a different complex weight is multiplied with each signal and then all the signals are summed together. The complex weight depends on the source of interest’s location and the spatial position of the antennas. In Figure 2.2 we show how a hardware receiving beam former can be used to combine the signals received by four different antennas and provide a single coherent signal. In general, the complex weight that is multiplied with a received sample, is composed by two values: an amplitude and a phase shift. However, in narrow-band systems (like the LOFAR radio telescope), just a phase shift is sufficient to beam form the samples. The specific algorithm of the LOFAR’s beam former is described in detail in Section 5.2. Chapter 2. Background Figure 2.2: Hardware beam former, courtesy of Toby Haynes [1]. 10 Chapter 3 Related works The beam forming algorithm is straightforward, as can be seen in Section 2.3. In order to deal with high data rate and an increasing number of signals to merge, beam formers are usually built using special purpose hardware. In this Chapter we provide an overview of the most interesting beam forming solutions for radio astronomy: hardware implementations are presented in Section 3.1, and software solutions in Section 3.2. 3.1 Hardware beam formers Due to its simplicity, and to achieve real-time performance, the beam forming algorithm has a long history of hardware implementations. Although our work is focused on a particular software implementation, it is important to provide an introduction to some beam formers used in practice for radio astronomy. The first beam former of our list has been designed and realized by the Netherlands Foundation for Research in Astronomy (NFRA) as a technology demonstrator for the SKA radio telescope. The name of this demonstrator is Thousand Element Array (THEA), and it consists of 1.024 antennas divided in 16 tiles covering an area of approximately 16 square meters. Each tile contains 16 boards, each equipped with 4 antennas, with embedded radio frequency beam formers. A single tile is capable of forming two independent beams in hardware, then the formed beams are digitalized and sent to a central digital beam former. THEA is capable of forming 32 beams simultaneously. A 11 Chapter 3. Related works 12 complete description of the hardware can be found in [12]. The demonstrator was successful enough to permit scientific experiments and is currently continued by ASTRON. A picture of the THEA beam former can be seen in Figure 3.1. Figure 3.1: One of the THEA boards, courtesy of ASTRON. In the process of designing and building the different technology demonstrators for the SKA project, different beam formers were proposed and realized at ASTRON. However, reinventing every time a beam former for the specific project was considered a suboptimal solution, and so the Generic Digital Beam Former (GDBF) platform was designed. The GDBF is a generic digital narrowband beam former, modeled using an high level description language, VHDL, and successively implemented with both FPGAs and integrated circuits. What is interesting about the GDBF is that, also if it is an hardware project, it is based on non-functional requirements that are not so different from ours: it is indeed designed to be flexible, to scale up to an increasing number of receiving antennas and to be able to form multiple beams at the same time. Moreover, GDBF is designed to deliver high performance and have low power consumption. A mathematical description of the project is available in [13]. The Center for Astronomy Signal Processing and Electronics Research (CASPER) of the University of California at Berkeley proposes a similar solution. In order to provide performance close to the hardware, but without the time and costs involved in designing and building new special purpose hardware, they designed a set of FPGA modules for digital signal processing algorithms. These modules can be interconnected together, to form radio astronomy instruments, as can be read in [14]. The beam former for the Allen Telescope Array (ATA), called the BEEmformer, has been built using this technology; Chapter 3. Related works 13 a description of the beam former and a comparison between this implementation and another implementation made using DSP processors is available in [15]. Another recent comparison between two different beam formers, a radio frequency and a digital one, is provided in [2]. As these beam formers are also SKA demonstrators, they show that there is an increasing interest in the field for high performance and scalable beam formers. A photo of the radio frequency beam former chip is provided in Figure 3.2. Figure 3.2: EMBRACE radio frequency beam former chip, courtesy of P. Picard [2]. The last beam former of this short survey was developed by the Virginia Tech for the Eight-meter-wavelength Transient Array (ETA). The architecture of this beam former is a layered one: the signals are received by a cluster of 12 external FPGA nodes, each of them connected to two different receivers. The cluster node produces in output eight single polarization beams and sends four of them to one FPGA internal node and the other four to another. There are 4 internal FPGA nodes, each of them receiving four beams from six external nodes, for a total of 24 beams. These internal nodes are used to combine the different beams and send them to the storage nodes. A description of the hardware, and of the scientific goals of this beam former, can be found in [16]. It is interesting to note that this beam former is the only real parallel beam former between the different hardware implementation that we described. Beam forming in hardware is not exactly something new, but the goals set by radio astronomy for the near future are changing the field. The hardware beam formers, although still used, can hardly keep the pace: with the future radio telescopes composed of Chapter 3. Related works 14 millions of different antennas, producing a huge amount of data that has to be processed in almost real time, hardware solutions, certainly able to provide raw performance, do not scale. Further, hardware beam formers are inherently not flexible; in fact adding new components (e.g. antennas) or modifying the beam forming algorithm (e.g. to form a different number of beams) will require the substitution of some components or the design and successive build of new beam forming chips. But producing new hardware is not easy, nor cheap, as it usually spans for many years and involves complex and expensive prototyping. And it will not become cheaper in future, because special purpose hardware never benefits from the economy of scale effects. Furthermore, hardware design requires a lot of expertise that is not as widespread as with software design. A possible solution, and we can see from our small survey that is a solution that is being widely investigated, is to create standard hardware components using FPGAs and then build more complex instruments with them. Using FPGAs can also improve the flexibility and scalability of this new hardware beam formers, and simplify the life of their designers. 3.2 Software beam formers The history of software implementations of beam formers is shorter than the history of the hardware ones. There are, however, some interesting approaches that we need to examine in order to better understand the context of our own work. The first software beam former that we encounter is the actual beam former used at ASTRON for the LOFAR radio telescope. A brief description can be found in [17] along with other information regarding the complete software pipeline and the correlator. This beam former is really important, not only because this is the exact same algorithm we implement in this work, and with which we will eventually compare our results, but also because it is the first case of a real-time software beam former used in production. The telescope has already been described in Section 2.1 and the beam former in Section 2.3, so we are not adding any more details here. Another attempt to build a real-time software pipeline for a radio telescope has been made in India for the Giant Metrewave Radio Telescope (GMRT) and is described in [18]. The telescope is composed of 32 antennas placed over an area of 25 Km in diameter. Chapter 3. Related works 15 The software pipeline is implemented on commodity hardware with a cluster of 48 Intel machines running the Linux operating system. Parallelism is exploited at different levels, using MPI for inter-node parallelism and OpenMP for intra-node parallelism, and then using the vector instructions of the Intel Xeon processors at the single thread level. The beam former is implemented with three threads per node and the output is formed by 32 dual polarized beams. The performance measurements provided in the paper are relative to the whole software pipeline and they cannot be used for comparison with our implementation. It is interesting, however, that they also propose the use of GPUs to accelerate the computation. A different approach is the one of OSKAR. OSKAR, developed by the Oxford astrophysics and e-Research groups, is a research tool built to investigate the challenges of beam forming for the SKA radio telescope. It currently supports two different modes of execution: the simulation of the beam forming phase and the computation of different beam patterns. Its architecture is highly modular, with the two most important components being the front-end, used to manage the computations, and the back-end, where the simulations are run. The back-end is parallelized with MPI and runs on a cluster. More details, documentation, and the software are available on the project’s website, [19]. This is, however, a different approach from ours, and they cannot really be compared. To conclude this brief introduction to recent software beam formers, we present two GPU solutions. Different, than everything else presented so far, these two solutions are not intended for radio astronomy, but they are the only attempts we have found on implementing a beam former using a GPU, and are interesting for comparison. In the first work, [20], two general digital beam formers, one on the time domain and another one in the frequency domain, are implemented using CUDA on a NVIDIA GeForce 8800 and then compared with the same algorithms implemented on an Intel Xeon CPU. In the experiments the execution times achieved by the GPU implementations are always lower than the ones achieved by the CPU implementations, and the authors conclude that they see the use of GPUs as a viable solution for implementing digital beam formers. Different results are the ones collected in [21]. In this work an adaptive beam former for underwater acoustics is implemented with CUDA, using the same card as the previous work, the NVIDIA GeForce 8800. In this case the authors tried first to parallelize Chapter 3. Related works 16 the whole algorithm on the GPU, but this was a performance failure. A successive hybrid attempt, using the GPU only to accelerate a part of the beam forming process, was successful, but the GPU implementation ran in twice the time as a sequential C implementation. With these results, the author concludes that this is still a proof that a beam former can be parallelized on a GPU, and that the slowdown factors (mostly relative to wrong patterns in accessing the off-chip memory) have been identified and so further improvements are possible. This list of current approaches demonstrates that implementing a beam former in software is not anymore a naive solution. At least two real telescopes, LOFAR and GMRT, are currently using real-time software pipelines, and the flexibility provided by a software solution will be probably exploited also in ambitious projects like the SKA. However it is still not clear if a real beam former for radio astronomy can be efficiently implemented using a GPU, despite this solution being (widely) foreseen. In this work we answer this question using our own parallel GPU implementation of the LOFAR production code for the beam forming. Chapter 4 General Purpose computations on GPUs A Graphical Processing Unit (GPU) is a specialized processor used by modern video cards to improve their performance, by taking over part of the intensive computation from the CPU. The main reason to introduce GPUs was to increase the performance in rendering, mostly for animation and video games. When we talk about General Purpose computations on GPUs (GPGPU), a term that has been introduced in 2002 by Mark Harris, we focus on the use of GPUs for general purpose computations, i.e. the execution of general purpose algorithms on GPUs, instead of the classical graphics related computations. In this chapter we present how a modern GPU works and the reasons behind the adoption of GPUs for general purpose computing. Moreover, we introduce the NVIDIA architecture and CUDA. Finally, we present an example of GPGPU by implementing a simple Red-Black SOR algorithm, and measuring its performance, i.e. execution time and speed-up. 4.1 The GPU pipeline Figure 4.1 shows the high-level organization of the hardware pipeline of a generic GPU, as presented in [3]. The hardware functionality [22] is straightforward: a set of geometries, i.e. vertices of geometrical shapes on a three dimensional space, are sent to the 17 Chapter 4. General Purpose computations on GPUs 18 Figure 4.1: Hardware pipeline of a video card [3]. GPU, which eventually draws the corresponding image into the frame buffer; the image from the frame buffer memory is shown on the screen. There are three main hardware components of the pipeline, corresponding to the main phases of the computation. First, the geometries are transformed by the vertex processor into two dimensional triangles. Next the rasterizer generates a fragment for each pixel location covered by a triangle. Finally the fragment processor computes the color of each fragment, leading to an image, i.e. a set of pixels, into the frame buffer. Initially, this pipeline was completely implemented in hardware, until it has been proven by industry, i.e. with Pixar’s RenderMan [23], that a programmable pipeline could produce better results in terms of rendered images. To respond to this need, vendors transformed the classic pipeline into a flexible one, where the vertex and fragment processors execute user defined vertex and fragment programs. The pipeline is intrinsically data parallel, i.e. each vertex or fragment can be computed in parallel with the others (and it actually is). The GPU with its pipeline can be seen as a stream computing processor [24]. In the stream computing paradigm we have streams, i.e. sequences, possibly infinite, of data elements, and kernels, i.e. functions to apply to each element of a given stream; the mapping of the described pipeline to the stream computing paradigm is straightforward, with the input geometries being the stream and the vertex and fragment programs applied in parallel to each element of the stream being the kernels. Exploiting the stream computing capabilities of modern programmable GPUs is what GPGPU has made possible. Chapter 4. General Purpose computations on GPUs 4.2 19 The reasons behind GPGPU Are there real advantages in the use of GPUs for general purpose computing ? To answer this question, we look at performance figures. In Figures 4.2 and 4.3, we can see a comparison of computing capabilities and memory bandwidth between Intel CPUs and NVIDIA GPUs [4]. Figure 4.2: Comparison between Intel CPUs and NVIDIA GPUs in term of GFLOP/s, courtesy of NVIDIA [4]. Figure 4.3: Comparison between Intel CPUs and NVIDIA GPUs in term of GB/s, courtesy of NVIDIA [4]. Chapter 4. General Purpose computations on GPUs Device Intel Core i7-920 CPU NVIDIA GeForce GTX 295 GPU GFLOPa /s 89,6b 1788,48 20 Price 284,00$c 529,99$ Power 130 W 289 W a single precision At 2,80 GHz c For Intel direct customers in bulk of 1000 units b Table 4.1: Comparison between an Intel CPU and an NVIDIA GPU. For both computational performance, measured in GFLOP/s, and memory bandwidth, measured in GB/s, the performance achieved by the GPUs is higher in absolute value. Moreover, as the gap is widening quite fast, GPGPU seems suitable for a big slice of general purpose computations, mainly in the scientific field. Performance is not the only advantage brought by GPUs. A GPU is also cheaper per FLOP than an ordinary CPU. If we compare two recent devices from Intel and NVIDIA in terms of GFLOP/s and price, as in Table 4.1, we see that the GFLOPs/Dollar ratio of the two devices are 0,31 and 3,37 respectively; thus, an NVIDIA GPU is about ten times cheaper per FLOP than an Intel CPU. Moreover, in the field of scientific programming, one of the biggest problems in terms of cost is the power required by the modern supercomputers. We can compare the same devices in Table 4.1 to evaluate the GFLOPs/Watt ratio. The ratio is 0,68 for the Intel Core i7-920 CPU and 6,18 for the NVIDIA GeForce GTX 295 GPU. Also in this case the GPU is more efficient than the CPU. However, the differences in performance between CPUs and GPUs come from the fact that the latter are highly specialized and therefore less flexible. Having to design and produce processors specialized for applying the same function in parallel to many different data items, made the GPU producers focus more on incrementing the arithmetic capabilities of their architecture than on control capabilities. This can be seen in the different organizations of CPUs and GPUs, as presented in Figure 4.4. To compensate the lack of control features, the role that a GPU typically has in GPGPU applications is that of a powerful accelerator, computing the massive data-parallel parts of algorithms, while the CPU deals with the sequential parts of the same general purpose computations. Thus, the two architectures will continue to coexist and complement each other. Moreover, modern many-core GPUs are an ideal testbed for a future scenario in which general purpose CPUs will also become many cores architectures. Chapter 4. General Purpose computations on GPUs 21 Figure 4.4: The number of transistors devoted to different functions in CPUs and GPUs, courtesy of NVIDIA [4]. 4.3 NVIDIA architecture An important aspect of GPGPU is to understand how a GPU works. In the early days, it was impossible to write a general purpose algorithm without a complete understanding of the GPU pipeline, as introduced in Section 4.1. A problem needed to be transformed from its own domain to the graphic domain, before it could be implemented on a GPU. Today, with the introduction of high-level abstractions and support for generic programming languages, it is no longer necessary to translate a generic problem into the graphical domain. However, it is still difficult to obtain good performance without the knowledge of the underlying architecture. Figure 4.5: NVIDIA Tesla GPU architecture [5]. Chapter 4. General Purpose computations on GPUs 22 Figure 4.5 shows the architecture of a modern NVIDIA Tesla GPU with 112 cores. It is interesting to see that, from an hardware point of view, the border between vertex and fragment processor is not visible anymore, in fact the different steps of the pipeline are executed by an unified computing processor. However, from a logical point of view, the pipeline introduced in Section 4.1 is still valid. As the hardware architecture is complex, we will only provide a high-level overview here. Readers interested in more details on the NVIDA GPU architectures can read [25] for a more in depth description. In Figure 4.5 we can see that the GPU is composed of three layers: the command processing, made of the various input and distribution managers, the streaming processor array (SPA), and the memory management. We focus on the description of the organization of the SPA, as the other two layers are of little interest for the GPGPU programming per-se. A SPA is composed of a variable number of Texture/Processor Clusters (TPCs); the number of TPCs dictates the processing capabilities of the GPU itself. Each TPC contains two streaming multiprocessors and a texture unit. The streaming multiprocessor is the real computing core of the architecture. It contains a multithreaded instruction fetch and issue unit, eight streaming processors, two special-function units and a 16 KB read/write shared memory. Streaming multiprocessors are based on the Single Instruction Multiple Threads (SIMT) processor architecture, where the same instruction is applied to multiple threads in parallel. Threads are managed and executed by a streaming processor in groups of 32; a group of this type is called warp. The main change introduced by NVIDIA in its new GPU architecture, called Fermi, is the availability of a shared L2 cache memory, as can be seen in Figure 4.6. Figure 4.6: NVIDIA Fermi GPU architecture [6]. Chapter 4. General Purpose computations on GPUs 4.4 23 CUDA CUDA is a general purpose parallel computing architecture developed by NVIDIA to help programmers who want to use NVIDIA GPUs for GPGPU computing. Here we briefly introduce the programming model of CUDA; more details are available in [4]. A CUDA kernel is a user defined function that is executed in parallel by different GPU threads. Each executing thread is identified by a three dimensional vector inside its block ; the vector associates a position inside the block to each thread. Threads in the same block share communication and synchronization facilities while threads from different blocks are, theoretically, completely independent. Blocks are also identified by a three dimensional vector that associates to each block a position inside the grid of thread blocks. This introduces a hierarchy for CUDA threads, in which we have a single grid containing multiple blocks, each of them containing multiple threads. The memory organization in CUDA is also hierarchically organized. Each thread in a block has access to its private read/write local memory. All the threads inside a block share a read/write shared memory of 16 KB. The shared memory has low-latency, comparable to the latency to access local registers, and it acts like a user-programmable cache inside a block. Next in the hierarchy there is the GPU global memory which is accessible to each thread in every block, and shared also between different grids, i.e. different kernel executions. Two additional read-only memories, related to the global memory, are the constant and texture memories. These were the only cached memories inside the GPU, before the introduction of the Fermi architecture. The GPU memory, called device memory, is a physically different memory from the CPU host memory. Most of the memory allocation and management has to be explicitly addressed by the programmer. 4.5 An example: SOR To introduce GPGPU programming techniques, optimizations and performance we developed a parallel version of the SOR algorithm using CUDA and then we compared it to both the sequential version and a parallel CPU-only version that uses POSIX threads. Chapter 4. General Purpose computations on GPUs 24 SOR is a method of solving Laplace equations on a grid. The core of the algorithm is presented in Listing 4.1. for ( i = 1; i < N -1; i ++ ) { for ( j = 1; j < N -1; j ++ ) { Gnew = ( G [i -1][ j ] + G [ i +1][ j ] + G [ i ][ j -1] + G [ i ][ j +1] ) / 4.0; G [ i ][ j ] = G [ i ][ j ] + omega * ( Gnew - G [ i ][ j ]); } } Listing 4.1: SOR algorithm in C The algorithm is simple: each element of the matrix G, with the exception of the elements at the border, is updated by adding to its current value the product of a given value, omega, with the difference between the average value of the four direct neighbors (North, East, South and West) and the value of the element itself. This process is iterated a certain number of times until a convergence criterion is met. To parallelize the algorithm in a shared memory model, using POSIX threads, the RedBlack strategy is used. For Red-Black SOR, the matrix is seen as a checkerboard and each iteration is split in two phases; each phase is associated with a color. Only the items with the same color of the phase are updated. The data distribution strategy used is row-wise, i.e. each thread receives a certain number of rows from the matrix and iterates the algorithm on them. Listing 4.2 shows how each different thread updates its part of the matrix. void * threadSolver () { for ( phase = 0; phase < 2; phase ++ ) { for ( i = startRow ; i < endRow ; i ++ ) { /* Only e l e m e n t s in the current phase are updated */ for ( j = 1 + ( even ( i ) ^ phase ); j < N -1; j += 2 ) { Gnew = ( G [i -1][ j ] + G [ i +1][ j ] + G [ i ][ j -1] + G [ i ][ j +1]) / 4.0; G [ i ][ j ] = G [ i ][ j ] + omega * ( Gnew - G [ i ][ j ]); } } p t h r e a d _ b a r r i e r _ w a i t (& barrier ); } } Listing 4.2: Red-Black SOR algorithm in C Chapter 4. General Purpose computations on GPUs 25 Assuming that we are not taking into account the convergence factor for the algorithm, the differences between the sequential and parallel version of the code are negligible: instead of sequentially updating each element of the matrix, each thread updates a certain amount of rows, in the interval provided by the startRow and endRow variables, in two phases. A synchronization point is necessary after each phase to avoid the situation in which some threads will start phase one while others are still in phase zero, invalidating with this behavior of the Red-Black strategy. The CUDA version introduces, however, several visible changes. In fact, we implemented different versions of SOR with CUDA, demonstrating different optimizations. The data distribution strategy used in all the CUDA versions differs from the POSIX thread implementation: each CUDA thread has a single matrix cell to update (as opposed to the block distribution used by the POSIX version). This can be seen as a special form of block-wise data distribution, where the dimension of each block is 1 × 1. Both the CPU and the GPU participate in solving the problem: the CPU preparing the memory for the computation, managing the phases and eventually getting the results back from the device, while the role of the GPU is to update the matrix. Listing 4.3 shows the work of the CPU in version A of our GPGPU SOR. /* A l l o c a t e the matrix to the device */ /* Copy the matrix to the device */ /* Set the thread block d i m e n s i o n s */ for ( i = 0; i < iterations ; i ++ ) { for ( phase = 0; phase < 2; phase ++ ) { solver < < < blockSize , THREAD_N > > >( devG , pitch , omega , phase , N ); c u d a T h r e a d S y n c h r o n i z e (); } } /* Check if all CUDA kernel i n v o c a t i o n s r e t u r n e d without errors */ /* Copy result matrix from device to host memory for print or further p r o c e s s i n g */ Listing 4.3: CPU work in CUDA SOR A The CPU is used to manage the work. It allocates the memory on the device, copies the data, and then iteratively calls the CUDA kernel. Finally, it copies the modified matrix back from the device to the host’s memory. Chapter 4. General Purpose computations on GPUs 26 Listing 4.4 presents the implementation of the CUDA kernel developed for version A of our Red-Black SOR. __global__ void solver () { if ( phase == 0 ) { /* In phase 0 only cells with both even , or odd , c o o r d i n a t e s are updated */ if ( ( even ( i ) && even ( j )) || ( odd ( i ) && odd ( j )) ) { item = *(( float *) (( char *) G + i * pitch ) + j ); /* Load from device memory s t e n c i l V a l u e s [0..3] */ /* Threads " outside " of the matrix are not used */ if ( j < N - 1 && i < N - 1 ) { Gnew = ( stencilValues [0] + stencilValues [1] + stencilValues [2] + stencilValues [3]) / 4.0; itemPointer = ( float *) (( char *) G + i * pitch ) + j ; * itemPointer = item + omega * ( Gnew - item ); } } } else { /* In phase 1 only cells with mixed even and odd c o o r d i n a t e s are updated */ if ( ( even ( i ) && odd ( j )) || ( odd ( i ) && even ( j )) ) { /* The code is the same as in phase 0 */ } } Listing 4.4: Kernel in CUDA SOR A Each thread executes a copy of this kernel on a single element of the matrix. The position of the element to operate on inside the matrix is found using the position of the thread inside the block and of the block inside the grid. Other than that, the thread only checks if it’s in the correct phase for updating its element or not. Note that there is no synchronization between different blocks, nor between different invocations of the kernel. Version B of the CUDA implementation code makes use of the shared memory, as introduced in Section 4.4. The CPU code is almost the same, with the only difference that in the kernel invocation enough shared memory for each block is dynamically allocated. Listing 4.5 shows the differences in the kernel code between version A and B. __global__ void solver () { /* Load 3 column e l e m e n t s from the matrix to shared memory */ rowU [ rowId ] = *(( float *) (( char *) G + ( i - 1) * pitch ) + j ); Chapter 4. General Purpose computations on GPUs 27 rowM [ rowId ] = *(( float *) (( char *) G + i * pitch ) + j ); rowD [ rowId ] = *(( float *) (( char *) G + ( i + 1) * pitch ) + j ); /* Threads " outside " the matrix are not used */ if ( i < N - 1 && j < N - 1 ) { if ( threadIdx . x == 0 ) { /* Load from the matrix the first and last element of each row segment */ } /* A s s u r i n g that all the values are loaded into memory */ __syncthreads (); if ( phase == 0 ) { /* In phase 0 only cells with both even , or odd , c o o r d i n a t e s are updated */ if ( ( even ( i ) && even ( j )) || ( odd ( i ) && odd ( j )) ) { Gnew = ( rowU [ rowId ] + rowM [ rowId - 1] + rowM [ rowId + 1] + rowD [ rowId ]) / 4.0; rowM [ rowId ] = rowM [ rowId ] + omega * ( Gnew - rowM [ rowId ]); } } else { /* In phase 1 only cells with mixed even and odd c o o r d i n a t e s are updated */ if ( ( even ( i ) && odd ( j )) || ( odd ( i ) && even ( j )) ) { /* The code is the same as in phase 0 */ } } *(( float *) (( char *) G + i * pitch ) + j ) = rowM [ rowId ]; } } Listing 4.5: Kernel in CUDA SOR B Shared memory has been introduced to improve the performance of version A, because most of the elements in the row updated by a thread block are accessed by more than one thread. Thus, we decided to use the shared memory inside a block to store the three partial rows. However, the experiments showed that in version B the coalesced memory access [26] was almost completely lost. Memory access coalescing means that all the different access to memory by the threads of a warp are joined together in a single read or write if some alignment requirements are satisfied. On NVIDIA GPUs, the access to memory is significantly slower than computation, so coalesced access to global memory is very important for performance. Version C corrects this suboptimal access pattern to memory. The CPU code doesn’t change between versions B and C, while the only two differences between the kernels of Chapter 4. General Purpose computations on GPUs 28 version B and C are presented in Listings 4.6 and 4.7. rowU [ rowId -1] = *(( float *) (( char *) G + ( i - 1) * pitch ) + ( j - 1)); rowM [ rowId -1] = *(( float *) (( char *) G + i * pitch ) + ( j - 1)); rowD [ rowId -1] = *(( float *) (( char *) G + ( i + 1) * pitch ) + ( j - 1)); Listing 4.6: Differences between CUDA SOR B and C if ( threadIdx . x == 0 ) { /* Load from the matrix the last two e l e m e n t s of each row segment */ } Listing 4.7: Differences between CUDA SOR B and C The code in Listing 4.6 changes the access pattern to memory, thus reintroducing coalescing, aligning the reads to the memory addressing boundaries and closing the gap that thread 0 was creating in Listing 4.5. Moreover we see in Listing 4.7 that, instead of having to load the first and last elements of the block under update as in Listing 4.5, in version C the first thread of the block has to load the last two elements. Changing the memory access pattern, to fulfill the alignment requirements of the platform, permitted to increase the performance of version A, as we can see in Section 4.5.1, with the gain given by the use of shared memory. Shared memory is not a breakthrough for performance in this case because the level of reutilization of data in the SOR algorithm is low. It is, however, a good practice when using CUDA, so we wanted to implement and test it. The last SOR implementation is version D of the code. In this last version, the access to memory is again changed by using the texture memory, as introduced in Section 4.4. One of the advantages of using texture memory is that texture memory is cached. The CPU code has been modified to bind and unbind the texture area to the memory allocated on the device. Note that the number of thread blocks used has decreased slightly, but two new threads are added to each thread block. They are used to simplify the access to texture memory. The new kernel code is presented in Listing 4.8. __global__ void solver () { valueU = tex2D ( rowCache , j , i - 1); row [ threadIdx . x ] = tex2D ( rowCache , j , i ); valueD = tex2D ( rowCache , j , i + 1); __syncthreads (); /* Threads " outside " of the matrix and the first and last of each block are not used */ Chapter 4. General Purpose computations on GPUs 29 if ( ( i < N - 1 && j < N - 1) && ( threadIdx . x != 0 && threadIdx . x != blockDim . x - 1) ) { if ( phase == 0 ) { /* In phase 0 only cells with mixed even and odd c o o r d i n a t e s are updated */ if ( ( even ( i ) && even ( j )) || ( odd ( i ) && odd ( j )) ) { Gnew = ( valueU + row [ threadIdx . x - 1] + row [ threadIdx . x + 1] + valueD ) / 4.0; row [ threadIdx . x ] = row [ threadIdx . x ] + omega * ( Gnew - row [ threadIdx . x ]); } } else { /* In phase 1 only cells with both even , or odd , c o o r d i n a t e s are updated */ if ( ( even ( i ) && odd ( j )) || ( odd ( i ) && even ( j )) ) { Gnew = ( valueU + row [ threadIdx . x - 1] + row [ threadIdx . x + 1] + valueD ) / 4.0; row [ threadIdx . x ] = row [ threadIdx . x ] + omega * ( Gnew - row [ threadIdx . x ]); } } *(( float *) (( char *) G + i * pitch ) + j ) = row [ threadIdx . x ]; } } Listing 4.8: Kernel in CUDA SOR D Although version D of the kernel uses different types of memories, i.e. shared and texture memory, and preserves the coalesced access to the memory, its code is the shortest and more readable of all versions, due to optimizations and code reorganization. 4.5.1 Performance After introducing the algorithm and the code, we show the execution time and the achieved speed-up on a real case, to verify the performance that GPGPU can provide. The parameter that is varied in the experiment is the number of threads per block. The dimension of the matrix has been fixed as 8000 × 8000, small enough to be sure that it fits into the device memory, and large enough to exceed the CPU cache. The platform we used is equipped with two Intel Xeon E5320 CPUs, for a total of 8 computing cores, and 8 GB of RAM. The video card is an NVIDIA GeForce 8800 GTX with 16 streaming multiprocessors and a total of 128 cores and 767 MB of global memory. Figure 4.7 presents a comparison of the execution time of the various versions of the SOR algorithm developed for the GPU, in relation with the execution time of both the sequential and parallel CPU-only versions. Figure 4.8 shows the speed-up of the different Chapter 4. General Purpose computations on GPUs 30 versions of the CUDA implementations, compared with the speed-up achieved by the parallel CPU-only version. Figure 4.7: SOR execution time (lower is better) Figure 4.8: SOR speed-up (higher is better) From Figure 4.7 we can see that all CUDA implementations have a better execution time than the sequential version. Moreover the execution times of all CUDA implementations are also better than the one of the parallel CPU-only version. This behavior is clearly Chapter 4. General Purpose computations on GPUs 31 visible in Figure 4.8, when considering the achieved speed-up of all the parallel versions of the algorithm, relative to the sequential implementation. Looking at our best CUDA implementation (version D), we see that it is possible, also in a simple example like the SOR algorithm, to achieve with GPGPU significantly better performance compared to a CPU-only implementation: we obtained an execution time of 4 seconds compared to 44 and 22 seconds of the sequential and CPU-only parallel version respectively, thus achieving a speed-up of a factor equal to 11. Moreover, the high-level programming capabilities offered by a framework like CUDA permit to write code that is readable and similar to what we will expect from CPU-only code. However, as we found out in this example, a deep understanding of the device functionality, especially of the memory organization, is necessary for performance. Coalesced access to memory for NVIDIA devices is extremely important and can be a performance breakthrough. Furthermore, the use of shared and texture memory is an unavoidable optimization strategy, for this pre Fermi GPU, resulting in another performance improvement. Chapter 5 Application analysis In this chapter we describe the beam forming application by first introducing the data structures used to represent the input and the output, in Section 5.1, then describing a sequential version of the algorithm, in Section 5.2, and finally, in Section 5.3, providing the strategies followed to parallelize the beam former on a GPU. 5.1 Data structures The input data structures for the beam forming algorithm are two: the samples and the metadata. The samples are essentially the values measured from the stations, while the metadata are the delays that have to be applied to the samples to form a beam. The only output data structure contains, for each formed beam, the merged samples of all the stations. In our parallel implementation we use the same data structures that are used by the ASTRON code for the LOFAR beam former, without any change. The source code of these data structures is available in Appendix F. The class representing the input samples is called SampleData (Section F.1). Apart from its own logic, used to implement operations like read from and write to permanent storage, or memory allocation, it contains two internal structures. The first of them is a four dimensional array, named samples, containing the complex values representing the measured signals. The dimensions of the array are, in order, the channels, the stations, the time intervals in which a second is divided, and the measured polarizations. This multidimensional array is allocated as a contiguous memory segment than can be 32 Chapter 5. Application analysis 33 accessed as a big linear array; this property is important for the parallel implementation on the GPU because it permits a single, fast memory transfer between the host and the device. Otherwise, some intermediate step would have been necessary to transform the data structure into one that could be handled by the GPU. The second internal data structure is a vector of sparse sets, named flags, with a sparse set for each station showing the intervals of samples that are flagged. A measured sample is flagged if it contains some sort of error, in which case it is simply excluded from the computation and the corresponding output is set to zero. The class representing the metadata is called SubbandMetaData (Section F.2). Besides its own logic (in which we are not interested for this work), it contains an array of arrays for each station. Those second arrays contain, for each beam that has to be formed, the delays that should be applied to a station’s sample to form the given beam. Finally, the class representing the output values, i.e. the formed beams, is called BeamFormedData (Section F.3), and it is derived from the same parent of SampleData. The only changes in this output data structure (compared with the input one) are the dimensions of the multidimensional array, that in this case are, in order, the beams, the channels, the time intervals in which a second is divided and the measured polarizations. Everything else said about SampleData remains the same for the BeamFormedData. 5.2 The beam forming algorithm The reference sequential version of the beam forming algorithm has been derived from the C++ ASTRON code. The algorithm is divided in three phases: 1. Delays computation; 2. Flags computation; 3. Beams computation. We will now discuss these steps in more detail. Chapter 5. Application analysis 5.2.1 34 Delays computation The delays computation phase combines two delay values, the one at the beginning of the measurement and the one at the end, that are provided to the algorithm via a SubbandMetaData object, and stores the result as a single double precision floating point value for each station-beam combination, as can be seen in Listing 5.1. Computed delays are stored in a matrix, delays, that is stored into the BeamFormer object. for ( unsigned int station = 0; station < nrStations ; station ++ ) { double c o m p e n sa t e d D e l a y = ( metaData . beams ( station )[0]. delayAfterEnd + metaData . beams ( station )[0]. delayAtBegin ) * 0.5; delays [ station ][0] = 0.0; for ( unsigned int beam = 1; beam < nrBeams ; beam ++ ) { delays [ station ][ beam ] = (( metaData . beams ( station )[ beam ]. delayAfterEnd + metaData . beams ( station )[ beam ]. delayAtBegin ) * 0.5) - c o m p e n s a t e d D e l a y ; } } Listing 5.1: Phase 1: delay computation It is not necessary to compute the delay for the central beam, i.e. beam number 0, of each station, because is assumed that the input provided to the algorithm has already been compensated for it. It is important to note also that, when computing the beam forming for many different input samples, but with the same metadata, it is enough to perform this phase once, at the first iteration. This is the case for all the observations performed with the LOFAR. 5.2.2 Flags computation The goal of the flags computation phase is to discard the stations with too much flagged data, i.e. stations containing too many errors, to avoid the pollution of the formed beams with measurement that are not correct. The code of this phase can be seen in Listing 5.2. n rV al id S ta ti on s = 0; for ( unsigned int station = 0; station < nrStations ; station ++ ) { if ( isValid ( station ) ) { isVal idStati on [ station ] = true ; n rV al id S ta ti on s ++; Chapter 5. Application analysis 35 } else { isVal idStati on [ station ] = false ; } } for ( unsigned int beam = 0; beam < nrBeams ; beam ++ ) { outputData . flags [ beam ]. reset (); for ( unsigned int station = 0; station < nrStations ; station ++ ) { if ( isVa lidStati on [ station ] ) { outputData . flags [ beam ] |= inputData . flags [ station ]; } } } Listing 5.2: Phase 2: flags computation The flags computation has two loops. In the first one, each station is checked to see if it’s valid or not. A station is valid if the percentage of its samples that are flagged doesn’t exceed a certain upper bound (defined elsewhere in the code). The number of valid stations is saved and an array of boolean values is populated, to provide a faster check on the validity of a given station. The second loop sets the flags of the output data. The flagging policy is straightforward: if an input sample is flagged, even if just for one of the input stations, then the correspondent output sample is flagged too. Invalid stations are excluded because their values are not used further in the computation (i.e., they are not affecting the output). 5.2.3 Beams computation The beams computation is the core of the beam forming algorithm: it computes the different beams that are obtained by merging the samples from all the stations. The code is provided in Listing 5.3. The phaseShift function used in the code is shown in Listing 5.4. double a ve r ag in gF a ct or = 1.0 / nr Va l id St at i on s ; for ( unsigned int beam = 0; beam < nrBeams ; beam ++ ) { for ( unsigned int channel = 0; channel < nrChannels ; channel ++ ) { double frequency = baseFrequency + channel * c h a n n el B a n d w i d t h ; Chapter 5. Application analysis 36 for ( unsigned int time = 0; time < nrSamples ; time ++ ) { if ( ! outputData . flags [ beam ]. test ( time ) ) { // valid sample for ( unsigned int pol = 0; pol < nr P ol ar iz a ti on s ; pol ++ ) { outputData . samples [ beam ][ channel ][ time ][ pol ] = makefcomplex (0 , 0); for ( unsigned int station = 0; station < nrStations ; station ++ ) { if ( isVa lidStati on [ station ] ) { fcomplex shift = phaseShift ( frequency , delays [ station ][ beam ]); outputData . samples [ beam ][ channel ][ time ][ pol ] += inputData . samples [ channel ][ station ][ time ][ pol ] * shift ; } } outputData . samples [ beam ][ channel ][ time ][ pol ] *= av e ra gi ng F ac to r ; } } else { // flagged sample for ( unsigned int pol = 0; pol < nr P ol ar iz a ti on s ; pol ++ ) { outputData . samples [ beam ][ channel ][ time ][ pol ] = makefcomplex (0 , 0); } } } } } Listing 5.3: Phase 3: beams computation fcomplex phaseShift ( double frequency , double delay ) { double phaseShift = delay * frequency ; double phi = -2 * M_PI * phaseShift ; return makefcomplex ( cos ( phi ) , sin ( phi )); } Listing 5.4: phaseShift function The external loop is performed for each beam that has to be formed by the algorithm. For all the channels and the time samples, if they are not flagged, all the valid stations are merged, for all the measured polarizations. The phaseShift function, given the frequency and the delay, provides the complex shift that has to be multiplied to each sample from each valid station. The sum of all the shifted samples is then multiplied with an average factor. In case a time sample is flagged, the output value is simply set to zero. Chapter 5. Application analysis 5.3 37 Parallelization strategies The sequential algorithm described in the previous sections can be parallelized with a data-parallel strategy, i.e. it will be possible to perform the same operation on different items at the same time. The sequential algorithm is composed by three interdependent steps, so a task-parallel strategy does not seem suitable for this parallelization. We will now analyze how the three phases may be parallelized. The first phase of the beam former, the delays computation (Section 5.2.1), can be independently computed for each station-beam pair; the same seems to be true for the second phase, the flags computation (Section 5.2.2). However, what appears so simple at a first glance, is not after a deeper analysis. In the delays computation phase, several double precision floating point values are summed and multiplied for each station-beam pair; after, the computed values are stored for later reuse. The involved data structures are nothing more than simple arrays. So, to parallelize this phase on a GPU, we just need to copy the input arrays in the video card’s memory, perform the arithmetical operations in parallel, e.g. using a different thread for each different station-beam pair, and then copy the results back to the main memory. Moreover, because the computed values will be used in the third phase, it should be possible to compute them on the GPU and leave them there, reducing the number of necessary memory transfers. If we look, instead, at the flags computation phase, we see that it has no computation, but only checks for the quality of the data, and these checks are implemented with special functions and operators, defined on data structures more complex than the previously involved arrays. To parallelize this phase on a GPU we would need to modify the input and output data structures or, in order to avoid modifications on legacy code, write wrappers and new intermediate data structures compatible with the GPU architecture. But introducing this compatibility layer will certainly result in a downgrade in performance. Thus, we decided not to parallelize this phase on the GPU, and leave its execution on the CPU. What is more interesting is the parallelization of the third phase, the beams computation (Section 5.2.3). First of all, from the sequential algorithm it is possible to see that all the operations involved are arithmetical, but the check to determine if a given sample from a Chapter 5. Application analysis 38 station is correct or not. This check, however, can be skipped with minor modifications to the algorithm (i.e. replacing the incorrect samples with zeros in the previous phase). Therefore, this phase is a good candidate for being parallelized on a GPU. We think, moreover, that the parallelization of this phase can bring major improvement to the algorithm’s performance. From the sequential code we can see that each combination of beams, channels, time samples and polarizations is independent from the others, and can be computed in parallel without interdependencies. A possible parallelization strategy is, then, to assign a different combination of these parameters to each thread, with the thread multiplying and summing the values for all the stations. Or, to avoid synchronization issues, leave the final sum of all the shifted samples out of this phase, and parallelize it later, with another kernel. In fact, we can obtain different strategies just by organizing the matching between the threads and the data in a different way. What is for sure important is the memory’s access pattern, and the level of reuse that can be achieved with the different strategies. We know that data reuse is important because we know that to form a beam we need the samples, corresponding to a certain channel, time and polarization, from all the stations. But, we need the same samples to form all the beams, not just one of them; what changes between different beams is just the delay computed during the first phase. Like the organization of the matching between threads and data could lead to different parallelization strategy, a different scheme for samples’ reuse can do the same. We can also see in the sequential algorithm, that the complex shift, that needs to be multiplied to each sample, depends only on three of the parameters: channel, station and beam. It should be possible, to extract this step from the main algorithm, parallelize it, and simply store the results in the video card’s memory for further accesses. Besides, another strategy can be tried, in which this shift computation is moved back in the pipeline and merged with the delays computation phase: this will result in a new phase, that we can call shift computation, that needs to be performed on the GPU just once per computation. This reorganization of the algorithm also simplifies the operations of the third phase, and reduce the number of calls to costly functions like cos and sin. We can conclude here that the beam forming algorithm is a good candidate for being parallelized on GPUs. However, achieving good performance will not be trivial. We have to implement, test, benchmark and tune multiple parallelization strategies. As we Chapter 5. Application analysis 39 believe that the bottleneck will be the video card’s memory, we focus on those strategies able to maximize the data reuse between different threads, and to provide coalesced access to memory. Chapter 6 CUDA BeamFormer In this chapter we present six different versions of the beam forming algorithm, all developed with CUDA. These versions are presented in the context of an experiment, set up as described in Section 6.1. The experiments are designed to test performance and, to analyze which parallelization strategies best suite the beam forming process on modern NVIDIA GPUs. A comparison between all six versions is provided in Section 6.8, together with our conclusions. 6.1 Experimental setup We performed a series of experiments to understand how our six implementations of the beam former scale and which are the best parallelization strategie on a modern NVIDIA GPU. Each one of the six developed beam forming algorithm versions is described in detail in one of the following sections. The operational intensity and the number of utilized registers for all of them are presented in Table 6.1. The optimization strategy implemented by each version, as well as the code differences between them, are summarized in Table 6.2. The experiments are performed running a single execution of each developed version, and varying two of the input parameters: the number of stations to merge to form a single beam, and the number of beams to form. Both input parameters vary, independently, in the space of the powers of 2, with the number of stations varying between 21 and 28 and the number of beams varying between 21 and 29 . The other parameters, i.e. 40 Chapter 6. CUDA BeamFormer Kernel 1.0.2a 1.0.2b 1.1 / 1.1.1c 1.1 2x2 / 1.1.1 2x2d 1.2e 1.3f 1.4g 1.5h a For For c For d For e For f For g For h For b 41 Operational intensity 0, 65 3+(2∗log2 (#stations)) 16 0, 3 0, 41 9+(16∗#stations per block) 32+(24∗#stations per block) 9+(16∗#stations per block) 40+(16∗#stations per block) 1+#beams per block∗(8+2∗log2 #stations) 8+8∗#beams per block (#stations per block∗#beams per block∗16)+9∗#beams per block #stations per block∗(16+8∗#beams per block)+32∗#beams per block Registers 29 25 17 23 [25, 36] 26 18 [29, 63] the samples computation with mixed single and double precision operations the samples addition and for thread #0 thread #0 computing the last station thread #0 computing the last station the computation of the last station the computation of the last station thread #0 the computation of the last station Table 6.1: Operational intensity and registers used by each kernel Version Optimization strategy 1.0 No optimizations, it follows the ASTRON algorithm structure. Separation of single and double precision floating point operations, avoid temporary memory buffer. Computation of more beams per iteration. Coalesced access to memory. Coalesced access to memory. Avoid idle threads, coalesced access to memory, improved data reuse per thread blocks. 1.1 1.2 1.3 1.4 1.5 Code differences with previous version Single kernel, separate shift computation phase. Introduction of the station-beam block. Complete rewriting. Complete rewriting. Complete rewriting. Table 6.2: Algorithms’ optimization strategies and code differences. the number of channels, time samples and polarizations, are kept constant, with values of 256, 768 and 2 respectively. This is not an unrealistic assumption because in the production environment stations and beams are more likely to change. The values chosen are representative for the LOFAR scenarios. There is no flagged data in the input, so all the data is used in the computations. In each experiment, we measure the execution time and the time taken only by the kernels running on the GPU; the former is used to measure how the different algorithms Chapter 6. CUDA BeamFormer 42 scale, the latter to derive two other performance metrics; the number of single precision floating point operations per second, measured in GFLOP/s, and the achieved memory bandwidth, measured in GB/s. These two metrics are used to compare the different versions and measure the hardware utilization. The machine used for the experiments has one Intel Core i7-920 CPU, 6 GB of RAM and a NVIDIA GeForce GTX 480 video card. The GeForce GTX 480 uses the NVIDIA GF100 GPU, with 480 computational cores, that provides a theoretical peak performance of 1344,96 GFLOP/s and can sustain a memory bandwidth of 177,4 GB/s accessing its on-board 1536 MB of RAM. The machine’s operating system is Ubuntu Linux 9.10. The host code is compiled with g++ version 4.4.1 and the device code is compiled with nvcc version 0.2.1221; we use CUDA version 3.1. 6.2 BeamFormer 1.0 BeamFormer 1.0 follows strictly, in its structure, the ASTRON C++ implementation, as presented in Section 5.2; from the three computational phases composing the algorithm, only the third one (the beam forming phase) is parallelized taking advantage of the GPU acceleration. Two different kernels are used to implement this phase: the first one computes the weighted samples for each station and stores them into a temporary buffer on device memory, while the second sums all these samples to form the beam. The first kernel is executed one time for each beam to form and the second one twice, once for each polarization. The structure of the CUDA grid is the same for both kernels, with a thread block created for each channel-time pair; the block structure is different, and this is the reason for the different times the kernels need to be executed to form a beam: for the first kernel there is a thread for each station-polarization pair while for the second there is only a thread for each station. The use of a temporary buffer to store the beam samples before merging them is a limitation for the number of instances that can be computed: this version can compute at most 256 beams and merge 128 stations. Version 1.0 consists of one implementation. The execution times measured for this version are presented in Table A.1. The data shows that the algorithm scales linearly. From the other two performance metrics, only the achieved bandwidth is presented in Table C.1, because version 1.0 mixes together double and single precision floating Chapter 6. CUDA BeamFormer 43 point instructions (making it impossible to compute an accurate GFLOP/s value). The memory bandwidth follows an expected trend, being stable when varying the beams for a fixed number of stations to merge, and scaling linearly when varying the stations to merge for a fixed number of beams to form, because more threads are run when the number of stations to merge is increased. The values that are higher than the card’s maximum memory bandwidth are due to the caching system. Although the BeamFormer 1.0 scales linearly, the execution times are still far from what we expect to achieve parallelizing the beam forming algorithm on a GPU. So far, following the structure of the sequential code does not look like a good parallelization strategy. 6.3 BeamFormer 1.1 The idea behind the BeamFormer 1.1 is to provide a separation of double and single precision operations and avoid the use of a temporary buffer to store partial results. Separating single and double precision floating point operations allows us to compute the achieved GFLOP/s, and to have a better understanding of the performance of our beam formers. Moreover, the use of a temporary buffer to store partial results increases the number of accesses to the video card’s global memory, and this is an expensive operation, and also increases the amount of memory that needs to be allocated, thus reducing the number of computable instances. To achieve these goals, the structure of the algorithm has been modified. The code is still structured in three phases, but the first and the third phases are changed compared to the sequential code. In the first phase, instead of computing the delays, the complex weights needed for the beam forming are computed and stored permanently on the GPU. A value is computed for each channel, station and beam combination, and is used later in the third phase of the algorithm; these values are computed only once in the first execution of the BeamFormer, and can be used later on to compute more beams (which is the normal case in the production environment). Double precision floating point operations are only necessary in this phase, as the beam forming phase works only with single precision operations. Chapter 6. CUDA BeamFormer 44 For the BeamFormer 1.1 family, four different implementations are used in the experiments: 1.1, 1.1 2x2, 1.1.1 and 1.1.1 2x2. They differ only in the third phase of the algorithm, that is implemented by a single kernel. Implementation 1.1 allocates enough memory on the device to store all the formed beams, while implementation 1.1.1 only allocates memory to store the beams computed in a single kernel invocation, thus permitting to solve instances covering the whole input space of the experiments at the expense of more memory transfers. The CUDA grid is structured such that a thread block is used for each channel-time pair and, inside blocks, a thread is created for each beam to form in a single kernel invocation. That is one for implementations 1.1 and 1.1.1 and two for implementations 1.1 2x2 and 1.1.1 2x2. The execution times measured, for the four implementations, are presented in Tables A.2 to A.5 (Appendix A), and they show that the BeamFormer 1.1 scales linearly. The achieved memory bandwidth, presented in Tables C.2-C.5 (Appendix C), is almost the same between implementations 1.1 and 1.1.1 and between implementations 1.1 2x2 and 1.1.1 2x2, and scales linearly when increasing the number of beams to compute in a single iteration; the same behavior is shown with the achieved GFLOP/s in Tables B.1-B.4 (Appendix B). The performance of version 1.1 is low because the low number of threads per block (two in the best case) creates a situation in which each streaming multiprocessor gets less threads than available cores, and so the GPU is heavily underutilized. 6.4 BeamFormer 1.2 In order to increase the number of threads per block, version 1.1 is modified to compute more beams per kernel execution. This modification is implemented in BeamFormer 1.2. The modification is based on the concept of station-beam block: a block of NxM indicates that within a single kernel execution we are merging N stations into M beams. How many times a kernel needs to be executed to solve an input instance depends on the input and station-beam block dimensions. For the BeamFormer 1.2, we have seven different implementations. Implementations 1.2 2x2, 1.2 4x4 and 1.2 8x8 are used to demonstrate that a bigger station-beam block implies better performance; as can be seen in Tables A.6 to A.8 (Appendix A), not only each implementations scales linearly with the input, but doubling the size of the block Chapter 6. CUDA BeamFormer 45 halves the execution time on the same input instance. The other metrics, presented in Tables B.5 to B.7 (Appendix B) and Tables C.6 to C.8 (Appendix C), show the same behavior. However, the values are still far from the theoretical peaks of the platform: the best implementation just reaches a bit more than 1% of the video card’s capabilities. In implementation 1.2.1, the code is modified to permit a runtime definition of the stationbeam block; for the experiment, the station-beam block is set with the dimensions of the input instance, thus leading to a single kernel execution. Table A.9 (Appendix A) shows that the execution times for all the different input instances are low enough to be considered almost constant, with the trend becoming linear with big instances. The issue with this implementation is that the bigger the number of beams computed on a single execution is, the bigger the allocated memory is; so, it is not possible to solve all the instances of the experiment’s input space. Looking at both the achieved GFLOP/s and GB/s, in Tables B.8 and C.9, respectively, the linear trend is lost when the number of stations to merge is fixed, and the number of beams to form is varied: in this case the values are increasing, reaching a maximum and then decreasing. Implementation 1.2.1 achieves, when merging 256 stations to form 128 beams, 218,16 GFLOP/s, nearly 16% of the platform theoretical peak. Implementations 1.2.2, 1.2.1.1 and 1.2.2.1 are written and tested to better understand the memory behavior. In implementation 1.2.2, we changed the input and output data structures, described in Section 5.1, reordering the dimensions of the multidimensional arrays to match the CUDA grid structure and permit an improved coalesced access to memory. Tables A.11, B.10 and C.11 show good performance numbers, with an execution time that scales linearly and the achievement of 267,83 GFLOP/s (19% of the card capabilities). This value is higher than what the Roofline model [27] predicts for an operational intensity of just 0,66; the extra-performance is due to the improved memory bandwidth provided by the cache. The experiment also shows that improving the memory coalescing is important. But reordering the data structures, especially with big instances and in a production environment, may cost too much to be effective. Thus, a reordering of the computation should be taken into account. Implementations 1.2.1.1 and 1.2.2.1 differ from their parent implementations, 1.2.1 and 1.2.2 respectively, in how the data reuse is implemented inside each thread block: left to the CUDA cache hierarchy in the latter and manually addressed with the shared memory in the former. The execution time measured, presented in Tables A.10 and Chapter 6. CUDA BeamFormer 46 A.12 (Appendix A), show values that are comparable with the ones measured with the cache. The measured GFLOP/s, presented in Tables B.9 and B.11 (Appendix B), and GB/s, presented in Tables C.10 and C.12 (Appendix C), are lower than the ones of the “parents”. However, they show the same trend, as expected from this version. Our results prove that in the case of the beam forming algorithm on the NVIDIA Fermi architecture, the data reuse can be left to the cache and the manual use of shared memory appears redundant. 6.5 BeamFormer 1.3 Reordering the structure of the computation to improve the coalesced access to memory in order to improve performance, is the goal of BeamFormer 1.3. The grid organization reflects the ordering of input and output data structures: the grid has one thread block for each channel-beam pair and each block has a number of threads at least equal to the number of time samples. The kernel merges a block of stations in each execution; in the last execution, it sums them and computes the final beam value. In the performed experiments, all the stations are merged in a single execution. The implementation scales well, as can be seen in Table A.13, and its behavior is symmetric, that is, approximately the same time is necessary to compute an x × y and an y × x input instance. Values are stable, at least for big input instances, in terms of achieved GFLOP/s (as seen in Table B.12), but are low compared to the capabilities of the video card. The memory bandwidth, seen in Table C.13, is stable too. However, the memory occupancy is too high to permit the computation of 512 beams. It becomes clear that, in order to achieve stable performance, it is necessary to separate the number of threads from the input parameters that vary too much and are too low to keep the hardware busy all the time. 6.6 BeamFormer 1.4 BeamFormer 1.4 is another attempt to reorder the computation to improve performance. The grid is organized to have a thread block for each channel-time pair and inside each block, a thread for each station-polarization pair. Each kernel computes a partial result Chapter 6. CUDA BeamFormer 47 and then all the kernels inside a block collaborate to perform two parallel reductions, one for each polarization, to first sum all the computed values and then store the result in global memory. This BeamFormer 1.4 scales linearly, but the measured times, presented in Table A.14, are higher (i.e. worse) than what we are looking for at this point. Moreover, Tables B.13 and C.14 show that both the number of single precision floating point operations and the achieved memory bandwidth are extremely low. The issue with this BeamFormer 1.4 is that, trying to find a good mapping between the data and the computation to improve the access pattern to memory, we reduced the data reuse between the threads of a same block. Moreover, the parallel reductions idle too many threads, causing a critical underutilization of the hardware. 6.7 BeamFormer 1.5 Version 1.5 aims, at the same time, at avoiding idle threads, accessing the memory in a coalesced way by means of a computational structure that matches the input and output data structures, and being stable in terms of performance. In BeamFormer 1.5 we have, inside the CUDA grid, a number of thread blocks that is at least the number of channels, and each block has a number of threads that is at most the number of time samples. This version also relies on the concept of the station-beam block, and so the dimensions of the grid are set at runtime, because a station-beam block computing more beams at the same time needs more registers and forces us to reduce the number of threads per block. Three different implementations of version 1.5 are tested, each of them using a different size for the station-beam block. The execution times of implementations 1.5 2x2, 1.5 4x4 and 1.5 8x8 are presented in Tables from A.15 to A.17 (Appendix A). The measured values are low, but they scale linearly and are symmetric; moreover, it is possible to compute instances covering all the experiment’s input space. It is important to also note that execution times, for the same input instance, are nearly halved when the number of blocks is increased from an implementation to the successive one. The achieved GFLOP/s, as can be seen in Tables from B.14 to B.16 (Appendix B), are high considering how small the station-beam block dimensions are, but are decreasing with the Chapter 6. CUDA BeamFormer 48 increase of the number of beams to form. A small decrease in performance is indeed expected, because forming more beams without increasing the size of the station-beam block, requires more kernel executions. The measured decrease in performance is too big compared to the expected one. This behavior appears not to be due to the algorithm, but to the CUDA compiler, as we will show in Section 7.4. Overall, the BeamFormer version 1.5 reaches 84% of the GFLOP/s predicted applying the Roofline model, which is a good result. Tables from C.15 to C.17 (Appendix C), presenting the achieved memory bandwidth, show the same trends as the measured GFLOP/s. 6.8 Conclusions Finally , we present a comprehensive comparison of the performance results of the developed CUDA versions of the BeamFormer; for the comparison we restrict ourselves to the case of merging 64 stations, which is an important use case for ASTRON. Figures 6.1 and 6.2 show the execution time for all the different implementations of the BeamFormer. We can see that all the implementations scale linearly, a first important result that permits us to affirm that it is possible to efficiently implement a beam forming algorithm with CUDA on a NVIDIA GPU. The fastest implementation is the BeamFormer 1.2.2; however this implementation, along with BeamFormer 1.2.1, has an increase in the slope of the curve for more than 128 beams to form. Moreover, both these implementations, together with the BeamFormer 1.3, are incapable of computing 512 beams. The version capable of computing the highest number of beams (in our experiment) and still being among the best performing algorithms is the BeamFormer 1.5. If we also consider that the second ranked implementation (BeamFormer 1.5 8x8) performs many more kernel executions compared to the first ranked, due to its small station-beam block, version 1.5 appears the real winner for what concerns the execution times. Figures 6.3 and 6.4 provide the comparison between the best performing versions for achieved GFLOP/s and GB/s, respectively. The figures show that almost all implementations (with the exception of 1.2.1 and 1.2.2) are fairly stable in their trends. The two unstable implementations are the ones obtaining the highest values in both performance metrics. Chapter 6. CUDA BeamFormer 49 80 BeamFormer 1.0 BeamFormer 1.1 BeamFormer 1.1_2x2 BeamFormer 1.1.1 BeamFormer 1.1.1_2x2 BeamFormer 1.2_2x2 BeamFormer 1.2_4x4 BeamFormer 1.2_8x8 BeamFormer 1.4 70 Execution time (s) 60 50 40 30 20 10 0 0 100 200 300 400 500 Pencil beams Figure 6.1: Execution time in seconds of various BeamFormer versions merging 64 stations (lower is better). From these results we choose version 1.5 as the best beam forming algorithm for the NVIDIA GPUs and use it for future investigations. Not only this version shows a stable behavior, permits to solve large input instances, and has good performance, but it is also open to further improvements, especially concerning the dimensions of the stationbeam block. We motivate the thought that improvements are still possible with the BeamFormer 1.5 because we measured a value of 411,86 GFLOP/s using a station-beam block of 256 × 8; that is 30% of the video card’s capabilities, and a value higher than everything else measured while performing the previously described experiments. To conclude, we want to summarize the crucial aspects needed to obtain good performance with the beam forming algorithm on NVIDIA GPUs: 1. use a high number of independent and not idle threads, 2. structure the computation to match the input and output data structures, thus permitting a coalesced access to device memory, Chapter 6. CUDA BeamFormer 50 3 BeamFormer 1.2.1 BeamFormer 1.2.2 BeamFormer 1.3 BeamFormer 1.5_2x2 BeamFormer 1.5_4x4 BeamFormer 1.5_8x8 2.5 Execution time (s) 2 1.5 1 0.5 0 0 100 200 300 400 500 Pencil beams Figure 6.2: Execution time in seconds of various BeamFormer versions merging 64 stations (lower is better). 3. keep the kernels as simple as possible, leaving them with the only job of performing arithmetic operations, while performing synchronization on the host by means of multiple kernel executions, 4. optimize for data reuse between the kernels of a same thread block; for this algorithm, when using the NVIDIA Fermi architecture, this optimization can be left to the cache system, without having to manually implement it with shared memory, 5. perform more memory transfers from host to device and vice versa, in order to be able to allocate less memory and then compute bigger instances; the performance of this algorithm is, in fact, so bound by the beam computation phase that the effect of the memory transfers is negligible. Chapter 6. CUDA BeamFormer 51 300 BeamFormer 1.2_4x4 BeamFormer 1.2_8x8 BeamFormer 1.2.1 BeamFormer 1.2.2 BeamFormer 1.3 BeamFormer 1.4 BeamFormer 1.5_2x2 BeamFormer 1.5_4x4 BeamFormer 1.5_8x8 250 GFLOP/s 200 150 100 50 0 0 100 200 300 400 500 Pencil beams Figure 6.3: GFLOP/s of various BeamFormer versions merging 64 stations (higher is better). 400 BeamFormer 1.2_4x4 BeamFormer 1.2_8x8 BeamFormer 1.2.1 BeamFormer 1.2.2 BeamFormer 1.3 BeamFormer 1.4 BeamFormer 1.5_2x2 BeamFormer 1.5_4x4 BeamFormer 1.5_8x8 350 300 GB/s 250 200 150 100 50 0 0 100 200 300 400 500 Pencil beams Figure 6.4: GB/s of various BeamFormer versions merging 64 stations (higher is better). Chapter 7 OpenCL BeamFormer In this chapter we present an implementation of the BeamFormer 1.5, as described in Section 6.7, using the Open Computing Language (OpenCL) [28]: the reason for this implementation is the analysis of the achievable performance of our beam forming algorithm when using a framework that is focused on the portability of the code, to eventually compare this implementation with the previously developed CUDA one. First, we introduce in Section 7.1 what OpenCL is and how it works. Then, we briefly discuss the changes that are necessary to port the code between the two different frameworks, in Section 7.2. The experiments, that resemble the ones we already performed with the CUDA implementation, are presented in Section 7.3, along with the results. Conclusions are provided in Section 7.4, together with a comparison of the CUDA and OpenCL implementations. 7.1 The Open Computing Language OpenCL, the Open Computing Language [28], is an open and royalty-free standard for general purpose parallel programming on heterogeneous architectures. Initially developed by Apple, it is now supported by the Khronos Group with many different participants from the industry. OpenCL’s goal is to allow developers to write portable code that is compiled at run-time and executed on different parallel architectures, like multicore CPUs and many-core GPUs. The OpenCL standard defines an API to manage the 52 Chapter 7. OpenCL BeamFormer 53 devices and the computation and a programming language, called OpenCL C, that provides parallel extensions to the C programming language. Both the Single Instruction Multiple Data (SIMD) and the Single Program Multiple Data (SPMD) paradigms are supported in OpenCL. Here we briefly introduce the computational model of OpenCL (more details are available in [7]). In the OpenCL platform model, there is a host connected to one or more computing devices. Each computing device contains one or more compute units (CUs), and each CU contains one or more processing elements (PEs). The OpenCL application runs on the host, managing the computing devices via the functions provided by the OpenCL API; the execution of the kernels and the memory are managed by the application submitting commands to queues that are associated with the computing devices. The OpenCL kernel instances, called work-items, are executed by the PEs, and each workitem is identified by an integer vector in a three dimensional index space that is called NDRange and that represents the computation as a whole. Work-items are grouped into work-groups; a work-group is executed by a compute unit. As with the work-items, each work-group is identified by a vector in the same dimensions of the NDRange that is used for the work-items. Figure 7.1 shows an example of this computational hierarchy using an NDRange with the third dimension set to zero. Figure 7.1: NDRange example, courtesy of Khronos Group [7]. Chapter 7. OpenCL BeamFormer 54 OpenCL also defines a memory hierarchy that is organized in three levels. At the lowest level, each work-item has access to a read/write private memory that is statically allocated by the kernel and cannot be accessed by the host. A level above, all the workitems inside the same work-group share another read/write memory called local memory. All the work-items of all work-groups, have also access to two global memories, at the highest level of the OpenCL memory hierarchy. Of these two memories, only one, called global memory, is writable, while the other one is a read-only memory called constant memory. The two global memories are allocated by the host application, that can also perform copy operations on them. There is no explicit knowledge, in OpenCL, about which of these memories are cached, because this depends on the actual capabilities of the device on which the code is executed. 7.2 Porting the BeamFormer 1.5 from CUDA to OpenCL Comparing the descriptions of CUDA and OpenCL, described from Sections 4.4 and 7.1, respectively, it is clear that they share many concepts. Thus, that porting the same algorithm between the two of them should be straightforward. We describe here the main technical differences between the two implementations, starting with the host code. The host code is modified only to address the different syntax of the two APIs. The major difference is the addition, in the OpenCL implementation, of the code to generate and compile the kernels at run-time. In addition, the number of files containing the source code is reduced because, using only one compiler (and not a combination of nvcc and g++), there is no need for separating the host and device code. For the kernels, we first added an additional one to perform the operation of setting all the elements of a certain memory area to a given value, because OpenCL does not include a function like CUDA’s memset. Then, we modified the code of the other two kernels, the one to compute the weights and the one to compute the beams, with two small changes: the kernel signatures are rewritten, because the OpenCL syntax differs here from the one of CUDA, and the access to global memory is performed using the C array syntax instead of the one based on pointer arithmetic. Chapter 7. OpenCL BeamFormer 7.3 55 OpenCL BeamFormer performance Porting of the BeamFormer 1.5 from CUDA to OpenCL is, as described, a simple process; what we want to discover is if this OpenCL implementation can achieve good performance. The setup of the experiments performed here is the same as the one described in Section 6.1, with the only difference that OpenCL 1.1 is used instead of CUDA 3.1 and nvcc. The three metrics we measure are the execution time, the number of single precision floating point operations per second and the achieved memory bandwidth, measured in seconds, GFLOP/s and GB/s respectively. We developed three different implementations (BeamFormer 1.5-opencl 2x2, 1.5-opencl 4x4 and 1.5-opencl 8x8) using the same station-beam block size as the CUDA ones. Their execution times are presented in Tables from D.1 to D.3 (Appendix D). The measured values are nearly constant: in fact, there is almost no increment in the execution time when the input is increased. Times start to grow only when computing big instances. However, this is clearly due to the just-in-time OpenCL compiler: the measured execution times for the OpenCL implementations include, other than the computation itself, the time needed to generate and compile all the kernels, and is not surprising that for a single execution the overhead of these operations is significant. As a result, more interesting for the OpenCL implementations are the other two performance metrics. The achieved single precision floating point operations per second are reported in Tables from D.4 to D.6 (Appendix D). We see in these tables values that are stable for both small and big instances, with a difference in percentage that is at most 12%. Moreover, the highest measurements are always in correspondence with the input that exactly matches the dimension of the implementation’s station-beam block. For what concerns the memory bandwidth, in Tables from D.7 to D.9, we see the same trends as with measured GFLOP/s. For both performance metrics, the values achieved by implementations 1.5-opencl 2x2 and 1.5-opencl 4x4 are in line with our expectations, while the achieved performance of implementation 1.5-opencl 8x8 is lower than expected, also if in terms of absolute values this implementation achieves the highest measurements of all the OpenCL ones. We will explain this in more detail in Section 7.4. Chapter 7. OpenCL BeamFormer 7.4 56 Conclusions Writing parallel code with OpenCL is not more difficult than writing the same code with CUDA, and porting the BeamFormer from one framework to the other was an easy task. Therefore, we can freely use OpenCL to obtain software that is portable between different architectures, being them multi-core CPUs, modern many-core GPUs, or even other kinds of parallel processors. What still needs an answer is the question if, given the same algorithm and the same hardware, the portable OpenCL code performs as well as code developed with a native framework, that in our terms means to discover if the OpenCL and CUDA implementations of the BeamFormer 1.5 algorithm achieve the same performance on the NVIDIA GTX 480 video card. 200 BeamFormer 1.5_2x2 BeamFormer 1.5_4x4 BeamFormer 1.5_8x8 BeamFormer 1.5-opencl_2x2 BeamFormer 1.5-opencl_4x4 BeamFormer 1.5-opencl_8x8 180 160 GFLOP/s 140 120 100 80 60 40 0 100 200 300 400 500 Pencil beams Figure 7.2: GFLOP/s of BeamFormer 1.5 implemented with CUDA and OpenCL merging 64 stations (higher is better). Figures 7.2 and 7.3 provide a view of the achieved GFLOP/s and GB/s, respectively, of all the CUDA and OpenCL implementations. Looking at the results, we can say that good performance is possible: for both of the metrics, OpenCL implementations 1.5-opencl 2x2 and 1.5-opencl 4x4 achieve more stable and higher values than the ones Chapter 7. OpenCL BeamFormer 57 180 BeamFormer 1.5_2x2 BeamFormer 1.5_4x4 BeamFormer 1.5_8x8 BeamFormer 1.5-opencl_2x2 BeamFormer 1.5-opencl_4x4 BeamFormer 1.5-opencl_8x8 170 160 GB/s 150 140 130 120 110 0 100 200 300 400 500 Pencil beams Figure 7.3: GB/s of BeamFormer 1.5 implemented with CUDA and OpenCL merging 64 stations (higher is better). of their CUDA counterparts. Unfortunately, this is not true for implementation 1.5opencl 8x8, whose achieved values are lower than the ones of the BeamFormer 1.5 8x8. The fact that some OpenCL implementations perform better than the CUDA ones, and also that some of them don’t, is, however, independent of our algorithm and code, and is caused by the OpenCL compiler. As a matter of fact, we noticed many differences between the PTX code generated by the nvcc and by the OpenCL compilers, and we believe that these differences cause the performance discrepancies. In particular, we found that the lower performance achieved by the OpenCL implementation 1.5-opencl 8x8, when comparing it with the CUDA implementation 1.5 8x8, is due to a wrong management of the virtual registers in the code generated by the OpenCL compiler. The generated PTX code uses too many registers, causing a phenomenon called register spilling, and consequently inducing a dramatic increase in the accesses to global memory. If we want to see, as it should be, the curve of the BeamFormer 1.5-opencl 8x8 in Figure 7.2 lying all above the value of 180 GFLOP/s, an improvement of the OpenCL compiler is necessary. What we can conclude is that good performance is certainly possible with OpenCL. In fact there is no fundamental reason why OpenCL should be slower than a Chapter 7. OpenCL BeamFormer 58 native implementation. However, an improvement of the compiler is still necessary to close the performance gap on NVIDIA GPUs between CUDA and OpenCL. 3.5 BeamFormer 1.5_2x2 BeamFormer 1.5_4x4 BeamFormer 1.5_8x8 BeamFormer 1.5-opencl_2x2 BeamFormer 1.5-opencl_4x4 BeamFormer 1.5-opencl_8x8 3 Execution time (s) 2.5 2 1.5 1 0.5 0 0 100 200 300 400 500 Pencil beams Figure 7.4: Execution time in seconds of BeamFormer 1.5 implemented with CUDA and OpenCL merging 64 stations (lower is better). Another small issue of OpenCL is the time needed for the code generation and compilation at run-time. Figure 7.4 shows that the execution times of the OpenCL implementations are shifted up of almost one second, and without combining this metric with the other two we would not even be able to understand if the OpenCL implementations scale well or not. In our context this is not a real problem because in the production environment the beam former is executed more than once, making the effect of these operations on the code negligible when compared to the total execution time. However, when using OpenCL, is important to take into account this overhead, especially with algorithms for which the average execution time is in the order of milliseconds or less. Chapter 8 Finding the best station-beam block size The performance of the BeamFormer 1.5 algorithm is influenced, among other factors, by the size of the station-beam block. So far, we tried three different block sizes: 2 × 2, 4 × 4 and 8 × 8. We want to analyze further the way in which this parameter affects the performance of our beam forming algorithm on NVIDIA GPUs, and to find the station-beam block size that delivers the highest number of single precision floating point operations per second. In this chapter, Section 8.1 introduces the new experiment, while Sections 8.2 and 8.3 present the results with OpenCL and CUDA, respectively. Section 8.4 lists our conclusions, and presents a comparison between the results obtained using the two frameworks. 8.1 Experimental setup For this experiment we generated thirty-two different implementations of the BeamFormer 1.5, half of them using OpenCL and the other half using CUDA. So many implementations are necessary because, although changing the station component of the station-beam block does not imply changes in the source code, modifying the beam component does. The implementations are tested to find the station-beam block size that delivers the highest performance, and furthermore, to understand the way in which this parameter affects the algorithm’s performance. 59 Chapter 8. Finding the best station-beam block size 60 The experiment is performed running a single execution of each implementation, varying the input parameter that represents the number of stations to merge for every single beam. The parameter is first varied in the interval of the integers between 1 and 16, and then in the space of the powers of two between 21 and 28 . For each implementation, the input parameter associated to the number of beams to form is set in order to match the beam component of the implementation’s station-beam block: in this way it is possible to test a different station-beam block size for each execution, eventually testing all the block sizes between 1 × 1 and 16 × 16, and then between 2 × 1 and 256 × 16. As with previous experiments, the other input parameters are kept constant. In each experiment we measure the time taken by the kernel running on the GPU. This value is used to derive the number of achieved single precision floating point operations per second, measured in GFLOP/s. Because each execution is associated with a station-beam block size, the measured values are used to compare the different sizes. The machine used for the experiments has one Intel Core i7-920 CPU, 6 GB of RAM and a NVIDIA GeForce GTX 480 video card. The GeForce GTX 480 uses the NVIDIA GF100 GPU, with 480 computational cores, that provides a theoretical peak performance of 1344,96 GFLOP/s and can sustain a memory bandwidth of 177,4 GB/s accessing its on-board 1536 MB of RAM. The machine’s operating system is Ubuntu Linux 9.10. The host code is compiled with g++ version 4.4.1 and the CUDA device code is compiled with nvcc version 0.2.1221; we use CUDA version 3.1 and OpenCL version 1.1. 8.2 OpenCL results The sixteen OpenCL implementations are generated exploiting the framework capabilities for run-time code generation and compilation, without any code modification. The station-beam block sizes from 1 × 1 to 16 × 16 are presented in Tables from E.1 to E.4 (Appendix E). The highest value, 240,89 GFLOP/s, is found in correspondence of the 16 × 7 station-beam block. If we keep the station component of the block, while increasing the beam component, we see a continuous increase in performance, starting with the value 1 for the beam component of the block and up to value 7. After that, we see the performance decreasing steadily. Considering as fixed the beam component of the block, and varying the station component, the measured GFLOP/s are instead constantly increasing. Chapter 8. Finding the best station-beam block size 61 Tables E.5 and E.6 (Appendix E), show the achieved GFLOP/s for station-beam block sizes from 2 × 1 to 256 × 16. When the station component is between the values of 2 and 16, all the peaks are in correspondence of the beam size of 7, while for the successive values of the station component all the peaks are shifted backwards to the value of 6. As in the previous case, increasing the station component appears to always produce a performance gain, while the beam component of the block is bounded between the values of 6 and 7. The highest measured value, 392,16 GFLOP/s, is achieved with a block of 256 × 6 and represents 29% of the theoretical performance peak of the GTX 480 video card. 350 OpenCL BeamFormer 1.5 300 GFLOP/s 250 200 150 100 50 0 0 2 4 6 8 10 Beam component size 12 14 16 Figure 8.1: GFLOP/s for the OpenCL BeamFormer: block sizes from 64x1 to 64x16 (higher is better). The results of this experiment indicate that, when configuring the station-beam block for the OpenCL BeamFormer 1.5: • it is advantageous to set the station component equal to the number of stations to merge, • and setting the beam component over the value of 7, or over the value of 6 when the station component size is above 16, is counterproductive. Chapter 8. Finding the best station-beam block size 62 This trend can be seen in Figure 8.1, where the measured GFLOP/s are shown with the station component size fixed to 64 stations. 8.3 CUDA results For the sixteen CUDA implementations, we developed an external kernel generator for the BeamFormer, and then added a new kernel, one for each beam component size under test, to the code already described in Section 6.7. In Tables from E.7 to E.10 (Appendix E) the GFLOP/s achieved with station-beam blocks of size between 1×1 and 16×16 are presented. The highest measured value, found in correspondence of the block size 16×10, is 251,41 GFLOP/s. As expected, when varying the station component of the block, we can see a continuous increase in performance, with the peaks always correspondent to the value 16 of the station component. The trend of the performance metric is more complex if we increase the beam component of the block, and keep the other component constant. We can see here that performance is initially increasing, up to a certain peak, and then suddenly decreasing. The peaks are positioned in the interval that includes three beam component sizes: 8, 9 and 10. The same trend is shown for station-beam block sizes from 2 × 1 to 256 × 16, as can be seen in Tables E.11 and E.12 (Appendix E). Moreover, an increase in size of the station component of the block, corresponds to a shift towards the value of 10 for the beam component. In this experiment the highest measured value is 427,07 GFLOP/s, more than 31% of the platform’s theoretical performance peak. This value is measured with the station-beam block size of 256 × 10. Figure 8.2 shows the performance trend with the station component fixed at the value of 64, and the variation of the other component. We can affirm that, as a result of this experiment, the best station-beam block size for the CUDA BeamFormer is composed by the highest possible value for the station component, and a beam component that is between 8 and 10. Chapter 8. Finding the best station-beam block size 63 400 CUDA BeamFormer 1.5 350 300 GFLOP/s 250 200 150 100 50 0 2 4 6 8 10 Beam component size 12 14 16 Figure 8.2: GFLOP/s for the CUDA BeamFormer: block sizes from 64x1 to 64x16 (higher is better). 8.4 Conclusions The comparison between the CUDA and OpenCL implementations of the BeamFormer 1.5, in terms of station-beam block, can be summarized by Figure 8.3. The two plotted curves are indeed similar. This means on one side that the influence of the station-beam block size on the algorithm is independent of the framework used to implement it, and on the other that performance can be improved by a correct setup of this parameter. Like with the comparison of the CUDA and OpenCL BeamFormers, in Section 7.4, we see the OpenCL implementations perform better than their CUDA counterparts up to a certain point, that we can now quantify in correspondence of values 6 and 7 of the beam block’s component, and then their performance rapidly fall. This discrepancy is, however, due to the two different compilers, as we already explained in Section 7.4, and it is not an algorithm property. In addition, the decrease in performance of the CUDA implementations can be explained with an increased register spilling. The NVIDIA Fermi architecture, in fact, poses a limit of 63 registers per thread, and given the fact Chapter 8. Finding the best station-beam block size 64 400 CUDA BeamFormer 1.5 OpenCL BeamFormer 1.5 350 300 GFLOP/s 250 200 150 100 50 0 0 2 4 6 8 10 Beam component size 12 14 16 Figure 8.3: Comparison of CUDA and OpenCL BeamFormers: block sizes from 64x1 to 64x16 (higher is better). that increasing the beam component of the station-beam block increases the register usage, we found here a hardware limit. As a final remark, we can conclude that the station-beam block is an important parameter of the BeamForming 1.5 version, and that for best performance the station component should be set as the total number of stations to merge, while the beam component should be set as high as possible with respect to the limits of 7 and 10 beams, when using the OpenCL or CUDA implementation respectively, and to the beams that have to be formed. As a side effect, the search for the best station-beam block size showed that the BeamFormer 1.5 can perform around the 30% of the theoretical GFLOP/s of the used GPU, both with OpenCL and with CUDA. Chapter 9 Conclusions Our main research question, at the beginning of this work, was if is it possible or not to efficiently parallelize the beam forming algorithm on a GPU. After the parallelization of the algorithm, following different strategies, and after testing all the implementations and collecting the results, we summarize here our answers to this question. First, we summarize here all the results and contributions of this project. In Chapter 6 we showed that it is possible to implement good performing beam formers on an NVIDIA GTX 480, using CUDA. We found important, in order to achieve good performance, to have a high number of independent, and not idle, threads, each of them executing a kernel that is as simple as possible. The best performing strategy was to leave the kernel to perform just arithmetic operations, while performing synchronization and other high cost operations on the host. The access pattern to memory is of capital importance, too. We found that the best performing beam formers were the ones where the computation had a structure that was matching the input and output data structures, thus permitting coalesced accesses to device memory, and reducing the number of read and write operations performed. In order to reduce the number of accesses to memory, data reuse between the threads of a same block proved to be also extremely important. We performed further experiments on our best performing beam former (the BeamFormer 1.5, described in Section 6.7), and we found (see Chapter 8) that the correct 65 Chapter 9. Conclusions % of the theoretical GFLOP/s Chip’s GFLOP/s Power efficiency (GFLOPs/W) 66 IBM Blue Gene/P 80% 10,8 0,456 NVIDIA GTX 480 30% 427,07 1,708 Table 9.1: Comparison of the beam former running on the ASTRON IBM Blue Gene/P and on an NVIDIA GTX 480. setup of the station-beam block parameter may improve the performance of the algorithm itself. Indeed, it was during the experiment aimed at finding how this stationbeam block parameter affected the algorithm’s behavior that we measured the highest GFLOP/s values, for both CUDA and OpenCL. We found also that the way in which the setup of this parameter affects the performance of our beam former is independent from the implementation framework, so, although a different implementation framework produces slightly different values, the performance trend remains the same. When using OpenCL to implement our beam former, as discussed in Chapter 7, we found that is possible to achieve good performance, and in some cases even better performance that what we achieved with CUDA. Although the measured execution time was always higher using OpenCL, this was due to the overhead necessary for the run-time environment (run-time kernel compiling and launching), which was measured together with the computation itself. For an algorithm whose average execution time is under the second an added cost of 900 milliseconds is indeed a problem, however the run-time overhead is sensibly reduced when the kernels are executed multiple times. In fact, this overhead is not the biggest problem that we found with OpenCL. Instead, we found some compiler issues: the code produced by the OpenCL compiler uses too many registers (more than the similar code with CUDA), causing dramatic register spilling to the slow global memory, and consequently a dramatic fall in performance. We hope that this problem will be fixed in a future version of the OpenCL framework. We also compare our best implementation with the current ASTRON implementation, that works in production on a IBM Blue Gene/P. The comparison is provided in Table 9.1. In terms of the achieved percentage of the theoretical GFLOP/s peak, the beam former running on the IBM Blue Gene/P is currently the winner, achieving the 80% of the platform’s theoretical GFLOP/s against our 30%. However, this efficiency in terms of Chapter 9. Conclusions 67 hardware utilization, is due to the narrow gap that the architecture of the IBM Blue Gene/P provides between the maximum number of achievable floating point operations per second and the maximum memory bandwidth. In contrast, the GPU we used has a wide gap between its theoretically achievable GFLOP/s and GB/s, and in order to achieve an efficiency of the 80% on the NVIDIA GTX 480, we need a kernel’s operational intensity of more than 6 (according to the Roofline model [27]), and in our analysis this level of operation intensity is not achievable for this algorithm. However, looking at the other parameters of the comparison, also in Table 9.1, we can see that the number of single precision floating point operations per second that we achieved with a single NVIDIA GTX 480 is more than forty times (40x) the GFLOP/s achievable by a single chip of the IBM Blue Gene/P, and, in terms of power consumption, our best implementation is more than three times more efficient. So, we conclude that the use of GPUs for radio astronomy beam forming is a viable solution, and we have proved that it is possible to efficiently parallelize this algorithm on a GPU. In the future we aim to extend this work testing the beam former on different multi-core and many-core architectures, e.g. GPUs from other manufacturers or the Cell Broadband Engine Architecture, and improve the kernel generator to automatically try new optimizations and tune the code for each specific architecture. Furthermore, we plan to parallelize our algorithm to run on a GPU-powered cluster in order to be able to compute bigger instances and to discover how the algorithm scales over a single GPU. Appendix A CUDA BeamFormer execution time b s 2 4 8 16 32 64 128 2 4 8 16 32 64 128 256 0,106 0,0957 0,109 0,124 0,16 0,229 0,44 0,112 0,121 0,14 0,174 0,236 0,357 0,731 0,154 0,173 0,203 0,255 0,389 0,619 1,33 0,241 0,276 0,328 0,417 0,663 1,14 2,42 0,412 0,482 0,579 0,741 1,21 2,14 4,51 0,755 0,894 1,08 1,39 2,31 4,16 8,88 1,44 1,72 2,08 2,68 4,51 8,13 17,6 2,82 3,37 4,09 5,28 8,96 16,4 - Table A.1: Execution time in seconds for the BeamFormer 1.0.2 b s 2 4 8 16 32 64 128 256 2 4 8 16 32 64 128 256 0,1 0,0912 0,111 0,152 0,233 0,4 0,757 1,46 0,0921 0,111 0,15 0,227 0,379 0,694 1,34 2,63 0,114 0,152 0,226 0,374 0,669 1,28 2,52 4,99 0,159 0,232 0,377 0,669 1,25 2,45 4,87 9,7 0,249 0,393 0,683 1,26 2,41 4,79 9,57 19,2 0,428 0,715 1,29 2,44 4,74 9,46 19,0 38,0 0,786 1,36 2,51 4,8 9,39 18,8 37,9 76,0 1,5 2,65 4,94 9,54 18,7 37,7 75,9 - Table A.2: Execution time in seconds for the BeamFormer 1.1 68 Appendix A. CUDA BeamFormer execution time b s 2 4 8 16 32 64 128 256 69 2 4 8 16 32 64 128 256 0,0937 0,0794 0,0885 0,106 0,141 0,211 0,378 0,69 0,0805 0,0876 0,103 0,133 0,193 0,318 0,583 1,1 0,0911 0,105 0,132 0,187 0,297 0,524 1,0 1,93 0,112 0,138 0,191 0,295 0,508 0,942 1,83 3,58 0,155 0,205 0,308 0,512 0,925 1,78 3,49 6,92 0,239 0,339 0,541 0,945 1,76 3,44 6,82 13,5 0,408 0,606 1,01 1,82 3,44 6,8 13,5 26,9 0,745 1,14 1,95 3,56 6,81 13,5 26,9 - Table A.3: Execution time in seconds for the BeamFormer 1.1 2x2 b s 2 4 8 16 32 64 128 256 2 4 8 16 32 64 128 256 512 0,095 0,0903 0,109 0,149 0,226 0,386 0,728 1,4 0,0908 0,109 0,146 0,22 0,365 0,665 1,29 2,52 0,113 0,149 0,219 0,361 0,642 1,22 2,41 4,78 0,157 0,226 0,365 0,643 1,2 2,34 4,65 9,29 0,243 0,381 0,656 1,21 2,3 4,56 9,13 18,3 0,417 0,69 1,24 2,33 4,52 9,02 18,1 36,3 0,764 1,31 2,41 4,59 8,97 18,0 36,2 72,8 1,46 2,55 4,74 9,12 17,8 35,9 72,5 146,0 2,87 5,07 9,44 18,2 35,6 72,2 145,0 291,0 Table A.4: Execution time in seconds for the BeamFormer 1.1.1 b s 2 4 8 16 32 64 128 256 2 4 8 16 32 64 128 256 512 0,0906 0,0799 0,0882 0,106 0,14 0,212 0,375 0,683 0,0807 0,0891 0,104 0,134 0,194 0,316 0,584 1,09 0,0932 0,107 0,134 0,189 0,298 0,526 0,997 1,92 0,116 0,143 0,194 0,299 0,51 0,942 1,82 3,57 0,163 0,214 0,315 0,519 0,93 1,77 3,48 6,88 0,256 0,356 0,557 0,959 1,77 3,44 6,8 13,5 0,443 0,64 1,04 1,84 3,46 6,79 13,5 26,8 0,815 1,21 2,01 3,62 6,86 13,5 26,8 53,3 1,57 2,39 3,98 7,19 13,7 27,0 53,5 107,0 Table A.5: Execution time in seconds for the BeamFormer 1.1.1 2x2 b s 2 4 8 16 32 64 128 256 2 4 8 16 32 64 128 256 512 0,0952 0,0803 0,0884 0,106 0,142 0,214 0,38 0,714 0,0815 0,0882 0,105 0,135 0,196 0,32 0,591 1,13 0,0921 0,107 0,134 0,19 0,302 0,535 1,01 1,97 0,115 0,142 0,195 0,302 0,516 0,958 1,86 3,66 0,159 0,212 0,316 0,523 0,943 1,81 3,55 7,04 0,25 0,352 0,557 0,968 1,8 3,5 6,93 13,8 0,429 0,633 1,04 1,86 3,51 6,91 13,7 27,3 0,812 1,22 2,03 3,65 6,96 13,8 27,3 54,3 1,53 2,35 3,98 7,24 13,8 27,5 54,5 109,0 Table A.6: Execution time in seconds for the BeamFormer 1.2 2x2 Appendix A. CUDA BeamFormer execution time b s 4 8 16 32 64 128 256 70 4 8 16 32 64 128 256 512 0,0847 0,0886 0,103 0,134 0,197 0,344 0,642 0,0899 0,103 0,129 0,181 0,286 0,522 0,974 0,107 0,131 0,178 0,274 0,466 0,871 1,68 0,145 0,19 0,279 0,458 0,82 1,57 3,08 0,218 0,304 0,478 0,827 1,53 2,98 5,83 0,365 0,536 0,878 1,56 2,97 5,78 11,5 0,662 1,01 1,71 3,08 5,85 11,5 22,6 1,25 1,95 3,32 6,03 11,6 22,7 45,1 Table A.7: Execution time in seconds for the BeamFormer 1.2 4x4 b s 8 16 32 64 128 256 8 16 32 64 128 256 512 0,112 0,108 0,14 0,204 0,354 0,654 0,11 0,137 0,191 0,3 0,54 1,01 0,146 0,194 0,294 0,492 0,909 1,74 0,218 0,311 0,498 0,871 1,65 3,19 0,36 0,544 0,908 1,64 3,13 6,13 0,677 1,03 1,75 3,19 6,15 11,9 1,26 1,96 3,39 6,28 12,0 23,6 Table A.8: Execution time in seconds for the BeamFormer 1.2 8x8 b s 2 4 8 16 32 64 128 256 2 4 8 16 32 64 128 256 0,0919 0,0792 0,0856 0,101 0,126 0,179 0,311 0,563 0,0782 0,0809 0,0879 0,103 0,128 0,185 0,31 0,549 0,0821 0,0859 0,0925 0,106 0,133 0,187 0,318 0,554 0,0928 0,0959 0,103 0,118 0,146 0,197 0,326 0,565 0,113 0,117 0,124 0,14 0,164 0,221 0,353 0,603 0,161 0,161 0,165 0,178 0,204 0,26 0,396 0,673 0,305 0,311 0,314 0,323 0,327 0,365 0,524 0,849 0,641 0,644 0,649 0,662 0,697 0,724 0,924 - Table A.9: Execution time in seconds for the BeamFormer 1.2.1 b s 2 4 8 16 32 64 128 256 2 4 8 16 32 64 128 256 0,0833 0,0789 0,0835 0,0972 0,124 0,174 0,303 2,12 0,078 0,0805 0,0866 0,0996 0,126 0,179 0,303 2,12 0,0823 0,0854 0,0911 0,104 0,131 0,182 0,31 2,14 0,092 0,0949 0,101 0,114 0,144 0,193 0,319 2,15 0,111 0,115 0,122 0,136 0,162 0,218 0,35 2,24 0,154 0,156 0,159 0,171 0,198 0,253 0,39 2,4 0,288 0,289 0,29 0,298 0,321 0,388 0,575 2,48 0,639 0,65 0,651 0,691 0,794 1,05 1,64 - Table A.10: Execution time in seconds for the BeamFormer 1.2.1.1 Appendix A. CUDA BeamFormer execution time b s 2 4 8 16 32 64 128 256 71 2 4 8 16 32 64 128 256 0,0874 0,0772 0,0825 0,0962 0,119 0,17 0,293 0,545 0,0761 0,0788 0,0842 0,096 0,12 0,171 0,296 0,545 0,079 0,0814 0,0874 0,098 0,124 0,176 0,297 0,542 0,0847 0,0875 0,093 0,107 0,129 0,18 0,306 0,544 0,0973 0,0993 0,106 0,12 0,147 0,197 0,321 0,577 0,12 0,123 0,13 0,146 0,17 0,227 0,355 0,608 0,185 0,189 0,2 0,216 0,245 0,31 0,459 0,759 0,313 0,334 0,345 0,372 0,431 0,538 0,776 - Table A.11: Execution time in seconds for the BeamFormer 1.2.2 b s 2 4 8 16 32 64 128 256 2 4 8 16 32 64 128 256 0,0921 0,0781 0,0833 0,0945 0,12 0,17 0,296 2,09 0,076 0,0788 0,0847 0,0984 0,121 0,174 0,295 2,07 0,0795 0,0826 0,0878 0,0988 0,124 0,174 0,299 2,07 0,0851 0,0878 0,0939 0,106 0,131 0,181 0,304 2,07 0,0977 0,101 0,107 0,121 0,147 0,199 0,327 2,15 0,12 0,124 0,131 0,147 0,172 0,224 0,357 2,19 0,203 0,2 0,207 0,226 0,263 0,337 0,524 2,37 0,361 0,384 0,409 0,469 0,6 0,864 1,45 - Table A.12: Execution time in seconds for the BeamFormer 1.2.2.1 b s 2 4 8 16 32 64 128 256 2 4 8 16 32 64 128 256 0,0742 0,0734 0,0754 0,0814 0,0919 0,112 0,179 0,296 0,0739 0,0751 0,0779 0,0837 0,0951 0,118 0,184 0,313 0,078 0,0796 0,0828 0,0896 0,102 0,13 0,204 0,343 0,0859 0,0878 0,092 0,0999 0,118 0,149 0,236 0,419 0,102 0,105 0,11 0,122 0,145 0,19 0,308 0,534 0,134 0,138 0,147 0,165 0,201 0,275 0,447 0,796 0,198 0,206 0,22 0,253 0,314 0,44 0,725 1,3 0,327 0,359 0,387 0,454 0,566 0,799 1,31 - Table A.13: Execution time in seconds for the BeamFormer 1.3 b s 2 4 8 16 32 64 128 256 2 4 8 16 32 64 128 256 0,0805 0,0814 0,0854 0,0933 0,105 0,134 0,226 0,443 0,0851 0,0894 0,0948 0,105 0,121 0,159 0,266 0,506 0,0989 0,106 0,115 0,13 0,15 0,202 0,355 0,72 0,126 0,138 0,154 0,177 0,212 0,294 0,53 1,11 0,18 0,204 0,233 0,273 0,329 0,484 0,893 1,94 0,289 0,336 0,39 0,465 0,567 0,865 1,63 3,6 0,505 0,6 0,711 0,85 1,04 1,62 3,08 6,78 0,959 1,15 1,36 1,63 2,03 3,08 6,04 - Table A.14: Execution time in seconds for the BeamFormer 1.4 Appendix A. CUDA BeamFormer execution time b s 2 4 8 16 32 64 128 256 72 2 4 8 16 32 64 128 256 512 0,0727 0,0741 0,0773 0,0828 0,0945 0,118 0,185 0,301 0,0745 0,0769 0,0798 0,087 0,1 0,128 0,203 0,343 0,0797 0,0814 0,0862 0,0957 0,113 0,147 0,238 0,407 0,0883 0,0925 0,0983 0,111 0,136 0,185 0,307 0,546 0,107 0,112 0,123 0,143 0,183 0,263 0,446 0,8 0,145 0,154 0,171 0,206 0,278 0,419 0,723 1,32 0,218 0,235 0,268 0,335 0,466 0,731 1,29 2,37 0,378 0,401 0,481 0,605 0,868 1,37 2,4 4,49 0,68 0,764 0,853 1,14 1,61 2,61 4,64 8,67 Table A.15: Execution time in seconds for the BeamFormer 1.5 2x2 b s 4 8 16 32 64 128 256 4 8 16 32 64 128 256 512 0,0758 0,0789 0,0843 0,0967 0,12 0,188 0,307 0,0792 0,0836 0,0905 0,104 0,132 0,208 0,349 0,0882 0,0927 0,101 0,12 0,154 0,248 0,413 0,104 0,111 0,124 0,15 0,202 0,326 0,568 0,137 0,147 0,169 0,211 0,294 0,485 0,845 0,202 0,221 0,257 0,333 0,481 0,809 1,44 0,338 0,374 0,444 0,584 0,863 1,45 2,58 0,595 0,674 0,819 1,08 1,63 2,72 4,89 Table A.16: Execution time in seconds for the BeamFormer 1.5 4x4 b s 8 16 32 64 128 256 8 16 32 64 128 256 512 0,0818 0,0885 0,1 0,124 0,192 0,316 0,0903 0,0974 0,111 0,139 0,217 0,355 0,105 0,114 0,133 0,169 0,264 0,434 0,136 0,15 0,176 0,23 0,359 0,601 0,196 0,219 0,264 0,351 0,556 0,935 0,329 0,366 0,452 0,611 0,942 1,59 0,594 0,654 0,801 1,1 1,69 2,9 Table A.17: Execution time in seconds for the BeamFormer 1.5 8x8 Appendix B CUDA BeamFormer GFLOP/s b s 2 4 8 16 32 64 128 256 2 4 8 16 32 64 128 256 1,2308 1,2699 1,2857 1,2992 1,3033 1,2892 1,2801 1,2713 1,2238 1,2616 1,2849 1,2943 1,3027 1,2889 1,2822 1,2712 1,2159 1,2608 1,2855 1,2982 1,3046 1,2932 1,2821 1,2739 1,2151 1,2596 1,2848 1,2978 1,3045 1,2931 1,2793 1,2739 1,2108 1,2590 1,2823 1,2932 1,3044 1,2903 1,2793 1,2766 1,2141 1,2586 1,2822 1,2987 1,3043 1,2903 1,2834 1,2732 1,2138 1,2606 1,2821 1,2987 1,3043 1,2903 1,2766 1,2698 1,2098 1,2605 1,2848 1,2959 1,3043 1,2834 1,2715 - Table B.1: GFLOP/s for the BeamFormer 1.1 b s 2 4 8 16 32 64 128 256 2 4 8 16 32 64 128 256 3,2140 3,4749 3,6061 3,6764 3,7054 3,6443 3,6425 3,6328 3,1653 3,4482 3,5922 3,6764 3,6800 3,6603 3,6416 3,6324 3,1494 3,4348 3,5920 3,6800 3,6963 3,6505 3,6368 3,6365 3,1372 3,4344 3,5749 3,6782 3,6954 3,6500 3,6365 3,6364 3,1258 3,4437 3,5902 3,6683 3,6950 3,6587 3,6364 3,6254 3,1280 3,4263 3,5893 3,6724 3,7039 3,6475 3,6364 3,6363 3,1265 3,4254 3,5889 3,6587 3,6924 3,6419 3,6363 3,6226 3,1257 3,4289 3,5716 3,6475 3,6810 3,6363 3,6226 - Table B.2: GFLOP/s for the BeamFormer 1.1 2x2 73 Appendix B. CUDA BeamFormer GFLOP/s b s 2 4 8 16 32 64 128 256 74 2 4 8 16 32 64 128 256 512 1,2797 1,3270 1,3483 1,3613 1,3699 1,3544 1,3420 1,3275 1,2631 1,3181 1,3457 1,3600 1,3692 1,3541 1,3394 1,3334 1,2582 1,3137 1,3405 1,3593 1,3689 1,3515 1,3393 1,3333 1,2524 1,3124 1,3447 1,3590 1,3638 1,3514 1,3423 1,3333 1,2512 1,3118 1,3420 1,3638 1,3699 1,3544 1,3423 1,3333 1,2506 1,3092 1,3394 1,3575 1,3699 1,3544 1,3407 1,3333 1,2482 1,3045 1,3393 1,3575 1,3667 1,3559 1,3370 1,3259 1,2501 1,3101 1,3393 1,3559 1,3636 1,3483 1,3296 1,3241 1,2448 1,3072 1,3378 1,3559 1,3675 1,3389 1,3241 1,3241 Table B.3: GFLOP/s for the BeamFormer 1.1.1 b s 2 4 8 16 32 64 128 256 2 4 8 16 32 64 128 256 512 3,2226 3,4902 3,6061 3,6908 3,7201 3,6800 3,6603 3,6594 2,9351 3,3035 3,5249 3,6478 3,6800 3,6603 3,6505 3,6500 2,8119 3,2342 3,4790 3,6092 3,6782 3,6594 3,6500 3,6587 2,7439 3,1895 3,4437 3,6075 3,6773 3,6545 3,6587 3,6475 2,7109 3,1810 3,4420 3,6066 3,6724 3,6587 3,6475 3,6530 2,7004 3,1661 3,4412 3,6018 3,6587 3,6475 3,6474 3,6363 2,6894 3,1520 3,4407 3,5930 3,6586 3,6419 3,6363 3,6363 2,6888 3,1549 3,4287 3,5821 3,6530 3,6363 3,6363 3,6363 2,6788 3,1415 3,4189 3,5714 3,6363 3,6226 3,6294 3,6226 Table B.4: GFLOP/s for the BeamFormer 1.1.1 2x2 b s 2 4 8 16 32 64 128 256 2 4 8 16 32 64 128 256 512 4,2534 4,2086 4,1973 4,1750 4,1475 4,0740 4,0458 4,0318 3,6716 3,8975 4,0187 4,0917 4,1136 4,0458 4,0318 4,0296 3,4103 3,7517 3,9401 4,0352 4,0849 4,0414 4,0248 4,0189 3,3015 3,6747 3,8872 4,0265 4,0805 4,0344 4,0189 4,0303 3,2384 3,6526 3,8788 4,0222 4,0684 4,0430 4,0184 4,0121 3,2098 3,6291 3,8746 4,0152 4,0673 4,0303 4,0181 4,0299 3,1908 3,6174 3,8725 3,9951 4,0668 4,0241 4,0299 4,0149 3,1874 3,6115 3,8581 4,0064 4,0543 4,0299 4,0149 4,0149 3,1856 3,6106 3,8576 3,9943 4,0602 4,0149 4,0074 4,0000 Table B.5: GFLOP/s for the BeamFormer 1.2 2x2 b s 4 8 16 32 64 128 256 4 8 16 32 64 128 256 512 9,4799 9,4744 9,4804 9,4389 9,3739 9,2872 9,2766 8,6782 9,0571 9,2649 9,3302 9,3415 9,2497 9,2713 8,3147 8,8568 9,1489 9,2872 9,2766 9,2444 9,2417 8,1068 8,7490 9,0762 9,2497 9,2848 9,2417 9,2740 8,0214 8,6815 9,0658 9,2310 9,2417 9,2404 9,2733 7,9870 8,6716 9,0477 9,2417 9,2070 9,2565 9,1895 7,9382 8,6431 9,0451 9,1739 9,2397 9,1895 9,2309 7,9336 8,6759 9,0118 9,2230 9,1895 9,2309 9,1892 Table B.6: GFLOP/s for the BeamFormer 1.2 4x4 Appendix B. CUDA BeamFormer GFLOP/s b s 8 16 32 64 128 256 75 8 16 32 64 128 256 512 17,759 17,591 17,584 17,521 17,421 17,401 16,839 17,159 17,326 17,421 17,352 17,390 16,402 16,930 17,228 17,352 17,317 17,312 16,189 16,762 17,112 17,317 17,312 17,370 16,067 16,697 17,079 17,192 17,249 17,278 15,966 16,665 17,074 17,249 17,218 17,369 15,915 16,615 17,012 17,218 17,218 17,292 Table B.7: GFLOP/s for the BeamFormer 1.2 8x8 b s 2 4 8 16 32 64 128 256 2 4 8 16 32 64 128 256 4,0039 4,9232 5,3544 5,3321 5,3897 5,4197 5,4586 5,4864 7,0326 8,8667 10,034 10,324 10,570 10,747 10,838 10,933 11,184 14,769 17,603 19,312 20,405 21,074 21,522 21,708 15,090 21,386 28,265 33,755 37,875 40,622 42,142 42,951 18,062 27,418 39,739 52,994 65,369 74,725 79,294 82,655 15,350 27,774 51,789 81,938 110,00 131,29 144,34 151,27 8,2839 14,257 26,536 50,495 103,48 178,02 206,91 218,16 5,7887 10,245 19,056 36,321 70,380 136,78 199,21 - Table B.8: GFLOP/s for the BeamFormer 1.2.1 b s 2 4 8 16 32 64 128 256 2 4 8 16 32 64 128 256 3,4358 4,5310 5,6435 5,7085 5,6953 5,8109 5,7067 0,8259 6,1159 8,2765 10,474 10,904 11,125 11,480 11,413 1,6429 9,9434 14,014 18,243 20,113 21,317 22,384 22,486 3,2858 13,712 20,176 27,872 34,659 39,137 42,624 43,988 6,5007 16,567 25,460 38,044 52,105 64,676 74,541 79,294 12,527 15,701 28,048 50,328 81,082 110,60 134,50 145,21 23,466 8,7357 15,553 29,730 56,981 98,475 143,25 174,67 46,704 5,5545 9,8898 18,722 33,302 53,206 70,380 79,687 - Table B.9: GFLOP/s for the BeamFormer 1.2.1.1 b s 2 4 8 16 32 64 128 256 2 4 8 16 32 64 128 256 4,3269 5,3789 6,2870 6,7864 6,4025 6,1914 6,0749 6,0862 8,3880 10,524 12,403 13,385 12,805 12,382 12,149 12,172 16,590 20,885 24,714 26,407 25,439 24,765 24,299 24,344 30,958 39,826 47,781 50,742 50,044 49,129 48,599 48,689 50,789 66,938 79,478 86,263 89,261 91,018 92,713 93,227 81,991 108,01 130,52 147,17 158,58 167,66 173,42 176,21 102,66 143,77 184,53 213,43 239,90 257,56 266,35 267,25 95,615 136,87 177,89 213,43 239,43 257,56 267,83 - Table B.10: GFLOP/s for the BeamFormer 1.2.2 Appendix B. CUDA BeamFormer GFLOP/s b s 2 4 8 16 32 64 128 256 76 2 4 8 16 32 64 128 256 3,6092 4,7148 5,9798 6,4268 6,2556 6,1712 6,0749 0,8445 7,0326 9,2025 11,818 12,769 12,470 12,342 12,149 1,6890 13,934 18,373 23,499 25,206 24,859 24,644 24,299 3,3781 25,287 34,055 43,674 48,522 48,765 48,891 48,599 6,7563 41,136 56,653 71,672 81,722 86,972 90,339 89,945 13,143 66,546 92,383 119,81 141,80 156,95 170,02 168,56 26,144 36,125 75,371 121,16 158,03 180,36 200,09 211,45 48,838 35,328 49,664 65,462 79,627 89,457 95,884 94,531 - Table B.11: GFLOP/s for the BeamFormer 1.2.2.1 b s 2 4 8 16 32 64 128 256 2 4 8 16 32 64 128 256 0,2023 0,3642 0,6653 89,857 100,41 105,96 109,49 107,37 0,4063 0,7119 80,920 92,866 102,57 108,39 109,97 107,37 48,828 64,808 82,585 94,219 103,97 108,08 107,61 106,80 49,635 65,603 83,271 94,794 103,83 109,65 108,58 108,34 49,944 65,906 83,618 95,259 104,54 109,25 108,58 106,90 50,048 66,315 83,618 95,259 104,72 109,05 108,09 107,37 50,048 66,059 83,836 95,406 104,36 108,56 108,09 107,13 50,179 66,315 83,946 95,552 104,81 108,81 108,21 - Table B.12: GFLOP/s for the BeamFormer 1.3 b s 2 4 8 16 32 64 128 256 2 4 8 16 32 64 128 256 1,0215 1,4947 2,2588 3,4401 5,3801 6,6117 6,2563 5,1439 0,7678 1,0395 1,4648 2,1014 3,1283 3,8047 3,6003 2,8918 0,6103 0,7629 0,9903 1,3137 1,8245 2,1282 1,9804 1,6206 0,5207 0,6081 0,7260 0,8816 1,1288 1,2192 1,0912 0,8847 0,4705 0,5241 0,5859 0,6548 0,7595 0,7366 0,6194 0,4774 0,4437 0,4814 0,5137 0,5373 0,5723 0,4972 0,3823 0,2723 0,4312 0,4588 0,4728 0,4788 0,4780 0,3796 0,2688 0,1766 0,4242 0,4464 0,4530 0,4503 0,4292 0,3323 0,2085 - Table B.13: GFLOP/s for the BeamFormer 1.4 b s 2 4 8 16 32 64 128 256 2 4 8 16 32 64 128 256 512 59,346 55,948 55,284 55,595 55,264 55,458 55,266 55,387 25,177 34,103 41,913 47,705 51,172 53,181 54,319 54,818 19,145 28,088 37,262 44,506 49,449 52,301 53,685 54,450 17,004 25,821 35,308 43,141 48,543 51,712 53,585 54,442 16,118 24,778 34,234 42,339 48,034 51,540 53,366 54,329 15,634 24,298 33,786 42,000 47,885 51,337 53,258 54,273 15,458 24,133 33,768 41,677 47,709 51,236 52,942 54,272 15,357 23,949 33,424 41,801 47,622 51,430 53,202 54,135 15,279 23,942 33,420 41,669 47,789 51,185 53,202 54,203 Table B.14: GFLOP/s for the BeamFormer 1.5 2x2 Appendix B. CUDA BeamFormer GFLOP/s b s 4 8 16 32 64 128 256 77 4 8 16 32 64 128 256 512 99,565 99,299 98,966 99,039 99,199 99,218 99,104 60,041 74,588 84,831 91,594 95,084 96,995 98,133 49,847 66,134 79,261 88,254 93,308 96,064 97,356 45,849 62,541 76,798 86,716 92,444 95,892 97,342 43,985 60,969 75,441 85,850 91,752 95,519 97,335 43,173 60,097 74,866 85,595 91,739 95,333 97,146 42,719 59,613 74,582 85,296 91,733 95,330 97,144 42,524 59,596 74,572 85,148 91,895 95,328 97,143 Table B.15: GFLOP/s for the BeamFormer 1.5 4x4 b s 8 16 32 64 128 256 8 16 32 64 128 256 512 180,79 181,58 181,99 182,63 182,41 182,73 132,22 153,09 166,75 174,21 178,52 180,50 116,01 141,29 159,84 170,65 176,38 179,40 109,45 136,61 156,43 168,69 175,58 179,37 105,83 134,08 154,77 167,27 174,93 178,71 104,11 132,71 153,77 167,25 174,61 178,38 103,18 131,68 153,27 166,67 173,69 178,38 Table B.16: GFLOP/s for the BeamFormer 1.5 8x8 Appendix C CUDA BeamFormer GB/s b s 2 4 8 16 32 64 128 2 4 8 16 32 64 128 256 2,8200 5,5803 10,557 24,898 52,989 112,81 191,76 2,7323 5,4505 11,143 23,514 50,781 109,56 181,77 2,6905 5,3879 11,432 25,367 49,744 105,86 174,56 2,6700 5,3432 11,621 26,457 52,139 104,41 181,61 2,6616 5,3432 11,718 27,080 53,181 105,97 187,5 2,6533 5,3294 11,755 27,380 53,917 106,5 187,06 2,6574 5,3225 11,780 27,490 54,041 107,57 187,28 2,6470 5,3225 11,780 27,6 54,166 105,83 - Table C.1: GB/s for the BeamFormer 1.0.2 b s 2 4 8 16 32 64 128 256 2 4 8 16 32 64 128 256 4,5444 4,5677 4,5626 4,5789 4,5776 4,5203 4,4845 4,4516 4,4021 4,4770 4,5287 4,5460 4,5674 4,5153 4,4896 4,4503 4,3150 4,4436 4,5149 4,5516 4,5703 4,5283 4,4884 4,4592 4,2827 4,4240 4,5048 4,5465 4,5677 4,5271 4,4782 4,4589 4,2527 4,4141 4,4921 4,5283 4,5664 4,5167 4,4779 4,4682 4,2570 4,4092 4,4896 4,5467 4,5658 4,5164 4,4921 4,4563 4,2523 4,4142 4,4884 4,5460 4,5655 4,5162 4,4681 4,4444 4,2362 4,4129 4,4974 4,5359 4,5653 4,4920 4,4503 - Table C.2: GB/s for the BeamFormer 1.1 78 Appendix C. CUDA BeamFormer GB/s b s 2 4 8 16 32 64 128 256 79 2 4 8 16 32 64 128 256 9,3500 9,4092 9,3928 9,3841 9,3611 9,1588 9,1303 9,0940 8,5708 8,9815 9,1692 9,2878 9,2486 9,1749 9,1161 9,0869 8,2031 8,7676 9,0747 9,2486 9,2653 9,1383 9,0980 9,0944 8,0078 8,6765 8,9843 9,2198 9,2508 9,1312 9,0944 9,0926 7,8969 8,6546 8,9993 9,1830 9,2436 9,1499 9,0926 9,0643 7,8613 8,5883 8,9853 9,1870 9,2628 9,1203 9,0917 9,0913 7,8369 8,5750 8,9783 9,1499 9,2325 9,1055 9,0913 9,0568 7,8247 8,5781 8,9320 9,1203 9,2033 9,0913 9,0568 - Table C.3: GB/s for the BeamFormer 1.1 2x2 b s 2 4 8 16 32 64 128 256 2 4 8 16 32 64 128 256 512 4,7253 4,7730 4,7848 4,7978 4,8115 4,7488 4,7011 4,6486 4,5433 4,6776 4,7429 4,7766 4,8008 4,7435 4,6901 4,6679 4,4650 4,6301 4,7084 4,7660 4,7954 4,7323 4,6888 4,6673 4,4140 4,6096 4,7148 4,7607 4,7753 4,7310 4,6986 4,6669 4,3945 4,5994 4,7011 4,7753 4,7958 4,7410 4,6983 4,6668 4,3847 4,5862 4,6901 4,7524 4,7951 4,7407 4,6929 4,6667 4,3725 4,5677 4,6888 4,7517 4,7839 4,7459 4,6797 4,6409 4,3774 4,5864 4,6881 4,7460 4,7728 4,7191 4,6537 4,6345 4,3580 4,5758 4,6826 4,7459 4,7864 4,6862 4,6345 4,6344 Table C.4: GB/s for the BeamFormer 1.1.1 b s 2 4 8 16 32 64 128 256 2 4 8 16 32 64 128 256 512 9,375 9,4506 9,3928 9,4209 9,3982 9,2486 9,1749 9,1606 7,9473 8,6046 8,9975 9,2157 9,2486 9,1749 9,1383 9,1312 7,3242 8,2554 8,7890 9,0707 9,2198 9,1606 9,1312 9,1499 7,0039 8,0578 8,6546 9,0425 9,2055 9,1423 9,1499 9,1203 6,8486 7,9945 8,6277 9,0285 9,1870 9,1499 9,1203 9,1333 6,7867 7,9361 8,6143 9,0106 9,1499 9,1203 9,1194 9,0913 6,7414 7,8904 8,6076 8,9855 9,1481 9,1055 9,0913 9,0911 6,7309 7,8925 8,5747 8,9569 9,1333 9,0913 9,0911 9,0910 6,7016 7,8564 8,5486 8,9294 9,0913 9,0568 9,0738 9,0566 Table C.5: GB/s for the BeamFormer 1.1.1 2x2 b s 2 4 8 16 32 64 128 256 2 4 8 16 32 64 128 256 512 8,2993 8,7453 9,0144 9,1196 9,1374 9,0144 8,9712 8,9498 7,6393 8,3705 8,7780 9,0144 9,1019 8,9712 8,9498 8,9498 7,3242 8,1949 8,6966 8,9285 9,0579 8,9712 8,9392 8,9285 7,2115 8,1098 8,6805 8,9285 9,0579 8,9605 8,9285 8,9285 7,1347 8,0818 8,6009 8,9285 9,0470 8,9820 8,9285 8,9285 7,1022 8,0472 8,6206 8,9179 9,0361 8,9552 8,9418 8,9552 7,0754 8,0299 8,6009 8,9285 9,0361 8,9418 8,8888 8,9219 7,0754 8,0213 8,5714 8,9020 9,0225 8,9552 8,9219 8,9219 7,0754 8,0213 8,5714 8,8757 9,0225 8,9219 8,9053 8,8888 Table C.6: GB/s for the BeamFormer 1.2 2x2 Appendix C. CUDA BeamFormer GB/s b s 4 8 16 32 64 128 256 80 4 8 16 32 64 128 256 512 16,622 17,201 17,523 17,605 17,564 17,441 17,441 15,756 16,741 17,281 17,482 17,543 17,391 17,441 15,368 16,519 17,142 17,441 17,441 17,391 17,391 15,120 16,393 17,045 17,391 17,467 17,391 17,454 15,030 16,304 17,045 17,366 17,391 17,391 17,454 15,0 16,304 17,021 17,391 17,328 17,422 17,297 14,925 16,260 17,021 17,266 17,391 17,297 17,375 14,925 16,326 16,961 17,359 17,297 17,375 17,297 Table C.7: GB/s for the BeamFormer 1.2 4x4 b s 8 16 32 64 128 256 8 16 32 64 128 256 512 29,037 29,296 29,560 29,594 29,494 29,494 28,044 28,846 29,264 29,494 29,411 29,494 27,573 28,594 29,166 29,411 29,370 29,370 27,343 28,378 29,005 29,370 29,370 29,473 27,202 28,301 28,965 29,166 29,268 29,319 27,061 28,263 28,965 29,268 29,217 29,473 26,992 28,187 28,865 29,217 29,217 29,344 Table C.8: GB/s for the BeamFormer 1.2 8x8 b s 2 4 8 16 32 64 128 256 2 4 8 16 32 64 128 256 7,8125 8,6325 8,7546 8,3705 8,2759 8,2266 8,2370 8,2544 13,722 15,547 16,406 16,206 16,230 16,313 16,355 16,448 21,822 25,897 28,782 30,317 31,333 31,989 32,477 32,660 29,444 37,5 46,214 52,989 58,157 61,661 63,592 64,620 35,244 48,076 64,975 83,191 100,37 113,42 119,65 124,35 29,952 48,701 84,677 128,62 168,91 199,29 217,81 227,59 16,163 25,0 43,388 79,268 158,89 270,22 312,23 328,23 11,295 17,964 31,157 57,017 108,06 207,62 300,61 - Table C.9: GB/s for the BeamFormer 1.2.1 b s 2 4 8 16 32 64 128 256 2 4 8 16 32 64 128 256 5,3632 5,9586 6,5909 6,2040 5,9468 5,9403 5,7705 0,8306 8,3534 9,0702 9,7860 9,2169 8,8830 8,8913 8,7002 1,2423 12,611 13,822 14,914 14,572 14,402 14,561 14,342 2,0747 16,721 18,794 21,158 23,018 24,038 25,088 25,319 3,6993 19,800 23,018 27,769 33,032 37,738 41,564 43,174 6,7382 18,573 24,974 36,001 50,179 62,839 72,916 76,807 12,256 10,280 13,742 21,050 34,833 55,191 76,553 91,032 24,029 6,5198 8,7043 13,187 20,232 29,616 37,336 41,219 - Table C.10: GB/s for the BeamFormer 1.2.1.1 Appendix C. CUDA BeamFormer GB/s b s 2 4 8 16 32 64 128 256 81 2 4 8 16 32 64 128 256 8,4429 9,4315 10,279 10,653 9,8311 9,3980 9,1670 9,1567 16,366 18,454 20,279 21,012 19,662 18,796 18,334 18,313 32,372 36,621 40,409 41,454 39,062 37,592 36,668 36,627 60,405 69,832 78,125 79,656 76,844 74,573 73,336 73,254 99,101 117,37 129,95 135,41 137,06 138,15 139,90 140,26 159,98 189,39 213,41 231,04 243,50 254,50 261,69 265,10 200,32 252,10 301,72 335,05 368,36 390,95 401,93 402,08 186,56 240,0 290,85 335,05 367,64 390,95 404,16 - Table C.11: GB/s for the BeamFormer 1.2.2 b s 2 4 8 16 32 64 128 256 2 4 8 16 32 64 128 256 5,6340 6,2003 6,9837 6,9846 6,5317 6,3086 6,1428 0,8492 9,6055 10,084 11,042 10,793 9,9571 9,5585 9,2615 1,2772 17,673 18,121 19,211 18,262 16,795 16,032 15,498 2,1330 30,838 31,722 33,154 32,226 29,952 28,776 27,973 3,8448 49,162 51,221 52,315 51,809 50,747 50,373 48,973 7,0696 78,720 82,259 85,704 87,756 89,170 92,169 89,160 13,654 42,513 66,595 85,790 96,612 101,08 106,92 110,19 25,126 41,468 43,711 46,110 48,377 49,793 50,866 48,897 - Table C.12: GB/s for the BeamFormer 1.2.2.1 b s 2 4 8 16 32 64 128 256 2 4 8 16 32 64 128 256 0,4738 0,7982 1,3987 184,46 203,53 213,36 219,72 215,12 0,9514 1,5604 170,11 190,63 207,91 218,25 220,68 215,12 114,32 142,04 173,61 193,41 210,74 217,63 215,95 213,97 116,21 143,78 175,05 194,59 210,45 220,78 217,90 217,06 116,94 144,45 175,78 195,55 211,90 219,99 217,90 214,16 117,18 145,34 175,78 195,55 212,26 219,59 216,92 215,12 117,18 144,78 176,24 195,85 211,53 218,60 216,92 214,64 117,49 145,34 176,47 196,15 212,44 219,10 217,17 - Table C.13: GB/s for the BeamFormer 1.3 b s 2 4 8 16 32 64 128 256 2 4 8 16 32 64 128 256 2,4516 3,3216 4,9284 7,5461 11,935 14,819 14,128 11,673 2,3626 3,0241 4,3174 6,4280 9,9311 12,428 11,990 9,7487 2,3129 2,8483 3,9612 5,7805 8,7577 10,907 10,605 8,9206 2,2846 2,7498 3,7581 5,4253 8,1708 10,025 9,8047 8,4119 2,2588 2,6877 3,6458 5,2389 7,8125 9,375 9,2410 7,9282 2,2403 2,6584 3,5855 5,1312 7,6308 9,0169 8,9095 7,6261 2,2362 2,6373 3,5221 5,0845 7,5393 8,8468 8,8709 7,7119 2,2309 2,6207 3,4939 5,0722 7,4672 9,0681 8,7411 - Table C.14: GB/s for the BeamFormer 1.4 Appendix C. CUDA BeamFormer GB/s b s 2 4 8 16 32 64 128 256 82 2 4 8 16 32 64 128 256 512 115,79 116,25 118,73 121,43 121,75 122,70 122,54 122,95 52,315 73,242 91,552 105,10 113,22 117,92 120,57 121,75 41,118 61,354 82,092 98,476 109,64 116,09 119,23 120,96 37,143 56,887 78,125 95,663 107,75 114,85 119,04 120,96 35,511 54,824 75,910 93,984 106,68 114,50 118,57 120,72 34,594 53,879 75,0 93,283 106,38 114,06 118,34 120,60 34,277 53,571 75,0 92,592 106,00 113,85 117,64 120,60 34,090 53,191 74,257 92,879 105,82 114,28 118,22 120,30 33,936 53,191 74,257 92,592 106,19 113,74 118,22 120,45 Table C.15: GB/s for the BeamFormer 1.5 2x2 b s 4 8 16 32 64 128 256 4 8 16 32 64 128 256 512 130,93 135,21 137,19 138,54 139,40 139,75 139,75 81,758 103,40 118,67 128,71 133,92 136,77 138,46 69,103 92,516 111,38 124,30 131,57 135,54 137,40 64,139 87,890 108,17 122,28 130,43 135,33 137,40 61,813 85,877 106,38 121,13 129,49 134,83 137,40 60,810 84,745 105,63 120,80 129,49 134,57 137,14 60,240 84,112 105,26 120,40 129,49 134,57 137,14 60,0 84,112 105,26 120,20 129,72 134,57 137,14 Table C.16: GB/s for the BeamFormer 1.5 4x4 b s 8 16 32 64 128 256 8 16 32 64 128 256 512 168,91 172,81 174,82 176,26 176,47 176,99 125,83 147,05 160,94 168,53 172,91 174,92 111,44 136,36 154,63 165,28 170,94 173,91 105,63 132,15 151,51 163,48 170,21 173,91 102,38 129,87 150,0 162,16 169,61 173,28 100,84 128,61 149,06 162,16 169,31 172,97 100,0 127,65 148,60 161,61 168,42 172,97 Table C.17: GB/s for the BeamFormer 1.5 8x8 Appendix D OpenCL BeamFormer measurements b s 2 4 8 16 32 64 128 256 2 4 8 16 32 64 128 256 512 0,92 0,94 0,958 0,944 0,996 0,981 1,0 1,09 0,919 0,949 0,939 0,944 0,964 0,97 1,04 1,12 0,926 0,937 0,949 0,959 0,978 1,08 1,07 1,2 0,918 0,953 0,96 0,977 0,987 1,02 1,11 1,31 0,943 1,01 1,01 0,992 1,02 1,1 1,26 1,57 0,961 0,978 0,997 1,02 1,09 1,25 1,51 2,09 1,01 1,04 1,07 1,13 1,26 1,52 2,05 3,1 1,12 1,17 1,22 1,34 1,59 2,1 3,13 5,16 1,42 1,4 1,54 1,78 2,28 3,28 5,26 9,26 Table D.1: Execution time in seconds for the BeamFormer 1.5-opencl 2x2 b s 4 8 16 32 64 128 256 4 8 16 32 64 128 256 512 0,959 0,946 0,952 0,965 0,987 1,03 1,14 0,954 0,952 0,966 0,986 1,01 1,06 1,18 0,963 0,968 0,967 0,989 1,02 1,09 1,25 0,974 0,983 0,987 1,02 1,08 1,17 1,38 0,997 1,02 1,02 1,06 1,15 1,32 1,69 1,06 1,07 1,1 1,17 1,32 1,62 2,22 1,19 1,19 1,26 1,4 1,68 2,23 3,34 1,35 1,42 1,55 1,81 2,36 3,43 5,59 Table D.2: Execution time in seconds for the BeamFormer 1.5-opencl 4x4 83 Appendix D. OpenCL BeamFormer measurements b s 8 16 32 64 128 256 84 8 16 32 64 128 256 512 1,05 1,05 1,06 1,09 1,13 1,25 1,05 1,05 1,06 1,09 1,15 1,29 1,06 1,06 1,08 1,14 1,21 1,38 1,09 1,09 1,13 1,19 1,31 1,59 1,14 1,16 1,21 1,31 1,54 1,96 1,31 1,29 1,38 1,59 1,96 2,73 1,44 1,53 1,72 2,09 2,82 4,3 Table D.3: Execution time in seconds for the BeamFormer 1.5-opencl 8x8 b s 2 4 8 16 32 64 128 256 2 4 8 16 32 64 128 256 512 63,086 59,364 57,802 56,772 56,437 56,120 55,998 55,753 59,489 57,558 57,074 56,287 56,194 55,998 55,753 55,723 57,256 56,772 56,139 56,194 55,998 55,938 55,815 55,800 56,177 56,139 55,971 55,998 55,938 55,815 55,800 55,563 55,844 55,750 55,630 55,753 55,631 55,342 55,563 55,559 55,458 55,630 55,570 55,723 55,800 55,563 55,559 55,443 55,266 55,570 55,631 55,800 55,792 55,559 55,500 55,385 55,387 55,448 55,342 55,792 55,673 55,557 55,385 55,527 55,267 55,342 55,563 55,673 55,729 55,385 55,527 55,384 Table D.4: GFLOP/s for the BeamFormer 1.5-opencl 2x2 b s 4 8 16 32 64 128 256 4 8 16 32 64 128 256 512 105,04 102,24 100,93 100,27 99,817 99,838 99,413 101,24 100,43 99,776 99,817 99,838 99,413 99,356 99,941 99,529 99,569 99,838 99,413 99,201 99,637 98,312 98,588 98,605 98,797 98,893 98,865 98,851 97,865 98,605 98,797 98,740 98,865 98,851 98,844 97,403 98,492 98,588 98,865 98,851 98,653 99,032 97,589 98,133 98,865 98,851 98,653 99,032 98,552 97,384 98,105 98,469 98,653 99,032 98,552 98,790 Table D.5: GFLOP/s for the BeamFormer 1.5-opencl 4x4 b s 8 16 32 64 128 256 8 16 32 64 128 256 512 146,48 144,86 143,78 143,90 142,90 143,06 143,81 143,52 142,58 142,90 143,06 142,65 142,47 142,58 142,90 142,73 142,81 143,10 142,58 142,25 142,73 142,81 143,10 142,67 141,60 142,40 142,65 142,28 142,67 142,86 141,43 142,16 142,28 142,67 142,66 142,45 141,51 142,28 142,67 142,66 142,45 142,96 Table D.6: GFLOP/s for the BeamFormer 1.5-opencl 8x8 Appendix D. OpenCL BeamFormer measurements b s 2 4 8 16 32 64 128 256 85 2 4 8 16 32 64 128 256 512 123,09 123,35 124,13 124,00 124,33 124,17 124,17 123,76 123,61 123,61 124,66 124,00 124,33 124,17 123,76 123,76 122,96 124,00 123,68 124,33 124,17 124,17 123,96 123,96 122,70 123,68 123,84 124,17 124,17 123,96 123,96 123,45 123,03 123,35 123,35 123,76 123,55 122,95 123,45 123,45 122,70 123,35 123,35 123,76 123,96 123,45 123,45 123,20 122,54 123,35 123,55 123,96 123,96 123,45 123,32 123,07 122,95 123,15 122,95 123,96 123,71 123,45 123,07 123,39 122,74 122,95 123,45 123,71 123,83 123,07 123,39 123,07 Table D.7: GB/s for the BeamFormer 1.5-opencl 2x2 b s 4 8 16 32 64 128 256 4 8 16 32 64 128 256 512 138,13 139,23 139,92 140,27 140,27 140,62 140,18 137,86 139,23 139,57 140,27 140,62 140,18 140,18 138,54 139,23 139,92 140,62 140,18 139,96 140,62 137,53 138,54 138,88 139,31 139,53 139,53 139,53 137,53 138,88 139,31 139,31 139,53 139,53 139,53 137,19 138,88 139,10 139,53 139,53 139,26 139,80 137,61 138,46 139,53 139,53 139,26 139,80 139,13 137,40 138,46 138,99 139,26 139,80 139,13 139,46 Table D.8: GB/s for the BeamFormer 1.5-opencl 4x4 b s 8 16 32 64 128 256 8 16 32 64 128 256 512 136,86 137,86 138,12 138,88 138,24 138,56 136,86 137,86 137,61 138,24 138,56 138,24 136,86 137,61 138,24 138,24 138,40 138,72 137,61 137,61 138,24 138,40 138,72 138,32 136,98 137,93 138,24 137,93 138,32 138,52 136,98 137,77 137,93 138,32 138,32 138,12 137,14 137,93 138,32 138,32 138,12 138,62 Table D.9: GB/s for the BeamFormer 1.5-opencl 8x8 Appendix E Finding the best station-beam block size b s 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 32,236 43,902 53,250 59,940 65,447 69,912 73,607 76,949 43,389 62,301 77,311 89,111 99,066 107,10 114,20 120,60 49,937 72,417 91,285 105,80 119,24 130,19 139,93 148,14 49,221 72,710 91,753 108,01 121,61 133,28 143,86 153,19 51,434 76,605 97,360 115,23 131,63 144,33 156,02 166,57 53,435 79,582 102,32 121,14 138,10 152,38 165,96 176,86 54,127 81,097 103,19 122,79 139,62 154,33 167,30 179,54 53,539 78,405 98,695 115,23 129,08 141,10 151,49 160,54 Table E.1: GFLOP/s for the OpenCL BeamFormer 1.5: block sizes from 1x1 to 8x8 b s 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 50,181 71,272 86,975 99,421 110,27 117,71 124,62 130,50 47,883 65,853 79,068 89,111 97,583 103,92 109,68 114,02 46,196 61,170 72,662 79,909 86,182 92,352 94,830 99,978 40,993 52,992 61,394 66,008 69,346 72,551 74,892 76,208 37,664 45,182 50,252 53,797 56,494 58,125 59,264 60,390 34,271 41,707 46,683 49,902 52,088 53,833 55,094 56,012 32,853 38,565 41,416 43,492 44,770 45,922 46,286 47,331 31,434 36,733 39,665 41,527 42,955 43,882 44,702 45,454 Table E.2: GFLOP/s for the OpenCL BeamFormer 1.5: block sizes from 1x9 to 8x16 86 Appendix E. Finding the best station-beam block size b s 9 10 11 12 13 14 15 16 87 1 2 3 4 5 6 7 8 79,815 82,081 83,640 85,392 86,945 88,697 89,750 90,866 126,19 129,74 134,68 138,10 141,40 144,37 147,07 149,53 155,63 161,17 166,86 173,06 176,85 180,77 185,84 189,29 161,70 168,17 175,97 182,19 187,20 193,04 197,80 203,02 176,19 184,63 192,90 200,45 206,73 213,31 219,19 224,64 188,02 197,10 205,50 214,39 220,74 226,53 231,83 240,60 190,39 198,72 208,00 214,69 222,50 229,72 236,40 240,89 167,25 174,33 181,87 186,35 191,48 196,15 200,40 203,23 Table E.3: GFLOP/s for the OpenCL BeamFormer 1.5: block sizes from 9x1 to 16x8 b s 9 10 11 12 13 14 15 16 9 10 11 12 13 14 15 16 134,83 139,95 143,80 146,56 149,62 152,97 154,84 157,08 117,71 120,87 124,08 126,47 128,58 130,07 132,53 134,41 102,38 105,05 107,38 109,71 111,21 112,27 113,72 113,56 78,000 78,840 80,176 80,740 81,644 82,309 82,771 83,420 61,004 61,511 62,023 62,624 63,146 63,531 63,732 64,170 56,842 57,610 58,118 58,618 59,115 59,137 59,654 60,116 47,644 48,050 48,392 48,426 48,853 48,851 49,201 49,178 45,926 46,272 46,723 46,735 47,092 47,077 47,369 47,629 Table E.4: GFLOP/s for the OpenCL BeamFormer 1.5: block sizes from 9x9 to 16x16 b s 2 4 8 16 32 64 128 256 1 2 3 4 5 6 7 8 44,160 59,940 77,424 90,866 98,960 103,35 105,50 106,01 62,301 88,229 120,02 149,53 170,35 183,63 190,22 192,23 72,185 106,36 147,85 189,54 221,85 244,58 256,22 263,12 72,534 107,36 153,19 201,54 246,18 275,12 294,25 305,54 76,762 115,23 167,45 224,64 275,71 316,30 341,78 357,92 80,007 120,78 177,49 240,60 299,67 344,95 374,77 392,16 80,972 122,46 179,36 242,61 298,11 336,90 361,16 375,82 78,303 115,35 160,54 203,23 237,01 259,55 271,45 280,98 Table E.5: GFLOP/s for the OpenCL BeamFormer 1.5: block sizes from 2x1 to 256x8 b s 2 4 8 16 32 64 128 256 9 10 11 12 13 14 15 16 71,272 99,421 130,50 157,08 175,57 186,65 193,70 197,22 65,853 89,709 114,02 134,04 147,67 156,32 160,96 163,40 61,170 79,477 97,505 113,56 123,16 126,86 128,27 127,79 53,306 65,737 76,402 83,182 87,387 90,070 90,941 91,478 45,182 53,964 60,502 64,105 66,319 67,183 67,817 67,857 41,873 49,769 56,012 60,116 62,119 62,899 63,454 63,776 38,565 43,682 47,331 49,178 50,209 50,755 50,898 51,248 36,509 41,527 45,403 47,629 48,610 49,049 49,396 49,491 Table E.6: GFLOP/s for the OpenCL BeamFormer 1.5: block sizes from 2x9 to 256x16 Appendix E. Finding the best station-beam block size b s 1 2 3 4 5 6 7 8 88 1 2 3 4 5 6 7 8 29,533 39,512 45,978 51,410 56,584 61,035 63,849 65,668 41,996 58,196 68,439 78,627 86,682 93,785 99,800 102,59 42,648 60,870 74,550 86,236 95,861 104,86 111,89 118,70 45,891 67,179 83,329 97,567 110,67 119,78 130,13 140,14 48,906 71,498 90,285 105,58 119,65 131,86 143,49 153,14 50,581 74,452 94,738 112,32 126,98 141,71 153,85 165,21 52,018 77,395 98,462 116,08 133,10 148,05 161,04 172,15 52,997 79,024 100,47 119,88 137,52 152,28 167,21 179,18 Table E.7: GFLOP/s for the CUDA BeamFormer 1.5: block sizes from 1x1 to 8x8 b s 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 52,216 78,565 100,78 120,54 138,36 153,12 167,56 179,18 51,318 77,156 98,462 118,28 136,94 151,38 166,58 179,18 50,404 75,761 97,294 114,87 130,84 144,85 157,23 167,23 45,776 66,731 83,496 97,212 109,24 118,31 127,21 134,98 42,205 60,243 74,550 86,023 95,429 103,28 110,35 115,64 39,559 57,120 69,580 79,631 88,088 94,443 100,38 105,14 38,146 53,880 64,960 73,713 80,942 86,086 91,051 94,543 35,212 49,027 58,388 65,603 71,632 76,331 80,020 83,271 Table E.8: GFLOP/s for the CUDA BeamFormer 1.5: block sizes from 1x9 to 8x16 b s 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 68,833 68,766 70,571 72,449 74,687 76,049 77,539 78,262 107,95 111,71 115,61 118,34 121,88 122,59 125,60 127,86 125,62 130,93 136,22 141,37 145,72 147,96 151,13 154,69 147,83 155,30 160,35 166,34 172,56 177,02 178,79 183,10 163,44 170,58 178,66 184,76 189,20 195,70 201,74 205,60 174,91 183,83 191,74 200,75 207,30 213,31 220,61 223,95 185,00 195,14 202,66 211,17 219,00 226,24 231,27 237,52 189,93 201,26 210,07 219,72 228,68 233,77 241,55 248,83 Table E.9: GFLOP/s for the CUDA BeamFormer 1.5: block sizes from 9x1 to 16x8 b s 9 10 11 12 13 14 15 16 9 10 11 12 13 14 15 16 191,01 200,36 210,25 219,36 226,33 234,12 242,80 248,12 190,57 200,94 210,40 219,07 228,35 235,71 242,51 251,41 178,13 187,02 195,08 203,44 210,13 216,26 221,91 226,16 141,84 147,35 152,24 157,17 162,17 165,68 168,34 172,27 120,99 124,93 129,14 132,53 135,57 138,31 141,12 143,03 109,25 113,11 116,23 118,72 121,45 123,40 125,89 127,45 98,413 101,12 104,12 106,16 108,16 109,76 111,38 112,84 85,542 88,099 89,733 91,439 93,217 94,545 95,733 96,803 Table E.10: GFLOP/s for the CUDA BeamFormer 1.5: block sizes from 9x9 to 16x16 Appendix E. Finding the best station-beam block size b s 2 4 8 16 32 64 128 256 89 1 2 3 4 5 6 7 8 39,721 50,823 65,668 78,643 86,725 92,719 95,353 96,861 57,307 77,713 102,18 127,02 144,54 156,32 162,69 166,29 61,035 86,051 119,45 155,35 183,45 203,38 215,22 221,07 67,029 97,567 140,14 183,10 223,15 250,52 267,12 278,38 71,092 105,74 153,89 205,60 251,04 285,72 305,72 318,49 74,699 111,85 165,58 223,95 277,85 319,68 345,02 360,79 76,941 115,51 173,85 237,52 296,79 342,12 369,78 386,87 78,507 119,34 179,18 247,25 314,06 362,87 392,85 411,86 Table E.11: GFLOP/s for the CUDA BeamFormer 1.5: block sizes from 2x1 to 256x8 b s 2 4 8 16 32 64 128 256 9 10 11 12 13 14 15 16 78,565 120,54 179,18 248,12 313,35 362,19 395,08 412,48 76,605 119,34 179,18 250,11 320,12 373,81 408,51 427,07 75,761 114,87 168,25 226,16 278,34 314,29 336,83 351,83 66,731 97,805 134,98 172,27 201,19 220,36 231,78 237,98 60,243 86,023 115,64 143,36 163,39 176,90 183,38 187,55 56,812 79,971 105,14 127,21 142,68 153,06 157,87 161,14 53,369 73,713 94,781 112,66 125,52 133,20 137,46 139,71 49,027 66,418 83,098 96,924 106,73 112,92 116,33 117,90 Table E.12: GFLOP/s for the CUDA BeamFormer 1.5: block sizes from 2x9 to 256x16 Appendix F Data structures namespace LOFAR { namespace RTCP { // Data which needs to be t r a n s p o r t e d between CN , ION and Storage . // Apart from read () and write () functionality , the data is a u g m e n t e d // with a s e q u e n c e number in order to detect missing data . Furthermore , // an i n t e g r a t i o n o p e r a t o r += can be defined to reduce the data . // E n d i a n n e s s : // * CN / ION are big endian ( ppc ) // * Storage is little endian ( intel ) // * S t a t i o n s are little endian // // E n d i a n n e s s is swapped by : // * Data r e c e i v e d by the CN from the s t a t i o n s ( t r a n s p o r t e d via the ION ) // * Data r e c e i v e d by Storage from the ION // // WARNING : We c o n s i d e r all data streams to be big endian , and will also write // them as such . s e q u e n c e N u m b e r is the only field c o n v e r t e d from and to big endian . class Str eamableD ata { public : // A stream is i n t e g r a t a b l e if it s u p p o r t s the += o p e r a t o r to combine // several objects into one . Strea mableDa ta ( bool isIntegr atable ): integratable ( is Integrat able ) {} // s u p p r e s s warning by d e f i n i n g a virtual d e s t r u c t o r virtual ~ S treamab leData () {} // return a copy of the object virtual St reamable Data * clone () const = 0; 90 Appendix F. Data structures 91 virtual size_t requiredSize () const = 0; virtual void allocate ( Allocator & allocator = heapAllocator ) = 0; virtual void read ( Stream * , bool w i t h S e q u e n c e N u m b e r ); virtual void write ( Stream * , bool withSequenceNumber , unsigned align = 0); bool isIntegr atable () const { return integratable ; } virtual St reamable Data & operator += ( const S treamabl eData &) { LOG_WARN ( " Integration not implemented . " ); return * this ; } uint32_t seque nceNumb er ; protected : const bool integratable ; // a s u b c l a s s should o v e r r i d e these to m a r s h a l l its data virtual void readData ( Stream *) = 0; virtual void writeData ( Stream *) = 0; }; // A typical data set c o n t a i n s a M u l t i D i m A r r a y of tuples and a set of flags . template < typename T = fcomplex , unsigned DIM = 4 > class SampleData : public Stre amableD ata { public : typedef typename MultiDimArray <T , DIM >:: ExtentList ExtentList ; SampleData ( bool isIntegratable , const ExtentList & extents , unsigned nrFlags ); virtual SampleData * clone () const { return new SampleData (* this ); } virtual size_t requiredSize () const ; virtual void allocate ( Allocator & allocator = heapAllocator ); MultiDimArray <T , DIM > samples ; std :: vector < SparseSet < unsigned > > flags ; protected : virtual void c he ck E nd ia nn e ss (); virtual void readData ( Stream *); virtual void writeData ( Stream *); private : // copy the E x t e n t L i s t instead of using a reference , // as boost by default uses a global one ( boost :: extents ) const ExtentList extents ; Appendix F. Data structures const unsigned nrFlags ; bool itsHaveWarnedLittleEndian ; 92 }; inline void St reamable Data :: read ( Stream * str , bool w i t h S e q u e n c e N u m b e r ) { if ( w i t h S e q u e n c e N u m b e r ) { str - > read (& sequenceNumber , sizeof se quenceN umber ); # if ! defined W OR DS _B I GE ND IA N dataConvert ( LittleEndian , & sequenceNumber , 1); # endif } readData ( str ); } inline void St reamable Data :: write ( Stream * str , bool withSequenceNumber , unsigned align ) { if ( w i t h S e q u e n c e N u m b e r ) { # if ! defined W OR DS _B I GE ND IA N if ( align > 1) { if ( align < sizeof ( uint32_t )) THROW ( AssertError , " Sizeof alignment < sizeof seque ncenumb er " ); void * sn_buf ; uint32_t sn = sequenc eNumber ; if ( p osix_me malign (& sn_buf , align , align ) != 0) { THROW ( InterfaceException , " could not allocate data " ); } try { dataConvert ( BigEndian , & sn , 1); memcpy ( sn_buf , & sn , sizeof sn ); str - > write ( sn_buf , align ); } catch (...) { free ( sn_buf ); throw ; } free ( sn_buf ); } else { uint32_t sn = sequenc eNumber ; dataConvert ( LittleEndian , & sn , 1); Appendix F. Data structures 93 str - > write (& sn , sizeof sn ); } # else if ( align > 1) { if ( align < sizeof ( uint32_t )) THROW ( AssertError , " Sizeof alignment < sizeof sequ encenumb er " ); void * sn_buf ; if ( p osix_me malign (& sn_buf , align , align ) != 0) { THROW ( InterfaceException , " could not allocate data " ); } try { memcpy ( sn_buf , & sequenceNumber , sizeof s equenceN umber ); str - > write ( sn_buf , align ); } catch (...) { free ( sn_buf ); throw ; } free ( sn_buf ); } else { str - > write (& sequenceNumber , sizeof s equence Number ); } # endif } writeData ( str ); } template < typename T , unsigned DIM > inline SampleData <T , DIM >:: SampleData ( bool isIntegratable , const ExtentList & extents , unsigned nrFlags ) : Strea mableDa ta ( isI ntegrata ble ) , flags (0) , extents ( extents ) , nrFlags ( nrFlags ) , i t s H a v e W a r n e d L i t t l e E n d i a n ( false ) { } template < typename T , unsigned DIM > inline size_t SampleData <T , DIM >:: requiredSize () const { return align ( MultiDimArray <T , DIM >:: nrElements ( extents ) * sizeof ( T ) ,32); } template < typename T , unsigned DIM > inline void SampleData <T , DIM >:: allocate ( Allocator & allocator ) Appendix F. Data structures 94 { samples . resize ( extents , 32 , allocator ); flags . resize ( nrFlags ); } template < typename T , unsigned DIM > inline void SampleData <T , DIM >:: ch ec kE n di an ne s s () { # if 0 && ! defined WO RD S_ B IG EN DI A N dataConvert ( LittleEndian , samples . origin () , samples . num_elements ()); # endif } template < typename T , unsigned DIM > inline void SampleData <T , DIM >:: readData ( Stream * str ) { str - > read ( samples . origin () , samples . num_elements () * sizeof ( T )); c he ck En d ia nn es s (); } template < typename T , unsigned DIM > inline void SampleData <T , DIM >:: writeData ( Stream * str ) { # if 0 && ! defined WO RD S_ B IG EN DI A N if (! i t s H a v e W a r n e d L i t t l e E n d i a n ) { i t s H a v e W a r n e d L i t t l e E n d i a n = true ; LOG_WARN ( " writing data in little endian . " ); } // THROW ( AssertError , " not i m p l e m e n t e d : think about e n d i a n n e s s "); # endif str - > write ( samples . origin () , samples . num_elements () * sizeof ( T )); } } // n a m e s p a c e RTCP } // n a m e s p a c e LOFAR Listing F.1: SampleData namespace LOFAR { namespace RTCP { // Note : struct must remain co p y a b l e to avoid ugly c o n s t r u c t i o n s when passing it around struct S ub b an dM et a Da ta { public : S ub ba nd M et aD at a ( unsigned nrSubbands , unsigned nrBeams , size_t alignment = 16 , Allocator & allocator = heapAllocator ); virtual ~ Su b ba nd Me t aD at a (); Appendix F. Data structures 95 struct beamInfo { float delayAtBegin , delayAfterEnd ; double b e a m D i r e c t i o n A t B e g i n [3] , b e a m D i r e c t i o n A f t e r E n d [3]; }; struct marshallData { unsigned char flagsBuffer [132]; unsigned alignmen tShift ; // i t s N r B e a m s e l e m e n t s will really be allocated , so this array needs to // be the last element . Also , ISO C ++ forbids zero - sized arrays , so we use size 1. struct beamInfo beams [1]; }; SparseSet < unsigned > getFlags ( unsigned subband ) const ; void setFlags ( unsigned subband , const SparseSet < unsigned > &); unsigned alignme ntShift ( unsigned subband ) const ; unsigned & alignm entShif t ( unsigned subband ); struct beamInfo * beams ( unsigned subband ) const ; struct beamInfo * beams ( unsigned subband ); struct marshallData & subbandInfo ( unsigned subband ) const ; struct marshallData & subbandInfo ( unsigned subband ); void read ( Stream * str ); void write ( Stream * str ) const ; // size of the i n f o r m a t i o n for one subband const unsigned itsSubbandInfoSize ; private : const unsigned itsNrSubbands , itsNrBeams ; // size of the i n f o r m a t i o n for all s u b b a n d s const unsigned itsMarshallDataSize ; // the pointer to all our data , which c o n s i s t s of struct m a r s h a l l D a t a [ i t s N r S u b b a n d s ] , // except for the fact that the e l e m e n t s are spaces apart more than // sizeof ( struct m a r s h a l l D a t a ) // to make room for extra beams which are not defined in the m a r s h a l l D a t a s t r u c t u r e . // // Access e l e m e n t s through s u b b a n d I n f o ( subband ). char * const i ts M ar sh al l Da ta ; // r e m e m b e r the pointer at which we a l l o c a t e d the memory for the m a r s h a l l D a t a Appendix F. Data structures Allocator 96 & itsAllocator ; }; inline S ub b an dM et a Da ta :: S ub ba nd M et aD at a ( unsigned nrSubbands , unsigned nrBeams , size_t alignment , Allocator & allocator ) : // Size of the data we need to a l l o c a t e . Note that m a r s h a l l D a t a already c o n t a i n s // the size of one b e a m I n f o . i t s S u b b a n d I n f o S i z e ( sizeof ( struct marshallData ) + ( nrBeams - 1) * sizeof ( struct beamInfo )) , itsNrSubbands ( nrSubbands ) , itsNrBeams ( nrBeams ) , i t s M a r s h a l l D a t a S i z e ( nrSubbands * i t s S u b b a n d I n f o S i z e ) , i ts Ma rs h al lD at a ( static_cast < char * >( allocator . allocate ( itsMarshallDataSize , alignment ))) , itsAllocator ( allocator ) { # if defined USE_VALGRIND memset ( itsMarshallData , 0 , i t s M a r s h a l l D a t a S i z e ); # endif } inline S ub b an dM et a Da ta ::~ S ub ba n dM et aD a ta () { itsAllocator . deallocate ( it sM a rs ha ll D at a ); } inline SparseSet < unsigned > Sub ba nd M et aD at a :: getFlags ( unsigned subband ) const { SparseSet < unsigned > flags ; flags . unmarshall ( subbandInfo ( subband ). flagsBuffer ); return flags ; } inline void Su bb a nd Me ta D at a :: setFlags ( unsigned subband , const SparseSet < unsigned > & flags ) { ssize_t size = flags . marshall (& subbandInfo ( subband ). flagsBuffer , sizeof subbandInfo ( subband ). flagsBuffer ); assert ( size >= 0); } inline unsigned Su bb an d Me ta Da t a :: alignm entShift ( unsigned subband ) const { return subbandInfo ( subband ). a lignment Shift ; } Appendix F. Data structures 97 inline unsigned & Su bb a nd Me ta D at a :: align mentShi ft ( unsigned subband ) { return subbandInfo ( subband ). a lignment Shift ; } inline struct S ub ba n dM et aD a ta :: beamInfo * S ub b an dM et a Da ta :: beams ( unsigned subband ) const { return & subbandInfo ( subband ). beams [0]; } inline struct S ub ba n dM et aD a ta :: beamInfo * S ub b an dM et a Da ta :: beams ( unsigned subband ) { return & subbandInfo ( subband ). beams [0]; } inline struct S ub ba n dM et aD a ta :: marshallData & Su bb a nd Me ta D at a :: subbandInfo ( unsigned subband ) const { // c a l c u l a t e the array stride ourself , // since C ++ does not know the proper size of the m a r s h a l l D a t a e l e m e n t s return * reinterpret_cast < struct marshallData * >( i ts M ar sh al l Da ta + ( subband * i t s S u b b a n d I n f o S i z e )); } inline struct S ub ba n dM et aD a ta :: marshallData & S ub ba n dM et aD a ta :: subbandInfo ( unsigned subband ) { // c a l c u l a t e the array stride ourself , // since C ++ does not know the proper size of the m a r s h a l l D a t a e l e m e n t s return * reinterpret_cast < struct marshallData * >( i ts M ar sh al l Da ta + ( subband * i t s S u b b a n d I n f o S i z e )); } inline void Su bb a nd Me ta D at a :: read ( Stream * str ) { // TODO : e n d i a n n e s s str - > read ( itsMarshallData , i t s M a r s h a l l D a t a S i z e ); } inline void Su bb a nd Me ta D at a :: write ( Stream * str ) const { // TODO : e n d i a n n e s s str - > write ( itsMarshallData , i t s M a r s h a l l D a t a S i z e ); } } // n a m e s p a c e RTCP Appendix F. Data structures 98 } // n a m e s p a c e LOFAR Listing F.2: SubbandMetaData namespace LOFAR { namespace RTCP { class Bea mFormedD ata : public SampleData < fcomplex ,4 > { public : typedef SampleData < fcomplex ,4 > SuperType ; BeamF ormedDa ta ( unsigned nrBeams , unsigned nrChannels , unsigned n r S a m p l e s P e r I n t e g r a t i o n ); virtual Be amFormed Data * clone () const { return new Be amFormed Data (* this ); } }; inline BeamF ormedDat a :: Be amFormed Data ( unsigned nrBeams , unsigned nrChannels , unsigned n r S a m p l e s P e r I n t e g r a t i o n ) // The "| 2" s i g n i f i c a n t l y i m p r o v e s t r a n s p o s e speeds for p a r t i c u l a r // numbers of s t a t i o n s due to cache c o n f l i c t effects . The extra memory // is not used . : SuperType :: SampleData ( false , boost :: extents [ nrBeams ][ nrChannels ][ n r S a m p l e s P e r I n t e g r a t i o n | 2][ N R _ P O L A R I Z A T I O NS ] , nrBeams ) { } } // n a m e s p a c e RTCP } // n a m e s p a c e LOFAR Listing F.3: BeamFormedData Bibliography [1] Toby Haynes. A Primer on Digital Beamforming, March 1998. [2] A J Faulkner, K Zarb Adami, J. G. Bij de Vaate, G. W. Kant, and P. Pickard. Beamforming Techniques for Large-N Aperture Arrays. 2010. [3] John Owens. GPU architecture overview. In SIGGRAPH ’07: ACM SIGGRAPH 2007 courses, page 2, New York, NY, USA, 2007. ACM. doi: http://doi.acm.org/ 10.1145/1281500.1281643. [4] NVIDIA. NVIDIA CUDA Programming Guide version 2.3.1. http: //developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/ NVIDIA_CUDA_Programming_Guide_2.3.pdf. [5] John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. Scalable parallel programming with cuda. Queue, 6(2):40–53, 2008. ISSN 1542-7730. doi: http: //doi.acm.org/10.1145/1365490.1365500. [6] . Commun. ACM, 53(11), 2010. ISSN 0001-0782. [7] Khronos OpenCL Working Group. The OpenCL Specification version 1.1. http: //www.khronos.org/registry/cl/specs/opencl-1.1.pdf. [8] ASTRON, Netherlands Institute for Radio Astronomy. LOFAR, LOw Frequency ARray radio telescope. http://www.lofar.org. [9] A. Kemball, J. Cordes, J. Lasio, D. Backer, G. Bower, S. Bhatnager, and R. Plante. Technology development for computational radio astronomy: 2010-2020. In astro2010: The Astronomy and Astrophysics Decadal Survey, volume 2010 of ArXiv Astrophysics e-prints, pages 46–+, 2009. 99 Bibliography 100 [10] Tilak Agerwala. Exascale computing: the challenges and opportunities in the next decade. In Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP ’10, pages 1–2, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-877-3. doi: http://doi.acm.org/10.1145/1693453.1693454. URL http://doi.acm.org/10.1145/1693453.1693454. [11] K.G. Jansky. Electrical disturbances apparently of extraterrestrial origin. Proceedings of the IEEE, 72(6):710–714, 2005. ISSN 0018-9219. [12] AB Smolders, J.G.B. de Vaate, GW Kant, A. van Ardenne, Dan Schaubert, and T.H. Chio. ray. Dual-beam Wide-band Beamformer with Integrated Antenna Ar- In IEEE Millennium Conference on Antenna & Propagation, pages 3–6. Citeseer. URL http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1. 1.135.4037&rep=rep1&type=pdf. [13] Ronald De Wild. A generic digital beam former platform for phased arrays in radio astronomy. [14] Aaron Parsons, Donald Backer, Chen Chang, Daniel Chapman, Henry Chen, Patrick Crescini, Andrew Siemion, Dan Werthimer, and Melvyn Wright Abstractour Group. Petaop/second fpga signal processing for seti and radio astronomy. In Proceedings of the Asilomar Conference on Signals, Systems, and Computers, 2006. [15] Vinayak Nagpal and Terry Filiba. Beamforming for antenna arrays bee2 vs dsp processors. 2007. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi= ?doi=10.1.1.119.137. [16] C. D. Patterson, S. W. Ellingson, B. S. Martin, K. Deshpande, J. H. Simonetti, M. Kavic, and S. E. Cutchin. Searching for transient pulses with the eta radio telescope. TRETS, 1(4), 2009. [17] J.W. Romein, P.C. Broekema, J.D. Mol, and R.V. van Nieuwpoort. Processing Real-Time LOFAR Telescope Data on a Blue Gene/P Supercomputer. astron.nl. URL http://www.astron.nl/~{}nieuwpoort/papers/lofar.pdf. [18] Jayanta Roy, Yashwant Gupta, Ue-Li Pen, Jeffrey B. Peterson, Sanjay Kudale, and Jitendra Kodilkar. A real-time software backend for the GMRT. Oct 2009. URL http://arxiv.org/abs/0910.1517. Bibliography 101 [19] B. J. Mort, F. Dulwich, S. Salvini, K. Zarb Adami, M. E. Jones, A. E. Trefethen, and S. Rawlings. OSKAR: Beamforming simulation for the SKA aperture array. http://www.oerc.ox.ac.uk/research/oskar. [20] Carl-Inge Colombo Nilsen and Ines Hafizovic. Digital beamforming using a gpu. Acoustics, Speech, and Signal Processing, IEEE International Conference on, 0:609– 612, 2009. doi: http://doi.ieeecomputersociety.org/10.1109/ICASSP.2009.4959657. [21] Michael Romer. Beamforming adaptive arrays with graphics processing units. [22] John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Krüger, Aaron E. Lefohn, and Tim Purcell. A survey of general-purpose computation on graphics hardware. In Eurographics 2005, State of the Art Reports, pages 21–51, August 2005. URL http://graphics.idav.ucdavis.edu/publications/print_ pub?pub_id=844. [23] Steve Upstill. RenderMan Companion: A Programmer’s Guide to Realistic Computer Graphics. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1989. ISBN 0201508680. [24] S. Venkatasubramanian. The graphics card as a stream computer. In SIGMODDIMACS Workshop on Management and Processing of Data Streams. Citeseer, 2003. [25] Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. Nvidia tesla: A unified graphics and computing architecture. IEEE Micro, 28:39–55, 2008. ISSN 0272-1732. doi: http://doi.ieeecomputersociety.org/10.1109/MM.2008.31. [26] Jack W. Davidson and Sanjay Jinturkar. Memory access coalescing: a technique for eliminating redundant memory accesses. In PLDI ’94: Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation, pages 186–195, New York, NY, USA, 1994. ACM. ISBN 0-89791-662-X. doi: http: //doi.acm.org/10.1145/178243.178259. [27] Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65–76, 2009. ISSN 0001-0782. doi: http://doi.acm.org/10.1145/1498765.1498785. [28] Khronos Group. OpenCL. http://www.khronos.org/opencl.
© Copyright 2025 Paperzz