Radio astronomy beam forming on GPUs

VRIJE UNIVERSITEIT AMSTERDAM
Radio astronomy beam forming on
GPUs
by
Alessio Sclocco
Supervisors
Dr. Rob van Nieuwpoort
Dr. Ana Lucia Varbanescu
A thesis submitted in partial fulfillment for the
degree of Master of Science
in the
Faculty of Sciences
Department of Computer Science
March 2011
“Computer Science is no more about computers than astronomy is about telescopes.”
E. W. Dijkstra
VRIJE UNIVERSITEIT AMSTERDAM
Abstract
Faculty of Sciences
Department of Computer Science
Master of Science
by Alessio Sclocco
In order to build the radio telescopes needed for the experiments planned for the years
to come, it will be necessary to design computers capable of performing thousands more
floating point operations per second than the actual most powerful computers of today,
and do it in a very power efficient way. In this work we focus on the parallelization of
a specific operation (part of the pipeline of most modern radio telescopes): the beam
forming. We aim at discovering if this operation can be accelerated using Graphics
Processing Units (GPUs). To do so we analyze a reference beam former, the one that
ASTRON uses for the LOFAR radio telescope, discuss different parallelization strategies, and then implement and test the algorithm on a NVIDIA GTX 480 video card.
Furthermore, we want to compare the performance of our algorithm using two different
implementation frameworks: CUDA and OpenCL.
Contents
Abstract
ii
List of Figures
v
List of Tables
vi
Abbreviations
viii
1 Introduction
1
2 Background
2.1 Radio astronomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Electromagnetic radiation . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Beam forming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
4
6
9
3 Related works
11
3.1 Hardware beam formers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Software beam formers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4 General Purpose computations on GPUs
4.1 The GPU pipeline . . . . . . . . . . . . .
4.2 The reasons behind GPGPU . . . . . . .
4.3 NVIDIA architecture . . . . . . . . . . . .
4.4 CUDA . . . . . . . . . . . . . . . . . . . .
4.5 An example: SOR . . . . . . . . . . . . .
4.5.1 Performance . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17
17
19
21
23
23
29
5 Application analysis
5.1 Data structures . . . . . . . .
5.2 The beam forming algorithm
5.2.1 Delays computation .
5.2.2 Flags computation . .
5.2.3 Beams computation .
5.3 Parallelization strategies . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
32
32
33
34
34
35
37
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 CUDA BeamFormer
40
6.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.2 BeamFormer 1.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
iii
Contents
6.3
6.4
6.5
6.6
6.7
6.8
BeamFormer 1.1
BeamFormer 1.2
BeamFormer 1.3
BeamFormer 1.4
BeamFormer 1.5
Conclusions . . .
iv
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7 OpenCL BeamFormer
7.1 The Open Computing Language . . . . . .
7.2 Porting the BeamFormer 1.5 from CUDA to
7.3 OpenCL BeamFormer performance . . . . .
7.4 Conclusions . . . . . . . . . . . . . . . . . .
8 Finding the best station-beam block size
8.1 Experimental setup . . . . . . . . . . . . .
8.2 OpenCL results . . . . . . . . . . . . . . .
8.3 CUDA results . . . . . . . . . . . . . . . .
8.4 Conclusions . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . .
OpenCL
. . . . . .
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
43
44
46
46
47
48
.
.
.
.
52
52
54
55
56
.
.
.
.
59
59
60
62
63
9 Conclusions
65
A CUDA BeamFormer execution time
68
B CUDA BeamFormer GFLOP/s
73
C CUDA BeamFormer GB/s
78
D OpenCL BeamFormer measurements
83
E Finding the best station-beam block size
86
F Data structures
90
Bibliography
99
List of Figures
2.1
2.2
Electromagnetic spectrum, courtesy of Wikipedia. . . . . . . . . . . . . . 7
Hardware beam former, courtesy of Toby Haynes [1]. . . . . . . . . . . . . 10
3.1
3.2
One of the THEA boards, courtesy of ASTRON. . . . . . . . . . . . . . . 12
EMBRACE radio frequency beam former chip, courtesy of P. Picard [2]. . 13
4.1
4.2
Hardware pipeline of a video card [3]. . . . . . . . . . . . . . . . . . . .
Comparison between Intel CPUs and NVIDIA GPUs in term of GFLOP/s,
courtesy of NVIDIA [4]. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Comparison between Intel CPUs and NVIDIA GPUs in term of GB/s,
courtesy of NVIDIA [4]. . . . . . . . . . . . . . . . . . . . . . . . . . . .
The number of transistors devoted to different functions in CPUs and
GPUs, courtesy of NVIDIA [4]. . . . . . . . . . . . . . . . . . . . . . . .
NVIDIA Tesla GPU architecture [5]. . . . . . . . . . . . . . . . . . . . .
NVIDIA Fermi GPU architecture [6]. . . . . . . . . . . . . . . . . . . . .
SOR execution time (lower is better) . . . . . . . . . . . . . . . . . . . .
SOR speed-up (higher is better) . . . . . . . . . . . . . . . . . . . . . . .
4.3
4.4
4.5
4.6
4.7
4.8
6.1
6.2
6.3
6.4
7.1
7.2
7.3
7.4
8.1
8.2
8.3
. 18
. 19
. 19
.
.
.
.
.
Execution time in seconds of various BeamFormer versions merging 64
stations (lower is better). . . . . . . . . . . . . . . . . . . . . . . . . . . .
Execution time in seconds of various BeamFormer versions merging 64
stations (lower is better). . . . . . . . . . . . . . . . . . . . . . . . . . . .
GFLOP/s of various BeamFormer versions merging 64 stations (higher is
better). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
GB/s of various BeamFormer versions merging 64 stations (higher is better).
NDRange example, courtesy of Khronos Group [7]. . . . . . . . . . . . .
GFLOP/s of BeamFormer 1.5 implemented with CUDA and OpenCL
merging 64 stations (higher is better). . . . . . . . . . . . . . . . . . . .
GB/s of BeamFormer 1.5 implemented with CUDA and OpenCL merging
64 stations (higher is better). . . . . . . . . . . . . . . . . . . . . . . . .
Execution time in seconds of BeamFormer 1.5 implemented with CUDA
and OpenCL merging 64 stations (lower is better). . . . . . . . . . . . .
21
21
22
30
30
49
50
51
51
. 53
. 56
. 57
. 58
GFLOP/s for the OpenCL BeamFormer: block sizes from 64x1 to 64x16
(higher is better). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
GFLOP/s for the CUDA BeamFormer: block sizes from 64x1 to 64x16
(higher is better). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Comparison of CUDA and OpenCL BeamFormers: block sizes from 64x1
to 64x16 (higher is better). . . . . . . . . . . . . . . . . . . . . . . . . . . 64
v
List of Tables
4.1
Comparison between an Intel CPU and an NVIDIA GPU. . . . . . . . . . 20
6.1
6.2
Operational intensity and registers used by each kernel . . . . . . . . . . . 41
Algorithms’ optimization strategies and code differences. . . . . . . . . . . 41
9.1
Comparison of the beam former running on the ASTRON IBM Blue
Gene/P and on an NVIDIA GTX 480. . . . . . . . . . . . . . . . . . . . . 66
A.1 Execution
A.2 Execution
A.3 Execution
A.4 Execution
A.5 Execution
A.6 Execution
A.7 Execution
A.8 Execution
A.9 Execution
A.10 Execution
A.11 Execution
A.12 Execution
A.13 Execution
A.14 Execution
A.15 Execution
A.16 Execution
A.17 Execution
time
time
time
time
time
time
time
time
time
time
time
time
time
time
time
time
time
B.1
B.2
B.3
B.4
B.5
B.6
B.7
B.8
B.9
B.10
B.11
B.12
B.13
for
for
for
for
for
for
for
for
for
for
for
for
for
GFLOP/s
GFLOP/s
GFLOP/s
GFLOP/s
GFLOP/s
GFLOP/s
GFLOP/s
GFLOP/s
GFLOP/s
GFLOP/s
GFLOP/s
GFLOP/s
GFLOP/s
in
in
in
in
in
in
in
in
in
in
in
in
in
in
in
in
in
the
the
the
the
the
the
the
the
the
the
the
the
the
seconds
seconds
seconds
seconds
seconds
seconds
seconds
seconds
seconds
seconds
seconds
seconds
seconds
seconds
seconds
seconds
seconds
for
for
for
for
for
for
for
for
for
for
for
for
for
for
for
for
for
the
the
the
the
the
the
the
the
the
the
the
the
the
the
the
the
the
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
1.1 . . .
1.1 2x2 .
1.1.1 . .
1.1.1 2x2
1.2 2x2 .
1.2 4x4 .
1.2 8x8 .
1.2.1 . .
1.2.1.1 .
1.2.2 . .
1.2.2.1 .
1.3 . . .
1.4 . . .
vi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1.0.2 . .
1.1 . . .
1.1 2x2 .
1.1.1 . .
1.1.1 2x2
1.2 2x2 .
1.2 4x4 .
1.2 8x8 .
1.2.1 . .
1.2.1.1 .
1.2.2 . .
1.2.2.1 .
1.3 . . .
1.4 . . .
1.5 2x2 .
1.5 4x4 .
1.5 8x8 .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
68
68
69
69
69
69
70
70
70
70
71
71
71
71
72
72
72
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
73
73
74
74
74
74
75
75
75
75
76
76
76
List of Tables
vii
B.14 GFLOP/s for the BeamFormer 1.5 2x2 . . . . . . . . . . . . . . . . . . . . 76
B.15 GFLOP/s for the BeamFormer 1.5 4x4 . . . . . . . . . . . . . . . . . . . . 77
B.16 GFLOP/s for the BeamFormer 1.5 8x8 . . . . . . . . . . . . . . . . . . . . 77
C.1
C.2
C.3
C.4
C.5
C.6
C.7
C.8
C.9
C.10
C.11
C.12
C.13
C.14
C.15
C.16
C.17
GB/s
GB/s
GB/s
GB/s
GB/s
GB/s
GB/s
GB/s
GB/s
GB/s
GB/s
GB/s
GB/s
GB/s
GB/s
GB/s
GB/s
for
for
for
for
for
for
for
for
for
for
for
for
for
for
for
for
for
the
the
the
the
the
the
the
the
the
the
the
the
the
the
the
the
the
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
BeamFormer
1.0.2 . . .
1.1 . . . .
1.1 2x2 .
1.1.1 . . .
1.1.1 2x2
1.2 2x2 .
1.2 4x4 .
1.2 8x8 .
1.2.1 . . .
1.2.1.1 . .
1.2.2 . . .
1.2.2.1 . .
1.3 . . . .
1.4 . . . .
1.5 2x2 .
1.5 4x4 .
1.5 8x8 .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
78
78
79
79
79
79
80
80
80
80
81
81
81
81
82
82
82
D.1
D.2
D.3
D.4
D.5
D.6
D.7
D.8
D.9
Execution time in seconds for the BeamFormer 1.5-opencl 2x2 .
Execution time in seconds for the BeamFormer 1.5-opencl 4x4 .
Execution time in seconds for the BeamFormer 1.5-opencl 8x8 .
GFLOP/s for the BeamFormer 1.5-opencl 2x2 . . . . . . . . . .
GFLOP/s for the BeamFormer 1.5-opencl 4x4 . . . . . . . . . .
GFLOP/s for the BeamFormer 1.5-opencl 8x8 . . . . . . . . . .
GB/s for the BeamFormer 1.5-opencl 2x2 . . . . . . . . . . . .
GB/s for the BeamFormer 1.5-opencl 4x4 . . . . . . . . . . . .
GB/s for the BeamFormer 1.5-opencl 8x8 . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
83
83
84
84
84
84
85
85
85
E.1
E.2
E.3
E.4
E.5
E.6
E.7
E.8
E.9
E.10
E.11
E.12
GFLOP/s for the OpenCL BeamFormer 1.5: block sizes from 1x1 to 8x8 .
GFLOP/s for the OpenCL BeamFormer 1.5: block sizes from 1x9 to 8x16
GFLOP/s for the OpenCL BeamFormer 1.5: block sizes from 9x1 to 16x8
GFLOP/s for the OpenCL BeamFormer 1.5: block sizes from 9x9 to 16x16
GFLOP/s for the OpenCL BeamFormer 1.5: block sizes from 2x1 to 256x8
GFLOP/s for the OpenCL BeamFormer 1.5: block sizes from 2x9 to 256x16
GFLOP/s for the CUDA BeamFormer 1.5: block sizes from 1x1 to 8x8 . .
GFLOP/s for the CUDA BeamFormer 1.5: block sizes from 1x9 to 8x16 .
GFLOP/s for the CUDA BeamFormer 1.5: block sizes from 9x1 to 16x8 .
GFLOP/s for the CUDA BeamFormer 1.5: block sizes from 9x9 to 16x16
GFLOP/s for the CUDA BeamFormer 1.5: block sizes from 2x1 to 256x8
GFLOP/s for the CUDA BeamFormer 1.5: block sizes from 2x9 to 256x16
86
86
87
87
87
87
88
88
88
88
89
89
Abbreviations
CU
Compute Unit
CUDA
Compute Unified Device Architecture
EMR
ElectroMagnetic Radiation
FPGA
Field Programmable Gate Array
GDBF
Generic Digital Beam Former
GPGPU
General Purpose computations on GPU
GPU
Graphical Processing Unit
LOFAR
LOw Frequency ARray
OpenCL
Open Computing Language
PE
Processing Element
SIMT
Single Instruction Multiple Thread
SOR
Successive Over-Relaxation
SPA
Streaming Processor Array
TPC
Texture/Processor Cluster
VHDL
VHSIC Hardware Description Language
viii
Chapter 1
Introduction
Radio astronomy is changed in recent years, and most of these changes are relative to the
instruments of radio astronomy itself, i.e. radio telescopes. Classic radio telescopes were
big dish directional antennas and all the experiments were performed with hardware
sensors connected to those antennas. Unfortunately, there are engineering limitations
in the dimension that a single dish telescope can achieve, and furthermore the building
of special purpose hardware instruments that are increasingly complex is expensive and
does not offer flexibility.
Radio interferometry gives engineers the possibility of building bigger radio telescopes,
instead of using a large single dish antenna, it combinines the signals received from many
different antennas to form a virtual radio telescope that is the combination of all the
small antennas. If at the beginning this technique was used to connect only antennas
close to each other, it is nowadays possible to connect antennas that are thousands of
kilometers away, and so it is now possible to obtain radio telescopes with apertures so
big that were not even imaginable ten years ago. The LOw Frequency ARray (LOFAR)
radio telescope [8] is an example of this new generation of telescopes.
In this context the beam forming algorithm, the subject of this work, acquires an increased importance. This technique, in fact, permits to combine different signals, received from many antennas, and to form a single coherent signal, called a beam. Moreover, beam forming permits to give directionality to an array of non-directional antennas.
The other revolution in the instruments of radio astronomy is the possibility to implement larger components of the operational pipeline of a radio telescope in software,
1
Chapter 1. Introduction
2
thus reducing the costs to develop new instruments and increasing the flexibility of radio telescopes. However, to perform new and much more complex experiments in the
following years, more powerful instruments will be necessary. With software radio telescopes there will be a need for more powerful computers to run this software, computers
able to perform a very large number of operations per second (a thousand times more
than the actual computational power of the best supercomputers in the world: exa-scale
computers [9]). But, in order to build this next generation of supercomputers, we will
not just need faster processing units and increased memory and network bandwidth, we
will also need to achieve better power efficiency. In fact, with the actual technology just
to operate these exa-scale computers, we currently need personal power plants [10].
A possible solution to build more powerful and power efficient supercomputers is to
accelerate the computations using Graphics Processing Units (GPUs). The architecture
of modern GPUs is inherently parallel and can easily be used to accelerate complex
scientific operations, like radio astronomy beam forming in our case. And it is not just
now that the absolute GPUs performance, in terms of both computational power and
memory bandwidth, is higher than the performance achieved by CPUs: the gap between
GPUs and CPUs performance is widening, and GPUs are more power efficient.
The goal of this thesis is to answer the question if the beam forming algorithm used
for the LOFAR reference radio telescope can be efficiently parallelized to run on GPUs.
Thinking about how to build an exa-scale supercomputer, we want to understand if a
GPU-based beam former can match, or even outperform the parallel beam former used
in production at ASTRON.
The rest of this work is organized as follows. Chapters 2 and 3 provide, respectively,
a background on radio astronomy and beam forming, and related work in digital beam
forming for modern radio telescopes. Chapter 4 presents an introduction to general
purpose computing on GPUs, introducing the main concepts, presenting the NVIDIA
architecture used to parallelize the beam former, and showing an example of parallelizing
the parallelization a simple algorithm on the GPU. The application analysis, consisting
on the description of the sequential algorithm and the parallelization strategies, is included in Chapter 5. Chapters 6, 7 and 8 provide the description of the implemented
parallel beam formers, of the experiments and of the results, providing also partial conclusions and comparisons. The overall conclusions of this work are presented in Chapter
Chapter 1. Introduction
3
9. In the Appendices we provide the detailed results of all the performed experiments
and the source code of the relevant input and output data structures.
Chapter 2
Background
The focus of this master project is the parallelization of the beam forming algorithm on
GPUs. The beam forming algorithm is a standard signal processing technique aimed at
providing spatial selectivity in the reception or transmission of a signal. In this work we
use the beam forming in the field of radio astronomy, to receive data from a particular
region of the sky using a large array of omnidirectional antennas.
Section 2.1 gives a brief introduction to radio astronomy and its instruments. In Section
2.2 we introduce the fundamental physical notions, on electromagnetic radiation, that
are necessary to understand how the beam former works, and finally in Section 2.3 we
provide a brief introduction to beam forming itself.
2.1
Radio astronomy
Radio astronomy is the field of astronomy that studies the universe at radio frequencies,
while the classical astronomy studies only the so called “visible” universe. Its origin can
be dated in 1933 when Karl G. Jansky published [11] the discovery of an electromagnetic
emission from our galaxy, the Milky Way. This discovery, made by Jansky during an
investigation aimed at finding the causes of static disturbances on transatlantic voice
transmissions for the Bell Telephone Laboratories, made the scientists design and develop more complex and precise instruments to receive radio sources originating from
outer space. Today the possibility to look at the universe in other frequencies of the
electromagnetic spectrum gives astronomers the means to penetrate some of the deepest
4
Chapter 2. Background
5
secrets of the universe, like analyzing the molecules composing planets, star, galaxies
and everything else in the known universe, and also the possibility to look at objects
that, in the realm of visible light, are invisible, like pulsars.
The history and achievements of radio astronomy are deeply connected with the history
of radio telescopes. As a standard optical telescope is, in its barely essence, a reflecting
glass, a radio telescope can be seen as just an antenna, tuned to receive a particular
frequency. The Jansky prototype, for example, was an array of dipoles and reflectors
receiving radio emissions at 20,5 MHz. The whole antenna was able to rotate, completing
a full circle every 20 minutes. However, this prototype is really different from what is
known today as a radio telescope. Nowadays, radio telescopes are in the majority of
cases large parabolic antennas, in which the electromagnetic emission is reflected from
the surface of a dish to an electronic receiver. The first radio telescope of this kind was
realized by Grote Reber in 1937.
But, as with optical telescopes, engineering difficulties arise while trying to increase
the dimensions of the dish to have bigger telescopes, able to provide bigger apertures
and needed for more accurate observations. An increase in costs is also unavoidable
while building bigger radio telescopes. A solution to this problem is provided by radio
interferometry. Radio interferometry is a technique that permits to combine the signals
received by two or more receivers, obtaining a virtual radio telescope with the resolution
equivalent to that of a telescope with a single antenna whose diameter is equal to the
distance between the farthest receivers.
The LOw Frequency ARray (LOFAR) is a radio interferometric array, developed by
ASTRON, and is the radio telescope whose beam former we parallelize on GPUs in this
work. Composed by more than 10.000 antennas , the LOFAR is one of the biggest radio
telescopes ever built. But not only the dimensions of the LOFAR are groundbreaking: a
really interesting aspect is that this radio telescope is mostly implemented in software.
The LOFAR’s antennas are of two kinds: low-band antennas, for the frequency range
of 10-80 MHz, and high-band antennas, for the range of 110-250 MHz. Antennas are
organized in stations; stations will be important in this work because our beam former
will not merge the signals directly coming from the antennas, but will merge the output
signals of the stations. In fact, each station combines, using FPGAs, the signals of
its antennas and sends a single signal to the central processing facility. The real-time
Chapter 2. Background
6
processing pipeline of the LOFAR radio telescope is done, at the central processing
facility, using a two and a half rack IBM Blue Gene/P.
In order to reduce the amount of data sent by the stations to the central processing
facility, only a reduced number of directions and frequencies are sent by the stations.
These (reduced number of) directions and frequencies are called subbands. The subbands
are split further into narrower frequency bands, at the central processing facility. These
narrow frequency bands are called channels. These concepts will return frequently in
the remainder of this work.
2.2
Electromagnetic radiation
The existence of the electromagnetic radiation was postulated by James Clerk Maxwell
in his theory of electromagnetism, and then demonstrated with a successive experiment
by Heinrich Hertz. It is, in fact, possible to derive from Maxwell’s equations that waves
are generated by the oscillation of an electric and a magnetic field (i.e. an oscillating
electromagnetic field). While an ordinary wave is capable of propagating only through
matter, an electromagnetic wave is capable of propagating also through the vacuum,
and it propagates through the vacuum with a constant speed equal to 299,792 Km/s
(the speed of light). This characteristic of the electromagnetic waves permits us to
receive signals from outer space, although there is only vacuum between the Earth and
the emitting sources. The visible light is just an example of electromagnetic radiation,
as other examples are radio waves or X-rays. Another peculiarity of electromagnetic
radiation, that derives from the quantum theory, is that it can behave both as a wave
or as a particle.
First we characterize the properties of the electromagnetic radiation as a wave. As any
other wave the electromagnetic radiation it has three characteristics: speed, frequency
and wavelength. As we previously said the speed is a constant in the vacuum and it is
exactly the speed of light, which is symbolized in physics with c. In media different than
vacuum, the speed is less than the speed of light, and it is dependant from the specific
medium; in this case we use a v to indicate the speed of the wave. The frequency is the
rate, measured in Hertz (Hz), at which the radiating electromagnetic field is oscillating;
in physics, the symbol ν is used for the frequency. The wavelength of a wave is the
Chapter 2. Background
7
distance between two successive crests, or troughs, and is simply measured in meters.
The symbol that we use for the wavelength is λ. The frequency, speed and wavelength
of a wave are related: in general we can write this relation as v = ν · λ, that in case of an
electromagnetic wave propagating in the vacuum becomes c = ν · λ. An electromagnetic
radiation can be classified using its frequency, or its wavelength, producing what is called
the electromagnetic spectrum, shown in Figure 2.1.
While propagating from the generating electromagnetic field, the radiation travels in all
directions in straight lines, as covering the surface of a sphere. As the area of the sphere
increases proportionally to the length of its radius (i.e. the distance the radiation has
travelled) following the well known equation A = 4πR2 , the radiation loses its signal
strength.
Another property of the EMR is polarization. The polarization is the direction of the
oscillation of the electromagnetic field’s electric component.
Figure 2.1: Electromagnetic spectrum, courtesy of Wikipedia.
When two different waves have similar frequencies, the relative measure of their alignment is called phase and is measured in degrees, from 0 to 360. If the peaks and troughs
of two waves match over time, then they are said to be in phase with each other.
When viewed as a particle, the electromagnetic radiation is composed by a discrete
stream of energy’s quanta, called photons. The energy that each photon transport is
Chapter 2. Background
8
proportional to the wave frequency, and is expressed by the equation E = h · ν, where
h = 6, 625 × 10−27 erg/second is the Planck constant.
To conclude this introduction on the properties of electromagnetic radiation, it is interesting to understand where these oscillating electromagnetic fields are originating from,
especially in the field of radio astronomy. The main mechanism for the production of
electromagnetic radiation is thermal. Heat is produced by the movement of the inner
molecules of a solid, gas or liquid. When molecules move, an electromagnetic radiation
is produced at all the frequencies of the electromagnetic spectrum, with the amount of
radiation emitted for each frequency related to the temperature of the emitting body.
Indeed, an emitting body that has a higher temperature will emit more energy, and so
more electromagnetic radiation, at all frequencies, with its peak of emission concentrated
on higher frequencies. This relationship is known as Wien’s law and it can be written
as νmax =
α
h kT
where T is the temperature (in Kelvin), k is the Boltzmann constant
and α ≈ 2, 821439 is an empirical constant.
Matter, in the state of a solid or plasma, is said to be a blackbody if emits thermal
radiation. We can summarize the characteristics of a blackbody as the following:
1. A blackbody that has a temperature higher than 0 Kelvin emits some energy at
all frequencies;
2. A blackbody whose temperature is higher than another one’s will emit more energy,
at all frequencies, than the other one;
3. The higher the temperature of the blackbody, the higher the frequency at which
the maximum energy is emitted.
An electromagnetic field can also be produced (in rare cases) as the consequence of a
non-thermal phenomenon. An example of electromagnetic radiation produced of nonthermal origin is the synchrotron radiation. This radiation is produced when a charged
particle enters a magnetic field and, being forced to move around the magnetic lines
of force, is accelerated to nearly the speed of light. However, as a difference from
electromagnetic radiations produced by thermal emissions, in non-thermal radiations
the intensity decreases with the frequency, i.e. lower is the frequency of the radiation,
higher is the energy emitted.
Chapter 2. Background
2.3
9
Beam forming
Beam forming is a standard signal processing technique used to control the directionality
of an array of antennas. It can be used for both transmitters and receivers. In this work,
we focus only on using beam forming for reception, i.e., to combine the signals received
from an array of antennas and simulate a larger directional antenna.
The problem when combining signals received from different antennas is that the receivers are in different places in space, and so each of them is receiving the same signal
emitted by a given source at different times. Simply combining the signals received by
the different antennas does not produce meaningful information, because the waves are
interfering. But these interferences can be both constructive and destructive, and exploiting the behavior of constructive interfering waves is exactly what beam forming is
based on. The simplest beam former can be built just by connecting nearby antennas to
the same receiver with wires of different lengths, thus delaying the signals and producing
a temporal shift and an increase of sensitivity on a specific direction. This solution is not
very flexible and beam formers are actually implemented with special purpose hardware,
or with software.
In general, to form a beam from different received signals, a different complex weight
is multiplied with each signal and then all the signals are summed together. The complex weight depends on the source of interest’s location and the spatial position of the
antennas. In Figure 2.2 we show how a hardware receiving beam former can be used
to combine the signals received by four different antennas and provide a single coherent
signal.
In general, the complex weight that is multiplied with a received sample, is composed
by two values: an amplitude and a phase shift. However, in narrow-band systems (like
the LOFAR radio telescope), just a phase shift is sufficient to beam form the samples.
The specific algorithm of the LOFAR’s beam former is described in detail in Section 5.2.
Chapter 2. Background
Figure 2.2: Hardware beam former, courtesy of Toby Haynes [1].
10
Chapter 3
Related works
The beam forming algorithm is straightforward, as can be seen in Section 2.3. In order
to deal with high data rate and an increasing number of signals to merge, beam formers are usually built using special purpose hardware. In this Chapter we provide an
overview of the most interesting beam forming solutions for radio astronomy: hardware
implementations are presented in Section 3.1, and software solutions in Section 3.2.
3.1
Hardware beam formers
Due to its simplicity, and to achieve real-time performance, the beam forming algorithm
has a long history of hardware implementations. Although our work is focused on a
particular software implementation, it is important to provide an introduction to some
beam formers used in practice for radio astronomy.
The first beam former of our list has been designed and realized by the Netherlands Foundation for Research in Astronomy (NFRA) as a technology demonstrator for the SKA
radio telescope. The name of this demonstrator is Thousand Element Array (THEA),
and it consists of 1.024 antennas divided in 16 tiles covering an area of approximately
16 square meters. Each tile contains 16 boards, each equipped with 4 antennas, with
embedded radio frequency beam formers. A single tile is capable of forming two independent beams in hardware, then the formed beams are digitalized and sent to a
central digital beam former. THEA is capable of forming 32 beams simultaneously. A
11
Chapter 3. Related works
12
complete description of the hardware can be found in [12]. The demonstrator was successful enough to permit scientific experiments and is currently continued by ASTRON.
A picture of the THEA beam former can be seen in Figure 3.1.
Figure 3.1: One of the THEA boards, courtesy of ASTRON.
In the process of designing and building the different technology demonstrators for the
SKA project, different beam formers were proposed and realized at ASTRON. However,
reinventing every time a beam former for the specific project was considered a suboptimal solution, and so the Generic Digital Beam Former (GDBF) platform was designed.
The GDBF is a generic digital narrowband beam former, modeled using an high level
description language, VHDL, and successively implemented with both FPGAs and integrated circuits. What is interesting about the GDBF is that, also if it is an hardware
project, it is based on non-functional requirements that are not so different from ours:
it is indeed designed to be flexible, to scale up to an increasing number of receiving
antennas and to be able to form multiple beams at the same time. Moreover, GDBF is
designed to deliver high performance and have low power consumption. A mathematical
description of the project is available in [13].
The Center for Astronomy Signal Processing and Electronics Research (CASPER) of
the University of California at Berkeley proposes a similar solution. In order to provide
performance close to the hardware, but without the time and costs involved in designing
and building new special purpose hardware, they designed a set of FPGA modules for
digital signal processing algorithms. These modules can be interconnected together, to
form radio astronomy instruments, as can be read in [14]. The beam former for the Allen
Telescope Array (ATA), called the BEEmformer, has been built using this technology;
Chapter 3. Related works
13
a description of the beam former and a comparison between this implementation and
another implementation made using DSP processors is available in [15].
Another recent comparison between two different beam formers, a radio frequency and a
digital one, is provided in [2]. As these beam formers are also SKA demonstrators, they
show that there is an increasing interest in the field for high performance and scalable
beam formers. A photo of the radio frequency beam former chip is provided in Figure
3.2.
Figure 3.2: EMBRACE radio frequency beam former chip, courtesy of P. Picard [2].
The last beam former of this short survey was developed by the Virginia Tech for the
Eight-meter-wavelength Transient Array (ETA). The architecture of this beam former
is a layered one: the signals are received by a cluster of 12 external FPGA nodes, each
of them connected to two different receivers. The cluster node produces in output eight
single polarization beams and sends four of them to one FPGA internal node and the
other four to another. There are 4 internal FPGA nodes, each of them receiving four
beams from six external nodes, for a total of 24 beams. These internal nodes are used
to combine the different beams and send them to the storage nodes. A description of
the hardware, and of the scientific goals of this beam former, can be found in [16]. It is
interesting to note that this beam former is the only real parallel beam former between
the different hardware implementation that we described.
Beam forming in hardware is not exactly something new, but the goals set by radio
astronomy for the near future are changing the field. The hardware beam formers, although still used, can hardly keep the pace: with the future radio telescopes composed of
Chapter 3. Related works
14
millions of different antennas, producing a huge amount of data that has to be processed
in almost real time, hardware solutions, certainly able to provide raw performance, do
not scale. Further, hardware beam formers are inherently not flexible; in fact adding
new components (e.g. antennas) or modifying the beam forming algorithm (e.g. to form
a different number of beams) will require the substitution of some components or the
design and successive build of new beam forming chips. But producing new hardware
is not easy, nor cheap, as it usually spans for many years and involves complex and
expensive prototyping. And it will not become cheaper in future, because special purpose hardware never benefits from the economy of scale effects. Furthermore, hardware
design requires a lot of expertise that is not as widespread as with software design. A
possible solution, and we can see from our small survey that is a solution that is being
widely investigated, is to create standard hardware components using FPGAs and then
build more complex instruments with them. Using FPGAs can also improve the flexibility and scalability of this new hardware beam formers, and simplify the life of their
designers.
3.2
Software beam formers
The history of software implementations of beam formers is shorter than the history of
the hardware ones. There are, however, some interesting approaches that we need to
examine in order to better understand the context of our own work.
The first software beam former that we encounter is the actual beam former used at
ASTRON for the LOFAR radio telescope. A brief description can be found in [17] along
with other information regarding the complete software pipeline and the correlator. This
beam former is really important, not only because this is the exact same algorithm we
implement in this work, and with which we will eventually compare our results, but also
because it is the first case of a real-time software beam former used in production. The
telescope has already been described in Section 2.1 and the beam former in Section 2.3,
so we are not adding any more details here.
Another attempt to build a real-time software pipeline for a radio telescope has been
made in India for the Giant Metrewave Radio Telescope (GMRT) and is described in
[18]. The telescope is composed of 32 antennas placed over an area of 25 Km in diameter.
Chapter 3. Related works
15
The software pipeline is implemented on commodity hardware with a cluster of 48 Intel
machines running the Linux operating system. Parallelism is exploited at different levels,
using MPI for inter-node parallelism and OpenMP for intra-node parallelism, and then
using the vector instructions of the Intel Xeon processors at the single thread level. The
beam former is implemented with three threads per node and the output is formed by
32 dual polarized beams. The performance measurements provided in the paper are
relative to the whole software pipeline and they cannot be used for comparison with our
implementation. It is interesting, however, that they also propose the use of GPUs to
accelerate the computation.
A different approach is the one of OSKAR. OSKAR, developed by the Oxford astrophysics and e-Research groups, is a research tool built to investigate the challenges of
beam forming for the SKA radio telescope. It currently supports two different modes
of execution: the simulation of the beam forming phase and the computation of different beam patterns. Its architecture is highly modular, with the two most important
components being the front-end, used to manage the computations, and the back-end,
where the simulations are run. The back-end is parallelized with MPI and runs on a
cluster. More details, documentation, and the software are available on the project’s
website, [19]. This is, however, a different approach from ours, and they cannot really
be compared.
To conclude this brief introduction to recent software beam formers, we present two
GPU solutions. Different, than everything else presented so far, these two solutions are
not intended for radio astronomy, but they are the only attempts we have found on
implementing a beam former using a GPU, and are interesting for comparison. In the
first work, [20], two general digital beam formers, one on the time domain and another
one in the frequency domain, are implemented using CUDA on a NVIDIA GeForce 8800
and then compared with the same algorithms implemented on an Intel Xeon CPU. In
the experiments the execution times achieved by the GPU implementations are always
lower than the ones achieved by the CPU implementations, and the authors conclude
that they see the use of GPUs as a viable solution for implementing digital beam formers.
Different results are the ones collected in [21]. In this work an adaptive beam former for
underwater acoustics is implemented with CUDA, using the same card as the previous
work, the NVIDIA GeForce 8800. In this case the authors tried first to parallelize
Chapter 3. Related works
16
the whole algorithm on the GPU, but this was a performance failure. A successive
hybrid attempt, using the GPU only to accelerate a part of the beam forming process,
was successful, but the GPU implementation ran in twice the time as a sequential C
implementation. With these results, the author concludes that this is still a proof that
a beam former can be parallelized on a GPU, and that the slowdown factors (mostly
relative to wrong patterns in accessing the off-chip memory) have been identified and so
further improvements are possible.
This list of current approaches demonstrates that implementing a beam former in software is not anymore a naive solution. At least two real telescopes, LOFAR and GMRT,
are currently using real-time software pipelines, and the flexibility provided by a software
solution will be probably exploited also in ambitious projects like the SKA. However it
is still not clear if a real beam former for radio astronomy can be efficiently implemented
using a GPU, despite this solution being (widely) foreseen. In this work we answer this
question using our own parallel GPU implementation of the LOFAR production code
for the beam forming.
Chapter 4
General Purpose computations on
GPUs
A Graphical Processing Unit (GPU) is a specialized processor used by modern video
cards to improve their performance, by taking over part of the intensive computation
from the CPU. The main reason to introduce GPUs was to increase the performance
in rendering, mostly for animation and video games. When we talk about General
Purpose computations on GPUs (GPGPU), a term that has been introduced in 2002
by Mark Harris, we focus on the use of GPUs for general purpose computations, i.e.
the execution of general purpose algorithms on GPUs, instead of the classical graphics
related computations.
In this chapter we present how a modern GPU works and the reasons behind the adoption of GPUs for general purpose computing. Moreover, we introduce the NVIDIA
architecture and CUDA. Finally, we present an example of GPGPU by implementing a
simple Red-Black SOR algorithm, and measuring its performance, i.e. execution time
and speed-up.
4.1
The GPU pipeline
Figure 4.1 shows the high-level organization of the hardware pipeline of a generic GPU,
as presented in [3]. The hardware functionality [22] is straightforward: a set of geometries, i.e. vertices of geometrical shapes on a three dimensional space, are sent to the
17
Chapter 4. General Purpose computations on GPUs
18
Figure 4.1: Hardware pipeline of a video card [3].
GPU, which eventually draws the corresponding image into the frame buffer; the image
from the frame buffer memory is shown on the screen. There are three main hardware
components of the pipeline, corresponding to the main phases of the computation. First,
the geometries are transformed by the vertex processor into two dimensional triangles.
Next the rasterizer generates a fragment for each pixel location covered by a triangle.
Finally the fragment processor computes the color of each fragment, leading to an image,
i.e. a set of pixels, into the frame buffer.
Initially, this pipeline was completely implemented in hardware, until it has been proven
by industry, i.e. with Pixar’s RenderMan [23], that a programmable pipeline could produce better results in terms of rendered images. To respond to this need, vendors transformed the classic pipeline into a flexible one, where the vertex and fragment processors
execute user defined vertex and fragment programs.
The pipeline is intrinsically data parallel, i.e. each vertex or fragment can be computed
in parallel with the others (and it actually is). The GPU with its pipeline can be
seen as a stream computing processor [24]. In the stream computing paradigm we have
streams, i.e. sequences, possibly infinite, of data elements, and kernels, i.e. functions
to apply to each element of a given stream; the mapping of the described pipeline to
the stream computing paradigm is straightforward, with the input geometries being the
stream and the vertex and fragment programs applied in parallel to each element of
the stream being the kernels. Exploiting the stream computing capabilities of modern
programmable GPUs is what GPGPU has made possible.
Chapter 4. General Purpose computations on GPUs
4.2
19
The reasons behind GPGPU
Are there real advantages in the use of GPUs for general purpose computing ? To
answer this question, we look at performance figures. In Figures 4.2 and 4.3, we can
see a comparison of computing capabilities and memory bandwidth between Intel CPUs
and NVIDIA GPUs [4].
Figure 4.2: Comparison between Intel CPUs and NVIDIA GPUs in term of GFLOP/s,
courtesy of NVIDIA [4].
Figure 4.3: Comparison between Intel CPUs and NVIDIA GPUs in term of GB/s,
courtesy of NVIDIA [4].
Chapter 4. General Purpose computations on GPUs
Device
Intel Core i7-920 CPU
NVIDIA GeForce GTX 295 GPU
GFLOPa /s
89,6b
1788,48
20
Price
284,00$c
529,99$
Power
130 W
289 W
a
single precision
At 2,80 GHz
c
For Intel direct customers in bulk of 1000 units
b
Table 4.1: Comparison between an Intel CPU and an NVIDIA GPU.
For both computational performance, measured in GFLOP/s, and memory bandwidth,
measured in GB/s, the performance achieved by the GPUs is higher in absolute value.
Moreover, as the gap is widening quite fast, GPGPU seems suitable for a big slice of
general purpose computations, mainly in the scientific field.
Performance is not the only advantage brought by GPUs. A GPU is also cheaper per
FLOP than an ordinary CPU. If we compare two recent devices from Intel and NVIDIA
in terms of GFLOP/s and price, as in Table 4.1, we see that the GFLOPs/Dollar ratio
of the two devices are 0,31 and 3,37 respectively; thus, an NVIDIA GPU is about ten
times cheaper per FLOP than an Intel CPU.
Moreover, in the field of scientific programming, one of the biggest problems in terms of
cost is the power required by the modern supercomputers. We can compare the same
devices in Table 4.1 to evaluate the GFLOPs/Watt ratio. The ratio is 0,68 for the Intel
Core i7-920 CPU and 6,18 for the NVIDIA GeForce GTX 295 GPU. Also in this case
the GPU is more efficient than the CPU.
However, the differences in performance between CPUs and GPUs come from the fact
that the latter are highly specialized and therefore less flexible. Having to design and
produce processors specialized for applying the same function in parallel to many different data items, made the GPU producers focus more on incrementing the arithmetic
capabilities of their architecture than on control capabilities. This can be seen in the
different organizations of CPUs and GPUs, as presented in Figure 4.4. To compensate
the lack of control features, the role that a GPU typically has in GPGPU applications is
that of a powerful accelerator, computing the massive data-parallel parts of algorithms,
while the CPU deals with the sequential parts of the same general purpose computations. Thus, the two architectures will continue to coexist and complement each other.
Moreover, modern many-core GPUs are an ideal testbed for a future scenario in which
general purpose CPUs will also become many cores architectures.
Chapter 4. General Purpose computations on GPUs
21
Figure 4.4: The number of transistors devoted to different functions in CPUs and
GPUs, courtesy of NVIDIA [4].
4.3
NVIDIA architecture
An important aspect of GPGPU is to understand how a GPU works. In the early days, it
was impossible to write a general purpose algorithm without a complete understanding
of the GPU pipeline, as introduced in Section 4.1. A problem needed to be transformed
from its own domain to the graphic domain, before it could be implemented on a GPU.
Today, with the introduction of high-level abstractions and support for generic programming languages, it is no longer necessary to translate a generic problem into the
graphical domain. However, it is still difficult to obtain good performance without the
knowledge of the underlying architecture.
Figure 4.5: NVIDIA Tesla GPU architecture [5].
Chapter 4. General Purpose computations on GPUs
22
Figure 4.5 shows the architecture of a modern NVIDIA Tesla GPU with 112 cores. It is
interesting to see that, from an hardware point of view, the border between vertex and
fragment processor is not visible anymore, in fact the different steps of the pipeline are
executed by an unified computing processor. However, from a logical point of view, the
pipeline introduced in Section 4.1 is still valid.
As the hardware architecture is complex, we will only provide a high-level overview here.
Readers interested in more details on the NVIDA GPU architectures can read [25] for a
more in depth description. In Figure 4.5 we can see that the GPU is composed of three
layers: the command processing, made of the various input and distribution managers,
the streaming processor array (SPA), and the memory management. We focus on the
description of the organization of the SPA, as the other two layers are of little interest
for the GPGPU programming per-se.
A SPA is composed of a variable number of Texture/Processor Clusters (TPCs); the
number of TPCs dictates the processing capabilities of the GPU itself. Each TPC
contains two streaming multiprocessors and a texture unit. The streaming multiprocessor
is the real computing core of the architecture. It contains a multithreaded instruction
fetch and issue unit, eight streaming processors, two special-function units and a 16
KB read/write shared memory. Streaming multiprocessors are based on the Single
Instruction Multiple Threads (SIMT) processor architecture, where the same instruction
is applied to multiple threads in parallel. Threads are managed and executed by a
streaming processor in groups of 32; a group of this type is called warp. The main change
introduced by NVIDIA in its new GPU architecture, called Fermi, is the availability of
a shared L2 cache memory, as can be seen in Figure 4.6.
Figure 4.6: NVIDIA Fermi GPU architecture [6].
Chapter 4. General Purpose computations on GPUs
4.4
23
CUDA
CUDA is a general purpose parallel computing architecture developed by NVIDIA to
help programmers who want to use NVIDIA GPUs for GPGPU computing. Here we
briefly introduce the programming model of CUDA; more details are available in [4].
A CUDA kernel is a user defined function that is executed in parallel by different GPU
threads. Each executing thread is identified by a three dimensional vector inside its
block ; the vector associates a position inside the block to each thread. Threads in
the same block share communication and synchronization facilities while threads from
different blocks are, theoretically, completely independent. Blocks are also identified by
a three dimensional vector that associates to each block a position inside the grid of
thread blocks. This introduces a hierarchy for CUDA threads, in which we have a single
grid containing multiple blocks, each of them containing multiple threads.
The memory organization in CUDA is also hierarchically organized. Each thread in
a block has access to its private read/write local memory. All the threads inside a
block share a read/write shared memory of 16 KB. The shared memory has low-latency,
comparable to the latency to access local registers, and it acts like a user-programmable
cache inside a block. Next in the hierarchy there is the GPU global memory which is
accessible to each thread in every block, and shared also between different grids, i.e.
different kernel executions. Two additional read-only memories, related to the global
memory, are the constant and texture memories. These were the only cached memories
inside the GPU, before the introduction of the Fermi architecture. The GPU memory,
called device memory, is a physically different memory from the CPU host memory.
Most of the memory allocation and management has to be explicitly addressed by the
programmer.
4.5
An example: SOR
To introduce GPGPU programming techniques, optimizations and performance we developed a parallel version of the SOR algorithm using CUDA and then we compared it
to both the sequential version and a parallel CPU-only version that uses POSIX threads.
Chapter 4. General Purpose computations on GPUs
24
SOR is a method of solving Laplace equations on a grid. The core of the algorithm is
presented in Listing 4.1.
for ( i = 1; i < N -1; i ++ ) {
for ( j = 1; j < N -1; j ++ ) {
Gnew = ( G [i -1][ j ] + G [ i +1][ j ] + G [ i ][ j -1] + G [ i ][ j +1] ) / 4.0;
G [ i ][ j ] = G [ i ][ j ] + omega * ( Gnew - G [ i ][ j ]);
}
}
Listing 4.1: SOR algorithm in C
The algorithm is simple: each element of the matrix G, with the exception of the elements
at the border, is updated by adding to its current value the product of a given value,
omega, with the difference between the average value of the four direct neighbors (North,
East, South and West) and the value of the element itself. This process is iterated a
certain number of times until a convergence criterion is met.
To parallelize the algorithm in a shared memory model, using POSIX threads, the RedBlack strategy is used. For Red-Black SOR, the matrix is seen as a checkerboard and
each iteration is split in two phases; each phase is associated with a color. Only the
items with the same color of the phase are updated. The data distribution strategy
used is row-wise, i.e. each thread receives a certain number of rows from the matrix and
iterates the algorithm on them. Listing 4.2 shows how each different thread updates its
part of the matrix.
void * threadSolver () {
for ( phase = 0; phase < 2; phase ++ ) {
for ( i = startRow ; i < endRow ; i ++ ) {
/* Only e l e m e n t s in the current phase are updated */
for ( j =
1 + ( even ( i ) ^ phase ); j < N -1; j += 2 ) {
Gnew = ( G [i -1][ j ] + G [ i +1][ j ] + G [ i ][ j -1] + G [ i ][ j +1]) / 4.0;
G [ i ][ j ] = G [ i ][ j ] + omega * ( Gnew - G [ i ][ j ]);
}
}
p t h r e a d _ b a r r i e r _ w a i t (& barrier );
}
}
Listing 4.2: Red-Black SOR algorithm in C
Chapter 4. General Purpose computations on GPUs
25
Assuming that we are not taking into account the convergence factor for the algorithm,
the differences between the sequential and parallel version of the code are negligible: instead of sequentially updating each element of the matrix, each thread updates a certain
amount of rows, in the interval provided by the startRow and endRow variables, in two
phases. A synchronization point is necessary after each phase to avoid the situation in
which some threads will start phase one while others are still in phase zero, invalidating
with this behavior of the Red-Black strategy.
The CUDA version introduces, however, several visible changes. In fact, we implemented
different versions of SOR with CUDA, demonstrating different optimizations. The data
distribution strategy used in all the CUDA versions differs from the POSIX thread
implementation: each CUDA thread has a single matrix cell to update (as opposed to
the block distribution used by the POSIX version). This can be seen as a special form
of block-wise data distribution, where the dimension of each block is 1 × 1. Both the
CPU and the GPU participate in solving the problem: the CPU preparing the memory
for the computation, managing the phases and eventually getting the results back from
the device, while the role of the GPU is to update the matrix.
Listing 4.3 shows the work of the CPU in version A of our GPGPU SOR.
/* A l l o c a t e the matrix to the device */
/* Copy the matrix to the device */
/* Set the thread block d i m e n s i o n s */
for ( i = 0; i < iterations ; i ++ ) {
for ( phase = 0; phase < 2; phase ++ ) {
solver < < < blockSize , THREAD_N > > >( devG , pitch , omega , phase , N );
c u d a T h r e a d S y n c h r o n i z e ();
}
}
/* Check if all CUDA kernel i n v o c a t i o n s r e t u r n e d without errors */
/* Copy result matrix from device to host memory for print or further p r o c e s s i n g */
Listing 4.3: CPU work in CUDA SOR A
The CPU is used to manage the work. It allocates the memory on the device, copies the
data, and then iteratively calls the CUDA kernel. Finally, it copies the modified matrix
back from the device to the host’s memory.
Chapter 4. General Purpose computations on GPUs
26
Listing 4.4 presents the implementation of the CUDA kernel developed for version A of
our Red-Black SOR.
__global__ void solver () {
if ( phase == 0 ) {
/* In phase 0 only cells with both even , or odd , c o o r d i n a t e s are updated */
if ( ( even ( i ) && even ( j )) || ( odd ( i ) && odd ( j )) ) {
item = *(( float *) (( char *) G + i * pitch ) + j );
/* Load from device memory s t e n c i l V a l u e s [0..3] */
/* Threads " outside " of the matrix are not used */
if ( j < N - 1 && i < N - 1 ) {
Gnew = ( stencilValues [0] + stencilValues [1]
+ stencilValues [2] + stencilValues [3]) / 4.0;
itemPointer = ( float *) (( char *) G + i * pitch ) + j ;
* itemPointer = item + omega * ( Gnew - item );
}
}
}
else {
/* In phase 1 only cells with mixed even and odd c o o r d i n a t e s are updated */
if ( ( even ( i ) && odd ( j )) || ( odd ( i ) && even ( j )) ) {
/* The code is the same as in phase 0
*/
}
}
Listing 4.4: Kernel in CUDA SOR A
Each thread executes a copy of this kernel on a single element of the matrix. The
position of the element to operate on inside the matrix is found using the position of
the thread inside the block and of the block inside the grid. Other than that, the thread
only checks if it’s in the correct phase for updating its element or not. Note that there
is no synchronization between different blocks, nor between different invocations of the
kernel.
Version B of the CUDA implementation code makes use of the shared memory, as introduced in Section 4.4. The CPU code is almost the same, with the only difference that
in the kernel invocation enough shared memory for each block is dynamically allocated.
Listing 4.5 shows the differences in the kernel code between version A and B.
__global__ void solver () {
/* Load 3 column e l e m e n t s from the matrix to shared memory */
rowU [ rowId ] = *(( float *) (( char *) G + ( i - 1) * pitch ) + j );
Chapter 4. General Purpose computations on GPUs
27
rowM [ rowId ] = *(( float *) (( char *) G + i * pitch ) + j );
rowD [ rowId ] = *(( float *) (( char *) G + ( i + 1) * pitch ) + j );
/* Threads " outside " the matrix are not used */
if ( i < N - 1 && j < N - 1 ) {
if ( threadIdx . x == 0 ) {
/* Load from the matrix the first and last element of each row segment */
}
/* A s s u r i n g that all the values are loaded into memory */
__syncthreads ();
if ( phase == 0 ) {
/* In phase 0 only cells with both even , or odd , c o o r d i n a t e s are updated */
if ( ( even ( i ) && even ( j )) || ( odd ( i ) && odd ( j )) ) {
Gnew = ( rowU [ rowId ] + rowM [ rowId - 1]
+ rowM [ rowId + 1] + rowD [ rowId ]) / 4.0;
rowM [ rowId ] = rowM [ rowId ] + omega * ( Gnew - rowM [ rowId ]);
}
}
else {
/* In phase 1 only cells with mixed even and odd c o o r d i n a t e s are updated */
if ( ( even ( i ) && odd ( j )) || ( odd ( i ) && even ( j )) ) {
/* The code is the same as in phase 0 */
}
}
*(( float *) (( char *) G + i * pitch ) + j ) = rowM [ rowId ];
}
}
Listing 4.5: Kernel in CUDA SOR B
Shared memory has been introduced to improve the performance of version A, because
most of the elements in the row updated by a thread block are accessed by more than
one thread. Thus, we decided to use the shared memory inside a block to store the three
partial rows. However, the experiments showed that in version B the coalesced memory
access [26] was almost completely lost. Memory access coalescing means that all the
different access to memory by the threads of a warp are joined together in a single read
or write if some alignment requirements are satisfied. On NVIDIA GPUs, the access to
memory is significantly slower than computation, so coalesced access to global memory
is very important for performance.
Version C corrects this suboptimal access pattern to memory. The CPU code doesn’t
change between versions B and C, while the only two differences between the kernels of
Chapter 4. General Purpose computations on GPUs
28
version B and C are presented in Listings 4.6 and 4.7.
rowU [ rowId -1] = *(( float *) (( char *) G + ( i - 1) * pitch ) + ( j - 1));
rowM [ rowId -1] = *(( float *) (( char *) G + i * pitch ) + ( j - 1));
rowD [ rowId -1] = *(( float *) (( char *) G + ( i + 1) * pitch ) + ( j - 1));
Listing 4.6: Differences between CUDA SOR B and C
if ( threadIdx . x == 0 ) {
/* Load from the matrix the last two e l e m e n t s of each row segment */
}
Listing 4.7: Differences between CUDA SOR B and C
The code in Listing 4.6 changes the access pattern to memory, thus reintroducing coalescing, aligning the reads to the memory addressing boundaries and closing the gap
that thread 0 was creating in Listing 4.5. Moreover we see in Listing 4.7 that, instead of
having to load the first and last elements of the block under update as in Listing 4.5, in
version C the first thread of the block has to load the last two elements. Changing the
memory access pattern, to fulfill the alignment requirements of the platform, permitted
to increase the performance of version A, as we can see in Section 4.5.1, with the gain
given by the use of shared memory. Shared memory is not a breakthrough for performance in this case because the level of reutilization of data in the SOR algorithm is low.
It is, however, a good practice when using CUDA, so we wanted to implement and test
it.
The last SOR implementation is version D of the code. In this last version, the access to
memory is again changed by using the texture memory, as introduced in Section 4.4. One
of the advantages of using texture memory is that texture memory is cached. The CPU
code has been modified to bind and unbind the texture area to the memory allocated
on the device. Note that the number of thread blocks used has decreased slightly, but
two new threads are added to each thread block. They are used to simplify the access
to texture memory. The new kernel code is presented in Listing 4.8.
__global__ void solver () {
valueU = tex2D ( rowCache , j , i - 1);
row [ threadIdx . x ] = tex2D ( rowCache , j , i );
valueD = tex2D ( rowCache , j , i + 1);
__syncthreads ();
/* Threads " outside " of the matrix and the first and last of each block are not used */
Chapter 4. General Purpose computations on GPUs
29
if ( ( i < N - 1 && j < N - 1) && ( threadIdx . x != 0 && threadIdx . x != blockDim . x - 1) ) {
if ( phase == 0 ) {
/* In phase 0 only cells with mixed even and odd c o o r d i n a t e s are updated */
if ( ( even ( i ) && even ( j )) || ( odd ( i ) && odd ( j )) ) {
Gnew = ( valueU + row [ threadIdx . x - 1] + row [ threadIdx . x + 1] + valueD ) / 4.0;
row [ threadIdx . x ] = row [ threadIdx . x ] + omega * ( Gnew - row [ threadIdx . x ]);
}
}
else {
/* In phase 1 only cells with both even , or odd , c o o r d i n a t e s are updated */
if ( ( even ( i ) && odd ( j )) || ( odd ( i ) && even ( j )) ) {
Gnew = ( valueU + row [ threadIdx . x - 1] + row [ threadIdx . x + 1] + valueD ) / 4.0;
row [ threadIdx . x ] = row [ threadIdx . x ] + omega * ( Gnew - row [ threadIdx . x ]);
}
}
*(( float *) (( char *) G + i * pitch ) + j ) = row [ threadIdx . x ];
}
}
Listing 4.8: Kernel in CUDA SOR D
Although version D of the kernel uses different types of memories, i.e. shared and texture
memory, and preserves the coalesced access to the memory, its code is the shortest and
more readable of all versions, due to optimizations and code reorganization.
4.5.1
Performance
After introducing the algorithm and the code, we show the execution time and the
achieved speed-up on a real case, to verify the performance that GPGPU can provide.
The parameter that is varied in the experiment is the number of threads per block. The
dimension of the matrix has been fixed as 8000 × 8000, small enough to be sure that it
fits into the device memory, and large enough to exceed the CPU cache. The platform
we used is equipped with two Intel Xeon E5320 CPUs, for a total of 8 computing cores,
and 8 GB of RAM. The video card is an NVIDIA GeForce 8800 GTX with 16 streaming
multiprocessors and a total of 128 cores and 767 MB of global memory.
Figure 4.7 presents a comparison of the execution time of the various versions of the
SOR algorithm developed for the GPU, in relation with the execution time of both the
sequential and parallel CPU-only versions. Figure 4.8 shows the speed-up of the different
Chapter 4. General Purpose computations on GPUs
30
versions of the CUDA implementations, compared with the speed-up achieved by the
parallel CPU-only version.
Figure 4.7: SOR execution time (lower is better)
Figure 4.8: SOR speed-up (higher is better)
From Figure 4.7 we can see that all CUDA implementations have a better execution time
than the sequential version. Moreover the execution times of all CUDA implementations
are also better than the one of the parallel CPU-only version. This behavior is clearly
Chapter 4. General Purpose computations on GPUs
31
visible in Figure 4.8, when considering the achieved speed-up of all the parallel versions
of the algorithm, relative to the sequential implementation.
Looking at our best CUDA implementation (version D), we see that it is possible, also in
a simple example like the SOR algorithm, to achieve with GPGPU significantly better
performance compared to a CPU-only implementation: we obtained an execution time of
4 seconds compared to 44 and 22 seconds of the sequential and CPU-only parallel version
respectively, thus achieving a speed-up of a factor equal to 11. Moreover, the high-level
programming capabilities offered by a framework like CUDA permit to write code that
is readable and similar to what we will expect from CPU-only code. However, as we
found out in this example, a deep understanding of the device functionality, especially
of the memory organization, is necessary for performance. Coalesced access to memory
for NVIDIA devices is extremely important and can be a performance breakthrough.
Furthermore, the use of shared and texture memory is an unavoidable optimization
strategy, for this pre Fermi GPU, resulting in another performance improvement.
Chapter 5
Application analysis
In this chapter we describe the beam forming application by first introducing the data
structures used to represent the input and the output, in Section 5.1, then describing a
sequential version of the algorithm, in Section 5.2, and finally, in Section 5.3, providing
the strategies followed to parallelize the beam former on a GPU.
5.1
Data structures
The input data structures for the beam forming algorithm are two: the samples and
the metadata. The samples are essentially the values measured from the stations, while
the metadata are the delays that have to be applied to the samples to form a beam.
The only output data structure contains, for each formed beam, the merged samples of
all the stations. In our parallel implementation we use the same data structures that
are used by the ASTRON code for the LOFAR beam former, without any change. The
source code of these data structures is available in Appendix F.
The class representing the input samples is called SampleData (Section F.1). Apart
from its own logic, used to implement operations like read from and write to permanent
storage, or memory allocation, it contains two internal structures. The first of them is
a four dimensional array, named samples, containing the complex values representing
the measured signals. The dimensions of the array are, in order, the channels, the
stations, the time intervals in which a second is divided, and the measured polarizations.
This multidimensional array is allocated as a contiguous memory segment than can be
32
Chapter 5. Application analysis
33
accessed as a big linear array; this property is important for the parallel implementation
on the GPU because it permits a single, fast memory transfer between the host and the
device. Otherwise, some intermediate step would have been necessary to transform the
data structure into one that could be handled by the GPU. The second internal data
structure is a vector of sparse sets, named flags, with a sparse set for each station
showing the intervals of samples that are flagged. A measured sample is flagged if it
contains some sort of error, in which case it is simply excluded from the computation
and the corresponding output is set to zero.
The class representing the metadata is called SubbandMetaData (Section F.2). Besides
its own logic (in which we are not interested for this work), it contains an array of arrays
for each station. Those second arrays contain, for each beam that has to be formed, the
delays that should be applied to a station’s sample to form the given beam.
Finally, the class representing the output values, i.e.
the formed beams, is called
BeamFormedData (Section F.3), and it is derived from the same parent of SampleData.
The only changes in this output data structure (compared with the input one) are the
dimensions of the multidimensional array, that in this case are, in order, the beams, the
channels, the time intervals in which a second is divided and the measured polarizations.
Everything else said about SampleData remains the same for the BeamFormedData.
5.2
The beam forming algorithm
The reference sequential version of the beam forming algorithm has been derived from
the C++ ASTRON code. The algorithm is divided in three phases:
1. Delays computation;
2. Flags computation;
3. Beams computation.
We will now discuss these steps in more detail.
Chapter 5. Application analysis
5.2.1
34
Delays computation
The delays computation phase combines two delay values, the one at the beginning
of the measurement and the one at the end, that are provided to the algorithm via
a SubbandMetaData object, and stores the result as a single double precision floating
point value for each station-beam combination, as can be seen in Listing 5.1. Computed
delays are stored in a matrix, delays, that is stored into the BeamFormer object.
for ( unsigned int station = 0; station < nrStations ; station ++ ) {
double c o m p e n sa t e d D e l a y = ( metaData . beams ( station )[0]. delayAfterEnd
+ metaData . beams ( station )[0]. delayAtBegin ) * 0.5;
delays [ station ][0] = 0.0;
for ( unsigned int beam = 1; beam < nrBeams ; beam ++ ) {
delays [ station ][ beam ] = (( metaData . beams ( station )[ beam ]. delayAfterEnd
+ metaData . beams ( station )[ beam ]. delayAtBegin ) * 0.5) - c o m p e n s a t e d D e l a y ;
}
}
Listing 5.1: Phase 1: delay computation
It is not necessary to compute the delay for the central beam, i.e. beam number 0, of
each station, because is assumed that the input provided to the algorithm has already
been compensated for it. It is important to note also that, when computing the beam
forming for many different input samples, but with the same metadata, it is enough to
perform this phase once, at the first iteration. This is the case for all the observations
performed with the LOFAR.
5.2.2
Flags computation
The goal of the flags computation phase is to discard the stations with too much flagged
data, i.e. stations containing too many errors, to avoid the pollution of the formed
beams with measurement that are not correct. The code of this phase can be seen in
Listing 5.2.
n rV al id S ta ti on s = 0;
for ( unsigned int station = 0; station < nrStations ; station ++ ) {
if ( isValid ( station ) ) {
isVal idStati on [ station ] = true ;
n rV al id S ta ti on s ++;
Chapter 5. Application analysis
35
}
else {
isVal idStati on [ station ] = false ;
}
}
for ( unsigned int beam = 0; beam < nrBeams ; beam ++ ) {
outputData . flags [ beam ]. reset ();
for ( unsigned int station = 0; station < nrStations ; station ++ ) {
if ( isVa lidStati on [ station ] ) {
outputData . flags [ beam ] |= inputData . flags [ station ];
}
}
}
Listing 5.2: Phase 2: flags computation
The flags computation has two loops. In the first one, each station is checked to see
if it’s valid or not. A station is valid if the percentage of its samples that are flagged
doesn’t exceed a certain upper bound (defined elsewhere in the code). The number of
valid stations is saved and an array of boolean values is populated, to provide a faster
check on the validity of a given station.
The second loop sets the flags of the output data. The flagging policy is straightforward: if an input sample is flagged, even if just for one of the input stations, then the
correspondent output sample is flagged too. Invalid stations are excluded because their
values are not used further in the computation (i.e., they are not affecting the output).
5.2.3
Beams computation
The beams computation is the core of the beam forming algorithm: it computes the
different beams that are obtained by merging the samples from all the stations. The
code is provided in Listing 5.3. The phaseShift function used in the code is shown in
Listing 5.4.
double a ve r ag in gF a ct or = 1.0 / nr Va l id St at i on s ;
for ( unsigned int beam = 0; beam < nrBeams ; beam ++ ) {
for ( unsigned int channel = 0; channel < nrChannels ; channel ++ ) {
double frequency = baseFrequency + channel * c h a n n el B a n d w i d t h ;
Chapter 5. Application analysis
36
for ( unsigned int time = 0; time < nrSamples ; time ++ ) {
if ( ! outputData . flags [ beam ]. test ( time ) ) { // valid sample
for ( unsigned int pol = 0; pol < nr P ol ar iz a ti on s ; pol ++ ) {
outputData . samples [ beam ][ channel ][ time ][ pol ] = makefcomplex (0 , 0);
for ( unsigned int station = 0; station < nrStations ; station ++ ) {
if ( isVa lidStati on [ station ] ) {
fcomplex shift = phaseShift ( frequency , delays [ station ][ beam ]);
outputData . samples [ beam ][ channel ][ time ][ pol ] +=
inputData . samples [ channel ][ station ][ time ][ pol ] * shift ;
}
}
outputData . samples [ beam ][ channel ][ time ][ pol ] *= av e ra gi ng F ac to r ;
}
}
else { // flagged sample
for ( unsigned int pol = 0; pol < nr P ol ar iz a ti on s ; pol ++ ) {
outputData . samples [ beam ][ channel ][ time ][ pol ] = makefcomplex (0 , 0);
}
}
}
}
}
Listing 5.3: Phase 3: beams computation
fcomplex phaseShift ( double frequency , double delay ) {
double phaseShift = delay * frequency ;
double phi = -2 * M_PI * phaseShift ;
return makefcomplex ( cos ( phi ) , sin ( phi ));
}
Listing 5.4: phaseShift function
The external loop is performed for each beam that has to be formed by the algorithm.
For all the channels and the time samples, if they are not flagged, all the valid stations
are merged, for all the measured polarizations. The phaseShift function, given the
frequency and the delay, provides the complex shift that has to be multiplied to each
sample from each valid station. The sum of all the shifted samples is then multiplied
with an average factor. In case a time sample is flagged, the output value is simply set
to zero.
Chapter 5. Application analysis
5.3
37
Parallelization strategies
The sequential algorithm described in the previous sections can be parallelized with a
data-parallel strategy, i.e. it will be possible to perform the same operation on different
items at the same time. The sequential algorithm is composed by three interdependent
steps, so a task-parallel strategy does not seem suitable for this parallelization. We will
now analyze how the three phases may be parallelized.
The first phase of the beam former, the delays computation (Section 5.2.1), can be
independently computed for each station-beam pair; the same seems to be true for the
second phase, the flags computation (Section 5.2.2). However, what appears so simple
at a first glance, is not after a deeper analysis.
In the delays computation phase, several double precision floating point values are
summed and multiplied for each station-beam pair; after, the computed values are stored
for later reuse. The involved data structures are nothing more than simple arrays. So,
to parallelize this phase on a GPU, we just need to copy the input arrays in the video
card’s memory, perform the arithmetical operations in parallel, e.g. using a different
thread for each different station-beam pair, and then copy the results back to the main
memory. Moreover, because the computed values will be used in the third phase, it
should be possible to compute them on the GPU and leave them there, reducing the
number of necessary memory transfers.
If we look, instead, at the flags computation phase, we see that it has no computation,
but only checks for the quality of the data, and these checks are implemented with
special functions and operators, defined on data structures more complex than the previously involved arrays. To parallelize this phase on a GPU we would need to modify
the input and output data structures or, in order to avoid modifications on legacy code,
write wrappers and new intermediate data structures compatible with the GPU architecture. But introducing this compatibility layer will certainly result in a downgrade in
performance. Thus, we decided not to parallelize this phase on the GPU, and leave its
execution on the CPU.
What is more interesting is the parallelization of the third phase, the beams computation
(Section 5.2.3). First of all, from the sequential algorithm it is possible to see that all the
operations involved are arithmetical, but the check to determine if a given sample from a
Chapter 5. Application analysis
38
station is correct or not. This check, however, can be skipped with minor modifications
to the algorithm (i.e. replacing the incorrect samples with zeros in the previous phase).
Therefore, this phase is a good candidate for being parallelized on a GPU.
We think, moreover, that the parallelization of this phase can bring major improvement
to the algorithm’s performance. From the sequential code we can see that each combination of beams, channels, time samples and polarizations is independent from the others,
and can be computed in parallel without interdependencies. A possible parallelization
strategy is, then, to assign a different combination of these parameters to each thread,
with the thread multiplying and summing the values for all the stations. Or, to avoid
synchronization issues, leave the final sum of all the shifted samples out of this phase,
and parallelize it later, with another kernel. In fact, we can obtain different strategies
just by organizing the matching between the threads and the data in a different way.
What is for sure important is the memory’s access pattern, and the level of reuse that can
be achieved with the different strategies. We know that data reuse is important because
we know that to form a beam we need the samples, corresponding to a certain channel,
time and polarization, from all the stations. But, we need the same samples to form
all the beams, not just one of them; what changes between different beams is just the
delay computed during the first phase. Like the organization of the matching between
threads and data could lead to different parallelization strategy, a different scheme for
samples’ reuse can do the same.
We can also see in the sequential algorithm, that the complex shift, that needs to be
multiplied to each sample, depends only on three of the parameters: channel, station and
beam. It should be possible, to extract this step from the main algorithm, parallelize it,
and simply store the results in the video card’s memory for further accesses. Besides,
another strategy can be tried, in which this shift computation is moved back in the
pipeline and merged with the delays computation phase: this will result in a new phase,
that we can call shift computation, that needs to be performed on the GPU just once
per computation. This reorganization of the algorithm also simplifies the operations of
the third phase, and reduce the number of calls to costly functions like cos and sin.
We can conclude here that the beam forming algorithm is a good candidate for being
parallelized on GPUs. However, achieving good performance will not be trivial. We
have to implement, test, benchmark and tune multiple parallelization strategies. As we
Chapter 5. Application analysis
39
believe that the bottleneck will be the video card’s memory, we focus on those strategies
able to maximize the data reuse between different threads, and to provide coalesced
access to memory.
Chapter 6
CUDA BeamFormer
In this chapter we present six different versions of the beam forming algorithm, all
developed with CUDA. These versions are presented in the context of an experiment,
set up as described in Section 6.1. The experiments are designed to test performance
and, to analyze which parallelization strategies best suite the beam forming process on
modern NVIDIA GPUs. A comparison between all six versions is provided in Section
6.8, together with our conclusions.
6.1
Experimental setup
We performed a series of experiments to understand how our six implementations of the
beam former scale and which are the best parallelization strategie on a modern NVIDIA
GPU. Each one of the six developed beam forming algorithm versions is described in detail in one of the following sections. The operational intensity and the number of utilized
registers for all of them are presented in Table 6.1. The optimization strategy implemented by each version, as well as the code differences between them, are summarized
in Table 6.2.
The experiments are performed running a single execution of each developed version, and
varying two of the input parameters: the number of stations to merge to form a single
beam, and the number of beams to form. Both input parameters vary, independently,
in the space of the powers of 2, with the number of stations varying between 21 and
28 and the number of beams varying between 21 and 29 . The other parameters, i.e.
40
Chapter 6. CUDA BeamFormer
Kernel
1.0.2a
1.0.2b
1.1 / 1.1.1c
1.1 2x2 / 1.1.1 2x2d
1.2e
1.3f
1.4g
1.5h
a
For
For
c
For
d
For
e
For
f
For
g
For
h
For
b
41
Operational intensity
0, 65
3+(2∗log2 (#stations))
16
0, 3
0, 41
9+(16∗#stations per block)
32+(24∗#stations per block)
9+(16∗#stations per block)
40+(16∗#stations per block)
1+#beams per block∗(8+2∗log2 #stations)
8+8∗#beams per block
(#stations per block∗#beams per block∗16)+9∗#beams per block
#stations per block∗(16+8∗#beams per block)+32∗#beams per block
Registers
29
25
17
23
[25, 36]
26
18
[29, 63]
the samples computation with mixed single and double precision operations
the samples addition and for thread #0
thread #0 computing the last station
thread #0 computing the last station
the computation of the last station
the computation of the last station
thread #0
the computation of the last station
Table 6.1: Operational intensity and registers used by each kernel
Version
Optimization strategy
1.0
No optimizations, it follows the ASTRON algorithm structure.
Separation of single and double
precision floating point operations,
avoid temporary memory buffer.
Computation of more beams per iteration.
Coalesced access to memory.
Coalesced access to memory.
Avoid idle threads, coalesced access
to memory, improved data reuse per
thread blocks.
1.1
1.2
1.3
1.4
1.5
Code
differences
with previous version
Single kernel, separate shift computation
phase.
Introduction of the
station-beam block.
Complete rewriting.
Complete rewriting.
Complete rewriting.
Table 6.2: Algorithms’ optimization strategies and code differences.
the number of channels, time samples and polarizations, are kept constant, with values
of 256, 768 and 2 respectively. This is not an unrealistic assumption because in the
production environment stations and beams are more likely to change. The values
chosen are representative for the LOFAR scenarios. There is no flagged data in the
input, so all the data is used in the computations.
In each experiment, we measure the execution time and the time taken only by the
kernels running on the GPU; the former is used to measure how the different algorithms
Chapter 6. CUDA BeamFormer
42
scale, the latter to derive two other performance metrics; the number of single precision
floating point operations per second, measured in GFLOP/s, and the achieved memory
bandwidth, measured in GB/s. These two metrics are used to compare the different
versions and measure the hardware utilization.
The machine used for the experiments has one Intel Core i7-920 CPU, 6 GB of RAM
and a NVIDIA GeForce GTX 480 video card. The GeForce GTX 480 uses the NVIDIA
GF100 GPU, with 480 computational cores, that provides a theoretical peak performance
of 1344,96 GFLOP/s and can sustain a memory bandwidth of 177,4 GB/s accessing its
on-board 1536 MB of RAM. The machine’s operating system is Ubuntu Linux 9.10. The
host code is compiled with g++ version 4.4.1 and the device code is compiled with nvcc
version 0.2.1221; we use CUDA version 3.1.
6.2
BeamFormer 1.0
BeamFormer 1.0 follows strictly, in its structure, the ASTRON C++ implementation, as
presented in Section 5.2; from the three computational phases composing the algorithm,
only the third one (the beam forming phase) is parallelized taking advantage of the
GPU acceleration. Two different kernels are used to implement this phase: the first one
computes the weighted samples for each station and stores them into a temporary buffer
on device memory, while the second sums all these samples to form the beam. The first
kernel is executed one time for each beam to form and the second one twice, once for
each polarization. The structure of the CUDA grid is the same for both kernels, with a
thread block created for each channel-time pair; the block structure is different, and this
is the reason for the different times the kernels need to be executed to form a beam: for
the first kernel there is a thread for each station-polarization pair while for the second
there is only a thread for each station. The use of a temporary buffer to store the beam
samples before merging them is a limitation for the number of instances that can be
computed: this version can compute at most 256 beams and merge 128 stations.
Version 1.0 consists of one implementation. The execution times measured for this
version are presented in Table A.1. The data shows that the algorithm scales linearly.
From the other two performance metrics, only the achieved bandwidth is presented
in Table C.1, because version 1.0 mixes together double and single precision floating
Chapter 6. CUDA BeamFormer
43
point instructions (making it impossible to compute an accurate GFLOP/s value). The
memory bandwidth follows an expected trend, being stable when varying the beams for
a fixed number of stations to merge, and scaling linearly when varying the stations to
merge for a fixed number of beams to form, because more threads are run when the
number of stations to merge is increased. The values that are higher than the card’s
maximum memory bandwidth are due to the caching system.
Although the BeamFormer 1.0 scales linearly, the execution times are still far from
what we expect to achieve parallelizing the beam forming algorithm on a GPU. So far,
following the structure of the sequential code does not look like a good parallelization
strategy.
6.3
BeamFormer 1.1
The idea behind the BeamFormer 1.1 is to provide a separation of double and single
precision operations and avoid the use of a temporary buffer to store partial results.
Separating single and double precision floating point operations allows us to compute
the achieved GFLOP/s, and to have a better understanding of the performance of our
beam formers. Moreover, the use of a temporary buffer to store partial results increases
the number of accesses to the video card’s global memory, and this is an expensive
operation, and also increases the amount of memory that needs to be allocated, thus
reducing the number of computable instances.
To achieve these goals, the structure of the algorithm has been modified. The code is
still structured in three phases, but the first and the third phases are changed compared
to the sequential code. In the first phase, instead of computing the delays, the complex
weights needed for the beam forming are computed and stored permanently on the
GPU. A value is computed for each channel, station and beam combination, and is used
later in the third phase of the algorithm; these values are computed only once in the
first execution of the BeamFormer, and can be used later on to compute more beams
(which is the normal case in the production environment). Double precision floating
point operations are only necessary in this phase, as the beam forming phase works only
with single precision operations.
Chapter 6. CUDA BeamFormer
44
For the BeamFormer 1.1 family, four different implementations are used in the experiments: 1.1, 1.1 2x2, 1.1.1 and 1.1.1 2x2. They differ only in the third phase of the
algorithm, that is implemented by a single kernel. Implementation 1.1 allocates enough
memory on the device to store all the formed beams, while implementation 1.1.1 only
allocates memory to store the beams computed in a single kernel invocation, thus permitting to solve instances covering the whole input space of the experiments at the expense
of more memory transfers. The CUDA grid is structured such that a thread block is
used for each channel-time pair and, inside blocks, a thread is created for each beam to
form in a single kernel invocation. That is one for implementations 1.1 and 1.1.1 and
two for implementations 1.1 2x2 and 1.1.1 2x2. The execution times measured, for the
four implementations, are presented in Tables A.2 to A.5 (Appendix A), and they show
that the BeamFormer 1.1 scales linearly. The achieved memory bandwidth, presented in
Tables C.2-C.5 (Appendix C), is almost the same between implementations 1.1 and 1.1.1
and between implementations 1.1 2x2 and 1.1.1 2x2, and scales linearly when increasing
the number of beams to compute in a single iteration; the same behavior is shown with
the achieved GFLOP/s in Tables B.1-B.4 (Appendix B).
The performance of version 1.1 is low because the low number of threads per block (two
in the best case) creates a situation in which each streaming multiprocessor gets less
threads than available cores, and so the GPU is heavily underutilized.
6.4
BeamFormer 1.2
In order to increase the number of threads per block, version 1.1 is modified to compute
more beams per kernel execution. This modification is implemented in BeamFormer
1.2. The modification is based on the concept of station-beam block: a block of NxM
indicates that within a single kernel execution we are merging N stations into M beams.
How many times a kernel needs to be executed to solve an input instance depends on
the input and station-beam block dimensions.
For the BeamFormer 1.2, we have seven different implementations. Implementations
1.2 2x2, 1.2 4x4 and 1.2 8x8 are used to demonstrate that a bigger station-beam block
implies better performance; as can be seen in Tables A.6 to A.8 (Appendix A), not only
each implementations scales linearly with the input, but doubling the size of the block
Chapter 6. CUDA BeamFormer
45
halves the execution time on the same input instance. The other metrics, presented in
Tables B.5 to B.7 (Appendix B) and Tables C.6 to C.8 (Appendix C), show the same
behavior. However, the values are still far from the theoretical peaks of the platform: the
best implementation just reaches a bit more than 1% of the video card’s capabilities. In
implementation 1.2.1, the code is modified to permit a runtime definition of the stationbeam block; for the experiment, the station-beam block is set with the dimensions of
the input instance, thus leading to a single kernel execution. Table A.9 (Appendix A)
shows that the execution times for all the different input instances are low enough to
be considered almost constant, with the trend becoming linear with big instances. The
issue with this implementation is that the bigger the number of beams computed on a
single execution is, the bigger the allocated memory is; so, it is not possible to solve all
the instances of the experiment’s input space. Looking at both the achieved GFLOP/s
and GB/s, in Tables B.8 and C.9, respectively, the linear trend is lost when the number
of stations to merge is fixed, and the number of beams to form is varied: in this case the
values are increasing, reaching a maximum and then decreasing. Implementation 1.2.1
achieves, when merging 256 stations to form 128 beams, 218,16 GFLOP/s, nearly 16%
of the platform theoretical peak.
Implementations 1.2.2, 1.2.1.1 and 1.2.2.1 are written and tested to better understand
the memory behavior. In implementation 1.2.2, we changed the input and output data
structures, described in Section 5.1, reordering the dimensions of the multidimensional
arrays to match the CUDA grid structure and permit an improved coalesced access
to memory. Tables A.11, B.10 and C.11 show good performance numbers, with an
execution time that scales linearly and the achievement of 267,83 GFLOP/s (19% of
the card capabilities). This value is higher than what the Roofline model [27] predicts
for an operational intensity of just 0,66; the extra-performance is due to the improved
memory bandwidth provided by the cache. The experiment also shows that improving
the memory coalescing is important. But reordering the data structures, especially with
big instances and in a production environment, may cost too much to be effective. Thus,
a reordering of the computation should be taken into account.
Implementations 1.2.1.1 and 1.2.2.1 differ from their parent implementations, 1.2.1 and
1.2.2 respectively, in how the data reuse is implemented inside each thread block: left
to the CUDA cache hierarchy in the latter and manually addressed with the shared
memory in the former. The execution time measured, presented in Tables A.10 and
Chapter 6. CUDA BeamFormer
46
A.12 (Appendix A), show values that are comparable with the ones measured with the
cache. The measured GFLOP/s, presented in Tables B.9 and B.11 (Appendix B), and
GB/s, presented in Tables C.10 and C.12 (Appendix C), are lower than the ones of the
“parents”. However, they show the same trend, as expected from this version. Our
results prove that in the case of the beam forming algorithm on the NVIDIA Fermi
architecture, the data reuse can be left to the cache and the manual use of shared
memory appears redundant.
6.5
BeamFormer 1.3
Reordering the structure of the computation to improve the coalesced access to memory
in order to improve performance, is the goal of BeamFormer 1.3. The grid organization
reflects the ordering of input and output data structures: the grid has one thread block
for each channel-beam pair and each block has a number of threads at least equal to
the number of time samples. The kernel merges a block of stations in each execution; in
the last execution, it sums them and computes the final beam value. In the performed
experiments, all the stations are merged in a single execution.
The implementation scales well, as can be seen in Table A.13, and its behavior is symmetric, that is, approximately the same time is necessary to compute an x × y and an
y × x input instance. Values are stable, at least for big input instances, in terms of
achieved GFLOP/s (as seen in Table B.12), but are low compared to the capabilities of
the video card. The memory bandwidth, seen in Table C.13, is stable too. However,
the memory occupancy is too high to permit the computation of 512 beams. It becomes
clear that, in order to achieve stable performance, it is necessary to separate the number
of threads from the input parameters that vary too much and are too low to keep the
hardware busy all the time.
6.6
BeamFormer 1.4
BeamFormer 1.4 is another attempt to reorder the computation to improve performance.
The grid is organized to have a thread block for each channel-time pair and inside each
block, a thread for each station-polarization pair. Each kernel computes a partial result
Chapter 6. CUDA BeamFormer
47
and then all the kernels inside a block collaborate to perform two parallel reductions,
one for each polarization, to first sum all the computed values and then store the result
in global memory.
This BeamFormer 1.4 scales linearly, but the measured times, presented in Table A.14,
are higher (i.e. worse) than what we are looking for at this point. Moreover, Tables
B.13 and C.14 show that both the number of single precision floating point operations
and the achieved memory bandwidth are extremely low.
The issue with this BeamFormer 1.4 is that, trying to find a good mapping between the
data and the computation to improve the access pattern to memory, we reduced the
data reuse between the threads of a same block. Moreover, the parallel reductions idle
too many threads, causing a critical underutilization of the hardware.
6.7
BeamFormer 1.5
Version 1.5 aims, at the same time, at avoiding idle threads, accessing the memory in a
coalesced way by means of a computational structure that matches the input and output
data structures, and being stable in terms of performance. In BeamFormer 1.5 we have,
inside the CUDA grid, a number of thread blocks that is at least the number of channels,
and each block has a number of threads that is at most the number of time samples.
This version also relies on the concept of the station-beam block, and so the dimensions
of the grid are set at runtime, because a station-beam block computing more beams at
the same time needs more registers and forces us to reduce the number of threads per
block.
Three different implementations of version 1.5 are tested, each of them using a different size for the station-beam block. The execution times of implementations 1.5 2x2,
1.5 4x4 and 1.5 8x8 are presented in Tables from A.15 to A.17 (Appendix A). The measured values are low, but they scale linearly and are symmetric; moreover, it is possible
to compute instances covering all the experiment’s input space. It is important to also
note that execution times, for the same input instance, are nearly halved when the number of blocks is increased from an implementation to the successive one. The achieved
GFLOP/s, as can be seen in Tables from B.14 to B.16 (Appendix B), are high considering how small the station-beam block dimensions are, but are decreasing with the
Chapter 6. CUDA BeamFormer
48
increase of the number of beams to form. A small decrease in performance is indeed
expected, because forming more beams without increasing the size of the station-beam
block, requires more kernel executions. The measured decrease in performance is too big
compared to the expected one. This behavior appears not to be due to the algorithm,
but to the CUDA compiler, as we will show in Section 7.4. Overall, the BeamFormer
version 1.5 reaches 84% of the GFLOP/s predicted applying the Roofline model, which is
a good result. Tables from C.15 to C.17 (Appendix C), presenting the achieved memory
bandwidth, show the same trends as the measured GFLOP/s.
6.8
Conclusions
Finally , we present a comprehensive comparison of the performance results of the developed CUDA versions of the BeamFormer; for the comparison we restrict ourselves to
the case of merging 64 stations, which is an important use case for ASTRON. Figures
6.1 and 6.2 show the execution time for all the different implementations of the BeamFormer. We can see that all the implementations scale linearly, a first important result
that permits us to affirm that it is possible to efficiently implement a beam forming
algorithm with CUDA on a NVIDIA GPU.
The fastest implementation is the BeamFormer 1.2.2; however this implementation,
along with BeamFormer 1.2.1, has an increase in the slope of the curve for more than
128 beams to form. Moreover, both these implementations, together with the BeamFormer 1.3, are incapable of computing 512 beams. The version capable of computing the
highest number of beams (in our experiment) and still being among the best performing
algorithms is the BeamFormer 1.5. If we also consider that the second ranked implementation (BeamFormer 1.5 8x8) performs many more kernel executions compared to
the first ranked, due to its small station-beam block, version 1.5 appears the real winner
for what concerns the execution times.
Figures 6.3 and 6.4 provide the comparison between the best performing versions for
achieved GFLOP/s and GB/s, respectively. The figures show that almost all implementations (with the exception of 1.2.1 and 1.2.2) are fairly stable in their trends. The two
unstable implementations are the ones obtaining the highest values in both performance
metrics.
Chapter 6. CUDA BeamFormer
49
80
BeamFormer 1.0
BeamFormer 1.1
BeamFormer 1.1_2x2
BeamFormer 1.1.1
BeamFormer 1.1.1_2x2
BeamFormer 1.2_2x2
BeamFormer 1.2_4x4
BeamFormer 1.2_8x8
BeamFormer 1.4
70
Execution time (s)
60
50
40
30
20
10
0
0
100
200
300
400
500
Pencil beams
Figure 6.1: Execution time in seconds of various BeamFormer versions merging 64
stations (lower is better).
From these results we choose version 1.5 as the best beam forming algorithm for the
NVIDIA GPUs and use it for future investigations. Not only this version shows a stable
behavior, permits to solve large input instances, and has good performance, but it is
also open to further improvements, especially concerning the dimensions of the stationbeam block. We motivate the thought that improvements are still possible with the
BeamFormer 1.5 because we measured a value of 411,86 GFLOP/s using a station-beam
block of 256 × 8; that is 30% of the video card’s capabilities, and a value higher than
everything else measured while performing the previously described experiments.
To conclude, we want to summarize the crucial aspects needed to obtain good performance with the beam forming algorithm on NVIDIA GPUs:
1. use a high number of independent and not idle threads,
2. structure the computation to match the input and output data structures, thus
permitting a coalesced access to device memory,
Chapter 6. CUDA BeamFormer
50
3
BeamFormer 1.2.1
BeamFormer 1.2.2
BeamFormer 1.3
BeamFormer 1.5_2x2
BeamFormer 1.5_4x4
BeamFormer 1.5_8x8
2.5
Execution time (s)
2
1.5
1
0.5
0
0
100
200
300
400
500
Pencil beams
Figure 6.2: Execution time in seconds of various BeamFormer versions merging 64
stations (lower is better).
3. keep the kernels as simple as possible, leaving them with the only job of performing
arithmetic operations, while performing synchronization on the host by means of
multiple kernel executions,
4. optimize for data reuse between the kernels of a same thread block; for this algorithm, when using the NVIDIA Fermi architecture, this optimization can be left to
the cache system, without having to manually implement it with shared memory,
5. perform more memory transfers from host to device and vice versa, in order to be
able to allocate less memory and then compute bigger instances; the performance
of this algorithm is, in fact, so bound by the beam computation phase that the
effect of the memory transfers is negligible.
Chapter 6. CUDA BeamFormer
51
300
BeamFormer 1.2_4x4
BeamFormer 1.2_8x8
BeamFormer 1.2.1
BeamFormer 1.2.2
BeamFormer 1.3
BeamFormer 1.4
BeamFormer 1.5_2x2
BeamFormer 1.5_4x4
BeamFormer 1.5_8x8
250
GFLOP/s
200
150
100
50
0
0
100
200
300
400
500
Pencil beams
Figure 6.3: GFLOP/s of various BeamFormer versions merging 64 stations (higher is
better).
400
BeamFormer 1.2_4x4
BeamFormer 1.2_8x8
BeamFormer 1.2.1
BeamFormer 1.2.2
BeamFormer 1.3
BeamFormer 1.4
BeamFormer 1.5_2x2
BeamFormer 1.5_4x4
BeamFormer 1.5_8x8
350
300
GB/s
250
200
150
100
50
0
0
100
200
300
400
500
Pencil beams
Figure 6.4: GB/s of various BeamFormer versions merging 64 stations (higher is
better).
Chapter 7
OpenCL BeamFormer
In this chapter we present an implementation of the BeamFormer 1.5, as described
in Section 6.7, using the Open Computing Language (OpenCL) [28]: the reason for
this implementation is the analysis of the achievable performance of our beam forming
algorithm when using a framework that is focused on the portability of the code, to
eventually compare this implementation with the previously developed CUDA one.
First, we introduce in Section 7.1 what OpenCL is and how it works. Then, we briefly
discuss the changes that are necessary to port the code between the two different frameworks, in Section 7.2. The experiments, that resemble the ones we already performed
with the CUDA implementation, are presented in Section 7.3, along with the results.
Conclusions are provided in Section 7.4, together with a comparison of the CUDA and
OpenCL implementations.
7.1
The Open Computing Language
OpenCL, the Open Computing Language [28], is an open and royalty-free standard for
general purpose parallel programming on heterogeneous architectures. Initially developed by Apple, it is now supported by the Khronos Group with many different participants from the industry. OpenCL’s goal is to allow developers to write portable code
that is compiled at run-time and executed on different parallel architectures, like multicore CPUs and many-core GPUs. The OpenCL standard defines an API to manage the
52
Chapter 7. OpenCL BeamFormer
53
devices and the computation and a programming language, called OpenCL C, that provides parallel extensions to the C programming language. Both the Single Instruction
Multiple Data (SIMD) and the Single Program Multiple Data (SPMD) paradigms are
supported in OpenCL. Here we briefly introduce the computational model of OpenCL
(more details are available in [7]).
In the OpenCL platform model, there is a host connected to one or more computing
devices. Each computing device contains one or more compute units (CUs), and each
CU contains one or more processing elements (PEs). The OpenCL application runs on
the host, managing the computing devices via the functions provided by the OpenCL
API; the execution of the kernels and the memory are managed by the application
submitting commands to queues that are associated with the computing devices. The
OpenCL kernel instances, called work-items, are executed by the PEs, and each workitem is identified by an integer vector in a three dimensional index space that is called
NDRange and that represents the computation as a whole. Work-items are grouped into
work-groups; a work-group is executed by a compute unit. As with the work-items, each
work-group is identified by a vector in the same dimensions of the NDRange that is used
for the work-items. Figure 7.1 shows an example of this computational hierarchy using
an NDRange with the third dimension set to zero.
Figure 7.1: NDRange example, courtesy of Khronos Group [7].
Chapter 7. OpenCL BeamFormer
54
OpenCL also defines a memory hierarchy that is organized in three levels. At the
lowest level, each work-item has access to a read/write private memory that is statically
allocated by the kernel and cannot be accessed by the host. A level above, all the workitems inside the same work-group share another read/write memory called local memory.
All the work-items of all work-groups, have also access to two global memories, at the
highest level of the OpenCL memory hierarchy. Of these two memories, only one, called
global memory, is writable, while the other one is a read-only memory called constant
memory. The two global memories are allocated by the host application, that can also
perform copy operations on them. There is no explicit knowledge, in OpenCL, about
which of these memories are cached, because this depends on the actual capabilities of
the device on which the code is executed.
7.2
Porting the BeamFormer 1.5 from CUDA to OpenCL
Comparing the descriptions of CUDA and OpenCL, described from Sections 4.4 and
7.1, respectively, it is clear that they share many concepts. Thus, that porting the
same algorithm between the two of them should be straightforward. We describe here
the main technical differences between the two implementations, starting with the host
code.
The host code is modified only to address the different syntax of the two APIs. The
major difference is the addition, in the OpenCL implementation, of the code to generate
and compile the kernels at run-time. In addition, the number of files containing the
source code is reduced because, using only one compiler (and not a combination of nvcc
and g++), there is no need for separating the host and device code.
For the kernels, we first added an additional one to perform the operation of setting
all the elements of a certain memory area to a given value, because OpenCL does not
include a function like CUDA’s memset. Then, we modified the code of the other two
kernels, the one to compute the weights and the one to compute the beams, with two
small changes: the kernel signatures are rewritten, because the OpenCL syntax differs
here from the one of CUDA, and the access to global memory is performed using the C
array syntax instead of the one based on pointer arithmetic.
Chapter 7. OpenCL BeamFormer
7.3
55
OpenCL BeamFormer performance
Porting of the BeamFormer 1.5 from CUDA to OpenCL is, as described, a simple process;
what we want to discover is if this OpenCL implementation can achieve good performance. The setup of the experiments performed here is the same as the one described
in Section 6.1, with the only difference that OpenCL 1.1 is used instead of CUDA 3.1
and nvcc. The three metrics we measure are the execution time, the number of single precision floating point operations per second and the achieved memory bandwidth,
measured in seconds, GFLOP/s and GB/s respectively.
We developed three different implementations (BeamFormer 1.5-opencl 2x2, 1.5-opencl 4x4
and 1.5-opencl 8x8) using the same station-beam block size as the CUDA ones. Their
execution times are presented in Tables from D.1 to D.3 (Appendix D). The measured
values are nearly constant: in fact, there is almost no increment in the execution time
when the input is increased. Times start to grow only when computing big instances.
However, this is clearly due to the just-in-time OpenCL compiler: the measured execution times for the OpenCL implementations include, other than the computation itself,
the time needed to generate and compile all the kernels, and is not surprising that for a
single execution the overhead of these operations is significant.
As a result, more interesting for the OpenCL implementations are the other two performance metrics. The achieved single precision floating point operations per second are
reported in Tables from D.4 to D.6 (Appendix D). We see in these tables values that are
stable for both small and big instances, with a difference in percentage that is at most
12%. Moreover, the highest measurements are always in correspondence with the input
that exactly matches the dimension of the implementation’s station-beam block. For
what concerns the memory bandwidth, in Tables from D.7 to D.9, we see the same trends
as with measured GFLOP/s. For both performance metrics, the values achieved by implementations 1.5-opencl 2x2 and 1.5-opencl 4x4 are in line with our expectations, while
the achieved performance of implementation 1.5-opencl 8x8 is lower than expected, also
if in terms of absolute values this implementation achieves the highest measurements of
all the OpenCL ones. We will explain this in more detail in Section 7.4.
Chapter 7. OpenCL BeamFormer
7.4
56
Conclusions
Writing parallel code with OpenCL is not more difficult than writing the same code
with CUDA, and porting the BeamFormer from one framework to the other was an easy
task. Therefore, we can freely use OpenCL to obtain software that is portable between
different architectures, being them multi-core CPUs, modern many-core GPUs, or even
other kinds of parallel processors. What still needs an answer is the question if, given
the same algorithm and the same hardware, the portable OpenCL code performs as
well as code developed with a native framework, that in our terms means to discover if
the OpenCL and CUDA implementations of the BeamFormer 1.5 algorithm achieve the
same performance on the NVIDIA GTX 480 video card.
200
BeamFormer 1.5_2x2
BeamFormer 1.5_4x4
BeamFormer 1.5_8x8
BeamFormer 1.5-opencl_2x2
BeamFormer 1.5-opencl_4x4
BeamFormer 1.5-opencl_8x8
180
160
GFLOP/s
140
120
100
80
60
40
0
100
200
300
400
500
Pencil beams
Figure 7.2: GFLOP/s of BeamFormer 1.5 implemented with CUDA and OpenCL
merging 64 stations (higher is better).
Figures 7.2 and 7.3 provide a view of the achieved GFLOP/s and GB/s, respectively,
of all the CUDA and OpenCL implementations. Looking at the results, we can say
that good performance is possible: for both of the metrics, OpenCL implementations
1.5-opencl 2x2 and 1.5-opencl 4x4 achieve more stable and higher values than the ones
Chapter 7. OpenCL BeamFormer
57
180
BeamFormer 1.5_2x2
BeamFormer 1.5_4x4
BeamFormer 1.5_8x8
BeamFormer 1.5-opencl_2x2
BeamFormer 1.5-opencl_4x4
BeamFormer 1.5-opencl_8x8
170
160
GB/s
150
140
130
120
110
0
100
200
300
400
500
Pencil beams
Figure 7.3: GB/s of BeamFormer 1.5 implemented with CUDA and OpenCL merging
64 stations (higher is better).
of their CUDA counterparts. Unfortunately, this is not true for implementation 1.5opencl 8x8, whose achieved values are lower than the ones of the BeamFormer 1.5 8x8.
The fact that some OpenCL implementations perform better than the CUDA ones, and
also that some of them don’t, is, however, independent of our algorithm and code, and
is caused by the OpenCL compiler. As a matter of fact, we noticed many differences between the PTX code generated by the nvcc and by the OpenCL compilers, and we believe
that these differences cause the performance discrepancies. In particular, we found that
the lower performance achieved by the OpenCL implementation 1.5-opencl 8x8, when
comparing it with the CUDA implementation 1.5 8x8, is due to a wrong management
of the virtual registers in the code generated by the OpenCL compiler. The generated
PTX code uses too many registers, causing a phenomenon called register spilling, and
consequently inducing a dramatic increase in the accesses to global memory. If we want
to see, as it should be, the curve of the BeamFormer 1.5-opencl 8x8 in Figure 7.2 lying all above the value of 180 GFLOP/s, an improvement of the OpenCL compiler is
necessary. What we can conclude is that good performance is certainly possible with
OpenCL. In fact there is no fundamental reason why OpenCL should be slower than a
Chapter 7. OpenCL BeamFormer
58
native implementation. However, an improvement of the compiler is still necessary to
close the performance gap on NVIDIA GPUs between CUDA and OpenCL.
3.5
BeamFormer 1.5_2x2
BeamFormer 1.5_4x4
BeamFormer 1.5_8x8
BeamFormer 1.5-opencl_2x2
BeamFormer 1.5-opencl_4x4
BeamFormer 1.5-opencl_8x8
3
Execution time (s)
2.5
2
1.5
1
0.5
0
0
100
200
300
400
500
Pencil beams
Figure 7.4: Execution time in seconds of BeamFormer 1.5 implemented with CUDA
and OpenCL merging 64 stations (lower is better).
Another small issue of OpenCL is the time needed for the code generation and compilation at run-time. Figure 7.4 shows that the execution times of the OpenCL implementations are shifted up of almost one second, and without combining this metric with
the other two we would not even be able to understand if the OpenCL implementations
scale well or not. In our context this is not a real problem because in the production
environment the beam former is executed more than once, making the effect of these
operations on the code negligible when compared to the total execution time. However,
when using OpenCL, is important to take into account this overhead, especially with
algorithms for which the average execution time is in the order of milliseconds or less.
Chapter 8
Finding the best station-beam
block size
The performance of the BeamFormer 1.5 algorithm is influenced, among other factors,
by the size of the station-beam block. So far, we tried three different block sizes: 2 × 2,
4 × 4 and 8 × 8. We want to analyze further the way in which this parameter affects
the performance of our beam forming algorithm on NVIDIA GPUs, and to find the
station-beam block size that delivers the highest number of single precision floating
point operations per second.
In this chapter, Section 8.1 introduces the new experiment, while Sections 8.2 and 8.3
present the results with OpenCL and CUDA, respectively. Section 8.4 lists our conclusions, and presents a comparison between the results obtained using the two frameworks.
8.1
Experimental setup
For this experiment we generated thirty-two different implementations of the BeamFormer 1.5, half of them using OpenCL and the other half using CUDA. So many
implementations are necessary because, although changing the station component of
the station-beam block does not imply changes in the source code, modifying the beam
component does. The implementations are tested to find the station-beam block size
that delivers the highest performance, and furthermore, to understand the way in which
this parameter affects the algorithm’s performance.
59
Chapter 8. Finding the best station-beam block size
60
The experiment is performed running a single execution of each implementation, varying
the input parameter that represents the number of stations to merge for every single
beam. The parameter is first varied in the interval of the integers between 1 and 16, and
then in the space of the powers of two between 21 and 28 . For each implementation, the
input parameter associated to the number of beams to form is set in order to match the
beam component of the implementation’s station-beam block: in this way it is possible
to test a different station-beam block size for each execution, eventually testing all the
block sizes between 1 × 1 and 16 × 16, and then between 2 × 1 and 256 × 16. As with
previous experiments, the other input parameters are kept constant. In each experiment
we measure the time taken by the kernel running on the GPU. This value is used to derive
the number of achieved single precision floating point operations per second, measured
in GFLOP/s. Because each execution is associated with a station-beam block size, the
measured values are used to compare the different sizes.
The machine used for the experiments has one Intel Core i7-920 CPU, 6 GB of RAM
and a NVIDIA GeForce GTX 480 video card. The GeForce GTX 480 uses the NVIDIA
GF100 GPU, with 480 computational cores, that provides a theoretical peak performance
of 1344,96 GFLOP/s and can sustain a memory bandwidth of 177,4 GB/s accessing its
on-board 1536 MB of RAM. The machine’s operating system is Ubuntu Linux 9.10. The
host code is compiled with g++ version 4.4.1 and the CUDA device code is compiled
with nvcc version 0.2.1221; we use CUDA version 3.1 and OpenCL version 1.1.
8.2
OpenCL results
The sixteen OpenCL implementations are generated exploiting the framework capabilities for run-time code generation and compilation, without any code modification. The
station-beam block sizes from 1 × 1 to 16 × 16 are presented in Tables from E.1 to
E.4 (Appendix E). The highest value, 240,89 GFLOP/s, is found in correspondence of
the 16 × 7 station-beam block. If we keep the station component of the block, while
increasing the beam component, we see a continuous increase in performance, starting
with the value 1 for the beam component of the block and up to value 7. After that,
we see the performance decreasing steadily. Considering as fixed the beam component
of the block, and varying the station component, the measured GFLOP/s are instead
constantly increasing.
Chapter 8. Finding the best station-beam block size
61
Tables E.5 and E.6 (Appendix E), show the achieved GFLOP/s for station-beam block
sizes from 2 × 1 to 256 × 16. When the station component is between the values of 2 and
16, all the peaks are in correspondence of the beam size of 7, while for the successive
values of the station component all the peaks are shifted backwards to the value of 6.
As in the previous case, increasing the station component appears to always produce
a performance gain, while the beam component of the block is bounded between the
values of 6 and 7. The highest measured value, 392,16 GFLOP/s, is achieved with a
block of 256 × 6 and represents 29% of the theoretical performance peak of the GTX
480 video card.
350
OpenCL BeamFormer 1.5
300
GFLOP/s
250
200
150
100
50
0
0
2
4
6
8
10
Beam component size
12
14
16
Figure 8.1: GFLOP/s for the OpenCL BeamFormer: block sizes from 64x1 to 64x16
(higher is better).
The results of this experiment indicate that, when configuring the station-beam block
for the OpenCL BeamFormer 1.5:
• it is advantageous to set the station component equal to the number of stations to
merge,
• and setting the beam component over the value of 7, or over the value of 6 when
the station component size is above 16, is counterproductive.
Chapter 8. Finding the best station-beam block size
62
This trend can be seen in Figure 8.1, where the measured GFLOP/s are shown with the
station component size fixed to 64 stations.
8.3
CUDA results
For the sixteen CUDA implementations, we developed an external kernel generator for
the BeamFormer, and then added a new kernel, one for each beam component size under
test, to the code already described in Section 6.7. In Tables from E.7 to E.10 (Appendix
E) the GFLOP/s achieved with station-beam blocks of size between 1×1 and 16×16 are
presented. The highest measured value, found in correspondence of the block size 16×10,
is 251,41 GFLOP/s. As expected, when varying the station component of the block, we
can see a continuous increase in performance, with the peaks always correspondent to
the value 16 of the station component. The trend of the performance metric is more
complex if we increase the beam component of the block, and keep the other component
constant. We can see here that performance is initially increasing, up to a certain peak,
and then suddenly decreasing. The peaks are positioned in the interval that includes
three beam component sizes: 8, 9 and 10.
The same trend is shown for station-beam block sizes from 2 × 1 to 256 × 16, as can be
seen in Tables E.11 and E.12 (Appendix E). Moreover, an increase in size of the station
component of the block, corresponds to a shift towards the value of 10 for the beam
component. In this experiment the highest measured value is 427,07 GFLOP/s, more
than 31% of the platform’s theoretical performance peak. This value is measured with
the station-beam block size of 256 × 10.
Figure 8.2 shows the performance trend with the station component fixed at the value
of 64, and the variation of the other component. We can affirm that, as a result of this
experiment, the best station-beam block size for the CUDA BeamFormer is composed
by the highest possible value for the station component, and a beam component that is
between 8 and 10.
Chapter 8. Finding the best station-beam block size
63
400
CUDA BeamFormer 1.5
350
300
GFLOP/s
250
200
150
100
50
0
2
4
6
8
10
Beam component size
12
14
16
Figure 8.2: GFLOP/s for the CUDA BeamFormer: block sizes from 64x1 to 64x16
(higher is better).
8.4
Conclusions
The comparison between the CUDA and OpenCL implementations of the BeamFormer
1.5, in terms of station-beam block, can be summarized by Figure 8.3. The two plotted
curves are indeed similar. This means on one side that the influence of the station-beam
block size on the algorithm is independent of the framework used to implement it, and
on the other that performance can be improved by a correct setup of this parameter.
Like with the comparison of the CUDA and OpenCL BeamFormers, in Section 7.4, we
see the OpenCL implementations perform better than their CUDA counterparts up to
a certain point, that we can now quantify in correspondence of values 6 and 7 of the
beam block’s component, and then their performance rapidly fall. This discrepancy is,
however, due to the two different compilers, as we already explained in Section 7.4, and
it is not an algorithm property. In addition, the decrease in performance of the CUDA
implementations can be explained with an increased register spilling. The NVIDIA
Fermi architecture, in fact, poses a limit of 63 registers per thread, and given the fact
Chapter 8. Finding the best station-beam block size
64
400
CUDA BeamFormer 1.5
OpenCL BeamFormer 1.5
350
300
GFLOP/s
250
200
150
100
50
0
0
2
4
6
8
10
Beam component size
12
14
16
Figure 8.3: Comparison of CUDA and OpenCL BeamFormers: block sizes from 64x1
to 64x16 (higher is better).
that increasing the beam component of the station-beam block increases the register
usage, we found here a hardware limit.
As a final remark, we can conclude that the station-beam block is an important parameter of the BeamForming 1.5 version, and that for best performance the station component should be set as the total number of stations to merge, while the beam component
should be set as high as possible with respect to the limits of 7 and 10 beams, when
using the OpenCL or CUDA implementation respectively, and to the beams that have
to be formed. As a side effect, the search for the best station-beam block size showed
that the BeamFormer 1.5 can perform around the 30% of the theoretical GFLOP/s of
the used GPU, both with OpenCL and with CUDA.
Chapter 9
Conclusions
Our main research question, at the beginning of this work, was if is it possible or not to
efficiently parallelize the beam forming algorithm on a GPU. After the parallelization
of the algorithm, following different strategies, and after testing all the implementations
and collecting the results, we summarize here our answers to this question.
First, we summarize here all the results and contributions of this project. In Chapter 6
we showed that it is possible to implement good performing beam formers on an NVIDIA
GTX 480, using CUDA. We found important, in order to achieve good performance,
to have a high number of independent, and not idle, threads, each of them executing a
kernel that is as simple as possible. The best performing strategy was to leave the kernel
to perform just arithmetic operations, while performing synchronization and other high
cost operations on the host.
The access pattern to memory is of capital importance, too. We found that the best
performing beam formers were the ones where the computation had a structure that
was matching the input and output data structures, thus permitting coalesced accesses
to device memory, and reducing the number of read and write operations performed. In
order to reduce the number of accesses to memory, data reuse between the threads of a
same block proved to be also extremely important.
We performed further experiments on our best performing beam former (the BeamFormer 1.5, described in Section 6.7), and we found (see Chapter 8) that the correct
65
Chapter 9. Conclusions
% of the theoretical GFLOP/s
Chip’s GFLOP/s
Power efficiency (GFLOPs/W)
66
IBM Blue Gene/P
80%
10,8
0,456
NVIDIA GTX 480
30%
427,07
1,708
Table 9.1: Comparison of the beam former running on the ASTRON IBM Blue
Gene/P and on an NVIDIA GTX 480.
setup of the station-beam block parameter may improve the performance of the algorithm itself. Indeed, it was during the experiment aimed at finding how this stationbeam block parameter affected the algorithm’s behavior that we measured the highest
GFLOP/s values, for both CUDA and OpenCL. We found also that the way in which
the setup of this parameter affects the performance of our beam former is independent
from the implementation framework, so, although a different implementation framework
produces slightly different values, the performance trend remains the same.
When using OpenCL to implement our beam former, as discussed in Chapter 7, we
found that is possible to achieve good performance, and in some cases even better
performance that what we achieved with CUDA. Although the measured execution
time was always higher using OpenCL, this was due to the overhead necessary for the
run-time environment (run-time kernel compiling and launching), which was measured
together with the computation itself. For an algorithm whose average execution time
is under the second an added cost of 900 milliseconds is indeed a problem, however the
run-time overhead is sensibly reduced when the kernels are executed multiple times.
In fact, this overhead is not the biggest problem that we found with OpenCL. Instead, we
found some compiler issues: the code produced by the OpenCL compiler uses too many
registers (more than the similar code with CUDA), causing dramatic register spilling to
the slow global memory, and consequently a dramatic fall in performance. We hope that
this problem will be fixed in a future version of the OpenCL framework.
We also compare our best implementation with the current ASTRON implementation,
that works in production on a IBM Blue Gene/P. The comparison is provided in Table
9.1.
In terms of the achieved percentage of the theoretical GFLOP/s peak, the beam former
running on the IBM Blue Gene/P is currently the winner, achieving the 80% of the
platform’s theoretical GFLOP/s against our 30%. However, this efficiency in terms of
Chapter 9. Conclusions
67
hardware utilization, is due to the narrow gap that the architecture of the IBM Blue
Gene/P provides between the maximum number of achievable floating point operations
per second and the maximum memory bandwidth. In contrast, the GPU we used has
a wide gap between its theoretically achievable GFLOP/s and GB/s, and in order to
achieve an efficiency of the 80% on the NVIDIA GTX 480, we need a kernel’s operational
intensity of more than 6 (according to the Roofline model [27]), and in our analysis this
level of operation intensity is not achievable for this algorithm.
However, looking at the other parameters of the comparison, also in Table 9.1, we can see
that the number of single precision floating point operations per second that we achieved
with a single NVIDIA GTX 480 is more than forty times (40x) the GFLOP/s achievable
by a single chip of the IBM Blue Gene/P, and, in terms of power consumption, our best
implementation is more than three times more efficient.
So, we conclude that the use of GPUs for radio astronomy beam forming is a viable
solution, and we have proved that it is possible to efficiently parallelize this algorithm
on a GPU. In the future we aim to extend this work testing the beam former on different
multi-core and many-core architectures, e.g. GPUs from other manufacturers or the Cell
Broadband Engine Architecture, and improve the kernel generator to automatically try
new optimizations and tune the code for each specific architecture. Furthermore, we
plan to parallelize our algorithm to run on a GPU-powered cluster in order to be able to
compute bigger instances and to discover how the algorithm scales over a single GPU.
Appendix A
CUDA BeamFormer execution
time
b
s
2
4
8
16
32
64
128
2
4
8
16
32
64
128
256
0,106
0,0957
0,109
0,124
0,16
0,229
0,44
0,112
0,121
0,14
0,174
0,236
0,357
0,731
0,154
0,173
0,203
0,255
0,389
0,619
1,33
0,241
0,276
0,328
0,417
0,663
1,14
2,42
0,412
0,482
0,579
0,741
1,21
2,14
4,51
0,755
0,894
1,08
1,39
2,31
4,16
8,88
1,44
1,72
2,08
2,68
4,51
8,13
17,6
2,82
3,37
4,09
5,28
8,96
16,4
-
Table A.1: Execution time in seconds for the BeamFormer 1.0.2
b
s
2
4
8
16
32
64
128
256
2
4
8
16
32
64
128
256
0,1
0,0912
0,111
0,152
0,233
0,4
0,757
1,46
0,0921
0,111
0,15
0,227
0,379
0,694
1,34
2,63
0,114
0,152
0,226
0,374
0,669
1,28
2,52
4,99
0,159
0,232
0,377
0,669
1,25
2,45
4,87
9,7
0,249
0,393
0,683
1,26
2,41
4,79
9,57
19,2
0,428
0,715
1,29
2,44
4,74
9,46
19,0
38,0
0,786
1,36
2,51
4,8
9,39
18,8
37,9
76,0
1,5
2,65
4,94
9,54
18,7
37,7
75,9
-
Table A.2: Execution time in seconds for the BeamFormer 1.1
68
Appendix A. CUDA BeamFormer execution time
b
s
2
4
8
16
32
64
128
256
69
2
4
8
16
32
64
128
256
0,0937
0,0794
0,0885
0,106
0,141
0,211
0,378
0,69
0,0805
0,0876
0,103
0,133
0,193
0,318
0,583
1,1
0,0911
0,105
0,132
0,187
0,297
0,524
1,0
1,93
0,112
0,138
0,191
0,295
0,508
0,942
1,83
3,58
0,155
0,205
0,308
0,512
0,925
1,78
3,49
6,92
0,239
0,339
0,541
0,945
1,76
3,44
6,82
13,5
0,408
0,606
1,01
1,82
3,44
6,8
13,5
26,9
0,745
1,14
1,95
3,56
6,81
13,5
26,9
-
Table A.3: Execution time in seconds for the BeamFormer 1.1 2x2
b
s
2
4
8
16
32
64
128
256
2
4
8
16
32
64
128
256
512
0,095
0,0903
0,109
0,149
0,226
0,386
0,728
1,4
0,0908
0,109
0,146
0,22
0,365
0,665
1,29
2,52
0,113
0,149
0,219
0,361
0,642
1,22
2,41
4,78
0,157
0,226
0,365
0,643
1,2
2,34
4,65
9,29
0,243
0,381
0,656
1,21
2,3
4,56
9,13
18,3
0,417
0,69
1,24
2,33
4,52
9,02
18,1
36,3
0,764
1,31
2,41
4,59
8,97
18,0
36,2
72,8
1,46
2,55
4,74
9,12
17,8
35,9
72,5
146,0
2,87
5,07
9,44
18,2
35,6
72,2
145,0
291,0
Table A.4: Execution time in seconds for the BeamFormer 1.1.1
b
s
2
4
8
16
32
64
128
256
2
4
8
16
32
64
128
256
512
0,0906
0,0799
0,0882
0,106
0,14
0,212
0,375
0,683
0,0807
0,0891
0,104
0,134
0,194
0,316
0,584
1,09
0,0932
0,107
0,134
0,189
0,298
0,526
0,997
1,92
0,116
0,143
0,194
0,299
0,51
0,942
1,82
3,57
0,163
0,214
0,315
0,519
0,93
1,77
3,48
6,88
0,256
0,356
0,557
0,959
1,77
3,44
6,8
13,5
0,443
0,64
1,04
1,84
3,46
6,79
13,5
26,8
0,815
1,21
2,01
3,62
6,86
13,5
26,8
53,3
1,57
2,39
3,98
7,19
13,7
27,0
53,5
107,0
Table A.5: Execution time in seconds for the BeamFormer 1.1.1 2x2
b
s
2
4
8
16
32
64
128
256
2
4
8
16
32
64
128
256
512
0,0952
0,0803
0,0884
0,106
0,142
0,214
0,38
0,714
0,0815
0,0882
0,105
0,135
0,196
0,32
0,591
1,13
0,0921
0,107
0,134
0,19
0,302
0,535
1,01
1,97
0,115
0,142
0,195
0,302
0,516
0,958
1,86
3,66
0,159
0,212
0,316
0,523
0,943
1,81
3,55
7,04
0,25
0,352
0,557
0,968
1,8
3,5
6,93
13,8
0,429
0,633
1,04
1,86
3,51
6,91
13,7
27,3
0,812
1,22
2,03
3,65
6,96
13,8
27,3
54,3
1,53
2,35
3,98
7,24
13,8
27,5
54,5
109,0
Table A.6: Execution time in seconds for the BeamFormer 1.2 2x2
Appendix A. CUDA BeamFormer execution time
b
s
4
8
16
32
64
128
256
70
4
8
16
32
64
128
256
512
0,0847
0,0886
0,103
0,134
0,197
0,344
0,642
0,0899
0,103
0,129
0,181
0,286
0,522
0,974
0,107
0,131
0,178
0,274
0,466
0,871
1,68
0,145
0,19
0,279
0,458
0,82
1,57
3,08
0,218
0,304
0,478
0,827
1,53
2,98
5,83
0,365
0,536
0,878
1,56
2,97
5,78
11,5
0,662
1,01
1,71
3,08
5,85
11,5
22,6
1,25
1,95
3,32
6,03
11,6
22,7
45,1
Table A.7: Execution time in seconds for the BeamFormer 1.2 4x4
b
s
8
16
32
64
128
256
8
16
32
64
128
256
512
0,112
0,108
0,14
0,204
0,354
0,654
0,11
0,137
0,191
0,3
0,54
1,01
0,146
0,194
0,294
0,492
0,909
1,74
0,218
0,311
0,498
0,871
1,65
3,19
0,36
0,544
0,908
1,64
3,13
6,13
0,677
1,03
1,75
3,19
6,15
11,9
1,26
1,96
3,39
6,28
12,0
23,6
Table A.8: Execution time in seconds for the BeamFormer 1.2 8x8
b
s
2
4
8
16
32
64
128
256
2
4
8
16
32
64
128
256
0,0919
0,0792
0,0856
0,101
0,126
0,179
0,311
0,563
0,0782
0,0809
0,0879
0,103
0,128
0,185
0,31
0,549
0,0821
0,0859
0,0925
0,106
0,133
0,187
0,318
0,554
0,0928
0,0959
0,103
0,118
0,146
0,197
0,326
0,565
0,113
0,117
0,124
0,14
0,164
0,221
0,353
0,603
0,161
0,161
0,165
0,178
0,204
0,26
0,396
0,673
0,305
0,311
0,314
0,323
0,327
0,365
0,524
0,849
0,641
0,644
0,649
0,662
0,697
0,724
0,924
-
Table A.9: Execution time in seconds for the BeamFormer 1.2.1
b
s
2
4
8
16
32
64
128
256
2
4
8
16
32
64
128
256
0,0833
0,0789
0,0835
0,0972
0,124
0,174
0,303
2,12
0,078
0,0805
0,0866
0,0996
0,126
0,179
0,303
2,12
0,0823
0,0854
0,0911
0,104
0,131
0,182
0,31
2,14
0,092
0,0949
0,101
0,114
0,144
0,193
0,319
2,15
0,111
0,115
0,122
0,136
0,162
0,218
0,35
2,24
0,154
0,156
0,159
0,171
0,198
0,253
0,39
2,4
0,288
0,289
0,29
0,298
0,321
0,388
0,575
2,48
0,639
0,65
0,651
0,691
0,794
1,05
1,64
-
Table A.10: Execution time in seconds for the BeamFormer 1.2.1.1
Appendix A. CUDA BeamFormer execution time
b
s
2
4
8
16
32
64
128
256
71
2
4
8
16
32
64
128
256
0,0874
0,0772
0,0825
0,0962
0,119
0,17
0,293
0,545
0,0761
0,0788
0,0842
0,096
0,12
0,171
0,296
0,545
0,079
0,0814
0,0874
0,098
0,124
0,176
0,297
0,542
0,0847
0,0875
0,093
0,107
0,129
0,18
0,306
0,544
0,0973
0,0993
0,106
0,12
0,147
0,197
0,321
0,577
0,12
0,123
0,13
0,146
0,17
0,227
0,355
0,608
0,185
0,189
0,2
0,216
0,245
0,31
0,459
0,759
0,313
0,334
0,345
0,372
0,431
0,538
0,776
-
Table A.11: Execution time in seconds for the BeamFormer 1.2.2
b
s
2
4
8
16
32
64
128
256
2
4
8
16
32
64
128
256
0,0921
0,0781
0,0833
0,0945
0,12
0,17
0,296
2,09
0,076
0,0788
0,0847
0,0984
0,121
0,174
0,295
2,07
0,0795
0,0826
0,0878
0,0988
0,124
0,174
0,299
2,07
0,0851
0,0878
0,0939
0,106
0,131
0,181
0,304
2,07
0,0977
0,101
0,107
0,121
0,147
0,199
0,327
2,15
0,12
0,124
0,131
0,147
0,172
0,224
0,357
2,19
0,203
0,2
0,207
0,226
0,263
0,337
0,524
2,37
0,361
0,384
0,409
0,469
0,6
0,864
1,45
-
Table A.12: Execution time in seconds for the BeamFormer 1.2.2.1
b
s
2
4
8
16
32
64
128
256
2
4
8
16
32
64
128
256
0,0742
0,0734
0,0754
0,0814
0,0919
0,112
0,179
0,296
0,0739
0,0751
0,0779
0,0837
0,0951
0,118
0,184
0,313
0,078
0,0796
0,0828
0,0896
0,102
0,13
0,204
0,343
0,0859
0,0878
0,092
0,0999
0,118
0,149
0,236
0,419
0,102
0,105
0,11
0,122
0,145
0,19
0,308
0,534
0,134
0,138
0,147
0,165
0,201
0,275
0,447
0,796
0,198
0,206
0,22
0,253
0,314
0,44
0,725
1,3
0,327
0,359
0,387
0,454
0,566
0,799
1,31
-
Table A.13: Execution time in seconds for the BeamFormer 1.3
b
s
2
4
8
16
32
64
128
256
2
4
8
16
32
64
128
256
0,0805
0,0814
0,0854
0,0933
0,105
0,134
0,226
0,443
0,0851
0,0894
0,0948
0,105
0,121
0,159
0,266
0,506
0,0989
0,106
0,115
0,13
0,15
0,202
0,355
0,72
0,126
0,138
0,154
0,177
0,212
0,294
0,53
1,11
0,18
0,204
0,233
0,273
0,329
0,484
0,893
1,94
0,289
0,336
0,39
0,465
0,567
0,865
1,63
3,6
0,505
0,6
0,711
0,85
1,04
1,62
3,08
6,78
0,959
1,15
1,36
1,63
2,03
3,08
6,04
-
Table A.14: Execution time in seconds for the BeamFormer 1.4
Appendix A. CUDA BeamFormer execution time
b
s
2
4
8
16
32
64
128
256
72
2
4
8
16
32
64
128
256
512
0,0727
0,0741
0,0773
0,0828
0,0945
0,118
0,185
0,301
0,0745
0,0769
0,0798
0,087
0,1
0,128
0,203
0,343
0,0797
0,0814
0,0862
0,0957
0,113
0,147
0,238
0,407
0,0883
0,0925
0,0983
0,111
0,136
0,185
0,307
0,546
0,107
0,112
0,123
0,143
0,183
0,263
0,446
0,8
0,145
0,154
0,171
0,206
0,278
0,419
0,723
1,32
0,218
0,235
0,268
0,335
0,466
0,731
1,29
2,37
0,378
0,401
0,481
0,605
0,868
1,37
2,4
4,49
0,68
0,764
0,853
1,14
1,61
2,61
4,64
8,67
Table A.15: Execution time in seconds for the BeamFormer 1.5 2x2
b
s
4
8
16
32
64
128
256
4
8
16
32
64
128
256
512
0,0758
0,0789
0,0843
0,0967
0,12
0,188
0,307
0,0792
0,0836
0,0905
0,104
0,132
0,208
0,349
0,0882
0,0927
0,101
0,12
0,154
0,248
0,413
0,104
0,111
0,124
0,15
0,202
0,326
0,568
0,137
0,147
0,169
0,211
0,294
0,485
0,845
0,202
0,221
0,257
0,333
0,481
0,809
1,44
0,338
0,374
0,444
0,584
0,863
1,45
2,58
0,595
0,674
0,819
1,08
1,63
2,72
4,89
Table A.16: Execution time in seconds for the BeamFormer 1.5 4x4
b
s
8
16
32
64
128
256
8
16
32
64
128
256
512
0,0818
0,0885
0,1
0,124
0,192
0,316
0,0903
0,0974
0,111
0,139
0,217
0,355
0,105
0,114
0,133
0,169
0,264
0,434
0,136
0,15
0,176
0,23
0,359
0,601
0,196
0,219
0,264
0,351
0,556
0,935
0,329
0,366
0,452
0,611
0,942
1,59
0,594
0,654
0,801
1,1
1,69
2,9
Table A.17: Execution time in seconds for the BeamFormer 1.5 8x8
Appendix B
CUDA BeamFormer GFLOP/s
b
s
2
4
8
16
32
64
128
256
2
4
8
16
32
64
128
256
1,2308
1,2699
1,2857
1,2992
1,3033
1,2892
1,2801
1,2713
1,2238
1,2616
1,2849
1,2943
1,3027
1,2889
1,2822
1,2712
1,2159
1,2608
1,2855
1,2982
1,3046
1,2932
1,2821
1,2739
1,2151
1,2596
1,2848
1,2978
1,3045
1,2931
1,2793
1,2739
1,2108
1,2590
1,2823
1,2932
1,3044
1,2903
1,2793
1,2766
1,2141
1,2586
1,2822
1,2987
1,3043
1,2903
1,2834
1,2732
1,2138
1,2606
1,2821
1,2987
1,3043
1,2903
1,2766
1,2698
1,2098
1,2605
1,2848
1,2959
1,3043
1,2834
1,2715
-
Table B.1: GFLOP/s for the BeamFormer 1.1
b
s
2
4
8
16
32
64
128
256
2
4
8
16
32
64
128
256
3,2140
3,4749
3,6061
3,6764
3,7054
3,6443
3,6425
3,6328
3,1653
3,4482
3,5922
3,6764
3,6800
3,6603
3,6416
3,6324
3,1494
3,4348
3,5920
3,6800
3,6963
3,6505
3,6368
3,6365
3,1372
3,4344
3,5749
3,6782
3,6954
3,6500
3,6365
3,6364
3,1258
3,4437
3,5902
3,6683
3,6950
3,6587
3,6364
3,6254
3,1280
3,4263
3,5893
3,6724
3,7039
3,6475
3,6364
3,6363
3,1265
3,4254
3,5889
3,6587
3,6924
3,6419
3,6363
3,6226
3,1257
3,4289
3,5716
3,6475
3,6810
3,6363
3,6226
-
Table B.2: GFLOP/s for the BeamFormer 1.1 2x2
73
Appendix B. CUDA BeamFormer GFLOP/s
b
s
2
4
8
16
32
64
128
256
74
2
4
8
16
32
64
128
256
512
1,2797
1,3270
1,3483
1,3613
1,3699
1,3544
1,3420
1,3275
1,2631
1,3181
1,3457
1,3600
1,3692
1,3541
1,3394
1,3334
1,2582
1,3137
1,3405
1,3593
1,3689
1,3515
1,3393
1,3333
1,2524
1,3124
1,3447
1,3590
1,3638
1,3514
1,3423
1,3333
1,2512
1,3118
1,3420
1,3638
1,3699
1,3544
1,3423
1,3333
1,2506
1,3092
1,3394
1,3575
1,3699
1,3544
1,3407
1,3333
1,2482
1,3045
1,3393
1,3575
1,3667
1,3559
1,3370
1,3259
1,2501
1,3101
1,3393
1,3559
1,3636
1,3483
1,3296
1,3241
1,2448
1,3072
1,3378
1,3559
1,3675
1,3389
1,3241
1,3241
Table B.3: GFLOP/s for the BeamFormer 1.1.1
b
s
2
4
8
16
32
64
128
256
2
4
8
16
32
64
128
256
512
3,2226
3,4902
3,6061
3,6908
3,7201
3,6800
3,6603
3,6594
2,9351
3,3035
3,5249
3,6478
3,6800
3,6603
3,6505
3,6500
2,8119
3,2342
3,4790
3,6092
3,6782
3,6594
3,6500
3,6587
2,7439
3,1895
3,4437
3,6075
3,6773
3,6545
3,6587
3,6475
2,7109
3,1810
3,4420
3,6066
3,6724
3,6587
3,6475
3,6530
2,7004
3,1661
3,4412
3,6018
3,6587
3,6475
3,6474
3,6363
2,6894
3,1520
3,4407
3,5930
3,6586
3,6419
3,6363
3,6363
2,6888
3,1549
3,4287
3,5821
3,6530
3,6363
3,6363
3,6363
2,6788
3,1415
3,4189
3,5714
3,6363
3,6226
3,6294
3,6226
Table B.4: GFLOP/s for the BeamFormer 1.1.1 2x2
b
s
2
4
8
16
32
64
128
256
2
4
8
16
32
64
128
256
512
4,2534
4,2086
4,1973
4,1750
4,1475
4,0740
4,0458
4,0318
3,6716
3,8975
4,0187
4,0917
4,1136
4,0458
4,0318
4,0296
3,4103
3,7517
3,9401
4,0352
4,0849
4,0414
4,0248
4,0189
3,3015
3,6747
3,8872
4,0265
4,0805
4,0344
4,0189
4,0303
3,2384
3,6526
3,8788
4,0222
4,0684
4,0430
4,0184
4,0121
3,2098
3,6291
3,8746
4,0152
4,0673
4,0303
4,0181
4,0299
3,1908
3,6174
3,8725
3,9951
4,0668
4,0241
4,0299
4,0149
3,1874
3,6115
3,8581
4,0064
4,0543
4,0299
4,0149
4,0149
3,1856
3,6106
3,8576
3,9943
4,0602
4,0149
4,0074
4,0000
Table B.5: GFLOP/s for the BeamFormer 1.2 2x2
b
s
4
8
16
32
64
128
256
4
8
16
32
64
128
256
512
9,4799
9,4744
9,4804
9,4389
9,3739
9,2872
9,2766
8,6782
9,0571
9,2649
9,3302
9,3415
9,2497
9,2713
8,3147
8,8568
9,1489
9,2872
9,2766
9,2444
9,2417
8,1068
8,7490
9,0762
9,2497
9,2848
9,2417
9,2740
8,0214
8,6815
9,0658
9,2310
9,2417
9,2404
9,2733
7,9870
8,6716
9,0477
9,2417
9,2070
9,2565
9,1895
7,9382
8,6431
9,0451
9,1739
9,2397
9,1895
9,2309
7,9336
8,6759
9,0118
9,2230
9,1895
9,2309
9,1892
Table B.6: GFLOP/s for the BeamFormer 1.2 4x4
Appendix B. CUDA BeamFormer GFLOP/s
b
s
8
16
32
64
128
256
75
8
16
32
64
128
256
512
17,759
17,591
17,584
17,521
17,421
17,401
16,839
17,159
17,326
17,421
17,352
17,390
16,402
16,930
17,228
17,352
17,317
17,312
16,189
16,762
17,112
17,317
17,312
17,370
16,067
16,697
17,079
17,192
17,249
17,278
15,966
16,665
17,074
17,249
17,218
17,369
15,915
16,615
17,012
17,218
17,218
17,292
Table B.7: GFLOP/s for the BeamFormer 1.2 8x8
b
s
2
4
8
16
32
64
128
256
2
4
8
16
32
64
128
256
4,0039
4,9232
5,3544
5,3321
5,3897
5,4197
5,4586
5,4864
7,0326
8,8667
10,034
10,324
10,570
10,747
10,838
10,933
11,184
14,769
17,603
19,312
20,405
21,074
21,522
21,708
15,090
21,386
28,265
33,755
37,875
40,622
42,142
42,951
18,062
27,418
39,739
52,994
65,369
74,725
79,294
82,655
15,350
27,774
51,789
81,938
110,00
131,29
144,34
151,27
8,2839
14,257
26,536
50,495
103,48
178,02
206,91
218,16
5,7887
10,245
19,056
36,321
70,380
136,78
199,21
-
Table B.8: GFLOP/s for the BeamFormer 1.2.1
b
s
2
4
8
16
32
64
128
256
2
4
8
16
32
64
128
256
3,4358
4,5310
5,6435
5,7085
5,6953
5,8109
5,7067
0,8259
6,1159
8,2765
10,474
10,904
11,125
11,480
11,413
1,6429
9,9434
14,014
18,243
20,113
21,317
22,384
22,486
3,2858
13,712
20,176
27,872
34,659
39,137
42,624
43,988
6,5007
16,567
25,460
38,044
52,105
64,676
74,541
79,294
12,527
15,701
28,048
50,328
81,082
110,60
134,50
145,21
23,466
8,7357
15,553
29,730
56,981
98,475
143,25
174,67
46,704
5,5545
9,8898
18,722
33,302
53,206
70,380
79,687
-
Table B.9: GFLOP/s for the BeamFormer 1.2.1.1
b
s
2
4
8
16
32
64
128
256
2
4
8
16
32
64
128
256
4,3269
5,3789
6,2870
6,7864
6,4025
6,1914
6,0749
6,0862
8,3880
10,524
12,403
13,385
12,805
12,382
12,149
12,172
16,590
20,885
24,714
26,407
25,439
24,765
24,299
24,344
30,958
39,826
47,781
50,742
50,044
49,129
48,599
48,689
50,789
66,938
79,478
86,263
89,261
91,018
92,713
93,227
81,991
108,01
130,52
147,17
158,58
167,66
173,42
176,21
102,66
143,77
184,53
213,43
239,90
257,56
266,35
267,25
95,615
136,87
177,89
213,43
239,43
257,56
267,83
-
Table B.10: GFLOP/s for the BeamFormer 1.2.2
Appendix B. CUDA BeamFormer GFLOP/s
b
s
2
4
8
16
32
64
128
256
76
2
4
8
16
32
64
128
256
3,6092
4,7148
5,9798
6,4268
6,2556
6,1712
6,0749
0,8445
7,0326
9,2025
11,818
12,769
12,470
12,342
12,149
1,6890
13,934
18,373
23,499
25,206
24,859
24,644
24,299
3,3781
25,287
34,055
43,674
48,522
48,765
48,891
48,599
6,7563
41,136
56,653
71,672
81,722
86,972
90,339
89,945
13,143
66,546
92,383
119,81
141,80
156,95
170,02
168,56
26,144
36,125
75,371
121,16
158,03
180,36
200,09
211,45
48,838
35,328
49,664
65,462
79,627
89,457
95,884
94,531
-
Table B.11: GFLOP/s for the BeamFormer 1.2.2.1
b
s
2
4
8
16
32
64
128
256
2
4
8
16
32
64
128
256
0,2023
0,3642
0,6653
89,857
100,41
105,96
109,49
107,37
0,4063
0,7119
80,920
92,866
102,57
108,39
109,97
107,37
48,828
64,808
82,585
94,219
103,97
108,08
107,61
106,80
49,635
65,603
83,271
94,794
103,83
109,65
108,58
108,34
49,944
65,906
83,618
95,259
104,54
109,25
108,58
106,90
50,048
66,315
83,618
95,259
104,72
109,05
108,09
107,37
50,048
66,059
83,836
95,406
104,36
108,56
108,09
107,13
50,179
66,315
83,946
95,552
104,81
108,81
108,21
-
Table B.12: GFLOP/s for the BeamFormer 1.3
b
s
2
4
8
16
32
64
128
256
2
4
8
16
32
64
128
256
1,0215
1,4947
2,2588
3,4401
5,3801
6,6117
6,2563
5,1439
0,7678
1,0395
1,4648
2,1014
3,1283
3,8047
3,6003
2,8918
0,6103
0,7629
0,9903
1,3137
1,8245
2,1282
1,9804
1,6206
0,5207
0,6081
0,7260
0,8816
1,1288
1,2192
1,0912
0,8847
0,4705
0,5241
0,5859
0,6548
0,7595
0,7366
0,6194
0,4774
0,4437
0,4814
0,5137
0,5373
0,5723
0,4972
0,3823
0,2723
0,4312
0,4588
0,4728
0,4788
0,4780
0,3796
0,2688
0,1766
0,4242
0,4464
0,4530
0,4503
0,4292
0,3323
0,2085
-
Table B.13: GFLOP/s for the BeamFormer 1.4
b
s
2
4
8
16
32
64
128
256
2
4
8
16
32
64
128
256
512
59,346
55,948
55,284
55,595
55,264
55,458
55,266
55,387
25,177
34,103
41,913
47,705
51,172
53,181
54,319
54,818
19,145
28,088
37,262
44,506
49,449
52,301
53,685
54,450
17,004
25,821
35,308
43,141
48,543
51,712
53,585
54,442
16,118
24,778
34,234
42,339
48,034
51,540
53,366
54,329
15,634
24,298
33,786
42,000
47,885
51,337
53,258
54,273
15,458
24,133
33,768
41,677
47,709
51,236
52,942
54,272
15,357
23,949
33,424
41,801
47,622
51,430
53,202
54,135
15,279
23,942
33,420
41,669
47,789
51,185
53,202
54,203
Table B.14: GFLOP/s for the BeamFormer 1.5 2x2
Appendix B. CUDA BeamFormer GFLOP/s
b
s
4
8
16
32
64
128
256
77
4
8
16
32
64
128
256
512
99,565
99,299
98,966
99,039
99,199
99,218
99,104
60,041
74,588
84,831
91,594
95,084
96,995
98,133
49,847
66,134
79,261
88,254
93,308
96,064
97,356
45,849
62,541
76,798
86,716
92,444
95,892
97,342
43,985
60,969
75,441
85,850
91,752
95,519
97,335
43,173
60,097
74,866
85,595
91,739
95,333
97,146
42,719
59,613
74,582
85,296
91,733
95,330
97,144
42,524
59,596
74,572
85,148
91,895
95,328
97,143
Table B.15: GFLOP/s for the BeamFormer 1.5 4x4
b
s
8
16
32
64
128
256
8
16
32
64
128
256
512
180,79
181,58
181,99
182,63
182,41
182,73
132,22
153,09
166,75
174,21
178,52
180,50
116,01
141,29
159,84
170,65
176,38
179,40
109,45
136,61
156,43
168,69
175,58
179,37
105,83
134,08
154,77
167,27
174,93
178,71
104,11
132,71
153,77
167,25
174,61
178,38
103,18
131,68
153,27
166,67
173,69
178,38
Table B.16: GFLOP/s for the BeamFormer 1.5 8x8
Appendix C
CUDA BeamFormer GB/s
b
s
2
4
8
16
32
64
128
2
4
8
16
32
64
128
256
2,8200
5,5803
10,557
24,898
52,989
112,81
191,76
2,7323
5,4505
11,143
23,514
50,781
109,56
181,77
2,6905
5,3879
11,432
25,367
49,744
105,86
174,56
2,6700
5,3432
11,621
26,457
52,139
104,41
181,61
2,6616
5,3432
11,718
27,080
53,181
105,97
187,5
2,6533
5,3294
11,755
27,380
53,917
106,5
187,06
2,6574
5,3225
11,780
27,490
54,041
107,57
187,28
2,6470
5,3225
11,780
27,6
54,166
105,83
-
Table C.1: GB/s for the BeamFormer 1.0.2
b
s
2
4
8
16
32
64
128
256
2
4
8
16
32
64
128
256
4,5444
4,5677
4,5626
4,5789
4,5776
4,5203
4,4845
4,4516
4,4021
4,4770
4,5287
4,5460
4,5674
4,5153
4,4896
4,4503
4,3150
4,4436
4,5149
4,5516
4,5703
4,5283
4,4884
4,4592
4,2827
4,4240
4,5048
4,5465
4,5677
4,5271
4,4782
4,4589
4,2527
4,4141
4,4921
4,5283
4,5664
4,5167
4,4779
4,4682
4,2570
4,4092
4,4896
4,5467
4,5658
4,5164
4,4921
4,4563
4,2523
4,4142
4,4884
4,5460
4,5655
4,5162
4,4681
4,4444
4,2362
4,4129
4,4974
4,5359
4,5653
4,4920
4,4503
-
Table C.2: GB/s for the BeamFormer 1.1
78
Appendix C. CUDA BeamFormer GB/s
b
s
2
4
8
16
32
64
128
256
79
2
4
8
16
32
64
128
256
9,3500
9,4092
9,3928
9,3841
9,3611
9,1588
9,1303
9,0940
8,5708
8,9815
9,1692
9,2878
9,2486
9,1749
9,1161
9,0869
8,2031
8,7676
9,0747
9,2486
9,2653
9,1383
9,0980
9,0944
8,0078
8,6765
8,9843
9,2198
9,2508
9,1312
9,0944
9,0926
7,8969
8,6546
8,9993
9,1830
9,2436
9,1499
9,0926
9,0643
7,8613
8,5883
8,9853
9,1870
9,2628
9,1203
9,0917
9,0913
7,8369
8,5750
8,9783
9,1499
9,2325
9,1055
9,0913
9,0568
7,8247
8,5781
8,9320
9,1203
9,2033
9,0913
9,0568
-
Table C.3: GB/s for the BeamFormer 1.1 2x2
b
s
2
4
8
16
32
64
128
256
2
4
8
16
32
64
128
256
512
4,7253
4,7730
4,7848
4,7978
4,8115
4,7488
4,7011
4,6486
4,5433
4,6776
4,7429
4,7766
4,8008
4,7435
4,6901
4,6679
4,4650
4,6301
4,7084
4,7660
4,7954
4,7323
4,6888
4,6673
4,4140
4,6096
4,7148
4,7607
4,7753
4,7310
4,6986
4,6669
4,3945
4,5994
4,7011
4,7753
4,7958
4,7410
4,6983
4,6668
4,3847
4,5862
4,6901
4,7524
4,7951
4,7407
4,6929
4,6667
4,3725
4,5677
4,6888
4,7517
4,7839
4,7459
4,6797
4,6409
4,3774
4,5864
4,6881
4,7460
4,7728
4,7191
4,6537
4,6345
4,3580
4,5758
4,6826
4,7459
4,7864
4,6862
4,6345
4,6344
Table C.4: GB/s for the BeamFormer 1.1.1
b
s
2
4
8
16
32
64
128
256
2
4
8
16
32
64
128
256
512
9,375
9,4506
9,3928
9,4209
9,3982
9,2486
9,1749
9,1606
7,9473
8,6046
8,9975
9,2157
9,2486
9,1749
9,1383
9,1312
7,3242
8,2554
8,7890
9,0707
9,2198
9,1606
9,1312
9,1499
7,0039
8,0578
8,6546
9,0425
9,2055
9,1423
9,1499
9,1203
6,8486
7,9945
8,6277
9,0285
9,1870
9,1499
9,1203
9,1333
6,7867
7,9361
8,6143
9,0106
9,1499
9,1203
9,1194
9,0913
6,7414
7,8904
8,6076
8,9855
9,1481
9,1055
9,0913
9,0911
6,7309
7,8925
8,5747
8,9569
9,1333
9,0913
9,0911
9,0910
6,7016
7,8564
8,5486
8,9294
9,0913
9,0568
9,0738
9,0566
Table C.5: GB/s for the BeamFormer 1.1.1 2x2
b
s
2
4
8
16
32
64
128
256
2
4
8
16
32
64
128
256
512
8,2993
8,7453
9,0144
9,1196
9,1374
9,0144
8,9712
8,9498
7,6393
8,3705
8,7780
9,0144
9,1019
8,9712
8,9498
8,9498
7,3242
8,1949
8,6966
8,9285
9,0579
8,9712
8,9392
8,9285
7,2115
8,1098
8,6805
8,9285
9,0579
8,9605
8,9285
8,9285
7,1347
8,0818
8,6009
8,9285
9,0470
8,9820
8,9285
8,9285
7,1022
8,0472
8,6206
8,9179
9,0361
8,9552
8,9418
8,9552
7,0754
8,0299
8,6009
8,9285
9,0361
8,9418
8,8888
8,9219
7,0754
8,0213
8,5714
8,9020
9,0225
8,9552
8,9219
8,9219
7,0754
8,0213
8,5714
8,8757
9,0225
8,9219
8,9053
8,8888
Table C.6: GB/s for the BeamFormer 1.2 2x2
Appendix C. CUDA BeamFormer GB/s
b
s
4
8
16
32
64
128
256
80
4
8
16
32
64
128
256
512
16,622
17,201
17,523
17,605
17,564
17,441
17,441
15,756
16,741
17,281
17,482
17,543
17,391
17,441
15,368
16,519
17,142
17,441
17,441
17,391
17,391
15,120
16,393
17,045
17,391
17,467
17,391
17,454
15,030
16,304
17,045
17,366
17,391
17,391
17,454
15,0
16,304
17,021
17,391
17,328
17,422
17,297
14,925
16,260
17,021
17,266
17,391
17,297
17,375
14,925
16,326
16,961
17,359
17,297
17,375
17,297
Table C.7: GB/s for the BeamFormer 1.2 4x4
b
s
8
16
32
64
128
256
8
16
32
64
128
256
512
29,037
29,296
29,560
29,594
29,494
29,494
28,044
28,846
29,264
29,494
29,411
29,494
27,573
28,594
29,166
29,411
29,370
29,370
27,343
28,378
29,005
29,370
29,370
29,473
27,202
28,301
28,965
29,166
29,268
29,319
27,061
28,263
28,965
29,268
29,217
29,473
26,992
28,187
28,865
29,217
29,217
29,344
Table C.8: GB/s for the BeamFormer 1.2 8x8
b
s
2
4
8
16
32
64
128
256
2
4
8
16
32
64
128
256
7,8125
8,6325
8,7546
8,3705
8,2759
8,2266
8,2370
8,2544
13,722
15,547
16,406
16,206
16,230
16,313
16,355
16,448
21,822
25,897
28,782
30,317
31,333
31,989
32,477
32,660
29,444
37,5
46,214
52,989
58,157
61,661
63,592
64,620
35,244
48,076
64,975
83,191
100,37
113,42
119,65
124,35
29,952
48,701
84,677
128,62
168,91
199,29
217,81
227,59
16,163
25,0
43,388
79,268
158,89
270,22
312,23
328,23
11,295
17,964
31,157
57,017
108,06
207,62
300,61
-
Table C.9: GB/s for the BeamFormer 1.2.1
b
s
2
4
8
16
32
64
128
256
2
4
8
16
32
64
128
256
5,3632
5,9586
6,5909
6,2040
5,9468
5,9403
5,7705
0,8306
8,3534
9,0702
9,7860
9,2169
8,8830
8,8913
8,7002
1,2423
12,611
13,822
14,914
14,572
14,402
14,561
14,342
2,0747
16,721
18,794
21,158
23,018
24,038
25,088
25,319
3,6993
19,800
23,018
27,769
33,032
37,738
41,564
43,174
6,7382
18,573
24,974
36,001
50,179
62,839
72,916
76,807
12,256
10,280
13,742
21,050
34,833
55,191
76,553
91,032
24,029
6,5198
8,7043
13,187
20,232
29,616
37,336
41,219
-
Table C.10: GB/s for the BeamFormer 1.2.1.1
Appendix C. CUDA BeamFormer GB/s
b
s
2
4
8
16
32
64
128
256
81
2
4
8
16
32
64
128
256
8,4429
9,4315
10,279
10,653
9,8311
9,3980
9,1670
9,1567
16,366
18,454
20,279
21,012
19,662
18,796
18,334
18,313
32,372
36,621
40,409
41,454
39,062
37,592
36,668
36,627
60,405
69,832
78,125
79,656
76,844
74,573
73,336
73,254
99,101
117,37
129,95
135,41
137,06
138,15
139,90
140,26
159,98
189,39
213,41
231,04
243,50
254,50
261,69
265,10
200,32
252,10
301,72
335,05
368,36
390,95
401,93
402,08
186,56
240,0
290,85
335,05
367,64
390,95
404,16
-
Table C.11: GB/s for the BeamFormer 1.2.2
b
s
2
4
8
16
32
64
128
256
2
4
8
16
32
64
128
256
5,6340
6,2003
6,9837
6,9846
6,5317
6,3086
6,1428
0,8492
9,6055
10,084
11,042
10,793
9,9571
9,5585
9,2615
1,2772
17,673
18,121
19,211
18,262
16,795
16,032
15,498
2,1330
30,838
31,722
33,154
32,226
29,952
28,776
27,973
3,8448
49,162
51,221
52,315
51,809
50,747
50,373
48,973
7,0696
78,720
82,259
85,704
87,756
89,170
92,169
89,160
13,654
42,513
66,595
85,790
96,612
101,08
106,92
110,19
25,126
41,468
43,711
46,110
48,377
49,793
50,866
48,897
-
Table C.12: GB/s for the BeamFormer 1.2.2.1
b
s
2
4
8
16
32
64
128
256
2
4
8
16
32
64
128
256
0,4738
0,7982
1,3987
184,46
203,53
213,36
219,72
215,12
0,9514
1,5604
170,11
190,63
207,91
218,25
220,68
215,12
114,32
142,04
173,61
193,41
210,74
217,63
215,95
213,97
116,21
143,78
175,05
194,59
210,45
220,78
217,90
217,06
116,94
144,45
175,78
195,55
211,90
219,99
217,90
214,16
117,18
145,34
175,78
195,55
212,26
219,59
216,92
215,12
117,18
144,78
176,24
195,85
211,53
218,60
216,92
214,64
117,49
145,34
176,47
196,15
212,44
219,10
217,17
-
Table C.13: GB/s for the BeamFormer 1.3
b
s
2
4
8
16
32
64
128
256
2
4
8
16
32
64
128
256
2,4516
3,3216
4,9284
7,5461
11,935
14,819
14,128
11,673
2,3626
3,0241
4,3174
6,4280
9,9311
12,428
11,990
9,7487
2,3129
2,8483
3,9612
5,7805
8,7577
10,907
10,605
8,9206
2,2846
2,7498
3,7581
5,4253
8,1708
10,025
9,8047
8,4119
2,2588
2,6877
3,6458
5,2389
7,8125
9,375
9,2410
7,9282
2,2403
2,6584
3,5855
5,1312
7,6308
9,0169
8,9095
7,6261
2,2362
2,6373
3,5221
5,0845
7,5393
8,8468
8,8709
7,7119
2,2309
2,6207
3,4939
5,0722
7,4672
9,0681
8,7411
-
Table C.14: GB/s for the BeamFormer 1.4
Appendix C. CUDA BeamFormer GB/s
b
s
2
4
8
16
32
64
128
256
82
2
4
8
16
32
64
128
256
512
115,79
116,25
118,73
121,43
121,75
122,70
122,54
122,95
52,315
73,242
91,552
105,10
113,22
117,92
120,57
121,75
41,118
61,354
82,092
98,476
109,64
116,09
119,23
120,96
37,143
56,887
78,125
95,663
107,75
114,85
119,04
120,96
35,511
54,824
75,910
93,984
106,68
114,50
118,57
120,72
34,594
53,879
75,0
93,283
106,38
114,06
118,34
120,60
34,277
53,571
75,0
92,592
106,00
113,85
117,64
120,60
34,090
53,191
74,257
92,879
105,82
114,28
118,22
120,30
33,936
53,191
74,257
92,592
106,19
113,74
118,22
120,45
Table C.15: GB/s for the BeamFormer 1.5 2x2
b
s
4
8
16
32
64
128
256
4
8
16
32
64
128
256
512
130,93
135,21
137,19
138,54
139,40
139,75
139,75
81,758
103,40
118,67
128,71
133,92
136,77
138,46
69,103
92,516
111,38
124,30
131,57
135,54
137,40
64,139
87,890
108,17
122,28
130,43
135,33
137,40
61,813
85,877
106,38
121,13
129,49
134,83
137,40
60,810
84,745
105,63
120,80
129,49
134,57
137,14
60,240
84,112
105,26
120,40
129,49
134,57
137,14
60,0
84,112
105,26
120,20
129,72
134,57
137,14
Table C.16: GB/s for the BeamFormer 1.5 4x4
b
s
8
16
32
64
128
256
8
16
32
64
128
256
512
168,91
172,81
174,82
176,26
176,47
176,99
125,83
147,05
160,94
168,53
172,91
174,92
111,44
136,36
154,63
165,28
170,94
173,91
105,63
132,15
151,51
163,48
170,21
173,91
102,38
129,87
150,0
162,16
169,61
173,28
100,84
128,61
149,06
162,16
169,31
172,97
100,0
127,65
148,60
161,61
168,42
172,97
Table C.17: GB/s for the BeamFormer 1.5 8x8
Appendix D
OpenCL BeamFormer
measurements
b
s
2
4
8
16
32
64
128
256
2
4
8
16
32
64
128
256
512
0,92
0,94
0,958
0,944
0,996
0,981
1,0
1,09
0,919
0,949
0,939
0,944
0,964
0,97
1,04
1,12
0,926
0,937
0,949
0,959
0,978
1,08
1,07
1,2
0,918
0,953
0,96
0,977
0,987
1,02
1,11
1,31
0,943
1,01
1,01
0,992
1,02
1,1
1,26
1,57
0,961
0,978
0,997
1,02
1,09
1,25
1,51
2,09
1,01
1,04
1,07
1,13
1,26
1,52
2,05
3,1
1,12
1,17
1,22
1,34
1,59
2,1
3,13
5,16
1,42
1,4
1,54
1,78
2,28
3,28
5,26
9,26
Table D.1: Execution time in seconds for the BeamFormer 1.5-opencl 2x2
b
s
4
8
16
32
64
128
256
4
8
16
32
64
128
256
512
0,959
0,946
0,952
0,965
0,987
1,03
1,14
0,954
0,952
0,966
0,986
1,01
1,06
1,18
0,963
0,968
0,967
0,989
1,02
1,09
1,25
0,974
0,983
0,987
1,02
1,08
1,17
1,38
0,997
1,02
1,02
1,06
1,15
1,32
1,69
1,06
1,07
1,1
1,17
1,32
1,62
2,22
1,19
1,19
1,26
1,4
1,68
2,23
3,34
1,35
1,42
1,55
1,81
2,36
3,43
5,59
Table D.2: Execution time in seconds for the BeamFormer 1.5-opencl 4x4
83
Appendix D. OpenCL BeamFormer measurements
b
s
8
16
32
64
128
256
84
8
16
32
64
128
256
512
1,05
1,05
1,06
1,09
1,13
1,25
1,05
1,05
1,06
1,09
1,15
1,29
1,06
1,06
1,08
1,14
1,21
1,38
1,09
1,09
1,13
1,19
1,31
1,59
1,14
1,16
1,21
1,31
1,54
1,96
1,31
1,29
1,38
1,59
1,96
2,73
1,44
1,53
1,72
2,09
2,82
4,3
Table D.3: Execution time in seconds for the BeamFormer 1.5-opencl 8x8
b
s
2
4
8
16
32
64
128
256
2
4
8
16
32
64
128
256
512
63,086
59,364
57,802
56,772
56,437
56,120
55,998
55,753
59,489
57,558
57,074
56,287
56,194
55,998
55,753
55,723
57,256
56,772
56,139
56,194
55,998
55,938
55,815
55,800
56,177
56,139
55,971
55,998
55,938
55,815
55,800
55,563
55,844
55,750
55,630
55,753
55,631
55,342
55,563
55,559
55,458
55,630
55,570
55,723
55,800
55,563
55,559
55,443
55,266
55,570
55,631
55,800
55,792
55,559
55,500
55,385
55,387
55,448
55,342
55,792
55,673
55,557
55,385
55,527
55,267
55,342
55,563
55,673
55,729
55,385
55,527
55,384
Table D.4: GFLOP/s for the BeamFormer 1.5-opencl 2x2
b
s
4
8
16
32
64
128
256
4
8
16
32
64
128
256
512
105,04
102,24
100,93
100,27
99,817
99,838
99,413
101,24
100,43
99,776
99,817
99,838
99,413
99,356
99,941
99,529
99,569
99,838
99,413
99,201
99,637
98,312
98,588
98,605
98,797
98,893
98,865
98,851
97,865
98,605
98,797
98,740
98,865
98,851
98,844
97,403
98,492
98,588
98,865
98,851
98,653
99,032
97,589
98,133
98,865
98,851
98,653
99,032
98,552
97,384
98,105
98,469
98,653
99,032
98,552
98,790
Table D.5: GFLOP/s for the BeamFormer 1.5-opencl 4x4
b
s
8
16
32
64
128
256
8
16
32
64
128
256
512
146,48
144,86
143,78
143,90
142,90
143,06
143,81
143,52
142,58
142,90
143,06
142,65
142,47
142,58
142,90
142,73
142,81
143,10
142,58
142,25
142,73
142,81
143,10
142,67
141,60
142,40
142,65
142,28
142,67
142,86
141,43
142,16
142,28
142,67
142,66
142,45
141,51
142,28
142,67
142,66
142,45
142,96
Table D.6: GFLOP/s for the BeamFormer 1.5-opencl 8x8
Appendix D. OpenCL BeamFormer measurements
b
s
2
4
8
16
32
64
128
256
85
2
4
8
16
32
64
128
256
512
123,09
123,35
124,13
124,00
124,33
124,17
124,17
123,76
123,61
123,61
124,66
124,00
124,33
124,17
123,76
123,76
122,96
124,00
123,68
124,33
124,17
124,17
123,96
123,96
122,70
123,68
123,84
124,17
124,17
123,96
123,96
123,45
123,03
123,35
123,35
123,76
123,55
122,95
123,45
123,45
122,70
123,35
123,35
123,76
123,96
123,45
123,45
123,20
122,54
123,35
123,55
123,96
123,96
123,45
123,32
123,07
122,95
123,15
122,95
123,96
123,71
123,45
123,07
123,39
122,74
122,95
123,45
123,71
123,83
123,07
123,39
123,07
Table D.7: GB/s for the BeamFormer 1.5-opencl 2x2
b
s
4
8
16
32
64
128
256
4
8
16
32
64
128
256
512
138,13
139,23
139,92
140,27
140,27
140,62
140,18
137,86
139,23
139,57
140,27
140,62
140,18
140,18
138,54
139,23
139,92
140,62
140,18
139,96
140,62
137,53
138,54
138,88
139,31
139,53
139,53
139,53
137,53
138,88
139,31
139,31
139,53
139,53
139,53
137,19
138,88
139,10
139,53
139,53
139,26
139,80
137,61
138,46
139,53
139,53
139,26
139,80
139,13
137,40
138,46
138,99
139,26
139,80
139,13
139,46
Table D.8: GB/s for the BeamFormer 1.5-opencl 4x4
b
s
8
16
32
64
128
256
8
16
32
64
128
256
512
136,86
137,86
138,12
138,88
138,24
138,56
136,86
137,86
137,61
138,24
138,56
138,24
136,86
137,61
138,24
138,24
138,40
138,72
137,61
137,61
138,24
138,40
138,72
138,32
136,98
137,93
138,24
137,93
138,32
138,52
136,98
137,77
137,93
138,32
138,32
138,12
137,14
137,93
138,32
138,32
138,12
138,62
Table D.9: GB/s for the BeamFormer 1.5-opencl 8x8
Appendix E
Finding the best station-beam
block size
b
s
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
32,236
43,902
53,250
59,940
65,447
69,912
73,607
76,949
43,389
62,301
77,311
89,111
99,066
107,10
114,20
120,60
49,937
72,417
91,285
105,80
119,24
130,19
139,93
148,14
49,221
72,710
91,753
108,01
121,61
133,28
143,86
153,19
51,434
76,605
97,360
115,23
131,63
144,33
156,02
166,57
53,435
79,582
102,32
121,14
138,10
152,38
165,96
176,86
54,127
81,097
103,19
122,79
139,62
154,33
167,30
179,54
53,539
78,405
98,695
115,23
129,08
141,10
151,49
160,54
Table E.1: GFLOP/s for the OpenCL BeamFormer 1.5: block sizes from 1x1 to 8x8
b
s
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
50,181
71,272
86,975
99,421
110,27
117,71
124,62
130,50
47,883
65,853
79,068
89,111
97,583
103,92
109,68
114,02
46,196
61,170
72,662
79,909
86,182
92,352
94,830
99,978
40,993
52,992
61,394
66,008
69,346
72,551
74,892
76,208
37,664
45,182
50,252
53,797
56,494
58,125
59,264
60,390
34,271
41,707
46,683
49,902
52,088
53,833
55,094
56,012
32,853
38,565
41,416
43,492
44,770
45,922
46,286
47,331
31,434
36,733
39,665
41,527
42,955
43,882
44,702
45,454
Table E.2: GFLOP/s for the OpenCL BeamFormer 1.5: block sizes from 1x9 to 8x16
86
Appendix E. Finding the best station-beam block size
b
s
9
10
11
12
13
14
15
16
87
1
2
3
4
5
6
7
8
79,815
82,081
83,640
85,392
86,945
88,697
89,750
90,866
126,19
129,74
134,68
138,10
141,40
144,37
147,07
149,53
155,63
161,17
166,86
173,06
176,85
180,77
185,84
189,29
161,70
168,17
175,97
182,19
187,20
193,04
197,80
203,02
176,19
184,63
192,90
200,45
206,73
213,31
219,19
224,64
188,02
197,10
205,50
214,39
220,74
226,53
231,83
240,60
190,39
198,72
208,00
214,69
222,50
229,72
236,40
240,89
167,25
174,33
181,87
186,35
191,48
196,15
200,40
203,23
Table E.3: GFLOP/s for the OpenCL BeamFormer 1.5: block sizes from 9x1 to 16x8
b
s
9
10
11
12
13
14
15
16
9
10
11
12
13
14
15
16
134,83
139,95
143,80
146,56
149,62
152,97
154,84
157,08
117,71
120,87
124,08
126,47
128,58
130,07
132,53
134,41
102,38
105,05
107,38
109,71
111,21
112,27
113,72
113,56
78,000
78,840
80,176
80,740
81,644
82,309
82,771
83,420
61,004
61,511
62,023
62,624
63,146
63,531
63,732
64,170
56,842
57,610
58,118
58,618
59,115
59,137
59,654
60,116
47,644
48,050
48,392
48,426
48,853
48,851
49,201
49,178
45,926
46,272
46,723
46,735
47,092
47,077
47,369
47,629
Table E.4: GFLOP/s for the OpenCL BeamFormer 1.5: block sizes from 9x9 to 16x16
b
s
2
4
8
16
32
64
128
256
1
2
3
4
5
6
7
8
44,160
59,940
77,424
90,866
98,960
103,35
105,50
106,01
62,301
88,229
120,02
149,53
170,35
183,63
190,22
192,23
72,185
106,36
147,85
189,54
221,85
244,58
256,22
263,12
72,534
107,36
153,19
201,54
246,18
275,12
294,25
305,54
76,762
115,23
167,45
224,64
275,71
316,30
341,78
357,92
80,007
120,78
177,49
240,60
299,67
344,95
374,77
392,16
80,972
122,46
179,36
242,61
298,11
336,90
361,16
375,82
78,303
115,35
160,54
203,23
237,01
259,55
271,45
280,98
Table E.5: GFLOP/s for the OpenCL BeamFormer 1.5: block sizes from 2x1 to 256x8
b
s
2
4
8
16
32
64
128
256
9
10
11
12
13
14
15
16
71,272
99,421
130,50
157,08
175,57
186,65
193,70
197,22
65,853
89,709
114,02
134,04
147,67
156,32
160,96
163,40
61,170
79,477
97,505
113,56
123,16
126,86
128,27
127,79
53,306
65,737
76,402
83,182
87,387
90,070
90,941
91,478
45,182
53,964
60,502
64,105
66,319
67,183
67,817
67,857
41,873
49,769
56,012
60,116
62,119
62,899
63,454
63,776
38,565
43,682
47,331
49,178
50,209
50,755
50,898
51,248
36,509
41,527
45,403
47,629
48,610
49,049
49,396
49,491
Table E.6: GFLOP/s for the OpenCL BeamFormer 1.5: block sizes from 2x9 to
256x16
Appendix E. Finding the best station-beam block size
b
s
1
2
3
4
5
6
7
8
88
1
2
3
4
5
6
7
8
29,533
39,512
45,978
51,410
56,584
61,035
63,849
65,668
41,996
58,196
68,439
78,627
86,682
93,785
99,800
102,59
42,648
60,870
74,550
86,236
95,861
104,86
111,89
118,70
45,891
67,179
83,329
97,567
110,67
119,78
130,13
140,14
48,906
71,498
90,285
105,58
119,65
131,86
143,49
153,14
50,581
74,452
94,738
112,32
126,98
141,71
153,85
165,21
52,018
77,395
98,462
116,08
133,10
148,05
161,04
172,15
52,997
79,024
100,47
119,88
137,52
152,28
167,21
179,18
Table E.7: GFLOP/s for the CUDA BeamFormer 1.5: block sizes from 1x1 to 8x8
b
s
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
52,216
78,565
100,78
120,54
138,36
153,12
167,56
179,18
51,318
77,156
98,462
118,28
136,94
151,38
166,58
179,18
50,404
75,761
97,294
114,87
130,84
144,85
157,23
167,23
45,776
66,731
83,496
97,212
109,24
118,31
127,21
134,98
42,205
60,243
74,550
86,023
95,429
103,28
110,35
115,64
39,559
57,120
69,580
79,631
88,088
94,443
100,38
105,14
38,146
53,880
64,960
73,713
80,942
86,086
91,051
94,543
35,212
49,027
58,388
65,603
71,632
76,331
80,020
83,271
Table E.8: GFLOP/s for the CUDA BeamFormer 1.5: block sizes from 1x9 to 8x16
b
s
9
10
11
12
13
14
15
16
1
2
3
4
5
6
7
8
68,833
68,766
70,571
72,449
74,687
76,049
77,539
78,262
107,95
111,71
115,61
118,34
121,88
122,59
125,60
127,86
125,62
130,93
136,22
141,37
145,72
147,96
151,13
154,69
147,83
155,30
160,35
166,34
172,56
177,02
178,79
183,10
163,44
170,58
178,66
184,76
189,20
195,70
201,74
205,60
174,91
183,83
191,74
200,75
207,30
213,31
220,61
223,95
185,00
195,14
202,66
211,17
219,00
226,24
231,27
237,52
189,93
201,26
210,07
219,72
228,68
233,77
241,55
248,83
Table E.9: GFLOP/s for the CUDA BeamFormer 1.5: block sizes from 9x1 to 16x8
b
s
9
10
11
12
13
14
15
16
9
10
11
12
13
14
15
16
191,01
200,36
210,25
219,36
226,33
234,12
242,80
248,12
190,57
200,94
210,40
219,07
228,35
235,71
242,51
251,41
178,13
187,02
195,08
203,44
210,13
216,26
221,91
226,16
141,84
147,35
152,24
157,17
162,17
165,68
168,34
172,27
120,99
124,93
129,14
132,53
135,57
138,31
141,12
143,03
109,25
113,11
116,23
118,72
121,45
123,40
125,89
127,45
98,413
101,12
104,12
106,16
108,16
109,76
111,38
112,84
85,542
88,099
89,733
91,439
93,217
94,545
95,733
96,803
Table E.10: GFLOP/s for the CUDA BeamFormer 1.5: block sizes from 9x9 to 16x16
Appendix E. Finding the best station-beam block size
b
s
2
4
8
16
32
64
128
256
89
1
2
3
4
5
6
7
8
39,721
50,823
65,668
78,643
86,725
92,719
95,353
96,861
57,307
77,713
102,18
127,02
144,54
156,32
162,69
166,29
61,035
86,051
119,45
155,35
183,45
203,38
215,22
221,07
67,029
97,567
140,14
183,10
223,15
250,52
267,12
278,38
71,092
105,74
153,89
205,60
251,04
285,72
305,72
318,49
74,699
111,85
165,58
223,95
277,85
319,68
345,02
360,79
76,941
115,51
173,85
237,52
296,79
342,12
369,78
386,87
78,507
119,34
179,18
247,25
314,06
362,87
392,85
411,86
Table E.11: GFLOP/s for the CUDA BeamFormer 1.5: block sizes from 2x1 to 256x8
b
s
2
4
8
16
32
64
128
256
9
10
11
12
13
14
15
16
78,565
120,54
179,18
248,12
313,35
362,19
395,08
412,48
76,605
119,34
179,18
250,11
320,12
373,81
408,51
427,07
75,761
114,87
168,25
226,16
278,34
314,29
336,83
351,83
66,731
97,805
134,98
172,27
201,19
220,36
231,78
237,98
60,243
86,023
115,64
143,36
163,39
176,90
183,38
187,55
56,812
79,971
105,14
127,21
142,68
153,06
157,87
161,14
53,369
73,713
94,781
112,66
125,52
133,20
137,46
139,71
49,027
66,418
83,098
96,924
106,73
112,92
116,33
117,90
Table E.12: GFLOP/s for the CUDA BeamFormer 1.5: block sizes from 2x9 to
256x16
Appendix F
Data structures
namespace LOFAR {
namespace RTCP {
// Data which needs to be t r a n s p o r t e d between CN , ION and Storage .
// Apart from read () and write () functionality , the data is a u g m e n t e d
// with a s e q u e n c e number in order to detect missing data . Furthermore ,
// an i n t e g r a t i o n o p e r a t o r += can be defined to reduce the data .
// E n d i a n n e s s :
//
* CN / ION are big endian ( ppc )
//
* Storage is little endian ( intel )
//
* S t a t i o n s are little endian
//
// E n d i a n n e s s is swapped by :
//
* Data r e c e i v e d by the CN from the s t a t i o n s ( t r a n s p o r t e d via the ION )
//
* Data r e c e i v e d by Storage from the ION
//
// WARNING : We c o n s i d e r all data streams to be big endian , and will also write
// them as such . s e q u e n c e N u m b e r is the only field c o n v e r t e d from and to big endian .
class Str eamableD ata {
public :
// A stream is i n t e g r a t a b l e if it s u p p o r t s the += o p e r a t o r to combine
// several objects into one .
Strea mableDa ta ( bool isIntegr atable ): integratable ( is Integrat able ) {}
// s u p p r e s s warning by d e f i n i n g a virtual d e s t r u c t o r
virtual ~ S treamab leData () {}
// return a copy of the object
virtual St reamable Data * clone () const = 0;
90
Appendix F. Data structures
91
virtual size_t requiredSize () const = 0;
virtual void allocate ( Allocator & allocator = heapAllocator ) = 0;
virtual void read ( Stream * , bool w i t h S e q u e n c e N u m b e r );
virtual void write ( Stream * , bool withSequenceNumber , unsigned align = 0);
bool isIntegr atable () const
{ return integratable ; }
virtual St reamable Data & operator += ( const S treamabl eData &)
{ LOG_WARN ( " Integration not implemented . " ); return * this ; }
uint32_t seque nceNumb er ;
protected :
const bool integratable ;
// a s u b c l a s s should o v e r r i d e these to m a r s h a l l its data
virtual void readData ( Stream *) = 0;
virtual void writeData ( Stream *) = 0;
};
// A typical data set c o n t a i n s a M u l t i D i m A r r a y of tuples and a set of flags .
template < typename T = fcomplex , unsigned DIM = 4 > class SampleData : public Stre amableD ata
{
public :
typedef typename MultiDimArray <T , DIM >:: ExtentList ExtentList ;
SampleData ( bool isIntegratable , const ExtentList & extents , unsigned nrFlags );
virtual SampleData * clone () const { return new SampleData (* this ); }
virtual size_t requiredSize () const ;
virtual void allocate ( Allocator & allocator = heapAllocator );
MultiDimArray <T , DIM > samples ;
std :: vector < SparseSet < unsigned > >
flags ;
protected :
virtual void c he ck E nd ia nn e ss ();
virtual void readData ( Stream *);
virtual void writeData ( Stream *);
private :
// copy the E x t e n t L i s t instead of using a reference ,
// as boost by default uses a global one ( boost :: extents )
const ExtentList
extents ;
Appendix F. Data structures
const unsigned
nrFlags ;
bool
itsHaveWarnedLittleEndian ;
92
};
inline void St reamable Data :: read ( Stream * str , bool w i t h S e q u e n c e N u m b e r )
{
if ( w i t h S e q u e n c e N u m b e r ) {
str - > read (& sequenceNumber , sizeof se quenceN umber );
# if ! defined W OR DS _B I GE ND IA N
dataConvert ( LittleEndian , & sequenceNumber , 1);
# endif
}
readData ( str );
}
inline void St reamable Data :: write ( Stream * str , bool withSequenceNumber , unsigned align )
{
if ( w i t h S e q u e n c e N u m b e r ) {
# if ! defined W OR DS _B I GE ND IA N
if ( align > 1) {
if ( align < sizeof ( uint32_t ))
THROW ( AssertError , " Sizeof alignment < sizeof seque ncenumb er " );
void * sn_buf ;
uint32_t sn = sequenc eNumber ;
if ( p osix_me malign (& sn_buf , align , align ) != 0) {
THROW ( InterfaceException , " could not allocate data " );
}
try {
dataConvert ( BigEndian , & sn , 1);
memcpy ( sn_buf , & sn , sizeof sn );
str - > write ( sn_buf , align );
} catch (...) {
free ( sn_buf );
throw ;
}
free ( sn_buf );
} else {
uint32_t sn = sequenc eNumber ;
dataConvert ( LittleEndian , & sn , 1);
Appendix F. Data structures
93
str - > write (& sn , sizeof sn );
}
# else
if ( align > 1) {
if ( align < sizeof ( uint32_t ))
THROW ( AssertError , " Sizeof alignment < sizeof sequ encenumb er " );
void * sn_buf ;
if ( p osix_me malign (& sn_buf , align , align ) != 0) {
THROW ( InterfaceException , " could not allocate data " );
}
try {
memcpy ( sn_buf , & sequenceNumber , sizeof s equenceN umber );
str - > write ( sn_buf , align );
} catch (...) {
free ( sn_buf );
throw ;
}
free ( sn_buf );
} else {
str - > write (& sequenceNumber , sizeof s equence Number );
}
# endif
}
writeData ( str );
}
template < typename T , unsigned DIM > inline SampleData <T , DIM >:: SampleData ( bool isIntegratable ,
const ExtentList & extents , unsigned nrFlags ) :
Strea mableDa ta ( isI ntegrata ble ) ,
flags (0) ,
extents ( extents ) ,
nrFlags ( nrFlags ) ,
i t s H a v e W a r n e d L i t t l e E n d i a n ( false )
{
}
template < typename T , unsigned DIM > inline size_t SampleData <T , DIM >:: requiredSize () const
{
return align ( MultiDimArray <T , DIM >:: nrElements ( extents ) * sizeof ( T ) ,32);
}
template < typename T , unsigned DIM > inline void
SampleData <T , DIM >:: allocate ( Allocator & allocator )
Appendix F. Data structures
94
{
samples . resize ( extents , 32 , allocator );
flags . resize ( nrFlags );
}
template < typename T , unsigned DIM > inline void SampleData <T , DIM >:: ch ec kE n di an ne s s ()
{
# if 0 && ! defined WO RD S_ B IG EN DI A N
dataConvert ( LittleEndian , samples . origin () , samples . num_elements ());
# endif
}
template < typename T , unsigned DIM > inline void SampleData <T , DIM >:: readData ( Stream * str )
{
str - > read ( samples . origin () , samples . num_elements () * sizeof ( T ));
c he ck En d ia nn es s ();
}
template < typename T , unsigned DIM > inline void SampleData <T , DIM >:: writeData ( Stream * str )
{
# if 0 && ! defined WO RD S_ B IG EN DI A N
if (! i t s H a v e W a r n e d L i t t l e E n d i a n ) {
i t s H a v e W a r n e d L i t t l e E n d i a n = true ;
LOG_WARN ( " writing data in little endian . " );
}
// THROW ( AssertError , " not i m p l e m e n t e d : think about e n d i a n n e s s ");
# endif
str - > write ( samples . origin () , samples . num_elements () * sizeof ( T ));
}
} // n a m e s p a c e RTCP
} // n a m e s p a c e LOFAR
Listing F.1: SampleData
namespace LOFAR {
namespace RTCP {
// Note : struct must remain co p y a b l e to avoid ugly c o n s t r u c t i o n s when passing it around
struct S ub b an dM et a Da ta
{
public :
S ub ba nd M et aD at a ( unsigned nrSubbands , unsigned nrBeams , size_t alignment = 16 ,
Allocator & allocator = heapAllocator );
virtual ~ Su b ba nd Me t aD at a ();
Appendix F. Data structures
95
struct beamInfo {
float
delayAtBegin , delayAfterEnd ;
double b e a m D i r e c t i o n A t B e g i n [3] , b e a m D i r e c t i o n A f t e r E n d [3];
};
struct marshallData {
unsigned char flagsBuffer [132];
unsigned
alignmen tShift ;
// i t s N r B e a m s e l e m e n t s will really be allocated , so this array needs to
// be the last element . Also , ISO C ++ forbids zero - sized arrays , so we use size 1.
struct beamInfo
beams [1];
};
SparseSet < unsigned > getFlags ( unsigned subband ) const ;
void
setFlags ( unsigned subband , const SparseSet < unsigned > &);
unsigned
alignme ntShift ( unsigned subband ) const ;
unsigned
& alignm entShif t ( unsigned subband );
struct beamInfo
* beams ( unsigned subband ) const ;
struct beamInfo
* beams ( unsigned subband );
struct marshallData & subbandInfo ( unsigned subband ) const ;
struct marshallData & subbandInfo ( unsigned subband );
void read ( Stream * str );
void write ( Stream * str ) const ;
// size of the i n f o r m a t i o n for one subband
const unsigned
itsSubbandInfoSize ;
private :
const unsigned
itsNrSubbands , itsNrBeams ;
// size of the i n f o r m a t i o n for all s u b b a n d s
const unsigned
itsMarshallDataSize ;
// the pointer to all our data , which c o n s i s t s of struct m a r s h a l l D a t a [ i t s N r S u b b a n d s ] ,
// except for the fact that the e l e m e n t s are spaces apart more than
// sizeof ( struct m a r s h a l l D a t a )
// to make room for extra beams which are not defined in the m a r s h a l l D a t a s t r u c t u r e .
//
// Access e l e m e n t s through s u b b a n d I n f o ( subband ).
char
* const i ts M ar sh al l Da ta ;
// r e m e m b e r the pointer at which we a l l o c a t e d the memory for the m a r s h a l l D a t a
Appendix F. Data structures
Allocator
96
& itsAllocator ;
};
inline S ub b an dM et a Da ta :: S ub ba nd M et aD at a ( unsigned nrSubbands , unsigned nrBeams ,
size_t alignment , Allocator & allocator )
:
// Size of the data we need to a l l o c a t e . Note that m a r s h a l l D a t a already c o n t a i n s
// the size of one b e a m I n f o .
i t s S u b b a n d I n f o S i z e ( sizeof ( struct marshallData ) + ( nrBeams - 1) * sizeof ( struct beamInfo )) ,
itsNrSubbands ( nrSubbands ) ,
itsNrBeams ( nrBeams ) ,
i t s M a r s h a l l D a t a S i z e ( nrSubbands * i t s S u b b a n d I n f o S i z e ) ,
i ts Ma rs h al lD at a ( static_cast < char * >( allocator . allocate ( itsMarshallDataSize , alignment ))) ,
itsAllocator ( allocator )
{
# if defined USE_VALGRIND
memset ( itsMarshallData , 0 , i t s M a r s h a l l D a t a S i z e );
# endif
}
inline S ub b an dM et a Da ta ::~ S ub ba n dM et aD a ta ()
{
itsAllocator . deallocate ( it sM a rs ha ll D at a );
}
inline SparseSet < unsigned > Sub ba nd M et aD at a :: getFlags ( unsigned subband ) const
{
SparseSet < unsigned > flags ;
flags . unmarshall ( subbandInfo ( subband ). flagsBuffer );
return flags ;
}
inline void Su bb a nd Me ta D at a :: setFlags ( unsigned subband , const SparseSet < unsigned > & flags )
{
ssize_t size = flags . marshall (& subbandInfo ( subband ). flagsBuffer ,
sizeof subbandInfo ( subband ). flagsBuffer );
assert ( size >= 0);
}
inline unsigned Su bb an d Me ta Da t a :: alignm entShift ( unsigned subband ) const
{
return subbandInfo ( subband ). a lignment Shift ;
}
Appendix F. Data structures
97
inline unsigned & Su bb a nd Me ta D at a :: align mentShi ft ( unsigned subband )
{
return subbandInfo ( subband ). a lignment Shift ;
}
inline struct S ub ba n dM et aD a ta :: beamInfo * S ub b an dM et a Da ta :: beams ( unsigned subband ) const
{
return & subbandInfo ( subband ). beams [0];
}
inline struct S ub ba n dM et aD a ta :: beamInfo * S ub b an dM et a Da ta :: beams ( unsigned subband )
{
return & subbandInfo ( subband ). beams [0];
}
inline struct S ub ba n dM et aD a ta :: marshallData
& Su bb a nd Me ta D at a :: subbandInfo ( unsigned subband ) const
{
// c a l c u l a t e the array stride ourself ,
// since C ++ does not know the proper size of the m a r s h a l l D a t a e l e m e n t s
return * reinterpret_cast < struct marshallData * >( i ts M ar sh al l Da ta +
( subband * i t s S u b b a n d I n f o S i z e ));
}
inline struct S ub ba n dM et aD a ta :: marshallData & S ub ba n dM et aD a ta :: subbandInfo ( unsigned subband )
{
// c a l c u l a t e the array stride ourself ,
// since C ++ does not know the proper size of the m a r s h a l l D a t a e l e m e n t s
return * reinterpret_cast < struct marshallData * >( i ts M ar sh al l Da ta +
( subband * i t s S u b b a n d I n f o S i z e ));
}
inline void Su bb a nd Me ta D at a :: read ( Stream * str )
{
// TODO : e n d i a n n e s s
str - > read ( itsMarshallData , i t s M a r s h a l l D a t a S i z e );
}
inline void Su bb a nd Me ta D at a :: write ( Stream * str ) const
{
// TODO : e n d i a n n e s s
str - > write ( itsMarshallData , i t s M a r s h a l l D a t a S i z e );
}
} // n a m e s p a c e RTCP
Appendix F. Data structures
98
} // n a m e s p a c e LOFAR
Listing F.2: SubbandMetaData
namespace LOFAR {
namespace RTCP {
class Bea mFormedD ata : public SampleData < fcomplex ,4 >
{
public :
typedef SampleData < fcomplex ,4 > SuperType ;
BeamF ormedDa ta ( unsigned nrBeams , unsigned nrChannels ,
unsigned n r S a m p l e s P e r I n t e g r a t i o n );
virtual Be amFormed Data * clone () const { return new Be amFormed Data (* this ); }
};
inline BeamF ormedDat a :: Be amFormed Data ( unsigned nrBeams ,
unsigned nrChannels , unsigned n r S a m p l e s P e r I n t e g r a t i o n )
// The "| 2" s i g n i f i c a n t l y i m p r o v e s t r a n s p o s e speeds for p a r t i c u l a r
// numbers of s t a t i o n s due to cache c o n f l i c t effects .
The extra memory
// is not used .
:
SuperType :: SampleData ( false ,
boost :: extents [ nrBeams ][ nrChannels ][ n r S a m p l e s P e r I n t e g r a t i o n | 2][ N R _ P O L A R I Z A T I O NS ] ,
nrBeams )
{
}
} // n a m e s p a c e RTCP
} // n a m e s p a c e LOFAR
Listing F.3: BeamFormedData
Bibliography
[1] Toby Haynes. A Primer on Digital Beamforming, March 1998.
[2] A J Faulkner, K Zarb Adami, J. G. Bij de Vaate, G. W. Kant, and P. Pickard.
Beamforming Techniques for Large-N Aperture Arrays. 2010.
[3] John Owens. GPU architecture overview. In SIGGRAPH ’07: ACM SIGGRAPH
2007 courses, page 2, New York, NY, USA, 2007. ACM. doi: http://doi.acm.org/
10.1145/1281500.1281643.
[4] NVIDIA.
NVIDIA CUDA Programming Guide version 2.3.1.
http:
//developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/
NVIDIA_CUDA_Programming_Guide_2.3.pdf.
[5] John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. Scalable parallel
programming with cuda. Queue, 6(2):40–53, 2008. ISSN 1542-7730. doi: http:
//doi.acm.org/10.1145/1365490.1365500.
[6] . Commun. ACM, 53(11), 2010. ISSN 0001-0782.
[7] Khronos OpenCL Working Group. The OpenCL Specification version 1.1. http:
//www.khronos.org/registry/cl/specs/opencl-1.1.pdf.
[8] ASTRON, Netherlands Institute for Radio Astronomy. LOFAR, LOw Frequency
ARray radio telescope. http://www.lofar.org.
[9] A. Kemball, J. Cordes, J. Lasio, D. Backer, G. Bower, S. Bhatnager, and R. Plante.
Technology development for computational radio astronomy: 2010-2020. In astro2010: The Astronomy and Astrophysics Decadal Survey, volume 2010 of ArXiv
Astrophysics e-prints, pages 46–+, 2009.
99
Bibliography
100
[10] Tilak Agerwala. Exascale computing: the challenges and opportunities in the next
decade. In Proceedings of the 15th ACM SIGPLAN symposium on Principles and
practice of parallel programming, PPoPP ’10, pages 1–2, New York, NY, USA, 2010.
ACM. ISBN 978-1-60558-877-3. doi: http://doi.acm.org/10.1145/1693453.1693454.
URL http://doi.acm.org/10.1145/1693453.1693454.
[11] K.G. Jansky. Electrical disturbances apparently of extraterrestrial origin. Proceedings of the IEEE, 72(6):710–714, 2005. ISSN 0018-9219.
[12] AB Smolders, J.G.B. de Vaate, GW Kant, A. van Ardenne, Dan Schaubert, and
T.H. Chio.
ray.
Dual-beam Wide-band Beamformer with Integrated Antenna Ar-
In IEEE Millennium Conference on Antenna & Propagation, pages 3–6.
Citeseer. URL http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.
1.135.4037&amp;rep=rep1&amp;type=pdf.
[13] Ronald De Wild. A generic digital beam former platform for phased arrays in radio
astronomy.
[14] Aaron Parsons, Donald Backer, Chen Chang, Daniel Chapman, Henry Chen,
Patrick Crescini, Andrew Siemion, Dan Werthimer, and Melvyn Wright Abstractour Group. Petaop/second fpga signal processing for seti and radio astronomy. In
Proceedings of the Asilomar Conference on Signals, Systems, and Computers, 2006.
[15] Vinayak Nagpal and Terry Filiba. Beamforming for antenna arrays bee2 vs dsp
processors. 2007. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=
?doi=10.1.1.119.137.
[16] C. D. Patterson, S. W. Ellingson, B. S. Martin, K. Deshpande, J. H. Simonetti,
M. Kavic, and S. E. Cutchin. Searching for transient pulses with the eta radio
telescope. TRETS, 1(4), 2009.
[17] J.W. Romein, P.C. Broekema, J.D. Mol, and R.V. van Nieuwpoort. Processing
Real-Time LOFAR Telescope Data on a Blue Gene/P Supercomputer. astron.nl.
URL http://www.astron.nl/~{}nieuwpoort/papers/lofar.pdf.
[18] Jayanta Roy, Yashwant Gupta, Ue-Li Pen, Jeffrey B. Peterson, Sanjay Kudale, and
Jitendra Kodilkar. A real-time software backend for the GMRT. Oct 2009. URL
http://arxiv.org/abs/0910.1517.
Bibliography
101
[19] B. J. Mort, F. Dulwich, S. Salvini, K. Zarb Adami, M. E. Jones, A. E. Trefethen,
and S. Rawlings. OSKAR: Beamforming simulation for the SKA aperture array.
http://www.oerc.ox.ac.uk/research/oskar.
[20] Carl-Inge Colombo Nilsen and Ines Hafizovic. Digital beamforming using a gpu.
Acoustics, Speech, and Signal Processing, IEEE International Conference on, 0:609–
612, 2009. doi: http://doi.ieeecomputersociety.org/10.1109/ICASSP.2009.4959657.
[21] Michael Romer. Beamforming adaptive arrays with graphics processing units.
[22] John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Krüger,
Aaron E. Lefohn, and Tim Purcell. A survey of general-purpose computation on
graphics hardware. In Eurographics 2005, State of the Art Reports, pages 21–51,
August 2005. URL http://graphics.idav.ucdavis.edu/publications/print_
pub?pub_id=844.
[23] Steve Upstill. RenderMan Companion: A Programmer’s Guide to Realistic Computer Graphics. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA,
1989. ISBN 0201508680.
[24] S. Venkatasubramanian. The graphics card as a stream computer. In SIGMODDIMACS Workshop on Management and Processing of Data Streams. Citeseer, 2003.
[25] Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. Nvidia tesla:
A unified graphics and computing architecture. IEEE Micro, 28:39–55, 2008. ISSN
0272-1732. doi: http://doi.ieeecomputersociety.org/10.1109/MM.2008.31.
[26] Jack W. Davidson and Sanjay Jinturkar. Memory access coalescing: a technique
for eliminating redundant memory accesses. In PLDI ’94: Proceedings of the ACM
SIGPLAN 1994 conference on Programming language design and implementation,
pages 186–195, New York, NY, USA, 1994. ACM. ISBN 0-89791-662-X. doi: http:
//doi.acm.org/10.1145/178243.178259.
[27] Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an insightful
visual performance model for multicore architectures. Commun. ACM, 52(4):65–76,
2009. ISSN 0001-0782. doi: http://doi.acm.org/10.1145/1498765.1498785.
[28] Khronos Group. OpenCL. http://www.khronos.org/opencl.