Memory Bandwidth Scaling in Multicore Multiprocessors

HPC Performance Analysis
Q2 2012
Memory Bandwidth Scaling in Multicore Multiprocessors
George Ruddy
Office of Information Technology / Department of Computer Science
Tom Crockett
Office of Information Technology
In 1965, Intel co-founder Gordon Moore observed that the number of transistors on an integrated circuit
(and hence its computing power) doubles every two years. This growth rate, known as Moore’s Law, has
held remarkably constant through the intervening decades, and is projected to continue for at least a few
more years. In microprocessors, steadily increasing clock frequencies boosted performance even further
until heat dissipation considerations slowed this trend dramatically in the middle of the last decade. Since
then, microprocessor architects have enhanced processing power by increasing the number of individual
CPUs, or cores, which are placed on each chip.
To date, most computer systems are still based on some variant of the classic von Neumann architecture,
with processors and memory residing on different physical devices, connected electrically by a memory or
data bus. A combination of technical, cost, and power considerations limits the speed of memory accesses,
resulting in a serious performance bottleneck for many applications. The rise of multicore processors has
only made the problem worse.
Many high-end computing systems, including William and Mary’s SciClone cluster, are built from commodity servers which incorporate one or more multicore processors. As of 2012, SciClone includes three
generations of microprocessor technology, providing an opportunity to examine memory bandwidth trends
as a function of processing power over the past few years. We have employed the well-known STREAM
benchmark to characterize memory performance of the following systems:
System
Year
Processor
Sun V20z
2005
Opteron 250
HP SL390s
2011
Xeon X5672
Dell SC1435
2007
Opteron 2218
Proc.
Speed
Mem. Speed
2.4 GHz
333 MHz
3.2 GHz
1333 MHz
2.6 GHz
667 MHz
No.
Procs.
Cores per
Proc.
Total
Cores
2
2
4
2
2
1
4
2
8
STREAM is based on several simple operations involving very long vectors, thereby placing maximal
demands on memory relative to the operation count. It represents essentially worst-case scenarios for serial
memory access patterns, but nonetheless ones which are common in practice. We focus here on STREAM’s
Add benchmark, C = A + B, where A, B, and C are vectors of double-precision (8-byte) floating-point numbers. Each operation requires fetching of two operands and storing of a third, or 24 bytes of memory bandwidth per vector index. When running on multiple cores, STREAM uses the industry-standard OpenMP
programming API to parallelize across loop iterations.
In our experiments, we used vectors of length 20,000,000 (160 MB), more than sufficient to eliminate most
cache effects. The results presented here were obtained with the PGI Fortran 95 (pgf95) compiler, which
yields the highest STREAM performance among several compilers available on SciClone. The values
reported are averages over three or more repetitions at each data point.
Figure 1 shows the effective aggregate memory bandwidth across the range of available cores for each of
the systems listed above. The HP SL390s, our newest platform, is by far the fastest, but for all three systems, aggregate performance fails to keep pace as the number of active cores increases. This is more clearly
illustrated in Figure 2, which plots the speedup metric sp = t1 / tp, where sp is the performance improvement
with p processor cores, t1 is the execution time on a single core, and tp is the execution time on p cores.
We see that the SC1435 achieves near-linear scaling with two cores active, but otherwise multicore performance lags badly; the eight-core SL390s achieves a maximum speedup of just 2.3.
"$&$#
!%!
"#
Figure 1: Aggregate bandwidth as a function of core
count.
'
'
#
#
"
"#
Figure 2: Performance scales poorly with increasing
core count.
An alternate metric is parallel efficiency, ep = t1 / ptp, which indicates how effectively processors are being
utilized as the core count scales up (Fig. 3). In this view the historical trend is readily apparent: succeeding
generations of processors achieve 70%, 63%, and an abysmal 27% efficiency with all cores active. Finally,
Figure 4 plots the effective bandwidth per core for the SL390s. This metric,which is just the aggregate
data rate (Fig. 1) divided by the number of active cores, gives us a quantitative measure of the bandwidth
shortfall relative to single-core performance.
$ +!)
%
*
&'!(&#$"$%
!$
"""$%
Figure 3: Processor utilization has declined as core
count has increased.
%
!$
"""$%
Figure 4: A gaping bandwidth deficit inhibits
scalable performance in multicore multiprocessors.
As these results show, even trivially parallelizable computations can exhibit dismal performance on contemporary multicore architectures. Moore’s Law has given us extraordinary processing power, but memory
subsystems have failed to keep pace, and the problem is getting worse. The challenge for application developers is to devise algorithms which minimize the amount of data transferred to and from main memory;
the challenge for computer architects is to find cost-effective designs which balance processor and memory
performance.