HPC Performance Analysis Q2 2012 Memory Bandwidth Scaling in Multicore Multiprocessors George Ruddy Office of Information Technology / Department of Computer Science Tom Crockett Office of Information Technology In 1965, Intel co-founder Gordon Moore observed that the number of transistors on an integrated circuit (and hence its computing power) doubles every two years. This growth rate, known as Moore’s Law, has held remarkably constant through the intervening decades, and is projected to continue for at least a few more years. In microprocessors, steadily increasing clock frequencies boosted performance even further until heat dissipation considerations slowed this trend dramatically in the middle of the last decade. Since then, microprocessor architects have enhanced processing power by increasing the number of individual CPUs, or cores, which are placed on each chip. To date, most computer systems are still based on some variant of the classic von Neumann architecture, with processors and memory residing on different physical devices, connected electrically by a memory or data bus. A combination of technical, cost, and power considerations limits the speed of memory accesses, resulting in a serious performance bottleneck for many applications. The rise of multicore processors has only made the problem worse. Many high-end computing systems, including William and Mary’s SciClone cluster, are built from commodity servers which incorporate one or more multicore processors. As of 2012, SciClone includes three generations of microprocessor technology, providing an opportunity to examine memory bandwidth trends as a function of processing power over the past few years. We have employed the well-known STREAM benchmark to characterize memory performance of the following systems: System Year Processor Sun V20z 2005 Opteron 250 HP SL390s 2011 Xeon X5672 Dell SC1435 2007 Opteron 2218 Proc. Speed Mem. Speed 2.4 GHz 333 MHz 3.2 GHz 1333 MHz 2.6 GHz 667 MHz No. Procs. Cores per Proc. Total Cores 2 2 4 2 2 1 4 2 8 STREAM is based on several simple operations involving very long vectors, thereby placing maximal demands on memory relative to the operation count. It represents essentially worst-case scenarios for serial memory access patterns, but nonetheless ones which are common in practice. We focus here on STREAM’s Add benchmark, C = A + B, where A, B, and C are vectors of double-precision (8-byte) floating-point numbers. Each operation requires fetching of two operands and storing of a third, or 24 bytes of memory bandwidth per vector index. When running on multiple cores, STREAM uses the industry-standard OpenMP programming API to parallelize across loop iterations. In our experiments, we used vectors of length 20,000,000 (160 MB), more than sufficient to eliminate most cache effects. The results presented here were obtained with the PGI Fortran 95 (pgf95) compiler, which yields the highest STREAM performance among several compilers available on SciClone. The values reported are averages over three or more repetitions at each data point. Figure 1 shows the effective aggregate memory bandwidth across the range of available cores for each of the systems listed above. The HP SL390s, our newest platform, is by far the fastest, but for all three systems, aggregate performance fails to keep pace as the number of active cores increases. This is more clearly illustrated in Figure 2, which plots the speedup metric sp = t1 / tp, where sp is the performance improvement with p processor cores, t1 is the execution time on a single core, and tp is the execution time on p cores. We see that the SC1435 achieves near-linear scaling with two cores active, but otherwise multicore performance lags badly; the eight-core SL390s achieves a maximum speedup of just 2.3. "$&$# !%! "# Figure 1: Aggregate bandwidth as a function of core count. ' ' # # " "# Figure 2: Performance scales poorly with increasing core count. An alternate metric is parallel efficiency, ep = t1 / ptp, which indicates how effectively processors are being utilized as the core count scales up (Fig. 3). In this view the historical trend is readily apparent: succeeding generations of processors achieve 70%, 63%, and an abysmal 27% efficiency with all cores active. Finally, Figure 4 plots the effective bandwidth per core for the SL390s. This metric,which is just the aggregate data rate (Fig. 1) divided by the number of active cores, gives us a quantitative measure of the bandwidth shortfall relative to single-core performance. $ +!) % * &'!(&#$"$% !$ """$% Figure 3: Processor utilization has declined as core count has increased. % !$ """$% Figure 4: A gaping bandwidth deficit inhibits scalable performance in multicore multiprocessors. As these results show, even trivially parallelizable computations can exhibit dismal performance on contemporary multicore architectures. Moore’s Law has given us extraordinary processing power, but memory subsystems have failed to keep pace, and the problem is getting worse. The challenge for application developers is to devise algorithms which minimize the amount of data transferred to and from main memory; the challenge for computer architects is to find cost-effective designs which balance processor and memory performance.
© Copyright 2026 Paperzz