A Simple, Composite Metric for System Throughput

Optimum System Balance for
Systems of Finite Price
John D. McCalpin, Ph.D.
IBM Corporation
Austin, TX
SuperComputing 2004
Pittsburgh, PA
November 10, 2004
Overview
• The HPC Challenge Benchmark was announced
last year at SuperComputing’2003
• The HPC Challenge Benchmark consists of
–
–
–
–
–
–
LINPACK (HPL)
STREAM
PTRANS (transposing the array used by HPL)
RandomAccess (random read/modify/write)
FFT
and some low-level MPI latency & BW measurements
• No single figure of merit is defined
Overview (continued)
• Q: What is a “balanced” system?
• My answer:
“A balanced system is one for which the primary
applications are limited in performance by the
most expensive component(s) of the system.”
The Two Questions
• We need to understand both performance and cost
in the context of low-level component metrics
• Performance
– What performance model?
– Use a harmonically weighted, time-based model
• Cost
– What cost model?
– Simple linear additive cost model
Performance Model
• Composite Figures of Merit for Performance must
be based on “time” rather than “rate”
– i.e., weighted harmonic means of rates
• Why?
– Combining “rates” in any other way fails to have a
“Law of Diminishing Returns”
• Time = Work/Rate
• Repeat for each component: Ti = Wi/Ri
A Simple Composite Model
• Assume the time to solution is composed of a compute time
proportional to peak GFLOPS plus a memory transfer time
proportional to sustained memory bandwidth
• Assume “x Bytes/FLOP” to get:
" Balanced GFLOPS" 
1 " Effective FP op"
1 FP op
x Bytes

 




 Peak GFLOPS   Sustained GB/s 
• Target SPECfp_rate2000 as the workload
Does Peak GFLOPS predict SPECfp_rate2000?
SPECfp_rate2000 vs Peak MFLOPS
8000
7000
Peak MFLOPS
6000
5000
4000
3000
2000
1000
0
0.00
5.00
10.00
15.00
SPECfp_rate2000/cpu
20.00
25.00
30.00
Does Sustained Memory Bandwidth predict
SPECfp_rate2000?
SPECfp_rate2000 vs Sustained BW
7.000
6.000
GB/s per CPU
5.000
4.000
3.000
2.000
1.000
0.000
0.00
5.00
10.00
15.00
SPECfp_rate2000/cpu
20.00
25.00
30.00
Optimized Model Results
• Results rounded to nearby round values:
–
–
–
–
–
Bytes/FLOP for large caches === 0.16
Bytes/FLOP for small caches === 0.80
Size of asymptotically large cache === ~12 MB
Coefficient of best fit === ~6.4
The units of the coefficient are
SPECfp_rate2000 / Effective GFLOPS
Does this Revised Metric predict SPECfp_rate2000?
Optimized SPECfp_rate2000 Estimates
30.00
Estimated Rate/cpu
25.00
20.00
15.00
10.00
5.00
0.00
0.00
5.00
10.00
15.00
SPECfp_rate2000/cpu
20.00
25.00
30.00
Cost Model
• Assume simple linear additive model
– FLOPS cost some amount
– Sustained BW costs a different amount
– Define:
System characteristics
b = Rmem / Rcpu
g = Wmem / Wcpu
Application characteristics
d = ($/BW) / ($/FLOPS)
Technology characteristics
How to Optimize?
• For a given application g (Wmem/Wcpu), what is the
optimum system balance b?
• Many people seem to believe that the system
should be “balanced” to the application:
– boptimal = g
i.e.,
– Rmem/Rcpu = Wmem/Wcpu
• This does not optimize price/performance
The Correct Optimization
• This is actually an easy optimization problem
• Minimize cost/performance
– Same as minimizing cost * time
• Optimum cost/performance occurs at
– b = sqrt(g/d)
• Definitely not intuitive!
Example: High BW, expensive BW
gamma = 3, delta = 3
3.000
g = 3  relatively high BW
d = 3  relatively expensive BW
relative price/performance
2.500
2.000
Optimum price/performance is at b=1, not b=3
Improvement in price/performance is ~30%
1.500
1.000
0.500
0.000
0.10
1.00
beta
10.00
High BW, very expensive memory
gamma = 3, delta = 10
3.500
3.000
relative price/performance
2.500
g = 3  relatively high BW
d = 10  very expensive BW
Optimum price/performance is at b=0.58, not b=3
Improvement in price/performance is ~50%
2.000
1.500
1.000
0.500
0.000
0.10
1.00
beta
10.00
Low-BW, expensive BW
gamma = 0.1, delta = 3
14.000
12.000
relative price/performance
10.000
g = 0.1  low BW application
d = 3  moderately expensive BW
Optimum price/performance is at b=0.18, not
b=0.1
Improvement in price/performance is ~5%
8.000
6.000
More BW helps here even though it is expensive,
because the application does not need much.
4.000
2.000
0.000
0.10
1.00
beta
10.00
Medium BW, expensive BW
gamma = 1, delta = 3
4.500
4.000
relative price/performance
3.500
3.000
2.500
g = 1  modest BW
d = 3  moderately expensive BW
Optimum price/performance is at b=0.58, not b=1
Improvement in price/performance is ~10%
2.000
1.500
1.000
0.500
0.000
0.10
1.00
beta
10.00
Summary
•
•
•
•
Balance is important to cost/performance
You must understand performance
You must understand cost
Optimum cost-performance is not intuitive!