Performance

Performance Measurement
A Quantitative Basis for Design
Parallel programming is an optimization
problem.
Must take into account several factors:
– execution time
– scalability
– efficiency
A Quantitative Basis for Design
Parallel programming is an optimization
problem.
Must take into account several factors:
Also must take into account the costs:
– memory requirements
– implementation costs
– maintenance costs etc.
A Quantitative Basis for Design
Parallel programming is an optimization
problem.
Must take into account several factors:
Also must take into account the costs:
Mathematical performance models are used
to assess these costs and predict
performance.
Defining Performance
How do you define parallel performance?
What do you define it in terms of?
Consider
– Distributed databases
– Image processing pipeline
– Nuclear weapons testbed
Metrics for Performance
Efficiency
Speedup
Scalability
Others …………..
Some Terms
s(n,p) = speedup for problem size n on p processors
o(n) = serial portion of computation
p(n) = parallel portion of computation
c(n,p) = time for communication
Speed1 = o(n) + p(n)
SpeedP = o(n) + p(n)/p + c(n,p)
Efficiency
The fraction of time a processor spends doing useful work
T1
E=
pTp
o(n) + p(n)
E =
p * o(n) + p(n) + p * c(n,p)
What about when pTp < T1
– Does cache make a processor work at 110%?
Speedup
What is Speed?
Speed1
S=
SpeedP
What algorithm for Speed1?
What is the work performed?
How much work?
Speedup (More Detail)
s(n,p) = speedup for problem size n on p processors
o(n) = serial portion of computation
p(n) = parallel portion of computation
c(n,p) = time for communication
Speed1 = o(n) + p(n)
SpeedP = o(n) + p(n)/p + c(n,p)
o(n) + p(n)
Speedup =
o(n) + p(n)/p + c(n,p)
Execution Time
More on Speedup
100
90
80
70
60
50
40
30
20
10
0
Communication
Computation
Processors
Computation time decreases as we add processors but
communication time increases
Two kinds of Speedup
Relative
– Uses parallel algorithm on 1 processor
– Most common
– Useful for determining algorithm scalability
Absolute
– Uses best known serial algorithm
– Eliminates overheads in calculation.
– Useful to express absolute performance
Story: Prime Number Generation
Amdahl's Law
Every algorithm has a sequential component.
Sequential component limits speedup
¾ can be parallelized
¼ sequential
Suppose each ¼ of the program takes 1 unit of time
Speedup = 1 proc time / n proc time = 4/1 = 4
Sequential = 1/s
Component
Maximum
=s
Speedup
Amdahl’s Law
o(n) + p(n)
Speedup =
o(n) + p(n)/p + c(n,p)
o(n) + p(n)
<=
o(n) + p(n)/p
s = o(n)/(o(n) + p(n)) = the inherently sequential percentage
o(n) / s
Speedup <=
o(n) + o(n) ( 1/s -1)/p
Speedup <=
1
s + ( 1 - s)/p
Amdahl's Law
s
Speedup
Speedup
Algorithm A
– Serial execution time is 10 sec.
– Parallel execution time is 2 sec.
Algorithm B
– Serial execution time is 2 sec.
– Parallel execution time is 1 sec.
What if I told you A = B?
Speedup
Conventional speedup is defined as the
reduction in execution time.
Consider running a problem on a slow
parallel computer and on a faster one.
– Same serial component
– Speedup will be lower on the faster computer.
Logic
The art of thinking and reasoning in strict
accordance with the limitations and
incapacities of the human misunderstanding.
The basis of logic is the syllogism,
consisting of a major and minor premise and
a conclusion.
Example
Major Premise: Sixty men can do a piece of
work sixty times as quickly as one man.
Minor Premise: One man can dig a posthole in sixty seconds.
Conclusion: Sixty men can dig a post-hole
in one second.
Speedup and Amdahl's Law
Conventional speedup penalizes faster
absolute speed.
Assumption that task size is constant as the
computing power increases results in an
exaggeration of task overhead.
Scaling the problem size reduces these
distortion effects.
Solution
Gustafson introduced scaled speedup.
Scale the problem size as you increase the
number of processors.
Calculated in two ways
– Experimentally
– Analytical models
Traditional Speedup
(Strong Scaling)
T1(N )
Speedup =
T P (N )
Tx (y) is time taken to solve problem of
size y on x processors
Scaled Speedup
(weak scaling)
T 1 ( PN )
Speedup =
T P ( PN )
Traditional speedup reduces the work done by
each processor as we add processors
Scaled speedup keeps the work constant on each
processor as we add processors.
Scaled Speedup
Speedup <=
o(n) + p(n)
o(n) + p(n)/p
can be divided into two pieces
serial and parallel
s = o(n) / (o(n) + p(n)/p) and
(1 – s) = p(n)/p / (o(n) + p(n)/p)
now solve for o(n) and p(n) respectively
o(n) = (o(n) + p(n)/p) * s
p(n) = (o(n) + p(n)/p) * (1 – s) * p
substituting these back into Speedup Equation yeilds
Speedup <= s + (1 – s) * p and Speedup <= p + (1 – p) * s
where s is fraction of time doing serial code = o(n) / t(n,k)
t(n,k) is time of parallel program for size n on k processors
Thus, max speedup with p < k processors is
Speedup <= p + (1 – p) * s
Traditional Speedup
ideal
Speedup
measured
Number of Processors
Scaled Speedup
ideal
Large Problem
Speedup
Medium problem
Small problem
Number of Processors
Scaled Speedup vs Amdahl’s Law
Amdahl’s Law determines speedup by taking a serial
computation and predicting how quickly it could be done
in parallel
Scaled speedup begins with a parallel computation and
estimates how much faster the parallel computation is than
the same computation on a serial processor
strong scaling is defined as how the solution time varies
with the number of processors for a fixed total problem
size.
weak scaling is defined as how the solution time varies
with the number of processors for a fixed problem size per
processor.
Determining Scaled Speedup
Time problem size n on 1 processor
Time problem size 2n on 2 processors
Time problem size 2n on 1 processor
Time problem size 4n on 4 processors
Time problem size 4n on 1 processor
etc.
Plot the curve
Performance Measurement
There is not a perfect way to measure and
report performance.
Wall clock time seems to be the best.
But how much work do you do?
Best Bet:
– Develop a model that fits experimental results.
A Parallel Programming Model
Goal: Define an equation that predicts
execution time as a function of
–
–
–
–
Problem size
Number of processors
Number of tasks
Etc.
T = f ( N , P,....)
A Parallel Programming Model
Execution time can be broken up into
– Computing
– Communicating
– Idling
T =
Tcomp +
Tcomm +
Tidle
Computation Time
Normally depends on problem size
Also depends on machine characteristics
– Processor speed
– Memory system
– Etc.
Often, experimentally obtained
Communication Time
The amount of time spent sending &
receiving messages
Most often is calculated as
– Cost of sending a single message * #messages
Single message cost
– T = startuptime +
time_to_send_one_word * #words
Idle Time
Difficult to determine
This is often the time waiting for a message
to be sent to you.
Can be avoided by overlapping
communication and computation.
Finite Difference Example
Finite Difference Code
512 x 512 x 5 Elements
nxnx z
Nine-point stencil
Row-wise decomposition
– Each processor gets n/p*n*z elements
16 IBM RS6000 workstations
Connected via Ethernet
Finite Difference Model
Execution Time (per iteration)
– ExTime = (Tcomp + Tcomm)/P
Communication Time (per iteration)
– Tcomm = 2 (lat + 2*n*z*bw)
Computation Time
– Estimate using some sample code
Estimated Performance
Finite Difference Example
What was wrong?
Ethernet
– Shared bus
Change the computation of Tcomm
– Reduce the bandwith
– Scale the message volume by the number of
processors sending concurrently.
– Tcomm = 2 (lat + 2*n*z*bw * P/2)
Finite Difference Example
Using analytical models
Examine the control flow of the algorithm
Find a general algebraic form for the
complexity (execution time).
Fit the curve with experimental data.
If the fit is poor, find the missing terms and
repeat.
Calculate the scaled speedup using formula.
Example
Serial Time = 2 + 12 N seconds
Parallel Time = 4 + 12 N/P + 5P seconds
Let N/P = 128
Scaled Speedup for 4 processors is:
C1 ( PN )
2 + 12(4(128))
6146
=
=
= 3.93
CP ( PN ) 4 + 12(4(128) / 4) + 5(4) 1560
Performance Evaluation
Identify the data
Design the experiments to obtain the data
Report data
Performance Evaluation
Identify the data
– Execution time
– Be sure to examine a range of data points
Design the experiments to obtain the data
Report data
Performance Evaluation
Identify the data
Design the experiments to obtain the data
– Make sure the experiment measures what you
intend to measure.
– Remember: Execution time is max time taken.
– Repeat your experiments many times
– Validate data by designing a model
Report data
Performance Evaluation
Identify the data
Design the experiments to obtain the data
Report data
– Report all information that affects execution
– Results should be separate from Conclusions
– Present the data in an easily understandable
format.