oldchapter4thirdedition

CS/COE0447
Computer Organization &
Assembly Language
CHAPTER 4
Assessing and Understanding
Performance
1
Program Performance
• Program performance is measured in terms of
time!
• Program execution time deals with
– Number of instructions executed to complete
a job
– How many clock cycles are needed to
execute a single instruction
– The length of the clock cycle (clock cycle time)
2
Clock, Clock Cycle Time
• Circuits in computers are “clocked”
• At each clock rising (or falling) edge, some specified
actions are done, usually within the next rising (or falling)
edge
• Instructions typically require more than one cycle to
execute
Function block
(made of circuits)
clock cycle time
clock
3
Program Performance
• time = (# of clock cycles)  (clock cycle time)
• # of clock cycles = (# of instructions executed) 
(average cycles per instruction)
• time = (# of instructions executed)  (average clock
cycles per instruction)  (clock cycle time)
• time = cycle x s
cycle
• cycle = instruction x cycle (ave)
SO:
instruction
• time (s) = instruction x cycle (ave) x s
instruction cycle
4
Example 1
• You have a machine with a CPU running at
1GHz. The same company releases its 2GHz
CPU with 100% compatibility with the existing
1GHz CPU, and you are considering upgrading.
What is the expected performance improvement
from doing so? Assume that programs have
40% memory-access instructions, and each
memory access takes 10ns on average. All
other instructions take exactly one cycle for
execution. Answer: in class
5
From WikiPedia
• Amdahl's law, named after computer
architect Gene Amdahl, is used to find the
maximum expected improvement to an
overall system when only part of the
system is improved. It is often used in
parallel computing to predict the
theoretical maximum speedup using
multiple processors.
6
Amdahl’s Law (cont)
• The law is concerned with the speedup achievable from
an improvement to a computation that affects a
proportion P of that computation where the improvement
has a speedup of S.
• Amdahl's law states that the overall speedup of applying
the improvement will be:
1
((1-P) + P/S)
• Our example: P = .6 and S = 2
• 1/((1-.6) + (.6/2)) = 1.43
• This is the maximum speedup possible
7
Example 2
• If a computer issues 30 network requests
per second and each request is on
average 64 KB, will a 100 Mbit Ethernet
link be sufficient? (printer, accessing files,
…)
• KB = 10^3 bytes
• Byte = 8 bits
• Mbit = 10^6
• A 100 Mbit Ethernet: 10^8 bit/s “bitrate”
8
Answer
• Ethernet: 10^8 bit/s
• KB = Kilobyte; Kilo = 10^3; byte = 8 bits
• 30 request/s * 64 KB/request * 10^3 x 8
bit/KB
(the units cancel to leave bit/s)
• 30 * 64 * 8 * 10^3 = 3 * 6.4 * 8 * 10^5 <
10^8
(or use a calculator to compute it exactly)
So, yes, it is sufficient
9
Why Performance Evaluation?
DESIGN
EVALUATION
10
Defining Performance
• What do you mean when you say a
computer has better performance than
another?
• We need a “metric” for comparison
– One metric may not fully characterize a
system
• a number of metrics may be relevant
– Important metrics for computer systems
• Response time (a.k.a. execution time)
• Throughput
11
Response Time vs. Throughput
Plane
DC to Paris
Top Speed
Passengers
Throughput
(pmph)
Boeing 747
6.5 hours
610 mph
470
286,700
BAD/Sud
Concorde
3 hours
1350 mph
132
178,200
• Which has higher performance?
– Time to deliver 1 passenger
– Time to deliver 400 passengers
• Time for 1 job is called
– Response time or execution time
• Jobs per day is called
– Throughput or bandwidth
12
Some Definitions
• Throughput is in units of things per second
– Bigger is better
• If we are primarily concerned with
response time
– Performance = 1 / execution time
– Bigger is better  shorter execution time
• “Machine A is N times faster than B”
– = performance (A) / performance (B)
= execution time (B) / execution time (A)
13
Response Time vs. Throughput
• Time of Concorde vs. Boeing 747?
– Concord is (6.5 hours/3 hours) faster
– 2.2 times faster
• Throughput of Boeing 747 vs. Concorde
– 286,700 pmph / 178,200 pmph
– 1.6 times higher
• Boeing 747 is 1.6 times (or 60%) higher in terms of
throughput
• Concorde is 2.2 times (or 120%) faster in terms of flying
time (response time)
• We will focus primarily on execution time for a single job
for the remaining discussions
14
Regarding Time
• Straightforward definition of time
– Total time to complete a task, including disk accesses,
memory accesses, other I/O activities, operating
system overheads, …
– Terms for this: “Real time”, “response time”, “elapsed
time”
• Alternative: time spent by CPU only on your
program (since multiple processes may run at
the same time)
– “CPU execution time” or “CPU time”
– Often divided into system CPU time (OS) and user
CPU time (user program)
15
Clock
16
Measuring Time
• In terms of seconds
• CPU time: computers are constructed using
digital circuitry running at a “clock”
– Constant rate
– Determines when events take place
• Clock cycle time = length of a clock or clock
period = 1 / clock rate
– 1ns if 1GHz clock
– 0.5ns if 2GHz clock
– 0.25ns if 4GHz clock
17
Measuring Time w/ Clocks
• CPU execution time for program
– Clock cycles for a program  clock cycle time
– Clock cycles for a program / clock rate
18
Measuring Time w/ Clocks, cont’d
• Total clock cycles for a program
– Instructions for a program (=instruction count)
 average clock cycles per instruction CPI
• Time=(# of instr.)CPI(clock cycle time)
• Looking at the units:
– s = inst * cycle/inst * s/cycle
19
Workload
• A set of programs run on a computer is a
workload
– Actual collection of applications
– Synthetic programs (for experimentation)
• To evaluate two computer systems, a user
would simply compare the execution time
of the workload on the two computers
20
Benchmarks
• A set of applications relevant for performance evaluation
• SPEC (Standard Performance Evaluation Corporation)
–
–
–
–
CPU benchmarks
Server benchmarks
Graphics benchmarks
…
• EEMBC (Embedded Microprocessor Benchmark Consortium)
–
–
–
–
–
–
Automotive
Consumer
Network
Telecom
Office
…
21
Summarizing Performance
Computer A
Computer B
Program 1 (sec)
1
10
Program 2 (sec)
1000
100
Total time (sec)
1001
110
• A is 10 times faster than B for program 1
• B is 10 times faster than A for program 2
• Although the above statements are correct
individually, they present a confusing picture!
22
Summarizing Performance, cont’d
• Arithmetic mean (AM) = ( Timei) / N
• Weighted AM = ( TimeiWi), Wi = 1
• AM is a special case of weighted AM
where Wi = 1/N
23
SPEC Benchmark
• SPEC CPU2000 benchmark
– 12 integer benchmarks
– 14 floating-point benchmarks
• To get a SPECmark
– Run each program on the target machine
– Get the performance ratio by dividing the preprovided execution time (based on an old
SUN workstation) with the execution time
obtained
24
Amdahl’s Law
(in terms of time)
• An optimization is usually applicable to
only a limited portion of program execution
– E.g., A larger cache; improved CPU frequency;
improved FSB frequency; …
• Timeimproved = Timeunaffected +
Timeaffected/(Improvement Factor)
• “Make the common case fast!”
25
Amdahl’s Law - example
• A program runs in 100 seconds on a computer, with
multiply operations responsible for 80 seconds of this
time
• How much do I have to improve the speed of
multiplication, if I want my program to run 5 times faster?
• Timeimproved = Timeunaffected + Timeaffected/(Improvement
Factor)
• 20 s = 20 s + 80 s / n
• 0 = 80 s / n
• There is no amount by which we can improve multiply to
achieve a fivefold increase in performance!
26
Fallacies and Pitfalls
• Pitfall: Expecting the improvement of one
aspect of a computer to increase
performance by an amount proportional to
the size of the improvement
• Pitfall: Using a subset of the performance
equation as a performance metric
27
To Summarize…
• Performance evaluation is an important
stage of an engineering process
• We are interested in measuring computer
performance
– Software improvement
– Hardware improvement
–…
• Defining performance
– Need relevant metric!
• Latency vs. throughput
28
To Summarize…, cont’d
• Response time = time to finish a given
single job
• Throughput = # of jobs done in a second
• Time = # of clock cycles  clock cycle time
• # of clock cycles = # of instructions  CPI
29
To Summarize…, cont’d
• Best workload is one that comes from real
applications
• Benchmarks are a set of applications to aid
performance evaluation
• Summarizing results
– Arithmetic mean (AM)
– Weighted mean
• Amdahl’s law
– Specifies overall performance improvement due to a
limited-scope optimization
30