improve performance

EE2011 Computer Organization
Lecture 5: Understanding Performance
Wen-Yen Lin, Ph.D.
Department of Electrical Engineering
Chang Gung University
Email: [email protected]
April 2017
Understanding The Performance
Performance
Computer Organization, 105/2, EE/CGU, W.Y. Lin
2
Performance
(Ch. 1.6)
Measure, Report, and Summarize the performance
Describe the major factors determining the
performance
Make intelligent choices
See through the marketing hype
Key to understanding underlying organizational
motivation
Why is some hardware better than others for different programs?
What factors of system performance are hardware related?
(e.g., Do we need a new machine, or a new operating system?)
How does the machine's instruction set affect performance?
Computer Organization, 105/2, EE/CGU, W.Y. Lin
3
Which of these airplanes has the best
performance? (p. 29)
Airplane
Passengers
Boeing 777
375
Boeing 747
470
BAC/Sud Concorde 132
Douglas DC-8-50
146
Range (mi) Speed (mph)
4630
610
4150
610
4000
1350
8720
544
How much faster is the Concorde compared to the 747?
How much bigger is the 747 than the Douglas DC-8?
Any other way to measure the performance?
Airplane
Boeing 777
Boeing 747
BAC/Sud Concorde
Douglas DC-8-50
Passenger Throughput
(passenger x mph)
228,750
286,700
178,200
79,424
Computer Organization, 105/2, EE/CGU, W.Y. Lin
4
Defining Performance
Which airplane has the best performance?
Boeing 777
Boeing 777
Boeing 747
Boeing 747
BAC/Sud
Concorde
BAC/Sud
Concorde
Douglas
DC-8-50
Douglas DC8-50
0
100
200
300
400
0
500
Boeing 777
Boeing 777
Boeing 747
Boeing 747
BAC/Sud
Concorde
BAC/Sud
Concorde
Douglas
DC-8-50
Douglas DC8-50
500
1000
Cruising Speed (mph)
4000
6000
8000 10000
Cruising Range (miles)
Passenger Capacity
0
2000
1500
0
100000 200000 300000 400000
Passengers x mph
Computer Organization, 105/2, EE/CGU, W.Y. Lin
5
Computer Performance:
TIME, TIME, TIME
Response Time (latency)
Running a program on two different computers:


The faster one is the one that gets the job done first
As an individual user, you are interested in reducing Response Time
(Execution time or latency), including disk accesses, memory accesses,
I/O activities, OS overhead, CPU execution time, etc.

How long does it take for my job to run?

How long does it take to execute a job?

How long must I wait for the database query?
Throughput
Several jobs running on several servers in data center:


The fastest one is the one that completes the most jobs during a day
As a data center manager, you are interested in increasing throughput,
the total amount of work done in a given time.

How many jobs can the machine run at once?

What is the average execution rate?

How much work is getting done?
Computer Organization, 105/2, EE/CGU, W.Y. Lin
6
Response Time and Throughput
(pp. 30)
Response Time
How long it takes to do a task
Throughput
Total work done per unit time

E.g., tasks/trnasactions/… per hour
If we upgrade a machine with a new faster processor what do we increase?
Tasks can be done faster
Both response time and throughput are increased
If we add a new machine/processor to the lab/machine what do we increase?
A single task execution time is not affected
Overall throughput is increased
However, if the task needs to compete processing resources with others, it
may need to wait in the queue. Hence, increasing throughput would
sometimes also reduce the response time
Changing either response time or throughput often affects the other
We will focus on response time for now …
Computer Organization, 105/2, EE/CGU, W.Y. Lin
7
Book's Definition of Performance
(p. 30)
For some program running on machine X,
PerformanceX = 1 / Execution timeX
PerformanceX is greater than PerformanceY
means Execution TimeY is longer than Execution
TimeX
"X is n times faster than Y"
PerformanceX / PerformanceY = n
Execution TimeY is n times longer than Execution
TimeX
Computer Organization, 105/2, EE/CGU, W.Y. Lin
8
Relative Performance
(p. 31)
Problem:
machine A runs a program in 10 seconds
machine B runs the same program in 15 seconds
PerformanceA Execution TimeB
=
= 1.5
PerformanceB Execution TimeA
A is 1.5 times faster than B
To avoid the potential confusion between the
terms, increasing and decreasing, we usually say
“improve performance” or “improve execution
time” instead of “increasing performance” or
“decreasing execution time”.
Computer Organization, 105/2, EE/CGU, W.Y. Lin
9
Measuring Performance
(p. 32)
Execution Time
Elapsed Time
 Total response time, including all aspects

counts everything: processing, I/O, OS overhead, idle time
 Determines system performance
 a useful number, but often not good for comparison purposes
CPU time
 Time spent processing a given job

doesn't count I/O or time spent running other programs
 can be broken up into system CPU time, and user CPU time
 Different programs are affected differently by CPU and system
performance
Our focus: User CPU time
 time spent executing the lines of code that are "in" our
program
Computer Organization, 105/2, EE/CGU, W.Y. Lin
10
Clock Cycles
(pp. 34)
Instead of reporting execution time in seconds, we often use
cycles
seconds
cycles
seconds
=
×
program program
cycle
Clock “ticks” indicate when to start activities (one abstraction):
time
Computer Organization, 105/2, EE/CGU, W.Y. Lin
11
CPU Clocking
Operation of digital hardware governed by a constant-rate
clock
Clock period
Clock (cycles)
Data transfer
and computation
Update state
Clock period (cycle time): duration of a clock cycle
E.g., 250ps = 0.25ns = 250x10 -12s
Clock frequency (rate): cycles per second
E.g., 4.0GHz = 4000MHz = 4.0x10 9Hz
A 4 Ghz. clock has a
1
4 ×109
×1012 = 250 picoseconds (ps) cycle time
Computer Organization, 105/2, EE/CGU, W.Y. Lin
12
How to Improve Performance
CPU Time = CPU Clock Cycles × Clock Cycle Time
=
CPU Clock Cycles
Clock Rate
So, to improve performance (everything else being equal)
you can either (increase or decrease?)
reducing
________
the # of required cycles for a program, or
reducing
________
the clock cycle time or, said another way,
increasing the clock rate.
________
Hardware designer must often trade off clock rate against
cycle count
Computer Organization, 105/2, EE/CGU, W.Y. Lin
13
CPU Time Example
(p. 34)
Computer A: 2GHz clock, 10s CPU time
Designing Computer B
Aim for 6s CPU time
Can do faster clock, but causes 1.2 × clock cycles
How fast must Computer B clock be?
Clock CyclesB 1.2 × Clock Cycles A
=
Clock RateB =
CPU Time B
6s
Clock Cycles A = CPU Time A × Clock Rate A
= 10s × 2GHz = 20 × 10 9
1.2 × 20 × 10 9 24 × 10 9
Clock RateB =
=
= 4GHz
6s
6s
Computer Organization, 105/2, EE/CGU, W.Y. Lin
14
How many cycles are required for a
program?
...
6th
5th
4th
3rd instruction
2nd instruction
1st instruction
Could assume that number of cycles
equals number of instructions
time
This assumption is incorrect, different instructions take
different amounts of time on different machines.
Why? hint: remember that these are machine instructions,
not lines of C code
Computer Organization, 105/2, EE/CGU, W.Y. Lin
15
Different numbers of cycles for different
instructions
time
Multiplication takes more time than addition
Floating point operations take longer than integer ones
Accessing memory takes more time than accessing
registers
Important point: changing the cycle time often changes the number of
cycles required for various instructions (more later)
Computer Organization, 105/2, EE/CGU, W.Y. Lin
16
Now that we understand cycles
A given program will require
some number of instructions (machine instructions)
some number of cycles
some number of seconds
We have a vocabulary that relates these quantities:
cycle time (seconds per cycle)
clock rate (cycles per second)
CPI (cycles per instruction)
a floating point intensive application might have a higher CPI
MIPS (millions of instructions per second)
this would be higher for a program using simple instructions
Computer Organization, 105/2, EE/CGU, W.Y. Lin
17
Performance
Performance is determined by execution
time
Do any of the other variables equal
performance?
# of cycles to execute program?
# of instructions in program?
# of cycles per second?
average # of cycles per instruction?
average # of instructions per second?
Common pitfall: thinking one of the
variables is indicative of performance when
it really isn’t.
Computer Organization, 105/2, EE/CGU, W.Y. Lin
18
Instruction Performance
(p. 35)
Clock Cycles = Instruction Count × Cycles per Instruction
CPU Time = Instruction Count × CPI × Clock Cycle Time
Instruction Count × CPI
=
Clock Rate
Instruction Count for a program
Determined by program, ISA and compiler
CPI: Average cycles per instruction
Determined by CPU hardware
If different instructions have different CPI
 Average CPI affected by instruction mix
Computer Organization, 105/2, EE/CGU, W.Y. Lin
19
CPI Example
(p.35)
Suppose we have two implementations of the same
instruction set architecture (ISA).
For some program,
Machine A has a clock cycle time of 250 ps and a CPI of 2.0
Machine B has a clock cycle time of 500 ps and a CPI of 1.2
What machine is faster for this program, and by how much?
If two machines have the same ISA which of our quantities (e.g., clock rate,
CPI, execution time, # of instructions, MIPS) will always be identical?
Computer Organization, 105/2, EE/CGU, W.Y. Lin
20
CPI Example
(p.36)
A is faster…
…by this much
Computer Organization, 105/2, EE/CGU, W.Y. Lin
21
Instruction Classes
(p.37)
Classes of Instructions
Different types/classes of instructions would take various
numbers of cycles to execute



e.g. Add, Sub, and Logic operations may only need one cycle
Multiplication/Division and Floating point operations take much
longer to execute
Data transfer operations may take more than one cycle to execute
n
CPU clock cycles = Σ(CPIi x Ci)
i=1
Where
Ci is the count of the number of instruction of class i executed
CPIi is the average number of cycles per instruction for that
instruction class
n is the number of instruction classes.
Weighted average CPI
n
Clock Cycles
Instruction Count i 

CPI =
= ∑  CPIi ×

Instruction Count i=1 
Instruction Count 
Computer Organization, 105/2, EE/CGU, W.Y. Lin
22
# of Instructions Example
(p. 37)
A compiler designer is trying to decide between two
code sequences for a particular machine. Based on
the hardware implementation, there are three
different classes of instructions: Class A, Class B, and
Class C, and they require one, two, and three cycles
(respectively).
The first code sequence has 5 instructions: 2 of A, 1
of B, and 2 of C
The second sequence has 6 instructions: 4 of A, 1 of
B, and 1 of C.
Which code sequence executes the most instructions?
Which sequence will be faster?
What is the CPI for each sequence?
Computer Organization, 105/2, EE/CGU, W.Y. Lin
23
# of Instructions Example
Computer Organization, 105/2, EE/CGU, W.Y. Lin
24
SPEC CPU Benchmark
(p. 46)
Programs used to measure performance
Supposedly typical of actual workload
Standard Performance Evaluation Corp (SPEC)
Develops benchmarks for CPU, I/O, Web, …
SPEC CPU2006
Elapsed time to execute a selection of programs
 Negligible I/O, so focuses on CPU performance
Normalize relative to reference machine
Summarize as geometric mean of performance ratios
 CINT2006 (integer) and CFP2006 (floating-point)
n
n
∏ Execution time ratio
i
i=1
Computer Organization, 105/2, EE/CGU, W.Y. Lin
25
Pitfall: Amdahl’s Law
(p. 49)
Improving an aspect of a computer
and expecting a proportional
improvement in overall performance
Timproved =
Taffected
+ Tunaffected
improvemen t factor
But actually, the performance
enhancement possible with a given
improvement is limited by the
amount that the improved feature is
used.
Principle: Make the common case fast!!
Computer Organization, 105/2, EE/CGU, W.Y. Lin
26
Example
Example: multiply accounts for 80s/100s
"Suppose a program runs in 100 seconds on a machine, with
multiply responsible for 80 seconds of this time. How much do we
have to improve the speed of multiplication if we want the program
to run 4 times faster?"
How about making it 5 times faster?
100/4 = 25 = (80/n + 20)
n = 16
We have to make multiply operation 16 times faster in
order to have program runs 4 times faster.
100/5 = 20 = (80/n + 20)
n = NaN
No way to improve on multiply operation only to make the
program runs 5 times faster.
Computer Organization, 105/2, EE/CGU, W.Y. Lin
27
Million Instruction Per Second (MIPS)
(p. 51)
MIPS: Million Instruction Per Second
One alternative to time as the metric
Doesn’t account for
 Differences in ISAs between computers
 Differences in complexity between instructions
MIPS =
=
Instruction count
Execution time × 10 6
Instruction count
Clock rate
=
6
Instruction count × CPI
CPI
×
10
6
× 10
Clock rate
CPI varies between programs on a given
CPU
Computer Organization, 105/2, EE/CGU, W.Y. Lin
28
Pitfall: MIPS as a Performance Metric
Example:
The Computer’s clock rate is 4GHz
Which sequence will be faster according to
MIPS?
Which sequence will be faster according to
execution time?
Computer Organization, 105/2, EE/CGU, W.Y. Lin
29
MIPS Example
1st Compiler’s code
Execution time = (5x109x1 + 1x109x2 + 1x109x3) x
250x10-12 = 2.5 seconds
MIPS = (5+1+1) x109/(2.5x106)= 2800
2nd Compiler’s code
Execution time = (10x109x1 + 1x109x2 + 1x109x3) x
250x10-12 = 3.75 seconds
MIPS = (10+1+1) x109 /(3.75x106)= 3200
2nd Compiler’s code has longer execution time
but also larger MIPS
Computer Organization, 105/2, EE/CGU, W.Y. Lin
30
MIPS Example#2 – Check Yourself
(p. 51)
Measurement
Computer A Computer B
Instruction count
10 billion
8 billion
Clock rate
4 GHz
4 GHz
CPI
1.0
1.1
Which computer has the higher MIPS
rating?
Which computer is faster?
Computer Organization, 105/2, EE/CGU, W.Y. Lin
31
Problems of MIPS
Is MIPS a good index for performance measurement?
MIPS specifies the instruction execution rate but
does not take into account the capabilities of the
instructions
MIPS varies between programs on the same
computer; thus a computer can not have a
single MIPS rating for all programs
MIPS can vary inversely with performance
MIPS is a good index for CPU performance but might
not be suitable for Compiler/Software efficiency.
Computer Organization, 105/2, EE/CGU, W.Y. Lin
32
Performance Summary
The BIG Picture
Instructions Clock cycles Seconds
CPU Time =
×
×
Program
Instruction Clock cycle
Performance depends on
Algorithm: affects IC, possibly CPI
Programming language: affects IC, CPI
Compiler: affects IC, CPI
Instruction set architecture: affects IC,
CPI, Tc
Computer Organization, 105/2, EE/CGU, W.Y. Lin
33
Processor Performance Equation
CPU time
= Seconds
=
Program
Instructions x
Program
Inst Count
Program
X
CPI
x
Seconds
Instruction
Cycle
Clock Rate
CPI
Compiler
X
(X)
Inst. Set.
X
X
Organization
Cycles
inst count
X
Technology
Computer Organization, 105/2, EE/CGU, W.Y. Lin
Cycle time
X
X
34
Concluding Remarks
Performance is specific to a particular program/s
Total execution time is a consistent summary of
performance => the best performance measure
For a given architecture performance increases come
from:
increases in clock rate (without adverse CPI affects)
improvements in processor organization that lower CPI
compiler enhancements that lower CPI and/or
instruction count
Algorithm/Language choices that affect instruction count
Pitfalls:
expecting improvement in one aspect of a machine’s
performance to affect the total performance
Expecting higher MIPS ratio to have better performance.
Computer Organization, 105/2, EE/CGU, W.Y. Lin
35