EE2011 Computer Organization Lecture 5: Understanding Performance Wen-Yen Lin, Ph.D. Department of Electrical Engineering Chang Gung University Email: [email protected] April 2017 Understanding The Performance Performance Computer Organization, 105/2, EE/CGU, W.Y. Lin 2 Performance (Ch. 1.6) Measure, Report, and Summarize the performance Describe the major factors determining the performance Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation Why is some hardware better than others for different programs? What factors of system performance are hardware related? (e.g., Do we need a new machine, or a new operating system?) How does the machine's instruction set affect performance? Computer Organization, 105/2, EE/CGU, W.Y. Lin 3 Which of these airplanes has the best performance? (p. 29) Airplane Passengers Boeing 777 375 Boeing 747 470 BAC/Sud Concorde 132 Douglas DC-8-50 146 Range (mi) Speed (mph) 4630 610 4150 610 4000 1350 8720 544 How much faster is the Concorde compared to the 747? How much bigger is the 747 than the Douglas DC-8? Any other way to measure the performance? Airplane Boeing 777 Boeing 747 BAC/Sud Concorde Douglas DC-8-50 Passenger Throughput (passenger x mph) 228,750 286,700 178,200 79,424 Computer Organization, 105/2, EE/CGU, W.Y. Lin 4 Defining Performance Which airplane has the best performance? Boeing 777 Boeing 777 Boeing 747 Boeing 747 BAC/Sud Concorde BAC/Sud Concorde Douglas DC-8-50 Douglas DC8-50 0 100 200 300 400 0 500 Boeing 777 Boeing 777 Boeing 747 Boeing 747 BAC/Sud Concorde BAC/Sud Concorde Douglas DC-8-50 Douglas DC8-50 500 1000 Cruising Speed (mph) 4000 6000 8000 10000 Cruising Range (miles) Passenger Capacity 0 2000 1500 0 100000 200000 300000 400000 Passengers x mph Computer Organization, 105/2, EE/CGU, W.Y. Lin 5 Computer Performance: TIME, TIME, TIME Response Time (latency) Running a program on two different computers: The faster one is the one that gets the job done first As an individual user, you are interested in reducing Response Time (Execution time or latency), including disk accesses, memory accesses, I/O activities, OS overhead, CPU execution time, etc. How long does it take for my job to run? How long does it take to execute a job? How long must I wait for the database query? Throughput Several jobs running on several servers in data center: The fastest one is the one that completes the most jobs during a day As a data center manager, you are interested in increasing throughput, the total amount of work done in a given time. How many jobs can the machine run at once? What is the average execution rate? How much work is getting done? Computer Organization, 105/2, EE/CGU, W.Y. Lin 6 Response Time and Throughput (pp. 30) Response Time How long it takes to do a task Throughput Total work done per unit time E.g., tasks/trnasactions/… per hour If we upgrade a machine with a new faster processor what do we increase? Tasks can be done faster Both response time and throughput are increased If we add a new machine/processor to the lab/machine what do we increase? A single task execution time is not affected Overall throughput is increased However, if the task needs to compete processing resources with others, it may need to wait in the queue. Hence, increasing throughput would sometimes also reduce the response time Changing either response time or throughput often affects the other We will focus on response time for now … Computer Organization, 105/2, EE/CGU, W.Y. Lin 7 Book's Definition of Performance (p. 30) For some program running on machine X, PerformanceX = 1 / Execution timeX PerformanceX is greater than PerformanceY means Execution TimeY is longer than Execution TimeX "X is n times faster than Y" PerformanceX / PerformanceY = n Execution TimeY is n times longer than Execution TimeX Computer Organization, 105/2, EE/CGU, W.Y. Lin 8 Relative Performance (p. 31) Problem: machine A runs a program in 10 seconds machine B runs the same program in 15 seconds PerformanceA Execution TimeB = = 1.5 PerformanceB Execution TimeA A is 1.5 times faster than B To avoid the potential confusion between the terms, increasing and decreasing, we usually say “improve performance” or “improve execution time” instead of “increasing performance” or “decreasing execution time”. Computer Organization, 105/2, EE/CGU, W.Y. Lin 9 Measuring Performance (p. 32) Execution Time Elapsed Time Total response time, including all aspects counts everything: processing, I/O, OS overhead, idle time Determines system performance a useful number, but often not good for comparison purposes CPU time Time spent processing a given job doesn't count I/O or time spent running other programs can be broken up into system CPU time, and user CPU time Different programs are affected differently by CPU and system performance Our focus: User CPU time time spent executing the lines of code that are "in" our program Computer Organization, 105/2, EE/CGU, W.Y. Lin 10 Clock Cycles (pp. 34) Instead of reporting execution time in seconds, we often use cycles seconds cycles seconds = × program program cycle Clock “ticks” indicate when to start activities (one abstraction): time Computer Organization, 105/2, EE/CGU, W.Y. Lin 11 CPU Clocking Operation of digital hardware governed by a constant-rate clock Clock period Clock (cycles) Data transfer and computation Update state Clock period (cycle time): duration of a clock cycle E.g., 250ps = 0.25ns = 250x10 -12s Clock frequency (rate): cycles per second E.g., 4.0GHz = 4000MHz = 4.0x10 9Hz A 4 Ghz. clock has a 1 4 ×109 ×1012 = 250 picoseconds (ps) cycle time Computer Organization, 105/2, EE/CGU, W.Y. Lin 12 How to Improve Performance CPU Time = CPU Clock Cycles × Clock Cycle Time = CPU Clock Cycles Clock Rate So, to improve performance (everything else being equal) you can either (increase or decrease?) reducing ________ the # of required cycles for a program, or reducing ________ the clock cycle time or, said another way, increasing the clock rate. ________ Hardware designer must often trade off clock rate against cycle count Computer Organization, 105/2, EE/CGU, W.Y. Lin 13 CPU Time Example (p. 34) Computer A: 2GHz clock, 10s CPU time Designing Computer B Aim for 6s CPU time Can do faster clock, but causes 1.2 × clock cycles How fast must Computer B clock be? Clock CyclesB 1.2 × Clock Cycles A = Clock RateB = CPU Time B 6s Clock Cycles A = CPU Time A × Clock Rate A = 10s × 2GHz = 20 × 10 9 1.2 × 20 × 10 9 24 × 10 9 Clock RateB = = = 4GHz 6s 6s Computer Organization, 105/2, EE/CGU, W.Y. Lin 14 How many cycles are required for a program? ... 6th 5th 4th 3rd instruction 2nd instruction 1st instruction Could assume that number of cycles equals number of instructions time This assumption is incorrect, different instructions take different amounts of time on different machines. Why? hint: remember that these are machine instructions, not lines of C code Computer Organization, 105/2, EE/CGU, W.Y. Lin 15 Different numbers of cycles for different instructions time Multiplication takes more time than addition Floating point operations take longer than integer ones Accessing memory takes more time than accessing registers Important point: changing the cycle time often changes the number of cycles required for various instructions (more later) Computer Organization, 105/2, EE/CGU, W.Y. Lin 16 Now that we understand cycles A given program will require some number of instructions (machine instructions) some number of cycles some number of seconds We have a vocabulary that relates these quantities: cycle time (seconds per cycle) clock rate (cycles per second) CPI (cycles per instruction) a floating point intensive application might have a higher CPI MIPS (millions of instructions per second) this would be higher for a program using simple instructions Computer Organization, 105/2, EE/CGU, W.Y. Lin 17 Performance Performance is determined by execution time Do any of the other variables equal performance? # of cycles to execute program? # of instructions in program? # of cycles per second? average # of cycles per instruction? average # of instructions per second? Common pitfall: thinking one of the variables is indicative of performance when it really isn’t. Computer Organization, 105/2, EE/CGU, W.Y. Lin 18 Instruction Performance (p. 35) Clock Cycles = Instruction Count × Cycles per Instruction CPU Time = Instruction Count × CPI × Clock Cycle Time Instruction Count × CPI = Clock Rate Instruction Count for a program Determined by program, ISA and compiler CPI: Average cycles per instruction Determined by CPU hardware If different instructions have different CPI Average CPI affected by instruction mix Computer Organization, 105/2, EE/CGU, W.Y. Lin 19 CPI Example (p.35) Suppose we have two implementations of the same instruction set architecture (ISA). For some program, Machine A has a clock cycle time of 250 ps and a CPI of 2.0 Machine B has a clock cycle time of 500 ps and a CPI of 1.2 What machine is faster for this program, and by how much? If two machines have the same ISA which of our quantities (e.g., clock rate, CPI, execution time, # of instructions, MIPS) will always be identical? Computer Organization, 105/2, EE/CGU, W.Y. Lin 20 CPI Example (p.36) A is faster… …by this much Computer Organization, 105/2, EE/CGU, W.Y. Lin 21 Instruction Classes (p.37) Classes of Instructions Different types/classes of instructions would take various numbers of cycles to execute e.g. Add, Sub, and Logic operations may only need one cycle Multiplication/Division and Floating point operations take much longer to execute Data transfer operations may take more than one cycle to execute n CPU clock cycles = Σ(CPIi x Ci) i=1 Where Ci is the count of the number of instruction of class i executed CPIi is the average number of cycles per instruction for that instruction class n is the number of instruction classes. Weighted average CPI n Clock Cycles Instruction Count i CPI = = ∑ CPIi × Instruction Count i=1 Instruction Count Computer Organization, 105/2, EE/CGU, W.Y. Lin 22 # of Instructions Example (p. 37) A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A, Class B, and Class C, and they require one, two, and three cycles (respectively). The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of C The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C. Which code sequence executes the most instructions? Which sequence will be faster? What is the CPI for each sequence? Computer Organization, 105/2, EE/CGU, W.Y. Lin 23 # of Instructions Example Computer Organization, 105/2, EE/CGU, W.Y. Lin 24 SPEC CPU Benchmark (p. 46) Programs used to measure performance Supposedly typical of actual workload Standard Performance Evaluation Corp (SPEC) Develops benchmarks for CPU, I/O, Web, … SPEC CPU2006 Elapsed time to execute a selection of programs Negligible I/O, so focuses on CPU performance Normalize relative to reference machine Summarize as geometric mean of performance ratios CINT2006 (integer) and CFP2006 (floating-point) n n ∏ Execution time ratio i i=1 Computer Organization, 105/2, EE/CGU, W.Y. Lin 25 Pitfall: Amdahl’s Law (p. 49) Improving an aspect of a computer and expecting a proportional improvement in overall performance Timproved = Taffected + Tunaffected improvemen t factor But actually, the performance enhancement possible with a given improvement is limited by the amount that the improved feature is used. Principle: Make the common case fast!! Computer Organization, 105/2, EE/CGU, W.Y. Lin 26 Example Example: multiply accounts for 80s/100s "Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster?" How about making it 5 times faster? 100/4 = 25 = (80/n + 20) n = 16 We have to make multiply operation 16 times faster in order to have program runs 4 times faster. 100/5 = 20 = (80/n + 20) n = NaN No way to improve on multiply operation only to make the program runs 5 times faster. Computer Organization, 105/2, EE/CGU, W.Y. Lin 27 Million Instruction Per Second (MIPS) (p. 51) MIPS: Million Instruction Per Second One alternative to time as the metric Doesn’t account for Differences in ISAs between computers Differences in complexity between instructions MIPS = = Instruction count Execution time × 10 6 Instruction count Clock rate = 6 Instruction count × CPI CPI × 10 6 × 10 Clock rate CPI varies between programs on a given CPU Computer Organization, 105/2, EE/CGU, W.Y. Lin 28 Pitfall: MIPS as a Performance Metric Example: The Computer’s clock rate is 4GHz Which sequence will be faster according to MIPS? Which sequence will be faster according to execution time? Computer Organization, 105/2, EE/CGU, W.Y. Lin 29 MIPS Example 1st Compiler’s code Execution time = (5x109x1 + 1x109x2 + 1x109x3) x 250x10-12 = 2.5 seconds MIPS = (5+1+1) x109/(2.5x106)= 2800 2nd Compiler’s code Execution time = (10x109x1 + 1x109x2 + 1x109x3) x 250x10-12 = 3.75 seconds MIPS = (10+1+1) x109 /(3.75x106)= 3200 2nd Compiler’s code has longer execution time but also larger MIPS Computer Organization, 105/2, EE/CGU, W.Y. Lin 30 MIPS Example#2 – Check Yourself (p. 51) Measurement Computer A Computer B Instruction count 10 billion 8 billion Clock rate 4 GHz 4 GHz CPI 1.0 1.1 Which computer has the higher MIPS rating? Which computer is faster? Computer Organization, 105/2, EE/CGU, W.Y. Lin 31 Problems of MIPS Is MIPS a good index for performance measurement? MIPS specifies the instruction execution rate but does not take into account the capabilities of the instructions MIPS varies between programs on the same computer; thus a computer can not have a single MIPS rating for all programs MIPS can vary inversely with performance MIPS is a good index for CPU performance but might not be suitable for Compiler/Software efficiency. Computer Organization, 105/2, EE/CGU, W.Y. Lin 32 Performance Summary The BIG Picture Instructions Clock cycles Seconds CPU Time = × × Program Instruction Clock cycle Performance depends on Algorithm: affects IC, possibly CPI Programming language: affects IC, CPI Compiler: affects IC, CPI Instruction set architecture: affects IC, CPI, Tc Computer Organization, 105/2, EE/CGU, W.Y. Lin 33 Processor Performance Equation CPU time = Seconds = Program Instructions x Program Inst Count Program X CPI x Seconds Instruction Cycle Clock Rate CPI Compiler X (X) Inst. Set. X X Organization Cycles inst count X Technology Computer Organization, 105/2, EE/CGU, W.Y. Lin Cycle time X X 34 Concluding Remarks Performance is specific to a particular program/s Total execution time is a consistent summary of performance => the best performance measure For a given architecture performance increases come from: increases in clock rate (without adverse CPI affects) improvements in processor organization that lower CPI compiler enhancements that lower CPI and/or instruction count Algorithm/Language choices that affect instruction count Pitfalls: expecting improvement in one aspect of a machine’s performance to affect the total performance Expecting higher MIPS ratio to have better performance. Computer Organization, 105/2, EE/CGU, W.Y. Lin 35
© Copyright 2026 Paperzz