Lecture 2: Basic Notions and Fundamentals © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Outline • Basic computer organization – von Neumann model and execution cycle – Pipelines and caches • Architectural drivers – Technology – Applications and compatibility – Compilers • Measures – Methodology – Key measures © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois von Neumann’s Contribution Memory Control program … Datapath Input/ Output © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Instruction Cycle • Fetch • Decode • Evaluate addresses • Fetch operands • Execute • Store results Pipelining Fetch • • • • Decode Execute Mem Write Instruction latency does not decrease Throughput increases Dependencies degrade performance General trends: deeper/wider until 2004 © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Multiple Issue adder Decode/Swap Fetch © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois write ALU write ALU write BR Stall logic Mem Pipeline of the Pentium IV [Courtesy of Intel Corp] 1 2 TC nxt IP 3 4 TC Fetch 5 6 Drv Alloc 7 8 Rename 9 Que 10 11 Sch Sch 12 13 14 15 Sch Disp Disp RF 16 RF 17 Ex 18 19 20 Flgs BrCk Drv • Enables very high clock rate • Enables scalability from one technology to the next • But does it always lead to highest performance? © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Speeding up memory • Fundamental: Principles of locality of memory references – Reference to X another reference to X later – Reference to X reference to Y, where X, Y are close • Fundamental: Memory tends to be small/fast/expensive or large/slow/cheap • Result: Memory hierarchies with caching © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Basic Cache Organization Address tag valid tag index offset Direct Mapped Cache cache line … hit/miss © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Data CMOS Inverter gate gate polysilicon Vdd p-MOS trans field oxide gate oxide metal Vdd Input Output Vss p+ p+ n well p-MOS trans gate Vss n+ p substrate n-MOS trans n-MOS trans Schematic Notation © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois n+ Cross-section of inverter in n-well process CMOS Trends [Collated from the 2000 update of the International Technology Roadmap for Semiconductors] Year 1995 1998 2001 2004 2007 2010 2013 Feature Size 350nm 180nm 130nm 90nm 65nm 45nm 33nm 10M 50M 110M 350M 1300M 3500M 11000M (nanometers) Transistor Count For high-performance CPUs, the feature size is (typically) the minimum width of a gate n+ n+ © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois An Microarchitect’s Model of CMOS Power and Delay Vdd Input p-MOS trans Output Vss n-MOS trans • Delay is proportional to the number of gates – typical measure FO4 • Power dissipation – dynamic (switching) – Static (leakage) © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois CMOS Trends [Collated/extrapolated from the 2000 update of the International Technology Roadmap for Semiconductors] Year 1995 1998 2001 2004 2007 2010 2013 Feature Size 350nm 180nm 130nm 90nm 65nm 45nm 33nm 10M 50M 110M 350M 1300M 3500M 11000M 27 13 11 8 7 (nanometers) Transistor Count Projected FO4 delays In each pipe satge 6 6 This will likely be revised due to the transistor variability problem from 65nm generation on. © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois CMOS Trends • Transistors become more plentiful but variable • Gates become faster • Wires become relatively slower • Memory becomes relatively slower • Power becomes a critical issue • Noise, error rates, design complexity © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois • High variability – Increasing speed and power variability of transistors – Limited frequency increase – Reliability / verification challenges Normalized Frequency Trends in hardware 1.4 1.3 1.2 1.1 1.0 0.9 1 10000 1000 130nm 5X 2 3 4 5 Normalized Leakage (Isb) Interconnect RC Delay Clock Period Delay (ps) • Large interconnect delay 30% – Increasing interconnect Copper Interconnect 100 delay and shrinking clock RC delay of 1mm interconnect 10 domains 1 – Limited size of individual 350 250 180 130 90 computing engines © Wen-mei Hwu and S. J. Patel, 2005 Source: Shekhar Borkar, Intel ECE 511, University of Illinois 65 Dynamic-Static Interface • Moving functionality into compilers – Most architectures today, including X86 rely on compilers to achieve performance goals – A major issue is the number of bits required to deliver information from the compiler to the runtime hardware – Highly optimizing compiler also reduced the incentive for machine language programming, which makes portable programs a reality. – The more implementation details one exposes in the instruction set architecture, the more difficult it is to adopt new implementation techniques. © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Applications • Applications drive architecture from ‘above” • Designing next-generation computers involves understanding the behavior of applications of importance and exploiting their characteristics • What are applications of importance? • For us, we will often use benchmarks to characterize different architectural options © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois In the image of the simple hardware, humans created complex software • Software creation based on a simple execution model – Instructions are considered to execute sequentially – Data objects are mapped into a flat, monolithic store reachable by all – Reality when laid out by von Neumann in the 40’s; abstraction now • This execution abstraction has been used in development of large, complex software – “Traditional software model” © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Future apps reflect a concurrent world • Exciting applications in future mass computing market have been traditionally considered “supercomputing applications” – – – – – Physiological simulation – cellular pathways (GE Research) Molecular dynamics simulation (NAMD at UIUC) Video and audio coding and manipulation – MPEG-4 (NCTU) Medical imaging – CT (UIUC) Consumer game and virtual reality products • These “Super-applications” represent and model physical world • Various granularities of parallelism exist, but… – programming model must support required dimensions – data delivery needs careful management © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Direction of computer architecture • • • • Current general purpose architectures cover traditional applications New parallel-privatized architectures cover some super-applications Attempts to grow current architectures “out” or domain-specific architectures “in” lack success By properly exploiting parallelism of super-applications, the coverage of domain-specific architectures can be extended Traditional applications Current architecture coverage New applications Domain-specific architecture coverage Obstacles © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Compatibility • The case of workstations and servers – – – – Relatively open to new architectures. Performance is a major concern. Linux/UNIX is a portable operating system. Current economics model works against new architectures • The case of personal computers – – – – Very tough on new architectures. Windows and Apple OS are not portable Most applications are distributed in binary code Price is of more concern than performance. © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Experimental Methodology • Selected real programs by characterizing workload – – – – – PERFECT Club, SPEC, MediaBench What do these programs do? What input was given to these programs? How are they related to your own workload? What do experimental results mean? • Require high quality software support. – Tremendous variation in capability • Trace driven simulation vs. re-compilation • Nothing can replace real-machine measurements © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Measures Iron Law: performance = 1 / execution time = 1/ (CPI * insts * 1/freq) (Basis of SPEC Marks) or (IPC * freq)/insts CPI : Cycles per instruction (how is this calculated)? IPC : Instructions per cycle other useful measures : average memory access time © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois
© Copyright 2026 Paperzz