2. Basic Notions and Fundamentals

Lecture 2:
Basic Notions and Fundamentals
© Wen-mei Hwu and S. J. Patel, 2005
ECE 511, University of Illinois
Outline
• Basic computer organization
– von Neumann model and execution cycle
– Pipelines and caches
• Architectural drivers
– Technology
– Applications and compatibility
– Compilers
• Measures
– Methodology
– Key measures
© Wen-mei Hwu and S. J. Patel, 2005
ECE 511, University of Illinois
von Neumann’s Contribution
Memory
Control
program
…
Datapath
Input/
Output
© Wen-mei Hwu and S. J. Patel, 2005
ECE 511, University of Illinois
Instruction Cycle
• Fetch
• Decode
• Evaluate addresses
• Fetch operands
• Execute
• Store results
Pipelining
Fetch
•
•
•
•
Decode
Execute
Mem
Write
Instruction latency does not decrease
Throughput increases
Dependencies degrade performance
General trends: deeper/wider until 2004
© Wen-mei Hwu and S. J. Patel, 2005
ECE 511, University of Illinois
Multiple Issue
adder
Decode/Swap
Fetch
© Wen-mei Hwu and S. J. Patel, 2005
ECE 511, University of Illinois
write
ALU
write
ALU
write
BR
Stall
logic
Mem
Pipeline of the Pentium IV
[Courtesy of Intel Corp]
1
2
TC nxt IP
3
4
TC Fetch
5
6
Drv Alloc
7
8
Rename
9
Que
10
11
Sch Sch
12
13
14
15
Sch Disp Disp RF
16
RF
17
Ex
18
19
20
Flgs BrCk Drv
• Enables very high clock rate
• Enables scalability from one technology to
the next
• But does it always lead to highest
performance?
© Wen-mei Hwu and S. J. Patel, 2005
ECE 511, University of Illinois
Speeding up memory
• Fundamental: Principles of locality of memory
references
– Reference to X  another reference to X later
– Reference to X  reference to Y, where X, Y are close
• Fundamental: Memory tends to be
small/fast/expensive or large/slow/cheap
• Result: Memory hierarchies with caching
© Wen-mei Hwu and S. J. Patel, 2005
ECE 511, University of Illinois
Basic Cache Organization
Address
tag
valid
tag
index offset
Direct Mapped Cache
cache line
…
hit/miss
© Wen-mei Hwu and S. J. Patel, 2005
ECE 511, University of Illinois
Data
CMOS Inverter
gate
gate
polysilicon
Vdd
p-MOS trans
field oxide
gate oxide
metal
Vdd
Input
Output
Vss
p+
p+
n well
p-MOS trans
gate
Vss
n+
p substrate
n-MOS trans
n-MOS trans
Schematic Notation
© Wen-mei Hwu and S. J. Patel, 2005
ECE 511, University of Illinois
n+
Cross-section of inverter in
n-well process
CMOS Trends
[Collated from the 2000 update of the
International Technology Roadmap for Semiconductors]
Year
1995
1998
2001
2004
2007
2010
2013
Feature Size
350nm
180nm
130nm
90nm
65nm
45nm
33nm
10M
50M
110M
350M
1300M
3500M 11000M
(nanometers)
Transistor
Count
For high-performance CPUs, the feature size
is (typically) the minimum width of a gate
n+
n+
© Wen-mei Hwu and S. J. Patel, 2005
ECE 511, University of Illinois
An Microarchitect’s Model of
CMOS Power and Delay
Vdd
Input
p-MOS trans
Output
Vss
n-MOS trans
• Delay is proportional
to the number of gates
– typical measure FO4
• Power dissipation
– dynamic (switching)
– Static (leakage)
© Wen-mei Hwu and S. J. Patel, 2005
ECE 511, University of Illinois
CMOS Trends
[Collated/extrapolated from the 2000 update of the
International Technology Roadmap for Semiconductors]
Year
1995
1998
2001
2004
2007
2010
2013
Feature Size
350nm
180nm
130nm
90nm
65nm
45nm
33nm
10M
50M
110M
350M
1300M
3500M 11000M
27
13
11
8
7
(nanometers)
Transistor
Count
Projected
FO4 delays
In each pipe
satge
6
6
This will likely be revised due to the transistor variability problem from 65nm generation on.
© Wen-mei Hwu and S. J. Patel, 2005
ECE 511, University of Illinois
CMOS Trends
• Transistors become more plentiful but variable
• Gates become faster
• Wires become relatively slower
• Memory becomes relatively slower
• Power becomes a critical issue
• Noise, error rates, design complexity
© Wen-mei Hwu and S. J. Patel, 2005
ECE 511, University of Illinois
• High variability
– Increasing speed and power
variability of transistors
– Limited frequency increase
– Reliability / verification
challenges
Normalized Frequency
Trends in hardware
1.4
1.3
1.2
1.1
1.0
0.9
1
10000
1000
130nm
5X
2
3
4
5
Normalized Leakage (Isb)
Interconnect RC Delay
Clock Period
Delay (ps)
• Large interconnect delay
30%
– Increasing interconnect
Copper Interconnect
100
delay and shrinking clock
RC delay of 1mm interconnect
10
domains
1
– Limited size of individual
350 250 180 130 90
computing
engines
© Wen-mei
Hwu and S. J. Patel,
2005
Source: Shekhar Borkar, Intel
ECE 511, University of Illinois
65
Dynamic-Static Interface
• Moving functionality into compilers
– Most architectures today, including X86 rely on
compilers to achieve performance goals
– A major issue is the number of bits required to
deliver information from the compiler to the runtime
hardware
– Highly optimizing compiler also reduced the
incentive for machine language programming, which
makes portable programs a reality.
– The more implementation details one exposes in the
instruction set architecture, the more difficult it is to
adopt new implementation techniques.
© Wen-mei Hwu and S. J. Patel, 2005
ECE 511, University of Illinois
Applications
• Applications drive architecture from ‘above”
• Designing next-generation computers involves
understanding the behavior of applications of
importance and exploiting their characteristics
• What are applications of importance?
• For us, we will often use benchmarks to characterize
different architectural options
© Wen-mei Hwu and S. J. Patel, 2005
ECE 511, University of Illinois
In the image of the simple hardware,
humans created complex software
• Software creation based on a
simple execution model
– Instructions are considered to execute
sequentially
– Data objects are mapped into a flat,
monolithic store reachable by all
– Reality when laid out by von
Neumann in the 40’s; abstraction
now
• This execution abstraction has
been used in development of
large, complex software
– “Traditional software model”
© Wen-mei Hwu and S. J. Patel, 2005
ECE 511, University of Illinois
Future apps reflect a concurrent world
• Exciting applications in future mass computing market
have been traditionally considered “supercomputing
applications”
–
–
–
–
–
Physiological simulation – cellular pathways (GE Research)
Molecular dynamics simulation (NAMD at UIUC)
Video and audio coding and manipulation – MPEG-4 (NCTU)
Medical imaging – CT (UIUC)
Consumer game and virtual reality products
• These “Super-applications” represent and model
physical world
• Various granularities of parallelism exist, but…
– programming model must support required dimensions
– data delivery needs careful management
© Wen-mei Hwu and S. J. Patel, 2005
ECE 511, University of Illinois
Direction of computer architecture
•
•
•
•
Current general purpose architectures cover traditional applications
New parallel-privatized architectures cover some super-applications
Attempts to grow current architectures “out” or domain-specific
architectures “in” lack success
By properly exploiting parallelism of super-applications, the coverage of
domain-specific architectures can be extended
Traditional applications
Current architecture
coverage
New applications
Domain-specific
architecture coverage
Obstacles
© Wen-mei Hwu and S. J. Patel, 2005
ECE 511, University of Illinois
Compatibility
• The case of workstations and servers
–
–
–
–
Relatively open to new architectures.
Performance is a major concern.
Linux/UNIX is a portable operating system.
Current economics model works against new
architectures
• The case of personal computers
–
–
–
–
Very tough on new architectures.
Windows and Apple OS are not portable
Most applications are distributed in binary code
Price is of more concern than performance.
© Wen-mei Hwu and S. J. Patel, 2005
ECE 511, University of Illinois
Experimental Methodology
• Selected real programs by characterizing
workload
–
–
–
–
–
PERFECT Club, SPEC, MediaBench
What do these programs do?
What input was given to these programs?
How are they related to your own workload?
What do experimental results mean?
• Require high quality software support.
– Tremendous variation in capability
• Trace driven simulation vs. re-compilation
• Nothing can replace real-machine
measurements
© Wen-mei Hwu and S. J. Patel, 2005
ECE 511, University of Illinois
Measures
Iron Law:
performance = 1 / execution time
= 1/ (CPI * insts * 1/freq)
(Basis of SPEC Marks)
or (IPC * freq)/insts
CPI : Cycles per instruction (how is this calculated)?
IPC : Instructions per cycle
other useful measures : average memory access time
© Wen-mei Hwu and S. J. Patel, 2005
ECE 511, University of Illinois