Introduction to GPU architecture

Parallel programming:
Introduction to GPU architecture
Sylvain Collange
Inria Rennes – Bretagne Atlantique
[email protected]
GPU internals
What makes a GPU tick?
NVIDIA GeForce GTX 980 Maxwell GPU. Artist rendering!
2
Outline
Computer architecture crash course
The simplest processor
Exploiting instruction-level parallelism
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
3
The free lunch era... was yesterday
1980's to 2002: Moore's law, Dennard scaling, micro-architecture
improvements
Exponential performance increase
Software compatibility preserved
Do not
rewrite
software,
buy a new
machine!
Hennessy,
Patterson.
Computer Architecture,
a quantitative
approach. 4
th
Ed. 2006
4
Computer architecture crash course
How does a processor work?
Or was working in the 1980s to 1990s:
modern processors are much more complicated!
An attempt to sum up 30 years of research in 15 minutes
5
Machine language: instruction set
Registers
Example
State for computations
Keeps variables and temporaries
Instructions
R0, R1, R2, R3... R31
Perform computations on registers,
move data between register and memory, branch…
Instruction word
Binary representation of an instruction
01100111
Assembly language
Readable form of machine language
ADD R1, R3
Examples
Which instruction set does your laptop/desktop run?
Your cell phone?
6
The Von Neumann processor
PC
Fetch unit
+1
Instruction
word
Decoder
Operands
Operation
Memory
Register file
Branch
Unit
Arithmetic and
Logic Unit
Load/Store
Unit
Result bus
State
machine
Let's look at it step by step
7
Step by step: Fetch
The processor maintains a Program Counter (PC)
Fetch: read the instruction word pointed by PC in memory
PC
Fetch unit
01100111
Instruction
word
01100111
Memory
8
Decode
Split the instruction word to understand what it represents
Which operation? → ADD
Which operands? → R1, R3
PC
Fetch unit
01100111 Instruction
word
Decoder
Operands
R1, R3
Operation
Memory
ADD
9
Read operands
Get the value of registers R1, R3 from the register file
PC
Fetch unit
Instruction
word
Decoder
Operands
R1, R3
Register file
Memory
42, 17
10
Execute operation
Compute the result: 42 + 17
PC
Fetch unit
Instruction
word
Decoder
Operands
Operation
Memory
Register file
42, 17
ADD
Arithmetic and
Logic Unit
59
11
Write back
Write the result back to the register file
PC
Fetch unit
Instruction
word
Decoder
R1 Operands
59
Operation
Memory
Register file
Arithmetic and
Logic Unit
Result bus
12
Increment PC
PC
Fetch unit
+1
Instruction
word
Decoder
Operands
Operation
Memory
Register file
Arithmetic and
Logic Unit
Result bus
13
Load or store instruction
Can read and write memory from a computed address
PC
Fetch unit
+1
Instruction
word
Decoder
Operands
Operation
Memory
Register file
Arithmetic and
Logic Unit
Load/Store
Unit
Result bus
14
Branch instruction
Instead of incrementing PC, set it to a computed value
PC
Fetch unit
+1
Instruction
word
Decoder
Operands
Operation
Memory
Register file
Branch
Unit
Arithmetic and
Logic Unit
Load/Store
Unit
Result bus
15
What about the state machine?
The state machine controls
everybody
Sequences the successive
steps
Send signals to units
depending on current state
At every clock tick,
switch to next state
Clock is a periodic signal
(as fast as possible)
Fetchdecode
Read
operands
Increment
PC
Read
operands
Writeback
Execute
16
Recap
We can build a real processor
As it was in the early 1980's
You
are
here
How did processors become faster?
17
Reason 1: faster clock
Progress in semiconductor technology
allows higher frequencies
Frequency
scaling
But this is not enough!
18
Outline
Computer architecture crash course
The simplest processor
Exploiting instruction-level parallelism
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
19
Going faster using ILP: pipeline
Idea: we do not have to wait until instruction n has finished
to start instruction n+1
Like a factory assembly line
Or the bandeijão
20
Pipelined processor
Program
1: add, r1, r3
2: mul r2, r3
3: load r3, [r1]
Fetch
Decode
Execute Writeback
1: add
Independent instructions can follow each other
Exploits ILP to hide instruction latency
21
Pipelined processor
Program
1: add, r1, r3
2: mul r2, r3
3: load r3, [r1]
Fetch
Decode
2: mul
1: add
Execute Writeback
Independent instructions can follow each other
Exploits ILP to hide instruction latency
22
Pipelined processor
Program
1: add, r1, r3
2: mul r2, r3
3: load r3, [r1]
Fetch
Decode
3: load
2: mul
Execute Writeback
1: add
Independent instructions can follow each other
Exploits ILP to hide instruction latency
23
Superscalar execution
Multiple execution units in parallel
Independent instructions can execute at the same time
Fetch
Decode
Execute Writeback
Decode
Execute Writeback
Decode
Execute Writeback
Exploits ILP to increase throughput
24
Locality
Time to access main memory: ~200 clock cycles
One memory access every few instructions
Are we doomed?
Fortunately: principle of locality
~90% of memory accesses on ~10% of data
Accessed locations are often the same
Temporal locality
Access the same location at different times
Spacial locality
Access locations close to each other
26
Caches
Large memories are slower than small memories
The computer theorists lied to you:
in the real world, access in an array of size n costs O(log n), not O(1)!
Think about looking up a book in a small or huge library
Idea: put frequently-accessed data in small, fast memory
Can be applied recursively: hierarchy with multiple levels of cache
L1
cache
L2
cache
Capacity: 64 KB
Access time: 2 ns
1 MB
10 ns
L3
cache
8 MB
30 ns
Memory
8 GB
60 ns
27
Branch prediction
What if we have a branch?
We do not know the next PC to fetch from until the branch executes
Solution 1: wait until the branch is resolved
Problem: programs have 1 branch every 5 instructions on average
We would spend most of our time waiting
Solution 2: predict (guess) the most likely direction
If correct, we have bought some time
If wrong, just go back and start over
Modern CPUs can correctly predict over 95% of branches
World record holder: 1.691 mispredictions / 1000 instructions
General concept: speculation
P. Michaud and A. Seznec. "Pushing the branch predictability limits with the multi-poTAGE+ SC
28
predictor." JWAC-4: Championship Branch Prediction (2014).
Example CPU: Intel Core i7 Haswell
Up to 192 instructions
in flight
May be 48 predicted branches
ahead
Up to 8 instructions/cycle
executed out of order
About 25 pipeline stages at
~4 GHz
Quizz: how far does light travel
during the 0.25 ns of a clock
cycle?
Too complex to explain in 1
slide, or even 1 lecture
David Kanter, Intel's Haswell CPU architecture, RealWorldTech, 2012
http://www.realworldtech.com/haswell-cpu/
29
Recap
Many techniques to run sequential programs
as fast as possible
Discovers and exploits parallelism between instructions
Speculates to remove dependencies
Works on existing binary programs,
without rewriting or re-compiling
Upgrading hardware is cheaper than improving software
Extremely complex machine
30
Outline
Computer architecture crash course
The simplest processor
Exploiting instruction-level parallelism
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
31
Technology evolution
Memory wall
Compute
Performance
Gap
Memory
Memory speed does not increase as
fast as computing speed
More and more difficult to hide memory
latency
Time
Power wall
Transistor
density
Power consumption of transistors does
not decrease as fast as density
increases
Performance is now limited by power
consumption
Total power
Transistor
power
Time
ILP wall
Law of diminishing returns on
Instruction-Level Parallelism
Cost
Pollack rule: cost ≃ performance²
Serial performance
32
Usage changes
New applications demand
parallel processing
Computer games : 3D graphics
Search engines, social networks…
“big data” processing
New computing devices are
power-constrained
Laptops, cell phones, tablets…
Small, light, battery-powered
Datacenters
High power supply
and cooling costs
33
Latency vs. throughput
Latency: time to solution
CPUs
Minimize time, at the expense of
power
Throughput: quantity of tasks
processed per unit of time
GPUs
Assumes unlimited parallelism
Minimize energy per operation
34
Amdahl's law
Bounds speedup attainable on a parallel machine
1
S=
Time to run
sequential portions
1−P
P
N
Time to run
parallel portions
S
P
N
Speedup
Ratio of parallel
portions
Number of
processors
S (speedup)
N (available processors)
G. Amdahl. Validity of the Single Processor Approach to Achieving Large-Scale
Computing Capabilities. AFIPS 1967.
35
Why heterogeneous architectures?
Time to run
sequential portions
S=
1
P
1−P
N
Time to run
parallel portions
Latency-optimized multi-core (CPU)
Low efficiency on parallel portions:
spends too much resources
Throughput-optimized multi-core (GPU)
Low performance on sequential portions
Heterogeneous multi-core (CPU+GPU)
Use the right tool for the right job
Allows aggressive optimization
for latency or for throughput
M. Hill, M. Marty. Amdahl's law in the multicore era. IEEE Computer, 2008.
36
Example: System on Chip for smartphone
Small cores
for background activity
GPU
Big cores
for applications
Lots of interfaces
Special-purpose
accelerators
37
Outline
Computer architecture crash course
The simplest processor
Exploiting instruction-level parallelism
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
38
The (simplest) graphics rendering pipeline
Primitives
(triangles…)
Vertices
Fragment shader
Vertex shader
Textures
Z-Compare
Blending
Clipping, Rasterization
Attribute interpolation
Pixels
Fragments
Framebuffer
Z-Buffer
Programmable
stage
Parametrizable
stage
39
How much performance do we need
… to run 3DMark 11 at 50 frames/second?
Element
Per frame Per second
Vertices
12.0M
600M
Primitives
12.6M
630M
Fragments
180M
9.0G
Instructions 14.4G
720G
Intel Core i7 2700K: 56 Ginsn/s peak
We need to go 13x faster
Make a special-purpose accelerator
Source: Damien Triolet, Hardware.fr
40
Beginnings of GPGPU
Microsoft DirectX
7.x
8.0
8.1
9.0 a
9.0b
9.0c
10.0
10.1
11
Unified shaders
NVIDIA
NV10
FP 16
NV20
NV30
Programmable
shaders
Dynamic
control flow
G70
R200
G80-G90
CTM
R300
R400
GT200
GF100
CUDA
SIMT
FP 24
ATI/AMD
R100
FP 32
NV40
R500
FP 64
R600
CAL
R700
Evergreen
GPGPU traction
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
42
Today: what do we need GPUs for?
1. 3D graphics rendering for games
Complex texture mapping, lighting
computations…
2. Computer Aided Design
workstations
Complex geometry
3. GPGPU
Complex synchronization, data
movements
One chip to rule them all
Find the common denominator
43
Outline
Computer architecture crash course
The simplest processor
Exploiting instruction-level parallelism
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
44
Little's law: data=throughput×latency
Throughput (GB/s)
Intel Core i7 920
180
50
1,25 3 10
Latency (ns)
50
1500
NVIDIA GeForce GTX 580
L1
320
L2
30
210
J. Little. A proof for the queuing formula L= λ W. JSTOR 1961.
190
DRAM
350 ns45
Hiding memory latency with pipelining
Memory throughput: 190 GB/s
Throughput:
190B/cycle
Memory latency: 350 ns
Time
Requests
At 1 GHz:
190 Bytes/cycle,
350 cycles to wait
Latency:
350 cycles
Data in flight = 66 500 Bytes
1 cycle
...
In flight: 65 KB
46
Consequence: more parallelism
GPU vs. CPU
Requests
8× more parallelism to hide longer
latency
64× more total parallelism
×8
How to find this parallelism?
×8
Time
8× more parallelism to feed more
units (throughput)
Space
...
47
Sources of parallelism
ILP: Instruction-Level Parallelism
Between independent instructions
in sequential program
TLP: Thread-Level Parallelism
Between independent execution
contexts: threads
DLP: Data-Level Parallelism
Between elements of a vector:
same operation on several elements
add r3 ← r1, r2
Parallel
mul r0 ← r0, r1
sub r1 ← r3, r0
Thread 1
add
Thread 2
mul
vadd r←a,b
Parallel
a1
+
b1
a2
+
b2
a3
+
b3
r1
r2
r3
48
Example: X ← a×X
In-place scalar-vector product: X ← a×X
Sequential (ILP)
For i = 0 to n-1 do:
X[i] ← a * X[i]
Threads (TLP)
Launch n threads:
X[tid] ← a * X[tid]
Vector (DLP)
X ← a * X
Or any combination of the above
49
Uses of parallelism
“Horizontal” parallelism
for throughput
A
B
C
D
More units working in parallel
“Vertical” parallelism
for latency hiding
A
Pipelining: keep units busy
when waiting for dependencies,
memory
B
C
D
A
B
C
A
B
latency
throughput
A
cycle 1
cycle 2 cycle 3
cycle 4
50
How to extract parallelism?
Horizontal
Vertical
ILP
Superscalar
Pipelined
TLP
Multi-core
SMT
Interleaved / switch-on-event
multithreading
DLP
SIMD / SIMT
Vector / temporal SIMT
We have seen the first row: ILP
We will now review techniques for the next rows: TLP, DLP
51
Outline
Computer architecture crash course
The simplest processor
Exploiting instruction-level parallelism
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
52
Sequential processor
for i = 0 to n-1
X[i] ← a * X[i]
Source code
move i ← 0
loop:
load t ← X[i]
mul
t ← a×t
store X[i] ← t
add
i ← i+1
branch i<n? loop
Machine code
add i ← 18
Fetch
store X[17]
Decode
mul
Execute
Memory
Memory
Sequential CPU
Focuses on instruction-level parallelism
Exploits ILP: vertically (pipelining) and horizontally (superscalar)
53
The incremental approach: multi-core
Several processors
on a single chip
sharing one memory space
Intel Sandy Bridge
Area: benefits from Moore's law
Power: extra cores consume little when not in use
e.g. Intel Turbo Boost
Source: Intel
54
Homogeneous multi-core
Horizontal use of thread-level parallelism
add i ← 50 F
IF
store X[49] D
ID
mul
mul
EX
X
LSU
Mem
Threads:
T0
EX
X
Memory
add i ← 18 F
IF
store X[17] D
ID
LSU
Mem
T1
Improves peak throughput
55
Example: Tilera Tile-GX
Grid of (up to) 72 tiles
Each tile: 3-way VLIW processor,
5 pipeline stages, 1.2 GHz
…
Tile (1,1)
Tile (1,2)
Tile (1,8)
…
…
…
Tile (9,1)
Tile (9,8)
56
Interleaved multi-threading
Vertical use of thread-level parallelism
Threads: T0
T1
T2
T3
mul
Fetch
mul
Decode
add i ←73
add i ← 50
Execute
load X[89]
store X[72]
load X[17]
store X[49]
Memory
load-store
unit
Memory
Hides latency thanks to explicit parallelism
improves achieved throughput
57
Example: Oracle Sparc T5
16 cores / chip
Core: out-of-order superscalar, 8 threads
15 pipeline stages, 3.6 GHz
Thread 1
Thread 2
…
Thread 8
Core 1
Core 2
Core 16
58
Clustered multi-core
For each
individual unit,
select between
T0
T1
T2
→ Cluster 1
T3
→ Cluster 2
Horizontal replication
Vertical time-multiplexing
br
Examples
mul
Sun UltraSparc T2, T3
AMD Bulldozer
IBM Power 7
add i ←73
Fetch
store
Decode
mul
add i ← 50
load X[89]
store X[72]
load X[17]
store X[49]
EX
Memory
L/S Unit
Area-efficient tradeoff
Blurs boundaries between cores
59
Implicit SIMD
Factorization of fetch/decode, load-store units
Fetch 1 instruction on behalf of several threads
Read 1 memory location and broadcast to several registers
(0-3) load
F
T1
(0-3) store
D
T2
T3
(0) mul
(0)
(1) mul
(2) mul
(1)
(2)
(3) mul X
(3)
Memory
T0
Mem
In NVIDIA-speak
SIMT: Single Instruction, Multiple Threads
Convoy of synchronized threads: warp
Extracts DLP from multi-thread applications
60
Explicit SIMD
Single Instruction Multiple Data
Horizontal use of data level parallelism
add i ← 20
F
vstore X[16..19 D
vmul
X
Memory
loop:
vload T ← X[i]
vmul
T ← a×T
vstore X[i] ← T
add
i ← i+4
branch i<n? loop
Mem
Machine code
SIMD CPU
Examples
Intel MIC (16-wide)
AMD GCN GPU (16-wide×4-deep)
Most general purpose CPUs (4-wide to 8-wide)
61
Quizz: link the words
Parallelism
Architectures
ILP
Superscalar processor
TLP
Homogeneous multi-core
DLP
Multi-threaded core
Use
Horizontal:
more throughput
Clustered multi-core
Implicit SIMD
Explicit SIMD
Vertical:
hide latency
62
Quizz: link the words
Parallelism
Architectures
ILP
Superscalar processor
TLP
Homogeneous multi-core
DLP
Multi-threaded core
Use
Horizontal:
more throughput
Clustered multi-core
Implicit SIMD
Explicit SIMD
Vertical:
hide latency
63
Quizz: link the words
Parallelism
Architectures
ILP
Superscalar processor
TLP
Homogeneous multi-core
DLP
Multi-threaded core
Use
Horizontal:
more throughput
Clustered multi-core
Implicit SIMD
Explicit SIMD
Vertical:
hide latency
64
Outline
Computer architecture crash course
The simplest processor
Exploiting instruction-level parallelism
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
65
Hierarchical combination
Both CPUs and GPUs combine these techniques
Multiple cores
Multiple threads/core
SIMD units
66
Example CPU: Intel Core i7
Is a wide superscalar, but has also
Multicore
Multi-thread / core
SIMD units
Up to 117 operations/cycle from 8 threads
256-bit
SIMD
units: AVX
4 CPU cores
Wide superscalar
Simultaneous Multi-Threading:
2 threads
67
Example GPU: NVIDIA GeForce GTX 580
SIMT: warps of 32 threads
16 SMs / chip
2×16 cores / SM, 48 warps / SM
Warp 1
Warp 2
Warp 3
Warp 4
…
Core 32
Core 18
Time
Core 17
Warp 47
Core 16
Core 2
Core 1
…
…
Warp 48
SM1
SM16
Up to 512 operations per cycle from 24576 threads in flight
68
Taxonomy of parallel architectures
Horizontal
Vertical
ILP
Superscalar / VLIW
Pipelined
TLP
Multi-core
SMT
Interleaved / switch-onevent multithreading
DLP
SIMD / SIMT
Vector / temporal SIMT
69
Classification: multi-core
Intel Haswell
Horizontal
ILP
8
TLP
4
DLP
8
Fujitsu SPARC64 X
Vertical
8
2
16
2
2
General-purpose
multi-cores:
balance ILP, TLP and DLP
SIMD Cores
Hyperthreading
(AVX)
IBM Power 8
Oracle Sparc T5
10
12
2
8
16
Sparc T:
focus on TLP
8
4
Cores
Threads
70
Classification: GPU and many small-core
Intel MIC
Horizontal
ILP
2
TLP
60
DLP
16
SIMD
Nvidia Kepler
AMD GCN
Vertical
2
Cores
Tilera Tile-GX
4
16×4
32
32
Cores
×units
SIMT
5
72
17×16
40
16
4
Multithreading
Kalray MPPA-256
3
20×4
GPU: focus on DLP, TLP
horizontal and vertical
Many small-core:
focus on horizontal TLP
71
Takeaway
All processors use hardware mechanisms to turn parallelism into
performance
GPUs focus on Thread-level and Data-level parallelism
72
Outline
Computer architecture crash course
The simplest processor
Exploiting instruction-level parallelism
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
73
Computation cost vs. memory cost
Power measurements on NVIDIA GT200
Energy/op
(nJ)
Instruction control
Multiply-add on a 32-wide
warp
Load 128B from DRAM
Total power
(W)
1.8
3.6
18
36
80
90
With the same amount of energy
Load 1 word from external memory (DRAM)
Compute 44 flops
Must optimize memory accesses first!
74
External memory: discrete GPU
Classical CPU-GPU model
Split memory spaces
Highest bandwidth from
GPU memory
Transfers to main
memory are slower
PCI
Express
CPU
26GB/s
Main memory
8 GB
16GB/s
GPU
290GB/s
Graphics memory
3 GB
Ex: Intel Core i7 4770, Nvidia GeForce GTX 780
75
External memory: embedded GPU
Most GPUs today
Same memory
May support memory
coherence
GPU can read directly from
CPU caches
More contention on external
memory
CPU
GPU
Cache
26GB/s
Main memory
8 GB
76
GPU: on-chip memory
Conventional wisdom
Cache area in CPU vs. GPU
according to the NVIDIA
CUDA Programming Guide:
But... if we include registers:
GPU
NVIDIA
GM204 GPU
AMD Hawaii
GPU
Intel Core i7
CPU
Register files
+ caches
8.3 MB
15.8 MB
9.3 MB
GPU/accelerator internal memory exceeds desktop CPUs
77
Registers: CPU vs. GPU
Registers keep the contents of local variables
Typical values
CPU
GPU
Registers/thread
32
32
Registers/core
256
65536
10R/5W
2R/1W
Read / Write ports
GPU: many more registers, but made of simpler memory
78
Internal memory: GPU
Cache hierarchy
Keep frequently-accessed data
Core
Reduce throughput demand on
main memory
L1
Core
Core
~2 TB/s
L1
L1
1 MB
L2
6 MB
Managed by hardware (L1, L2) or
software (shared memory)
Crossbar
L2
L2
290 GB/s
External memory
79
Caches: CPU vs. GPU
Latency
CPU
GPU
Caches,
prefetching
Multi-threading
Throughput
Caches
On CPU, caches are designed to avoid memory latency
Throughput reduction is a side effect
On GPU, multi-threading deals with memory latency
Caches are used to improve throughput (and energy)
80
GPU: thousands of cores?
Computational resources
NVIDIA GPUs
Exec. units
SM
AMD GPUs
Exec. Units
SIMD-CU
G80/G92
(2006)
128
16
GT200
(2008)
240
30
GF100
(2010)
512
16
R600
(2007)
320
4
R700
(2008)
800
10
Evergreen
(2009)
1600
20
GK104
(2012)
1536
8
NI
(2010)
1536
24
GK110
(2012)
2688
14
GM204
(2014)
2048
16
SI
(2012)
2048
32
VI
(2013)
2560
40
Number of clients in interconnection network (cores)
stays limited
81
Takeaway
Result of many tradeoffs
Between locality and parallelism
Between core complexity and interconnect complexity
GPU optimized for throughput
Exploits primarily DLP, TLP
Energy-efficient on parallel applications with regular behavior
CPU optimized for latency
Exploits primarily ILP
Can use TLP and DLP when available
82
Next time
Next Tuesday, 1:00pm, room 2014
CUDA
Execution model
Programming model
API
Thursday 1:00pm, room 2011
Lab work: what is my GPU and when should I use it?
There may be available seats even if you are not enrolled
83