AMD`s Direct Connect Architecture

AMD’s Direct Connect Architecture
4 Node Opteron MP Architecture
Pat Conway
DRAM
DRAM
Core 0
SRI
MCT
Principal Member of Technical Staff
Core 0
MCT
SRI
12.8 GB/s
128-bit
Core 1
Core 1
ncHT
HT
HT
HT-HB
HT
I/O
XBAR
HT
I/O
HT-HB
XBAR
I/O
4.0GB/s per
direction
HT
HT
HT
HT-HB
XBAR
This poster illustrates the variety of Opteron system topologies built to date using
HyperTransport and AMD's Direct Connect Architecture for glueless multiprocessing
and summarizes the latency/bandwidth characteristics of these systems.
I/O
XBAR
Building and using Opteron systems has provided useful, real world lessons which expose
the challenges to be addressed when designing future system interconnect, memory
hierarchy and I/O to scale up both the number of cores and sockets in future x86 CMP
Architectures.
MCT
SRI
Core 0
SRI
Core 0
Core 1
HT-HB
cHT
I/O
Glueless Multiprocessing with
Opteron CMP processors
@ 2GT/s Data Rate
HT
Core 1
“Fully Connected 4 Node and 8 Node
Topologies are becoming Feasible”
“NorthBridge”
MCT
Additional HyperTransport™ Ports
• Enable Fully Connected 4 Node (four x16 HT)
and 8 Node (eight x8 HT)
• Reduced network diameter
– Fewer hops to memory
• Increased Coherent Bandwidth
– more links
– cHT packets visit fewer links
– HyperTransport3 up to 5.2GT/s
• Benefits
– Low latency because of lower diameter
– Evenly balanced utilization of HyperTransport links
– Low queuing delays
Four x16 HT
OR
Eight x8 HT
DRAM
DRAM
Low latency under load
Torrenza - HyperTransport-based Accelerators
Imagine it, Build it
Lessons Learned #1
Memory Latency is the Key to Application
Performance!
Fully Connected 4 Node Performance
4 Node Square
SOA
SOA
XML
XML
Accelerator
Accelerator
I/O
Accelerator
Accelerator
4 Node fully connected
I/O
P0
16
I/O
I/O
P2
P0
16
P2
Performance vs Average Memory Latency
(single 2.8GHz core, 400MHz DDR2 PC3200, 2GT/s HT with 1MB cache in MP system)
120%
6
100%
5
4
3
2
80%
OLTP1
OLTP2
SW99
SSL
JBB
60%
40%
20%
1
0
0%
1N
2N
4N (SQ)
8N (TL)
P1
Native
Dual-Core
Native
Dual-Core
Processor Performance
System Performance
7
P3
I/O
P1
I/O
P3
I/O
I/O
8 GB/S
SRQ
SRQ
Crossbar
Crossbar
Mem.Ctrlr
HT
Mem.Ctrlr
8 GB/S
HT
8 GB/S
+ 2 EXTRA LINKS
W/ HYPERTRANSPORT3
4N SQ (2GT/s
HyperTransport)
4N FC (2GT/s
HyperTransport)
4N FC (4.4GT/s
HyperTransport3)
Diam 2 Avg Diam 1.00
Diam 1 Avg Diam 0.75
Diam 1 Avg Diam 0.75
XFIRE BW 14.9GB/s
XFIRE BW 29.9GB/s
XFIRE BW 65.8GB/s
(4X)
(2X)
PCI-E
PCI-E
Bridge
Bridge
Other
Other
Optimized
Optimized
Silicon
Silicon
FLOPs
FLOPs
Accelerator
Accelerator
XFIRE (“crossfire”)
BW is the link-limited
all-to-all communication
bandwidth (data only)
8 GB/S
8N (L)
USB
USB
I/O
I/OHub
Hub
4 Node Square
1 Node
I/O
P0
I/O
AvgD
Latency
I/O
P0
P2
P1
P3
I/O
0 hops
x + 0ns
PCI
PCI
8N Ladder
I/O
P0
P2
P4
P6
I/O
I/O
P1
P3
P5
P7
I/O
I/O
1 hops
x + 44ns (124 cpuclk)
1.8 hops
x + 105ns (234 cpuclk)
8N Twisted Ladder
I/O
P0
P2
P4
P6
I/O
2 Node
I/O
P0
P1
I/O
0.5 hops
x + 17ns (47 cpuclk)
I/O
P1
P3
P5
1.5 hops
x + 76ns (214 cpuclk)
P7
I/O
• Open platform for system builders
– 3rd Party Accelerators
– Media
– FLOPs
– XML
– SOA
• AMD Opteron™ Socket or HTX slot
• HyperTransport interface is an open standard see www.hypertransport.org
• Coherent HyperTransport interface available if the accelerator caches system memory (under
license)
• Homogeneous/heterogeneous computing
• Architectural enhancements planned to support fine grain parallel communication and
synchronization
Fully Connected 8 Node Performance
8 Node Fully Connected
8 Node
2x4
8 Node6HT
6HT 2x4
8N Twisted Ladder
P0
P2
16
P4
P6
I/O
I/O
I/O
5
4
3
0
P0
P1
P3
P5
P7
I/O
2
I/O
I/O
3
0
P2
1
1
4
3
5
I/O
I/O
P2
P1
P3
P4
P3
4
3
P0
I/O
P5
I/O
P4
P1
P2
P3
I/O
I/O
8
P7
I/O
P6
I/O
3 4
2
0
I/O
OR
2
1
0
P1
5
4
1
2
I/O
I/O
8
P0
I/O
5
I/O
I/O
P6
P5
P7
8N TL (2GT/s
HyperTransport)
8N 2x4 (4.4GT/s
HyperTransport3)
8N FC (4.4GT/s
HyperTransport3)
Diam 3 Avg Diam 1.62
Diam 2 Avg Diam 1.12
Diam 1 Avg Diam 0.88
XFIRE BW 15.2GB/s
XFIRE BW 72.2GB/s
XFIRE BW 94.4GB/s
(5X)
(6X)
2006 Workshop on On- and Off-Chip Interconnection Networks for Multicore Systems, 6-7 December 2006, Stanford, California