A Novel 3D Layer-Multiplexed On-Chip Network - UCSD

A Novel 3D Layer-Multiplexed
On-Chip Network
Rohit Sunkam Ramanujam
Bill Lin
Electrical and Computer Engineering
University of California, San Diego
Networks-on-Chip
• Chip-multiprocessors (CMPs) increasingly popular
• 2D-mesh networks often used as on-chip fabric
12.64mm
I/O Area
single tile
1.5mm
21.72mm
2.0mm
Tilera Tile64
2 I/O Area
Intel 80-core
3D Integrated Circuits
Through Silicon Via
Short inter-layer
distances
•
•
•
•
Device layer 2
≥ 2 active device layers
Device layer 1
Reduced chip footprint
Reduced wire delays
High inter-layer bandwidth
Heterogeneous system integration
Natural Progression:
3D Mesh for 3D CMPs
3D Mesh
2D Mesh
What routing algorithms to use for 3D mesh networks?
Outline
• Oblivious routing on a 3D mesh
• Layer-multiplexed 3D architecture
• Evaluation
Oblivious Routing Objectives
• Maximize throughput
– Distribute traffic evenly on network links
– Maximize worst-case throughput as traffic is
application dependent
• Minimize hop count
– Minimize routing delay between source and
destination
– Reduce power
Average hop count
(normalized to minimal)
Routing Algorithms for 3D Mesh
Networks
Valiant Routing
• Optimal worst-case throughput
• Poor latency
2
VAL
Dimension
O1TURN
Ordered
RoutingRouting
Ideal routing algorithm
• Minimal
• Minimal
latency
latency
• Minimal latency
• Poor
• Poor
worst-case
worst-case
throughput
throughput
• Maximum worst-case throughput
1
DOR
IDEAL
O1TURN
0.25
0.5
Worst-case throughput
(fraction of network capacity)
Randomized Partially-Minimal Routing (RPM)
Z
Y
X
Random
intermediate layer
Source
Phase-1Z
Source to the
intermediate layer
Destination
Phase-2Z
Intermediate layer
to the destination
XY or YX routing on the intermediate layer
Main Idea
• Load-balance uniformly across the vertical layers
– 2 phases of vertical routing
• Min XY/YX used on each layer
Average hop count
(normalized to minimal)
Routing Algorithms for 3D Mesh
Networks
2
VAL Randomized Partially Minimal Routing
• Near-optimal worst-case throughput
• Low latency
RPM
IDEAL
1.1
1
DOR
O1TURN
0.25
0.5
Worst-case throughput
(fraction of network capacity)
RPM has Near-optimal Worst-case
Throughput
RPM is optimal for even radix, within 1/k2 of
optimal for odd radix.
Performance of RPM:
Average-case Throughput
Normalized average-case throughput
0.8
0.7
0.6
VAL
0.5
DOR
0.4
ROMM
0.3
O1TURN
0.2
RPM
0.1
0
4x4x4 mesh
8x8x4 mesh
Outline
• Oblivious routing on a 3D mesh
• Layer-multiplexed (LM) 3D architecture
• Evaluation
Unique Features of 3D ICs
• Inter-layer distances are very small (~50 μm)
– Order of magnitude lower than distances between
adjacent tiles on a 2D plane (~1500 μm)
– Vertical interconnects implemented using
Through-Silicon-Vias (TSVs) have very low delay
50μm
TSV
1500μm
Unique Features of 3D ICs
• Inter-layer distances are very small (~50 μm)
– Order of magnitude lower than distances between
adjacent tiles on a 2D plane (~1500 μm)
– Vertical wires using Through-Silicon-Vias (TSVs) have
very low delay
• Vertical bandwidth abundant as TSVs can be
densely packed in 2D with small via pitch (~4 μm)
4 μm
4 μm
Unique Features of 3D ICs
• Inter-layer distances are very small (~50 μm)
– Order of magnitude lower than distances between
adjacent tiles on a 2D plane (~1500 μm)
– Vertical wires using Through-Silicon-Vias (TSVs)
have very low delay
• Vertical wiring abundant as TSVs can be
packed in 2D with small via pitch (~4 μm)
• Number of device layers likely to remain small
(4-5 layers) due to thermal and manufacturing
issues
RPM on a 3D Mesh
Z
Y
X
Random
intermediate layer
Destination
Source
Phase-1Z
Source to the
intermediate layer
*
Phase-2Z
Intermediate layer
to the destination
XY or YX routing on the intermediate layer
Proposed Layer-Multiplexed Architecture
Y
Z
Phase-1Z
Source to the
X intermediate layer
Phase-2Z
Intermediate layer
to the destination
Random
intermediate layer
P1
P2
P1
P3
P2
P4
RPM routing adapted to the LM architecture : RPM-LM
P3
P4
*
Source
Destination
XY or YX routing on the intermediate layer
Power and Area Savings
• 5x5 crossbar in LM vs. 7x7 crossbar in 3D mesh
P1
P1
P1
P2
P2
P2
P3
P3
P3
P4
P4
P4
.
.
.
Packet ejection multiplexer
Conventional 3D Mesh Packet injection demultiplexer
Layer-Multiplexed Architecture
• Decouple vertical routing from horizontal routing
• Restrict vertical routing to packet injection and
packet ejection
Single Hop Vertical Communication
• Single hop vertical routing more power
efficient than one-layer-per-hop routing
– Leverages short inter-layer distances in 3D ICs
– Better utilizes available vertical bandwidth
Packet Injection Demultiplexer
Route Selection/Load
Balancing
Credits in from the
injection port of routers
on layers 1-4
VC Allocation
Flit
Counters
Switch Arbitration
P1
P2
P3
P4
To the injection port of
the Layer 1 router
.
.
.
To the injection port of
the Layer 4 router
Packet Ejection Multiplexer
VCID
Credits out for L1-P1,
L2-P1, L3-P1 and L4-P1
Arbiter
L1-P1
Router
on
Layer 1
Packets from layer2
Packets from layer3
Packets from layer4
L2-P1
P1
L3-P1
L4-P1
Credits out for L1-P4,
L2-P4, L3-P4 and L4-P4
.
.
.
Arbiter
P2
P3
L1-P4
Packets from layer2
Packets from layer3
Packets from layer4
L2-P4
L3-P4
L4-P4
P4
Outline
• Oblivious routing on a 3D mesh
• Layer-multiplexed 3D architecture
• Evaluation
– Power and Area
– Performance
Power and Area Evaluation
• Used Orion 2.0 models for router power and
area estimation.
• 65nm process at 1V and 1GHz
• Buffers
– 4VCs/port, 5flits/VC for routers
– 5 flits/port for packet injection demultiplexer
– 5 flits/port for each packet ejection multiplexer
Power Comparison
• 3D mesh
– One 7-port router per tile
• LM
– One 5-port router per tile
– One packet injection demultiplexer for every 4 tiles
– One packet ejection multiplexer per tile
Power Evaluation
200
180
Power (in mW)
160
27% power reduction
140
120
Multiplexer power
100
80
60
Amortized
demultiplexer power
40
Router power
20
0
3D Mesh
LM
Area Evaluation
1
0.9
Area (in sqmm)
0.8
26.5% power reduction
0.7
Wire area
0.6
0.5
Multiplexer area
0.4
0.3
0.2
Amortized
demultiplexer area
0.1
Router area
0
3D Mesh
LM
Outline
• Oblivious routing on a 3D mesh
• Layer-multiplexed 3D architecture
• Evaluation
– Power and Area
– Performance
RPM on a 3D mesh vs. RPM-LM
• Worst-case throughput
– RPM-LM achieves same (near-optimal) worst-case
throughput as RPM
Normalized average-case
throughput
• Average-case throughput
0.8
0.6
RPM
0.4
RPM-LM
0.2
0
4x4x4
8x8x4
Flit-Level Simulation
• Ideal throughput evaluation assumes
– Ideal single-cycle router
– Infinite buffers
– No contention in switches, no flow control
• Flit-level simulation
–
–
–
–
–
PopNet network simulator
5 stage router pipeline
Credit-based flow control
8 virtual channels, each 5 flits deep
Multi-flit packets injected into the network (5 flits/packet)
Flit-Level Simulation (cont’d)
• Network configurations simulated
– 4 x 4 x 4 mesh
– 8 x 8 x 4 mesh
• Four different traffic traces used
–
–
–
–
Uniform traffic
Transpose traffic: (x,y,z) → (y,z,x)
Complement traffic: (x,y,z) → (k-x-1, k-y-1, k-z-1)
Worst Case traffic pattern for DOR (DOR-WC):
(x,y,z) → (k-z-1, k-y-1, k-x-1)
Uniform Traffic
8x8x4 Mesh
Transpose Traffic
8x8x4 Mesh
Worst-case Traffic for DOR
8x8x4 Mesh
Summary of Contributions
• Proposed a 3D Layer-multiplexed architecture
which is an optimization of a 3D mesh
• Exploits the optimality of RPM together with
the high vertical bandwidth enabled in 3D
technology
• LM architecture consumes 27% less power,
occupies 26% less area than a 3D mesh
• RPM-LM has comparable (marginally better)
performance to RPM on a 3D mesh
Thank you!!