A Novel 3D Layer-Multiplexed On-Chip Network Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering University of California, San Diego Networks-on-Chip • Chip-multiprocessors (CMPs) increasingly popular • 2D-mesh networks often used as on-chip fabric 12.64mm I/O Area single tile 1.5mm 21.72mm 2.0mm Tilera Tile64 2 I/O Area Intel 80-core 3D Integrated Circuits Through Silicon Via Short inter-layer distances • • • • Device layer 2 ≥ 2 active device layers Device layer 1 Reduced chip footprint Reduced wire delays High inter-layer bandwidth Heterogeneous system integration Natural Progression: 3D Mesh for 3D CMPs 3D Mesh 2D Mesh What routing algorithms to use for 3D mesh networks? Outline • Oblivious routing on a 3D mesh • Layer-multiplexed 3D architecture • Evaluation Oblivious Routing Objectives • Maximize throughput – Distribute traffic evenly on network links – Maximize worst-case throughput as traffic is application dependent • Minimize hop count – Minimize routing delay between source and destination – Reduce power Average hop count (normalized to minimal) Routing Algorithms for 3D Mesh Networks Valiant Routing • Optimal worst-case throughput • Poor latency 2 VAL Dimension O1TURN Ordered RoutingRouting Ideal routing algorithm • Minimal • Minimal latency latency • Minimal latency • Poor • Poor worst-case worst-case throughput throughput • Maximum worst-case throughput 1 DOR IDEAL O1TURN 0.25 0.5 Worst-case throughput (fraction of network capacity) Randomized Partially-Minimal Routing (RPM) Z Y X Random intermediate layer Source Phase-1Z Source to the intermediate layer Destination Phase-2Z Intermediate layer to the destination XY or YX routing on the intermediate layer Main Idea • Load-balance uniformly across the vertical layers – 2 phases of vertical routing • Min XY/YX used on each layer Average hop count (normalized to minimal) Routing Algorithms for 3D Mesh Networks 2 VAL Randomized Partially Minimal Routing • Near-optimal worst-case throughput • Low latency RPM IDEAL 1.1 1 DOR O1TURN 0.25 0.5 Worst-case throughput (fraction of network capacity) RPM has Near-optimal Worst-case Throughput RPM is optimal for even radix, within 1/k2 of optimal for odd radix. Performance of RPM: Average-case Throughput Normalized average-case throughput 0.8 0.7 0.6 VAL 0.5 DOR 0.4 ROMM 0.3 O1TURN 0.2 RPM 0.1 0 4x4x4 mesh 8x8x4 mesh Outline • Oblivious routing on a 3D mesh • Layer-multiplexed (LM) 3D architecture • Evaluation Unique Features of 3D ICs • Inter-layer distances are very small (~50 μm) – Order of magnitude lower than distances between adjacent tiles on a 2D plane (~1500 μm) – Vertical interconnects implemented using Through-Silicon-Vias (TSVs) have very low delay 50μm TSV 1500μm Unique Features of 3D ICs • Inter-layer distances are very small (~50 μm) – Order of magnitude lower than distances between adjacent tiles on a 2D plane (~1500 μm) – Vertical wires using Through-Silicon-Vias (TSVs) have very low delay • Vertical bandwidth abundant as TSVs can be densely packed in 2D with small via pitch (~4 μm) 4 μm 4 μm Unique Features of 3D ICs • Inter-layer distances are very small (~50 μm) – Order of magnitude lower than distances between adjacent tiles on a 2D plane (~1500 μm) – Vertical wires using Through-Silicon-Vias (TSVs) have very low delay • Vertical wiring abundant as TSVs can be packed in 2D with small via pitch (~4 μm) • Number of device layers likely to remain small (4-5 layers) due to thermal and manufacturing issues RPM on a 3D Mesh Z Y X Random intermediate layer Destination Source Phase-1Z Source to the intermediate layer * Phase-2Z Intermediate layer to the destination XY or YX routing on the intermediate layer Proposed Layer-Multiplexed Architecture Y Z Phase-1Z Source to the X intermediate layer Phase-2Z Intermediate layer to the destination Random intermediate layer P1 P2 P1 P3 P2 P4 RPM routing adapted to the LM architecture : RPM-LM P3 P4 * Source Destination XY or YX routing on the intermediate layer Power and Area Savings • 5x5 crossbar in LM vs. 7x7 crossbar in 3D mesh P1 P1 P1 P2 P2 P2 P3 P3 P3 P4 P4 P4 . . . Packet ejection multiplexer Conventional 3D Mesh Packet injection demultiplexer Layer-Multiplexed Architecture • Decouple vertical routing from horizontal routing • Restrict vertical routing to packet injection and packet ejection Single Hop Vertical Communication • Single hop vertical routing more power efficient than one-layer-per-hop routing – Leverages short inter-layer distances in 3D ICs – Better utilizes available vertical bandwidth Packet Injection Demultiplexer Route Selection/Load Balancing Credits in from the injection port of routers on layers 1-4 VC Allocation Flit Counters Switch Arbitration P1 P2 P3 P4 To the injection port of the Layer 1 router . . . To the injection port of the Layer 4 router Packet Ejection Multiplexer VCID Credits out for L1-P1, L2-P1, L3-P1 and L4-P1 Arbiter L1-P1 Router on Layer 1 Packets from layer2 Packets from layer3 Packets from layer4 L2-P1 P1 L3-P1 L4-P1 Credits out for L1-P4, L2-P4, L3-P4 and L4-P4 . . . Arbiter P2 P3 L1-P4 Packets from layer2 Packets from layer3 Packets from layer4 L2-P4 L3-P4 L4-P4 P4 Outline • Oblivious routing on a 3D mesh • Layer-multiplexed 3D architecture • Evaluation – Power and Area – Performance Power and Area Evaluation • Used Orion 2.0 models for router power and area estimation. • 65nm process at 1V and 1GHz • Buffers – 4VCs/port, 5flits/VC for routers – 5 flits/port for packet injection demultiplexer – 5 flits/port for each packet ejection multiplexer Power Comparison • 3D mesh – One 7-port router per tile • LM – One 5-port router per tile – One packet injection demultiplexer for every 4 tiles – One packet ejection multiplexer per tile Power Evaluation 200 180 Power (in mW) 160 27% power reduction 140 120 Multiplexer power 100 80 60 Amortized demultiplexer power 40 Router power 20 0 3D Mesh LM Area Evaluation 1 0.9 Area (in sqmm) 0.8 26.5% power reduction 0.7 Wire area 0.6 0.5 Multiplexer area 0.4 0.3 0.2 Amortized demultiplexer area 0.1 Router area 0 3D Mesh LM Outline • Oblivious routing on a 3D mesh • Layer-multiplexed 3D architecture • Evaluation – Power and Area – Performance RPM on a 3D mesh vs. RPM-LM • Worst-case throughput – RPM-LM achieves same (near-optimal) worst-case throughput as RPM Normalized average-case throughput • Average-case throughput 0.8 0.6 RPM 0.4 RPM-LM 0.2 0 4x4x4 8x8x4 Flit-Level Simulation • Ideal throughput evaluation assumes – Ideal single-cycle router – Infinite buffers – No contention in switches, no flow control • Flit-level simulation – – – – – PopNet network simulator 5 stage router pipeline Credit-based flow control 8 virtual channels, each 5 flits deep Multi-flit packets injected into the network (5 flits/packet) Flit-Level Simulation (cont’d) • Network configurations simulated – 4 x 4 x 4 mesh – 8 x 8 x 4 mesh • Four different traffic traces used – – – – Uniform traffic Transpose traffic: (x,y,z) → (y,z,x) Complement traffic: (x,y,z) → (k-x-1, k-y-1, k-z-1) Worst Case traffic pattern for DOR (DOR-WC): (x,y,z) → (k-z-1, k-y-1, k-x-1) Uniform Traffic 8x8x4 Mesh Transpose Traffic 8x8x4 Mesh Worst-case Traffic for DOR 8x8x4 Mesh Summary of Contributions • Proposed a 3D Layer-multiplexed architecture which is an optimization of a 3D mesh • Exploits the optimality of RPM together with the high vertical bandwidth enabled in 3D technology • LM architecture consumes 27% less power, occupies 26% less area than a 3D mesh • RPM-LM has comparable (marginally better) performance to RPM on a 3D mesh Thank you!!
© Copyright 2026 Paperzz