slides

O1TURN : Near-Optimal Worst-Case
Throughput Routing for 2D-Mesh Networks
DaeHo Seo, Akif Ali, WonTaek Lim
Nauman Rafique, Mithuna Thottethodi
School of Electrical and Computer Engineering
Purdue University
Motivation
• New routing algorithm for 2D
Mesh networks : O1TURN
• Why 2D Mesh networks?
– Important class of interconnection
network
– Natural topology for on-chip
network
– Many Applications
• “yet another routing algorithm”?
June 08 2005
Purdue University
2
Routing Algorithms: Objectives
• Maximize throughput and minimize latency
IDEAL
DOR
ROMM
Average case
throughput
X
Worst case
Throughput
X
Minimal # of
network hops
X
X
X
Low complexity
router
X
X
X
VALIANT
X
MIN-ADAPTIVE
X
X
?
X
• O1TURN satisfies all design goals
June 08 2005
Purdue University
3
Challenges
• Intuition: Path flexibility, Load Balancing, Throughput correlated
IDEAL
DOR
ROMM
Average case throughput
X
Worst case Throughput
X
Minimal # of network hops
X
X
X
Low complexity router
X
X
X
# of Paths
?
1
Θ(K’2)
VALIANT
X
MIN-ADAPTIVE
X
X
?
X
Θ(K2)
Θ(2K’)
• Prior results
– Throughput : Increasing path flexibility [SPAA 2002]
• May not improve worst case throughput, even decrease
• Likely to improve average case throughput
– Latency : Increasing path flexibility may increase router complexity
June 08 2005
Purdue University
4
Contributions
• Develop new routing algorithm : O1TURN
• Throughput
– Better than DOR / ROMM for worst-case throughput
• Near optimal worst-case throughput for 2D Mesh
– Captures most of the “opportunity” with limited path flexibility for
average case throughput
• O1TURN (with 2 paths) as good as ROMM (with
Θ(K’2)
paths)
• Latency
– Router Implementation for O1TURN
• Comparable complexity as simple DOR router
• Key Point :
– Partition the delay-critical circuitry
• O1TURN is minimal : One goal trivially satisfied
June 08 2005
Purdue University
5
Outline
• Background of interconnection network
• O1TURN routing algorithm
• O1TURN router implementation
• Simulation Results
• Conclusion and Q&A
June 08 2005
Purdue University
6
Outline
• Background of interconnection network
• O1TURN routing algorithm
• O1TURN router implementation
• Simulation Results
• Conclusion and Q&A
June 08 2005
Purdue University
7
Background
• Packet Switched, 2D mesh network
– Each packet independently routed
• Terminology
– Network Radix = k in kxk network (NOT Degree)
• Simplifying assumptions for this talk
– One packet crosses a link in one cycle
– Square mesh networks (K x K)
– K is even (K = 2p)
• Analytical method for throughput analysis
– TD Method [Towles and Dally, SPAA 2002]
– Worst-case throughput = (Maximum channel load)-1
– Given permutation and (oblivious) routing algorithm
• Find maximum channel load
– Given only (oblivious) routing algorithm
• Find permutation that causes maximum channel load
June 08 2005
Purdue University
8
TD-Method Example
Unit of worst-case throughput = packets / node / cycle
• Max Channel Load = 0.5
• Worst-case Throughput = (1 / 0.5) = 2
• Max Channel Load = 1
• Worst-case Throughput = (1 / 1) = 1
A
B
A
0.5
A
B
1
1
1
B
0.5
0.5
0.5
0.5
0.5
0.5
C
D
Traffic :
Src -> Dst
A -> D
D -> A
June 08 2005
C
1
D
A -> B -> D
D -> C -> A
Purdue University
C
D
0.5
A -> B -> D
A -> C -> D
D -> B -> A
D -> C -> A
9
Outline
• Background of interconnection network
• O1TURN routing algorithm
• O1TURN router implementation
• Simulation Results
• Conclusion and Q&A
June 08 2005
Purdue University
10
O1TURN routing algorithm
• Orthogonal 1 TURN routing
– There is no U-TURN => Orthogonal
– At most 1 turn => 1TURN
1
D
• Use 2 routes
– At most 2 minimal, 1-turn routes in
2D MESH (XY, YX)
– Two routing algorithms (XY routing,
YX routing)
– With same probability
June 08 2005
Purdue University
2
S
11
O1TURN routing algorithm
• Claim: Maximum channel load of O1TURN is K / 2
• Proof: Two sources of load contributions
– # of nodes of left side of channel by XY routing
– # of nodes of right side of channel by YX routing
……………
……………
……………
……………
(K - N) * 0.5
……………
……………
C
C
……………
……………
……………
……………
XY routing
June 08 2005
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
N * 0.5
……………
YX routing
Purdue University
12
Optimal Worst Case Throughput
• Maximum channel load = K / 2
– Worst-case Throughput = 2 / K by TD Method
• Consider a permutation where 100%
packets cross bisection
– Throughput (X) bounded when bisection links
saturated
– X * (K2 / 2) = K
– X = 2 / K packets / node / cycle
K x K mesh
• When K is odd, O1TURN is within (1 / K2)
of optimal worst-case throughput
June 08 2005
Purdue University
13
Worst-case Throughput Trends
• Worst-case channel load as network size changes
– Normalized to Optimal worst-case throughput
– Worst case throughput of DOR, ROMM degrades with K
Normalized Throughput
1
Recall
Even Radix : Opt * 1
Odd Radix : Opt * (1 - 1 / K2)
0.8
0.6
0.4
OPTIMAL
DOR
0.2
ROMM
O1TURN
0
2
4
6
8
10
12
14
16
Network Radix (k)
June 08 2005
Purdue University
14
Average Case Analysis
• Extension of TD method [B.Towles et.al., SPAA 2003]
– Examine randomly chosen permutations
– Harmonic means of worst-case throughput of various permutations
– 1 M random permutations
4 x 4 2D MESH
DOR
Average case throughput
ROMM
1
O1TURN
1.113
1.136
1.180
1.188
8 x 8 2D MESH
Average case throughput
1
• O1TURN shows the better or the same average case
throughput
June 08 2005
Purdue University
15
O1TURN Summary
• Near optimal worst-case Throughput
– By TD method
– Optimal for even K
– Approaches Optimal for large, odd K
• Average case throughput
– Better than DOR and comparable to ROMM
• Minimal # of network hops
– O1TURN is minimal routing
June 08 2005
Purdue University
16
Outline
• Background of interconnection network
• O1TURN routing algorithm
• O1TURN router implementation
• Simulation Results
• Conclusion and Q&A
June 08 2005
Purdue University
17
Base Router Implementation
• Base Router : Pipelined Virtual Channel Router
– 4 Stages : Routing, Virtual Channel allocation, Switch allocation,
Crossbar & Physical Channel transfer
– One control block controls all virtual channels
– Critical Stage : Virtual Channel allocation stage
CREDITS OUT (ALL PCs and VCs)
Routing Algorithm
VC Allocation
Switch Allocation
CREDITS IN (ALL PCs and VCs)
VC ID
INJECT
EJECT
X+
XY+
Y-
June 08 2005
5X5
CROSSBAR
Purdue University
18
O1TURN Router Implementation
• O1TURN Router
– Separate Virtual Channels into two virtual networks (VN)
– One VN for XY routing, the other for YX routing
– Deadlock prevention in each independent VN due to DOR
CREDITS OUT (ALL PCs and YX VCs)
CREDITS OUT
(ALL PCs and XY VCs)
Routing (YX)
VC Allocation
CREDITS IN
(ALL PCs and YX VCs)
CREDITS IN
(ALL PCs and XY VCs)
Routing (XY)
VC Allocation
Switch Allocation
VC ID
INJECT
EJECT
X+
XY+
Y-
June 08 2005
5X5
CROSSBAR
Purdue University
19
Delay Analysis
• Existing router delay models for pipelined routers
– Peh and Dally [HPCA 2001]
• Based on the logical effort method
– [I.Sutherland, B. Sproull, 1999]
– FO4 unit
VCs / PC
DOR
O1TURN
VC allocation
SW allocation
VC allocation
SW allocation
4
17
14
14
14
8
20
16
17
16
– Comparable complexity as DOR router
June 08 2005
Purdue University
20
O1TURN Summary
• Near Optimal Worst case
Throughput
• Good average case Throughput
• Minimal Network Hops
• Low Complexity Router
Implementation
– Comparable complexity as
DOR router
June 08 2005
IDEAL
O1TURN
Average case
throughput
X
X
Worst case
Throughput
X
X
Minimal # of
network hops
X
X
Low complexity
router
X
X
Purdue University
21
Outline
• Background of interconnection network
• O1TURN routing algorithm
• O1TURN router implementation
• Simulation Results
• Conclusion and Q&A
June 08 2005
Purdue University
22
Evaluation Method
•
•
•
•
•
•
•
•
Modified Popnet network Simulator [L. Shang, 2003]
4x4 2D MESH (8x8 in paper)
Full-duplex, bidirectional links
8 VCs per PC
5 Flits per packet
500 K cycles
Synthetic Traffic: Uniform Random, BC, MT, HOT SPOT
Compared with existing routing algorithms
– Oblivious routing algorithms (DOR, ROMM)
– Adaptive routing algorithm (DUATO)
June 08 2005
Purdue University
23
Simulation Results
Average Latency (cycle)
• 4 x 4 2D MESH – Uniform Random Traffic Pattern
200
DOR
ROMM
150
O1TURN
DUATO
100
50
0
0
0.2
0.4
0.6
0.8
1
Throughput (flits / node / cycle)
June 08 2005
Purdue University
24
Simulation Results
• 4 x 4 2D MESH – Matrix Transpose Traffic Pattern
Average Latency (cycle)
– One of the worst-case traffic pattern for DOR
200
DOR
ROMM
150
O1TURN
DUATO
100
50
0
0
0.2
0.4
0.6
0.8
1
Throughput (flits / node / cycle)
June 08 2005
Purdue University
25
Simulation Results
• 4 x 4 2D MESH – Bit Complement Traffic Pattern
Average Latency (cycle)
– Already balanced traffic pattern
200
DOR
ROMM
150
O1TURN
DUATO
100
50
0
0
0.2
0.4
0.6
0.8
1
Throughput (flits / node / cycle)
June 08 2005
Purdue University
26
Simulation Results
• 4 x 4 2D MESH – HOT SPOT Traffic Pattern
– 2 nodes have 20% of traffic
Average Latency (cycle)
200
DOR
ROMM
150
O1TURN
DUATO
100
50
0
0
0.2
0.4
0.6
0.8
1
Throughput (flits / node / cycle)
June 08 2005
Purdue University
27
Simulation Results
• Delay penalty of adaptive routing
– How the complexity of router implementation affects on latency
– Hot Spot Traffic Pattern
Average Latency (FO4)
2000
DOR
ROMM
1500
O1TURN
DUATO
1000
500
0
0
0.2
0.4
0.6
0.8
1
Throughput (flits / node / cycle)
June 08 2005
Purdue University
28
Outline
• Background of interconnection network
• O1TURN routing algorithm
• O1TURN router implementation
• Simulation Results
• Conclusion and Q&A
June 08 2005
Purdue University
29
Related Work
• Routing algorithms
– Valiant [L.G.Valiant et.al, ACM 1981]
– ROMM [T.Nesson et.al, ACM 1995]
– DUATO [J.Duato et.al, 1993]
• Partitioned router implementation
– Mad Postman [Jesshope et.al, ISCA 1989]
– PFNF [Upadhyay et.al, 1997]
• Analysis methods
– Worst-case [B.Towles et.al, 2002]
– Throughput centric [B.Towles et.al, 2003]
– Delay model [L.S.Peh et.al, HPCA 2001]
June 08 2005
Purdue University
30
Conclusion
• Goals
–
–
–
–
Good average case throughput
Good or Optimal worst case throughput
Minimal # of network hops
Low complexity router implementation
• O1TURN
– Provide near optimal worst case throughput
– Provide the better or the same average case throughput
compared with existing routing algorithms
– Minimal # of network hops
– Simple router implementation : comparable with DOR router
– Satisfy all performance aspects
June 08 2005
Purdue University
31
Q&A
June 08 2005
Purdue University
32