Exploring Concentration and channel slicing in On

Exploring Concentration and
Channel Slicing in On-chip
Network Router
Prabhat Kumar1
Yan Pan1
John Kim2
Gokhan Memik1
Alok Choudhary1
1Northwestern
University
2KAIST, South Korea
1
Contributions of the Work

Performance implication of concentration.

Integrated vs. external concentration.



47% reduction in area
36% reduction in energy
10% performance degradation

Channel Slicing

Virtual concentration for efficient resource utilization.


69% reduction in area
32 % reduction in energy
2
Outline

Motivation

Concentration

Channel slicing

Virtual concentration

Results

Conclusion
3
Motivation

Limited Budget



Efficiency



Performance
Cost optimization
Design Options



Area
Energy
Concentration
Channel Slicing
Previous Work


Concentration – CMESH, Flattened Butterfly, Firefly, Multidrop
Express Channels (on-chip)
Channel Slicing – Dragonfly (off-chip)
4
Motivation

Firefly [Pan ISCA’09]

Simplified Router
Microarchitecture
5
Typical Topology
2D Mesh
 # Processor Nodes
= # Routers

6
Solution: Concentration

Multiple cores share one
router

Benefits




Resource sharing
Network Diameter decreases
Local communication cost
decreases
Drawbacks

Router complexity increases
significantly
• C=4
• Radix = 8
• Width = 2x
7
Issue : Router Complexity

Router components


Crossbar Switch ~
(radix)2
Arbitration logic
• How can we reduce
the complexity of
crossbar switch?
5x5 crossbar, 2D MESH
8x8 crossbar, 2D MESH, C = 4
(Integrated Concentration)
8
Design Option: External Concentration

Multiplex injection ports
De-multiplex ejection
ports

Benefits




Router radix
decreases
Area decreases
Cons

Reduced switching
capacity
9
Issue: Arbitration

External
Concentration


Two levels of
arbitration
Parallel Arbitration

Use router switch
information for
concentration
arbitration
10
Outline

Motivation

Concentration

Channel slicing

Virtual concentration

Results

Conclusion
11
Issue: Wide Channels

Constant bandwidth
density => wider
channels

Inefficient utilization



Cache lines ~ 512-1024 bits
wide
Request, control, coherency
packets much narrower
Router Area

Switch area ~ (channel
width)2
• C=4
• Radix = 8
• Width = 2x
12
Design Option: Channel Slicing

Slice wide channels

Pros



Complexity reduces
further
Better channel utilization
Cons

Serialization latency
increases (for long pkts)
• C=4
• Slicing Factor = 4
13
Outline

Motivation

Concentration

Channel slicing

Virtual concentration

Results

Conclusion
14
Combining Concentration and Slicing

Slicing +
Concentration

Virtual Concentration


Nodes dedicated to a
sliced layer
No sharing of input
bandwidth
15
Outline

Motivation

Concentration

Channel slicing

Virtual concentration

Results

Conclusion
16
Evaluation Setup

Simulation
Environment


Booksim simulator
Constant on-chip
resources
•
•
Equal Bisection
bandwidth for all
configurations
Equal amount of
buffer storage
Terminology
Code Name Slicing
Architecture Name
M1D1
No
External Concentration
M4D4
No
Integrated Concentration
S1R4M4D4
Yes
Integrated Concentration
S4R1M1D1
Yes
Virtual Concentration
S4R4M4D4
Yes
Fully connected Slice
Code Name
MESH (C=1)
S1R4M4D4
S4R1M1D1
S4R4M4D4
# Slices
1
1
4
4
Inj Channel Buffer
Node Width Depth
1
0.5b
0.6x
4
1b
0.75x
1
0.25b
1.2x
4
0.25b
0.75x
Router Latency M1D1 M1D4
2
3
M4D1
M4D4
3
3
17
External Concentration
Zero-load latency


Throughput



10% reduction for UR
No change for Bitcomp
20
15
M1D1
M4D4
10
0
Area


21% reduction for UR
25% reduction for Bitcomp
Average Pkt Latency
(# Cycles)

Uniform Random
25
47% reduction compared
to Integrated
Energy

36% reduction compared
to Integrated
Average Pkt Latency
( # Cycles)

0.05
0.1
0.15
Injection Rate
0.2
Bitcomp
40
35
30
25
20
M1D1
M4D4
15
10
0
0.05
0.1
Injection Rate
0.15
18
Virtual Concentration




Throughput



69% reduction compared to
MESH
Energy

32% reduction compared to
MESH
Uniform Random
80
60
S1R4M4D4
S4R1M1D1
S4R4M4D4
MESH
40
20
0
0
No significant difference for UR
4.5% increase for Bitcomp
Area


No change compared to MESH
16% increase for UR compared
to Integrated
12% increase for Bitcomp
compared to Integrated
Average Pkt Latency
(# Cycles)
Zero-load latency
0.1
0.2
Injection Rate
0.3
Bitcomp
80
Average Pkt Latency
(# Cycles)

60
40
S1R4M4D4
S4R1M1D1
S4R4M4D4
MESH
20
0
0
0.05
0.1
0.15
Injection Rate
19
Area and Energy Consumption
Area



69% reduction compared
to MESH
88% reduction compared
to Integrated concentration
Normalized Area

3
2.5
2
1.5
1
0.5
0
Energy


32 % reduction compared
to MESH
35% reduction compared
to Integrated concentration
20
Conclusion

Combination of concentration and channel
slicing provides efficient NoC design.

External concentration reduces complexity
with some performance degradation.

Virtual Concentration saves 69% area and
32% energy compared to 2D MESH.
21
Thank you for your patience!!
Questions?
[email protected]
22