Abel IBM-Zurich

IEEE Hot Interconnects 20, Santa Clara, CA, Aug. 22-23, 2012
Rx Stack Accelerator for 10 GbE Integrated NIC
F. Abel, C. Hagleitner
IBM Research – Zurich
Switzerland
F. Verplanken
IBM Systems & Technology Group
La Gaude – France
© 2009 IBM Corporation
Discrete vs Integrated Network Interface Controller
■
Discrete NIC (dNIC)
–
–
■
Peripheral ASIC device
Marketed as:
●
Ethernet Controller
●
Converged Controller
Integrated NIC (iNIC)
–
–
–
Sun Niagara 2
Freescale → QorIQTM family
IBM → PowerENTM
410 mm2
(1.43 billion transistors)
A2
A2
A2
A2
PBIC
PBEC
PBUS
A2
A2
EI3 TX
A2
AT1
L2
MC
L2
AT3
A2
A2
A2
DDR3
A2
A2
MC
Crypto
DDR3
Comp
XML
Converged Network
Adapter (CNA)
PBIC
PBIC
RegX
Network Interface
Card (NIC)
EI3 RX
A2
AT0
L2
L2
AT2
A2
PBIC
PCI
HEA/
PP
iNIC
~3%
A2
A2
PCI/Ethernet
DDR3
DDR3
 Higher performance, lower latency
 Lower power consumption
 Significant cost reduction
LAN On Motherboard (LOM)
2
 Alter the general-purpose nature of
the computer complex
© 2012 IBM Corporation
Outline
■
Hardware context of this work
–
PowerENTM / Host Ethernet Adapter
■
Functional requirements
■
Architecture of the Rx Stack Accelerator
–
–
–
■
Results
–
–
■
3
Data path
Packet parser
Packet handler
Implementation
Performance
Summary and conclusions
© 2012 IBM Corporation
Hardware
Context
4
© 2012 IBM Corporation
AT0
AT1
AT2
AT3
A2
A2
A2
A2
A2
A2
A2
A2
A2
A2
A2
A2
A2
A2
A2
A2
Context: PowerENTM / Host Ethernet Adapter (HEA)
Mem PHY
Mem PHY
P
LLP
LL
P
LLP
LL
Comp /
Decomp
2MB L2
2MB L2
2MB L2
2MB L2
P
LLP
LL
Crypto
P
LLP
LL
PL
LPL
L
Pattern
Engine
XML
MC
MC
PBIC
PBIC
PBus
PBIC
PBus External
Controller
PBIC
Root/EP
Engine
Root
Engine
PCI
Exp
Gen 2
PCI
Exp
Gen 2
Pbus
Access
Macro
Quad-port 10 GbE NIC
a.k.a
Host Ethernet Adapter
(HEA)
Pervasive
4x 10GE MAC or
4x 1GE MAC
EI3
5
EI3
EI3
x8 PHY
x8 PHY
x8 PHY
x4 PHY
Misc I/O
© 2012 IBM Corporation
Overview of the Host Ethernet Adapter (HEA)
■
■
■
6
410 GbE state-of-art Ethernet controller featuring:
– I/O virtualization support
●
Through 128 queue pairs
●
Internal layer-2 switch for partition-to-partition
data traffic
– Flexible queue selection and scheduling assist
– Rx and Tx protocol acceleration
– Low-latency through direct processor bus
attachment and cache injection
– Multi-core scaling,
– Interrupt and receive coalescing assist,
– 9 KB Jumbo frame support,
– Memory address protection
–
...
SoC Bus
Host Interface
128
128
TxTx
Tx
TxAcc.
Acc.
Acc.
Acc.
Rx
Rx
Rx
RxAcc.
Acc
Acc
Acc
XGMAC
XGMAC
XGMAC
XGMAC
XGXSPCS
XGPCS
XGPCS
XGPCS
XGMAC
XGMAC
XGMAC
XGMAC
XGXSPCS
XGPCS
XGPCS
XGPCS
Equivalent functionality and performance levels as
modern discrete NICs
Small footprint (13 mm2) and low power (2.6 W)
© 2012 IBM Corporation
RXACC
Functional
Requirements
7
© 2012 IBM Corporation
Target Domain Requirements
■
■
PowerENTM targets network-facing applications
– Must cope with a large spectrum of protocols (13+), traffic characteristics and
policies
● E.g. routers, firewalls, intrusion-prevention systems and network analytics
– Must be able to adapt to changes
● Handle other existing or emerging protocols
– Virtualized I/Os → Must sustain 16 Gb/s (internal layer-2 switch)
Ethernet: A trucking service for application data
– Protocol layered architecture (i.e., encapsulation) → 2000+ permutations
IPv4t
ISL
DIX
L1.5
L2
VLAN QinQ SAP SNAP PPPoE
L2
L2
L2.5
MPLS
MPLS
MPLS
MPLS
MPLS
MPLS
8
IPv4
TCP ICMPv6
L3
IPv6
IPv6t
L4
Payload
L2
UDP
Ipv6
Ext.
Hdr.
Ipv6
Ext.
Hdr.
Ipv6
Ext.
Ipv6 Ext.Hdr.
Hdr.
© 2012 IBM Corporation
Distribution of the Protocol Stacks
U, 558
M, 85
A, 25
L5, 424
DIX, 1153 1153
DIX,
SAP, 1079
L4, 2118
L4, 2118
SNAP, 1078
RH, 186
LF, 329
Q, 1183
VLAN, 1183
MF, 331
FF, 651
QQ, 362
IP6t, 186
IPv6,
IP6, 12631263
PPP, 972
IP4t, 197
M1, 1788
IP4, 940
M6, 394
M5, 665
M4, 941
9
M3, 1231
M2, 1510
MPLS2,
1510
© 2012 IBM Corporation
Offloaded Service Requirements
■
WHY
–
–
–
TCP Rx+Tx processing → ~3000 instructions (assuming zero-copy and checksum offload)
10 GbE → Occurrence of 64 B packets → 67.2 ns (14.8 Mfps)
A generally accepted rule of thumb → 1 GHz / 1 Gb/s
■
WHAT (business as usual)
– 'per-byte' operations
●
check-summing
– 'per-packet' operations
●
header processing: VLAN, MAC, flow identification, QoS determination, discard,
errors, traffic steering
■
HOW (different)
– Flexible way → Programmable parsing and processing (rule-based)
– Meta-data descriptors → Save 300-400 instructions per frame
●
E.g., Protocol stack signature (31b),
Protocol statck offsets (64b)
L4 L2.5
Prot. Pos.
D S S V
I A N L
X P A A
P N
10
Q
i
n
Q
P
P
P
o
E
M
P
L
S
1
M
P
L
S
2
M
P
L
S
3
M
P
L
S
4
M
P
L
S
5
M
P
L
S
6
I
P
v
4
I
P
v
4
t
I
P
v
6
I
P
v
6
t
F
F
r
g
M
F
r
g
L
F
r
g
L3
Position
L4
Position
R L L N N N N N N I
H 4 5 u u u u u u n
d
c
r
o
m
M
a
l
f
d
Payload
Position
A
b
o
r
t
© 2012 IBM Corporation
Rx Stack
Accelerator
11
© 2012 IBM Corporation
Rx Stack Processing
■
Can be decomposed in three major tasks:
– (t1) Parsing
●
Identify protocol stack + position of protocol fields
– (t2) Data extraction
●
Locate and retrieve data to be processed
– (t3) Processing
●
Execute instructions based on identified rules
– Filtering (MAC, VLAN), VLAN extraction,
– Rx queue assignment, checksum verification,
– flow determination, discard, counters increments, …
■
Main characteristics exhibited by protocol processing applications [Jantsch1998]
– (c1) Intensive use of pattern matching
●
especially on headers
– (c2) Complex and control dominated flow
●
many nested if-then-else and case structures
– (c3) Intensive use of irregular memory accesses
●
various sizes and patterns
12
© 2012 IBM Corporation
RXACC Architecture
A Transport Triggered Architecture (TTA) w/ 5 transport buses (1 byte/bus)
(180 bits)
meta-data
octaword-out
Cut-through architecture
→ low latency
→ efficient irregular
mem. access (c3)
Data Path
FU
1
Frame “signature”
→ saves 300-400
instructions (avg.)
Packet Handler
FU
N
Sockets
Transport
Src
Dst
Ctrl
Packet Parser
ILP architecture
→ instruction-level parallelism
→ parallel, independent, and
pipelined coprocessors (FUs)
→ performance scalability
→ hardware efficiency
→ hardware modularity
→ “ultimate” RISC architecture
- only 1 instruction
 “move data”
(programmable)
clk = 625 MHz
octaword-in
(16 B)
13
Prog. FSM + Balanced Routing Table search algorithm (B-FSM)
→ efficient pattern-matching (c1)
→ efficient one-cycle multiway branching (c2)
© 2012 IBM Corporation
Data Path
(DP)
14
© 2012 IBM Corporation
Multi-field Packet Inspection & Data Extraction
QQ/VLAN/SAP/SNAP/IPv6/UDP
48
16
4 | 12
20
16
48
16
8
8
8
8
16
24
8
16
4 | 12
4
8
128
128
16
FP
15
16
16
32
FP
Off.5
Off.5
Off.4
Off.3
Off.4
Off.3
Off.2
Off.2
Off.1
Off.1
Current
fields of interest
16
Next
fields of interest
© 2012 IBM Corporation
Data Path Architecture
Cut-through architecture
– Relaxed buffering (2  16B) + Low latency
– Flexible data extraction (any 5 bytes within the 32B window)
– Entire packet is exposed to the parser → Can inspect and match any data of the frame
Legend:
L2 (DIX)
MAC DA
L3 (IPv4)
Extracted fields:
FrmPtr = 16
Offset1 = 2
Offset2 = 3
Offset3 = 7
Offset4 = 8
Offset5 = 9
MAC SA
Ether
Type
Tot.
Len.
HOST I/F
Status
MM M
00 0
RH
R
Id.
Pr Hd.
ot. Csum
SA
DA
Status
B
C
N
T
B1-B5
FrmPtr
Off(1-5)
L
D00 D01 D02 D03 D04 D05 D06 D07 D08 D09 D10 D11 D12 D13 D14 D15
16
Status
Tranport I/F
D00 D01 D02 D03 D04 D05 D06 D07 D08 D09 D10 D11 D12 D13 D14 D15
SrcOp I/F
■
Status
XGMAC I/F
© 2012 IBM Corporation
Packet Parser
(PP)
17
© 2012 IBM Corporation
Packet Parser
Design space:
 micro-coded → limited branch capabilities → multi-GHz operation → power
 finite state machine (FSM) → efficient but inflexible
 programmable finite state machine (pFSM) → high performance and energy efficient
Rule
IPv4_SA
Current
State
S22
IPv4_DA_Prot
S24
Input symbols
Next
Priority
Input 1
Input 2
Input 3
State
...
...
...
...
0
...
….
…
R44
S22
XXXX_XXXXb
XXXX_XXXXb
XXXX_XXXXb
S24
R52
S24
XXXX_0101b
0001_0001b
X00X_XXXXb
S65
0
R53
S24
XXXX_XXXXb
0001_0001b
X00X_XXXXb
S82
1
R61
S24
XXXX_XXXXb
0010_1001b
X00X_XXXXb
S36
2
R112
S24
XXXX_0101b
0000_0110b
X00X_XXXXb
S44
3
State transition rules
IPv4_UDP
S65
IPv4_UL4
S36
IPv4_TCP
S44
('X' symbol represents a “don't care” bit)
HW Engine
We use the B-FSM architecture [van Lunteren2006]
'B' stands for Balanced Routing Table search algorithm (BaRT)
Originally designed for longest matching prefix searches
(i.e. routing table lookups)
State transition diagram
18
© 2012 IBM Corporation
Output
Packet
Parser
Architecture
Input
DP I/F
PH I/F
Data Path
Controller
B1N-B3N
FPInc, Src1-Src5
Range
Comparator
RCC
H
F
output
G
H'
LUT
Transition Rule Selector
next-state
Match Match Match Match
1
2
3
4
Rule Selector
State
Register
input
Address
Generator
SC
Transition
Rule
Memory
Reg.
File
A,B
B1C-B3C
Input
STATE RCR B1 B2 B3 R1
R2
R3
R4
SN
B-FSM engine
FWR
Stat
Reg.
Transition Rule Memory
(F, F', G, G', H')
64  640 bits
256 rules
(1 rule = 160 bits)
Rule
Structure
19
Test part
SC
Input symbols
Result part
SN
Output symbols
H part
H sym.
© 2012 IBM Corporation
Packet Handler
(PH)
20
© 2012 IBM Corporation
Packet Handler Architecture
■
Functional Units (FU)
– Independent coprocessors (IP-blocks) → Concurrent operation → ILP → Performance
– Configurable through MMIO
– Implement TTA socket registers:
FU example:
●
O = operand register(s)
the HASHER
●
T = trigger register
(note: result registers are not implemented)
32
Hasher
(XORs)
meta-data
Symmetry
MMIO config.
Bit Mask
FU
1
Packet Handler
FU
N
O1
O2
O3
ON
T
Sockets
Transport
Transport
Dst
21
Ctrl
Decode
Dst
© 2012 IBM Corporation
Rule Format → 160 bits
Compare
Value
CS
Src Src Src Src Src Dst Dst Dst Dst Dst
1 2 3
4 5 1
2 3 4 5
Compare
NS
Mask
Result Part (variable-length instruction)
(87 bits)
Test Part
(62 bits)
H Part
(9 bits)
Frame pointer increment (Long/Short – Fixed/Variable)
Source of the operands (i.e. offsets w.r.t the frame pointer)
Destination of the operands (i.e. functional unit registers)
■
Simplified Instruction Set
– Frame pointer instructions
●
●
–
e.g. ffpinc(4);
e.g. vfpinc(1);
// fixed increment
// variable increment
Move instruction
●
●
e.g.: move(DP(4), CSI.O1);
// unicast destination
e.g.: move(DP(12), HSH.O1 & CSI.O1 & CST.O9); // multicast destination
(socket write sharing)
22
© 2012 IBM Corporation
Performance and
Implementation
Results
23
© 2012 IBM Corporation
Implementation Results
■
Area (in 45 nm SOI technology)
–
–
■
Clock Frequency
–
■
RXACC  0.15 W (entire HEA = 2.6 W)
Transition Rule Memory Utilization
–
–
24
625 MHz (27% of core frequency)
Power (estimate)
–
■
1  RXACC = 0.7 mm2
4  RXACC = 21.5 % of HEA (entire HEA = 13 mm2)
256 rules → 64 x (4 x 160) bits = 40 kbits
● 68 states out of 128 (53 %)
● 221 rules out of 256 (86 %)
Transition rule memory + ECC logic = 15 % of RXACC
© 2012 IBM Corporation
RXACC Bandwidth
DIX/IPv4/TCP
DIX/IPv6/UDP
DIX/QQ/VLAN/PPPoE/MPLS/MPLS/MPLS/MPLS/IPv4/TCP
DIX/QQ/VLAN/IPv6-HBH-ROUT-FRAG/TCP
Theoretical limit of 10 GbE media (Inter-frame gap=12B - Preample=8B)
20
18
Bandwidth (Gb/s)
16
Protocol stacks do not fit
(missing bars)
14
12
10
8
6
4
2
0
64
80
128
256
512
1024
1518
4096
8192
Frame length (bytes)
25
© 2012 IBM Corporation
Summary & Conclusions
Processor compute complex + integrated network I/O complex
■
–
IFF high-computation performance (1) + high-chip density (2) + low-power (3)
+ flexibility (4)
 RXACC delivers (1) + (2) + (3) + (4)
(1) Performance
■
–
–
15 Mfps, 20 Gb/s (at “relaxed” clock frequency → 625 MHz)
Saves hundreds of CPU cycles per frame
(2-3) Area and power efficiency
■
–
0.7 mm2 (45 nm SOI), 0.15 W
(4) Flexibility
■
–
–

2000+ protocol permutations
Programmable parsing and processing → Rule-based
● Can parse new emerging standards (e.g. SDN)
● Can inspect header and payload
● Pragmatic approach w/ one code-set per application
RXACC = TTA + pFSM = novel Application Specific Processor


26
Key enabler for integrated NICs
The architecture has headroom to scale towards 40-100 GbE
© 2012 IBM Corporation
Rx Stack Accelerator for 10 GbE Integrated NIC
Thank you
IEEE Hot Interconnects 20, Santa Clara, CA, Aug. 22-23, 2012
© 2009 IBM Corporation