IEEE Hot Interconnects 20, Santa Clara, CA, Aug. 22-23, 2012 Rx Stack Accelerator for 10 GbE Integrated NIC F. Abel, C. Hagleitner IBM Research – Zurich Switzerland F. Verplanken IBM Systems & Technology Group La Gaude – France © 2009 IBM Corporation Discrete vs Integrated Network Interface Controller ■ Discrete NIC (dNIC) – – ■ Peripheral ASIC device Marketed as: ● Ethernet Controller ● Converged Controller Integrated NIC (iNIC) – – – Sun Niagara 2 Freescale → QorIQTM family IBM → PowerENTM 410 mm2 (1.43 billion transistors) A2 A2 A2 A2 PBIC PBEC PBUS A2 A2 EI3 TX A2 AT1 L2 MC L2 AT3 A2 A2 A2 DDR3 A2 A2 MC Crypto DDR3 Comp XML Converged Network Adapter (CNA) PBIC PBIC RegX Network Interface Card (NIC) EI3 RX A2 AT0 L2 L2 AT2 A2 PBIC PCI HEA/ PP iNIC ~3% A2 A2 PCI/Ethernet DDR3 DDR3 Higher performance, lower latency Lower power consumption Significant cost reduction LAN On Motherboard (LOM) 2 Alter the general-purpose nature of the computer complex © 2012 IBM Corporation Outline ■ Hardware context of this work – PowerENTM / Host Ethernet Adapter ■ Functional requirements ■ Architecture of the Rx Stack Accelerator – – – ■ Results – – ■ 3 Data path Packet parser Packet handler Implementation Performance Summary and conclusions © 2012 IBM Corporation Hardware Context 4 © 2012 IBM Corporation AT0 AT1 AT2 AT3 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 Context: PowerENTM / Host Ethernet Adapter (HEA) Mem PHY Mem PHY P LLP LL P LLP LL Comp / Decomp 2MB L2 2MB L2 2MB L2 2MB L2 P LLP LL Crypto P LLP LL PL LPL L Pattern Engine XML MC MC PBIC PBIC PBus PBIC PBus External Controller PBIC Root/EP Engine Root Engine PCI Exp Gen 2 PCI Exp Gen 2 Pbus Access Macro Quad-port 10 GbE NIC a.k.a Host Ethernet Adapter (HEA) Pervasive 4x 10GE MAC or 4x 1GE MAC EI3 5 EI3 EI3 x8 PHY x8 PHY x8 PHY x4 PHY Misc I/O © 2012 IBM Corporation Overview of the Host Ethernet Adapter (HEA) ■ ■ ■ 6 410 GbE state-of-art Ethernet controller featuring: – I/O virtualization support ● Through 128 queue pairs ● Internal layer-2 switch for partition-to-partition data traffic – Flexible queue selection and scheduling assist – Rx and Tx protocol acceleration – Low-latency through direct processor bus attachment and cache injection – Multi-core scaling, – Interrupt and receive coalescing assist, – 9 KB Jumbo frame support, – Memory address protection – ... SoC Bus Host Interface 128 128 TxTx Tx TxAcc. Acc. Acc. Acc. Rx Rx Rx RxAcc. Acc Acc Acc XGMAC XGMAC XGMAC XGMAC XGXSPCS XGPCS XGPCS XGPCS XGMAC XGMAC XGMAC XGMAC XGXSPCS XGPCS XGPCS XGPCS Equivalent functionality and performance levels as modern discrete NICs Small footprint (13 mm2) and low power (2.6 W) © 2012 IBM Corporation RXACC Functional Requirements 7 © 2012 IBM Corporation Target Domain Requirements ■ ■ PowerENTM targets network-facing applications – Must cope with a large spectrum of protocols (13+), traffic characteristics and policies ● E.g. routers, firewalls, intrusion-prevention systems and network analytics – Must be able to adapt to changes ● Handle other existing or emerging protocols – Virtualized I/Os → Must sustain 16 Gb/s (internal layer-2 switch) Ethernet: A trucking service for application data – Protocol layered architecture (i.e., encapsulation) → 2000+ permutations IPv4t ISL DIX L1.5 L2 VLAN QinQ SAP SNAP PPPoE L2 L2 L2.5 MPLS MPLS MPLS MPLS MPLS MPLS 8 IPv4 TCP ICMPv6 L3 IPv6 IPv6t L4 Payload L2 UDP Ipv6 Ext. Hdr. Ipv6 Ext. Hdr. Ipv6 Ext. Ipv6 Ext.Hdr. Hdr. © 2012 IBM Corporation Distribution of the Protocol Stacks U, 558 M, 85 A, 25 L5, 424 DIX, 1153 1153 DIX, SAP, 1079 L4, 2118 L4, 2118 SNAP, 1078 RH, 186 LF, 329 Q, 1183 VLAN, 1183 MF, 331 FF, 651 QQ, 362 IP6t, 186 IPv6, IP6, 12631263 PPP, 972 IP4t, 197 M1, 1788 IP4, 940 M6, 394 M5, 665 M4, 941 9 M3, 1231 M2, 1510 MPLS2, 1510 © 2012 IBM Corporation Offloaded Service Requirements ■ WHY – – – TCP Rx+Tx processing → ~3000 instructions (assuming zero-copy and checksum offload) 10 GbE → Occurrence of 64 B packets → 67.2 ns (14.8 Mfps) A generally accepted rule of thumb → 1 GHz / 1 Gb/s ■ WHAT (business as usual) – 'per-byte' operations ● check-summing – 'per-packet' operations ● header processing: VLAN, MAC, flow identification, QoS determination, discard, errors, traffic steering ■ HOW (different) – Flexible way → Programmable parsing and processing (rule-based) – Meta-data descriptors → Save 300-400 instructions per frame ● E.g., Protocol stack signature (31b), Protocol statck offsets (64b) L4 L2.5 Prot. Pos. D S S V I A N L X P A A P N 10 Q i n Q P P P o E M P L S 1 M P L S 2 M P L S 3 M P L S 4 M P L S 5 M P L S 6 I P v 4 I P v 4 t I P v 6 I P v 6 t F F r g M F r g L F r g L3 Position L4 Position R L L N N N N N N I H 4 5 u u u u u u n d c r o m M a l f d Payload Position A b o r t © 2012 IBM Corporation Rx Stack Accelerator 11 © 2012 IBM Corporation Rx Stack Processing ■ Can be decomposed in three major tasks: – (t1) Parsing ● Identify protocol stack + position of protocol fields – (t2) Data extraction ● Locate and retrieve data to be processed – (t3) Processing ● Execute instructions based on identified rules – Filtering (MAC, VLAN), VLAN extraction, – Rx queue assignment, checksum verification, – flow determination, discard, counters increments, … ■ Main characteristics exhibited by protocol processing applications [Jantsch1998] – (c1) Intensive use of pattern matching ● especially on headers – (c2) Complex and control dominated flow ● many nested if-then-else and case structures – (c3) Intensive use of irregular memory accesses ● various sizes and patterns 12 © 2012 IBM Corporation RXACC Architecture A Transport Triggered Architecture (TTA) w/ 5 transport buses (1 byte/bus) (180 bits) meta-data octaword-out Cut-through architecture → low latency → efficient irregular mem. access (c3) Data Path FU 1 Frame “signature” → saves 300-400 instructions (avg.) Packet Handler FU N Sockets Transport Src Dst Ctrl Packet Parser ILP architecture → instruction-level parallelism → parallel, independent, and pipelined coprocessors (FUs) → performance scalability → hardware efficiency → hardware modularity → “ultimate” RISC architecture - only 1 instruction “move data” (programmable) clk = 625 MHz octaword-in (16 B) 13 Prog. FSM + Balanced Routing Table search algorithm (B-FSM) → efficient pattern-matching (c1) → efficient one-cycle multiway branching (c2) © 2012 IBM Corporation Data Path (DP) 14 © 2012 IBM Corporation Multi-field Packet Inspection & Data Extraction QQ/VLAN/SAP/SNAP/IPv6/UDP 48 16 4 | 12 20 16 48 16 8 8 8 8 16 24 8 16 4 | 12 4 8 128 128 16 FP 15 16 16 32 FP Off.5 Off.5 Off.4 Off.3 Off.4 Off.3 Off.2 Off.2 Off.1 Off.1 Current fields of interest 16 Next fields of interest © 2012 IBM Corporation Data Path Architecture Cut-through architecture – Relaxed buffering (2 16B) + Low latency – Flexible data extraction (any 5 bytes within the 32B window) – Entire packet is exposed to the parser → Can inspect and match any data of the frame Legend: L2 (DIX) MAC DA L3 (IPv4) Extracted fields: FrmPtr = 16 Offset1 = 2 Offset2 = 3 Offset3 = 7 Offset4 = 8 Offset5 = 9 MAC SA Ether Type Tot. Len. HOST I/F Status MM M 00 0 RH R Id. Pr Hd. ot. Csum SA DA Status B C N T B1-B5 FrmPtr Off(1-5) L D00 D01 D02 D03 D04 D05 D06 D07 D08 D09 D10 D11 D12 D13 D14 D15 16 Status Tranport I/F D00 D01 D02 D03 D04 D05 D06 D07 D08 D09 D10 D11 D12 D13 D14 D15 SrcOp I/F ■ Status XGMAC I/F © 2012 IBM Corporation Packet Parser (PP) 17 © 2012 IBM Corporation Packet Parser Design space: micro-coded → limited branch capabilities → multi-GHz operation → power finite state machine (FSM) → efficient but inflexible programmable finite state machine (pFSM) → high performance and energy efficient Rule IPv4_SA Current State S22 IPv4_DA_Prot S24 Input symbols Next Priority Input 1 Input 2 Input 3 State ... ... ... ... 0 ... …. … R44 S22 XXXX_XXXXb XXXX_XXXXb XXXX_XXXXb S24 R52 S24 XXXX_0101b 0001_0001b X00X_XXXXb S65 0 R53 S24 XXXX_XXXXb 0001_0001b X00X_XXXXb S82 1 R61 S24 XXXX_XXXXb 0010_1001b X00X_XXXXb S36 2 R112 S24 XXXX_0101b 0000_0110b X00X_XXXXb S44 3 State transition rules IPv4_UDP S65 IPv4_UL4 S36 IPv4_TCP S44 ('X' symbol represents a “don't care” bit) HW Engine We use the B-FSM architecture [van Lunteren2006] 'B' stands for Balanced Routing Table search algorithm (BaRT) Originally designed for longest matching prefix searches (i.e. routing table lookups) State transition diagram 18 © 2012 IBM Corporation Output Packet Parser Architecture Input DP I/F PH I/F Data Path Controller B1N-B3N FPInc, Src1-Src5 Range Comparator RCC H F output G H' LUT Transition Rule Selector next-state Match Match Match Match 1 2 3 4 Rule Selector State Register input Address Generator SC Transition Rule Memory Reg. File A,B B1C-B3C Input STATE RCR B1 B2 B3 R1 R2 R3 R4 SN B-FSM engine FWR Stat Reg. Transition Rule Memory (F, F', G, G', H') 64 640 bits 256 rules (1 rule = 160 bits) Rule Structure 19 Test part SC Input symbols Result part SN Output symbols H part H sym. © 2012 IBM Corporation Packet Handler (PH) 20 © 2012 IBM Corporation Packet Handler Architecture ■ Functional Units (FU) – Independent coprocessors (IP-blocks) → Concurrent operation → ILP → Performance – Configurable through MMIO – Implement TTA socket registers: FU example: ● O = operand register(s) the HASHER ● T = trigger register (note: result registers are not implemented) 32 Hasher (XORs) meta-data Symmetry MMIO config. Bit Mask FU 1 Packet Handler FU N O1 O2 O3 ON T Sockets Transport Transport Dst 21 Ctrl Decode Dst © 2012 IBM Corporation Rule Format → 160 bits Compare Value CS Src Src Src Src Src Dst Dst Dst Dst Dst 1 2 3 4 5 1 2 3 4 5 Compare NS Mask Result Part (variable-length instruction) (87 bits) Test Part (62 bits) H Part (9 bits) Frame pointer increment (Long/Short – Fixed/Variable) Source of the operands (i.e. offsets w.r.t the frame pointer) Destination of the operands (i.e. functional unit registers) ■ Simplified Instruction Set – Frame pointer instructions ● ● – e.g. ffpinc(4); e.g. vfpinc(1); // fixed increment // variable increment Move instruction ● ● e.g.: move(DP(4), CSI.O1); // unicast destination e.g.: move(DP(12), HSH.O1 & CSI.O1 & CST.O9); // multicast destination (socket write sharing) 22 © 2012 IBM Corporation Performance and Implementation Results 23 © 2012 IBM Corporation Implementation Results ■ Area (in 45 nm SOI technology) – – ■ Clock Frequency – ■ RXACC 0.15 W (entire HEA = 2.6 W) Transition Rule Memory Utilization – – 24 625 MHz (27% of core frequency) Power (estimate) – ■ 1 RXACC = 0.7 mm2 4 RXACC = 21.5 % of HEA (entire HEA = 13 mm2) 256 rules → 64 x (4 x 160) bits = 40 kbits ● 68 states out of 128 (53 %) ● 221 rules out of 256 (86 %) Transition rule memory + ECC logic = 15 % of RXACC © 2012 IBM Corporation RXACC Bandwidth DIX/IPv4/TCP DIX/IPv6/UDP DIX/QQ/VLAN/PPPoE/MPLS/MPLS/MPLS/MPLS/IPv4/TCP DIX/QQ/VLAN/IPv6-HBH-ROUT-FRAG/TCP Theoretical limit of 10 GbE media (Inter-frame gap=12B - Preample=8B) 20 18 Bandwidth (Gb/s) 16 Protocol stacks do not fit (missing bars) 14 12 10 8 6 4 2 0 64 80 128 256 512 1024 1518 4096 8192 Frame length (bytes) 25 © 2012 IBM Corporation Summary & Conclusions Processor compute complex + integrated network I/O complex ■ – IFF high-computation performance (1) + high-chip density (2) + low-power (3) + flexibility (4) RXACC delivers (1) + (2) + (3) + (4) (1) Performance ■ – – 15 Mfps, 20 Gb/s (at “relaxed” clock frequency → 625 MHz) Saves hundreds of CPU cycles per frame (2-3) Area and power efficiency ■ – 0.7 mm2 (45 nm SOI), 0.15 W (4) Flexibility ■ – – 2000+ protocol permutations Programmable parsing and processing → Rule-based ● Can parse new emerging standards (e.g. SDN) ● Can inspect header and payload ● Pragmatic approach w/ one code-set per application RXACC = TTA + pFSM = novel Application Specific Processor 26 Key enabler for integrated NICs The architecture has headroom to scale towards 40-100 GbE © 2012 IBM Corporation Rx Stack Accelerator for 10 GbE Integrated NIC Thank you IEEE Hot Interconnects 20, Santa Clara, CA, Aug. 22-23, 2012 © 2009 IBM Corporation
© Copyright 2024 Paperzz