Implementing Ultra Low Latency Data Center Services with Programmable Logic John W. Lockwood, CEO: Algo-Logic Systems, Inc. http://Algo-Logic.com • [email protected] • (408) 707-3740 • 2255-D Martin Ave., Santa Clara, CA 95050 Why the Move to Programmable Logic? “There are large challenges in scaling the performance of software now. The question is: ‘What’s next?’ We took a bet on programmable hardware.” - Doug Burger, Microsoft GPU GPU CPU CPU CPU … CPU CPU CPU CPU 1985 Optimized Sequential 2006 Multicore … • Driving Metrics in the Data Center – Latency: • Reduce delay • Avoid jitter – Throughput • Processing packets at line rate • Handle 10G, 25G, 40G, and 100G – Power: • Driving cost of OpEx © 2015 Algo-Logic Systems Inc., All rights reserved. Micro … FPGA INCREASING SILICON DIE AREA = SILICONE DIE AREA TIME 1971 Sequential Processing 2013 2016-2019 Add Vector Processing Add Logic Processing • Field Programmable Gate Array (FPGA) logic moves into the CPU • Microsoft accelerates BING search with FPGA • Intel acquires Altera 2 Servers Accelerated with FPGA Gateware • FPGA Augments Existing Servers – Can run on an expansion card (same size as a GPU) – Or may be integrated into the CPU socket • GDN Applications run on FPGA – Implements low-latency, low-power, high-throughput data processing Accelerated Server RAM RAM RAM RAM FPGA RAM 10G 40G 100 GE RAM HDD SSD CPU Cores CPU Rack Server © 2015 Algo-Logic Systems Inc., All rights reserved. 3 Example of Low Latency Service: Key/Value Store • Key/Value Store (KVS) Examples: Directory Forwarding Tables Data Deduplication Key Value Company Phone # Algo-Logic (408) 707-3740 IP Address Interface : MAC Address 204.2.34.5 Eth6 : 02:33:29:F2:AB:CC Content Hash Storage Block ID XYZ 948830038411 Order ID Stock Trading ATY11217911101 Virtex Graph Search v140 Symbol, Side, Price AAPL, B, 126.75 Edge List v201, v206, v225 – Simplifies implementation of large-scale distributed computation algorithms – Data Center Servers exchanges data over standard Ethernet • Challenges – Operating System delays packets and limits throughput – Per-core processing inefficient at high-speed packet processing • Solutions – Bypass kernel bypass with DPDK – Offload of packet processing with FPGA © 2015 Algo-Logic Systems Inc., All rights reserved. 4 Mobile Application Servers Need Fast Key/Value Stores • Scalable backend services to share data – Sensors { location, bio, movement, .. } – Social { status, dating, updates, multi-player games .. } – Media { video/security, audio/music, .. } – Communication { network status, handoff, short messages .. } – Database { users, providers, payments, travel, authentication, .. } • Must be able to scale – as the number of users grows 。 。 。 – Scale up to provide the best latency, throughput, and power – Scale out to increase storage capacity, throughput, and redundancy • Example: – Mobile location sharing © 2015 Algo-Logic Systems Inc., All rights reserved. 5 Case Study: Implementing Uber with KVS • Uber in 2014 – 162,037 drivers in the US completed 4 or more trips – New drivers doubled every 6 months for past 2 years – Number of Uber Users = 8M – Number of Cities = 290 – Total trips = 140M – Daily Trips = 1M • Analysis and Assumptions – Assuming 25% of the drivers are active – 25% of 160k drivers = 40k active cars (<48k) – Drivers update position once per second = 40k IOW • Implementation – Uber with on an Algo-Logic KVS card Washington Post, Dec. 2014 © 2015 Algo-Logic Systems Inc., All rights reserved. 6 Algo-Logic’s KVS solution for Mobile Applications 。 。 。 KVS on FPGA Application Server Scale UP for FASTER access to shared data © 2015 Algo-Logic Systems Inc., All rights reserved. 7 Algo-Logic’s KVS solution for Mobile Applications KVS on FPGA KVS on FPGA KVS on FPGA 2U System KVS on FPGA 2UKVS System on FPGA 2UKVS System on FPGA 。 。 。 。 。 。 2U System 2U System App Servers And SCALE-OUT quickly to increase storage capacity and throughput © 2015 Algo-Logic Systems Inc., All rights reserved. 8 Provisioning and Measurements with GDN-Switch © 2015 Algo-Logic Systems Inc., All rights reserved. 9 Linux Software Socket KVS OCSM Packet 10g Ethernet Intel 10G NIC Kernel Driver Message Process LEGEND Data Transfer = Algo-Logic software on Intel 82598 10GE NIC and Core i7-4770k CPU © 2015 Algo-Logic Systems Inc., All rights reserved. 10 Algo-Logic KVS with DPDK: Bypass the Kernel Receive Queue Dequeue Enqueue Message Process OCSM Packet 10g Ethernet Intel 82598 DPDK Supported NIC Store Message Buffer Note: Message read once into CPU Cache Response Generation LEGEND Dequeue Control Handoff = Data Transfer = Transmit Queue Enqueue Algo-Logic software on Intel 82598 10GE NIC and Core i7-4770k CPU © 2015 Algo-Logic Systems Inc., All rights reserved. 11 ALGO-LOGIC’SKVS FPGA SOLUTION FOR KEY/VALUE SEARCH Algo-Logic GDN-Search: in FPGA OPTIMISED FOR LARGE NUMBER OF KEY/VALUE SEARCH ENTRIES FPGA Packet Handler Key-Value Search OCSM Packet 10g Ethernet MAC0 RX 32B OCSM Header Identifier Packet Parser Key-Value Extractor RESPONSE GENERATOR MAC0 TX Packet Reconstruct 32B OCSM Header Reconstruct Key: 96b, Value: 96b EMSE2 Key-Value Response Decoder Packet Handler On-Chip Memory REQUEST GENERATOR Key-Value Search Key: 96b, Value: 352b MAC1 RX OCSM Packet Packet Parser 10g Ethernet 64B OCSM Header Identifier Key-Value Extractor RESPONSE GENERATOR MAC1 TX Packet Reconstruct 64B OCSM Header Reconstruct Key-Value Response Decoder EMSE2 Off-Chip Off-Chip Memory Off-ChipMemory Memory Off-Chip Memory REQUEST GENERATOR Algo-Logic gateware on Nallatech P385 with Altera Stratix V A7 FPGA © 2015 Algo-Logic Systems Inc., All rights reserved. 12 Trends for Adding Storage around FPGAs • Six banks of memory controllers – QDR SRAM – RLDRAM – DDR3, DDR4 • 64 lanes of SERDES – SATA disk and Flash – Serial memories • Hybrid Memory Cube (HMC) • Mosys Bandwidth Engine (BE2, BE3) • New Memories – 3D Xpoint with DDR4 Interface – Potential for Terabytes of Memory on each card © 2015 Algo-Logic Systems Inc., All rights reserved. 13 Implementation of KVS with Socket I/O, DPDK, and FPGA • Benchmark same application Socket I/O – Key/Value Store (KVS) • Running on the same PC OCSM Packet 10g Ethernet Intel 10G NIC Kernel Driver Message Process – Intel i7-4770k CPU, 82598 NIC, and Altera Stratix V A7 FPGA • With three different implementations LEGEND Data Transfer = – Socket I/O, DPDK, FPGA DPDK Receive Queue Algo-Logic software on Intel 82598 10GE NIC and Core i7-4770k CPU Dequeue FPGA Enqueue Message Process OCSM Packet 10g Ethernet Intel 82598 DPDK Supported NIC REQUEST GENERATOR Packet Parser Store Message Buffer Note: Message read once into CPU Cache OCSM Packet OCSM Header Identifier Key/Value Extractor 10g Ethernet RESPONSE GENERATOR Response Generation LEGEND Dequeue Packet Modifier OCSM Header Reconstruct Exact Match Search Engine (EMSE) Key/Value Search Response Decoder Control Handoff = Transmit Queue Data Transfer = Enqueue Algo-Logic software on Intel 82598 10GE NIC and Core i7-4770k CPU © 2015 Algo-Logic Systems Inc., All rights reserved. Algo-Logic gateware on Nallatech P385 with Altera Stratix V A7 FPGA 14 KVS Hardware in Data Center Rack FPGA GDN-Classify Associative Rule-Match CAM Key Extractor PHY MAC Packet Parser Flow or ACL Target Queues MAC s PHYs KVS in Software KVS in DPDK KVS in FPGA 10G … Rack of Search 40G Servers Additional KVS Servers Provision Controller UPS Power © 2015 Algo-Logic Systems Inc., All reserved. ©rights 2015 Algo-Logic Systems—Public 15 Load Testing off the KVS Implementations GEN: Traffic Generator Traffic Generator © 2015 Algo-Logic Systems Inc., All rights reserved. KVS Implementations Mellanox 10G NIC Intel 82598 10G NIC Kernel Driver Process Message Intel 82599 10G NIC Intel 82598 10G NIC Intel DPDK EAL Process Message Mellanox 10G NIC Nallatech P385 10G Process Message 16 Stratix V FPGA with EMSE-2 Drops no Packets Full-Rate Packet Processing (100% of TX Load) DPDK Drops Packets Software Drops Packets © 2015 Algo-Logic Systems Inc., All rights reserved. 17 Sockets Power Consumption Profile (10M Packets with 40 CSM Messages/Packet) © 2015 Algo-Logic Systems Inc., All rights reserved. 18 DPDK Power Consumption Profile (10M Packets with 40 CSM Messages/Packet) © 2015 Algo-Logic Systems Inc., All rights reserved. 19 Latency Measurement with GDN-Classify 10G Rx KVS Server Loopback T_outT_in 40G Tx GDN Switch Time Stamp 10G Tx Search Latency (Different for Software, DPDK, and GDN-Search) + 40G Rx Switch Latency (constant for GDN-Switch) • Round trip latency of GDN Switch is deterministic (constant) • Round trip latency of KVS Sever = Total Round trip Time – Round trip time with 10G loopback on GDN Switch © 2015 Algo-Logic Systems Inc., All rights reserved. 20 KVS Latency in FPGA, DPDK, and Sockets Latency Comparison 100k packets, 1 OCSM per packet, 1k pps 50.00% Altera Stratix V RTL Average: 0.467µs KVS in Software Worst Latency Worst Jitter 45.00% KVS in FPGA: Best Latency, No Jitter Socket Implementation Latency Distribution with One OCSM/Packet 0.70% Intel i7 Average: 41.54µs 35.00% 0.60% KVS in DPDK: Lowers Latency, Some Jitter 30.00% 25.00% 20.00% Percentage of Packets Observed [%] Peercentage of Observed Packets Tighter Spread = Less Jitter 40.00% 0.50% RTL 0.40% Sockets Sockets 0.30% DPDK 0.20% 0.10% DPDK Average: 6.29µs 0.00% 38.0 39.0 40.0 41.0 42.0 43.0 44.0 45.0 46.0 47.0 48.0 Latency Distribution [µs] 15.00% 10.00% Sockets Average: 41.40µs 5.00% 0.00% 0 5.5 Lowest © 2015 Algo-Logic Systems Inc., All rights reserved. 11 16.5 22 27.5 Latency Distribution [µs] 33 38.5 44 Lower Latency = Faster Response 21 Measured Latency, Throughput, and Power Results FPGA GDN-Classify All Datapaths Summary Latency (µseconds) Tested Throughput (CSMs/sec) Power (µJoules/CSM) KVS in Software Sockets 41.54 4.0 11 KVS in DPDK DPDK 6.434 16 6.6 KVS in FPGA RTL 0.467 15 0.52 All Datapaths Summary Latency (µseconds) Maximum Throughput (CSMs/sec) Power (µJoules/CSM) GDN vs. Sockets 88x less 13x 21x less GDN vs. DPDK 14x less 3.2x 13x less Associative Rule-Match CAM Key Extractor PHY MAC Packet Parser Flow or ACL Target Queues MACs PHYs 10G … Rack of Search 40G Servers Additional KVS Servers Provision Controller UPS Power © 2015 Algo-Logic Systems Inc., All rights reserved. 22 Conclusions: Programmable Hardware in the Data Center • Lowers Latency ― 88x faster than Linux networking sockets ― 14x faster than optimized DPDK (kernel bypass) • Increases Throughput (IOPs) ― 3x to 13x improvement in throughput ― Lowers Capital Expenditures (CapEx) • Reduces Power ― 13x to 21x reduction in power ― Reduces Operating Expenditures (OpEx) © 2015 Algo-Logic Systems Inc., All rights reserved. 23
© Copyright 2026 Paperzz