lockwood

Implementing Ultra Low Latency Data Center
Services with Programmable Logic
John W. Lockwood, CEO: Algo-Logic Systems, Inc.
http://Algo-Logic.com • [email protected] • (408) 707-3740 • 2255-D Martin Ave., Santa Clara, CA 95050
Why the Move to Programmable Logic?
“There are large challenges in scaling the
performance of software now. The question
is: ‘What’s next?’ We took a bet on
programmable hardware.”
- Doug Burger, Microsoft
GPU
GPU
CPU
CPU
CPU
…
CPU
CPU
CPU
CPU
1985
Optimized
Sequential
2006
Multicore
…
• Driving Metrics in the Data Center
– Latency:
• Reduce delay
• Avoid jitter
– Throughput
• Processing packets at line rate
• Handle 10G, 25G, 40G, and 100G
– Power:
• Driving cost of OpEx
© 2015 Algo-Logic Systems Inc., All rights reserved.
Micro
…
FPGA
INCREASING SILICON DIE AREA
= SILICONE DIE AREA
TIME
1971
Sequential
Processing
2013
2016-2019
Add Vector
Processing
Add Logic
Processing
• Field Programmable Gate Array (FPGA) logic moves into the CPU
• Microsoft accelerates BING search with FPGA
• Intel acquires Altera
2
Servers Accelerated with FPGA Gateware
• FPGA Augments Existing Servers
– Can run on an expansion card (same size as a GPU)
– Or may be integrated into the CPU socket
• GDN Applications run on FPGA
– Implements low-latency, low-power, high-throughput data processing
Accelerated Server
RAM
RAM
RAM
RAM
FPGA
RAM
10G
40G
100
GE
RAM
HDD
SSD
CPU
Cores
CPU
Rack Server
© 2015 Algo-Logic Systems Inc., All rights reserved.
3
Example of Low Latency Service: Key/Value Store
• Key/Value Store (KVS)
Examples:
Directory
Forwarding
Tables
Data Deduplication
Key
Value
Company
Phone #
Algo-Logic
(408) 707-3740
IP Address
Interface : MAC Address
204.2.34.5
Eth6 : 02:33:29:F2:AB:CC
Content Hash
Storage Block ID
XYZ
948830038411
Order ID
Stock Trading ATY11217911101
Virtex
Graph Search v140
Symbol, Side, Price
AAPL, B, 126.75
Edge List
v201, v206, v225
– Simplifies implementation of large-scale
distributed computation algorithms
– Data Center Servers exchanges data
over standard Ethernet
• Challenges
– Operating System
delays packets and
limits throughput
– Per-core processing
inefficient at high-speed
packet processing
• Solutions
– Bypass kernel bypass with DPDK
– Offload of packet processing with FPGA
© 2015 Algo-Logic Systems Inc., All rights reserved.
4
Mobile Application Servers Need Fast Key/Value Stores
• Scalable backend services to share data
– Sensors { location, bio, movement, .. }
– Social { status, dating, updates, multi-player games .. }
– Media { video/security, audio/music, .. }
– Communication { network status, handoff, short messages .. }
– Database { users, providers, payments, travel, authentication, .. }
• Must be able to scale
– as the number of users grows
。
。
。
– Scale up to provide the best
latency, throughput, and power
– Scale out to increase
storage capacity, throughput, and redundancy
• Example:
– Mobile location sharing
© 2015 Algo-Logic Systems Inc., All rights reserved.
5
Case Study: Implementing Uber with KVS
• Uber in 2014
– 162,037 drivers in the US completed 4 or more trips
– New drivers doubled every 6 months for past 2 years
– Number of Uber Users = 8M
– Number of Cities = 290
– Total trips = 140M
– Daily Trips = 1M
• Analysis and Assumptions
– Assuming 25% of the drivers are active
– 25% of 160k drivers = 40k active cars (<48k)
– Drivers update position once per second = 40k IOW
• Implementation
– Uber with on an Algo-Logic KVS card
Washington Post, Dec. 2014
© 2015 Algo-Logic Systems Inc., All rights reserved.
6
Algo-Logic’s KVS solution for Mobile Applications
。
。
。
KVS on FPGA
Application Server
Scale UP for FASTER access to shared data
© 2015 Algo-Logic Systems Inc., All rights reserved.
7
Algo-Logic’s KVS solution for Mobile Applications
KVS on FPGA
KVS on FPGA
KVS
on FPGA
2U
System
KVS
on FPGA
2UKVS
System
on FPGA
2UKVS
System
on FPGA
。
。
。
。
。
。
2U System
2U System
App Servers
And SCALE-OUT quickly to increase storage capacity and throughput
© 2015 Algo-Logic Systems Inc., All rights reserved.
8
Provisioning and Measurements with GDN-Switch
© 2015 Algo-Logic Systems Inc., All rights reserved.
9
Linux Software Socket KVS
OCSM
Packet
10g Ethernet
Intel
10G NIC
Kernel
Driver
Message
Process
LEGEND
Data Transfer =
Algo-Logic software
on Intel 82598 10GE NIC
and Core i7-4770k CPU
© 2015 Algo-Logic Systems Inc., All rights reserved.
10
Algo-Logic KVS with DPDK: Bypass the Kernel
Receive
Queue
Dequeue
Enqueue
Message
Process
OCSM
Packet
10g Ethernet
Intel
82598
DPDK
Supported
NIC
Store
Message
Buffer
Note: Message read once into CPU Cache
Response
Generation
LEGEND
Dequeue
Control Handoff =
Data Transfer =
Transmit
Queue
Enqueue
Algo-Logic software
on Intel 82598 10GE NIC
and Core i7-4770k CPU
© 2015 Algo-Logic Systems Inc., All rights reserved.
11
ALGO-LOGIC’SKVS
FPGA SOLUTION
FOR KEY/VALUE SEARCH
Algo-Logic GDN-Search:
in
FPGA
OPTIMISED FOR LARGE NUMBER OF KEY/VALUE SEARCH ENTRIES
FPGA
Packet Handler
Key-Value Search
OCSM
Packet
10g Ethernet
MAC0
RX
32B OCSM
Header
Identifier
Packet
Parser
Key-Value
Extractor
RESPONSE GENERATOR
MAC0
TX
Packet
Reconstruct
32B OCSM
Header
Reconstruct
Key: 96b, Value: 96b
EMSE2
Key-Value
Response
Decoder
Packet Handler
On-Chip Memory
REQUEST GENERATOR
Key-Value Search
Key: 96b, Value: 352b
MAC1
RX
OCSM
Packet
Packet
Parser
10g Ethernet
64B OCSM
Header
Identifier
Key-Value
Extractor
RESPONSE GENERATOR
MAC1
TX
Packet
Reconstruct
64B OCSM
Header
Reconstruct
Key-Value
Response
Decoder
EMSE2
Off-Chip
Off-Chip
Memory
Off-ChipMemory
Memory
Off-Chip
Memory
REQUEST GENERATOR
Algo-Logic gateware
on Nallatech P385 with
Altera Stratix V A7 FPGA
© 2015 Algo-Logic Systems Inc., All rights reserved.
12
Trends for Adding Storage around FPGAs
• Six banks of memory controllers
– QDR SRAM
– RLDRAM
– DDR3, DDR4
• 64 lanes of SERDES
– SATA disk and Flash
– Serial memories
• Hybrid Memory Cube (HMC)
• Mosys Bandwidth Engine (BE2, BE3)
• New Memories
– 3D Xpoint with DDR4 Interface
– Potential for Terabytes of Memory on each card
© 2015 Algo-Logic Systems Inc., All rights reserved.
13
Implementation of KVS with Socket I/O, DPDK, and FPGA
• Benchmark same application
Socket I/O
– Key/Value Store (KVS)
• Running on the same PC
OCSM
Packet
10g Ethernet
Intel
10G NIC
Kernel
Driver
Message
Process
– Intel i7-4770k CPU, 82598 NIC, and Altera Stratix V A7 FPGA
• With three different implementations
LEGEND
Data Transfer =
– Socket I/O, DPDK, FPGA
DPDK
Receive
Queue
Algo-Logic software
on Intel 82598 10GE NIC
and Core i7-4770k CPU
Dequeue
FPGA
Enqueue
Message
Process
OCSM
Packet
10g Ethernet
Intel
82598
DPDK
Supported
NIC
REQUEST GENERATOR
Packet
Parser
Store
Message
Buffer
Note: Message read once into CPU Cache
OCSM
Packet
OCSM
Header
Identifier
Key/Value
Extractor
10g Ethernet
RESPONSE GENERATOR
Response
Generation
LEGEND
Dequeue
Packet
Modifier
OCSM
Header
Reconstruct
Exact
Match
Search
Engine
(EMSE)
Key/Value
Search
Response
Decoder
Control Handoff =
Transmit
Queue
Data Transfer =
Enqueue
Algo-Logic software
on Intel 82598 10GE NIC
and Core i7-4770k CPU
© 2015 Algo-Logic Systems Inc., All rights reserved.
Algo-Logic gateware
on Nallatech P385 with
Altera Stratix V A7 FPGA
14
KVS Hardware in Data Center Rack
FPGA
GDN-Classify
Associative Rule-Match CAM
Key
Extractor
PHY
MAC
Packet
Parser
Flow or
ACL
Target
Queues
MAC
s
PHYs
KVS in Software
KVS in DPDK
KVS in FPGA
10G
…
Rack of Search
40G
Servers
Additional KVS Servers
Provision Controller
UPS Power
© 2015 Algo-Logic Systems Inc., All
reserved.
©rights
2015
Algo-Logic
Systems—Public
15
Load Testing off the KVS Implementations
GEN: Traffic Generator
Traffic
Generator
© 2015 Algo-Logic Systems Inc., All rights reserved.
KVS Implementations
Mellanox
10G NIC
Intel
82598
10G NIC
Kernel
Driver
Process
Message
Intel
82599
10G NIC
Intel
82598
10G NIC
Intel
DPDK
EAL
Process
Message
Mellanox
10G NIC
Nallatech
P385 10G
Process
Message
16
Stratix V FPGA with
EMSE-2 Drops no
Packets
Full-Rate
Packet Processing
(100% of TX Load)
DPDK
Drops
Packets
Software
Drops
Packets
© 2015 Algo-Logic Systems Inc., All rights reserved.
17
Sockets Power Consumption Profile
(10M Packets with 40 CSM Messages/Packet)
© 2015 Algo-Logic Systems Inc., All rights reserved.
18
DPDK Power Consumption Profile
(10M Packets with 40 CSM Messages/Packet)
© 2015 Algo-Logic Systems Inc., All rights reserved.
19
Latency Measurement with GDN-Classify
10G Rx
KVS Server
Loopback
T_outT_in
40G Tx
GDN Switch
Time
Stamp
10G Tx
Search
Latency
(Different for Software,
DPDK, and GDN-Search)
+
40G Rx
Switch
Latency
(constant for
GDN-Switch)
• Round trip latency of GDN Switch is deterministic (constant)
• Round trip latency of KVS Sever = Total Round trip Time – Round
trip time with 10G loopback on GDN Switch
© 2015 Algo-Logic Systems Inc., All rights reserved.
20
KVS Latency in FPGA, DPDK, and Sockets
Latency Comparison 100k packets, 1 OCSM per packet, 1k pps
50.00%
Altera Stratix V RTL Average: 0.467µs
KVS in Software
Worst Latency
Worst Jitter
45.00%
KVS in FPGA:
Best Latency,
No Jitter
Socket Implementation Latency Distribution with One OCSM/Packet
0.70%
Intel i7 Average: 41.54µs
35.00%
0.60%
KVS in DPDK:
Lowers Latency,
Some Jitter
30.00%
25.00%
20.00%
Percentage of Packets Observed [%]
Peercentage of Observed Packets
Tighter Spread = Less Jitter
40.00%
0.50%
RTL
0.40%
Sockets
Sockets
0.30%
DPDK
0.20%
0.10%
DPDK Average: 6.29µs
0.00%
38.0
39.0
40.0
41.0
42.0
43.0
44.0
45.0
46.0
47.0
48.0
Latency Distribution [µs]
15.00%
10.00%
Sockets Average: 41.40µs
5.00%
0.00%
0
5.5
Lowest
© 2015 Algo-Logic Systems Inc., All rights reserved.
11
16.5
22
27.5
Latency Distribution [µs]
33
38.5
44
Lower Latency = Faster Response
21
Measured Latency, Throughput, and Power Results
FPGA
GDN-Classify
All Datapaths
Summary
Latency
(µseconds)
Tested
Throughput
(CSMs/sec)
Power
(µJoules/CSM)
KVS in Software
Sockets
41.54
4.0
11
KVS in DPDK
DPDK
6.434
16
6.6
KVS in FPGA
RTL
0.467
15
0.52
All Datapaths
Summary
Latency
(µseconds)
Maximum
Throughput
(CSMs/sec)
Power
(µJoules/CSM)
GDN vs. Sockets
88x less
13x
21x less
GDN vs. DPDK
14x less
3.2x
13x less
Associative Rule-Match CAM
Key
Extractor
PHY
MAC
Packet
Parser
Flow or
ACL
Target
Queues
MACs
PHYs
10G
…
Rack of Search
40G
Servers
Additional KVS Servers
Provision Controller
UPS Power
© 2015 Algo-Logic Systems Inc., All rights reserved.
22
Conclusions: Programmable Hardware in the Data Center
• Lowers Latency
― 88x faster than Linux networking sockets
― 14x faster than optimized DPDK (kernel bypass)
• Increases Throughput (IOPs)
― 3x to 13x improvement in throughput
― Lowers Capital Expenditures (CapEx)
• Reduces Power
― 13x to 21x reduction in power
― Reduces Operating Expenditures (OpEx)
© 2015 Algo-Logic Systems Inc., All rights reserved.
23