Slides

IX: A Protected Dataplane
Operating System for High
Throughput and Low Latency
By: Adam Belay, George Prekas, Ana Klimovic, Samuel
Grossman, and Christos Kozyrakis, Edouard Bugnion,
Presented by: Xianghan Pei
EECS 582 – W16
1
Outline
• Overview
• Motivation
• IX Design
• IX Implementation
• Evaluation
EECS 582 – W16
2
Overview
• OSDI 2014 Best Paper
• Extension work from Dune (an extension to Linux that provides
safe and efficient access to kernel-only CPU features)
• Some slides borrow from IX conference slides
EECS 582 – W16
3
Challenges for Datacenter Application
• Microsecond Tail Latency
• High Packet Rates
• Protection
• Resource Efficiency
EECS 582 – W16
4
Motivation
• HW is fast, but SW is a Bottleneck
64-byte TCP Echo
EECS 582 – W16
5
Motivation
• Why SW Slow?
• HW and Workload is changing
• Dense Multicore +10 GbE
• Scale out workloads
• Berkeley sockets, designed for CPU time sharing
• Complex Interface & Code Paths Convoluted by Interrupts and
Scheduling
• Packet inter-arrival times being many times higher than the latency of
interrupts and system call
EECS 582 – W16
6
Motivation
• IX Closes the SW Performance Gap
• Protection and direct HW access through virtualization
• Execution model for low latency and high throughput
64-byte TCP Echo
EECS 582 – W16
7
IX Design
• Separation and protection of control and data plane
• Control plane responses for resource configuration, provisioning, ect.
• Data plane runs for networking stack and application logic
• Run to completion with adaptive batching
• Data and instruction cache locality
• Native, zero-copy API with explicit flow control
• Flow consistent, synchronization-free processing
EECS 582 – W16
8
IX implementation
• Separation of Control and Data Plane
EECS 582 – W16
9
IX implementation
• Run to Completion with Adaptive Batching
EECS 582 – W16
10
IX implementation
• Dataplane Details
• Different Memory Management
• Large blocks, allow internal memory fragmentation, turn off swappable memory
• Hierarchical timing wheel implementation, such as TCP
retransmissions timeout
• Redesign API
EECS 582 – W16
11
IX implementation
• Dataplane API and Operation
EECS 582 – W16
12
IX implementation
• Multi-core scalability
• Elastic threads operate in a synchronization and coherence free
manner
• API commute
• Flow-consistent hashing at the NICs
• Small number of shared structures
• Security model
• The malicious application can’t corrupt the networking stack or other
applications
EECS 582 – W16
13
Evaluation
• Comparison IX to Linux and mTCP
• TCP microbenchmarks and Memcached
EECS 582 – W16
14
Evaluation
• NetPIPE performance
EECS 582 – W16
15
Evaluation
• Multicore Scalability for Short Connections
EECS 582 – W16
16
Evaluation
• n round-trips per connection
EECS 582 – W16
17
Evaluation
• Different message sizes s (n=1)
EECS 582 – W16
18
Evaluation
• Connection Scalability
EECS 582 – W16
19
Evaluation
• Memcached over TCP
EECS 582 – W16
20
Discussion
• Why IX fast?
• Subtleties of adaptive batching
• Limitation
EECS 582 – W16
21
Conclusion
• A protected dataplane OS for datacenter applications with an
event-driven model and demanding connection scalability
requirements
• Efficient access to HW, without sacrificing security, through
virtualization
• High throughput and low latency enabled by a dataplane
execution model
EECS 582 – W16
22
Conventional Wisdom
• Bypass the kernel
• Move TCP to user-space (Onload, mTCP, Sandstorm)
• Move TCP to hardware (TOE)
• Avoid the connection scalability bottleneck
• Use datagrams instead of connections (DIY congestion management)
• Use proxies at the expense of latency
• Replace classic Ethernet
• Use a lossless fabric (Infiniband)
• Offload memory access (rDMA)
EECS 582 – W16
23