banerjee

NUMA Aware I/O in Virtualized
Systems
Rishi Mehta, Zach Shen,
Amitabha Banerjee
© 2014 VMware Inc. All rights reserved.
Non-Uniform Memory Access
+
Memory
=
Socket
NUMA node
Remote Memory Access
NUMA 0
NUMA 1
Local Memory Access
Non-Uniform Memory Access
Modern Server Architecture
App 1
App 2
App
Memory
App
Memory
NUMA 1
NUMA 2
Non-Uniform I/O Access
Local
Remote
Non Uniform I/O Access
NUMA 0
NUMA 1
NUMA 0
NUMA 1
2 –socket Dell PE R720
NUMA 2
NUMA 3
4 –socket Dell PE R840
Intel DDIO
Processor
Cache
Memory
Classic
Processor
Cache
Memory
DDIO
*Currently works with Local PCI Device
App
App
App
Memory
App
Memory
I/O Threads
Intel DDIO Benefits
I/O Threads
Ideal Situation
Real Situation
Local Memory Access
Non-Uniform I/O Access
Cache misses/packet
25
20
15
10
5
0
All
L3 Cache Miss vs Packet Rate
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
Device and IO
Thread
VM and I/O
Thread
Packet Rate mpps
DDIO benefits vs Local I/O : 1VM
VM
I/O Thread
Packet Rate
DDIO benefits vs Local I/O : 4VM
1.2
20
1.15
15
1.1
10
1.05
1
5
0
0.95
Device and IO Thread
VM and I/O Thread
0.9
Packet Rate mpps
Cache misses/packet
25
L3 Cache Miss vs Packet Rate
1.25
VM
I/O Thread
Packet Rate
NUMA I/O Scheduler
•  Hybrid Mode
–  Low Load :
•  One I/O thread is sufficient for networking traffic
•  Pin I/O Thread to device NUMA Node
•  Let the scheduler migrate I/O intensive VM to device NUMA Node
–  High Load:
•  Sufficient load for multiple I/O Threads.
•  Create I/O Thread per NUMA Node.
•  Use I/O Thread based on Virtual Machines home node.
NUMA I/O: Light load
VMs
NUMA0
NUMA1
NUMA0
4 1 2 3 1 2 3 4 Thread
HW
I/O Device
Affinity
Thread
I/O Device
NUMA1
NUMA I/O: Heavy load
VMs
NUMA0
NUMA1
NUMA0
NUMA1
1 2 3 4 5 6 1 2 3 4 5 6 Thread
HW
I/O Device
Thread
Affinity Thread Affinity Thread
I/O Device
Hardware Support
•  Most NICs support multiple queues.
–  Multiple contexts to process I/O requests.
–  More scheduling choices
•  Virtual machine provides abstraction at an application layer
–  Easier to Isolate Apps and I/O traffic using MAC Address as traffic
classification instead of 5-tuple hash.
VM
App
Hardware queues
Using RSS Hash
Hardware queues
Using MAC Address
Improvement: Light load
Throughput, CPU efficiency
20
16%
30
19%
24%
20
10
0
10%
Default
NUMA I/O
Transmit
15
10
5
Default
NUMA I/O
Receive
0
CPU Utilization/Gbps
Throughput (Gbps)
40
Throughput
CPU
• 
• 
• 
• 
• 
Test Setup:
1 VM, 1vCPU
NUMA: 2 sockets
Workload: TCP bulk, 4
sessions
I/O Device: Mellanox
40GigE
Improvement: Light Load
DPDK L2 Forwarding
%age improvement
30
64Bytes
20
256Bytes
512Bytes
1024Bytes
10
0
rxpps
fwdpps
Total CPU%/
1M fwd pkts
Test Setup:
1VM, 1vCPU
NUMA: 2 sockets
Workload: UDP packet
forwarding
I/O Device: Intel 10G
Device
Improvement: Heavy load
100.0
300.0
90.0
200.0
6%
80.0
70.0
100.0
4%
60.0
0.0
200
220 240 260
Number of VMs
280
RRT (millisecond)
CPU utilization %
HTTP Request Response Time, CPU utilization
Default CPU
NUMA I/O CPU
Default RRT
NUMA I/O RRT
• 
• 
• 
• 
Test Setup:
200-280 VMs, 1vCPU/VM
NUMA: 4 sockets
Workload: fixed HTTP
requests, 50 requests /
VM / second
Recap: NUMA aware I/O
•  Hybrid design: low and high load
•  Improvement:
•  higher throughput (Increased by 25%)
•  lower CPU (Reduced by 20%)
•  better VM consolidation (Up to 5% more VMs per host)
•  Future work: GPU, storage, RDMA
Questions ?