NUMA Aware I/O in Virtualized Systems Rishi Mehta, Zach Shen, Amitabha Banerjee © 2014 VMware Inc. All rights reserved. Non-Uniform Memory Access + Memory = Socket NUMA node Remote Memory Access NUMA 0 NUMA 1 Local Memory Access Non-Uniform Memory Access Modern Server Architecture App 1 App 2 App Memory App Memory NUMA 1 NUMA 2 Non-Uniform I/O Access Local Remote Non Uniform I/O Access NUMA 0 NUMA 1 NUMA 0 NUMA 1 2 –socket Dell PE R720 NUMA 2 NUMA 3 4 –socket Dell PE R840 Intel DDIO Processor Cache Memory Classic Processor Cache Memory DDIO *Currently works with Local PCI Device App App App Memory App Memory I/O Threads Intel DDIO Benefits I/O Threads Ideal Situation Real Situation Local Memory Access Non-Uniform I/O Access Cache misses/packet 25 20 15 10 5 0 All L3 Cache Miss vs Packet Rate 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Device and IO Thread VM and I/O Thread Packet Rate mpps DDIO benefits vs Local I/O : 1VM VM I/O Thread Packet Rate DDIO benefits vs Local I/O : 4VM 1.2 20 1.15 15 1.1 10 1.05 1 5 0 0.95 Device and IO Thread VM and I/O Thread 0.9 Packet Rate mpps Cache misses/packet 25 L3 Cache Miss vs Packet Rate 1.25 VM I/O Thread Packet Rate NUMA I/O Scheduler • Hybrid Mode – Low Load : • One I/O thread is sufficient for networking traffic • Pin I/O Thread to device NUMA Node • Let the scheduler migrate I/O intensive VM to device NUMA Node – High Load: • Sufficient load for multiple I/O Threads. • Create I/O Thread per NUMA Node. • Use I/O Thread based on Virtual Machines home node. NUMA I/O: Light load VMs NUMA0 NUMA1 NUMA0 4 1 2 3 1 2 3 4 Thread HW I/O Device Affinity Thread I/O Device NUMA1 NUMA I/O: Heavy load VMs NUMA0 NUMA1 NUMA0 NUMA1 1 2 3 4 5 6 1 2 3 4 5 6 Thread HW I/O Device Thread Affinity Thread Affinity Thread I/O Device Hardware Support • Most NICs support multiple queues. – Multiple contexts to process I/O requests. – More scheduling choices • Virtual machine provides abstraction at an application layer – Easier to Isolate Apps and I/O traffic using MAC Address as traffic classification instead of 5-tuple hash. VM App Hardware queues Using RSS Hash Hardware queues Using MAC Address Improvement: Light load Throughput, CPU efficiency 20 16% 30 19% 24% 20 10 0 10% Default NUMA I/O Transmit 15 10 5 Default NUMA I/O Receive 0 CPU Utilization/Gbps Throughput (Gbps) 40 Throughput CPU • • • • • Test Setup: 1 VM, 1vCPU NUMA: 2 sockets Workload: TCP bulk, 4 sessions I/O Device: Mellanox 40GigE Improvement: Light Load DPDK L2 Forwarding %age improvement 30 64Bytes 20 256Bytes 512Bytes 1024Bytes 10 0 rxpps fwdpps Total CPU%/ 1M fwd pkts Test Setup: 1VM, 1vCPU NUMA: 2 sockets Workload: UDP packet forwarding I/O Device: Intel 10G Device Improvement: Heavy load 100.0 300.0 90.0 200.0 6% 80.0 70.0 100.0 4% 60.0 0.0 200 220 240 260 Number of VMs 280 RRT (millisecond) CPU utilization % HTTP Request Response Time, CPU utilization Default CPU NUMA I/O CPU Default RRT NUMA I/O RRT • • • • Test Setup: 200-280 VMs, 1vCPU/VM NUMA: 4 sockets Workload: fixed HTTP requests, 50 requests / VM / second Recap: NUMA aware I/O • Hybrid design: low and high load • Improvement: • higher throughput (Increased by 25%) • lower CPU (Reduced by 20%) • better VM consolidation (Up to 5% more VMs per host) • Future work: GPU, storage, RDMA Questions ?
© Copyright 2026 Paperzz