On the Data Path Performance of Leaf-Spine Datacenter Fabrics Mohammad Alizadeh Joint with: Tom Edsall 1 Datacenter Networks Ø Must complete transfers or “flows” quickly need very high throughput, very low latency 1000s of server ports web app cache db map-‐ reduce HPC monitoring 2 Leaf-Spine DC Fabric Approximates ideal output-queued switch Spine switches ≈ Leaf switches • How close is Leaf-Spine to ideal OQ switch? H1 H2 H3 H4 H5 H6 H7 H8 H9 H1 H2 H3 H4 H5 H6 H7 H8 H9 • What impacts its performance? – Link speeds, oversubscription, buffering 3 Methodology • Widely deployed mechanisms – TCP-Reno+Sack – DropTail (10MB shared buffer per switch) – ECMP (hash-based) load balancing • OMNET++ simulations – 100×10Gbps servers (2-tiers) – Actual Linux 2.6.26 TCP stack Metric: Flow compleHon Hme • Realistic workloads – Bursty query traffic with Incast pattern – All-to-all background traffic: web search, data mining 4 Impact of Link Speed Three non-oversubscribed topologies: 20×10Gbps Uplinks 5×40Gbps Uplinks 2×100Gbps Uplinks 20×10Gbps Downlinks 20×10Gbps Downlinks 20×10Gbps Downlinks 5 Impact of Link Speed Avg FCT: Large (10MB,∞) background flows FCT (normalized to op7mal) 20 18 16 Be*er 14 OQ-‐Switch 12 20x10Gbps 10 5x40Gbps 8 6 2x100Gbps 4 2 • 40/100Gbps fabric: ~ same FCT as OQ 0 30 40 50 60 70 80 • 10Gbps fabric: FCTLoad up(%) 40% worse than OQ 6 Intuition Higher speed links improve ECMP efficiency 20×10Gbps Uplinks 2×100Gbps Uplinks Prob of 100% throughput = 3.27% 1 2 20 Prob of 100% throughput = 99.95% 11×10Gbps flows (55% load) 1 2 7 Impact of Buffering Where do queues build up? 2.5:1 oversubscribed fabric with large buffers Spine port 1 CDF 0.9 0.8 0.7 Leaf (Front-‐panel Ports) Leaf (Fabric Ports) Spine Ports 60% load 24% load • Leaf ports queues ≈ Spine port queues 0 fabric 2 4 6 8 Leaf Leaf Occupancy (MBytes) – 1:1Buffer port correspondence; same speed & load front-‐panel port fabric port 8 Impact of Buffering Where are large buffers more effective for Incast bursts? Be*er QCT (normalized to op7mal) Avg Query Compe7on Time 100MB-‐Leaf, 100MB-‐Spine 80 70 60 50 40 30 20 10 0 2MB-‐Leaf, 2MB-‐Spine 2MB-‐Leaf, 100MB-‐Spine 100MB-‐Leaf, 2MB-‐Spine • Larger 30 buffers at 40 better 50 60 Leaf 70 than 80 Spine Load (%) 9 Summary • 40/100Gbps fabric + ECMP ≈ OQ-switch; some performance loss with 10Gbps fabric • Buffering should generally be consistent in Leaf & Spine tiers • Larger buffers more useful in Leaf than Spine for Incast 10 Impact of Link Speed Three non-‐oversubscribed fabrics: – 100 × 10Gbps – 25 × 40Gbps – 10 × 100Gbps 100×10Gbps edge links 12 Methodology • Widely deployed mechanisms – TCP-Reno+Sack – DropTail (10MB shared buffer per switch) – ECMP (hash-based) load balancing Various fabric link configs • OMNET++ simulations – 100×10Gbps servers (2-tiers) – Actual Linux 2.6.26 TCP stack 100×10Gbps edge links • Realistic workloads – Bursty query traffic – All-to-all background traffic: web search, data mining Metric: Flow compleHon Hme 13 Workloads • Realistic workloads based on empirical studies – Query traffic with Incast pattern – All-to-all background traffic 1 CDF 0.8 0.6 0.4 0.2 0 3 10 Flow Size Total Bytes >10MB <100KB 5% of flows 35% of bytes 55% of flows 3% of bytes 4 10 * From DCTCP [SIGCOMM 2010] 5 6 10 10 Flow Size (Bytes) 7 10 8 10 14 Intuition Incast events are most severe at receiver Single point of convergence Spread over many paths Many senders One receiver 15
© Copyright 2026 Paperzz