Alizadeh

On the Data Path Performance of
Leaf-Spine Datacenter Fabrics
Mohammad Alizadeh
Joint with: Tom Edsall
1
Datacenter Networks
Ø  Must complete transfers or “flows” quickly
need very high throughput, very low latency
1000s of server ports web app cache db map-­‐
reduce HPC monitoring 2
Leaf-Spine DC Fabric
Approximates ideal output-queued switch
Spine switches ≈
Leaf switches •  How close is Leaf-Spine to ideal OQ switch?
H1 H2 H3 H4 H5 H6 H7 H8 H9 H1 H2 H3 H4 H5 H6 H7 H8 H9 •  What impacts its performance?
–  Link speeds, oversubscription, buffering
3
Methodology
•  Widely deployed mechanisms
–  TCP-Reno+Sack
–  DropTail (10MB shared buffer per switch)
–  ECMP (hash-based) load balancing
•  OMNET++ simulations
–  100×10Gbps servers (2-tiers)
–  Actual Linux 2.6.26 TCP stack
Metric: Flow compleHon Hme •  Realistic workloads
–  Bursty query traffic with Incast pattern
–  All-to-all background traffic: web search, data mining
4
Impact of Link Speed
Three non-oversubscribed topologies:
20×10Gbps Uplinks 5×40Gbps Uplinks 2×100Gbps Uplinks 20×10Gbps Downlinks 20×10Gbps Downlinks 20×10Gbps Downlinks 5
Impact of Link Speed
Avg FCT: Large (10MB,∞) background flows FCT (normalized to op7mal) 20 18 16 Be*er 14 OQ-­‐Switch 12 20x10Gbps 10 5x40Gbps 8 6 2x100Gbps 4 2 •  40/100Gbps
fabric: ~ same FCT as OQ
0 30 40 50 60 70 80 •  10Gbps fabric: FCTLoad up(%) 40% worse than OQ 6
Intuition
Higher speed links improve ECMP efficiency
20×10Gbps Uplinks 2×100Gbps Uplinks Prob of 100% throughput = 3.27% 1 2 20 Prob of 100% throughput = 99.95% 11×10Gbps flows (55% load) 1
2
7
Impact of Buffering
Where do queues build up?
2.5:1 oversubscribed
fabric with large buffers
Spine port 1 CDF 0.9 0.8 0.7 Leaf (Front-­‐panel Ports) Leaf (Fabric Ports) Spine Ports 60% load 24% load •  Leaf
ports
queues
≈ Spine port queues
0 fabric
2 4 6 8 Leaf Leaf Occupancy (MBytes) –  1:1Buffer port
correspondence;
same speed & load
front-­‐panel port fabric port 8
Impact of Buffering
Where are large buffers more effective for Incast bursts?
Be*er QCT (normalized to op7mal) Avg Query Compe7on Time 100MB-­‐Leaf, 100MB-­‐Spine 80 70 60 50 40 30 20 10 0 2MB-­‐Leaf, 2MB-­‐Spine 2MB-­‐Leaf, 100MB-­‐Spine 100MB-­‐Leaf, 2MB-­‐Spine •  Larger 30 buffers
at
40 better
50 60 Leaf
70 than
80 Spine
Load (%) 9
Summary
•  40/100Gbps fabric + ECMP ≈ OQ-switch;
some performance loss with 10Gbps fabric
•  Buffering should generally be consistent in
Leaf & Spine tiers
•  Larger buffers more useful in Leaf than
Spine for Incast
10
Impact of Link Speed
Three non-­‐oversubscribed fabrics: –  100 × 10Gbps –  25 × 40Gbps –  10 × 100Gbps 100×10Gbps edge links 12
Methodology
•  Widely deployed mechanisms
–  TCP-Reno+Sack
–  DropTail (10MB shared buffer per switch)
–  ECMP (hash-based) load balancing
Various fabric link configs •  OMNET++ simulations
–  100×10Gbps servers (2-tiers)
–  Actual Linux 2.6.26 TCP stack
100×10Gbps edge links •  Realistic workloads
–  Bursty query traffic
–  All-to-all background traffic:
web search, data mining
Metric: Flow compleHon Hme 13
Workloads
•  Realistic workloads based on empirical studies
–  Query traffic with Incast pattern
–  All-to-all background traffic
1
CDF
0.8
0.6
0.4
0.2
0 3
10
Flow Size
Total Bytes
>10MB
<100KB
5% of flows
35% of bytes
55% of flows
3% of bytes
4
10
* From DCTCP [SIGCOMM 2010] 5
6
10
10
Flow Size (Bytes)
7
10
8
10
14
Intuition
Incast events are most severe at receiver
Single point of convergence Spread over many paths Many senders One receiver 15