Slides

Planck: Millisecond-scale Monitoring
and Control for Commodity Networks
Jeff Rasley, Brent Stephens,!
Colin Dixon, Eric Rozner,!
Wes Felter, Kanak Agarwal,!
John Carter, Rodrigo Fonseca
1
Self-Tuning Networks
Control
Loop
Examples!
!
•
•
•
Measurement
Traffic Engineering!
Failure Avoidance!
Forwarding Behavior Verification
How fast can we do this?
Control
Decision
2
Self-Tuning Networks
100 ms — 1 sec+
Control
Loop
Examples!
!
•
•
•
Measurement
Traffic Engineering!
Failure Avoidance!
Forwarding Behavior Verification
How fast can we do this?
Control
Decision
2
Self-Tuning Networks
100 ms — 1 sec+
Control
Loop
Examples!
!
•
•
•
Measurement
Traffic Engineering!
Failure Avoidance!
Forwarding Behavior Verification
~100 µs
How fast can we do this?
Control
Decision
2
Self-Tuning Networks
100 ms — 1 sec+
Control
Loop
Examples!
!
•
•
•
Measurement
Traffic Engineering!
Failure Avoidance!
Forwarding Behavior Verification
How fast can we do this?
> 10 ms
~100 µs
Control
Decision
2
Self-Tuning Networks
100 ms — 1 sec+
Control
Loop
Examples!
!
•
•
•
Measurement
Traffic Engineering!
Failure Avoidance!
Forwarding Behavior Verification
How fast can we do this?
> 10 ms
~100 µs
Control
Decision
2
State-of-the-Art Measurement
System
Measurement Speed (ms)
Hedera (NSDI ’10)
5,000
DevoFlow Polling (Sigcomm ’11)
500–15,000
Mahout Polling (Infocom ’11)
190
sFlow/OpenSample (ICDCS ’14)
100
Helios (Sigcomm ’10)
77.4
Planck
< 4.2
3
State-of-the-Art Measurement
System
Measurement Speed (ms)
Hedera (NSDI ’10)
5,000
DevoFlow Polling (Sigcomm ’11)
500–15,000
Mahout Polling (Infocom ’11)
190
sFlow/OpenSample (ICDCS ’14)
100
Helios (Sigcomm ’10)
77.4
Planck
< 4.2
4
Outline
❖
Motivation!
❖
Planck Architecture!
❖
Is Planck Feasible?!
❖
Is Planck Useful?!
❖
Microbenchmarks!
❖
Traffic Engineering
5
Architecture Goals
❖
Obtain very fast samples across all switches in the network!
❖
Use those samples to infer global state of the network!
❖
Flow throughput!
❖
Flow paths!
❖
Port congestion state
6
Our Solution: Repurpose Port Mirroring
❖
Modern switches support port-mirroring!
❖
❖
Copies all packets e.g. going out a port to a
designated mirror port!
Production!
Traffic
Mirror all ports to a single mirror port!
❖
Intentional oversubscription!
❖
Drop behavior approximates sampling!
❖
Data-plane sampling much faster than
control-plane based approaches
Switch
7
Our Solution: Repurpose Port Mirroring
❖
Modern switches support port-mirroring!
❖
❖
Copies all packets e.g. going out a port to a
designated mirror port!
Production!
Traffic
Mirror all ports to a single mirror port!
❖
❖
❖
Intentional oversubscription!
Drop behavior approximates sampling!
Data-plane sampling much faster than
control-plane based approaches
Mirror Port
Switch
7
Our Solution: Repurpose Port Mirroring
❖
Modern switches support port-mirroring!
❖
❖
Copies all packets e.g. going out a port to a
designated mirror port!
Production!
Traffic
Mirror all ports to a single mirror port!
❖
❖
❖
Intentional oversubscription!
Drop behavior approximates sampling!
Data-plane sampling much faster than
control-plane based approaches
Mirror Port
Switch
7
Planck Architecture
❖
Oversubscribed port-mirroring as a
primitive !
❖
Collectors receive samples from mirror
ports!
❖
Netmap for fast processing!
❖
Reconstruct flow information across all
flows in the network!
❖
Collectors can interact with an SDN
controller to implement various applications
Planck Architecture
❖
Oversubscribed port-mirroring as a
primitive !
❖
Collectors receive samples from mirror
ports!
❖
❖
❖
Netmap for fast processing!
Reconstruct flow information across all
flows in the network!
Collectors can interact with an SDN
controller to implement various applications
Planck!
Collector!
Instance(s)
Planck Architecture
❖
Oversubscribed port-mirroring as a
primitive !
❖
Collectors receive samples from mirror
ports!
❖
❖
❖
SDN!
Controller
Netmap for fast processing!
Reconstruct flow information across all
flows in the network!
Collectors can interact with an SDN
controller to implement various applications
Planck!
Collector!
Instance(s)
Outline
❖
Motivation!
❖
Planck Architecture!
❖
Is Planck Feasible?!
❖
Is Planck Useful?!
❖
Microbenchmarks!
❖
Traffic Engineering
9
Is Planck Feasible?
SDN!
Controller
❖
Does Planck hurt production traffic?!
❖
Can Planck infer throughput?!
❖
Can Planck infer congested ports?
Planck!
Collector!
Instance(s)
10
Is Planck Feasible?
SDN!
Controller
❖
Does Planck hurt production traffic?!
❖
Can Planck infer throughput?!
❖
Can Planck infer congested ports?
Planck!
Collector!
Instance(s)
11
Experiment Setup
❖
Vary the number of congested ports!
N number of: 2 senders paired with 1
receiver!
❖
❖
Mirror Port
Collector
TCP will fill up the output buffer going to
the receiver!
15 trials for each config with/without
Planck-mirroring!
❖
Monitor latency, packet loss and
throughput
1
2
…
❖
Switch
N
12
Switches Share Buffers
❖
Modern switches use shared buffers!
❖
Independent queues per ports is not
completely accurate!
❖
Memory consumed by an oversubscribed
mirror port may use space that production
traffic could use
13
Switches Share Buffers
Switch
r
o
r
r
Mi
Modern switches use shared buffers!
❖
Independent queues per ports is not
completely accurate!
1
❖
Memory consumed by an oversubscribed
mirror port may use space that production
traffic could use
2
t
r
o
P
…
❖
Collector
N
13
Production Traffic Latency
Average Latency (ms)
3
Mirror
2.5
No Mirror
2
1.5
1
0.5
0
1
2
3
4
5
6
7
Congested Output Ports
8
9
10
14
Production Traffic Packet Loss
15
Median
Throughput
(Gbps)
0.1%
Throughput
(Gbps)
Production Throughput
6
4
Mirror
No Mirror
Mirror
No Mirror
2
0
6
4
2
0
1
2
3
4
5
6
7
Congested Output Ports
8
9
16
Production Throughput
Median
0.1%
Median
0.1%
Throughput
Throughput Throughput
Throughput
(Gbps)
(Gbps)
(Gbps)
(Gbps)
6
6
4
4
Mirror
Mirror
No Mirror
Mirror
No
Mirror
Mirror
No Mirror
Mirror
No
2
2
0
0
6
6
4
4
2
2
0
0
1
1
2
2
3
4
5
6
7
3
4
5
6
7
Congested Output
Output Ports
Ports
Congested
8
8
9
9
16
Is Planck Feasible?
SDN!
Controller
❖
Does Planck hurt production traffic?!
❖
Can Planck infer throughput?!
❖
Can Planck infer congested ports?
Planck!
Collector!
Instance(s)
17
Throughput Estimation is Accurate
❖
Estimates are trivial if sampling
rate is known!
❖
Leverage TCP seq# in packets!
❖
Smoothed estimates in 200–700 µs!
❖
See paper for more details
(n*10 Gbps sent to mirror port)
18
Inter-arrival
Length
(MTUs)
Sample Inter-Arrival Length
14
12
10
8
6
4
2
0
2
4
6
8
10
12
❖
x13 10 Gbps flows!
❖
Grows roughly linearly!
❖
See paper for further results
14
Flows
19
Is Planck Feasible?
SDN!
Controller
❖
Does Planck hurt production traffic?!
❖
Can Planck infer throughput?!
❖
Can Planck infer congested ports?
Planck!
Collector!
Instance(s)
Is Planck Feasible?
SDN!
Controller
❖
Does Planck hurt production traffic?!
❖
Can Planck infer throughput?!
❖
Can Planck infer congested ports?
Planck provides flow
throughput, a flow just needs
to be mapped to a port
Planck!
Collector!
Instance(s)
Is Planck Feasible?
SDN!
Controller
❖
Does Planck hurt production traffic?!
❖
Can Planck infer throughput?!
❖
Can Planck infer congested ports?
Planck provides flow
throughput, a flow just needs
to be mapped to a port
Planck!
Collector!
Instance(s)
Outline
❖
Motivation!
❖
Planck Architecture!
❖
Is Planck Feasible?!
❖
Is Planck Useful?!
❖
Microbenchmarks!
❖
Traffic Engineering
21
Sample Latency
t_2
H
Switch
t_1
Collector
H
H
Sender
Receiver
Latency = t_2 - t_1
22
Sample Latency
t_2
H
Switch
t_1
Collector
H
H
Sender
Receiver
Latency = t_2 - t_1
Low Congestion !
Sample Latency: !
75–150 µs
22
Sample Latency
t_2
High Congestion Sample Latency
H
Switch
t_1
H
H
Sender
Receiver
Latency = t_2 - t_1
Low Congestion !
Sample Latency: !
75–150 µs
CDF
Collector
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
1
2
3
4
5
6
Measurement Latency (ms)
IBM G8264 (10Gb)
Pronto 3290 (1Gb)
22
Control Loop Times
❖
< 3.5 ms to obtain sample + < 700µs to get
tput estimate = 4.2 ms worst-case
measurement time for 10 Gb!
❖
Planck achieves measurement speeds 18x –
291x faster than recent approaches!
❖
Shadow MAC addresses [1] and some ARP
tricks allow re-routing at < 3ms!
❖
See paper for more details
[1] Shadow MACs: Scalable Label-switching !
for Commodity Ethernet (HotSDN ’14 )
100 ms — 1 sec+
Measurement
> 10 ms
~100 µs
Control
Decision
23
Control Loop Times
❖
Planck: < 4.2 ms
100 ms — 1 sec+
< 3.5 ms to obtain sample + < 700µs to get
tput estimate = 4.2 ms worst-case
measurement time for 10 Gb!
❖
Planck achieves measurement speeds 18x –
291x faster than recent approaches!
❖
Shadow MAC addresses [1] and some ARP
tricks allow re-routing at < 3ms!
❖
See paper for more details
[1] Shadow MACs: Scalable Label-switching !
for Commodity Ethernet (HotSDN ’14 )
Measurement
> 10 ms
~100 µs
Control
Decision
23
Control Loop Times
❖
❖
Planck achieves measurement speeds 18x –
291x faster than recent approaches!
❖
Shadow MAC addresses [1] and some ARP
tricks allow re-routing at < 3ms!
❖
Planck: < 4.2 ms
100 ms — 1 sec+
< 3.5 ms to obtain sample + < 700µs to get
tput estimate = 4.2 ms worst-case
measurement time for 10 Gb!
See paper for more details
[1] Shadow MACs: Scalable Label-switching !
for Commodity Ethernet (HotSDN ’14 )
Measurement
< 3 ms
> 10 ms
~100 µs
Control
Decision
23
Planck as a Platform
❖
Vantage point mirroring!
❖
❖
tcpdump for switches!
Global view of the network!
❖
❖
SDN!
Controller
flow data across all links!
Traffic engineering!
❖
congested port notifications
Planck!
Collector!
Instance(s)
24
Outline
❖
Motivation!
❖
Planck Architecture!
❖
Is Planck Feasible?!
❖
Is Planck Useful?!
❖
Microbenchmarks!
❖
Traffic Engineering
25
Testbed
16 Host Fat Tree
❖
Split x4 IBM G8264 (48-port) switches into 20 sub-switches!
❖
Routing via Floodlight plugin inspired by FlowVisor!
❖
x3 server machines with x8 10 GbE NICs each!
❖
x16 machines with x2 10 GbE NICs
26
Methodology
Traffic Engineering!
❖
❖
Floodlight-based
module using Planck!
Collectors notify a
controller when ports
become congested
Workloads!
Routing!
❖
Shuffle!
❖
Static [1]!
❖
Stride!
❖
Poll-100 ms!
❖
Random!
❖
PlanckTE!
❖
Random Bijection
❖
Optimal
[1] PAST: Scalable Ethernet for Data Centers (CoNEXT ’12)
Optimal Topology
…
1
2
3
4
…
16
27
Traffic Engineering
Stride(8) 100 MiB Workload!
CDF of Flow Throughput
Static [1]
Optimal
28
Traffic Engineering
Stride(8) 100 MiB Workload!
CDF of Flow Throughput
Static [1]
Optimal
Poll 100ms
28
Traffic Engineering
Stride(8) 100 MiB Workload!
CDF of Flow Throughput
Static [1]
Optimal
Poll 100ms
PlanckTE
28
Traffic Engineering
Stride(8) Workload
29
Future Work
❖
❖
Planck should be able to go much faster!!
❖
Limit mirror port buffer!
❖
Truncation of samples!
❖
Improve re-routing time (via ARP improvements)!
Control loop of 100s of µs is possible
30
Conclusion
❖
Planck provides 1–2 orders of magnitude faster throughput measurements
over recent approaches (< 4.2 ms today and 100s of µs possible)!
❖
Planck provides a platform for low-latency measurement!
❖
Planck traffic engineering yields near optimal results even for small flows!
❖
Measurements at these speeds prompt a re-thinking of how networks are
managed
31