Automated Fault Localization in Backbone

High-Fidelity Latency Measurements
in Low-Latency Networks
Ramana Rao Kompella
Myungjin Lee (Purdue), Nick Duffield (AT&T Labs – Research)
Low Latency Applications
 Many important data center applications require
low end-to-end latencies (microseconds)
 High Performance Computing – lose parallelism
 Cluster Computing, Storage – lose performance
 Automated Trading – lose arbitrage opportunities
Stanford
2
Low Latency Applications
 Many important data center applications require
low end-to-end latencies (microseconds)
 High Performance Computing – lose parallelism
 Cluster Computing, Storage – lose performance
 Automated Trading – lose arbitrage opportunities
 Cloud applications
 Recommendation Systems, Social Collaboration
 All-up SLAs of 200ms [AlizadehSigcomm10]
 Involves backend computation time and network
latencies have little budget
Stanford
3
Latency Measurements are Needed
…
Core Router
Which router causes
the problem??
Edge Router
Measurement within a
router is necessary
ToR S/W
1ms
…
…
…
…
Router
 At every router, high-fidelity measurements are critical to
localize root causes
 Once root cause localized, operators can fix by rerouting
traffic, upgrade links or perform detailed diagnosis
Stanford
4
Vision: Knowledge Plane
SLA
Diagnosis
Routing/Traffic
Engineering
Scheduling/Job
Placement
Knowledge
Plane
Query Interface
Push
Latency
Measurements
Query
Response
Pull
Latency
Measurements
CORE
AGGREGATION
TOP-OF-RACK
(TOR)
Data Center
Network
Stanford
5
Contributions Thus Far…
 Aggregate Latency Estimation
 Lossy Difference Aggregator – Sigcomm 2009
Per-flow
latency2011
 FineComb
– Sigmetrics
measurements
at every
 mPlane
– ReArch 2009
hop
 Differentiated Latency Estimation
Per-Packet Latency
 Multiflow Estimator – Infocom 2010
Measurements
 Reference Latency Interpolation – Sigcomm 2010
 RLI across Routers – Hot-ICE 2011
 Delay Sketching – (under review at Sigcomm 2011)
Scalable Query Interface
 MAPLE – (under review at Sigcomm 2011)
Stanford
6
1) PER-FLOW MEASUREMENTS
WITH REFERENCE LATENCY
INTERPOLATION
[SIGCOMM 2010]
Stanford
7
Obtaining Fine-Grained Measurements
 Native router support: SNMP, NetFlow
 No latency measurements
 Active probes and tomography
 Too many probes (~10000HZ) required wasting bandwidth
 Use expensive high-fidelity measurement boxes
 London Stock Exchange uses Corvil boxes
 Cannot place them ubiquitously
 Recent work: LDA [Kompella09Sigcomm]
 Computes average latency/variance accurately within a switch
 Provides a good start but may not be sufficient to diagnose flowspecific problems
Stanford
8
Delay
From Aggregates to Per-Flow
Large delay
…
S/W
Small delay
Average
latency
Time
Interval
Queue
 Observation: Significant amount of difference in
average latencies across flows at a router
 Goal of this paper: How to obtain per-flow latency
measurements in a scalable fashion ?
Stanford
9
Measurement Model
Ingress I
Egress E
Router
 Assumption: Time synchronization between router interfaces
 Constraint: Cannot modify regular packets to carry timestamps
 Intrusive changes to the routing forwarding path
Stanford
10
Naïve Approach
Egress E
Ingress I
10
13
20 +
23 +
15
18
−
−
−
−
= 22
= 32
Avg. delay = 22/2 = 11
Avg. delay = 32/2 = 16
 For each flow key,




27
30
Store timestamps for each packet at I and E
After a flow stops sending, I sends the packet timestamps to E
E computes individual packet delays
E aggregates average latency, variance, etc for each flow
 Problem: High communication costs
 At 10Gbps, few million packets per second
 Sampling reduces communication, but also reduces accuracy
Stanford
11
A (Naïve) Extension of LDA
2
28
…
1
Packet
count
15
Ingress I
Egress E
LDA
LDA
LDA
LDA
LDA
Coordination
LDA
Sum of
timestamps
Per-flow latency
 Maintaining LDAs with many counters for flows
of interest
 Problem: (Potentially) high communication costs
 Proportional to the number of flows
Stanford
12
Delay
Key Observation: Delay Locality
D3
D2
D1
WD1
True mean delay =
WD3
WD2
(D1 +
D2 +
Time
D3) / 3
Localized mean delay = (WD1 + WD2 + WD3) / 3
How close is localized mean delay to
true mean delay as window size varies?
Stanford
13
Local mean delay per key / ms
Key Observation: Delay Locality
Global Mean
1s: RMSRE=1.72
10ms: RMSRE=0.16
0.1ms: RMSRE=0.054
True Mean delay per key / ms
Data sets from real router and synthetic queueing models
Stanford
14
Delay
Exploiting Delay Locality
Reference
Packet
Ingress
Timestamp
Time
 Reference packets are injected regularly at the ingress I
 Special packets carrying ingress timestamp
 Provides some reference delay values (substitute for window
averages)
 Used to approximate the latencies of regular packets
Stanford
15
RLI Architecture
R
1) Reference
Packet
L
Generator
3
2
1
Ingress 2) Latency
Timestamp Estimator
Ingress I
Egress E
1
3
2
 Component 1: Reference Packet generator
 Injects reference packets regularly
 Component 2: Latency Estimator
 Estimates packet latencies and updates per-flow statistics
 Estimates directly at the egress with no extra state
maintained at ingress side (reduces storage and
communication overheads)
Stanford
16
Component 1: Reference Packet Generator
 Question: When to inject a reference packet ?
 Idea 1: 1-in-n: Inject one reference packet every n packets
 Problem: low accuracy under low utilization
 Idea 2: 1-in-τ: Inject one reference packet every τ seconds
 Problem: bad in case where short-term delay variance is high
 Our approach: Dynamic injection based on utilization




High utilization  low injection rate
Low utilization  high injection rate
Adaptive scheme works better than fixed rate schemes
Details in the paper
Stanford
17
Delay
Component 2: Latency Estimator
Arrival time
is known
Estimated
Error in
delayInterpolated
delay estimate
L
Error in
delay
delay estimate
Reference
Packet
Regular
Packet
R
Linear interpolation
line
Arrival time
and delay
are known
Time
 Question 1: How to estimate latencies using reference
packets ?
 Solution: Different estimators possible
 Use only the delay of a left reference packet (RLI-L)
 Use linear interpolation of left and right reference packets (RLI)
 Other non-linear estimators possible (e.g., shrinkage)
Stanford
18
Component 2: Latency Estimator
Interpolation buffer
L
R
Right Reference
Packet arrived
Estimate
Flow Key
Delay
Square
of delay
4
1
5
16
1
25
Update
C1
C2
Update
C3
8 20
11 80
39
10
2
3
Selection
Any flow selection
strategy
Flow
key
3
4
6
7
When a
flow is
exported
Avg. latency = C2 / C1
 Question 2: How to compute per-flow latency statistics
 Solution: Maintain 3 counters per flow at the egress side




C1: Number of packets
C2: Sum of packet delays
C3: Sum of squares of packet delays (for estimating variance)
To minimize state, can use any flow selection strategy to maintain
counters for only a subset of flows
Stanford
19
Experimental Setup
 Data sets
 No public data center traces with timestamps
 Real router traces with synthetic workloads: WISC
 Real backbone traces with synthetic queueing: CHIC
and SANJ
 Simulation tool: Open source NetFlow software – YAF
 Supports reference packet injection mechanism
 Simulates a queueing model with RED active queue
management policy
 Experiments with different link utilizations
Stanford
20
CDF
Accuracy under High Link Utilization
Median relative error
is 10-12%
Relative error
Stanford
21
Comparison with Other Solutions
Average relative error
Packet sampling rate = 0.1%
1-2 orders of
magnitude difference
Utilization
Stanford
22
Overhead of RLI
 Bandwidth overhead is low
 less than 0.2% of link capacity
 Impact to packet loss is small
 Packet loss difference with and without RLI is at
most 0.001% at around 80% utilization
Stanford
23
Summary
 A scalable architecture to obtain high-fidelity
per-flow latency measurements between router
interfaces
 Achieves a median relative error of 10-12%
 Obtains 1-2 orders of magnitude lower relative
error compared to existing solutions
 Measurements are obtained directly at the
egress side
Stanford
24
Contributions Thus Far…
 Aggregate Latency Estimation
 Lossy Difference Aggregator – Sigcomm 2009
 FineComb – Sigmetrics 2011
 mPlane – ReArch 2009
 Differentiated Latency Estimation
 Multiflow Estimator – Infocom 2010
 Reference Latency Interpolation – Sigcomm 2010
 RLI across Routers – Hot-ICE 2011
 Virtual LDA – (under review at Sigcomm 2011)
 Scalable Query Interface
 MAPLE – (under review at Sigcomm 2011)
Stanford
25
2) SCALABLE PER-PACKET
LATENCY MEASUREMENT
ARCHITECTURE (UNDER
REVIEW AT SIGCOMM 2011)
Stanford
26
MAPLE Motivation
 LDA and RLI are ossified in the aggregation
level
 Not suitable for obtaining arbitrary subpopulation statistics
 Single packet delay may be important
 Key Goal: How to enable a flexible and
scalable architecture for packet latencies ?
Stanford
27
MAPLE Architecture
P1
Router A
Router B
P1 T1
P1P1D1
Timestamp
Unit
Q(P1)
2) Query
1) Packet
Latency
Engine
Store
A(P1)
Central
Monitor
 Timestamping not strictly required
 Can work with RLI estimated latencies
Stanford
28
Packet Latency Store (PLS)
 Challenge: How to store packet latencies in the
most efficient manner ?
 Naïve idea: Hashtables does not scale well
 At a minimum, require label (32 bits) + timestamp (32
bits) per packet
 To avoid collisions, need a large number of hash table
entries (~147 bits/pkt for a collision rate of 1%)
 Can we do better ?
Stanford
29
Our Approach
 Idea 1: Cluster packets
 Typically few dominant values
 Cluster packets into equivalence classes
 Associate one delay value with a cluster
 Choose cluster centers such that error is small
 Idea 2: Provision storage
 Naïvely, we can use one Bloom Filter per cluster
(Partitioned Bloom Filter)
 We propose a new data structure called Shared
Vector Bloom Filter (SVBF) that is more efficient
Stanford
30
Selecting Representative Delays
 Approach 1: Logarithmic delay selection
 Divide delay range into logarithmic intervals
 E.g., 0.1-10,000μs  0.1-1μs, 1-10μs …
 Simple to implement, bounded relative error, but
accuracy may not be optimal
 Approach 2: Dynamic clustering
 k-means (medians) clustering formulation
 Minimizes the average absolute error of packet
latencies (minimizes total Euclidean distance)
 Approach 3: Hybrid clustering
 Split centers equally across static and dynamic
 Best of both worlds
Stanford
31
K-means
 Goal: Determine k-centers every measurement cycle
 Can be formulated as a k-means clustering algorithm
 Problem 1: Running k-means typically hard
 Basic algorithm has O(nk+1 log n) run time
 Heuristics (Lloyd’s algorithm) also complicated in practice
 Solution: Sampling and streaming algorithms
 Use sampling to reduce n to pn
 Use a streaming k-medians algorithm (approximate but sufficient)
 Problem 2: Can’t find centers and record membership at
the same time
 Solution: Pipelined implementation
 Use previous interval’s centers as an approximation for this
interval
Stanford
32
Streaming k-Medians [CharikarSTOC03]
np packets
at i-th epoch
O(k log(np) centers
at (i+1)th epoch
Packet
Sampling
Packet
Stream
Online
Clusterin
g
Stage
SOFTWARE
Offline
Clusterin
g
Stage
k-centers
HARDWARE
Packets in (i+2)th epoch
Storage Data Structure
Flushed after every
epoch for archival
DRAM/SSD
Data
Stanford
33
Naïve: Partitioned BF (PBF)
INSERTION
c1
c2
Packet
Latency
Parallel matching
of closest center
LOOKUP
Packet
Contents
Query all
Bloom filters
c3
c4
1101 …11
0011
001 …101
011
01 …1
001 …1
Bits are set by
hashing packet
contents
1101 …11
001101 …1011 1
c1
c2
All bits are 1
01 …1
001 …1
Stanford
c3
c4
34
Problems with PBF
 Provisioning is hard
 Cluster sizes not known apriori
 Over-estimation or under estimation of BF sizes
 Lookup complexity is higher
 Need the data structure to be partitioned every
cycle
 Need to lookup multiple random locations in the
bitmap (based on number of hash functions)
Stanford
35
Shared-Vector Bloom Filter
INSERTION
c1
c2
Bit position is
located by hashing
000
1001 …100
111
Packet
Latency
Parallel matching
of closest center
c3
c4
Bit is set to 1 after offset by
the number of matched center
LOOKUP
Packet
Contents
H1 H2
Packet
Contents
# of centers
H1 H2
001001 …1011 1
AND
Bulk read
0100
c2
Offset is center id
Stanford
36
Comparing PBF and SVBF
 PBF
− Lookup is not easily parallelizable
− Provisioning is hard since number of packets per
BF is not known apriori
 SVBF
+ One Bloom filter is used
+ Burst read at the length of word
 COMB [Hao10Infocom]
+ Single BF with groups of hash functions
− More memory usage than SVBF and burst read
not possible
Stanford
37
Comparing Storage Needs
Data
Structure
# of Hash
functions
Capacity
Insertion Lookup
(bits/entry)
Note
HashTable
1
147
1
1
Storing only
latency value
(no label)
PBF
9
12.8
9
450
Provisioning is
hard (12.8 if
cardinality
known before)
COMB
7
12.8
14
77
(alternate
combinations
exist)
SVBF
9
12.8
9
27 (burst
reads)
Provisioning is
easy
For same classification failure rate of 1% and 50 centers (k=50)
Stanford
38
Tie-Breaking Heuristic
 Bloom filters have false positives
 Lookups involve search across all BFs
 So, multiple BFs may return match
 Tie-breaking heuristic returns the group that
has the highest cardinality
 Store a counter per center to store number of
packets that match the center (cluster cardinality)
 Works well in practice (especially when skewed
distributions)
Stanford
39
CDF
Estimation Accuracy
Absolute error (μs)
Stanford
40
CDF
Accuracy of Aggregates
Relative error
Stanford
41
MAPLE Architecture
Router B
Router A
Q(P1)
2) Query
Engine
A(P1)
Central
Monitor
Stanford
42
Query Interface
 Assumption: Path of a packet is known
 Possible to determine using forwarding tables
 In OpenFlow-enabled networks, controller has the
information
 Query answer:
 Latency estimate
 Type: (1) Match, (2) Multi-Match, (3) No-Match
Stanford
43
Query Bandwidth
 Query method 1: Query using packet hash
 Hashed using invariant fields in a packet header
 High query bandwidth for aggregate latency statistics (e.g., flowlevel latencies)
 Query method 2: Query using flow key and IP identifier
 Support range search to reduce query bandwidth overhead
 Inserts: use flow key and IPID for hashing
 Query: use a flow key and ranges of continuous IPIDs are sent
Query
message:
f1
Continuous IPID
block
1
5
f1 20
Stanford
35
44
Query Bandwidth Compression
CDF
Median
compression per
flow reduces bw
by 90%
Compression ratio
Stanford
45
Storage
 OC192 interface
 5 Million packets
 60Mbits per second
 Assuming 10% utilization, 6 Mbits per second
 DRAM – 16 GB
 40 minutes of packets
 SSD – 256 GB
 10 hours – enough time for diagnosis
Stanford
46
Summary
 RLI and LDA are ossified in their aggregation
level
 Proposed MAPLE as a mechanism to
compute measurements across arbitrary subpopulations
 Relies on clustering dominant delay values
 Novel SVBF data structure to reduce storage and
lookup complexity
Stanford
47
Conclusion
 Many applications demand low latencies
 Network operators need high-fidelity tools for
latency measurements
 Proposed RLI for fine-grained per-flow
measurements
 Proposed MAPLE to:
 Store per-packet latencies in a scalable way
 Compose latency aggregates across arbitrary subpopulations
 Many other solutions (papers on my web page)
Stanford
48
Sponsors
 CNS – 1054788: NSF CAREER: Towards a
Knowledge Plane for Data Center Networks
 CNS – 0831647: NSF NECO: Architectural
Support for Fault Management
 Cisco Systems: Designing Router Primitives
for Monitoring Network Health
Stanford
49