High-Fidelity Latency Measurements in Low-Latency Networks Ramana Rao Kompella Myungjin Lee (Purdue), Nick Duffield (AT&T Labs – Research) Low Latency Applications Many important data center applications require low end-to-end latencies (microseconds) High Performance Computing – lose parallelism Cluster Computing, Storage – lose performance Automated Trading – lose arbitrage opportunities Stanford 2 Low Latency Applications Many important data center applications require low end-to-end latencies (microseconds) High Performance Computing – lose parallelism Cluster Computing, Storage – lose performance Automated Trading – lose arbitrage opportunities Cloud applications Recommendation Systems, Social Collaboration All-up SLAs of 200ms [AlizadehSigcomm10] Involves backend computation time and network latencies have little budget Stanford 3 Latency Measurements are Needed … Core Router Which router causes the problem?? Edge Router Measurement within a router is necessary ToR S/W 1ms … … … … Router At every router, high-fidelity measurements are critical to localize root causes Once root cause localized, operators can fix by rerouting traffic, upgrade links or perform detailed diagnosis Stanford 4 Vision: Knowledge Plane SLA Diagnosis Routing/Traffic Engineering Scheduling/Job Placement Knowledge Plane Query Interface Push Latency Measurements Query Response Pull Latency Measurements CORE AGGREGATION TOP-OF-RACK (TOR) Data Center Network Stanford 5 Contributions Thus Far… Aggregate Latency Estimation Lossy Difference Aggregator – Sigcomm 2009 Per-flow latency2011 FineComb – Sigmetrics measurements at every mPlane – ReArch 2009 hop Differentiated Latency Estimation Per-Packet Latency Multiflow Estimator – Infocom 2010 Measurements Reference Latency Interpolation – Sigcomm 2010 RLI across Routers – Hot-ICE 2011 Delay Sketching – (under review at Sigcomm 2011) Scalable Query Interface MAPLE – (under review at Sigcomm 2011) Stanford 6 1) PER-FLOW MEASUREMENTS WITH REFERENCE LATENCY INTERPOLATION [SIGCOMM 2010] Stanford 7 Obtaining Fine-Grained Measurements Native router support: SNMP, NetFlow No latency measurements Active probes and tomography Too many probes (~10000HZ) required wasting bandwidth Use expensive high-fidelity measurement boxes London Stock Exchange uses Corvil boxes Cannot place them ubiquitously Recent work: LDA [Kompella09Sigcomm] Computes average latency/variance accurately within a switch Provides a good start but may not be sufficient to diagnose flowspecific problems Stanford 8 Delay From Aggregates to Per-Flow Large delay … S/W Small delay Average latency Time Interval Queue Observation: Significant amount of difference in average latencies across flows at a router Goal of this paper: How to obtain per-flow latency measurements in a scalable fashion ? Stanford 9 Measurement Model Ingress I Egress E Router Assumption: Time synchronization between router interfaces Constraint: Cannot modify regular packets to carry timestamps Intrusive changes to the routing forwarding path Stanford 10 Naïve Approach Egress E Ingress I 10 13 20 + 23 + 15 18 − − − − = 22 = 32 Avg. delay = 22/2 = 11 Avg. delay = 32/2 = 16 For each flow key, 27 30 Store timestamps for each packet at I and E After a flow stops sending, I sends the packet timestamps to E E computes individual packet delays E aggregates average latency, variance, etc for each flow Problem: High communication costs At 10Gbps, few million packets per second Sampling reduces communication, but also reduces accuracy Stanford 11 A (Naïve) Extension of LDA 2 28 … 1 Packet count 15 Ingress I Egress E LDA LDA LDA LDA LDA Coordination LDA Sum of timestamps Per-flow latency Maintaining LDAs with many counters for flows of interest Problem: (Potentially) high communication costs Proportional to the number of flows Stanford 12 Delay Key Observation: Delay Locality D3 D2 D1 WD1 True mean delay = WD3 WD2 (D1 + D2 + Time D3) / 3 Localized mean delay = (WD1 + WD2 + WD3) / 3 How close is localized mean delay to true mean delay as window size varies? Stanford 13 Local mean delay per key / ms Key Observation: Delay Locality Global Mean 1s: RMSRE=1.72 10ms: RMSRE=0.16 0.1ms: RMSRE=0.054 True Mean delay per key / ms Data sets from real router and synthetic queueing models Stanford 14 Delay Exploiting Delay Locality Reference Packet Ingress Timestamp Time Reference packets are injected regularly at the ingress I Special packets carrying ingress timestamp Provides some reference delay values (substitute for window averages) Used to approximate the latencies of regular packets Stanford 15 RLI Architecture R 1) Reference Packet L Generator 3 2 1 Ingress 2) Latency Timestamp Estimator Ingress I Egress E 1 3 2 Component 1: Reference Packet generator Injects reference packets regularly Component 2: Latency Estimator Estimates packet latencies and updates per-flow statistics Estimates directly at the egress with no extra state maintained at ingress side (reduces storage and communication overheads) Stanford 16 Component 1: Reference Packet Generator Question: When to inject a reference packet ? Idea 1: 1-in-n: Inject one reference packet every n packets Problem: low accuracy under low utilization Idea 2: 1-in-τ: Inject one reference packet every τ seconds Problem: bad in case where short-term delay variance is high Our approach: Dynamic injection based on utilization High utilization low injection rate Low utilization high injection rate Adaptive scheme works better than fixed rate schemes Details in the paper Stanford 17 Delay Component 2: Latency Estimator Arrival time is known Estimated Error in delayInterpolated delay estimate L Error in delay delay estimate Reference Packet Regular Packet R Linear interpolation line Arrival time and delay are known Time Question 1: How to estimate latencies using reference packets ? Solution: Different estimators possible Use only the delay of a left reference packet (RLI-L) Use linear interpolation of left and right reference packets (RLI) Other non-linear estimators possible (e.g., shrinkage) Stanford 18 Component 2: Latency Estimator Interpolation buffer L R Right Reference Packet arrived Estimate Flow Key Delay Square of delay 4 1 5 16 1 25 Update C1 C2 Update C3 8 20 11 80 39 10 2 3 Selection Any flow selection strategy Flow key 3 4 6 7 When a flow is exported Avg. latency = C2 / C1 Question 2: How to compute per-flow latency statistics Solution: Maintain 3 counters per flow at the egress side C1: Number of packets C2: Sum of packet delays C3: Sum of squares of packet delays (for estimating variance) To minimize state, can use any flow selection strategy to maintain counters for only a subset of flows Stanford 19 Experimental Setup Data sets No public data center traces with timestamps Real router traces with synthetic workloads: WISC Real backbone traces with synthetic queueing: CHIC and SANJ Simulation tool: Open source NetFlow software – YAF Supports reference packet injection mechanism Simulates a queueing model with RED active queue management policy Experiments with different link utilizations Stanford 20 CDF Accuracy under High Link Utilization Median relative error is 10-12% Relative error Stanford 21 Comparison with Other Solutions Average relative error Packet sampling rate = 0.1% 1-2 orders of magnitude difference Utilization Stanford 22 Overhead of RLI Bandwidth overhead is low less than 0.2% of link capacity Impact to packet loss is small Packet loss difference with and without RLI is at most 0.001% at around 80% utilization Stanford 23 Summary A scalable architecture to obtain high-fidelity per-flow latency measurements between router interfaces Achieves a median relative error of 10-12% Obtains 1-2 orders of magnitude lower relative error compared to existing solutions Measurements are obtained directly at the egress side Stanford 24 Contributions Thus Far… Aggregate Latency Estimation Lossy Difference Aggregator – Sigcomm 2009 FineComb – Sigmetrics 2011 mPlane – ReArch 2009 Differentiated Latency Estimation Multiflow Estimator – Infocom 2010 Reference Latency Interpolation – Sigcomm 2010 RLI across Routers – Hot-ICE 2011 Virtual LDA – (under review at Sigcomm 2011) Scalable Query Interface MAPLE – (under review at Sigcomm 2011) Stanford 25 2) SCALABLE PER-PACKET LATENCY MEASUREMENT ARCHITECTURE (UNDER REVIEW AT SIGCOMM 2011) Stanford 26 MAPLE Motivation LDA and RLI are ossified in the aggregation level Not suitable for obtaining arbitrary subpopulation statistics Single packet delay may be important Key Goal: How to enable a flexible and scalable architecture for packet latencies ? Stanford 27 MAPLE Architecture P1 Router A Router B P1 T1 P1P1D1 Timestamp Unit Q(P1) 2) Query 1) Packet Latency Engine Store A(P1) Central Monitor Timestamping not strictly required Can work with RLI estimated latencies Stanford 28 Packet Latency Store (PLS) Challenge: How to store packet latencies in the most efficient manner ? Naïve idea: Hashtables does not scale well At a minimum, require label (32 bits) + timestamp (32 bits) per packet To avoid collisions, need a large number of hash table entries (~147 bits/pkt for a collision rate of 1%) Can we do better ? Stanford 29 Our Approach Idea 1: Cluster packets Typically few dominant values Cluster packets into equivalence classes Associate one delay value with a cluster Choose cluster centers such that error is small Idea 2: Provision storage Naïvely, we can use one Bloom Filter per cluster (Partitioned Bloom Filter) We propose a new data structure called Shared Vector Bloom Filter (SVBF) that is more efficient Stanford 30 Selecting Representative Delays Approach 1: Logarithmic delay selection Divide delay range into logarithmic intervals E.g., 0.1-10,000μs 0.1-1μs, 1-10μs … Simple to implement, bounded relative error, but accuracy may not be optimal Approach 2: Dynamic clustering k-means (medians) clustering formulation Minimizes the average absolute error of packet latencies (minimizes total Euclidean distance) Approach 3: Hybrid clustering Split centers equally across static and dynamic Best of both worlds Stanford 31 K-means Goal: Determine k-centers every measurement cycle Can be formulated as a k-means clustering algorithm Problem 1: Running k-means typically hard Basic algorithm has O(nk+1 log n) run time Heuristics (Lloyd’s algorithm) also complicated in practice Solution: Sampling and streaming algorithms Use sampling to reduce n to pn Use a streaming k-medians algorithm (approximate but sufficient) Problem 2: Can’t find centers and record membership at the same time Solution: Pipelined implementation Use previous interval’s centers as an approximation for this interval Stanford 32 Streaming k-Medians [CharikarSTOC03] np packets at i-th epoch O(k log(np) centers at (i+1)th epoch Packet Sampling Packet Stream Online Clusterin g Stage SOFTWARE Offline Clusterin g Stage k-centers HARDWARE Packets in (i+2)th epoch Storage Data Structure Flushed after every epoch for archival DRAM/SSD Data Stanford 33 Naïve: Partitioned BF (PBF) INSERTION c1 c2 Packet Latency Parallel matching of closest center LOOKUP Packet Contents Query all Bloom filters c3 c4 1101 …11 0011 001 …101 011 01 …1 001 …1 Bits are set by hashing packet contents 1101 …11 001101 …1011 1 c1 c2 All bits are 1 01 …1 001 …1 Stanford c3 c4 34 Problems with PBF Provisioning is hard Cluster sizes not known apriori Over-estimation or under estimation of BF sizes Lookup complexity is higher Need the data structure to be partitioned every cycle Need to lookup multiple random locations in the bitmap (based on number of hash functions) Stanford 35 Shared-Vector Bloom Filter INSERTION c1 c2 Bit position is located by hashing 000 1001 …100 111 Packet Latency Parallel matching of closest center c3 c4 Bit is set to 1 after offset by the number of matched center LOOKUP Packet Contents H1 H2 Packet Contents # of centers H1 H2 001001 …1011 1 AND Bulk read 0100 c2 Offset is center id Stanford 36 Comparing PBF and SVBF PBF − Lookup is not easily parallelizable − Provisioning is hard since number of packets per BF is not known apriori SVBF + One Bloom filter is used + Burst read at the length of word COMB [Hao10Infocom] + Single BF with groups of hash functions − More memory usage than SVBF and burst read not possible Stanford 37 Comparing Storage Needs Data Structure # of Hash functions Capacity Insertion Lookup (bits/entry) Note HashTable 1 147 1 1 Storing only latency value (no label) PBF 9 12.8 9 450 Provisioning is hard (12.8 if cardinality known before) COMB 7 12.8 14 77 (alternate combinations exist) SVBF 9 12.8 9 27 (burst reads) Provisioning is easy For same classification failure rate of 1% and 50 centers (k=50) Stanford 38 Tie-Breaking Heuristic Bloom filters have false positives Lookups involve search across all BFs So, multiple BFs may return match Tie-breaking heuristic returns the group that has the highest cardinality Store a counter per center to store number of packets that match the center (cluster cardinality) Works well in practice (especially when skewed distributions) Stanford 39 CDF Estimation Accuracy Absolute error (μs) Stanford 40 CDF Accuracy of Aggregates Relative error Stanford 41 MAPLE Architecture Router B Router A Q(P1) 2) Query Engine A(P1) Central Monitor Stanford 42 Query Interface Assumption: Path of a packet is known Possible to determine using forwarding tables In OpenFlow-enabled networks, controller has the information Query answer: Latency estimate Type: (1) Match, (2) Multi-Match, (3) No-Match Stanford 43 Query Bandwidth Query method 1: Query using packet hash Hashed using invariant fields in a packet header High query bandwidth for aggregate latency statistics (e.g., flowlevel latencies) Query method 2: Query using flow key and IP identifier Support range search to reduce query bandwidth overhead Inserts: use flow key and IPID for hashing Query: use a flow key and ranges of continuous IPIDs are sent Query message: f1 Continuous IPID block 1 5 f1 20 Stanford 35 44 Query Bandwidth Compression CDF Median compression per flow reduces bw by 90% Compression ratio Stanford 45 Storage OC192 interface 5 Million packets 60Mbits per second Assuming 10% utilization, 6 Mbits per second DRAM – 16 GB 40 minutes of packets SSD – 256 GB 10 hours – enough time for diagnosis Stanford 46 Summary RLI and LDA are ossified in their aggregation level Proposed MAPLE as a mechanism to compute measurements across arbitrary subpopulations Relies on clustering dominant delay values Novel SVBF data structure to reduce storage and lookup complexity Stanford 47 Conclusion Many applications demand low latencies Network operators need high-fidelity tools for latency measurements Proposed RLI for fine-grained per-flow measurements Proposed MAPLE to: Store per-packet latencies in a scalable way Compose latency aggregates across arbitrary subpopulations Many other solutions (papers on my web page) Stanford 48 Sponsors CNS – 1054788: NSF CAREER: Towards a Knowledge Plane for Data Center Networks CNS – 0831647: NSF NECO: Architectural Support for Fault Management Cisco Systems: Designing Router Primitives for Monitoring Network Health Stanford 49
© Copyright 2026 Paperzz