Pushing the Limits of Web Caching: Eliminating Redundant Network Traffic Sunghwan Ihm Software Engineer, Google KRnet 2013 Disclaimer / References • Disclaimer – Materials and content based on work at Princeton University / KAIST, not related to work at Google • References – [Ihm et al., USENIX’10] “Wide-area Network Acceleration for the Developing World” – [Ihm et al., IMC’11] “Towards Understanding Modern Web Traffic” – [Woo et al., MobiSys’13] “Comparison of Caching Strategies in Modern Cellular Backhaul Networks” [KRnet 2013] Sunghwan Ihm, Google 2 Web Caching Origin Web Servers Cache Miss ~70% Bytes! Cache Hit Local Proxy Cache Users • Avoid fetching cached objects – Cache key: URL (+ cache expiration time) • Cache hit: save network bandwidth / RTT – Typical cache byte hit rate: 20~30% [KRnet 2013] Sunghwan Ihm, Google 3 When Web Caching Is Broken (1/3) • Aliasing: different URLs but same content – E.g., copy of common images and javascript framework/libraries http://foo/img.gif http://bar/logo.gif [KRnet 2013] Sunghwan Ihm, Google 4 When Web Caching Is Broken (2/3) • Uncacheable content: same URLs and content, but marked uncacheable – E.g., intentional for tracking, or admin mistakes HTTP/1.1 200 OK Date: Tue, 18 Jun 2013 04:25:05 GMT Expires: -1 Cache-Control: private, max-age=0 Content-Type: text/html; charset=ISO-8859-1 [KRnet 2013] Sunghwan Ihm, Google 5 When Web Caching Is Broken (3/3) • Content update: same URLs but different (similar) content – E.g., news, blogs, portal [KRnet 2013] Sunghwan Ihm, Google 6 Proposal #1: Object Based Caching • Cache key – Content hash (e.g., SHA-1) of an object – Think it as “extended/better” ETag • Pros – Addresses both aliasing and uncacheable content • Cons – Cannot address content update – Requires server-support – Does not work well with aborted transfers [KRnet 2013] Sunghwan Ihm, Google 7 Aborted Transfers [IMC’11] • Users cancel ongoing transfers by – Clicking the stop button of the browser – Moving to another Web page • A small number (1.8~3.1%) of requests are aborted, but total volume is significant – Until aborted: 12.4~30.8% – If not aborted: 69.9~88.8% – Mostly video previewing: flash video 40~70% [KRnet 2013] Sunghwan Ihm, Google 8 Problems with Aborted Transfers 1. Discard: low cache hit rate 2. Continue fully downloading: waste of network bandwidth if not referenced 3. Choose 1 or 2 depending on # of bytes remaining: parameter tuning 4. Range request: cacheable objects only [KRnet 2013] Sunghwan Ihm, Google 9 Proposal #2: Prefix Based Caching • Cache key – Content hash of the first N bytes (e.g., 1KB~1MB) of an object + content-length • Pros – “Mostly” works for aliasing and uncacheable content – Transparent: no server-support is required • Cons – Cannot address content update – Does not work well with aborted transfers – False-positives: same prefix key but different content [KRnet 2013] Sunghwan Ihm, Google 10 False-Positives [MobiSys’13] [KRnet 2013] Sunghwan Ihm, Google 11 Proposal #3: Chunk Based Caching (1/2) • Cache chunks, not objects – An object consists of a sequence of chunks – Cache key: content hash of a “chunk” • Pros – Addresses aliasing, uncacheable content, and content update – No false positives – Works well with aborted transfers • Bonus – Protocol-independent, not just for Web [KRnet 2013] Sunghwan Ihm, Google 12 Proposal #3: Chunk Based Caching (2/2) • Cons – Engineering complexity: topic of this tutorial • A form of “Redundancy Elimination” – Content-addressable caching – Chunk-based caching – De-duplication or “dedup” [KRnet 2013] Sunghwan Ihm, Google 13 Tutorial Outline • Introduction • Part 1: Redundancy Elimination 101 – Basic Architecture – Building Blocks – Real World Measurement Results [IMC’11, MobiSys’13] • Part 2: Building RE Systems – RE System Components and Performance – RE for Fast Network [MobiSys’13] – RE for Slow Network [USENIX’10] • Recap [KRnet 2013] Sunghwan Ihm, Google 14 Part 1: Redundancy Elimination 101 • Basic Architecture • Building Blocks • Real World Measurement Results [IMC’11, MobiSys’13] [KRnet 2013] Sunghwan Ihm, Google 15 Basic Architecture: Two Proxies with Shared Cache Local Proxy Cache Server-Side Proxy Client-Side Proxy [KRnet 2013] Sunghwan Ihm, Google 16 How It Works 1. 2. 3. 4. Users send requests Get content from Web server Generate chunks Send cached chunk names + uncached raw content (compression) Client-Side Users Sever-Side Origin Web Server 1. Fetch chunk contents from local disk (cache hits) 2. Request any cache misses from server-side node 3. As we assemble, send it to users (cut-through) [KRnet 2013] Sunghwan Ihm, Google 17 Building Blocks: How To Generate Chunks Chunk Index Object (1) Detecting chunk boundaries A (2) Naming chunks C B D E A B C D E [KRnet 2013] Sunghwan Ihm, Google Chunk Storage 18 Fixed Sized Chunk Boundaries • Split every N bytes (e.g., N=64B~1MB) • Pros: simple, cheap to compute (just byte offset) • Cons: vulnerable to local changes c1 c2 c3 c1 c2 c3 c1 c2 c3’ c4 c4 c4’ c5 c5 c5’ [KRnet 2013] Sunghwan Ihm, Google c6 c6 c6’ …… …… …… 19 Variable-Sized Chunk Boundaries (1/3) 0A 1B 2C 3D 4E 5F A6 B7 C8 D9 E0 F1 0B 1C 2D 3E 4F … f(window)=K? f(window)=K? • Split based on “content” (a.k.a. content-based chunking) [LBFS, SOSP’01] – Compute a function f over a sliding window of content iteratively – Constitute a chunk boundary if f(window) = K [KRnet 2013] Sunghwan Ihm, Google 20 Variable-Sized Chunk Boundaries (2/3) • Computing f can be inexpensive – Rabin’s fingerprinting (Michael O. Rabin, Turing Award Winner, 1976) – fi+1 can be computed iteratively from fi with a subtraction, a multiplication, an addition, and a mask – “Dynamic programming” • Controlling expected average chunk size – Match low-order n bits only: 2n bytes – “Sampling” [KRnet 2013] Sunghwan Ihm, Google 21 Variable-Sized Chunk Boundaries (3/3) • Pros: robust to local changes, “shift-resistant” • Cons: relatively expensive to compute c1 c2 c3 c4 c5 c1 c2 c3 c4 c5 c6 …… c1 c2 c3’ c4 c5 c6 …… c1 c2 c3’ c3’’ c4 c5 c6 …… [KRnet 2013] Sunghwan Ihm, Google c6 …… 22 Naming Chunks (1/2) • Chunk name = crypto-hash(chunk content) – Most commonly SHA-1 • Pros – Practically globally unique: collision rate < hardware error rate – Can deal with unsynchronized cache – Self-verifiable • Cons – Relatively expensive to compute – Reference size overhead (e.g., 20 bytes for SHA-1) [KRnet 2013] Sunghwan Ihm, Google 23 Naming Chunks (2/2) • Optimization with synchronized cache – Global uniqueness no longer necessary – Can send more compact reference – E.g., offset to chunk storage: ~4 bytes • Cons – Not self-verifiable – Cache can easily go out of sync in practice [KRnet 2013] Sunghwan Ihm, Google 24 Other Variants: Packet-based RE (1/3) • Cache “packets”, not chunks – Packet storage, not chunk storage • Index by “fingerprints”, not chunk names – Fingerprints == chunk boundaries • Expand left and right from the matched fingerprint – Reference: fingerprint (offset) + byte ranges [KRnet 2013] Sunghwan Ihm, Google 25 Other Variants: Packet-based RE (2/3) Incoming Packets Cached Packets Fingerprint Index [KRnet 2013] Sunghwan Ihm, Google 26 Other Variants: Packet-based RE (3/3) • Pros – Cheap: no chunks, no SAH-1 computation – Works for any protocol (not just TCP) • Cons – Small cache size (memory only) – Limited cache hit rate – Reference size overhead – Assumes synchronized cache [KRnet 2013] Sunghwan Ihm, Google 27 Real World Measurement Results 1. CoDeeN CDN Traffic [IMC’11] – http://codeen.cs.princeton.edu – A semi-open globally distributed open proxy on 500+ PlanetLab nodes – Running since 2003 / 30+ million requests per day – Full traffic dump for one month period 2. 3G Mobile Traffic [MobiSys’13] – One of the largest ISPs in South Korea – 12.5 million 3G subscribers – Full traffic monitoring over one week period [KRnet 2013] Sunghwan Ihm, Google 28 CoDeeN Traffic: Bandwidth Savings 1.8~2.5x • HTTP object-based: 17~28% • Chunk-based: 42~51% with 128-byte chunks [KRnet 2013] Sunghwan Ihm, Google 29 Origins of Redundancy (%) CoDeeN Traffic: Origins of Redundancy 100 75 50 Aborted 25 US, 128 byte 0 o de vi o di au t te oc e ag im s cs l ipt xm scr va ja l m ht l al Content updates inter-URL aliasing intra-URL object-hit Text Binary [KRnet 2013] Sunghwan Ihm, Google 30 3G Traffic: Bandwidth Savings Sunday night Rush hour ~1.5x • 4KB fixed-sized chunking with 512GB – Why not using variable-sized chunking? In Part 2 • Many other results in the paper [KRnet 2013] Sunghwan Ihm, Google 31 3G Traffic: Small Amount of Encrypted Traffic (HTTPS) [KRnet 2013] Sunghwan Ihm, Google 32 Part 1: Summary • Basic Architecture – Two proxies with shared cache • Building Blocks – Fixed vs. variable-sized chunk boundaries – Naming chunks with SHA-1 – Packet-based RE • Real World Measurement Results – 1.5~2x more bandwidth savings than Web caching [KRnet 2013] Sunghwan Ihm, Google 33 Part 2: Building RE Systems • RE System Components and Performance • RE for Fast Network [MobiSys’13] • RE for Slow Network [USENIX’10] [KRnet 2013] Sunghwan Ihm, Google 34 RE Middlebox “System”: Input-Process-Output Data Data Chunk [KRnet 2013] Sunghwan Ihm, Google Data 35 RE System Components 1. Network: RE target link (WAN/LAN) 2. CPU: chunk generation – Chunk boundary detection (e.g., Rabin’s fingerprinting) – Chunk naming (e.g., SHA-1 hash) 3. Disk: chunk storage – Can be cached in memory to reduce disk accesses 4. Memory: chunk index – Chunk name, offset, length, etc. – Partially or completely kept in memory to minimize disk accesses [KRnet 2013] Sunghwan Ihm, Google 36 End-To-End Performance • Bandwidth “usage” reduction alone is not enough • “Effective” bandwidth (or throughput) – Original data size / processing time – Should be larger than or equal to the original link bandwidth • Goal: minimize “processing time” for large effective bandwidth – Original data size is fixed [KRnet 2013] Sunghwan Ihm, Google 37 Processing Time Breakdown 1. Chunk generation (content-fingerprinting) time – CPU-bound – Depends on chunk generation method (e.g., chunkbased vs. packet-based / variable-sized vs. fixed-sized) 2. Network transfer time – Network-bound – Depends on amount of bandwidth savings 3. Reconstruction time – Memory- and disk-bound – Depends on how to locate and read chunks from cache storage [KRnet 2013] Sunghwan Ihm, Google 38 RE System Designs for Two Extreme Environments • RE for fast network – E.g., 10Gbps ISP core network – Main bottleneck: (1) chunk generation time, and (3) reconstruction time • RE for slow network – E.g., 1Mbps satellite internet in the developing world – Main bottleneck: (2) network transfer time [KRnet 2013] Sunghwan Ihm, Google 39 RE System Design for Fast Network [MobiSys’13] • How “fast” is it? – 270K concurrent flows – 1.3M new flows per minute – 7.5Gbps peak per-minute bandwidth usage • Goal #1: optimize “chunk generation time” – Exploit multicore CPUs – Favor cheap chunk generation method • Goal #2: optimize “reconstruction time” – Exploit multiple disks in parallel – Use RAM/SSD instead [KRnet 2013] Sunghwan Ihm, Google 40 Monbot: Flow Monitoring Systems Chunk Storage Chunk Generation / Indexing [KRnet 2013] Sunghwan Ihm, Google 41 Memory I/O and CPU Throughput Fixed-sized chunking on a single machine with 12 cores [KRnet 2013] Sunghwan Ihm, Google 42 Symmetric RSS (1/2) • RSS (Receive-Side Scaling) – Distributes packets into multiple RX queues – Toeplitz hash T over five tuples (src/dest IPs, ports, and protocol) and a random seed S • Problem – T(src->dest, S) != T(dest->src, S) – Packets in the same connection being put into two different RX queues requires lock for access • Solution [details in the paper] – Craft S that satisfies T(src->dest, S) == T(dest->src, S) [KRnet 2013] Sunghwan Ihm, Google 43 Symmetric RSS (2/2) [KRnet 2013] Sunghwan Ihm, Google 44 Other RE Systems for Fast Network • Packet-based RE systems for “enterprise” network – PacketCache [SIGCOMM’08]: RE on routers – SmartRE [SIGCOMM’09]: Coordinated RE – EndRE [NSDI’10]: RE on end-hosts • Disk bottleneck – Packet storage: packets kept entirely in RAM • CPU bottleneck – No chunks, no SHA-1: fingerprints + byte ranges [KRnet 2013] Sunghwan Ihm, Google 45 RE System Design for Slow Network [USENIX’10] • How “slow” is it? – Zambia example [Johnson et al. NSDR’10] – 300 people share 1Mbps satellite link – $1200 per month • Goal: optimize “network transfer time” – Maximize bandwidth savings by exploiting nonbottleneck CPU and disks [KRnet 2013] Sunghwan Ihm, Google 46 Developing World Challenges/Opportunities • • • • Enterprise Dedicated machine with ample RAM High-RPM SCSI disk Inter-office content only Star topology • • • • Developing World Shared machine with limited RAM Slow disk All content Mesh topology Poor Performance! VS. [KRnet 2013] Sunghwan Ihm, Google 47 Wanax: High-Performance WAN Accelerator • Design Goals – – – – Maximize compression rate Minimize memory pressure Maximize disk performance Exploit local resources • Contributions – Multi-Resolution Chunking (MRC) – Peering – Intelligent Load Shedding (ILS) [KRnet 2013] Sunghwan Ihm, Google 48 Single Resolution Chunking (SRC) Tradeoffs Q W E A R I X Y P V U Z N T O C 93.75% saving 15 disk reads Q W 15 index entries R T I O High compression rate X C High memory pressure Low disk performance 75% saving 3 disk reads E B Y U 3 index entries PLow Z compression rate VLow N memory pressure High disk performance [KRnet 2013] Sunghwan Ihm, Google 49 Multi-Resolution Chunking (MRC) Q W E R T Y I O P X C V A U Z N Q W E R T Y I O P X C V B U Z N 93.75% saving 6 disk reads 6 index entries • Use multiple chunk sizes simultaneously • Large chunks for low memory pressure and disk seeks • Small chunks for high compression rate [KRnet 2013] Sunghwan Ihm, Google 50 Generating MRC Chunks Original Data HIT HIT HIT • Detect smallest chunk boundaries first • Larger chunks are generated by matching more bits of the detected boundaries • Clean chunk alignment + less CPU [KRnet 2013] Sunghwan Ihm, Google 51 Storing MRC Chunks A B C A B C • Store every chunk regardless of content overlaps – No association among chunks – One index entry load + one disk seek – Reduce memory pressure and disk seeks – Disk space is cheap, disk seeks are limited *Alternative storage options in the paper [KRnet 2013] Sunghwan Ihm, Google 52 Peering • Cheap or free local networks (ex: wireless mesh, WiLDNet) – Multiple caches in the same region – Extra memory and disk • Use Highest Random Weight (HRW) – Robust to node churn – Scalable: no broadcast or digests exchange – Trade CPU cycles for low memory footprint Sunghwan Ihm, Princeton University 53 Intelligent Load Shedding (ILS) • Two sources of fetching chunks – Cache hits from disk – Cache misses from network • Fetching chunks from disks is not always desirable – Disk heavily loaded (shared/slow) – Network underutilized • Solution: adjust network and disk usage dynamically to maximize throughput [KRnet 2013] Sunghwan Ihm, Google 54 Shedding Smallest Chunk Time Disk Network Peer • Total latency max(disk, network) • Move the smallest chunks from disk to network, until network becomes the bottleneck – Disk one seek regardless of chunk size – Network proportional to chunk size [KRnet 2013] Sunghwan Ihm, Google 55 Bandwidth Savings • SRC: high metadata overhead • MRC: close to ideal bandwidth savings [KRnet 2013] Sunghwan Ihm, Google 56 Disk Fetch Cost • MRC greatly reduces # of disk seeks • 22.7x at 32 bytes [KRnet 2013] Sunghwan Ihm, Google 57 Chunk Size Breakdown • SRC: high metadata overhead • MRC: much less # of disk seeks and index entries (40% handled by large chunks) [KRnet 2013] Sunghwan Ihm, Google 58 Microbenchmark No tuning required SRC BASE PEER ILS MRC • 90% similar 1 MB files • 512kbps / 200ms RTT [KRnet 2013] Sunghwan Ihm, Google 59 Realistic Traffic: YouTube Better High Quality (490) Low Quality (320) • Classroom scenario • 100 clients download 18MB clip • 1Mbps / 1000ms RTT [KRnet 2013] Sunghwan Ihm, Google 60 Part 2: Summary • RE System Components and Performance – Interaction of CPU, network, disk, and memory – Processing time matters • RE System Design for Fast Network [MobiSys’13] – Requires scalable flow management – Uses RAM/SSD for chunk storage – Favors cheap chunk generation method • RE System Design for Slow Network [USENIX’10] – Exploits non-bottleneck resources: CPU and disks [KRnet 2013] Sunghwan Ihm, Google 61 Tutorial Recap • Conventional Web caching is limited – Aliasing – Uncacheable content – Content update • RE is a powerful technique for reducing bandwidth consumption – Not just for Web – E.g., storage de-duplication systems • RE is not a panacea: different environments require different system designs [KRnet 2013] Sunghwan Ihm, Google 62 Thank You • [email protected] • http://www.cs.princeton.edu/~sihm [KRnet 2013] Sunghwan Ihm, Google 63
© Copyright 2026 Paperzz