Pushing the Limits of Web Caching: Eliminating Redundant

Pushing the Limits of Web Caching:
Eliminating Redundant Network Traffic
Sunghwan Ihm
Software Engineer, Google
KRnet 2013
Disclaimer / References
• Disclaimer
– Materials and content based on work at Princeton
University / KAIST, not related to work at Google
• References
– [Ihm et al., USENIX’10] “Wide-area Network
Acceleration for the Developing World”
– [Ihm et al., IMC’11] “Towards Understanding Modern
Web Traffic”
– [Woo et al., MobiSys’13] “Comparison of Caching
Strategies in Modern Cellular Backhaul Networks”
[KRnet 2013] Sunghwan Ihm, Google
2
Web Caching
Origin Web Servers
Cache
Miss
~70% Bytes!
Cache
Hit
Local Proxy
Cache
Users
• Avoid fetching cached objects
– Cache key: URL (+ cache expiration time)
• Cache hit: save network bandwidth / RTT
– Typical cache byte hit rate: 20~30%
[KRnet 2013] Sunghwan Ihm, Google
3
When Web Caching Is Broken (1/3)
• Aliasing: different URLs but same content
– E.g., copy of common images and javascript
framework/libraries
http://foo/img.gif
http://bar/logo.gif
[KRnet 2013] Sunghwan Ihm, Google
4
When Web Caching Is Broken (2/3)
• Uncacheable content: same URLs and
content, but marked uncacheable
– E.g., intentional for tracking, or admin mistakes
HTTP/1.1 200 OK
Date: Tue, 18 Jun 2013 04:25:05 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
[KRnet 2013] Sunghwan Ihm, Google
5
When Web Caching Is Broken (3/3)
• Content update: same URLs but different
(similar) content
– E.g., news, blogs, portal
[KRnet 2013] Sunghwan Ihm, Google
6
Proposal #1:
Object Based Caching
• Cache key
– Content hash (e.g., SHA-1) of an object
– Think it as “extended/better” ETag
• Pros
– Addresses both aliasing and uncacheable content
• Cons
– Cannot address content update
– Requires server-support
– Does not work well with aborted transfers
[KRnet 2013] Sunghwan Ihm, Google
7
Aborted Transfers [IMC’11]
• Users cancel ongoing transfers by
– Clicking the stop button of the browser
– Moving to another Web page
• A small number (1.8~3.1%) of requests are
aborted, but total volume is significant
– Until aborted: 12.4~30.8%
– If not aborted: 69.9~88.8%
– Mostly video previewing: flash video 40~70%
[KRnet 2013] Sunghwan Ihm, Google
8
Problems with Aborted Transfers
1. Discard: low cache hit rate
2. Continue fully downloading: waste of
network bandwidth if not referenced
3. Choose 1 or 2 depending on # of bytes
remaining: parameter tuning
4. Range request: cacheable objects only
[KRnet 2013] Sunghwan Ihm, Google
9
Proposal #2:
Prefix Based Caching
• Cache key
– Content hash of the first N bytes (e.g., 1KB~1MB) of
an object + content-length
• Pros
– “Mostly” works for aliasing and uncacheable content
– Transparent: no server-support is required
• Cons
– Cannot address content update
– Does not work well with aborted transfers
– False-positives: same prefix key but different content
[KRnet 2013] Sunghwan Ihm, Google
10
False-Positives [MobiSys’13]
[KRnet 2013] Sunghwan Ihm, Google
11
Proposal #3:
Chunk Based Caching (1/2)
• Cache chunks, not objects
– An object consists of a sequence of chunks
– Cache key: content hash of a “chunk”
• Pros
– Addresses aliasing, uncacheable content, and content
update
– No false positives
– Works well with aborted transfers
• Bonus
– Protocol-independent, not just for Web
[KRnet 2013] Sunghwan Ihm, Google
12
Proposal #3:
Chunk Based Caching (2/2)
• Cons
– Engineering complexity: topic of this tutorial
• A form of “Redundancy Elimination”
– Content-addressable caching
– Chunk-based caching
– De-duplication or “dedup”
[KRnet 2013] Sunghwan Ihm, Google
13
Tutorial Outline
• Introduction
• Part 1: Redundancy Elimination 101
– Basic Architecture
– Building Blocks
– Real World Measurement Results [IMC’11,
MobiSys’13]
• Part 2: Building RE Systems
– RE System Components and Performance
– RE for Fast Network [MobiSys’13]
– RE for Slow Network [USENIX’10]
• Recap
[KRnet 2013] Sunghwan Ihm, Google
14
Part 1: Redundancy Elimination 101
• Basic Architecture
• Building Blocks
• Real World Measurement Results [IMC’11,
MobiSys’13]
[KRnet 2013] Sunghwan Ihm, Google
15
Basic Architecture:
Two Proxies with Shared Cache
Local Proxy
Cache
Server-Side
Proxy
Client-Side
Proxy
[KRnet 2013] Sunghwan Ihm, Google
16
How It Works
1.
2.
3.
4.
Users send requests
Get content from Web server
Generate chunks
Send cached chunk names +
uncached raw content
(compression)
Client-Side
Users
Sever-Side
Origin
Web Server
1. Fetch chunk contents from
local disk (cache hits)
2. Request any cache misses
from server-side node
3. As we assemble, send it to
users (cut-through)
[KRnet 2013] Sunghwan Ihm, Google
17
Building Blocks:
How To Generate Chunks
Chunk Index
Object
(1) Detecting chunk boundaries
A
(2) Naming chunks
C
B
D
E
A
B
C
D
E
[KRnet 2013] Sunghwan Ihm, Google
Chunk Storage
18
Fixed Sized Chunk Boundaries
• Split every N bytes (e.g., N=64B~1MB)
• Pros: simple, cheap to compute (just byte
offset)
• Cons: vulnerable to local changes
c1
c2
c3
c1
c2
c3
c1
c2
c3’
c4
c4
c4’
c5
c5
c5’
[KRnet 2013] Sunghwan Ihm, Google
c6
c6
c6’
……
……
……
19
Variable-Sized Chunk Boundaries (1/3)
0A 1B 2C 3D 4E 5F A6 B7 C8 D9 E0 F1 0B 1C 2D 3E 4F …
f(window)=K?
f(window)=K?
• Split based on “content” (a.k.a. content-based
chunking) [LBFS, SOSP’01]
– Compute a function f over a sliding window of
content iteratively
– Constitute a chunk boundary if f(window) = K
[KRnet 2013] Sunghwan Ihm, Google
20
Variable-Sized Chunk Boundaries (2/3)
• Computing f can be inexpensive
– Rabin’s fingerprinting (Michael O. Rabin, Turing Award
Winner, 1976)
– fi+1 can be computed iteratively from fi with a
subtraction, a multiplication, an addition, and a mask
– “Dynamic programming”
• Controlling expected average chunk size
– Match low-order n bits only: 2n bytes
– “Sampling”
[KRnet 2013] Sunghwan Ihm, Google
21
Variable-Sized Chunk Boundaries (3/3)
• Pros: robust to local changes, “shift-resistant”
• Cons: relatively expensive to compute
c1
c2
c3
c4
c5
c1
c2
c3
c4
c5
c6
……
c1
c2
c3’
c4
c5
c6
……
c1
c2
c3’ c3’’ c4
c5
c6
……
[KRnet 2013] Sunghwan Ihm, Google
c6
……
22
Naming Chunks (1/2)
• Chunk name = crypto-hash(chunk content)
– Most commonly SHA-1
• Pros
– Practically globally unique: collision rate < hardware
error rate
– Can deal with unsynchronized cache
– Self-verifiable
• Cons
– Relatively expensive to compute
– Reference size overhead (e.g., 20 bytes for SHA-1)
[KRnet 2013] Sunghwan Ihm, Google
23
Naming Chunks (2/2)
• Optimization with synchronized cache
– Global uniqueness no longer necessary
– Can send more compact reference
– E.g., offset to chunk storage: ~4 bytes
• Cons
– Not self-verifiable
– Cache can easily go out of sync in practice
[KRnet 2013] Sunghwan Ihm, Google
24
Other Variants: Packet-based RE (1/3)
• Cache “packets”, not chunks
– Packet storage, not chunk storage
• Index by “fingerprints”, not chunk names
– Fingerprints == chunk boundaries
• Expand left and right from the matched
fingerprint
– Reference: fingerprint (offset) + byte ranges
[KRnet 2013] Sunghwan Ihm, Google
25
Other Variants: Packet-based RE (2/3)
Incoming Packets
Cached Packets
Fingerprint Index
[KRnet 2013] Sunghwan Ihm, Google
26
Other Variants: Packet-based RE (3/3)
• Pros
– Cheap: no chunks, no SAH-1 computation
– Works for any protocol (not just TCP)
• Cons
– Small cache size (memory only)
– Limited cache hit rate
– Reference size overhead
– Assumes synchronized cache
[KRnet 2013] Sunghwan Ihm, Google
27
Real World Measurement Results
1. CoDeeN CDN Traffic [IMC’11]
– http://codeen.cs.princeton.edu
– A semi-open globally distributed open proxy on 500+
PlanetLab nodes
– Running since 2003 / 30+ million requests per day
– Full traffic dump for one month period
2. 3G Mobile Traffic [MobiSys’13]
– One of the largest ISPs in South Korea
– 12.5 million 3G subscribers
– Full traffic monitoring over one week period
[KRnet 2013] Sunghwan Ihm, Google
28
CoDeeN Traffic: Bandwidth Savings
1.8~2.5x
• HTTP object-based: 17~28%
• Chunk-based: 42~51% with 128-byte chunks
[KRnet 2013] Sunghwan Ihm, Google
29
Origins of Redundancy (%)
CoDeeN Traffic: Origins of Redundancy
100
75
50
Aborted
25
US, 128 byte
0
o
de
vi o
di
au t
te
oc e
ag
im
s
cs
l ipt
xm scr
va
ja l
m
ht
l
al
Content
updates
inter-URL
aliasing
intra-URL
object-hit
Text
Binary
[KRnet 2013] Sunghwan Ihm, Google
30
3G Traffic: Bandwidth Savings
Sunday night
Rush hour
~1.5x
• 4KB fixed-sized chunking with 512GB
– Why not using variable-sized chunking?  In Part 2
• Many other results in the paper
[KRnet 2013] Sunghwan Ihm, Google
31
3G Traffic: Small Amount of Encrypted
Traffic (HTTPS)
[KRnet 2013] Sunghwan Ihm, Google
32
Part 1: Summary
• Basic Architecture
– Two proxies with shared cache
• Building Blocks
– Fixed vs. variable-sized chunk boundaries
– Naming chunks with SHA-1
– Packet-based RE
• Real World Measurement Results
– 1.5~2x more bandwidth savings than Web caching
[KRnet 2013] Sunghwan Ihm, Google
33
Part 2: Building RE Systems
• RE System Components and Performance
• RE for Fast Network [MobiSys’13]
• RE for Slow Network [USENIX’10]
[KRnet 2013] Sunghwan Ihm, Google
34
RE Middlebox “System”:
Input-Process-Output
Data
Data
Chunk
[KRnet 2013] Sunghwan Ihm, Google
Data
35
RE System Components
1. Network: RE target link (WAN/LAN)
2. CPU: chunk generation
– Chunk boundary detection (e.g., Rabin’s
fingerprinting)
– Chunk naming (e.g., SHA-1 hash)
3. Disk: chunk storage
– Can be cached in memory to reduce disk accesses
4. Memory: chunk index
– Chunk name, offset, length, etc.
– Partially or completely kept in memory to minimize
disk accesses
[KRnet 2013] Sunghwan Ihm, Google
36
End-To-End Performance
• Bandwidth “usage” reduction alone is not
enough
• “Effective” bandwidth (or throughput)
– Original data size / processing time
– Should be larger than or equal to the original link
bandwidth
• Goal: minimize “processing time” for large
effective bandwidth
– Original data size is fixed
[KRnet 2013] Sunghwan Ihm, Google
37
Processing Time Breakdown
1. Chunk generation (content-fingerprinting) time
– CPU-bound
– Depends on chunk generation method (e.g., chunkbased vs. packet-based / variable-sized vs. fixed-sized)
2. Network transfer time
– Network-bound
– Depends on amount of bandwidth savings
3. Reconstruction time
– Memory- and disk-bound
– Depends on how to locate and read chunks from
cache storage
[KRnet 2013] Sunghwan Ihm, Google
38
RE System Designs for
Two Extreme Environments
• RE for fast network
– E.g., 10Gbps ISP core network
– Main bottleneck: (1) chunk generation time, and
(3) reconstruction time
• RE for slow network
– E.g., 1Mbps satellite internet in the developing
world
– Main bottleneck: (2) network transfer time
[KRnet 2013] Sunghwan Ihm, Google
39
RE System Design for
Fast Network [MobiSys’13]
• How “fast” is it?
– 270K concurrent flows
– 1.3M new flows per minute
– 7.5Gbps peak per-minute bandwidth usage
• Goal #1: optimize “chunk generation time”
– Exploit multicore CPUs
– Favor cheap chunk generation method
• Goal #2: optimize “reconstruction time”
– Exploit multiple disks in parallel
– Use RAM/SSD instead
[KRnet 2013] Sunghwan Ihm, Google
40
Monbot: Flow Monitoring Systems
Chunk Storage
Chunk Generation / Indexing
[KRnet 2013] Sunghwan Ihm, Google
41
Memory I/O and CPU Throughput
Fixed-sized chunking on a single machine with 12 cores
[KRnet 2013] Sunghwan Ihm, Google
42
Symmetric RSS (1/2)
• RSS (Receive-Side Scaling)
– Distributes packets into multiple RX queues
– Toeplitz hash T over five tuples (src/dest IPs, ports,
and protocol) and a random seed S
• Problem
– T(src->dest, S) != T(dest->src, S)
– Packets in the same connection being put into two
different RX queues  requires lock for access
• Solution [details in the paper]
– Craft S that satisfies T(src->dest, S) == T(dest->src, S)
[KRnet 2013] Sunghwan Ihm, Google
43
Symmetric RSS (2/2)
[KRnet 2013] Sunghwan Ihm, Google
44
Other RE Systems for Fast Network
• Packet-based RE systems for “enterprise”
network
– PacketCache [SIGCOMM’08]: RE on routers
– SmartRE [SIGCOMM’09]: Coordinated RE
– EndRE [NSDI’10]: RE on end-hosts
• Disk bottleneck
– Packet storage: packets kept entirely in RAM
• CPU bottleneck
– No chunks, no SHA-1: fingerprints + byte ranges
[KRnet 2013] Sunghwan Ihm, Google
45
RE System Design for
Slow Network [USENIX’10]
• How “slow” is it?
– Zambia example [Johnson et al. NSDR’10]
– 300 people share 1Mbps satellite link
– $1200 per month
• Goal: optimize “network transfer time”
– Maximize bandwidth savings by exploiting nonbottleneck CPU and disks
[KRnet 2013] Sunghwan Ihm, Google
46
Developing World
Challenges/Opportunities
•
•
•
•
Enterprise
Dedicated machine
with ample RAM
High-RPM SCSI disk
Inter-office content only
Star topology
•
•
•
•
Developing World
Shared machine
with limited RAM
Slow disk
All content
Mesh topology
Poor Performance!
VS.
[KRnet 2013] Sunghwan Ihm, Google
47
Wanax: High-Performance WAN
Accelerator
• Design Goals
–
–
–
–
Maximize compression rate
Minimize memory pressure
Maximize disk performance
Exploit local resources
• Contributions
– Multi-Resolution Chunking (MRC)
– Peering
– Intelligent Load Shedding (ILS)
[KRnet 2013] Sunghwan Ihm, Google
48
Single Resolution Chunking (SRC)
Tradeoffs
Q W
E
A
R
I
X
Y
P
V
U
Z
N
T
O
C
93.75% saving
15 disk reads Q W
15 index entries
R T
I O
High compression rate
X C
High memory pressure
Low disk performance
75% saving
3 disk reads
E B
Y U 3 index entries
PLow
Z compression rate
VLow
N memory pressure
High disk performance
[KRnet 2013] Sunghwan Ihm, Google
49
Multi-Resolution Chunking (MRC)
Q W E
R T Y
I O P
X C V
A
U
Z
N
Q W E
R T Y
I O P
X C V
B
U
Z
N
93.75% saving
6 disk reads
6 index entries
• Use multiple chunk sizes simultaneously
• Large chunks for low memory pressure
and disk seeks
• Small chunks for high compression rate
[KRnet 2013] Sunghwan Ihm, Google
50
Generating MRC Chunks
Original Data
HIT
HIT
HIT
• Detect smallest chunk boundaries first
• Larger chunks are generated by matching
more bits of the detected boundaries
• Clean chunk alignment + less CPU
[KRnet 2013] Sunghwan Ihm, Google
51
Storing MRC Chunks
A
B
C
A
B C
• Store every chunk regardless of content
overlaps
– No association among chunks
– One index entry load + one disk seek
– Reduce memory pressure and disk seeks
– Disk space is cheap, disk seeks are limited
*Alternative storage options in the paper
[KRnet 2013] Sunghwan Ihm, Google
52
Peering
• Cheap or free local networks
(ex: wireless mesh, WiLDNet)
– Multiple caches in the same region
– Extra memory and disk
• Use Highest Random Weight (HRW)
– Robust to node churn
– Scalable: no broadcast or digests exchange
– Trade CPU cycles for low memory footprint
Sunghwan Ihm, Princeton University
53
Intelligent Load Shedding (ILS)
• Two sources of fetching chunks
– Cache hits from disk
– Cache misses from network
• Fetching chunks from disks is not always desirable
– Disk heavily loaded (shared/slow)
– Network underutilized
• Solution: adjust network and disk usage
dynamically to maximize throughput
[KRnet 2013] Sunghwan Ihm, Google
54
Shedding Smallest Chunk
Time
Disk
Network
Peer
• Total latency  max(disk, network)
• Move the smallest chunks from disk to network,
until network becomes the bottleneck
– Disk  one seek regardless of chunk size
– Network  proportional to chunk size
[KRnet 2013] Sunghwan Ihm, Google
55
Bandwidth Savings
• SRC: high metadata overhead
• MRC: close to ideal bandwidth savings
[KRnet 2013] Sunghwan Ihm, Google
56
Disk Fetch Cost
• MRC greatly reduces # of disk seeks
• 22.7x at 32 bytes
[KRnet 2013] Sunghwan Ihm, Google
57
Chunk Size Breakdown
• SRC: high metadata overhead
• MRC: much less # of disk seeks and index
entries (40% handled by large chunks)
[KRnet 2013] Sunghwan Ihm, Google
58
Microbenchmark
No tuning required
SRC
BASE
PEER
ILS
MRC
• 90% similar 1 MB files
• 512kbps / 200ms RTT
[KRnet 2013] Sunghwan Ihm, Google
59
Realistic Traffic: YouTube
Better
High Quality (490)
Low Quality (320)
• Classroom scenario
• 100 clients download 18MB clip
• 1Mbps / 1000ms RTT
[KRnet 2013] Sunghwan Ihm, Google
60
Part 2: Summary
• RE System Components and Performance
– Interaction of CPU, network, disk, and memory
– Processing time matters
• RE System Design for Fast Network [MobiSys’13]
– Requires scalable flow management
– Uses RAM/SSD for chunk storage
– Favors cheap chunk generation method
• RE System Design for Slow Network [USENIX’10]
– Exploits non-bottleneck resources: CPU and disks
[KRnet 2013] Sunghwan Ihm, Google
61
Tutorial Recap
• Conventional Web caching is limited
– Aliasing
– Uncacheable content
– Content update
• RE is a powerful technique for reducing
bandwidth consumption
– Not just for Web
– E.g., storage de-duplication systems
• RE is not a panacea: different environments
require different system designs
[KRnet 2013] Sunghwan Ihm, Google
62
Thank You
• [email protected]
• http://www.cs.princeton.edu/~sihm
[KRnet 2013] Sunghwan Ihm, Google
63