Context of the problem Requirements for routing Hashing algorithm Geographic distribution Experiments Semantics of Caching with SPOCA: A Stateless, Proportional, Optimally-Consistent Addressing Algorithm Piotr Skowron November 6, 2011 Piotr Skowron Semantics of Caching with SPOCA: A Stateless, Proportional, Context of the problem Requirements for routing Hashing algorithm Geographic distribution Experiments Abstract This presentation is about: how Yahoo! efficiently serves millions of videos from its video library, what architecture they use to ensure efficient caching (and so the significant improvement in the quality of service), how their new algorithm improved disk cache misses from 5% to less than 1% and increased memory cache hits from 45% to 80% (thus improving overall cache hits from 95% to 99.6%). Piotr Skowron Semantics of Caching with SPOCA: A Stateless, Proportional, Context of the problem Requirements for routing Hashing algorithm Geographic distribution Experiments The way it works clients access videos using web browsers, clients connect to front-end servers which serve the video content, the front-end servers cache content, but are not the permanent repository, videos are stored in a storage farm that is accessible through front-end servers. Piotr Skowron Semantics of Caching with SPOCA: A Stateless, Proportional, Context of the problem Requirements for routing Hashing algorithm Geographic distribution Experiments The way it works – diagram Piotr Skowron Semantics of Caching with SPOCA: A Stateless, Proportional, Context of the problem Requirements for routing Hashing algorithm Geographic distribution Experiments Local or remote serving The videos are spread around the world so when the user is accessing a video we have one of the following options: retrieve the content from the storage farm /, if it is cached in the disks of a front-end server, the content can be served more efficiently ,, in the best case the content may be cached in the memory of a front-end servers ,. In case of videos the difference between caching in memory and the caching in disks is small. Piotr Skowron Semantics of Caching with SPOCA: A Stateless, Proportional, Context of the problem Requirements for routing Hashing algorithm Geographic distribution Experiments Local or remote serving The videos are spread around the world so when the user is accessing a video we have one of the following options: retrieve the content from the storage farm /, if it is cached in the disks of a front-end server, the content can be served more efficiently ,, in the best case the content may be cached in the memory of a front-end servers ,. In case of videos the difference between caching in memory and the caching in disks is small. Piotr Skowron Semantics of Caching with SPOCA: A Stateless, Proportional, Context of the problem Requirements for routing Hashing algorithm Geographic distribution Experiments Local or remote serving Retrieving a content from the storage farm not only causes significantly longer delivery, but also causes more load on the back-end infrastructure. This causes: higher cost of upgrading networking components, higher number of the servers in the storage farm required, in order to handle more load. Piotr Skowron Semantics of Caching with SPOCA: A Stateless, Proportional, Context of the problem Requirements for routing Hashing algorithm Geographic distribution Experiments Why caching is difficult in case of the video? Why caching is difficult in case of the video? videos are large – a typical front-end server can hold 500 unique videos in memory and 100,000 on disk, the demand is high – users make over 30,000,000 requests per day for over 800,000 unique videos; there is over 20,000,000 of unique videos in the library, the ratio: total/unique requests in low. Piotr Skowron Semantics of Caching with SPOCA: A Stateless, Proportional, Context of the problem Requirements for routing Hashing algorithm Geographic distribution Experiments The traditional VIP (Virtual IP) Piotr Skowron Semantics of Caching with SPOCA: A Stateless, Proportional, Context of the problem Requirements for routing Hashing algorithm Geographic distribution Experiments The traditional VIP (Virtual IP) With VIP each piece of popular content will end up on multiple servers → this redundant caching is highly inefficient compared to caching where each piece of content is kept at the single server. On the other hand remembering which front-end server hosts which content is expensive. The question of a day is: how to increase caching efficiency through intelligent routing without remembering content location in a database. Piotr Skowron Semantics of Caching with SPOCA: A Stateless, Proportional, Context of the problem Requirements for routing Hashing algorithm Geographic distribution Experiments Stateless There should be no need of keeping a data catalog associating each content file with a particular front-end server. Piotr Skowron Semantics of Caching with SPOCA: A Stateless, Proportional, Context of the problem Requirements for routing Hashing algorithm Geographic distribution Experiments Consistent The requests for a particular content file should all be directed to the same server. For stateless addressing the inputs are a filename and a list of currently available servers; the output is a server from the list. Consistency means that the same input always produces the same output. Piotr Skowron Semantics of Caching with SPOCA: A Stateless, Proportional, Context of the problem Requirements for routing Hashing algorithm Geographic distribution Experiments Proportional The requests should be partitioned between the front-end servers proportionally to the servers weights. Example A newer server might have twice the capacity of an older one, and therefore should service twice as large portion of the content library. Remark The proportionality requirement rules out the use of a distance-based consistent hashing algorithm, although such algorithms are consistent and stateless. Piotr Skowron Semantics of Caching with SPOCA: A Stateless, Proportional, Context of the problem Requirements for routing Hashing algorithm Geographic distribution Experiments Proportional Proportionality must be ensured also in case of adding or removing a server. The requests must not be distributed only between the nearest servers, therefore the distance-based hashing algorithms do not pass their exam. Piotr Skowron Semantics of Caching with SPOCA: A Stateless, Proportional, Context of the problem Requirements for routing Hashing algorithm Geographic distribution Experiments Optimally-consistent When a front-end server enters or leaves the pool as few as possible files are redistributed to the other servers. Example Suppose that a pool has 3 front-end servers of weight 100 and two servers of weight 200. If the new server of weight 200 is added to the pool it must be assigned 29 of the files in the content library. But also, more specifically, for each of the other servers it must take over 29 of the files that server was handling. Piotr Skowron Semantics of Caching with SPOCA: A Stateless, Proportional, Context of the problem Requirements for routing Hashing algorithm Geographic distribution Experiments Optimally-consistent Piotr Skowron Semantics of Caching with SPOCA: A Stateless, Proportional, Context of the problem Requirements for routing Hashing algorithm Geographic distribution Experiments Proportionality once again It must be taken into account that proportional distribution of the files does not necessarily means the proportional distribution of the requests. Piotr Skowron Semantics of Caching with SPOCA: A Stateless, Proportional, Context of the problem Requirements for routing Hashing algorithm Geographic distribution Experiments Hashing algorithm The routing function uses a hash function to map file names to the points in a hash space. Each front-end server is assigned a portion of hash space proportional to its capacity. Not every point in the hash space maps to a front-end server, so when the hash of the name of a requested video maps to unassigned space, the result of the hash function is hashed again until result lands in an assigned portion of the hash space. Piotr Skowron Semantics of Caching with SPOCA: A Stateless, Proportional, Context of the problem Requirements for routing Hashing algorithm Geographic distribution Experiments Unassigned space hit Piotr Skowron Semantics of Caching with SPOCA: A Stateless, Proportional, Context of the problem Requirements for routing Hashing algorithm Geographic distribution Experiments Leaving server Piotr Skowron Semantics of Caching with SPOCA: A Stateless, Proportional, Context of the problem Requirements for routing Hashing algorithm Geographic distribution Experiments Entering server Piotr Skowron Semantics of Caching with SPOCA: A Stateless, Proportional, Context of the problem Requirements for routing Hashing algorithm Geographic distribution Experiments Hashing algorithm The percentage of the assigned space is between 1%-25% dependently on the anticipation of future growth. If the percentage of the assigned space is 1% then SPOCA will have to generate on average 100 pseudo-random numbers to distribute tone request – this is efficient as linear congruence generators are very simple. By starting from 1% of the assigned space the system may grow to almost 100 times bigger, without worrying of running out of room. Piotr Skowron Semantics of Caching with SPOCA: A Stateless, Proportional, Context of the problem Requirements for routing Hashing algorithm Geographic distribution Experiments How to deal with popular files The hash of each filename, seed, is saved in the memory for a configurable length of time, T . When T elapses the seed is thrown away from the memory. If a request arrives for a file for which the request router has a saved seed, that file is deemed to be popular, and the generation of pseudo-random numbers starts from the saved seed rather than from the filename. Piotr Skowron Semantics of Caching with SPOCA: A Stateless, Proportional, Context of the problem Requirements for routing Hashing algorithm Geographic distribution Experiments So the SPOCA algorithm looks like that Piotr Skowron Semantics of Caching with SPOCA: A Stateless, Proportional, Context of the problem Requirements for routing Hashing algorithm Geographic distribution Experiments Geographic distribution The separate problem is how to take into consideration geographic distribution of the content. The second question of a day, therefore, is: when it is profitable to serve data from the storing cluster, home locale, and when it is profitable to cache it near to the user, at nearest locale. The solution: we want to serve unpopular data from the home local, while the popular data should be fetched to the nearest locale. This is done by the independent Zebra protocol. Piotr Skowron Semantics of Caching with SPOCA: A Stateless, Proportional, Context of the problem Requirements for routing Hashing algorithm Geographic distribution Experiments Bloom filters Zebra tracks popularity of the files using Bloom filters. Bloom filter A Bloom filter, conceived by Burton Howard Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible (but rarely), and false negatives are not. Unfortunately it is not possible to remove content from the bloom filter, therefore Zebra uses 17 Bloom filters instead of the single one. Each bloom filter represents requests for a given interval, on the order of hours. When the oldest filter expires, it is removed, and the new filter is added. Piotr Skowron Semantics of Caching with SPOCA: A Stateless, Proportional, Context of the problem Requirements for routing Hashing algorithm Geographic distribution Experiments Data growth Piotr Skowron Semantics of Caching with SPOCA: A Stateless, Proportional, Context of the problem Requirements for routing Hashing algorithm Geographic distribution Experiments Back-end servers traffic without SPOCA Piotr Skowron Semantics of Caching with SPOCA: A Stateless, Proportional, Context of the problem Requirements for routing Hashing algorithm Geographic distribution Experiments Back-end servers traffic with SPOCA Piotr Skowron Semantics of Caching with SPOCA: A Stateless, Proportional, Context of the problem Requirements for routing Hashing algorithm Geographic distribution Experiments Estimations of the savings in the storage costs Piotr Skowron Semantics of Caching with SPOCA: A Stateless, Proportional, Context of the problem Requirements for routing Hashing algorithm Geographic distribution Experiments The impact of the szie popularity window (T) on caching efficiency Here the popularity window was decreased from 300s to 240s. Decreasing the window increases cach hits, however decreasing it too much can result in overloading the servers. Piotr Skowron Semantics of Caching with SPOCA: A Stateless, Proportional, Context of the problem Requirements for routing Hashing algorithm Geographic distribution Experiments Related work Piotr Skowron Semantics of Caching with SPOCA: A Stateless, Proportional, Context of the problem Requirements for routing Hashing algorithm Geographic distribution Experiments Thank You Thank You :) Piotr Skowron Semantics of Caching with SPOCA: A Stateless, Proportional,
© Copyright 2026 Paperzz