Low Latency Geo-distributed Data Analytics Background • Many cloud organizations using datacenters and clusters to provide low latency services – E.g. Microsoft & Google • Services deploy on geo-distributed sites – Produce big data ( e.g. session logs, user activity info ) Current research • Aggregate all datasets to a single center – Poor efficiency • Use intra-DC analytics frameworks – Limited by bandwidth with low capacity of WAN Iridium • Architecture – Redistribute datasets among sites prior to query’s arrivals – Place tasks to reduce network bottlenecks during query’s execution – Budget WAN usage In general, executeing the queries geo-distributedly over the data stored locally r=( 1/3 , 1/3 , 1/3 ) r=( 0.05 , 0.475 , 0.475 ) Redistribute datasets • Assumption: – sites have relatively abundant compute and storage resource – For a single MR • Iteratively moves small chunks of datasets to “better” sites • How to choose data to be moved? – Prefer high value-per-byte data • E.g. Prefer moving datasets with many queries access them. Placement of Reduce Tasks(Move intermediate data) • Def: – – – – – – 𝑟𝑖 𝑆𝑖 𝐷𝑖 𝑈𝑖 Tiu TiD 𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 𝑟𝑒𝑑𝑢𝑐𝑒 𝑡𝑎𝑠𝑘𝑠 𝑡𝑜 𝑝𝑙𝑎𝑐𝑒 𝑜𝑛 𝑠𝑖𝑡𝑒 𝑖 𝐼𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒 𝑑𝑎𝑡𝑎 𝑜𝑛 𝑠𝑖𝑡𝑒 𝑖 𝐵𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ 𝑜𝑓 𝑑𝑜𝑤𝑛𝑙𝑜𝑎𝑑𝑖𝑛𝑔 𝑑𝑎𝑡𝑎 𝑜𝑛 𝑠𝑖𝑡𝑒 𝑖 𝐵𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ 𝑜𝑓 𝑢𝑝𝑙𝑜𝑎𝑑𝑖𝑛𝑔 𝑑𝑎𝑡𝑎 𝑜𝑛 𝑠𝑖𝑡𝑒 𝑖 Time of uploading data on site i Time of downloading data on site i DAGs of Tasks • E.g. If a query includes two MR works and the first MR’s output is the input of second MR, how does Iridium deal with it? • Actually, it doesn’t work for DAGs. • Usually, Iridium adopts a greedy approach of applying his scheme independently in each stage. Data Placement(Move input data) • Heuristic – iteratively identify bottlenecked sites – move data out of them with increment of 10MB Prioritizing Between Multiple Datasets • Def: – Score of data = value/cost • Value: the reduction of response time when moving data from site A to site B. • Cost: the amount of data that needs to be moved Evaluation • Using EC2 and trace-driven simulations – EC2 Deployment: Across 8 EC2 regions. • Testing Iridium from Conviva, Bing Edge, TPC-DS and AMPLab Big-data benchmark. • These workloads consist of a mix of Spark and Hive queries. – Trace-driven simulation. • Mimic 150 sites on Facebook’s Hadoop clusters • Trace query arrival times, input/output sizes, dataset properties of locations, generation times and access patterns EC2 Trace-driven Simulation
© Copyright 2026 Paperzz