Low Latency Geo-distributed Data Analytics

Low Latency Geo-distributed Data
Analytics
Background
• Many cloud organizations using datacenters
and clusters to provide low latency services
– E.g. Microsoft & Google
• Services deploy on geo-distributed sites
– Produce big data ( e.g. session logs, user activity
info )
Current research
• Aggregate all datasets to a single center
– Poor efficiency
• Use intra-DC analytics frameworks
– Limited by bandwidth with low capacity of WAN
Iridium
• Architecture
– Redistribute datasets among sites prior to query’s
arrivals
– Place tasks to reduce network bottlenecks during
query’s execution
– Budget WAN usage
In general, executeing the queries geo-distributedly
over the data stored locally
r=( 1/3 , 1/3 , 1/3 )
r=( 0.05 , 0.475 , 0.475 )
Redistribute datasets
• Assumption:
– sites have relatively abundant compute and storage
resource
– For a single MR
• Iteratively moves small chunks of datasets to
“better” sites
• How to choose data to be moved?
– Prefer high value-per-byte data
• E.g. Prefer moving datasets with many queries access them.
Placement of Reduce Tasks(Move intermediate data)
• Def:
–
–
–
–
–
–
𝑟𝑖
𝑆𝑖
𝐷𝑖
𝑈𝑖
Tiu
TiD
𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 𝑟𝑒𝑑𝑢𝑐𝑒 𝑡𝑎𝑠𝑘𝑠 𝑡𝑜 𝑝𝑙𝑎𝑐𝑒 𝑜𝑛 𝑠𝑖𝑡𝑒 𝑖
𝐼𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒 𝑑𝑎𝑡𝑎 𝑜𝑛 𝑠𝑖𝑡𝑒 𝑖
𝐵𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ 𝑜𝑓 𝑑𝑜𝑤𝑛𝑙𝑜𝑎𝑑𝑖𝑛𝑔 𝑑𝑎𝑡𝑎 𝑜𝑛 𝑠𝑖𝑡𝑒 𝑖
𝐵𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ 𝑜𝑓 𝑢𝑝𝑙𝑜𝑎𝑑𝑖𝑛𝑔 𝑑𝑎𝑡𝑎 𝑜𝑛 𝑠𝑖𝑡𝑒 𝑖
Time of uploading data on site i
Time of downloading data on site i
DAGs of Tasks
• E.g. If a query includes two MR works and the
first MR’s output is the input of second MR,
how does Iridium deal with it?
• Actually, it doesn’t work for DAGs.
• Usually, Iridium adopts a greedy approach of
applying his scheme independently in each
stage.
Data Placement(Move input data)
• Heuristic
– iteratively identify bottlenecked sites
– move data out of them with increment of 10MB
Prioritizing Between Multiple Datasets
• Def:
– Score of data = value/cost
• Value: the reduction of response time when moving
data from site A to site B.
• Cost: the amount of data that needs to be moved
Evaluation
• Using EC2 and trace-driven simulations
– EC2 Deployment: Across 8 EC2 regions.
• Testing Iridium from Conviva, Bing Edge, TPC-DS and
AMPLab Big-data benchmark.
• These workloads consist of a mix of Spark and Hive queries.
– Trace-driven simulation.
• Mimic 150 sites on Facebook’s Hadoop clusters
• Trace query arrival times, input/output sizes, dataset
properties of locations, generation times and access
patterns
EC2
Trace-driven Simulation