Wander Join: Online Aggregation via Random Walks Feifei Li University of Utah Bin Wu, Ke Yi Zhuoyue Zhao Hong Kong University Shanghai Jiao Tong of Science and Technology University Database Workloads ๏ฎ Transactional (OLTP) Deduct ๐ฅ dollars from account A, credit ๐ฅ dollars to account B โ Challenge: Efficiency and correctness (ACID) โ ๏ฎ Analytical (OLAP) โ โ โ โ โ 2 Large fraction of data Many tables Complex conditions Challenge: Efficiency Correctness? Wander Join: Online Aggregation via Random Walks Complex Analytical Queries (TPC-H) SELECT SUM(l_extendedprice * (1 - l_discount)) FROM customer, lineitem, orders, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_returnflag = 'R' AND c_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' This query finds the total revenue loss due to returned orders in a given region. 3 Wander Join: Online Aggregation via Random Walks Online Aggregation [Haas, Hellerstein, Wang SIGMODโ97] SELECT ONLINE SUM(l_extendedprice * (1 - l_discount)) FROM customer, lineitem, orders, nation, region ๐+๐ WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey ๐ AND l_returnflag = 'R' AND c_nationkey = n_nationkey ๐ โ ๐ = r_regionkey AND n_regionkey AND r_name = 'ASIA' WITHTIME 60000 CONFIDENCE 95 REPORTINTERVAL 1000 Pr ๐ โ ๐ < ๐ < ๐ + ๐ > 0.95 Confidence Interval 4 Confidence Level Wander Join: Online Aggregation via Random Walks Ripple Join [Haas, Hellerstein, SIGMODโ99] ๏ฎ ๏ฎ Store tuples in each table in random order In each step Reads the next tuple from a table in a round-robin fashion โ Join with sampled tuples from other tables โ ๏ฎ Works well for full Cartesian product โ 5 But most joins are sparse โฆ Wander Join: Online Aggregation via Random Walks A Running Example Nation CID US 1 US 2Whatโs China UK BuyerID OrderID 4 OrderID ItemID 1 301 the3total revenue of all 2orders304 2 in3China? 3from customers 1 3 201 4 5 4 China 5 5 59 ๐: size of each table, e.g., 10 US๐: # tuples 6 5 each 6table taken from China 7 3 103 7 ๐ : # estimators, e.g., 1 UK 8 5 3 8 ๐ โ 2=๐ ๐ Japan 9 3 9 2/3 ๐ 1/3 = 107 ๐ = ๐ UK 10 7 10 6 4 Wander Join: Online Aggregation via Random Walks Price $2100 $100 $300 4 306 $500 3 401 $230 1 101 $800 2 201 $300 5 101 $200 4 301 $100 2 201 $600 Join as a Graph Conceptual only Never materialized 7 ๐ 1 ๐ 2 ๐ 3 Wander Join: Online Aggregation via Random Walks Join as a Graph Conceptual only Never materialized 8 ๐ 1 ๐ 2 ๐ 3 Wander Join: Online Aggregation via Random Walks Join as a Graph SELECT SUM(Price) FROM Customers C, Orders O, Items I WHERE C.Nation = โChinaโ C.CID = O.BuyerID O.OrderID = I.OrderID 9 Nation CID BuyerID OrderID US 1 4 1 4 301 $2100 US 2 3 2 2 304 $100 China 3 1 3 3 201 $300 UK 4 5 4 4 306 $500 China 5 5 5 3 401 $230 US 6 5 6 1 101 $800 China 7 3 7 2 201 $300 UK 8 5 8 5 101 $200 Japan 9 3 9 4 301 $100 UK 10 7 10 2 201 $600 Wander Join: Online Aggregation via Random Walks OrderID ItemID Price Sampling by Random Walks SELECT SUM(Price) FROM Customers C, Orders O, Items I WHERE C.Nation = โChinaโ C.CID = O.BuyerID O.OrderID = I.OrderID 10 Nation CID BuyerID OrderID US 1 4 1 4 301 $2100 US 2 3 2 2 304 $100 China 3 1 3 3 201 $300 UK 4 5 4 4 306 $500 China 5 5 5 3 401 $230 US 6 5 6 1 101 $800 China 7 3 7 2 201 $300 UK 8 5 8 5 101 $200 Japan 9 3 9 4 301 $100 UK 10 7 10 2 201 $600 Wander Join: Online Aggregation via Random Walks OrderID ItemID Price Sampling by Random Walks SELECT SUM(Price) FROM Customers C, Orders O, Items I WHERE C.Nation = โChinaโ C.CID = O.BuyerID O.OrderID = I.OrderID 11 Nation CID BuyerID OrderID US 1 4 1 4 301 $2100 US 2 3 2 2 304 $100 China 3 1 3 3 201 $300 UK 4 5 4 4 306 $500 China 5 5 5 3 401 $230 US 6 5 6 1 101 $800 China 7 3 7 2 201 $300 UK 8 5 8 5 101 $200 Japan 9 3 9 4 301 $100 UK 10 7 10 2 201 $600 Wander Join: Online Aggregation via Random Walks OrderID ItemID Price Sampling by Random Walks SELECT SUM(Price) FROM Customers C, Orders O, Items I WHERE C.Nation = โChinaโ C.CID = O.BuyerID O.OrderID = I.OrderID 12 Nation CID BuyerID OrderID US 1 4 1 4 301 $2100 US 2 3 2 2 304 $100 China 3 1 3 3 201 $300 UK 4 5 4 4 306 $500 China 5 5 5 3 401 $230 US 6 5 6 1 101 $800 China 7 3 7 2 201 $300 UK 8 5 8 5 101 $200 Japan 9 3 9 4 301 $100 UK 10 7 10 2 201 $600 Wander Join: Online Aggregation via Random Walks OrderID ItemID Price Sampling by Random Walks SELECT SUM(Price) FROM Customers C, Orders O, Items I WHERE C.Nation = โChinaโ C.CID = O.BuyerID O.OrderID = I.OrderID Nation CID BuyerID OrderID US 1 4 1 US 2 3 2 OrderID ItemID 9 4 301 $2100 2 304 $100 ๐: size of each table size, e.g., 10 China 3 1 3 ๐: # tuples taken from each 3table = # random walks201 UK 4 4 306 ๐ : # estimators, e.g.,5 103 4 China 5 5 ๐ = ๐ 5 = 103 3 401 $300 $500 $230 US 6 5 6 1 101 $800 China 7 3 7 2 201 $300 UK 8 5 8 5 101 $200 $๐๐๐ estimator: Japan Unbiased 9 3 9 ๐ฌ๐๐ฆ๐ฉ๐ฅ๐ข๐ง๐ ๐ฉ๐ซ๐จ๐. UK 13 Price 10 7 10 Wander Join: Online Aggregation via Random Walks = $๐๐๐ 4 301 ๐/๐โ ๐/๐โ ๐/๐ 2 201 $100 $600 Walk Plan Optimization ๐ 1 ๏ฎ ๏ฎ ๐ 2 ๐ 3 Structure of the data graph Selection predicates Starting table: use index โ Table in the middle: reject random walk โ ๏ฎ Data distribution โ Non-uniformity may not be a bad thing! ๐ 1 ๐ 1 ๐ 2 14 5 6 3 0 1 1 1 1 1 1 1 1 5 6 3 0 ๐๐๐ ๐ 1 โ ๐ 2 < ๐๐๐ ๐ 2 โ ๐ 1 ๐ 2 ๐๐๐ ๐ 1 โ ๐ 2 > ๐๐๐ ๐ 2 โ ๐ 1 Wander Join: Online Aggregation via Random Walks Walk Plan Optimizer ๏ฎ ๏ฎ ๏ฎ ๏ฎ ๏ฎ 15 Enumerate all plans Conduct ~ 100 trial random walks using each plan Measure the variance of each plan Select the best plan All trials runs are still useful Wander Join: Online Aggregation via Random Walks Convergence Comparison 16 Wander Join: Online Aggregation via Random Walks Wander Join in PostgreSQL Logarithmic growth due to B-tree lookup to find random neighbours 17 Wander Join: Online Aggregation via Random Walks Running on Insufficient Memory (4GB) ๏ฎ ๏ฎ ๏ฎ Insufficient memory incurs a heavy, one-time penalty Growth is still logarithmic Fundamentally: Random sampling at odds with hard disks But does it matter? Spark, In-Memory DB, RAM cloudโฆ โ The algorithm is embarrassingly parallel โ Turbo DBO [Dobra, Jermaine, Rusu, Xu, VLDBโ09] 18 Wander Join: Online Aggregation via Random Walks Accuracy Achieved in 1/10 Time of Full Join 19 Wander Join: Online Aggregation via Random Walks Wander Join vs Ripple Join 20 Wander Join Ripple Join Sampling methodology Independent but non-uniform Uniform but non-independent Index needed? Yes Index or random storage Confidence interval computation Easy, ๐(๐) time Complicated, ๐(๐๐ ) time ๐: # tables Convergence time (20GB data, 3 tables) ~ 3s ~ 50s Scalability Logarithmic Slightly less than linear System implementation PostgreSQL (finished) Oracle (in progress) SparkSQL (in progress) Informix (internal project) DBO Wander Join: Online Aggregation via Random Walks Online Aggregation vs Data Cube 21 Online Aggregation Data Cube Queries Online, ad hoc Offline, fixed Latency Seconds Hours, then milliseconds Query mode One at a time Batch Accuracy Small error No error Data schema Any (relational, graph) Multidimensional cube Work with OLTP Integrated Separate Target scenario Online, ad hoc, interactive data analytics Monthly report Wander Join: Online Aggregation via Random Walks Thank you! Dealing with Selection Predicates ๏ฎ One predicate โ ๏ฎ Little impact: Can start walk from that table Multiple highly selective predicates More random walks will fail โ Running full query becomes faster โ Can simply switch to full query when selectivity <1% (say) โ 23 Wander Join: Online Aggregation via Random Walks Index Ripple Join [Lipton, Naughton, Schneider, SIGMODโ90] 24 Nation CID BuyerID OrderID OrderID ItemID US 1 4 8 4 301 $2100 US 2 3 5 2 304 $100 China 3 1 3 3 201 $300 UK 4 5 4 4 306 $500 China 5 5 2 3 401 $230 US 6 5 3 1 101 $800 China 7 3 7 2 201 $300 UK 8 5 1 5 101 $200 Japan 9 3 9 4 301 $100 UK 10 7 10 2 201 $600 Wander Join: Online Aggregation via Random Walks Price Sampling from a B-tree [Olken, โ93] 4 2 ๏ฎ ๏ฎ 3 Sampling from an aggregate (ranked) B-tree is easy But incurs heavy cost for transactions โ need to modify existing B-tree implementations โ 25 Wander Join: Online Aggregation via Random Walks Rejection Sampling [Olken, โ93] ๏ฎ ๏ฎ 26 Imagine each node has maximum fanout Reject as soon as it walks out of bound Wander Join: Online Aggregation via Random Walks Non-Uniform Sampling 1 3โ 4 ๏ฎ 27 1 3โ 4 1 3โ 4 1 3โ 4 1 3โ 2 1 3โ 2 1 3โ 3 1 3โ 3 1 3โ 3 As long as we can compute the sampling probability, wander join still works! Wander Join: Online Aggregation via Random Walks Compare with BlinkDB [Agarwal, Mozafari, Panda, Milner, Madden, Stoica, โ13] 28 Wander Join BlinkDB Methodology Query ๏ Sampling Sampling ๏ Query Sampling method Random walks Stratified sampling Joins supported Any Big table joining a small table (no sampling on small table) Error Reduce over time Fixed Data schema Any (relational, graph) Star / snowflake Work with OLTP Integrated Separate Group-by support Unbalanced Balanced Wander Join: Online Aggregation via Random Walks Accuracy Achieved in 1/10 Time of Full Join 29 Wander Join: Online Aggregation via Random Walks
© Copyright 2026 Paperzz