Managing Data at Scale Ke Yi and Dan Suciu Dagstuhl 2016 1 Thomas’ Questions 1. 2. 3. 4. 5. 6. What are the key benefits from theory? What are the key/tough problems? What are the key math/FDM techniques? How does this extend DB theory? What should we teach our students? What is a good foundational paper? Dagstuhl 2016 2 Thomas’ Questions 1. 2. 3. 4. 5. 6. What are the key benefits from theory? What are the key/tough problems? What are the key math/FDM techniques? How does this extend DB theory? What should we teach our students? What is a good foundational paper? Dagstuhl 2016 3 1. Key Benefits from Theory DB Theory Mission: Guide the development of new data management techniques driven by changes in: • What data processing tasks are needed • How data is processed • Where data is processed Dagstuhl 2016 4 What • ML, data analytics (R/GraphLab/GraphX) • Distributed state (replicated objects) • Graphics, Data visualization (Halide) • Information extraction / text processing Dagstuhl 2016 5 How • Cloud computing (Amazon, MS, Google) • Distributed data processing (MapReduce, Spark, Pregel, GraphLab) • Distributed transactions (Spanner) Dagstuhl 2016 6 Where • Shared nothing, distributed systems • Shared memory • Chip trends (SIMD, NVM, Dark silicon) Dagstuhl 2016 7 Thomas’ Questions 1. 2. 3. 4. 5. 6. What are the key benefits from theory? What are the key/tough problems? What are the key math/FDM techniques? How does this extend DB theory? What should we teach our students? What is a good foundational paper? Dagstuhl 2016 8 2-3. Key Problems and Techniques Some results/techniques from recent years*: • Worst-case optimal algorithms • Communication optimal algorithms • I/O-optimal algorithms + sampling (Ke) *subjective selection Dagstuhl 2016 9 [Atserias,Grohe,Marx’2011] AGM Bound Worst-case output size: • Fix a number N, full conjunctive query Q • For any database D s.t. |R1D| ≤ N, |R2D| ≤ N, … • How large is |Q(D)| ? Examples: • Q(x,y,z) = R1(x,y),R2(y,z) • Q(x,y,z,u) = R1(x,y),R2(y,z),R3(z,u) • Q(x,y,z,u,v) = R1(x,y),…, R4(z,u) • For any edge-cover of Q of size ρ, Dagstuhl 2016 |Q| ≤ N2 |Q| ≤ N2 |Q| ≤ N3 |Q| ≤ Nρ 10 [Atserias,Grohe,Marx’2011] AGM Bound Worst-case output size: • Fix a number N, full conjunctive query Q • For any database D s.t. |R1D| ≤ N, |R2D| ≤ N, … • How large is |Q(D)| ? Examples: • Q(x,y,z) = R1(x,y),R2(y,z) • Q(x,y,z,u) = R1(x,y),R2(y,z),R3(z,u) • Q(x,y,z,u,v) = R1(x,y),…, R4(z,u) • For any edge-cover of Q of size ρ, Dagstuhl 2016 |Q| ≤ N2 |Q| ≤ N2 |Q| ≤ N3 |Q| ≤ Nρ 11 [Atserias,Grohe,Marx’2011] AGM Bound Worst-case output size: • Fix a number N, full conjunctive query Q • For any database D s.t. |R1D| ≤ N, |R2D| ≤ N, … • How large is |Q(D)| ? Examples: • Q(x,y,z) = R1(x,y),R2(y,z) • Q(x,y,z,u) = R1(x,y),R2(y,z),R3(z,u) • Q(x,y,z,u,v) = R1(x,y),…, R4(z,u) • For any edge-cover of Q of size ρ, Dagstuhl 2016 |Q| ≤ N2 |Q| ≤ N2 |Q| ≤ N3 |Q| ≤ Nρ 12 [Atserias,Grohe,Marx’2011] AGM Bound Theorem If |RiD| ≤ N for i=1,2,…, then maxD |Q(D)| = Nρ* ρ* = optimal fractional edge cover of Q’s hypergraph [Atserias,Grohe,Marx’2011] AGM Bound Theorem If |RiD| ≤ N for i=1,2,…, then maxD |Q(D)| = Nρ* ρ* = optimal fractional edge cover of Q’s hypergraph Q(x,y,z) = R(x,y),S(y,z),T(z,x) [Atserias,Grohe,Marx’2011] AGM Bound Theorem If |RiD| ≤ N for i=1,2,…, then maxD |Q(D)| = Nρ* ρ* = optimal fractional edge cover of Q’s hypergraph Q(x,y,z) = R(x,y),S(y,z),T(z,x) ρ* = 3/2 ½ ½ min(wR + wS + wT) x: y: z: ½ |Q| ≤ N3/2 wR + wT ≥ 1 wR + wS ≥1 wS + wT ≥ 1 Thm. For any feasible wR, wS, wT |Q| ≤ NwR+wS+wT Proof. (Shearer’s lemma – next) [Atserias,Grohe,Marx’2011] AGM Bound Theorem If |RiD| ≤ N for i=1,2,…, then maxD |Q(D)| = Nρ* ρ* = optimal fractional edge cover of Q’s hypergraph Q(x,y,z) = R(x,y),S(y,z),T(z,x) max(vx + vy + vz) R: S: T: vx + vy ≤1 vy + vz ≤ 1 vx + vz ≤ 1 Thm. For any feasible vx,vy,vz |Q| = Nvx+vy+vz Proof “Free” instance R(x,y) = [Nvx] × [Nvy] S(y,z) = [Nvy] × [Nvz] T(z,x) = [Nvx] × [Nvz] ρ* = 3/2 ½ ½ x: y: z: ½ |Q| ≤ N3/2 ≤ min(wR + wS + wT) wR + wT ≥ 1 wR + wS ≥1 wS + wT ≥ 1 Thm. For any feasible wR, wS, wT |Q| ≤ NwR+wS+wT Proof. (Shearer’s lemma – next) [Atserias,Grohe,Marx’2011] AGM Bound (Upper bound) Consider any instance D, and the uniform probability space with outcomes Q(D) Q Q(x,y,z) = R(x,y),S(y,z),T(z,x) x y z a 3 r 1/5 a 3 q 1/5 a 2 q 1/5 b 2 q 1/5 d 3 d 1/5 Dagstuhl 2016 17 [Atserias,Grohe,Marx’2011] AGM Bound (Upper bound) Consider any instance D, and the uniform probability space with outcomes Q(D) Q Q(x,y,z) = R(x,y),S(y,z),T(z,x) x y z a 3 r a 3 a x y S y z T x z 1/5 a 3 2/5 3 r 2/5 a r 1/5 q 1/5 a 2 1/5 2 q 2/5 a q 2/5 2 q 1/5 b 2 1/5 3 q 1/5 b q 1/5 b 2 q 1/5 d 3 1/5 4 q 0 d q 1/5 d 3 d 1/5 Dagstuhl 2016 R 18 [Atserias,Grohe,Marx’2011] AGM Bound (Upper bound) Consider any instance D, and the uniform probability space with outcomes Q(D) Three things to know about entropy: H(X) ≤ log |Ω| “=“ for uniform H(X) ≤ H(X ∪ Y) H(X ∪ Y) + H(X ∩ Y) ≤ H(X) + H(Y) Q Q(x,y,z) = R(x,y),S(y,z),T(z,x) x y z a 3 r a 3 a x y S y z T x z 1/5 a 3 2/5 3 r 2/5 a r 1/5 q 1/5 a 2 1/5 2 q 2/5 a q 2/5 2 q 1/5 b 2 1/5 3 q 1/5 b q 1/5 b 2 q 1/5 d 3 1/5 4 q 0 d q 1/5 d 3 d 1/5 Dagstuhl 2016 R 19 [Atserias,Grohe,Marx’2011] AGM Bound (Upper bound) Consider any instance D, and the uniform probability space with outcomes Q(D) Three things to know about entropy: H(X) ≤ log |Ω| “=“ for uniform H(X) ≤ H(X ∪ Y) H(X ∪ Y) + H(X ∩ Y) ≤ H(X) + H(Y) Q Q(x,y,z) = R(x,y),S(y,z),T(z,x) x y z a 3 r a 3 a R x y S y z T x z 1/5 a 3 2/5 3 r 2/5 a r 1/5 q 1/5 a 2 1/5 2 q 2/5 a q 2/5 2 q 1/5 b 2 1/5 3 q 1/5 b q 1/5 b 2 q 1/5 d 3 1/5 4 q 0 d q 1/5 d 3 d 1/5 Shearer’s lemma: if X1, X2, … k-cover X, then H(X1) + H(X2) + … ≥ k H(X) Dagstuhl 2016 20 [Atserias,Grohe,Marx’2011] AGM Bound (Upper bound) Consider any instance D, and the uniform probability space with outcomes Q(D) Three things to know about entropy: H(X) ≤ log |Ω| “=“ for uniform H(X) ≤ H(X ∪ Y) H(X ∪ Y) + H(X ∩ Y) ≤ H(X) + H(Y) Shearer’s lemma: if X1, X2, … k-cover X, then H(X1) + H(X2) + … ≥ k H(X) Q Q(x,y,z) = R(x,y),S(y,z),T(z,x) x y z a 3 r a 3 a R x y S y z T x z 1/5 a 3 2/5 3 r 2/5 a r 1/5 q 1/5 a 2 1/5 2 q 2/5 a q 2/5 2 q 1/5 b 2 1/5 3 q 1/5 b q 1/5 b 2 q 1/5 d 3 1/5 4 q 0 d q 1/5 d 3 d 1/5 3log N ≥ log|R| + log|S| + log|T| ≥ ≥ H(xy) + H(yz) + H(xz) Dagstuhl 2016 21 [Atserias,Grohe,Marx’2011] AGM Bound (Upper bound) Consider any instance D, and the uniform probability space with outcomes Q(D) Three things to know about entropy: H(X) ≤ log |Ω| “=“ for uniform H(X) ≤ H(X ∪ Y) H(X ∪ Y) + H(X ∩ Y) ≤ H(X) + H(Y) Shearer’s lemma: if X1, X2, … k-cover X, then H(X1) + H(X2) + … ≥ k H(X) Q Q(x,y,z) = R(x,y),S(y,z),T(z,x) x y z a 3 r a 3 a R x y S y z T x z 1/5 a 3 2/5 3 r 2/5 a r 1/5 q 1/5 a 2 1/5 2 q 2/5 a q 2/5 2 q 1/5 b 2 1/5 3 q 1/5 b q 1/5 b 2 q 1/5 d 3 1/5 4 q 0 d q 1/5 d 3 d 1/5 3log N ≥ log|R| + log|S| + log|T| ≥ ≥ H(xy) + H(yz) + H(xz) ≥ H(xyz) + H(y) + H(xz) Dagstuhl 2016 22 [Atserias,Grohe,Marx’2011] AGM Bound (Upper bound) Consider any instance D, and the uniform probability space with outcomes Q(D) Three things to know about entropy: H(X) ≤ log |Ω| “=“ for uniform H(X) ≤ H(X ∪ Y) H(X ∪ Y) + H(X ∩ Y) ≤ H(X) + H(Y) Shearer’s lemma: if X1, X2, … k-cover X, then H(X1) + H(X2) + … ≥ k H(X) Q Q(x,y,z) = R(x,y),S(y,z),T(z,x) x y z a 3 r a 3 a R x y S y z T x z 1/5 a 3 2/5 3 r 2/5 a r 1/5 q 1/5 a 2 1/5 2 q 2/5 a q 2/5 2 q 1/5 b 2 1/5 3 q 1/5 b q 1/5 b 2 q 1/5 d 3 1/5 4 q 0 d q 1/5 d 3 d 1/5 3log N ≥ log|R| + log|S| + log|T| ≥ ≥ H(xy) + H(yz) + H(xz) ≥ H(xyz) + H(y) + H(xz) ≥ H(xyz) + H(xyz) + H(ɸ) Dagstuhl 2016 23 [Atserias,Grohe,Marx’2011] AGM Bound (Upper bound) Consider any instance D, and the uniform probability space with outcomes Q(D) Three things to know about entropy: H(X) ≤ log |Ω| “=“ for uniform H(X) ≤ H(X ∪ Y) H(X ∪ Y) + H(X ∩ Y) ≤ H(X) + H(Y) Shearer’s lemma: if X1, X2, … k-cover X, then H(X1) + H(X2) + … ≥ k H(X) Q Q(x,y,z) = R(x,y),S(y,z),T(z,x) x y z a 3 r a 3 a R x y S y z T x z 1/5 a 3 2/5 3 r 2/5 a r 1/5 q 1/5 a 2 1/5 2 q 2/5 a q 2/5 2 q 1/5 b 2 1/5 3 q 1/5 b q 1/5 b 2 q 1/5 d 3 1/5 4 q 0 d q 1/5 d 3 d 1/5 3log N ≥ log|R| + log|S| + log|T| ≥ ≥ H(xy) + H(yz) + H(xz) ≥ H(xyz) + H(y) + H(xz) ≥ H(xyz) + H(xyz) + H(ɸ) = 2H(xyz) = 2log|Q| Dagstuhl 2016 |Q| ≤ N3/2 24 [Ngo,Re,Rudra] Worst-Case Algorithm GenericJoin(Q, D): if Q = ground atom return “true” iff Q is in D choose any variable x compute A = Πx(R1) ∩ Πx(R2) ∩ … for all a ∈ A: GenericJoin(Q[a/x], D) return their union Theorem. GenericJoin runs in time O(Nρ*) Note: every traditional query plan (= one join at a time) is suboptimal! 25 [Beame, Koutris, S] Massively Parallel Communication Same as N Input (size=m) Input data = size m O(m/p) Server 1 Number of servers = p .... O(m/p) Server p [Beame, Koutris, S] Massively Parallel Communication Same as N Input (size=m) Input data = size m O(m/p) Server 1 Number of servers = p One round = Compute & communicate .... ≤L Server 1 O(m/p) Server p ≤L Round 1 .... Server p [Beame, Koutris, S] Massively Parallel Communication Same as N Input (size=m) Input data = size m O(m/p) Server 1 Number of servers = p One round = Compute & communicate .... ≤L Server 1 O(m/p) Server p ≤L Round 1 .... ≤L Server p ≤L Round 2 Algorithm = Several rounds Server 1 .... ≤L Server p ≤L Round 3 .... [Beame, Koutris, S] Massively Parallel Communication Same as N Input (size=m) Input data = size m O(m/p) Server 1 Number of servers = p One round = Compute & communicate .... ≤L Server 1 O(m/p) Server p ≤L Round 1 .... ≤L Server p ≤L Round 2 Algorithm = Several rounds Server 1 Max communication load / round / server = L .... ≤L Server p ≤L Round 3 .... [Beame, Koutris, S] Massively Parallel Communication Same as N Input (size=m) Input data = size m O(m/p) Server 1 Number of servers = p .... ≤L One round = Compute & communicate Server 1 O(m/p) Server p ≤L Round 1 .... ≤L Server p ≤L Round 2 Algorithm = Several rounds Server 1 Max communication load / round / server = L .... ≤L Server p ≤L Round 3 .... Cost: Load L Rounds r Ideal L = m/p Practical ε∈(0,1) L = m/p1-ε Naïve 1 L=m Naïve 2 L = m/p 1 O(1) 1 p [Beame, Koutris, S] Massively Parallel Communication Same as N Input (size=m) Input data = size m O(m/p) Server 1 Number of servers = p .... ≤L One round = Compute & communicate Server 1 O(m/p) Server p ≤L Round 1 .... ≤L Server p ≤L Round 2 Algorithm = Several rounds Server 1 Max communication load / round / server = L .... ≤L Server p ≤L Round 3 .... Cost: Load L Rounds r Ideal L = m/p Practical ε∈(0,1) L = m/p1-ε Naïve 1 L=m Naïve 2 L = m/p 1 O(1) 1 p [Beame, Koutris, S] Massively Parallel Communication Same as N Input (size=m) Input data = size m O(m/p) Server 1 Number of servers = p .... ≤L One round = Compute & communicate Server 1 O(m/p) Server p ≤L Round 1 .... ≤L Server p ≤L Round 2 Algorithm = Several rounds Server 1 Max communication load / round / server = L .... ≤L Server p ≤L Round 3 .... Cost: Load L Rounds r Ideal L = m/p Practical ε∈(0,1) L = m/p1-ε Naïve 1 L=m Naïve 2 L = m/p 1 O(1) 1 p [Beame, Koutris, S] Massively Parallel Communication Same as N Input (size=m) Input data = size m O(m/p) Server 1 Number of servers = p .... ≤L One round = Compute & communicate Server 1 O(m/p) Server p ≤L Round 1 .... ≤L Server p ≤L Round 2 Algorithm = Several rounds Server 1 Max communication load / round / server = L .... ≤L Server p ≤L Round 3 .... Cost: Load L Rounds r Ideal L = m/p Practical ε∈(0,1) L = m/p1-ε Naïve 1 L=m Naïve 2 L = m/p 1 O(1) 1 p Speedup Speed L = m/p linear speedup # processors (=p) L = m/p1-ε sub-linear speedup 34 Example Cartesian product: Q(x,y) = R(x) × S(y) L = 2 m / p ½ = O(m / p ½) Dagstuhl 2016 S(y) R(x) 35 [Beame, Koutris, S] Full Conjunctive Queries [Afrati&Ullman] One Round No Skew |S1| = |S2| = … Algorithm m1, m2,.. Formula Prod L = m/p1/2 based on Join L = m/p fractional Triang L = m/p2/3 edge Multi L = m/p1/τ* packing Lower Same as Bound above [Beame,Koutris,S] … #Rounds = r Skew Skew |S1| = |S2| … S1| = |S2| … L = m/p1/2 L = m/p1/2 L = m/p1/ψ* L = m/p1/2 r=1 L = m/p2/3 r=2 L = ? (open) Same as above L ≥ m/p1/ρ* * 1/r optimal? τ* = fractional edge packing / vertex cover ρ* = fractional edge cover / vertex packing [Beame, Koutris, S] Full Conjunctive Queries Speedup may improve E.g. 1/p2/3 1/p One Round No Skew |S1| = |S2| = … Algorithm m1, m2,.. Formula Prod L = m/p1/2 based on Join L = m/p fractional Triang L = m/p2/3 edge Multi L = m/p1/τ packing Lower Same as Bound above … #Rounds = r Skew Skew |S1| = |S2| … S1| = |S2| … L = m/p1/2 L = m/p1/2 L = m/p1/ψ* L = m/p1/2 r=1 L = m/p2/3 r=2 L = ? (open) Same as above L ≥ m/p1/ρ* * 1/r optimal? τ* = fractional edge packing / vertex cover ρ* = fractional edge cover / vertex packing [Beame, Koutris, S] Full Conjunctive Queries Speedup may improve E.g. 1/p2/3 1/p Increased load. Decreased speedup. One Round No Skew |S1| = |S2| = … Algorithm m1, m2,.. Formula Prod L = m/p1/2 based on Join L = m/p fractional Triang L = m/p2/3 edge Multi L = m/p1/τ packing Lower Same as Bound above … #Rounds = r Skew Skew |S1| = |S2| … S1| = |S2| … L = m/p1/2 L = m/p1/2 L = m/p1/ψ* L = m/p1/2 r=1 L = m/p2/3 r=2 L = ? (open) Same as above L ≥ m/p1/ρ* * 1/r optimal? τ* = fractional edge packing / vertex cover ρ* = fractional edge cover / vertex packing [Beame, Koutris, S] Full Conjunctive Queries Speedup may improve E.g. 1/p2/3 1/p Increased load. Decreased speedup. One Round No Skew |S1| = |S2| = … Algorithm m1, m2,.. Formula Prod L = m/p1/2 based on Join L = m/p fractional Triang L = m/p2/3 edge Multi L = m/p1/τ packing Lower Same as Bound above … Mitigates skew for some queries. #Rounds = r Skew Skew |S1| = |S2| … S1| = |S2| … L = m/p1/2 L = m/p1/2 L = m/p1/ψ* L = m/p1/2 r=1 L = m/p2/3 r=2 L = ? (open) Same as above L ≥ m/p1/ρ* * 1/r optimal? τ* = fractional edge packing / vertex cover ρ* = fractional edge cover / vertex packing Challenges, Open Problems • Single server: deepen connection to information theory (FDs, statistics) • Shared Memory: GenericJoin on PRAM? • Shared Nothing: O(1) rounds, beyond CQ • Explain: τ* = skew-free, ρ* = skewed data Dagstuhl 2016 40 I/O-optimal Algorithms • Ke Dagstuhl 2016 41 Key Techniques • Convex optimization meets finite model theory meets information theory! • Algorithms: novel “all-joins-at-once” • Concentration bounds (Cernoff) for hash function guarantees Dagstuhl 2016 42 Some Tough Problems… …that no-one is addressing in PODS/ICDT! (why??): • Transactions!!! Consistent, globally distributed data – Eventual consistency: efficient but wrong (NoSQL) – Strong consistency: correct but slow (Spanner, postgres-XL) – Optimistic models: Parallel SI (PSI) • Design the “right” DSL for ML + data transformations – SystemML(IBM)? TensorFlow(google)? – Expressive power / complexity / hierarchy? • Design the “right” theoretical model for architectures to come: – SMP, NVM, dark silicon Dagstuhl 2016 43 Thomas’ Questions 1. 2. 3. 4. 5. 6. What are the key benefits from theory? What are the key/tough problems? What are the key math/FDM techniques? How does this extend DB theory? What should we teach our students? What is a good foundational paper? Dagstuhl 2016 44 What Alice is Missing • Convex optimization • Information theory • Models of computation for query processing (beyond relational-machines) • Cernoff/Hoeffding bounds and beyond Dagstuhl 2016 45 Thomas’ Questions 1. 2. 3. 4. 5. 6. What are the key benefits from theory? What are the key/tough problems? What are the key math/FDM techniques? How does this extend DB theory? What should we teach our students? What is a good foundational paper? Dagstuhl 2016 46 Recipe for the Best PODS paper What do I know???? If I knew I would write it myself… I only have thoughts about “types” of best papers: • Best first paper • Best technical paper • Best last paper Dagstuhl 2016 47
© Copyright 2026 Paperzz