QSX: Querying Social Graphs Querying Big Graphs Parallel scalability Making big graphs small – Bounded evaluability – Query-preserving graph compression 1 The impact of the sheer volume of big data Using SSD of 6G/s, a linear scan of a data set D would take 1.9 days when D is of 1PB (1015B) 5.28 years when D is of 1EB (1018B) A departure from classical computational complexity theory Traditional computational complexity theory of almost 50 years: • The good: polynomial time computable (PTIME) • The bad: NP-hard (intractable) • The ugly: PSPACE-hard, EXPTIME-hard, undecidable… Is it feasible to query real-life big graphs? 2 Parallel query answering We can do better provided more resources Using 10000 SSD of 6G/s, a linear scan of D might take: 1.9 days/10000 = 16 seconds when D is of 1PB (1015B) 5.28 years/10000 = 4.63 days when D is of 1EB (1018B) Only ideally, why? interconnection network P P P M M M DB 10,000 processors DB DB Do parallel algorithms If not, is it still feasible to always work? query big graphs? How to cope with the sheer volume of big graphs? 3 Parallel scalability 4 4 Parallel scalability partition Input: G = (G1, …, Gn), distributed to (S1, …, Sn), and a query Q Output: Q(G), the answer to Q in G Complexity t(|G|, |Q|): the time taken by a sequential algorithm with a single processor T(|G|, |Q|, n): the time taken by a parallel algorithm with n including the cost of data processors shipment, k is a constant Parallel scalable: if T(|G|, |Q|, n) = O(t(|G|, |Q|)/n) + O((n + |Q|)k) When G is big, we can still query G by adding more processors if we can afford them A distributed algorithm is useful if it is parallel scalable 5 Degree of parallelism -- speedup Speedup: for a given task, TS/TL, TS: time taken by a traditional DBMS TL: time taken by a parallel system with more resources TS/TL: more sources mean proportionally less time for a task Linear speedup: the speedup is N while the parallel system has N times resources of the traditional system Speed: throughput response time Linear speedup resources Question: can we do better than linear speedup? 6 Better than linear speedup? NO, even hard to achieve linear speedup/scaleup! Give 4 reasons Startup costs: initializing each process Interference: competing for shared resources (network, disk, memory or even locks) Think of blocking in MapReduce Skew: it is difficult to divide a task into exactly equal-sized parts; the response time is determined by the largest part A closer look: Ullman’s algorithmarchitecture for subgraph Data shipment cost: in a shared-nothing isomorphism: the adjacency matrix for the entire G. What if we break G into ninfragments and Worst-case: exponential |G| and |Q| vsleverage the datainlocality of subgraph isomorphism? exponential |G|/n and |Q|! Contradiction? No: the worst-case complexity of a particular algorithm vs the time really needed by a sequential algorithm Linear speedup is the best we can hope for -- optimal! 7 linear scalability An algorithm T for answering a class Q of queries Input: G = (G1, …, Gn), distributed to (S1, …, Sn), and a query Q Output: Q(G), the answer to Q in G The more processors, the less response time Algorithm T is linearly scalable in computation if its parallel complexity is a function of |Q| and |G|/n, and in data shipment if the total amount of data shipped is a function of |Q| and n Independent of the size |G| of big G Is it always possible? Querying big data by adding more processors 8 Graph pattern matching via graph simulation Input: a graph pattern graph Q and a graph G Output: Q(G) is a binary relation S on the nodes of Q and G • each node u in Q is mapped to a node v in G, such that (u, v)∈ S • for each (u,v)∈ S, each edge (u,u’) in Q is mapped to an edge (v, v’ ) in G, such that (u’,v’ )∈ S Parallel scalable? O((| V | + | VQ |) (| E | + | EQ| )) time 9 9 Impossibility There exists NO algorithm for distributed graph simulation that is parallel scalable in either computation, or Why? data shipment Pattern: 2 nodes Graph: 2n nodes, distributed to n processors Possibility: when G is a tree, parallel scalable in both response time What can we do if parallel scalability is and data shipment beyond reach? Nontrivial to develop parallel scalable algorithms 10 Making big graphs small 11 11 The cost of query answering Input: A query Q and a graph G Question: The answer Q(G) to Q in G What should we do? too costly when G is big The cost of computing Q(G): a function f(|G|, |Q|) Find a lower function for f? Develop faster algorithm Reduce the size of |Q|? Q( G ) Reduce the size of G Q( GQ ) Reduce the cost of computing Q(G) by making G small! 12 12 Making big graphs small Input: A class Q of queries Question: Can we effectively find, given queries Q Q and any (possibly big) graph G, a small GQ such that Much smaller than G Q(G) = Q(GQ)? Q( G ) Q( GQ ) Particularly useful for A single dataset G, e.g., the social graph of Facebook Minimum GQ – the necessary amount of data for answering Q How to make G small? 13 The essence of parallel query answering Given a big graph G, and n processors S1, …, Sn G is partitioned into fragments (G1, …, Gn) G is distributed to n processors: Gi is stored at Si Parallel query answering Input: G = (G1, …, Gn), distributed to (S1, …, Sn), and a query Q Output: Q(G), the answer to Q in G Each processor Si processes its local fragment Gi in parallel |G|/n, much smaller Q( Q( G1 ) Q( G2 G ) What can we do if parallel scalability is … Gn ) ) beyond reach for ourQ(queries? Dividing a big G into small fragments Gi of manageable size 14 How to make big graphs small Input: A class of queries Q Distributed query processing Boundedly evaluable graphqueries queriesQ Q and any Question: Can we effectively find, given (possibly big) a small Ggraph that graph QueryG, preserving Q suchcompression Much smaller than G Q(G) = Q(G Query Q)? answering using views Bounded incremental evaluation Q( ) Q( GQ ) G … A number of methods We have seen one of the methods: parallel query answering Other methods – in the next two lectures Effective methods for making big graphs small 15 15 Making big graphs small 16 What do we need Input: A class Q of queries Question: Can we effectively find, given queries Q Q and any (possibly big) graph G, a small GQ such that Much smaller than G Q(G) = Q(GQ)? Q( G ) Q( GQ How to find GQ? ) Why? The time taken to find GQ should be independent of |G| Not very likely in the absence of auxiliary information How to characterize this? 17 Boundedly evaluable queries Input: A class Q of queries, an access schema A Question: Can we find by using A, for any query Q Q and any (possibly big) graph G, a fraction GQ of G such that |GQ | is independent of |G|, Q(G) = Q(GQ), and moreover, effectively find GQ can be identified in time determined by Q and A? A closer look GQ does not get bigger when G grows -- Q(GQ) can be efficiently computed The time taken on finding GQ does not increase when G grows Is this possible in practice? 18 Example: subgraph isomorphism A movie database represented as a graph, for movies from 1880 -- 2014 – Nodes: movies, casts (actors, actresses), awards, etc – Edges: relationships between the nodes 5.1 million nodes and 19.5 million edges Find pairs of leading actors and actresses from the same country and stared in an award-winning movie released in 2011-2014 award year movie 2011-2014 actress actor country Find all matches of the pattern in the graph 19 Example: access constraints award year movie 2011-2014 actress actor country real-life limits Build indices accordingly C1: an award is presented to no more than 4 movies each year C2: each movie has at most 30 leading actors and actresses C3: each person has only one country of origin C4-6: there are no more than 134 years (2014 1880), 24 major awards, and 196 countries in the graph Hold on the entire graph, regardless of queries posed on it 20 Example: a query plan award year movie actor country 1. 2. 3. 4. 2011-2014 actress By using the indices Fetch a set V1 of 134 year nodes, 24 awards and 195 countries Fetch a set V2 of at most 24 * 3 * 4 = 288 award-winning movies released in 2011-2014, with at most 288 * 2 associated edges, by using award and year nodes in V1 Fetch a set V3 of at most (30 + 30) * 288 = 17280 actors and As opposed 5.1 million nodes actresses with 17280 edges, using nodes intoV2 and 19.5 million edges Connect the actors and actresses in V3 to country nodes in V1, with at most 17280 edges -- GQ Visit at most 17922 nodes and 35136 edges, using indices 21 Access constraints: Example S (l, N) Combining cardinality constraints and index S: a set of node labels, and l is another label N: a natural number -- cardinality Semantics: G satisfies With distinct labels, in S S (l, N) an label edge S, to there each node in most Vs N For any set Vs ofConnected nodes in Gbywith exist at common neighbours of Vs with label l There is an index on S for l For each set Vs of nodes with label S, find all common neighbours labelled l in O(N) time Access schema: A set of access constraints 22 Example: access constraints C1: an award is presented to no more than 4 movies each year C2: each movie has at most 30 leading actors and actresses C3: each person has only one country of origin C4-6: there are no more than 134 years (2014 1880), 24 major awards, and 196 countries in the graph Access constraints (year, award) (movie, 4) movie (actor/actress, 30) actor/actress (country, 1) (year, 134), (award, 24), (country, 196) Build indices accordingly Useful special cases: (l, N), l (l’, N), 23 discovering access schema S (l, N) Functional dependencies X Y, e.g., movie (year, 1) Degree bound: l (l’, N) if a node with label l has a degree N, for any label l’ Shredding graphs to relations, using, e.g., TANE (l, N), very common, e.g., (country, 196) Aggregate queries: group by (year, award), we find (year, award) (movie, 4) Real-life bounds: 5000 friends per person (Facebook) … Local changes: only to common neighbours How to maintain constraints in response to changes to graphs? 24 Generating query plans award year movie 2011-2014 actress actor country A query plan P for a query Q is a sequence of fetching operations fetch(u, Vs, C, q(u)) given a set Vs of nodes fetched earlier, fetch all common neighbours u of labelled l, by using access Efficient byVs using the indices constraint C, the nodes satisfy the condition of u, e.g., year in [2011, 2014] Fetch operations: construct GQ; then we compute Q(GQ) 25 Generating query plans Boundedly evaluable: if there exists a query plan under an access schema A such that for all graphs G that satisfies A, Its fetch operations finds GQ, and Q(GQ) = Q(G) The time for all fetch operations is determined by Q and A only, example independent of |G| 1. 2. 3. 4. Fetch a set V1 of 134 year nodes, 24 awards and 195 countries Fetch a set V2 of at most 24 * 3 * 4 = 288 award-winning movies released in 2011-2014, with at most 288 * 2 associated edges, by using award and year nodes in V1 Fetch a set V3 of at most (30 + 30) * 288 = 17280 actors and actresses with 17280 edges, using nodes in V2 Boundedly evaluable Connect the actors and actresses in V3 to country nodes in V1, with at most 17280 edges -- GQ Independent of |G| no matter how big G grows! 26 An approach to querying big graphs Given a query Q, and an access schema A 1. Decide whether Q is boundedly evaluable under A 2. If so, generate a bounded query plan P for Q 3. Given any graph G, use the query plan P a) Fetch GQ b) Compute Q(GQ) Are we done yet? Questions: the complexity of – deciding bounded evaluability? – generating a boundedly evaluable query plan? Independent of the size of |G|? 27 Deciding bounded evaluability Input: A query Q, and an access schema A Question: Is Q boundedly evaluable under A? Graph subgraph isomorphism Q =pattern (VQ, EQmatching ), small invia real life Independent of any graph G Positive: in O(|A| |VQ| |EQ|) time Characterization: Q is boundedly evaluable under A iff Nodes covered by A, computed by VCov(Q, A) = VQ (l, N) first and inductively by ECov(Q, A) = EQ other constraints in A Edges (u1, u2) covered by A: one of them is in VCov and the other has a bounded number of candidates by A Deciding bounded evaluability: independent of |G| 28 28 28 Generating boundedly evaluable query plan Input: A boundedly evaluable query Q, and an access schema A Output: A boundedly evaluable query plan P for Q under A Graph pattern via subgraph isomorphism Q = (VQmatching , EQ) Independent of any graph G Positive: in O(|A| |EQ| + |A| |VQ|2) time Inductively identify covered nodes and edges, and in each step, generate a corresponding fetch operation Always possible? Yes, since Q is decided boundedly evaluable under A Query plan generation: independent of |G| 29 Instance-bounded in a graph G 1. Decide whether Q is effectively bounded under A 2. If so, generate a bounded query plan P for Q Can we do anything if Q is not boundedly evaluable under A? Extending A by to AM adding constraints of the form (l, M), l (l’, M) M: may depend on |G| such that G satisfies AM Query Q is M-bounded in G if there is GQ of G such that Q(G) = M L (LQ + 1)/2, LQ: the number of labels in G Q(GQ), and GQ can Qbe found in time determined by Q and AM For any finite set Q of pattern queries, access schema A and a graph G satisfying A, there exists M such that all queries in Q are M-bounded in G under A Instance-bounded: on an individual graph, e.g., Facebook 30 Effectiveness of bounded evaluability Graph pattern matching via subgraph isomorphism: data locality Does the same approach work on graph simulation, without data locality? Revised node and edge covers All the results remain intact on graph pattern matching via simulation How effective is this approach? 60% of subgraph queries and 33% of simulation queries are 28587 timessmall faster boundedly evaluable under access schema Improvement: 4 orders of magnitudes for subgraph queries, and 3 orders of magnitudes for simulation queries A small M of 0.016% of |G| makes all queries M-bounded Bounded evaluability: effective for graph pattern queries 31 Query-preserving graph compression 32 32 Dynamic reduction vs. Uniform reduction Bounded evaluability: dynamic reduction on dataset D • Given a query Q, identify and fetch a minimum subset DQ of D such that it has sufficient information for answering Q in D Uniform reduction on dataset D benefit? What is the • Identify and fetch a minimum DC such that for all queries Q posed on D, DC has sufficient information to find answers to Q in D What is the benefit? Questions: DQ is typically smaller than DC. Why? DC is computed once offline and then we don’t have to worry about it; is this claim true? Is there any effective uniform reduction to query big data? 33 Graph compression The cost of query processing: f(|G|, |Q|) It is unlikely that we can lower its complexity, but Compression <R, P> can we reduce the size of its parameter |G|? For a graph G, GC = R(G) Compressing For any Q, Q( G ) = P(Q(GC)) Post-processing G R Q Q( Gc G ) Q P Q( G ) restore G from Q( Gc Lossless: GC). GC is not much smaller than G Q( friendly Query GC ) compression: decompression of GC back to G Compress big G into a smaller GC 34 Query preserving graph compression Query preserving compression <R, P> for a class Q of queries For any graph G, GC = R(G) Compressing For any Q in Q, Q( G ) = P(Q(Gc)) Post-processing G R Q Q( Gc ) G Q P Q( G ) Q( Gc ) Q( GC ) Compress G w.r.t. to a particular query class Q 35 35 What is new about query preserving compression? Query preserving compression <R, P> for a class L of queries For any graph G, Gc = R(G) For any Q in L, Q( G ) = P(Q(Gc)) Relative to a class L of queries of users’ choice no need to decompress Better compression ratio: only information about L queries Gc For any Q in L, Q(Gc) can be directly computed In contrast to lossless compression, no need to Any algorithmsrestore and indexing structures the original graph G for G can be used for Gc Gc is computed once for all queries Q in L Incrementally maintained Compress G relative to your queries 36 Reachability queries Reachability • Input: A directed graph G, and a pair of nodes s and t in G • Question: Does there exist a path from s to t in G? O(|V| + |E|) time Equivalence relation: • reachability relation Re: a node pair (u,v) ∈Re iff they have the same set of ancestors and descendants in G. • for any graph G, there is a unique maximum Re, i.e., the reachability equivalence relation of G Compress G by leveraging the equivalence relation 37 Algorithm and example MSA1 C1 MSA1 QR 1. 2. 3. MSA1 MSA2 Compute Re and O(|V||E|) its equivalence classes Construct a node for each node set in the equivalence class MSA2 BSA1 BSA2 BSA1 FA1 FA1 BSA2 FA3 FA2 FA4 FA3 FA4 Construct GC C1 C2 C1 FA2 C2 C3 … C4 C3 Ck Ck 38 Reachability preserving compression A reachability preserving compression R for G – R maps each node in G to its reachability equivalence class in GC, and each edge to an edge between two equivalence classes Correctness: Nodes in GC: equivalence classes – For any query QR(v,w) over G, v can reach w iff R(v) can reach R(w) in GC – Compression R is in quadratic time – no post-processing function P is required. Reduction: 95% in average for reachability queries 39 How does it look like in real life? 18 times faster on average for reachability queries 40 Graph pattern matching by graph simulation Input: A directed graph G, and a graph pattern Q Output: the maximum simulation relation R Bisimulation: a binary relation B over V of G, such that for each node pair (u,v) ∈B, • L(u) = L(v) A1 A2 A3 A4 A5 • for each edge (u,u’) ∈ E, there exists (v,v’) ∈ E, s.t. (u’,v’) ∈ B, • for each edge (v,v’) ∈ E, there exists (u,u’) ∈ E, s.t. (u’,v’) ∈ B B2 B1 B3 B4 B5 Equivalence relation Rb: the unique maximum bisimulation relation C1 D1 C2 D2 C3 C4 G1 G2 Compress G by leveraging the equivalence relation 41 Compression for simulation R(G): computes equivalence classes msa1 msa2 MSAr R(G): constructs Gc with equivalence classes bsa1 BSAr bsa2 fa2 fa1P(Q,Gc): expanded to thefa nodes in their equivalence classes 3 FAr FAr’ Cr Cr’ … c1 G c2 cc33 ck Gc 42 42 Compression for simulation Query preserving compression <R, P> for graph pattern matching R(G) in O(|E| log (|V|)) time P(Q, Gc): linear time in the size of Q( G ) compression function R( ): • maximum bisimulation relation on the nodes of G • equivalence relation nodes in Gc denote equivalence classes post-processing function P( ): • making use of the inverse of R( ) nodes in Q(Gc ) are expanded to nodes in their Subgraph isomorphism? 2.3 times faster (simulation) equivalence classes, in the size of output Reduction: 57% in average for graph pattern matching 43 Summing up 44 44 Summary and review What is parallel scalability? Why do we care about it? Study some parallel algorithms. Show that they are parallel scalable if they are, and disprove it otherwise Why do we want to make big graphs small? How can we do it? What is bounded evaluability of queries? What auxiliary structures do we need to make queries boundedly evaluable? What is query-preserving graph compression? Is it lossless? Do we lose information when using such a compression scheme? How to develop query preserving graph compression schemes? 45 Project (1) Bounded evaluability. Recall keyword search via distinct-root trees (bounded by a fixed depth k; see Lectures 2 and 4) Develop an algorithm for keyword search based on access constraints; show that such queries can be boundedly evaluated Develop optimization strategies Develop a parallel version of your algorithm, in whatever model you like (MapReduce, BSP, GRAPE) Experimentally evaluate your algorithms, especially their scalability with the size of G Write a survey on various methods for keyword search with distinct-trees, as part of the related work. A research and development project 46 Project (2) Recall graph pattern matching by subgraph isomorphism (Lecture 3) Develop a query-preserving compression scheme for subgraph isomorphism Implement your compression scheme and an algorithm for graph pattern matching via subgraph isomorphism, based on your querypreserving compression scheme Experimentally evaluate your compression scheme and evaluation algorithm, especially its scalability with the size of G Write a survey on graph compression schemes, as part of the related work. A research and development project 47 Project (3) Combine query-preserving compression and distributed algorithm for reachability queries (Lecture 5) – – – Develop a framework for answering reachability queries, with query-preserving compression scheme to reduce graphs distributed algorithm for answering reachability queries incremental algorithm to maintain compressed graphs in response to changes to the original graphs Implement the framework with all three algorithms Experimentally evaluate method for answering reachability queries, especially its scalability with the size of G Write a survey on graph compression schemes and distributed algorithms for reachability queries, as part of the related work. A development project 48 Papers for you to review • M. Armbrust, A. Fox, D. A. Patterson, N. Lanham, B. Trushkowsky, J. Trutna, and H. Oh. SCADS: Scale-independent storage for social computing applications. In CIDR, 2009. http://arxiv.org/ftp/arxiv/papers/0909/0909.1775.pdf • M. Armbrust, E. Liang, T. Kraska, A. Fox, M. J. Franklin, and D. Patterson. Generalized scale independence through incremental precomputation. In SIGMOD, 2013. http://www.cs.albany.edu/ • ~jhh/courses/readings/armbrust.sigmod13.incremental.pdf • S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. BlinkDB: queries with bounded errors and bounded response times on very large data. In EuroSys, 2013. https://www.cs.berkeley.edu/~sameerag/blinkdb_eurosys13.pdf • Y. Tian, R. A. Hankins, and J. M. Patel. Efficient Aggregation for Graph Summarization. http://pages.cs.wisc.edu/~jignesh/publ/summarization.pdf • Y. Cao, W. Fan, and R. Huang. Making pattern queries bounded in big graphs. ICDE 2015. (bounded evaluability) W. Fan, J. Li, X. Wang, and Y. Wu. Query Preserving Graph 49 Compression, SIGMOD, 2012. (query-preserving compression) •
© Copyright 2026 Paperzz