XML publishing

QSX: Querying Social Graphs
Querying Big Graphs
 Parallel scalability
 Making big graphs small
– Bounded evaluability
– Query-preserving graph compression
1
The impact of the sheer volume of big data
Using SSD of 6G/s, a linear scan of a data set D would take

1.9 days when D is of 1PB (1015B)

5.28 years when D is of 1EB (1018B)
A departure from classical computational complexity theory

Traditional computational complexity theory of almost 50 years:
•
The good: polynomial time computable (PTIME)
•
The bad: NP-hard (intractable)
•
The ugly: PSPACE-hard, EXPTIME-hard, undecidable…
Is it feasible to query real-life big graphs?
2
Parallel query answering
We can do better provided more resources
Using 10000 SSD of 6G/s, a linear scan of D might take:
 1.9 days/10000 = 16 seconds when D is of 1PB (1015B)
 5.28 years/10000 = 4.63 days when D is of 1EB (1018B)
Only ideally, why?
interconnection network
P
P
P
M
M
M
DB
10,000 processors
DB
DB
Do parallel algorithms
If not, is it still feasible to
always work?
query big graphs?
How to cope with the sheer volume of big graphs?
3
Parallel scalability
4
4
Parallel scalability
partition


Input: G = (G1, …, Gn), distributed to (S1, …, Sn), and a query Q
Output: Q(G), the answer to Q in G
Complexity
 t(|G|, |Q|): the time taken by a sequential algorithm with a single
processor
 T(|G|, |Q|, n): the time taken by a parallel algorithm with n
including the cost of data
processors
shipment, k is a constant
 Parallel scalable: if
T(|G|, |Q|, n) = O(t(|G|, |Q|)/n) + O((n + |Q|)k)
When G is big, we can still query G by adding more processors if we
can afford them
A distributed algorithm is useful if it is parallel scalable
5
Degree of parallelism -- speedup
Speedup: for a given task, TS/TL,
 TS: time taken by a traditional DBMS
 TL: time taken by a parallel system with more resources
 TS/TL: more sources mean proportionally less time for a task
 Linear speedup: the speedup is N while the parallel system has N
times resources of the traditional system
Speed: throughput
response time
Linear speedup
resources
Question: can we do better than linear speedup?
6
Better than linear speedup?
NO, even hard to achieve linear speedup/scaleup!
Give 4 reasons
 Startup costs: initializing each process
 Interference: competing for shared resources (network, disk,
memory or even locks)
Think of blocking in MapReduce
 Skew: it is difficult to divide a task into exactly equal-sized parts;
the response time is determined by the largest part
A closer
look:
Ullman’s
algorithmarchitecture
for subgraph
 Data shipment
cost:
in a
shared-nothing
isomorphism: the adjacency matrix for the entire G.
What
if we break
G into ninfragments
and
Worst-case:
exponential
|G| and |Q|
vsleverage
the datainlocality
of subgraph
isomorphism?
exponential
|G|/n and
|Q|! Contradiction?
No: the worst-case complexity of a particular algorithm vs
the time really needed by a sequential algorithm
Linear speedup is the best we can hope for -- optimal!
7
linear scalability
An algorithm T for answering a class Q of queries
 Input: G = (G1, …, Gn), distributed to (S1, …, Sn), and a query Q
 Output: Q(G), the answer to Q in G
The more processors,
the less response time
Algorithm T is linearly scalable in
 computation if its parallel complexity is a function of |Q| and |G|/n,
and in
 data shipment if the total amount of data shipped is a function of
|Q| and n
Independent of the size |G| of big G
Is it always possible?
Querying big data by adding more processors
8
Graph pattern matching via graph simulation
 Input: a graph pattern graph Q and a graph G
 Output: Q(G) is a binary relation S on the nodes of Q and G
•
each node u in Q is mapped to a node v in G, such that
(u, v)∈ S
•
for each (u,v)∈ S, each edge (u,u’) in Q is mapped to an
edge (v, v’ ) in G, such that (u’,v’ )∈ S
Parallel scalable?
O((| V | + | VQ |) (| E | + | EQ| )) time
9
9
Impossibility
There exists NO algorithm for distributed graph simulation that is
parallel scalable in either
 computation, or
Why?
 data shipment
Pattern: 2 nodes
Graph: 2n nodes, distributed to n
processors
Possibility: when G is a tree, parallel scalable in both response time
What can we do if parallel scalability is
and data shipment
beyond reach?
Nontrivial to develop parallel scalable algorithms
10
Making big graphs small
11
11
The cost of query answering


Input: A query Q and a graph G
Question: The answer Q(G) to Q in G
What should we do?
too costly when G is big
The cost of computing Q(G): a function f(|G|, |Q|)

Find a lower function for f? Develop faster algorithm

Reduce the size of |Q|?
Q(
G
)
Reduce the size of G
Q( GQ
)
Reduce the cost of computing Q(G) by making G small!
12
12
Making big graphs small
Input: A class Q of queries

Question: Can we effectively find, given queries Q  Q and any
(possibly big) graph G, a small GQ such that
Much smaller than G
 Q(G) = Q(GQ)?

Q(
G
)
Q( GQ
)
Particularly useful for

A single dataset G, e.g., the social graph of Facebook

Minimum GQ – the necessary amount of data for answering Q
How to make G small?
13
The essence of parallel query answering
Given a big graph G, and n processors S1, …, Sn
 G is partitioned into fragments (G1, …, Gn)
 G is distributed to n processors: Gi is stored at Si
Parallel query answering
 Input: G = (G1, …, Gn), distributed to (S1, …, Sn), and a query Q
 Output: Q(G), the answer to Q in G
Each processor Si processes its local fragment Gi in parallel
|G|/n, much smaller
Q(
Q( G1
) Q( G2
G
)
What can we do if parallel scalability is
…
Gn )
)
beyond reach for ourQ(queries?
Dividing a big G into small fragments Gi of manageable size
14
How to make big graphs small


Input: A class
of queries
Q
Distributed
query processing
 Boundedly
evaluable
graphqueries
queriesQ  Q and any
Question: Can
we effectively
find, given
(possibly big)
a small Ggraph
that
 graph
QueryG,
preserving
Q suchcompression
Much smaller than G
 Q(G) = 
Q(G
Query
Q)? answering using views
 Bounded incremental evaluation
Q(
)
Q( GQ )
G
 …
A number of methods

We have seen one of the methods: parallel query answering

Other methods – in the next two lectures
Effective methods for making big graphs small
15
15
Making big graphs small
16
What do we need
Input: A class Q of queries

Question: Can we effectively find, given queries Q  Q and any
(possibly big) graph G, a small GQ such that
Much smaller than G
 Q(G) = Q(GQ)?

Q(
G
)
Q( GQ
How to find GQ?
)
Why?

The time taken to find GQ should be independent of |G|

Not very likely in the absence of auxiliary information
How to characterize this?
17
Boundedly evaluable queries


Input: A class Q of queries, an access schema A
Question: Can we find by using A, for any query Q  Q and
any (possibly big) graph G, a fraction GQ of G such that



|GQ | is independent of |G|,
Q(G) = Q(GQ), and moreover,
effectively find
GQ can be identified in time determined by Q and A?
A closer look

GQ does not get bigger when G grows -- Q(GQ) can be efficiently
computed

The time taken on finding GQ does not increase when G grows
Is this possible in practice?
18
Example: subgraph isomorphism

A movie database represented as a graph, for movies from 1880
-- 2014
– Nodes: movies, casts (actors, actresses), awards, etc
– Edges: relationships between the nodes
5.1 million nodes and 19.5 million edges

Find pairs of leading actors and actresses from the same country
and stared in an award-winning movie released in 2011-2014
award
year
movie
2011-2014
actress
actor
country
Find all matches of the pattern in the graph
19
Example: access constraints
award
year
movie
2011-2014
actress
actor
country
real-life limits




Build indices accordingly
C1: an award is presented to no more than 4 movies each year
C2: each movie has at most 30 leading actors and actresses
C3: each person has only one country of origin
C4-6: there are no more than 134 years (2014  1880), 24 major
awards, and 196 countries in the graph
Hold on the entire graph, regardless of queries posed on it
20
Example: a query plan
award
year
movie
actor
country
1.
2.
3.
4.
2011-2014
actress
By using the indices
Fetch a set V1 of 134 year nodes, 24 awards and 195 countries
Fetch a set V2 of at most 24 * 3 * 4 = 288 award-winning movies
released in 2011-2014, with at most 288 * 2 associated edges,
by using award and year nodes in V1
Fetch a set V3 of at most (30 + 30) * 288 = 17280 actors and
As opposed
5.1 million nodes
actresses with 17280 edges, using
nodes intoV2
and 19.5 million edges
Connect the actors and actresses in V3 to country nodes in V1,
with at most 17280 edges -- GQ
Visit at most 17922 nodes and 35136 edges, using indices
21
Access constraints: Example
S  (l, N)



Combining cardinality constraints and index
S: a set of node labels, and l is another label
N: a natural number -- cardinality
Semantics:
G satisfies
With
distinct labels,
in S S  (l, N)
an label
edge S,
to there
each node
in most
Vs N
For any set Vs ofConnected
nodes in Gbywith
exist at
common neighbours of Vs with label l

There is an index on S for l
For each set Vs of nodes with label S, find all
common neighbours labelled l in O(N) time
Access schema: A set of access constraints
22
Example: access constraints




C1: an award is presented to no more than 4 movies each year
C2: each movie has at most 30 leading actors and actresses
C3: each person has only one country of origin
C4-6: there are no more than 134 years (2014  1880), 24 major
awards, and 196 countries in the graph
Access constraints

(year, award)  (movie, 4)

movie  (actor/actress, 30)

actor/actress  (country, 1)

  (year, 134),   (award, 24),   (country, 196)
Build indices accordingly
Useful special cases:   (l, N), l  (l’, N),
23
discovering access schema
S  (l, N)
Functional dependencies X  Y, e.g., movie (year, 1)

Degree bound: l  (l’, N) if a node with label l has a degree N, for
any label l’



Shredding graphs to relations,
using, e.g., TANE
  (l, N), very common, e.g.,   (country, 196)
Aggregate queries: group by (year, award), we find (year, award)
 (movie, 4)

Real-life bounds: 5000 friends per person (Facebook)

…
Local changes: only to common neighbours
How to maintain constraints in response to changes to graphs?
24
Generating query plans
award
year
movie
2011-2014
actress
actor
country
A query plan P for a query Q is a sequence of fetching operations
fetch(u, Vs, C, q(u))
given a set Vs of nodes fetched earlier,
 fetch all common neighbours
u of
labelled
l, by using access
Efficient
byVs
using
the indices
constraint C,
 the nodes satisfy the condition of u, e.g., year in [2011, 2014]
Fetch operations: construct GQ; then we compute Q(GQ)
25
Generating query plans
Boundedly evaluable: if there exists a query plan under an access
schema A such that for all graphs G that satisfies A,
 Its fetch operations finds GQ, and Q(GQ) = Q(G)
 The time for all fetch operations is determined by Q and A only,
example
independent of |G|
1.
2.
3.
4.
Fetch a set V1 of 134 year nodes, 24 awards and 195 countries
Fetch a set V2 of at most 24 * 3 * 4 = 288 award-winning movies
released in 2011-2014, with at most 288 * 2 associated edges,
by using award and year nodes in V1
Fetch a set V3 of at most (30 + 30) * 288 = 17280 actors and
actresses with 17280 edges, using nodes in V2
Boundedly
evaluable
Connect the actors and actresses
in V3 to country
nodes in V1,
with at most 17280 edges -- GQ
Independent of |G| no matter how big G grows!
26
An approach to querying big graphs
Given a query Q, and an access schema A
1. Decide whether Q is boundedly evaluable under A
2. If so, generate a bounded query plan P for Q
3. Given any graph G, use the query plan P
a) Fetch GQ
b) Compute Q(GQ)

Are we done yet?
Questions: the complexity of
– deciding bounded evaluability?
– generating a boundedly evaluable query plan?
Independent of the size of |G|?
27
Deciding bounded evaluability


Input: A query Q, and an access schema A
Question: Is Q boundedly evaluable under A?
Graph
subgraph
isomorphism
Q =pattern
(VQ, EQmatching
), small invia
real
life
Independent of any graph G

Positive: in O(|A| |VQ| |EQ|) time
Characterization: Q is boundedly evaluable under A iff
Nodes covered by A, computed by
 VCov(Q, A) = VQ
  (l, N) first and inductively by
 ECov(Q, A) = EQ
other constraints in A
Edges (u1, u2) covered by A: one of
them is in VCov and the other has a
bounded number of candidates by A
Deciding bounded evaluability: independent of |G|
28
28
28
Generating boundedly evaluable query plan
Input: A boundedly evaluable query Q, and an access schema A
 Output: A boundedly evaluable query plan P for Q under A
Graph pattern
via subgraph isomorphism
Q = (VQmatching
, EQ)
Independent of any graph G


Positive: in O(|A| |EQ| + |A| |VQ|2) time
Inductively identify covered nodes and
edges, and in each step, generate a
corresponding fetch operation

Always possible?
Yes, since Q is decided boundedly
evaluable under A
Query plan generation: independent of |G|
29
Instance-bounded in a graph G
1. Decide whether Q is effectively bounded under A
2. If so, generate a bounded query plan P for Q
Can we do anything if Q is not boundedly evaluable under A?

Extending A by to AM adding constraints of the form
  (l, M),
l  (l’, M)
M: may depend on |G|
such that G satisfies AM
 Query Q is M-bounded in G if there is GQ of G such that Q(G) =
M  L (LQ + 1)/2, LQ: the number of labels in G
Q(GQ), and GQ can Qbe found
in time determined by Q and AM
For any finite set Q of pattern queries, access schema A and
a graph G satisfying A, there exists M such that all queries in
Q are M-bounded in G under A
Instance-bounded: on an individual graph, e.g., Facebook
30
Effectiveness of bounded evaluability
Graph pattern matching via subgraph isomorphism: data locality
Does the same approach work on graph simulation, without data
locality?
Revised node and edge covers
All the results remain intact on graph pattern matching via simulation
How effective is this approach?
 60% of subgraph queries and 33% of simulation queries are
28587
timessmall
faster
boundedly evaluable
under
access schema
 Improvement: 4 orders of magnitudes for subgraph queries, and 3
orders of magnitudes for simulation queries
 A small M of 0.016% of |G| makes all queries M-bounded
Bounded evaluability: effective for graph pattern queries
31
Query-preserving graph compression
32
32
Dynamic reduction vs. Uniform reduction
 Bounded evaluability: dynamic reduction on dataset D
•
Given a query Q, identify and fetch a minimum subset DQ of
D such that it has sufficient information for answering Q in D
 Uniform reduction on
dataset
D benefit?
What
is the
•
Identify and fetch a minimum DC such that for all queries Q
posed on D, DC has sufficient information to find answers to
Q in D
What is the benefit?
Questions:
 DQ is typically smaller than DC. Why?
 DC is computed once offline and then we don’t have to worry
about it; is this claim true?
Is there any effective uniform reduction to query big data?
33
Graph compression
The cost of query processing: f(|G|, |Q|)
It is unlikely that we can lower its complexity, but
Compression <R, P> can we reduce the size of its parameter |G|?
 For a graph G, GC = R(G)
Compressing
 For any Q, Q( G ) = P(Q(GC))
Post-processing
G
R
Q
Q(
Gc
G
)
Q
P
Q( G ) restore G from
Q( Gc
Lossless:
GC).
GC is not much smaller than G
Q( friendly
Query
GC ) compression:
decompression of GC back to G
Compress big G into a smaller GC
34
Query preserving graph compression
Query preserving compression <R, P> for a class Q of queries
 For any graph G, GC = R(G)
Compressing
 For any Q in Q, Q( G ) = P(Q(Gc))
Post-processing
G
R
Q
Q(
Gc
)
G
Q
P
Q( G )
Q( Gc )
Q( GC
)
Compress G w.r.t. to a particular query class Q
35
35
What is new about query preserving compression?
Query preserving compression <R, P> for a class L of queries
 For any graph G, Gc = R(G)
 For any Q in L, Q( G ) = P(Q(Gc))
 Relative to a class L of queries of users’ choice
no need
to decompress
Better compression ratio: only information
about
L queries Gc
 For any Q in L, Q(Gc) can be directly computed
In contrast to lossless compression, no need to
Any algorithmsrestore
and indexing
structures
the original
graph G for G can be used for Gc
 Gc is computed once for all queries Q in L
Incrementally maintained
Compress G relative to your queries
36
Reachability queries

Reachability
• Input: A directed graph G, and a pair of nodes s and t in G
• Question: Does there exist a path from s to t in G?
O(|V| + |E|) time
Equivalence relation:

•
reachability relation Re: a node pair (u,v) ∈Re iff they have the
same set of ancestors and descendants in G.
•
for any graph G, there is a unique maximum Re, i.e., the
reachability equivalence relation of G
Compress G by leveraging the equivalence relation
37
Algorithm and example
MSA1
C1
MSA1
QR
1.
2.
3.
MSA1 MSA2
Compute Re and
O(|V||E|)
its equivalence
classes
Construct a node
for each node
set in the
equivalence
class
MSA2
BSA1 BSA2
BSA1
FA1
FA1
BSA2
FA3
FA2
FA4
FA3
FA4
Construct GC
C1 C2
C1
FA2
C2
C3
…
C4
C3
Ck
Ck
38
Reachability preserving compression
 A reachability preserving compression R for G
– R maps each node in G to its reachability equivalence class
in GC, and each edge to an edge between two equivalence
classes
 Correctness:
Nodes in GC: equivalence classes
– For any query QR(v,w) over G, v can reach w iff R(v) can
reach R(w) in GC
– Compression R is in quadratic time
– no post-processing function P is required.
Reduction: 95% in average for reachability queries
39
How does it look like in real life?
18 times faster on average for reachability queries
40
Graph pattern matching by graph simulation


Input: A directed graph G, and a graph pattern Q
Output: the maximum simulation relation R
Bisimulation: a binary relation B over V of G, such that for each
node pair (u,v) ∈B,
• L(u) = L(v)
A1
A2
A3
A4 A5
• for each edge (u,u’) ∈ E, there exists (v,v’) ∈ E, s.t. (u’,v’) ∈ B,
• for each edge (v,v’) ∈ E, there exists (u,u’) ∈ E, s.t. (u’,v’) ∈ B
B2
B1
B3
B4
B5
 Equivalence relation Rb: the unique maximum bisimulation relation
C1
D1 C2
D2
C3
C4

G1
G2
Compress G by leveraging the equivalence relation
41
Compression for simulation
R(G): computes equivalence classes
msa1
msa2
MSAr
R(G): constructs Gc
with equivalence
classes
bsa1
BSAr
bsa2
fa2
fa1P(Q,Gc): expanded
to thefa
nodes in their equivalence
classes
3
FAr
FAr’
Cr
Cr’
…
c1
G
c2
cc33
ck
Gc
42
42
Compression for simulation
Query preserving compression <R, P> for graph pattern matching
 R(G) in O(|E| log (|V|)) time
 P(Q, Gc): linear time in the size of Q( G )
 compression function R( ):
• maximum bisimulation relation on the nodes of G
• equivalence relation
nodes in Gc denote equivalence classes
 post-processing function P( ):
• making use of the inverse of R( )
nodes in Q(Gc ) are expanded to nodes in their
Subgraph isomorphism?
2.3 times faster (simulation)
equivalence classes, in the size of output
Reduction: 57% in average for graph pattern matching
43
Summing up
44
44
Summary and review
 What is parallel scalability? Why do we care about it?
 Study some parallel algorithms. Show that they are parallel
scalable if they are, and disprove it otherwise
 Why do we want to make big graphs small? How can we do it?
 What is bounded evaluability of queries? What auxiliary
structures do we need to make queries boundedly evaluable?
 What is query-preserving graph compression? Is it lossless? Do
we lose information when using such a compression scheme?
 How to develop query preserving graph compression schemes?
45
Project (1)
Bounded evaluability. Recall keyword search via distinct-root trees
(bounded by a fixed depth k; see Lectures 2 and 4)





Develop an algorithm for keyword search based on access
constraints; show that such queries can be boundedly evaluated
Develop optimization strategies
Develop a parallel version of your algorithm, in whatever model
you like (MapReduce, BSP, GRAPE)
Experimentally evaluate your algorithms, especially their scalability
with the size of G
Write a survey on various methods for keyword search with
distinct-trees, as part of the related work.
A research and development project
46
Project (2)
Recall graph pattern matching by subgraph isomorphism (Lecture 3)




Develop a query-preserving compression scheme for subgraph
isomorphism
Implement your compression scheme and an algorithm for graph
pattern matching via subgraph isomorphism, based on your querypreserving compression scheme
Experimentally evaluate your compression scheme and evaluation
algorithm, especially its scalability with the size of G
Write a survey on graph compression schemes, as part of the
related work.
A research and development project
47
Project (3)
Combine query-preserving compression and distributed algorithm for
reachability queries (Lecture 5)

–
–
–



Develop a framework for answering reachability queries, with
query-preserving compression scheme to reduce graphs
distributed algorithm for answering reachability queries
incremental algorithm to maintain compressed graphs in response
to changes to the original graphs
Implement the framework with all three algorithms
Experimentally evaluate method for answering reachability queries,
especially its scalability with the size of G
Write a survey on graph compression schemes and distributed
algorithms for reachability queries, as part of the related work.
A development project
48
Papers for you to review
•
M. Armbrust, A. Fox, D. A. Patterson, N. Lanham, B. Trushkowsky, J.
Trutna, and H. Oh. SCADS: Scale-independent storage for social
computing applications. In CIDR, 2009.
http://arxiv.org/ftp/arxiv/papers/0909/0909.1775.pdf
•
M. Armbrust, E. Liang, T. Kraska, A. Fox, M. J. Franklin, and D.
Patterson. Generalized scale independence through incremental
precomputation. In SIGMOD, 2013. http://www.cs.albany.edu/
•
~jhh/courses/readings/armbrust.sigmod13.incremental.pdf
•
S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica.
BlinkDB: queries with bounded errors and bounded response times on
very large data. In EuroSys, 2013.
https://www.cs.berkeley.edu/~sameerag/blinkdb_eurosys13.pdf
•
Y. Tian, R. A. Hankins, and J. M. Patel. Efficient Aggregation for Graph
Summarization. http://pages.cs.wisc.edu/~jignesh/publ/summarization.pdf
•
Y. Cao, W. Fan, and R. Huang. Making pattern queries bounded in big
graphs. ICDE 2015. (bounded evaluability)
W. Fan, J. Li, X. Wang, and Y. Wu. Query Preserving Graph
49
Compression, SIGMOD, 2012. (query-preserving compression)
•