Fast In-Memory SQL Analytics on Typed Graphs

Fast In-Memory SQL Analytics
on Typed Graphs
Chunbin Lin
Benjamin Mandel
Yannis Papakonstantinou
Matthias Springer
1
Outline
 Motivation & Query Definition
 GQFast System
 Experimental Results
 Conclusion
2
Motivation & Query
GQFast System
Experimental Results
Conclusion
Entity-Relationship
Star schema
Snowflake schema
Academic graph
(Affiliations, Authors,
Journals, Papers)
RDF
(subject predicate object)
Social graph
(Users, Events)
Biomedical graph
(Terms, Citations, Authors)
3
Motivation & Query
Example Schema 1
Year: 2016
GQFast System
Experimental Results
Conclusion
ID: 11
ID: 8663
Author
Name: Mike
paper
Email: [email protected]
Title: Parallel graph analytics
Paper
ID Year Title
Affiliation
Author
ID Name
Paper-Author
PaID AuID …
Email
ID Name Address
Affiliation
Author-Affiliation
AuID AfID …
Name: UCSD
ID: 126
Address: 9500 Gilman Dr, La
Jolla, CA 92093
4
Motivation & Query
GQFast System
Experimental Results
Example Schema 2
Entity table
Schema
Graph
Relationship table
Conclusion
Motivation & Query
GQFast System
Experimental Results
Conclusion
Query definition
a1
Relationship query
1. Context computation
2. Path Navigation
3. Path Aggregation
superset of fixed-length
graph reachability queries
and of tree pattern
queries.
Q1: Find the authors who published papers
containing term t1 and term t2 and count the
number of papers per author.
Answer: {(a1, 1), (a3, 3), (a6,1)}
a3
ID:t1
a6
ID:t2
Author
Term
Document
Motivation & Query
More example queries
Relationship query
GQFast System
Experimental Results
Query SD
(Find similar
documents)
1. Context computation
2. Path Navigation
3. Path Aggregation
Query FSD
(Frequency-Time-aware Document Similarity)
Conclusion
Query AD
(Count authors
having papers
with terms
t1…tn)
Query FAD
(Co-Occurring Terms Discovery)
Query AS
(Author Similarity)
GQFast Demo
Motivation & Query
GQFast System
Experimental Results
Conclusion
GQFast architecture
Algebra
Translator
RQNA
Normalizer
Physical-plan
Producer
Code
Generator
Results
GQFast
Metadata
GQFast
Loader
Original database
GQFast
Indices
Memory
Motivation & Query
GQFast System
Experimental Results
Offset-array
GQFast index
C0
Conclusion
C1
C2
……
Original data
Indexed column
Cn
Encoded
fragment
h: the number of distinct values in column C0
Compression method for each column
Example index
Doc
…
116
116
116
…
Term
…
28
66
77
…
Fre
…
6
3
1
…
•
•
•
•
•
Index is data. Not index plus data.
From data to data without row ids
Efficient lookup structure.
Proper compression methods.
Different from database cluster index
Motivation & Query
GQFast System
Experimental Results
GQFast query processing
Algebra
Translator
RQNA
Normalizer
Physical-plan
Producer
RQNA Normalizer:
(1) push selections down, and (2) transform to left-deep joins
Physical-plan operators:
• Fragment-based join
• Fragment-based semijoin
• Fragment-based aggregation
• ……
Code Generator:
Analyze the physical-operators, produce efficient C++ code for the query
Code
Generator
Conclusion
Motivation & Query
GQFast System
Experimental Results
Conclusion
GQFast Code Generator
Algebra
Translator
RQNA
Normalizer
Physical-plan
Producer
Code
Generator
Bottom-up pipeline
execution
Step 1: Get all the terms
associated with doc 116 (a
term fragment)
Code
Generator
116
Step 2: Get all the
documents associated with
each term (doc-fragments)
Step 3: Aggregate
corresponding documents.
Doc
…
116
116
116
…
Term
…
28
66
77
…
Fre
…
6
3
1
…
Motivation & Query
GQFast System
Compared systems
Experimental Results
Conclusion
Column-oriented database
Analytic database
SQL on Hadoop
Row-oriented database
Vs.
PMC
OMC
Graph database
Optimized column databases
Motivation & Query
GQFast System
Experimental Results
Real-life datasets
Pubmed dataset*
* http://www.ncbi.nlm.nih.gov/pubmed
# http://skr3.nlm.nih.gov/SemMedDB/dbinfo.html
SemmedDB dataset#
Conclusion
Motivation & Query
GQFast System
Experimental Results
Conclusion
End-to-end experiments (both time and space)
Running time (sec)
Space cost (GB)
GQFast Demo
Motivation & Query
GQFast System
Experimental Results
Effect of each optimization
1.
2.
3.
4.
5.
Compilation: Using a code generator to generate C++ code
Pipelining: Adopting a bottom-up pipelined execution strategy
Array-l: Using dense IDs to maintain an array look-up table instead of a hash table
Array-a: Using dense IDs to maintain an array to store aggregation results instead of hash table
Compression: Applying aggressive data compression schemes
*
#
* OMC uses RLE encoding and dictionary encoding.
# OMC-denseID uses RLE encoding and dictionary encoding.
Conclusion
Motivation & Query
GQFast System
Experimental Results
Effect of each optimization
Effect of Pipeline
Effect of Array-l
Effect of Compression
Decompressing time (sec)
Space cost (MB)
Effect of Array-a
Conclusion
Motivation & Query
GQFast System
Experimental Results
Additional experiments
Effect of multiple threading
Time of building indices (sec)
Time of (de)serializing indices (sec)
Conclusion
Motivation & Query
GQFast System
Experimental Results
 Formally define the relationship queries
 Propose the fragment-based data organization
 Propose the GQFast code generator
 Conduct comprehensive experiments
Conclusion
Motivation & Query
GQFast System
Experimental Results
Conclusion
Fast In-Memory SQL Analytics on Typed Graphs