Fast In-Memory SQL Analytics on Typed Graphs Chunbin Lin Benjamin Mandel Yannis Papakonstantinou Matthias Springer 1 Outline Motivation & Query Definition GQFast System Experimental Results Conclusion 2 Motivation & Query GQFast System Experimental Results Conclusion Entity-Relationship Star schema Snowflake schema Academic graph (Affiliations, Authors, Journals, Papers) RDF (subject predicate object) Social graph (Users, Events) Biomedical graph (Terms, Citations, Authors) 3 Motivation & Query Example Schema 1 Year: 2016 GQFast System Experimental Results Conclusion ID: 11 ID: 8663 Author Name: Mike paper Email: [email protected] Title: Parallel graph analytics Paper ID Year Title Affiliation Author ID Name Paper-Author PaID AuID … Email ID Name Address Affiliation Author-Affiliation AuID AfID … Name: UCSD ID: 126 Address: 9500 Gilman Dr, La Jolla, CA 92093 4 Motivation & Query GQFast System Experimental Results Example Schema 2 Entity table Schema Graph Relationship table Conclusion Motivation & Query GQFast System Experimental Results Conclusion Query definition a1 Relationship query 1. Context computation 2. Path Navigation 3. Path Aggregation superset of fixed-length graph reachability queries and of tree pattern queries. Q1: Find the authors who published papers containing term t1 and term t2 and count the number of papers per author. Answer: {(a1, 1), (a3, 3), (a6,1)} a3 ID:t1 a6 ID:t2 Author Term Document Motivation & Query More example queries Relationship query GQFast System Experimental Results Query SD (Find similar documents) 1. Context computation 2. Path Navigation 3. Path Aggregation Query FSD (Frequency-Time-aware Document Similarity) Conclusion Query AD (Count authors having papers with terms t1…tn) Query FAD (Co-Occurring Terms Discovery) Query AS (Author Similarity) GQFast Demo Motivation & Query GQFast System Experimental Results Conclusion GQFast architecture Algebra Translator RQNA Normalizer Physical-plan Producer Code Generator Results GQFast Metadata GQFast Loader Original database GQFast Indices Memory Motivation & Query GQFast System Experimental Results Offset-array GQFast index C0 Conclusion C1 C2 …… Original data Indexed column Cn Encoded fragment h: the number of distinct values in column C0 Compression method for each column Example index Doc … 116 116 116 … Term … 28 66 77 … Fre … 6 3 1 … • • • • • Index is data. Not index plus data. From data to data without row ids Efficient lookup structure. Proper compression methods. Different from database cluster index Motivation & Query GQFast System Experimental Results GQFast query processing Algebra Translator RQNA Normalizer Physical-plan Producer RQNA Normalizer: (1) push selections down, and (2) transform to left-deep joins Physical-plan operators: • Fragment-based join • Fragment-based semijoin • Fragment-based aggregation • …… Code Generator: Analyze the physical-operators, produce efficient C++ code for the query Code Generator Conclusion Motivation & Query GQFast System Experimental Results Conclusion GQFast Code Generator Algebra Translator RQNA Normalizer Physical-plan Producer Code Generator Bottom-up pipeline execution Step 1: Get all the terms associated with doc 116 (a term fragment) Code Generator 116 Step 2: Get all the documents associated with each term (doc-fragments) Step 3: Aggregate corresponding documents. Doc … 116 116 116 … Term … 28 66 77 … Fre … 6 3 1 … Motivation & Query GQFast System Compared systems Experimental Results Conclusion Column-oriented database Analytic database SQL on Hadoop Row-oriented database Vs. PMC OMC Graph database Optimized column databases Motivation & Query GQFast System Experimental Results Real-life datasets Pubmed dataset* * http://www.ncbi.nlm.nih.gov/pubmed # http://skr3.nlm.nih.gov/SemMedDB/dbinfo.html SemmedDB dataset# Conclusion Motivation & Query GQFast System Experimental Results Conclusion End-to-end experiments (both time and space) Running time (sec) Space cost (GB) GQFast Demo Motivation & Query GQFast System Experimental Results Effect of each optimization 1. 2. 3. 4. 5. Compilation: Using a code generator to generate C++ code Pipelining: Adopting a bottom-up pipelined execution strategy Array-l: Using dense IDs to maintain an array look-up table instead of a hash table Array-a: Using dense IDs to maintain an array to store aggregation results instead of hash table Compression: Applying aggressive data compression schemes * # * OMC uses RLE encoding and dictionary encoding. # OMC-denseID uses RLE encoding and dictionary encoding. Conclusion Motivation & Query GQFast System Experimental Results Effect of each optimization Effect of Pipeline Effect of Array-l Effect of Compression Decompressing time (sec) Space cost (MB) Effect of Array-a Conclusion Motivation & Query GQFast System Experimental Results Additional experiments Effect of multiple threading Time of building indices (sec) Time of (de)serializing indices (sec) Conclusion Motivation & Query GQFast System Experimental Results Formally define the relationship queries Propose the fragment-based data organization Propose the GQFast code generator Conduct comprehensive experiments Conclusion Motivation & Query GQFast System Experimental Results Conclusion Fast In-Memory SQL Analytics on Typed Graphs
© Copyright 2026 Paperzz