Distance Queries on Large-Scale Graphs

Distance Queries
on Large-Scale Graphs
Work conducted in collaboration with:
Dieter Pfoser (GMU), Alexandros Efentakis (Athena RC),
Dimitrios Skoutas (Athena RC), Yannis Vassiliou (NTUA), Timos Sellis (RMIT)
Location-Based
Services
Location-Based Services
• Location-Based Services
−Geo-Social Search
−Spatio-Textual Search
−Find paths between friends on a social network
−Find paths on large transportation networks
−Social purposes, marketing, scientific, …
• Enhanced Location-Based Services
−Location +
•
•
•
•
Text
Connectivity
Semantics
…
3
Shortest Paths
• Graph:
−Structure that models relations between objects
−G(V,E), V: Vertices, E: Edges
• Shortest Path (Vertex to Vertex query)
−Find a path between 2 vertices in G such that the sum
of the weights of its constituent edges is minimized
4
Distance Queries on Graphs
•kNN: k-Nearest Neighbor Query
−Seek the k-Nearest neighbor nodes to an input
query node q
Example:
q = v4
Gas stations: v3, v9, v15
k=2
Query: Find 2-nearest gas stations to q
dist(v4, v3) = 2,
dist(v4, v9) = 15,
dist(v4, v15) = 21
 kNN list = {v3, v9}
5
Distance Queries on Graphs
• RkNN: Reverse k-Nearest Neighbor
Query (or monochromatic RkNN
query)
−Given a query point q and a set of objects P,
retrieves all the objects that have q as one of their
k-nearest neighbors according to a distance
function dist()
• Example 1: q interested in music
• p1, p2, p3 also interested in music
• Given the edge costs
• R1NN(q) = {p1,p2}
• q is most beneficial (1-NN) to p1
and p2
(closest NN AND shares same
interests)
Example 2: Find the set of customers affected by
the opening of a new store outlet location in
order to inform the relevant customers
6
Distance Queries on Graphs
•One-To-Many query
−Compute the SP distances between the source
vertex s and all vertices of a set of targets T
7
Efficient Shortest Path Algorithms
•BFS
−Bi-directional BFS
•Dijkstra
−Very little overhead
−But takes seconds on continental networks
8
Efficient Shortest Path Algorithms
• Practical Algorithms on Large-Scale
Networks
• Two-stage Approach
−Pre-processing (minutes, hours)
• Produce auxiliary data
−Real-time querying
• ALT algorithm
• CH: Contraction Hierarchies
• Graph Separators
9
Efficient Shortest Path Algorithms
• Landmarks + A* + Triangle Inequality:
ALT algorithm
−Select N landmark nodes (ie. 16, 32, 64, …)
−Pre-compute distances from Ni to all nodes in Graph
−Calculate lower bounds (triangle inequality)
−Combine A* search with lower bounds from landmarks
−Nodes closer to q are expanded first
10
Dataset: Road network of Western Europe
Hannah Bast, Daniel Delling, Andrew Goldberg, Matthias Müller-Hannemann, Thomas Pajor, Peter
Sanders, Dorothea Wagner, and Renato Werneck, Route Planning in Transportation Networks, MSRTR-2014-4, 2014
11
Motivation
• Many efficient Shortest Path (SP)
algorithms for vertex-to-vertex (v2v)
queries on road networks
• New research directions for more complex
queries
−One-to-all, Range, One-to-many, kNN
• Most of the existing solutions cannot be
adapted to large-scale networks
−Unweighted, undirected graphs of high-degree
−Social or collaboration networks
Challenge
• Adapt previous methods to secondary
storage &
• … large-scale networks
• Queries work entirely within a database
−Fast performance & scalability
• Previous solutions:
−HLDB: V2V, kNN queries on DB (SQL server) on road
networks
−HopDB: C++ solution. Only V2V queries
Contribution
• COLD pure-SQL framework
• Works entirely within a open source RDBMS
−PostgreSQL
• May answer:
−V2V queries (Only tested before on road networks)
−kNN queries (Only tested before on road networks)
−One-to-many (No previous method)
−Reverse-KNN (RkNN) queries (No previous method)
−…on large-scale networks
• Outperforms previous solutions (HLDB)
• Outperforms graph databases (Neo4J)
• Smaller index and table size
HL basics
• Most promising approach on large-scale networks
based on Hub-Labeling
• Build forward Lf(u) and backward label Lf(u) per
vertex u
− For undirected graphs: Lf(u)= Lb(u)
• At least one vertex on the shortest s-t path must
appear as a hub in labels of s and t
• SP queries between s and t may be answered:
− By using only the Lf(s) and Lf(t)
− V2V queries take ~μs on MAIN memory
• Also used for:
− kNN (road networks)
− One-to-many (road networks)
− RkNN (large-scale) queries
HL basics
Hub labels of vertices
s (diamonds) and
t (squares)
16
HL basics (V2V queries)
L(5)
(0,2)
(1,1)
(5,0)
L(7)
(0,2)
(1,1)
(7,0)
d(5,7)=∞
HL basics (V2V queries)
L(5)
(0,2)
(1,1)
(5,0)
L(7)
(0,2)
(1,1)
(7,0)
d(5,7)=4
HL basics (V2V queries)
L(5)
(0,2)
(1,1)
(5,0)
L(7)
(0,2)
(1,1)
(7,0)
d(5,7)=2
HL basics (V2V queries)
L(5)
(0,2)
(1,1)
(5,0)
L(7)
(0,2)
(1,1)
(7,0)
d(5,7)=2
HL basics
• For answering complex queries on HL
framework
−We need the forward labels & …
−…auxiliary data structures
• Backward labels-to-many (One-to-many)
• kNN-backward labels (kNN)
• RkNN-backward labels & kNN Results (RkNN)
Challenges
• How to efficiently store:
• Forward Labels & auxiliary data structures
−Backward labels-to-many (One-to-many)
−kNN-backward labels (kNN)
−RkNN-backward labels & kNN Results (RkNN)
• Translate main memory queries into SQL
commands
• Maximize performance
• Minimize Indexes and Table sizes
• Use an open-source RDBMS
V2V
kNN
HLDB
Forward Labels
Group rows per v
|V| rows
No composite PK (v)
Smaller Table size
Smaller Index
kNN Backward Labels
Group rows per (hub,dist)
Less rows
Composite PK (hub,dist)
Smaller Table size
Smaller Index
One-to-Many
Backward labels-to-many stored
as kNN-backward labels
HLDB could not scale for such
queries
RkNN
RkNN-Backward labels stored as kNNbackward labels
kNN results table grouped by object
Join between RkNN-Backward labels
and kNN results table
Experimental Settings
• Use Pruned Landmark Labelling (PLL) to
generate Hub Labels
• Implement on PostgreSQL 9.3.6, 64bit
• Compare with
−HLDB (v2v, kNN)
−Neo4J (v2v)
• HD and SSD
Graphs used
V2V queries
• COLD V2V queries require <9ms
• COLD is 2 - 20.7× faster than HLDB
• COLD is 9 - 143× faster than Neo4j
• PK index in COLD is 3,600 - 4,444× smaller
• DB tables are 131 - 188× smaller for COLD
kNN queries
•
•
•
•
•
For k = 1, COLD is 5 - 19x faster for the five largest datasets
COLD is 2 - 10x faster even for k = 16.
COLD answers kNN queries <26ms even for k = 16.
For varying D(=|P|/|V|), COLD up to 23.4x faster than HLDB
COLD answers kNN queries for k = 4 on all datasets and all D in
<14ms.
RkNN queries
• COLD RkNN query times < 20ms for k = 1
• COLD RkNN query times <82ms, for k = 16
• COLD RkNN query times <49ms for all datasets and values of D,
except Youtube for D = 0.1 (109.3ms)
One-to-Many queries
• COLD answers one-to-many queries in <1s for all datasets (and D)
− Except Citeseer2, DBLP (5601ms, 4170ms for D = 0.1).
• COLD one-to-many queries to 110,000 objects (Youtube) in 401ms
• One-to-many queries, only 2- 30% faster on the SSD
Summary
• Extensive Experimentation Compared to state of the
art and a Graph Database both on HDD and SSD
• Outperform rivals on all metrics: query performance,
storage utilization, scalability
• Provide comprehensive details for reproducibility
• Simple applications may use a DB for graph
distance queries
Enhanced Location-Based Services Revisited
• Twittersphere
o 100 million active users
o 500 million tweets per day
• Publicly available
TwitterViz:
• Real time stream
Visualizing and Exploring the Twittersphere
• Need for Management & Analysis
o Spatial
o Temporal
o Social
Visualizing and Exploring the
Twittersphere
References
•
COLD
 A. Efentakis, C. Efstathiades, D. Pfoser. “COLD. Revisiting Hub Labels on
the database for large-scale graphs”, SSTD’15
•
TwitterViz
 C. Efstathiades, H. Antoniou, D. Skoutas, Y. Vassiliou. “TwitterViz:
Visualizing and Exploring the Twittersphere”, SSTD’15
35
Thank You
36