Shortest paths on large graphs: Systems, Algorithms, Applications

Shortest paths on large graphs:
Systems, Algorithms, Applications
Andrey Gubichev
TU München
January 2012
Andrey Gubichev
Shortest paths on large graphs
1 / 53
Outline
Introduction
Systems
Algorithms
Applications
Semantic Web
Social Search
Andrey Gubichev
Shortest paths on large graphs
2 / 53
Everything is a graph
Internet Graph,Richardson
Wikipedia, Tulip
Andrey Gubichev
Web Graph
Social Network
Proteins, Bordalier Inst
Shortest paths on large graphs
3 / 53
RDF: format for graph data
Poland
Nobel Prize Chemistry
Maria Sklodowska
in
bornAs
hasWon
Warsaw
Henri Becquerel
bornIn
1867
bornOn
adviser
alma U Paris
Marie Curie mater
hasWon
diedOn
marriedTo
1934
Pierre Curie
hasWon
Andrey Gubichev
Nobel Prize Physics
Shortest paths on large graphs
4 / 53
RDF: format for graph data
Poland
Nobel Prize Chemistry
Maria Sklodowska
in
bornAs
hasWon
Warsaw
Henri Becquerel
bornIn
1867
bornOn
adviser
alma U Paris
Marie Curie mater
diedOn
marriedTo
hasWon
RDF:
(id1,Name,”Marie Curie”)
(id1,bornOn,1867)
(id1,bornIn,id2)
(id2,Name,”Warsaw”)
(id2,locatedIn,id3)
(id3,Name,”Poland”)
1934
Pierre Curie
hasWon
Andrey Gubichev
Nobel Prize Physics
(G.Weikum, WSDM’09)
Shortest paths on large graphs
4 / 53
RDF: format for graph data
Poland
Nobel Prize Chemistry
Maria Sklodowska
in
bornAs
hasWon
Warsaw
Henri Becquerel
bornIn
1867
bornOn
adviser
alma U Paris
Marie Curie mater
diedOn
marriedTo
hasWon
RDF:
(id1,Name,”Marie Curie”)
(id1,bornOn,1867)
(id1,bornIn,id2)
(id2,Name,”Warsaw”)
(id2,locatedIn,id3)
(id3,Name,”Poland”)
1934
Pierre Curie
hasWon
Nobel Prize Physics
(G.Weikum, WSDM’09)
• pay-as-you-go: schema-agnostic, schema-later
• RDF triples form ER graph
Andrey Gubichev
Shortest paths on large graphs
4 / 53
RDF: a lot of data out there
Linked Data Project, linkeddata.org
Linked Data: extract explicit knowledge (ER-oriented facts) from the
world‘s best information sources (Wikipedia, Web, Web 2.0)
Andrey Gubichev
Shortest paths on large graphs
5 / 53
SPARQL: a query language
Select ?c
Where
{
?p isa scientist.
?p bornIn ?t.
?p hasWon ?a.
?t locatedIn ?c.
?a Name NobelPrize.
}
...
...
Andrey Gubichev
• SQL-like syntax
• triple patterns
• common variables form joins
Shortest paths on large graphs
6 / 53
SPARQL: a query language for RDF
...
Select ?c
Where
{
?p isa scientist.
?p bornIn ?t.
?p hasWon ?a.
?t locatedIn ?c.
?a Name NobelPrize.
Filter (?t < 1900)
}
...
Andrey Gubichev
• SQL-like syntax
• triple patterns
• common variables form joins
• filter predicates
Shortest paths on large graphs
7 / 53
SPARQL: a query language
...
...
Select Distinct ?c
Where
{
?p ?r1 ?t.
?t ?r2 ?c.
?c isa Country.
?p bornOn ?b.
Filter (?b > 1945)
}
Andrey Gubichev
• SQL-like syntax
• triple patterns
• common variables form joins
• filter predicates
• wildcard joins
Shortest paths on large graphs
8 / 53
RDF & SPARQL Engines
giant triples table
S P
id1 Name
O
Marie Curie
id1 bornOn 1867
id1 bornIn
id2
id2 Name
Warsaw
...
Sesame/OpenRDF
YARS2 (DERI)
Andrey Gubichev
Shortest paths on large graphs
9 / 53
RDF & SPARQL Engines
giant triples table
clustered property tables
Person
S P
id1 Name
O
Marie Curie
id1 bornOn 1867
id1 bornIn
id2
id2 Name
Warsaw
...
Sesame/OpenRDF
YARS2 (DERI)
S Name bornOn bornIn ...
id1 Marie C 1867
id3
...
id2 Henri B 1852
id9
...
...
Town
S Name Country
id3 Warsaw id11
...
Jena (HP Labs)
Oracle RDF MATCH
Andrey Gubichev
Shortest paths on large graphs
9 / 53
RDF & SPARQL Engines
giant triples table
clustered property tables
Person
S P
id1 Name
O
Marie Curie
id1 bornOn 1867
id1 bornIn
id2
id2 Name
Warsaw
...
Sesame/OpenRDF
YARS2 (DERI)
property table
bornOn
S Name bornOn bornIn ...
id1 Marie C 1867
id3
...
S O
id1 1867
id2 Henri B 1852
id5 1852
id9
...
... ....
Town
Advisor
S Name Country
id3 Warsaw id11
...
S O
id1 id5
... ....
Jena (HP Labs)
Oracle RDF MATCH
Andrey Gubichev
...
C-Store (MIT)
MonetDB(CWI)
Shortest paths on large graphs
9 / 53
RDF & SPARQL Engines
giant triples table
clustered property tables
Person
S P
id1 Name
O
Why a
Marie Curie
id1 bornOn 1867Three
bornOn
S Name bornOn bornIn ...
id1 Marie C 1867
id3
...
S O
id1 1867
1852
id5 1852
new id2
engine?
Henri B
id9
...
main things
... in database design:
... ....
1. Performance
id1 bornIn
id2
id2 Name
Warsaw
...
property table
2. Performance
Town
Advisor
S Name Country
Performance
id3 Warsaw id11
3.
Sesame/OpenRDF
YARS2 (DERI)
...
... ....
Jena (HP Labs)
Oracle RDF MATCH
Andrey Gubichev
S O
id1 id5
C-Store (MIT)
MonetDB(CWI)
Shortest paths on large graphs
9 / 53
Scalable Semantic Web: RDF-3X Engine
[T.Neumann et al: VLDB’08]
• tuning-free system architecture: giant triple
table
Andrey Gubichev
Shortest paths on large graphs
10 / 53
Scalable Semantic Web: RDF-3X Engine
[T.Neumann et al: VLDB’08]
• tuning-free system architecture: giant triple
table
• map literals into ids (dictionary)
S P
id1 Name
O
Marie Curie
id1 bornOn 1867
id1 bornIn
id2
id2 Name
Warsaw
...
map ID
S P O
1 3 4
1 5 6
1 7 2
2 3 8
...
Andrey Gubichev
Shortest paths on large graphs
10 / 53
Scalable Semantic Web: RDF-3X Engine
[T.Neumann et al: VLDB’08]
• tuning-free system architecture: giant triple
table
• map literals into ids (dictionary) and
precompute
exhaustive indexing for SPO triples:
SPO, SOP, OPS, OSP, PSO, POS,
SP*, SO*, OS*, PO*, OP*, S*, P*, O*
very high compression, index-only store
P O S
3 4 1
3 8 2
5 6 1
7 2 1
• directly store indexes into clustered B+ trees
Andrey Gubichev
Shortest paths on large graphs
10 / 53
Scalable Semantic Web: RDF-3X Engine
[T.Neumann et al: VLDB’08]
• tuning-free system architecture: giant triple
table
• map literals into ids (dictionary) and
precompute
exhaustive indexing for SPO triples:
SPO, SOP, OPS, OSP, PSO, POS,
SP*, SO*, OS*, PO*, OP*, S*, P*, O*
very high compression, index-only store
• directly store indexes into clustered B+ trees
• can choose any order for scan and join
Andrey Gubichev
Shortest paths on large graphs
10 / 53
Scalable Semantic Web: RDF-3X Engine
[T.Neumann et al: VLDB’08]
• tuning-free system architecture: giant triple
table
• map literals into ids (dictionary) and
precompute
exhaustive indexing for SPO triples:
SPO, SOP, OPS, OSP, PSO, POS,
SP*, SO*, OS*, PO*, OP*, S*, P*, O*
very high compression, index-only store
• directly store indexes into clustered B+ trees
• can choose any order for scan and join
• also store two mapping indexes:
literal → id, id → literal
Andrey Gubichev
Shortest paths on large graphs
10 / 53
Scalable Semantic Web: RDF-3X Engine
[T.Neumann et al: VLDB’08]
• tuning-free system architecture: giant triple
table
• map literals into ids (dictionary) and
precompute
exhaustive indexing for SPO triples:
SPO, SOP, OPS, OSP, PSO, POS,
SP*, SO*, OS*, PO*, OP*, S*, P*, O*
very high compression, index-only store
• directly store indexes into clustered B+ trees
• can choose any order for scan and join
• also store two mapping indexes:
literal → id, id → literal
• efficient merge joins with order-preservation
Andrey Gubichev
Shortest paths on large graphs
10 / 53
RDF-3X Query Optimization
[T.Neumann et al: VLDB’08]
• bottom-up dynamical programming for plan enumaration
• exploit numerous indexes, order-preservation
• cost model based on selectivity estimation
Andrey Gubichev
Shortest paths on large graphs
11 / 53
Evaluation
[T.Neumann et al: SIGMOD’09]
• Queries like: find a polish
scientist with a french advisor,
both got some awards
• YAGO knowledge base: 40 Mio.
triples
• Billion Triple dataset, Uniprot
(845 Mio.) - similar results
Andrey Gubichev
Shortest paths on large graphs
12 / 53
Evaluation
[T.Neumann et al: SIGMOD’09]
• Queries like: find a polish
Try it out!
scientist with a french advisor,
both got some awards
RDF-3X is freely available:• YAGO knowledge base: 40 Mio.
http://code.google.com/p/rdf3x/
triples
• Billion Triple dataset, Uniprot
(845 Mio.) - similar results
Andrey Gubichev
Shortest paths on large graphs
12 / 53
Outline
Introduction
Systems
Algorithms
Applications
Semantic Web
Social Search
Andrey Gubichev
Shortest paths on large graphs
13 / 53
What is missing?
What kind of queries we CAN answer?
• Find lat and long of the Eiffel Tower
• Find politicians who are also scientists
What kind of queries we CAN NOT answer?
• Find common things between Angela Merkel and Arnold
Schwarznegger
• Find all European-born Nobel prize winners
Why?
They require path traversals over RDF graph.
Andrey Gubichev
Shortest paths on large graphs
14 / 53
Why is SPARQL not enough?
Sometimes we need to form join chains with unknown length
(e.g., we need the transitive closure of the predicate).
Example Triples
Example Triples
Einstein bornIn Ulm.
Ulm locatedIn Baden-Württemberg.
Baden-Württemberg locatedIn Germany.
Humboldt bornIn Berlin.
Berlin locatedIn Germany.
Were they both born in Germany? Yes.
How to figure that out?
Einstein
bornIn
Ulm
locatedIn
Baden-Württemberg
locatedIn
Germany
locatedIn
Humboldt
bornIn
Berlin
Andrey Gubichev
Shortest paths on large graphs
15 / 53
Why is SPARQL not enough?
Sometimes we need to form join chains with unknown length
(e.g., we need the transitive closure of the predicate).
Example Triples
Example Triples
Humboldt bornIn Berlin.
Berlin locatedIn Germany.
Einstein bornIn Ulm.
Ulm locatedIn Baden-Württemberg.
Baden-Württemberg locatedIn Germany.
How to find all scientists that were born in Germany?
SPARQL
?person bornIn ?place. ?place locatedIn Germany.
UNION
?person bornIn ?place. ?place locatedIn ?place1. ?place1 locatedIn
Germany.
UNION
...
Andrey Gubichev
Shortest paths on large graphs
16 / 53
Why is SPARQL not enough?
Sometimes we need to form join chains with unknown length
(e.g., we need the transitive closure of the predicate).
Example Triples
Example Triples
Humboldt bornIn Berlin.
Berlin locatedIn Germany.
Einstein bornIn Ulm.
Ulm locatedIn Baden-Württemberg.
Baden-Württemberg locatedIn Germany.
How to find all scientists that were born in Germany?
SPARQL with paths
?person bornIn ?place. ?place ??path Germany.
Andrey Gubichev
Shortest paths on large graphs
17 / 53
SPARQL with path variables
Introduced by K.Anyanwu et al. (WWW’07)
• Example: select ??p ?obj where {?place ??path Germany} (path
triple)
• ??p: there exists a path from place to Germany in the RDF graph
• we consider only shortest paths
• we can specify filter (conditions) on ??p
• we can join such path patterns with regular patterns
Example
select ?name where { ?m type Mountain.
?m hasName ?name.
?m ??location Europe.
filter(ContainsOnly(??location, locatedIn)) }
Andrey Gubichev
Shortest paths on large graphs
18 / 53
How to execute SPARQL with path variables?
[A.Gubichev et al: WebDB’11]
We build upon RDF-3X. Two goals:
• Query Optimization: How to estimate cardinality of path triples?
• Physical Level: How to perform path scan efficiently?
Andrey Gubichev
Shortest paths on large graphs
19 / 53
Outline
Introduction
Systems
Algorithms
Applications
Semantic Web
Social Search
Andrey Gubichev
Shortest paths on large graphs
20 / 53
Can we do better?
• Dijkstra’s algo is fine, but let’s consider approximate algorithms
(trade quality for speed)
• Let’s change the setting for now: shortest paths on social network
Social network:
• a set of people
• a social relationship linking them
Andrey Gubichev
Shortest paths on large graphs
21 / 53
Problem Statement
Exact shortest path:
• V — users, E — ”friend of” relationships
• Graph G (V , E ) — directed, unweighted, static
• Given u, v ∈ V find the shortest path from u to v
Approximate shortest path:
• Graph is disk-resident
• Offline step: Do some precomputation, store on disk
• Online step: for u,v ∈ V quickly find some path from u to v
• Approximation error:
|approximate| − |exact|
|exact|
Andrey Gubichev
Shortest paths on large graphs
22 / 53
Different approaches
Exact SP
• Dijkstra: very slow
• A∗ : works well for road networks, slow for OSN
• Hierarchy-based decomposition: works well for road networks, slow for
OSN
Approximate SP
• Different types of preprocessing: keep distances from all nodes to
small subset of nodes (random, with high degree or centrality)
• Poor results for OSN: average error is ≥ 10%
• Find just the distance, not the path itself
Andrey Gubichev
Shortest paths on large graphs
23 / 53
Precomputation
Step1 Set r = blog |V |c
Step2 Sample r + 1 sets of nodes (uniformly, at random) of sizes:
1, 2, 22 , 23 ,...,2r
Step3 For every u ∈ V and for every set S
1. Find the closest nodes to u in S (landmarks):
landmark h ∈ S : dist(u, h) = dist(u, S)
landmark h0 ∈ S : dist(h0 , u) = dist(S, u)
2. Find the distance from u to h and from h0 to u
Andrey Gubichev
Shortest paths on large graphs
24 / 53
Precomputation - WSDM’10 approach
[A.Das Sarma et al: WSDM’10]
h1 ∈ S1
2
u
h2 ∈ S2
3
...
1
Sketch in RDF:
huih2ihh1 i
huih3ihh2 i
···
huih1ihhr i
hr ∈ Sr
Andrey Gubichev
Shortest paths on large graphs
25 / 53
Precomputation - our approach
[A.Gubichev et al: CIKM’10]
h1 ∈ S1
x
y
u
...
h2 ∈ S2
Sketch in RDF:
huihxihh1 i
huihx y ihh2 i
···
huih ihhr i
hr ∈ Sr
Andrey Gubichev
Shortest paths on large graphs
26 / 53
Precomputation
Step1 Set r = blog |V |c
Step2 Sample r + 1 sets of nodes (uniformly, at random) of sizes:
1, 2, 22 , 23 ,...,2r
Step3 For every u ∈ V and for every set S
1. Find the closest nodes to u in S (landmarks):
landmark h ∈ S : dist(u, h) = dist(u, S)
landmark h0 ∈ S : dist(h0 , u) = dist(S, u)
2. Find the path from u to h and from h0 to u
3. Store the paths (RDF):hui hpathi hhi, hh0 i hpath0 i hui
Step4 Repeat Steps 2-3 k times (we use k = 2).
Andrey Gubichev
Shortest paths on large graphs
27 / 53
Sketch
Sketch for a node u consists of
1. Landmarks h1 ,...,hkr
2. Paths from u to landmarks
3. Paths from landmarks to u
Sketch for u consists of two trees (u is the root)
We keep sketches for every u ∈ V
Andrey Gubichev
Shortest paths on large graphs
28 / 53
SKETCH algorithm: online part
[A.Das Sarma et al: WSDM’10]
s
Input: nodes s, d ∈ V
d
SKETCH algorithm: online part
[A.Das Sarma et al: WSDM’10]
s
Input: nodes s, d ∈ V
1. Load all the distances from s
d
SKETCH algorithm: online part
[A.Das Sarma et al: WSDM’10]
s
Input: nodes s, d ∈ V
1. Load all the distances from s
2. Load all the distances to d
d
SKETCH algorithm: online part
[A.Das Sarma et al: WSDM’10]
s
Input: nodes s, d ∈ V
1. Load all the distances from s
2. Load all the distances to d
3. Find common landmarks
d
SKETCH algorithm: online part
[A.Das Sarma et al: WSDM’10]
s
Input: nodes s, d ∈ V
1. Load all the distances from s
2. Load all the distances to d
3. Find common landmarks
4. Construct the paths
d
SKETCH algorithm: online part
[A.Das Sarma et al: WSDM’10]
s
Input: nodes s, d ∈ V
1. Load all the distances from s
3
4
2. Load all the distances to d
3. Find common landmarks
4. Construct the paths
5. Select the shortest distance
3
Output: distance from s to d
2
Andrey Gubichev
d
Shortest paths on large graphs
29 / 53
SKETCH algorithm with paths
[A.Gubichev et al: CIKM’10]
s
Input: nodes s, d ∈ V
x
1. Load all the paths from s
y
2. Load all the paths to d
3. Find common landmarks
4. Construct the paths
h
5. Select the shortest path
Output: path from s to d:
hs x y h z di
z
d
Andrey Gubichev
Shortest paths on large graphs
30 / 53
Datasets
• Slashdot: 77 K nodes, undirected
• YouTube: 1.1 Mln nodes
• Flickr: 1.7 Mln nodes
• WikiTalk: 2.2 Mln nodes
• Twitter: 2.4 Mln nodes
• Orkut: 3 Mln nodes, undirected
Sources: Stanford, MPI, Telefonica Research
Andrey Gubichev
Shortest paths on large graphs
31 / 53
Approximation error of the Sketch algorithm
Error =
|approximate| − |exact|
|exact|
Dataset (#nodes)
Slashdot (77K)
YouTube (1.1M)
Flickr
(1.7M)
WikiTalk (2.2M)
Twitter (2.4M)
Orkut
(3M)
Andrey Gubichev
Sketch error
46%
30%
28%
55%
51%
71%
Shortest paths on large graphs
32 / 53
Precomputation
Step1 Set r = blog |V |c
Step2 Sample r + 1 sets of nodes (uniformly, at random) of sizes:
1, 2, 22 , 23 ,...,2r
Step3 For every u ∈ V and for every set S
1. Find the closest nodes to u in S (landmarks):
landmark h ∈ S : dist(u, h) = dist(u, S)
landmark h0 ∈ S : dist(h0 , u) = dist(S, u)
2. Find the path from u to h and from h0 to u
3. Store the paths (RDF):hui hpathi hhi, hh0 i hpath0 i hui
Step4 Repeat Steps 2-3 k times (we use k = 2).
Andrey Gubichev
Shortest paths on large graphs
33 / 53
First modification
We find the path, not just the distance!
s
d
Andrey Gubichev
Shortest paths on large graphs
34 / 53
First modification
Are there cycles?
s
a
Andrey Gubichev
a
Shortest paths on large graphs
d
34 / 53
First modification
Are there cycles?
s
a
d
First modification
Construct a shorter path
s
a
Andrey Gubichev
d
Shortest paths on large graphs
34 / 53
Approximation error of the first modification
No time overhead!
Dataset (#nodes)
Slashdot (77K)
YouTube (1.1M)
Flickr
(1.7M)
WikiTalk (2.2M)
Twitter (2.4M)
Orkut
(3M)
Andrey Gubichev
Sketch error
46%
30%
28%
55%
51%
71%
Sketch I error
26%
12%
11%
31%
38%
48%
Shortest paths on large graphs
35 / 53
Second modification
s
d
Andrey Gubichev
Shortest paths on large graphs
36 / 53
Second modification
Are there any ”hidden” connections?
s
d
?
Andrey Gubichev
Shortest paths on large graphs
36 / 53
Second modification
If yes, construct a shorter path
s
d
Andrey Gubichev
Shortest paths on large graphs
36 / 53
Second modification
How to check it?
1. For every node in the path load the list of friends from the original
dataset
2. For every pair of nodes from the path check whether they are friends
Number of nodes in the path is usually small!
Andrey Gubichev
Shortest paths on large graphs
37 / 53
Approximation error of the second modification
Dataset (#nodes)
Sketch error
Sketch I error
Sketch II error
Slashdot (77K)
YouTube (1.1M)
Flickr
(1.7M)
WikiTalk (2.2M)
Twitter (2.4M)
Orkut
(3M)
46%
30%
28%
55%
51%
71%
26%
12%
11%
31%
38%
48%
0.6%
0.6%
0.3%
0.2%
0.8%
0.6%
Andrey Gubichev
Shortest paths on large graphs
38 / 53
Tree algorithm
s
Paths from a node to landmarks form a
tree
...
...
...
...
landmarks
Andrey Gubichev
Shortest paths on large graphs
39 / 53
Tree algorithm
• Load paths from s and to d
s
...
...
...
d
Tree algorithm
• Load paths from s and to d
s
• Start BFS from s and d
• For every visited node load a list
of friends
s1
s4
...
s2
...
s3
s5
...
d4
d3
d2
d
d1
Tree algorithm
• Load paths from s and to d
s
• Start BFS from s and d
• For every visited node load a list
of friends
s1
• For every pair of visited nodes
check:
1. are they equal? (s3, d1)
2. are they friends? (s1, d)
s4
...
s3
s2
...
s5
...
d4
d3
d2
d
d1
Tree algorithm
• Load paths from s and to d
s
• Start BFS from s and d
• For every visited node load a list
of friends
s1
• For every pair of visited nodes
check:
1. are they equal? (s3, d1)
2. are they friends? (s1, d)
s4
...
s2
...
s3
s5
...
• Form a new path and put it to the
queue Q
d4
d3
d2
d
Tree algorithm
• Load paths from s and to d
s
• Start BFS from s and d
• For every visited node load a list
of friends
s1
• For every pair of visited nodes
check:
s2
s3
s4
s5
levels + leveld = 4 > 2
1. are they equal? (s3, d1)
2. are they friends? (s1, d)
• Form a new path and put it to the
d4
queue Q
• Don’t go too deep: terminate if
d3
d2
d1
levels + leveld > Q.top.length
d
Andrey Gubichev
Shortest paths on large graphs
40 / 53
Approximation error of the Tree algorithm
Dataset
Sketch error
Sketch I error
Sketch II error
Tree error
Slashdot
YouTube
Flickr
WikiTalk
Twitter
Orkut
46%
30%
28%
55%
51%
71%
26%
12%
11%
31%
38%
48%
0.6%
0.6%
0.3%
0.2%
0.8%
0.6%
0
0.06%
0.04%
0
0.03%
0.1%
Andrey Gubichev
Shortest paths on large graphs
41 / 53
Experimental setup
• Pick 100 nodes (uniformly at random) from the OSN.
• For each node compute Shortest Path Tree (Dijkstra)
• The result is {(x, y , dist)|x, y ∈ V , dist = dist(x, y )}
• Group triples by distance and randomly choose 50 triples from every
group
• For every chosen triple (x, y , dist): find approximate shortest paths
from x to y and compare their lengths with dist
Andrey Gubichev
Shortest paths on large graphs
42 / 53
Implementation details
• Datasets in RDF:
huser1 i hfriend-ofi huser2 i
• Precomputed paths in RDF:
hui hpathi hhi
hh0 i hpath0 i hui
• RDF3X for datasets and precomputed data
• C++
• Laptop: 2.0GHz Intel Core 2 Duo, 4 Gb RAM, L2 cache 3 Mb
Andrey Gubichev
Shortest paths on large graphs
43 / 53
Time
Dataset (#nodes)
Sketch
(sec)
Sketch II
(sec)
Tree
(sec)
Dijkstra
(sec)
Dijkstra
(queue)
Flickr
(1.7M)
WikiTalk (2.2M)
Twitter (2.4M)
Orkut
(3M)
1.2
0.7
1.9
1.1
2.1
1.4
3.9
2.6
1.9
1.7
4.0
2.7
73
101
119
503
696K
2 Mln
1.1 Mln
2.5 Mln
Andrey Gubichev
Shortest paths on large graphs
44 / 53
Disk space
Disk space for precomputed data, Gb
Dataset
Flickr
WikiTalk
Twitter
Orkut
Dataset size
0.57
0.22
1.3
5.6
Sketch with distances
2.3
1.9
3.4
6.0
Andrey Gubichev
Sketch with paths
4.4
2.1
6.1
7.4
Shortest paths on large graphs
45 / 53
Number of shortest paths
We find several shortest paths:
Dataset (#nodes)
Flickr
(1.7M)
Wikitalk (2.2M)
Twitter (2.4M)
Orkut (3M)
Andrey Gubichev
Sketch II
33.3
18.6
45.5
9.5
Tree
55.6
50.7
92
30
Shortest paths on large graphs
46 / 53
Outline
Introduction
Systems
Algorithms
Applications
Semantic Web
Social Search
Andrey Gubichev
Shortest paths on large graphs
47 / 53
Application #1: Semantic Web
• SPARQL v.1.1 - SPARQL + path traversal
• Querying the DB of entire human knowledge (everything that
Wikipedia knows)
Andrey Gubichev
Shortest paths on large graphs
48 / 53
Outline
Introduction
Systems
Algorithms
Applications
Semantic Web
Social Search
Andrey Gubichev
Shortest paths on large graphs
49 / 53
Small World
Milgram 1967
• People are given letters, asked to forward to one friend
• Source: random Omahaians; Target: stockbrocker in Sharon, MA
• Of completed chains, averaged 6 hops to reach target
Andrey Gubichev
Shortest paths on large graphs
50 / 53
Shortest paths on Social Networks
Shortest paths are interesting...
• per se:
• what is the distance between you and Angela Merkel?
• for geeks: Erdös number
Andrey Gubichev
Shortest paths on large graphs
51 / 53
Shortest paths on Social Networks
Shortest paths are interesting...
• per se:
• what is the distance between you and Angela Merkel?
• for geeks: Erdös number
• as an important primitive for
• social network analysis (diameter, centrality, etc)
• social search
• Of course, we can do one-to-many shortest paths algo
John searches Mary
Ranking:
1. Mary A
2. Mary B
3. Mary C
M. Potamias et al. CIKM 2009
Andrey Gubichev
Shortest paths on large graphs
51 / 53
Acknowledgements
• Srikanta Bedathur
• Gerhard Weikum
• Josep M. Pujol
• Thomas Neumann
• Sihem Amer-Yahia
Andrey Gubichev
Shortest paths on large graphs
52 / 53
Thank you!
Questions?
Andrey Gubichev
Shortest paths on large graphs
53 / 53