The Link Prediction Problem for Social Networks

The Link Prediction Problem
for Social Networks
David Liben-Nowell
Jon Kleinberg
January 8, 2004
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Introduction
Link prediction problem:
Can we infer which new interactions are likely to occur in the future?
Some experiments suggest that information about the future can be
extracted from network topology alone
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Motivation
ο‚– Understanding the mechanisms by which social networks evolve
ο‚– model the evolution of a social network using features intrinsic to the
network itself
How we will do that?
ο‚– We seek to predict the edges that will be added during the interval
from time 𝑑 to future time 𝑑’
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Motivation
ο‚– Problem: some collaborations can be hard to predict
ο‚– Assumption: most of the new collaboration are hinted at by the
topology of the network
ο‚– Our goal:
ο‚– Make this notion precise
ο‚– Find the measures of proximity that lead to the most accurate link prediction
Introduction
Data and Experimental Setup
Application
ο‚– Large organization
ο‚– Security
ο‚– Inferring missing links
Methods for Link Prediction
Results and Discussion
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
If I had only one hour to save the world,
I would spend fifty-five minutes defining the problem,
and only five minutes finding the solution.
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Definitions
ο‚– 𝐺 = 𝑉, 𝐸
ο‚– 𝑒 = 𝑒, 𝑣 ∈ 𝐸
𝑒 represent an interaction between 𝑒 and 𝑣 that took place at time 𝑑
ο‚– 𝐺 𝑑, 𝑑 β€² is the sub-graph of 𝐺 consisting of all edges with time stamp
between 𝑑 and 𝑑′
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Definitions
ο‚– We choose 4 times: 𝑑0 < 𝑑0β€² < 𝑑1 < 𝑑1 β€²
𝑑0 , 𝑑0β€² - training interval - in our paper: [1994,1996]
𝑑1 , 𝑑1β€² - test interval
- in our paper: [1997,1999]
output: a list of edges not present in 𝐺 𝑑0 , 𝑑0β€² ,
that are predicted to appear in the 𝐺 𝑑1 , 𝑑1β€²
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Definitions
Q: What is the problem?
A: Social networks grow through the addition of nodes as well as edges
Introduction
Data and Experimental Setup
Methods for Link Prediction
Definitions
To solve this problem we will use 2 parameters:
ο‚– πΎπ‘‘π‘Ÿπ‘Žπ‘–π‘›π‘–π‘›π‘”
ο‚– 𝐾𝑑𝑒𝑠𝑑
Each set to 3
And define a set:
ο‚– Core – all nodes incident to at least πΎπ‘‘π‘Ÿπ‘Žπ‘–π‘›π‘–π‘›π‘”
edges in G 𝑑0 , 𝑑0β€² and at least 𝐾𝑑𝑒𝑠𝑑 edges in G 𝑑1 , 𝑑1β€²
Results and Discussion
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Definitions
ο‚– πΈπ‘œπ‘™π‘‘ - edges between Core authors which first appear during the
training period
ο‚– 𝐸𝑛𝑒𝑀 - edges 𝑒, 𝑣 such that 𝑒, 𝑣 co-author a paper during the test
interval but not the training one
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Evaluating a link predictor
βˆ—
ο‚– 𝐸𝑛𝑒𝑀
≔ 𝐸𝑛𝑒𝑀 ∩ πΆπ‘œπ‘Ÿπ‘’ × πΆπ‘œπ‘Ÿπ‘’
βˆ—
ο‚– 𝑛 ≔ 𝐸𝑛𝑒𝑀
βˆ—
𝐸𝑛𝑒𝑀
=
π‘Ž, 𝑏 , π‘Ž, 𝑐 , 𝑏, 𝑑 , 𝑐, 𝑒
β†’ 𝑛=4
ο‚– Each link predictor 𝑝 outputs a ranked list 𝐿𝑝 of pairs in decreasing order
ο‚– From 𝐿𝑝 we take the first 𝑛 pairs in πΆπ‘œπ‘Ÿπ‘’ × πΆπ‘œπ‘Ÿπ‘’, and determine the
βˆ—
size of the intersection of this set of pairs with the set 𝐸𝑛𝑒𝑀
𝐿𝑝 =
π‘Ž, 𝑏 , π‘Ž, 𝑐 , 𝑏, 𝑑 , 𝑐, 𝑒 , 𝑑, 𝑒 , 𝑒, 𝑓
β†’ 𝐿𝑝 (𝑛) =
π‘Ž, 𝑏 , π‘Ž, 𝑐 , 𝑏, 𝑑 , 𝑐, 𝑒
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Data Base
The researchers constructed 5 co-authorship networks G:
training period
authors papers
edges
Core
authors Eold
Enew
astrophysics
astro-ph
5343
5816
41852
1561
6178
5751
condensed-matter
general relativity and
quantum cosmology
high energy physicsphenomenology
high energy physicstheory
cond-mat
5469
6700
19881
1253
1899
1150
gr-qc
2122
3287
5724
486
519
400
hep-ph
5414
10254
17806
1790
6654
3294
hep-th
5241
9498
15842
1438
2311
1576
www.arXiv.org
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Methods
ο‚– All the methods assign a connection weight π‘ π‘π‘œπ‘Ÿπ‘’(π‘₯, 𝑦) to pairs of
nodes π‘₯, 𝑦
ο‚– Then produce a ranked list in decreasing order of π‘ π‘π‘œπ‘Ÿπ‘’(π‘₯, 𝑦)
ο‚– Thus, they can be viewed as computing a measure of proximity or
β€œsimilarity” between nodes π‘₯ and 𝑦, relative to the network topology
Introduction
Data and Experimental Setup
List of Methods
ο‚– Methods based on:
ο‚– node neighborhoods
ο‚–
ο‚–
ο‚–
ο‚–
Common neighbors
Jaccard’s coefficient
Adanic/Adar
Preferential attachment
ο‚– ensemble of all paths
ο‚–
ο‚–
ο‚–
ο‚–
Katz
Hitting time
PageRank
SimRank
Methods for Link Prediction
Results and Discussion
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Methods based on node neighborhoods-
Common neighbors
π‘ π‘π‘œπ‘Ÿπ‘’ π‘₯, 𝑦 = Ξ“(π‘₯) ∩ Ξ“(𝑦)
Ξ“ π‘₯ =4
Ξ“ 𝑦 =6
y
x
π‘ π‘π‘œπ‘Ÿπ‘’ π‘₯, 𝑦 = Ξ“ π‘₯ ∩ Ξ“ 𝑦 = 3
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Methods based on node neighborhoods-
Jaccard coefficient
Ξ“ π‘₯ =4
Ξ“ 𝑦 =6
Ξ“(π‘₯) ∩ Ξ“(𝑦)
π‘ π‘π‘œπ‘Ÿπ‘’ π‘₯, 𝑦 =
Ξ“(π‘₯) βˆͺ Ξ“(𝑦)
Ξ“ π‘₯ ∩ Ξ“ 𝑦 =3
Ξ“ π‘₯ βˆͺ Ξ“ 𝑦 =7
Ξ“(π‘₯) ∩ Ξ“(𝑦) 3
π‘ π‘π‘œπ‘Ÿπ‘’ π‘₯, 𝑦 =
= = 0.43
Ξ“(π‘₯) βˆͺ Ξ“(𝑦) 7
y
x
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Methods based on node neighborhoods-
Adamic/Adar
π‘ π‘π‘œπ‘Ÿπ‘’ π‘₯, 𝑦 =
π‘§βˆˆΞ“(π‘₯)∩ Ξ“(𝑦)
1
π‘™π‘œπ‘” Ξ“(𝑧)
Z1
Ξ“ 𝑧1 = 4
Ξ“ 𝑧2 = 2
Z2
x
Ξ“ 𝑧3 = 3
1
1
1
π‘ π‘π‘œπ‘Ÿπ‘’ π‘₯, 𝑦 =
+
+
= 7.08
log 4 log 2 log 3
Z3
y
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Methods based on node neighborhoods-
preferential attachment
π‘ π‘π‘œπ‘Ÿπ‘’ π‘₯, 𝑦 = Ξ“(π‘₯) βˆ™ Ξ“(𝑦)
Ξ“ π‘₯ =4
Ξ“ 𝑦 =6
y
x
π‘ π‘π‘œπ‘Ÿπ‘’ π‘₯, 𝑦 = Ξ“ π‘₯ βˆ™ Ξ“ 𝑦 = 24
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Methods based on ensemble of all paths-
Katz
∞
𝛽𝑙
π‘ π‘π‘œπ‘Ÿπ‘’ π‘₯, 𝑦 =
0<𝛽<1
βˆ™
𝑙
π‘π‘Žπ‘‘β„Žπ‘ π‘₯,𝑦
𝑙=1
𝑙
π‘π‘Žπ‘‘β„Žπ‘ π‘₯,𝑦
- set of all length-𝑙 paths
y
x
π‘ π‘π‘œπ‘Ÿπ‘’ π‘₯, 𝑦 = 𝛽2 βˆ™ 3 + 𝛽3 βˆ™ 2 + 𝛽4 βˆ™ 3 + β‹―
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Methods based on ensemble of all paths-
Katz
∞
𝛽𝑙
π‘ π‘π‘œπ‘Ÿπ‘’ π‘₯, 𝑦 =
βˆ™
𝑙
π‘π‘Žπ‘‘β„Žπ‘ π‘₯,𝑦
𝑙=1
We consider 2 variants of Katz measure:
Unweighted:
1 , 𝑖𝑓 π‘₯ π‘Žπ‘›π‘‘ 𝑦 π‘π‘œπ‘™π‘™π‘Žπ‘π‘œπ‘Ÿπ‘Žπ‘‘π‘’π‘‘
1
π‘π‘Žπ‘‘β„Žπ‘ π‘₯,𝑦 =
0 ,
π‘œπ‘‘β„Žπ‘’π‘Ÿπ‘€π‘–π‘ π‘’
Weighted:
1
π‘π‘Žπ‘‘β„Žπ‘ π‘₯,𝑦
= π‘‘β„Žπ‘’ π‘›π‘’π‘šπ‘π‘’π‘Ÿ π‘œπ‘“ π‘‘π‘–π‘šπ‘’π‘  π‘‘β„Žπ‘Žπ‘‘ π‘₯ π‘Žπ‘›π‘‘ 𝑦 β„Žπ‘Žπ‘£π‘’ π‘π‘œπ‘™π‘™π‘Žπ‘π‘œπ‘Ÿπ‘Žπ‘‘π‘’π‘‘
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Methods based on ensemble of all paths-
hitting time
π‘ π‘π‘œπ‘Ÿπ‘’ π‘₯, 𝑦 = βˆ’π»π‘₯,𝑦 βˆ™ πœ‹π‘¦
𝐻π‘₯,𝑦 =: 𝑒π‘₯𝑝𝑒𝑐𝑑𝑒𝑑 π‘‘π‘–π‘šπ‘’ π‘“π‘œπ‘Ÿ π‘Ÿπ‘Žπ‘›π‘‘π‘œπ‘š π‘€π‘Žπ‘™π‘˜ π‘“π‘Ÿπ‘œπ‘š π‘₯ π‘‘π‘œ π‘Ÿπ‘’π‘Žπ‘β„Ž 𝑦
y
x
Introduction
Data and Experimental Setup
Methods for Link Prediction
Methods based on ensemble of all paths-
commute time
π‘ π‘π‘œπ‘Ÿπ‘’ π‘₯, 𝑦 = βˆ’ (𝐻π‘₯,𝑦 βˆ™ πœ‹π‘¦ + 𝐻𝑦,π‘₯ βˆ™ πœ‹π‘₯ )
y
x
Results and Discussion
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Methods based on ensemble of all paths-
rooted PageRank
ο‚– With probability 𝛼, jump to π‘₯
ο‚– With probability 1 βˆ’ 𝛼, go to random neighbor of current node
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Methods based on ensemble of all paths-
SimRank
1,
π‘ π‘π‘œπ‘Ÿπ‘’ π‘₯, 𝑦 =
π›Ύβˆ™
𝑖𝑓 π‘₯ = 𝑦
π›ΌβˆˆΞ“(π‘₯)
π›½βˆˆΞ“(𝑦) π‘ π‘π‘œπ‘Ÿπ‘’(𝛼, 𝛽)
Ξ“(π‘₯) βˆ™ Ξ“(𝑦)
,
π‘œπ‘‘β„Žπ‘’π‘Ÿπ‘€π‘–π‘ π‘’
y
x
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Higher-level Approaches
We now discuss three β€œmeta-approaches” that can be used in
conjunction with any of the methods discussed above:
ο‚– Low-rank approximation
ο‚– Unseen bigrams
ο‚– Clustering
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Results and Discussion
ο‚– As discussed before- many collaborations form for reasons outside
the scope of the network
ο‚– To represent predictor quality, we use a random predictor
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Performance of various predictors
predictor
probability that a random prediction is correct
graph distance (all distance-two pairs)
common neighbors
preferential attachment
Adamic/Adar
Jaccard
SimRank
Ξ³ = 0.8
hitting time
hitting timeβ€”normed by stationary distribution
commute time
commute timeβ€”normed by stationary distribution
rooted PageRank
Katz (weighted)
Katz (unweighted)
Ξ± = 0.01
Ξ± = 0.05
Ξ± = 0.15
Ξ± = 0.30
Ξ± = 0.50
Ξ² = 0.05
Ξ² = 0.005
Ξ² = 0.0005
Ξ² = 0.05
Ξ² = 0.005
Ξ² = 0.0005
astro-ph
0.48%
9.6
18
4.7
16.8
16.4
14.6
6.5
5.3
5.2
cond-mat
0.15%
25.3
41.1
6.1
54.8
42.3
39.3
23.8
23.8
15.5
gr-qc
0.34%
21.4
27.2
7.6
30.1
19.9
22.8
25
11
33.1
hep-ph
0.21%
12.2
27
15.2
33.3
27.7
26.1
3.8
11.3
17.1
hep-th
0.15%
29.2
47.2
7.5
50.5
41.7
41.7
13.4
21.3
23.4
5.3
10.8
13.8
16.6
17.1
16.8
3
13.4
14.5
10.9
16.8
16.8
16.1
28
39.9
41.1
42.3
41.1
21.4
54.8
54.2
41.7
41.7
41.7
11
33.1
35.3
27.2
25
24.3
19.9
30.1
30.1
37.5
37.5
37.5
11.3
18.7
24.6
27.6
29.9
30.7
2.4
24
32.6
18.7
24.2
24.9
16.3
29.2
41.3
42.6
46.8
46.8
12.9
52.2
51.8
48
49.7
49.7
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
60
50
40
30
20
10
probability that a random prediction is
correct
distance-two pairs
common neighbors
preferential attachment
Adamic/Adar
Jaccard
astro-ph
cond-mat
gr-qc
hep-ph
hep-th
0.48%
0.15%
0.34%
0.21%
0.15%
9.6
18
4.7
16.8
16.4
25.3
41.1
6.1
54.8
42.3
21.4
27.2
7.6
30.1
19.9
12.2
27
15.2
33.3
27.7
29.2
47.2
7.5
50.5
41.7
0
Factor improvement over random prediction
Methods based on node neighborhoods
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
60
50
40
30
20
10
SimRank Ξ³ = 0.8
hitting timeβ€”normed
commute timeβ€”normed
rooted PageRank Ξ± = 0.50
Katz (weighted) Ξ² = 0.0005
Katz (unweighted) Ξ² = 0.0005
0
0
0
0
0
0
astro-ph
14.6
5.3
5.3
16.8
14.5
16.8
cond-mat
39.3
23.8
16.1
41.1
54.2
41.7
gr-qc
22.8
11
11
24.3
30.1
37.5
hep-ph
26.1
11.3
11.3
30.7
32.6
24.9
hep-th
41.7
21.3
16.3
46.8
51.8
49.7
0
Factor improvement over random prediction
Methods based on ensemble of all paths
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Results and Discussion
ο‚– There is no single clear winner among the techniques
ο‚– but– number of methods outperform the random predictor:
ο‚– Katz
ο‚– Common neighbors
ο‚– Adamic/Adar
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Similarities among the predictors
ο‚– There is overlap in the prediction made by the various methods
ο‚– We will observe two tables:
ο‚– The number of common predictors
ο‚– The number of correct common predictors
unseenbigrams
rootedPagerank
rootedPagerank
Results and Discussion
low-rankinnerproduct
weightedKatz
Jaccard’scoefficient
hittingtime
Methods for Link Prediction
commonneighbors
Katzclustering
Data and Experimental Setup
Adamic/Adar
Introduction
Adamic/Adar 1150 638 520 193 442 1011 905 528 372 486
Katz clustering
1150 411 182 285 630 623 347 245 389
common neighbors
1150 135 506 494 467 305 332 489
hitting time
1150 87 191 192 247 130 156
Jaccard’s coefficient
1150 414 382 247 845 458
weighted Katz
1150 1013 488 344 474
low-rank inner product
1150 453 320 448
rooted PageRank
1150 678 461
SimRank
1150 423
unseen bigrams
1150
low-rankinnerproduct
weightedKatz
87
66
52
22
41
92
72
60
43
19
32
75
79
unseenbigrams
43
29
43
8
71
rootedPagerank
22
20
13
40
Results and Discussion
rootedPagerank
53
41
69
Jaccard’scoefficient
65
78
hittingtime
Adamic/Adar 92
Katz clustering
common neighbors
hitting time
Jaccard’s coefficient
weighted Katz
low-rank inner product
rooted PageRank
SimRank
unseen bigrams
Methods for Link Prediction
commonneighbors
Katzclustering
Data and Experimental Setup
Adamic/Adar
Introduction
44
31
27
17
39
44
39
69
36
22
26
9
51
32
26
48
66
49
37
40
15
43
51
46
39
34
68
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Small World Problem
ο‚– Accounting for the existence of short paths
ο‚– Problem: the shortest path between 2 scientist in an unrelated
disciplines is often very short
ο‚– Link predictors can be viewed as using measures of proximity
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
The breadth of the data
ο‚– considered 3 more datasets
ο‚– examine the performance of common neighbor predictor Vs. random
predictor
ο‚– Diverse collection of scientists
common neighbor predictor wins
ο‚– Narrow set of scientists
random predictor wins
Future Direction
ο‚– Improve the performance
ο‚– Improve the efficiency of the proximity-based methods on very large
network
ο‚– Suggestions:
ο‚– Bi-partite collaboration graph
ο‚– Treating more recent collaboration as more important
The End