The Link Prediction Problem for Social Networks

The Link Prediction Problem
for Social Networks
David Liben-Nowell
Jon Kleinberg
January 8, 2004
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Introduction
Link prediction problem:
Can we infer which new interactions are likely to occur in the future?
Some experiments suggest that information about the future can be
extracted from network topology alone
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Motivation
 Understanding the mechanisms by which social networks evolve
 model the evolution of a social network using features intrinsic to the
network itself
How we will do that?
 We seek to predict the edges that will be added during the interval
from time 𝑡 to future time 𝑡’
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Motivation
 Problem: some collaborations can be hard to predict
 Assumption: most of the new collaboration are hinted at by the
topology of the network
 Our goal:
 Make this notion precise
 Find the measures of proximity that lead to the most accurate link prediction
Introduction
Data and Experimental Setup
Application
 Large organization
 Security
 Inferring missing links
Methods for Link Prediction
Results and Discussion
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
If I had only one hour to save the world,
I would spend fifty-five minutes defining the problem,
and only five minutes finding the solution.
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Definitions
 𝐺 = 𝑉, 𝐸
 𝑒 = 𝑢, 𝑣 ∈ 𝐸
𝑒 represent an interaction between 𝑢 and 𝑣 that took place at time 𝑡
 𝐺 𝑡, 𝑡 ′ is the sub-graph of 𝐺 consisting of all edges with time stamp
between 𝑡 and 𝑡′
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Definitions
 We choose 4 times: 𝑡0 < 𝑡0′ < 𝑡1 < 𝑡1 ′
𝑡0 , 𝑡0′ - training interval - in our paper: [1994,1996]
𝑡1 , 𝑡1′ - test interval
- in our paper: [1997,1999]
output: a list of edges not present in 𝐺 𝑡0 , 𝑡0′ ,
that are predicted to appear in the 𝐺 𝑡1 , 𝑡1′
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Definitions
Q: What is the problem?
A: Social networks grow through the addition of nodes as well as edges
Introduction
Data and Experimental Setup
Methods for Link Prediction
Definitions
To solve this problem we will use 2 parameters:
 𝐾𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔
 𝐾𝑡𝑒𝑠𝑡
Each set to 3
And define a set:
 Core – all nodes incident to at least 𝐾𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔
edges in G 𝑡0 , 𝑡0′ and at least 𝐾𝑡𝑒𝑠𝑡 edges in G 𝑡1 , 𝑡1′
Results and Discussion
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Definitions
 𝐸𝑜𝑙𝑑 - edges between Core authors which first appear during the
training period
 𝐸𝑛𝑒𝑤 - edges 𝑢, 𝑣 such that 𝑢, 𝑣 co-author a paper during the test
interval but not the training one
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Evaluating a link predictor
∗
 𝐸𝑛𝑒𝑤
≔ 𝐸𝑛𝑒𝑤 ∩ 𝐶𝑜𝑟𝑒 × 𝐶𝑜𝑟𝑒
∗
 𝑛 ≔ 𝐸𝑛𝑒𝑤
∗
𝐸𝑛𝑒𝑤
=
𝑎, 𝑏 , 𝑎, 𝑐 , 𝑏, 𝑑 , 𝑐, 𝑒
→ 𝑛=4
 Each link predictor 𝑝 outputs a ranked list 𝐿𝑝 of pairs in decreasing order
 From 𝐿𝑝 we take the first 𝑛 pairs in 𝐶𝑜𝑟𝑒 × 𝐶𝑜𝑟𝑒, and determine the
∗
size of the intersection of this set of pairs with the set 𝐸𝑛𝑒𝑤
𝐿𝑝 =
𝑎, 𝑏 , 𝑎, 𝑐 , 𝑏, 𝑑 , 𝑐, 𝑒 , 𝑑, 𝑒 , 𝑒, 𝑓
→ 𝐿𝑝 (𝑛) =
𝑎, 𝑏 , 𝑎, 𝑐 , 𝑏, 𝑑 , 𝑐, 𝑒
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Data Base
The researchers constructed 5 co-authorship networks G:
training period
authors papers
edges
Core
authors Eold
Enew
astrophysics
astro-ph
5343
5816
41852
1561
6178
5751
condensed-matter
general relativity and
quantum cosmology
high energy physicsphenomenology
high energy physicstheory
cond-mat
5469
6700
19881
1253
1899
1150
gr-qc
2122
3287
5724
486
519
400
hep-ph
5414
10254
17806
1790
6654
3294
hep-th
5241
9498
15842
1438
2311
1576
www.arXiv.org
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Methods
 All the methods assign a connection weight 𝑠𝑐𝑜𝑟𝑒(𝑥, 𝑦) to pairs of
nodes 𝑥, 𝑦
 Then produce a ranked list in decreasing order of 𝑠𝑐𝑜𝑟𝑒(𝑥, 𝑦)
 Thus, they can be viewed as computing a measure of proximity or
“similarity” between nodes 𝑥 and 𝑦, relative to the network topology
Introduction
Data and Experimental Setup
List of Methods
 Methods based on:
 node neighborhoods




Common neighbors
Jaccard’s coefficient
Adanic/Adar
Preferential attachment
 ensemble of all paths




Katz
Hitting time
PageRank
SimRank
Methods for Link Prediction
Results and Discussion
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Methods based on node neighborhoods-
Common neighbors
𝑠𝑐𝑜𝑟𝑒 𝑥, 𝑦 = Γ(𝑥) ∩ Γ(𝑦)
Γ 𝑥 =4
Γ 𝑦 =6
y
x
𝑠𝑐𝑜𝑟𝑒 𝑥, 𝑦 = Γ 𝑥 ∩ Γ 𝑦 = 3
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Methods based on node neighborhoods-
Jaccard coefficient
Γ 𝑥 =4
Γ 𝑦 =6
Γ(𝑥) ∩ Γ(𝑦)
𝑠𝑐𝑜𝑟𝑒 𝑥, 𝑦 =
Γ(𝑥) ∪ Γ(𝑦)
Γ 𝑥 ∩ Γ 𝑦 =3
Γ 𝑥 ∪ Γ 𝑦 =7
Γ(𝑥) ∩ Γ(𝑦) 3
𝑠𝑐𝑜𝑟𝑒 𝑥, 𝑦 =
= = 0.43
Γ(𝑥) ∪ Γ(𝑦) 7
y
x
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Methods based on node neighborhoods-
Adamic/Adar
𝑠𝑐𝑜𝑟𝑒 𝑥, 𝑦 =
𝑧∈Γ(𝑥)∩ Γ(𝑦)
1
𝑙𝑜𝑔 Γ(𝑧)
Z1
Γ 𝑧1 = 4
Γ 𝑧2 = 2
Z2
x
Γ 𝑧3 = 3
1
1
1
𝑠𝑐𝑜𝑟𝑒 𝑥, 𝑦 =
+
+
= 7.08
log 4 log 2 log 3
Z3
y
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Methods based on node neighborhoods-
preferential attachment
𝑠𝑐𝑜𝑟𝑒 𝑥, 𝑦 = Γ(𝑥) ∙ Γ(𝑦)
Γ 𝑥 =4
Γ 𝑦 =6
y
x
𝑠𝑐𝑜𝑟𝑒 𝑥, 𝑦 = Γ 𝑥 ∙ Γ 𝑦 = 24
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Methods based on ensemble of all paths-
Katz
∞
𝛽𝑙
𝑠𝑐𝑜𝑟𝑒 𝑥, 𝑦 =
0<𝛽<1
∙
𝑙
𝑝𝑎𝑡ℎ𝑠𝑥,𝑦
𝑙=1
𝑙
𝑝𝑎𝑡ℎ𝑠𝑥,𝑦
- set of all length-𝑙 paths
y
x
𝑠𝑐𝑜𝑟𝑒 𝑥, 𝑦 = 𝛽2 ∙ 3 + 𝛽3 ∙ 2 + 𝛽4 ∙ 3 + ⋯
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Methods based on ensemble of all paths-
Katz
∞
𝛽𝑙
𝑠𝑐𝑜𝑟𝑒 𝑥, 𝑦 =
∙
𝑙
𝑝𝑎𝑡ℎ𝑠𝑥,𝑦
𝑙=1
We consider 2 variants of Katz measure:
Unweighted:
1 , 𝑖𝑓 𝑥 𝑎𝑛𝑑 𝑦 𝑐𝑜𝑙𝑙𝑎𝑏𝑜𝑟𝑎𝑡𝑒𝑑
1
𝑝𝑎𝑡ℎ𝑠𝑥,𝑦 =
0 ,
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Weighted:
1
𝑝𝑎𝑡ℎ𝑠𝑥,𝑦
= 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑡ℎ𝑎𝑡 𝑥 𝑎𝑛𝑑 𝑦 ℎ𝑎𝑣𝑒 𝑐𝑜𝑙𝑙𝑎𝑏𝑜𝑟𝑎𝑡𝑒𝑑
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Methods based on ensemble of all paths-
hitting time
𝑠𝑐𝑜𝑟𝑒 𝑥, 𝑦 = −𝐻𝑥,𝑦 ∙ 𝜋𝑦
𝐻𝑥,𝑦 =: 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑡𝑖𝑚𝑒 𝑓𝑜𝑟 𝑟𝑎𝑛𝑑𝑜𝑚 𝑤𝑎𝑙𝑘 𝑓𝑟𝑜𝑚 𝑥 𝑡𝑜 𝑟𝑒𝑎𝑐ℎ 𝑦
y
x
Introduction
Data and Experimental Setup
Methods for Link Prediction
Methods based on ensemble of all paths-
commute time
𝑠𝑐𝑜𝑟𝑒 𝑥, 𝑦 = − (𝐻𝑥,𝑦 ∙ 𝜋𝑦 + 𝐻𝑦,𝑥 ∙ 𝜋𝑥 )
y
x
Results and Discussion
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Methods based on ensemble of all paths-
rooted PageRank
 With probability 𝛼, jump to 𝑥
 With probability 1 − 𝛼, go to random neighbor of current node
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Methods based on ensemble of all paths-
SimRank
1,
𝑠𝑐𝑜𝑟𝑒 𝑥, 𝑦 =
𝛾∙
𝑖𝑓 𝑥 = 𝑦
𝛼∈Γ(𝑥)
𝛽∈Γ(𝑦) 𝑠𝑐𝑜𝑟𝑒(𝛼, 𝛽)
Γ(𝑥) ∙ Γ(𝑦)
,
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
y
x
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Higher-level Approaches
We now discuss three “meta-approaches” that can be used in
conjunction with any of the methods discussed above:
 Low-rank approximation
 Unseen bigrams
 Clustering
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Results and Discussion
 As discussed before- many collaborations form for reasons outside
the scope of the network
 To represent predictor quality, we use a random predictor
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Performance of various predictors
predictor
probability that a random prediction is correct
graph distance (all distance-two pairs)
common neighbors
preferential attachment
Adamic/Adar
Jaccard
SimRank
γ = 0.8
hitting time
hitting time—normed by stationary distribution
commute time
commute time—normed by stationary distribution
rooted PageRank
Katz (weighted)
Katz (unweighted)
α = 0.01
α = 0.05
α = 0.15
α = 0.30
α = 0.50
β = 0.05
β = 0.005
β = 0.0005
β = 0.05
β = 0.005
β = 0.0005
astro-ph
0.48%
9.6
18
4.7
16.8
16.4
14.6
6.5
5.3
5.2
cond-mat
0.15%
25.3
41.1
6.1
54.8
42.3
39.3
23.8
23.8
15.5
gr-qc
0.34%
21.4
27.2
7.6
30.1
19.9
22.8
25
11
33.1
hep-ph
0.21%
12.2
27
15.2
33.3
27.7
26.1
3.8
11.3
17.1
hep-th
0.15%
29.2
47.2
7.5
50.5
41.7
41.7
13.4
21.3
23.4
5.3
10.8
13.8
16.6
17.1
16.8
3
13.4
14.5
10.9
16.8
16.8
16.1
28
39.9
41.1
42.3
41.1
21.4
54.8
54.2
41.7
41.7
41.7
11
33.1
35.3
27.2
25
24.3
19.9
30.1
30.1
37.5
37.5
37.5
11.3
18.7
24.6
27.6
29.9
30.7
2.4
24
32.6
18.7
24.2
24.9
16.3
29.2
41.3
42.6
46.8
46.8
12.9
52.2
51.8
48
49.7
49.7
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
60
50
40
30
20
10
probability that a random prediction is
correct
distance-two pairs
common neighbors
preferential attachment
Adamic/Adar
Jaccard
astro-ph
cond-mat
gr-qc
hep-ph
hep-th
0.48%
0.15%
0.34%
0.21%
0.15%
9.6
18
4.7
16.8
16.4
25.3
41.1
6.1
54.8
42.3
21.4
27.2
7.6
30.1
19.9
12.2
27
15.2
33.3
27.7
29.2
47.2
7.5
50.5
41.7
0
Factor improvement over random prediction
Methods based on node neighborhoods
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
60
50
40
30
20
10
SimRank γ = 0.8
hitting time—normed
commute time—normed
rooted PageRank α = 0.50
Katz (weighted) β = 0.0005
Katz (unweighted) β = 0.0005
0
0
0
0
0
0
astro-ph
14.6
5.3
5.3
16.8
14.5
16.8
cond-mat
39.3
23.8
16.1
41.1
54.2
41.7
gr-qc
22.8
11
11
24.3
30.1
37.5
hep-ph
26.1
11.3
11.3
30.7
32.6
24.9
hep-th
41.7
21.3
16.3
46.8
51.8
49.7
0
Factor improvement over random prediction
Methods based on ensemble of all paths
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Results and Discussion
 There is no single clear winner among the techniques
 but– number of methods outperform the random predictor:
 Katz
 Common neighbors
 Adamic/Adar
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Similarities among the predictors
 There is overlap in the prediction made by the various methods
 We will observe two tables:
 The number of common predictors
 The number of correct common predictors
unseenbigrams
rootedPagerank
rootedPagerank
Results and Discussion
low-rankinnerproduct
weightedKatz
Jaccard’scoeﬃcient
hittingtime
Methods for Link Prediction
commonneighbors
Katzclustering
Data and Experimental Setup
Adamic/Adar
Introduction
Adamic/Adar 1150 638 520 193 442 1011 905 528 372 486
Katz clustering
1150 411 182 285 630 623 347 245 389
common neighbors
1150 135 506 494 467 305 332 489
hitting time
1150 87 191 192 247 130 156
Jaccard’s coefficient
1150 414 382 247 845 458
weighted Katz
1150 1013 488 344 474
low-rank inner product
1150 453 320 448
rooted PageRank
1150 678 461
SimRank
1150 423
unseen bigrams
1150
low-rankinnerproduct
weightedKatz
87
66
52
22
41
92
72
60
43
19
32
75
79
unseenbigrams
43
29
43
8
71
rootedPagerank
22
20
13
40
Results and Discussion
rootedPagerank
53
41
69
Jaccard’scoeﬃcient
65
78
hittingtime
Adamic/Adar 92
Katz clustering
common neighbors
hitting time
Jaccard’s coefficient
weighted Katz
low-rank inner product
rooted PageRank
SimRank
unseen bigrams
Methods for Link Prediction
commonneighbors
Katzclustering
Data and Experimental Setup
Adamic/Adar
Introduction
44
31
27
17
39
44
39
69
36
22
26
9
51
32
26
48
66
49
37
40
15
43
51
46
39
34
68
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
Small World Problem
 Accounting for the existence of short paths
 Problem: the shortest path between 2 scientist in an unrelated
disciplines is often very short
 Link predictors can be viewed as using measures of proximity
Introduction
Data and Experimental Setup
Methods for Link Prediction
Results and Discussion
The breadth of the data
 considered 3 more datasets
 examine the performance of common neighbor predictor Vs. random
predictor
 Diverse collection of scientists
common neighbor predictor wins
 Narrow set of scientists
random predictor wins
Future Direction
 Improve the performance
 Improve the efficiency of the proximity-based methods on very large
network
 Suggestions:
 Bi-partite collaboration graph
 Treating more recent collaboration as more important
The End

Download Report

The Link Prediction Problem for Social Networks

Paperzz.com

Your Paperzz