The Link Prediction Problem for Social Networks David Liben-Nowell Jon Kleinberg January 8, 2004 Introduction Data and Experimental Setup Methods for Link Prediction Results and Discussion Introduction Link prediction problem: Can we infer which new interactions are likely to occur in the future? Some experiments suggest that information about the future can be extracted from network topology alone Introduction Data and Experimental Setup Methods for Link Prediction Results and Discussion Motivation ο Understanding the mechanisms by which social networks evolve ο model the evolution of a social network using features intrinsic to the network itself How we will do that? ο We seek to predict the edges that will be added during the interval from time π‘ to future time π‘β Introduction Data and Experimental Setup Methods for Link Prediction Results and Discussion Motivation ο Problem: some collaborations can be hard to predict ο Assumption: most of the new collaboration are hinted at by the topology of the network ο Our goal: ο Make this notion precise ο Find the measures of proximity that lead to the most accurate link prediction Introduction Data and Experimental Setup Application ο Large organization ο Security ο Inferring missing links Methods for Link Prediction Results and Discussion Introduction Data and Experimental Setup Methods for Link Prediction Results and Discussion If I had only one hour to save the world, I would spend fifty-five minutes defining the problem, and only five minutes finding the solution. Introduction Data and Experimental Setup Methods for Link Prediction Results and Discussion Definitions ο πΊ = π, πΈ ο π = π’, π£ β πΈ π represent an interaction between π’ and π£ that took place at time π‘ ο πΊ π‘, π‘ β² is the sub-graph of πΊ consisting of all edges with time stamp between π‘ and π‘β² Introduction Data and Experimental Setup Methods for Link Prediction Results and Discussion Definitions ο We choose 4 times: π‘0 < π‘0β² < π‘1 < π‘1 β² π‘0 , π‘0β² - training interval - in our paper: [1994,1996] π‘1 , π‘1β² - test interval - in our paper: [1997,1999] output: a list of edges not present in πΊ π‘0 , π‘0β² , that are predicted to appear in the πΊ π‘1 , π‘1β² Introduction Data and Experimental Setup Methods for Link Prediction Results and Discussion Definitions Q: What is the problem? A: Social networks grow through the addition of nodes as well as edges Introduction Data and Experimental Setup Methods for Link Prediction Definitions To solve this problem we will use 2 parameters: ο πΎπ‘πππππππ ο πΎπ‘ππ π‘ Each set to 3 And define a set: ο Core β all nodes incident to at least πΎπ‘πππππππ edges in G π‘0 , π‘0β² and at least πΎπ‘ππ π‘ edges in G π‘1 , π‘1β² Results and Discussion Introduction Data and Experimental Setup Methods for Link Prediction Results and Discussion Definitions ο πΈπππ - edges between Core authors which first appear during the training period ο πΈπππ€ - edges π’, π£ such that π’, π£ co-author a paper during the test interval but not the training one Introduction Data and Experimental Setup Methods for Link Prediction Results and Discussion Evaluating a link predictor β ο πΈπππ€ β πΈπππ€ β© πΆπππ × πΆπππ β ο π β πΈπππ€ β πΈπππ€ = π, π , π, π , π, π , π, π β π=4 ο Each link predictor π outputs a ranked list πΏπ of pairs in decreasing order ο From πΏπ we take the first π pairs in πΆπππ × πΆπππ, and determine the β size of the intersection of this set of pairs with the set πΈπππ€ πΏπ = π, π , π, π , π, π , π, π , π, π , π, π β πΏπ (π) = π, π , π, π , π, π , π, π Introduction Data and Experimental Setup Methods for Link Prediction Results and Discussion Data Base The researchers constructed 5 co-authorship networks G: training period authors papers edges Core authors Eold Enew astrophysics astro-ph 5343 5816 41852 1561 6178 5751 condensed-matter general relativity and quantum cosmology high energy physicsphenomenology high energy physicstheory cond-mat 5469 6700 19881 1253 1899 1150 gr-qc 2122 3287 5724 486 519 400 hep-ph 5414 10254 17806 1790 6654 3294 hep-th 5241 9498 15842 1438 2311 1576 www.arXiv.org Introduction Data and Experimental Setup Methods for Link Prediction Results and Discussion Methods ο All the methods assign a connection weight π ππππ(π₯, π¦) to pairs of nodes π₯, π¦ ο Then produce a ranked list in decreasing order of π ππππ(π₯, π¦) ο Thus, they can be viewed as computing a measure of proximity or βsimilarityβ between nodes π₯ and π¦, relative to the network topology Introduction Data and Experimental Setup List of Methods ο Methods based on: ο node neighborhoods ο ο ο ο Common neighbors Jaccardβs coefficient Adanic/Adar Preferential attachment ο ensemble of all paths ο ο ο ο Katz Hitting time PageRank SimRank Methods for Link Prediction Results and Discussion Introduction Data and Experimental Setup Methods for Link Prediction Results and Discussion Methods based on node neighborhoods- Common neighbors π ππππ π₯, π¦ = Ξ(π₯) β© Ξ(π¦) Ξ π₯ =4 Ξ π¦ =6 y x π ππππ π₯, π¦ = Ξ π₯ β© Ξ π¦ = 3 Introduction Data and Experimental Setup Methods for Link Prediction Results and Discussion Methods based on node neighborhoods- Jaccard coefficient Ξ π₯ =4 Ξ π¦ =6 Ξ(π₯) β© Ξ(π¦) π ππππ π₯, π¦ = Ξ(π₯) βͺ Ξ(π¦) Ξ π₯ β© Ξ π¦ =3 Ξ π₯ βͺ Ξ π¦ =7 Ξ(π₯) β© Ξ(π¦) 3 π ππππ π₯, π¦ = = = 0.43 Ξ(π₯) βͺ Ξ(π¦) 7 y x Introduction Data and Experimental Setup Methods for Link Prediction Results and Discussion Methods based on node neighborhoods- Adamic/Adar π ππππ π₯, π¦ = π§βΞ(π₯)β© Ξ(π¦) 1 πππ Ξ(π§) Z1 Ξ π§1 = 4 Ξ π§2 = 2 Z2 x Ξ π§3 = 3 1 1 1 π ππππ π₯, π¦ = + + = 7.08 log 4 log 2 log 3 Z3 y Introduction Data and Experimental Setup Methods for Link Prediction Results and Discussion Methods based on node neighborhoods- preferential attachment π ππππ π₯, π¦ = Ξ(π₯) β Ξ(π¦) Ξ π₯ =4 Ξ π¦ =6 y x π ππππ π₯, π¦ = Ξ π₯ β Ξ π¦ = 24 Introduction Data and Experimental Setup Methods for Link Prediction Results and Discussion Methods based on ensemble of all paths- Katz β π½π π ππππ π₯, π¦ = 0<π½<1 β π πππ‘βπ π₯,π¦ π=1 π πππ‘βπ π₯,π¦ - set of all length-π paths y x π ππππ π₯, π¦ = π½2 β 3 + π½3 β 2 + π½4 β 3 + β― Introduction Data and Experimental Setup Methods for Link Prediction Results and Discussion Methods based on ensemble of all paths- Katz β π½π π ππππ π₯, π¦ = β π πππ‘βπ π₯,π¦ π=1 We consider 2 variants of Katz measure: Unweighted: 1 , ππ π₯ πππ π¦ ππππππππππ‘ππ 1 πππ‘βπ π₯,π¦ = 0 , ππ‘βπππ€ππ π Weighted: 1 πππ‘βπ π₯,π¦ = π‘βπ ππ’ππππ ππ π‘ππππ π‘βππ‘ π₯ πππ π¦ βππ£π ππππππππππ‘ππ Introduction Data and Experimental Setup Methods for Link Prediction Results and Discussion Methods based on ensemble of all paths- hitting time π ππππ π₯, π¦ = βπ»π₯,π¦ β ππ¦ π»π₯,π¦ =: ππ₯ππππ‘ππ π‘πππ πππ ππππππ π€πππ ππππ π₯ π‘π ππππβ π¦ y x Introduction Data and Experimental Setup Methods for Link Prediction Methods based on ensemble of all paths- commute time π ππππ π₯, π¦ = β (π»π₯,π¦ β ππ¦ + π»π¦,π₯ β ππ₯ ) y x Results and Discussion Introduction Data and Experimental Setup Methods for Link Prediction Results and Discussion Methods based on ensemble of all paths- rooted PageRank ο With probability πΌ, jump to π₯ ο With probability 1 β πΌ, go to random neighbor of current node Introduction Data and Experimental Setup Methods for Link Prediction Results and Discussion Methods based on ensemble of all paths- SimRank 1, π ππππ π₯, π¦ = πΎβ ππ π₯ = π¦ πΌβΞ(π₯) π½βΞ(π¦) π ππππ(πΌ, π½) Ξ(π₯) β Ξ(π¦) , ππ‘βπππ€ππ π y x Introduction Data and Experimental Setup Methods for Link Prediction Results and Discussion Higher-level Approaches We now discuss three βmeta-approachesβ that can be used in conjunction with any of the methods discussed above: ο Low-rank approximation ο Unseen bigrams ο Clustering Introduction Data and Experimental Setup Methods for Link Prediction Results and Discussion Results and Discussion ο As discussed before- many collaborations form for reasons outside the scope of the network ο To represent predictor quality, we use a random predictor Introduction Data and Experimental Setup Methods for Link Prediction Results and Discussion Performance of various predictors predictor probability that a random prediction is correct graph distance (all distance-two pairs) common neighbors preferential attachment Adamic/Adar Jaccard SimRank Ξ³ = 0.8 hitting time hitting timeβnormed by stationary distribution commute time commute timeβnormed by stationary distribution rooted PageRank Katz (weighted) Katz (unweighted) Ξ± = 0.01 Ξ± = 0.05 Ξ± = 0.15 Ξ± = 0.30 Ξ± = 0.50 Ξ² = 0.05 Ξ² = 0.005 Ξ² = 0.0005 Ξ² = 0.05 Ξ² = 0.005 Ξ² = 0.0005 astro-ph 0.48% 9.6 18 4.7 16.8 16.4 14.6 6.5 5.3 5.2 cond-mat 0.15% 25.3 41.1 6.1 54.8 42.3 39.3 23.8 23.8 15.5 gr-qc 0.34% 21.4 27.2 7.6 30.1 19.9 22.8 25 11 33.1 hep-ph 0.21% 12.2 27 15.2 33.3 27.7 26.1 3.8 11.3 17.1 hep-th 0.15% 29.2 47.2 7.5 50.5 41.7 41.7 13.4 21.3 23.4 5.3 10.8 13.8 16.6 17.1 16.8 3 13.4 14.5 10.9 16.8 16.8 16.1 28 39.9 41.1 42.3 41.1 21.4 54.8 54.2 41.7 41.7 41.7 11 33.1 35.3 27.2 25 24.3 19.9 30.1 30.1 37.5 37.5 37.5 11.3 18.7 24.6 27.6 29.9 30.7 2.4 24 32.6 18.7 24.2 24.9 16.3 29.2 41.3 42.6 46.8 46.8 12.9 52.2 51.8 48 49.7 49.7 Introduction Data and Experimental Setup Methods for Link Prediction Results and Discussion 60 50 40 30 20 10 probability that a random prediction is correct distance-two pairs common neighbors preferential attachment Adamic/Adar Jaccard astro-ph cond-mat gr-qc hep-ph hep-th 0.48% 0.15% 0.34% 0.21% 0.15% 9.6 18 4.7 16.8 16.4 25.3 41.1 6.1 54.8 42.3 21.4 27.2 7.6 30.1 19.9 12.2 27 15.2 33.3 27.7 29.2 47.2 7.5 50.5 41.7 0 Factor improvement over random prediction Methods based on node neighborhoods Introduction Data and Experimental Setup Methods for Link Prediction Results and Discussion 60 50 40 30 20 10 SimRank Ξ³ = 0.8 hitting timeβnormed commute timeβnormed rooted PageRank Ξ± = 0.50 Katz (weighted) Ξ² = 0.0005 Katz (unweighted) Ξ² = 0.0005 0 0 0 0 0 0 astro-ph 14.6 5.3 5.3 16.8 14.5 16.8 cond-mat 39.3 23.8 16.1 41.1 54.2 41.7 gr-qc 22.8 11 11 24.3 30.1 37.5 hep-ph 26.1 11.3 11.3 30.7 32.6 24.9 hep-th 41.7 21.3 16.3 46.8 51.8 49.7 0 Factor improvement over random prediction Methods based on ensemble of all paths Introduction Data and Experimental Setup Methods for Link Prediction Results and Discussion Results and Discussion ο There is no single clear winner among the techniques ο butβ number of methods outperform the random predictor: ο Katz ο Common neighbors ο Adamic/Adar Introduction Data and Experimental Setup Methods for Link Prediction Results and Discussion Similarities among the predictors ο There is overlap in the prediction made by the various methods ο We will observe two tables: ο The number of common predictors ο The number of correct common predictors unseenbigrams rootedPagerank rootedPagerank Results and Discussion low-rankinnerproduct weightedKatz Jaccardβscoeο¬cient hittingtime Methods for Link Prediction commonneighbors Katzclustering Data and Experimental Setup Adamic/Adar Introduction Adamic/Adar 1150 638 520 193 442 1011 905 528 372 486 Katz clustering 1150 411 182 285 630 623 347 245 389 common neighbors 1150 135 506 494 467 305 332 489 hitting time 1150 87 191 192 247 130 156 Jaccardβs coefficient 1150 414 382 247 845 458 weighted Katz 1150 1013 488 344 474 low-rank inner product 1150 453 320 448 rooted PageRank 1150 678 461 SimRank 1150 423 unseen bigrams 1150 low-rankinnerproduct weightedKatz 87 66 52 22 41 92 72 60 43 19 32 75 79 unseenbigrams 43 29 43 8 71 rootedPagerank 22 20 13 40 Results and Discussion rootedPagerank 53 41 69 Jaccardβscoeο¬cient 65 78 hittingtime Adamic/Adar 92 Katz clustering common neighbors hitting time Jaccardβs coefficient weighted Katz low-rank inner product rooted PageRank SimRank unseen bigrams Methods for Link Prediction commonneighbors Katzclustering Data and Experimental Setup Adamic/Adar Introduction 44 31 27 17 39 44 39 69 36 22 26 9 51 32 26 48 66 49 37 40 15 43 51 46 39 34 68 Introduction Data and Experimental Setup Methods for Link Prediction Results and Discussion Small World Problem ο Accounting for the existence of short paths ο Problem: the shortest path between 2 scientist in an unrelated disciplines is often very short ο Link predictors can be viewed as using measures of proximity Introduction Data and Experimental Setup Methods for Link Prediction Results and Discussion The breadth of the data ο considered 3 more datasets ο examine the performance of common neighbor predictor Vs. random predictor ο Diverse collection of scientists common neighbor predictor wins ο Narrow set of scientists random predictor wins Future Direction ο Improve the performance ο Improve the efficiency of the proximity-based methods on very large network ο Suggestions: ο Bi-partite collaboration graph ο Treating more recent collaboration as more important The End
© Copyright 2026 Paperzz