Graph similarity P. Van Dooren, CESAME, Univ. catholique Louvain based on work with Blondel, Gajardo, Heymans, Sennelart The web graph Nodes = web pages, Edges = hyperlinks between pages Several billion webpages Average of 7 outgoing links Structure of the web Experiments : two crawls over 200 million pages in 1999 found a giant strongly connected component (core) - Contains most prominent sites - It contains 30% of all pages - Average distance between nodes is 16 - Small world In- and out-degree distributions Power law distribution : number of pages of in-degree n is proportional to 1/n2.1 (Zipf law) A score for every page The score of a page is high if the page has many incoming links coming from pages with high page score One browses from page to page by following outgoing links with equal probability. Score = frequency a page is visited. … some pages may have no outgoing links … many pages have zero frequency PageRank : teleporting random score The surfer follows a path by choosing an outgoing link with probability p/dout(i) or teleports to a random web page with probability 0<1-p <1. Put the transition probability of i to j in a matrix M (bij=1 if i→j) mij = p bij /dout(i) + (1-p)/n then the vector x of probability distribution on the nodes of the graph is the steady state vector of the iteration xk+1=Mxk i.e. the dominant eigenvector of the matrix M (unique because of Perron-Frobenius) PageRank of node i is the (relative) size of element i of this vector Kleinberg’s structure graph The score of a page is high if the page has many incoming links The score is high if the incoming links are from pages that have high scores This inspired Kleinberg’s “structure graph” hub authority Good authorities for “University Belgium” A good hub for “University Belgium” Hub and authority scores The hub score hi and authority score ai ought to be mutually reinforcing : pages with large hi point to pages with high aj pages with large ai are pointed to by pages with high hj hi ← Σ j:(i→j) aj ai ← Σ j:(j→i) hj or, using the adjacency matrix B of the graph (bij=1 if i→j is an edge) h = a k+1 0 B BT 0 h a k h a = 0 1 1 Use limiting vector a (dominant eigenvector of BTB) to rank pages Alternative method : structure graph Give three scores to each web page : begin b, center c, end e b c e Use again mutual reinforcement to define the iteration bj ← cj ← Σ i:(i→j) bi ej ← Σ i:(j→i) ci + Σ i:(j→i) ei Σ i:(i→j) ci Defines a limiting vector for the iteration xk+1 = M xk, x0= 1 b where x = c e , 0 B 0 M = BT 0 B 0 BT 0 Bow tie example graph A 1 •→•2 graph B 2 1 ρ 0 0 ρ 0 0 1 0 : : : : S= 0 0 S= 1 0 0 1 0 0 : : : : 0 1 0 0 if m>n if n>m not satisfactory n+1 n+m+1 Bow tie example 0 ρ 0 graph A 1 0 0 1 •→•→•3 : : : S= 1 0 0 2 0 0 1 : graph B 2 : 0 0 1 1 n+1 : central score is good n+m+1 Towards arbitrary graphs For the graph For the graph •→• •→ • → • A= A= 0 1 0 0 and M = 0 1 0 0 0 1 0 0 0 and M = 0 B BT 0 0 B 0 BT 0 B 0 BT 0 Formula for M for two arbitrary graphs GA and GB : M= A B + AT BT With xk =vec(Xk) iteration xk+1 = M xk is equivalent to Xk+1 = BXk AT+BT Xk A Similarity matrix of two arbitrary graphs For A and B adjacency matrices of the two graphs S solves ρS = A S BT + AT S B This matrix can be obtained via fixed point of power method (linear) Ref: Blondel et al, SIAM Rev., ‘04 Similarity matrix of two arbitrary graphs For A and B adjacency matrices of the two graphs S solves ρS = A S BT + AT S B Two nodes are similar if their parents and children are similar Such a recursive definition leads to an eigenvector equation Ref: Blondel et al, SIAM Rev., ‘04 Algorithm The (normalized) sequence Xk+1 = (AXk BT+AT Xk B)/ ||AXk BT+ATXkB||F has two fixed points Xeven and Xodd for every X0>0 Similarity matrix S = lim k→∞ X2k , X0 =1 Si,j is the similarity score between Vi (A) and Vj (B) With xk=vec(Xk), this is equivalent to the power method xk+1 = (B A + BT AT )xk / ||(B A + BT AT )xk||2 which is the power method on M = B A + BT AT Some properties Satisfies ρS=ASBT+ATSB, ρ=||ASBT+ATSB||F It is the nonnegative fixed point S of largest 1-norm It solves the optimization problem max ASBT+ATSB , S subject to ||S||F=1 Extension of Kleinberg’s Hits method Linear convergence (power method for sparse M) Other properties • Central score is a dominant eigenvector of BBT+BTB (cfr. hub score of BBT and authority score of BTB) • Similarity matrix of a graph with itself is square and semi-definite. Path graph • → • → • Cycle graph .4 0 0 1 1 1 0 .8 0 1 1 1 0 0 .4 1 1 1 The dictionary graph OPTED, based on Webster’s unabridged dictionary http://msowww.anu.edu.au/~ralph/OPTED Nodes = words present in the dictionary : 112,169 nodes Edge (u,v) if v appears in the definition of u : 1,398,424 edges Average of 12 edges per node In and out degree distribution Very similar to web (power law) Words with highest in degree : of, a, the, or, to, in … Words with null out degree : 14159, Fe3O4, Aaron, and some undefined or misspelled words Neighborhood graph is the subset of vertices used for finding synonyms : it contains “all” parents and children of the node neighborhood graph of likely “Central” uses this sub-graph to rank automatically synonyms Rank each node in the graph with the similarity to node c in b Ref: Blondel et al, SIAM Rev., ‘04 c e Disappear Vectors Central ArcRanc Wordnet Microsoft 1 vanish vanish epidemic vanish vanish 2 wear pass disappearing go away cease to exist 3 die die port end fade away 4 sail wear dissipate finish die out 5 faint faint cease terminate go 6 light fade eat cease evaporate 7 port sail gradually wane 8 absorb light instrumental expire 9 appear dissipate darkness withdraw 10 cease cease efface pass away Mark 3.6 6.3 1.2 7.5 8.6 Std Dev 1.8 1.7 1.2 1.4 1.3 Vectors, Central and ArcRank are automatic, Wordnet, Microsoft Word are manual Science Vectors Central ArcRanc Wordnet Microsoft 1 art art formulate knowledge domain discipline 2 branch branch arithmetic knowledge base knowledge 3 nature law systematize discipline skill 4 law study scientific subject art 5 knowledge practice knowledge subject area 6 principle natural geometry subject field 7 life knowledge philosophical field 8 natural learning learning field of study 9 electricity theory expertness ability 10 biology principle mathematics power Mark 3.6 4.4 3.2 7.1 6.5 Std Dev 2.0 2.5 2.9 2.6 2.4 Parallelogram Vectors Central ArcRanc Wordnet Microsoft 1 square square quadrilateral quadrilateral diamond 2 parallel rhomb gnomon quadrangle lozenge 3 rhomb parallel right-lined tetragon rhomb 4 prism figure rectangle 5 figure prism consequently 6 equal equal parallelopiped 7 quadrilateral opposite parallel 8 opposite angles cylinder 9 altitude quadrilateral popular 10 parallelopiped rectangle prism Mark 4.6 4.8 3.3 6.3 5.3 Std Dev 2.7 2.5 2.2 2.5 2.6 Sugar Vectors Central ArcRanc Wordnet Microsoft 1 juice cane granulation sweetening darling 2 starch starch shrub sweetener baby 3 cane sucrose sucrose carbohydrate honey 4 milk milk preserve saccharide dear 5 molasses sweet honeyed organic compound love 6 sucrose dextrose property saccarify dearest 7 wax molasses sorghum sweeten beloved 8 root juice grocer dulcify precious 9 crystalline glucose acetate edulcorate pet 10 confection lactose saccharine dulcorate babe Mark 3.9 6.3 4.3 6.2 4.7 Std Dev 2.0 2.4 2.3 2.9 2.7 Real world application Typed graphs Graphs with colored nodes Fraikin, VD, ECC07 Graphs with colored edges Neighborhood graph is the subset of vertices used for finding synonyms : it contains all parents and children of the node other types neighborhood graph of likely “Central” uses this sub-graph to rank automatically synonyms Compares well with Vectors, ArcRank (automatic) Wordnet, Microsoft Word (manual) Typed nodes Partition adjacency matrices A11 A12 A21 A22 , B11 B12 B21 B22 and and compute (for the symmetric case) the Perron vector vec(S11) vec(S22) of A11B11 A12B12 A21B21 A22B22 S11 O O S22 Typed edges Partition source and terminal matrices AS = AS1 AS2 , AT = AT1 AT2 and BS = BS1 BS2 , BT = BS1 BS2 and compute the left and right singular Perron vectors vec(N) and vec(E1) vec(E2) of G = AS1BS1+AT1BT1 AS2BS2+AT2BT2 Concluding remarks Iteration is on large sparse graphs Complexity of one iteration step is linear in the number of nodes in both graphs We have methods with linear convergence (power-like method and gradient like method) Extensions to colored nodes and edges
© Copyright 2026 Paperzz