The web graph Nodes = web pages, Edges = hyperlinks between pages 4 billion (Google searched 3,083,324,625 webpages in 2002) Average of 7 outgoing links The web graph Nodes = web pages, Edges = hyperlinks between pages 4 billion (Google searched 3,083,324,625 webpages in 2002) Average of 7 outgoing links Growth of a few % every month Outline 1. Structure of the web 2. Methods for searching the web (Google PageRank and Kleinberg Hits) 3. Similarity in graphs 4. Application to synonym extraction Ref: Web searching and graph similarity V. Blondel, A. Gajardo, M. Heymans, P. Sennelart and P. Van Dooren SIAM Review, http://epubs.siam.org/sam-bin/dbq/article/41596 Structure of the web In 1999 a giant strongly connected component (core) was discovered • • • • Contains most prominent sites It contains 30% of all pages Average distance between nodes is 16 Small world Ref : Broder et al., Graph structure in the web, WWW9, 2000 http://www.almaden.ibm.com/cs/k53/www9.final/ The web is a bowtie Ref: The web is a bowtie, Nature, Vol. 405,May 11, 2000 In- and out-degree distributions Power law distribution : number of pages of in-degree n is proportional to 1/n2.1 (Zipf law) A score for every page The score of a page is high if the page has many incoming links coming from pages with high page score One browses from page to page by following outgoing links with equal probability. Score = frequency a page is visited. A score for every page The score of a page is high if the page has many incoming links coming from pages with high page score One browses from page to page by following outgoing links with equal probability. Score = frequency a page is visited. … some pages may have no outgoing links … many pages have zero frequency PageRank : teleporting random score The surfer follows a path by choosing an outgoing link with probability p/dout(i) or teleports to a random web page with probability 0<1-p <1. Put the transition probability of i to j in a matrix M (bij=1 if i→j) mij = p bij /dout(i) + (1-p)/n then the vector x of probability distribution on the nodes of the graph is the steady state vector of the iteration xk+1=Mxk i.e. the dominant eigenvector of the matrix M (unique because of Perron-Frobenius) PageRank of node i is the (relative) size of element i of this vector Matlab News and Notes, October 2002 and my own page rank ? use Google toolbar some top pages : 1 2 5 8 12 20 23 26 72 http://www.yahoo.com http://www.adobe.com http://www.google.com http://www.microsoft.com http://www.nasa.gov http://mit.edu http://www.nsf.gov http://www.inria.fr http://www.stanford.edu PageRank In-degree 10 10 10 10 10 10 10 10 9 654,000 646,000 252,000 129,000 93,900 47,600 39,400 17,400 36,300 Ref: S. Brin, L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, http://dbpubs.stanford.edu:8090/pub/1998-8 Kleinberg’s structure graph The score of a page is high if the page has many incoming links The score is high if the incoming links are from pages that have high scores Kleinberg’s structure graph The score of a page is high if the page has many incoming links The score is high if the incoming links are from pages that have high scores This inspired Kleinberg’s “structure graph” hub authority Good authorities for “University Belgium” A good hub for “University Belgium” Hub and authority scores Web pages have a hub score hj and an authority score aj which are mutually reinforcing : pages with large hj point to pages with high aj pages with large aj are pointed to by pages with high hj hj ← Σ i:(j→i) ai aj ← Σ i:(i→j) hi or, using the adjacency matrix B of the graph (bij=1 if j→i is an edge) h a = k+1 0 B BT 0 h a k h a = 0 1 1 Use limiting vector a (dominant eigenvector of BTB) to rank pages Extension to another structure graph Give three scores to each web page : begin b, center c, end e b c e Use again mutual reinforcement to define the iteration bj ← cj ← Σ i:(i→j) bi ej ← Σ i:(j→i) ci + Σ i:(j→i) ei Σ i:(i→j) ci Defines a limiting vector for the iteration xk+1 = M xk, x0= 1 b where x = c e , 0 B 0 M = BT 0 B 0 BT 0 Towards arbitrary graphs For the graph For the graph •→• •→ • → • A= A= 0 1 0 0 and M = 0 1 0 0 0 1 0 0 0 and M = 0 B BT 0 0 B 0 BT 0 B 0 BT 0 Formula for M for two arbitrary graphs GA and GB : M= A B + AT BT With xk =vec(Xk) iteration xk+1 = M xk is equivalent to Xk+1 = BXk AT+BT Xk A Convergence ? The (normalized) sequence Zk+1 = (BZk AT+BT Zk A)/ ||BZk AT+BT Zk A||2 has two fixed points Zeven and Zodd for every Z0>0 Similarity matrix S = lim k→∞ Z2k , Z0 =1 Si,j is the similarity score between Vj (A) and Vi (B) Properties • ρS=BSAT+BTSA, ρ=||BSAT+BTSA||2 • Fixed point of largest 1-norm • Robust fixed point for M+ε1 • Linear convergence (power method for sparse M) Bow tie example graph A 1 •→•2 graph B 2 1 1 0 0 1 0 0 1 0 : : : : S= 0 0 S= 1 0 0 1 0 0 : : : : 0 1 0 0 if m>n if n>m not satisfactory n+1 n+m+1 Bow tie example 0 1 0 graph A 1 0 0 1 •→•→•3 : : : S= 1 0 0 2 0 0 1 : graph B 2 : 0 0 1 1 n+1 : central score is good n+m+1 Other properties • Central score is a dominant eigenvector of BBT+BTB (cfr. hub score of BBT and authority score of BTB) • Similarity matrix of a graph with itself is square and semi-definite. Path graph • → • → • Cycle graph 1 0 0 1 1 1 0 1 0 1 1 1 0 0 1 1 1 1 The dictionary graph OPTED, based on Webster’s unabridged dictionary http://msowww.anu.edu.au/~ralph/OPTED Nodes = words present in the dictionary : 112,169 nodes Edge (u,v) if v appears in the definition of u : 1,398,424 edges Average of 12 edges per node In and out degree distribution Very similar to web (power law) Words with highest in degree : of, a, the, or, to, in … Words with null out degree : 14159, Fe3O4, Aaron, and some undefined or misspelled words Neighborhood graph is the subset of vertices used for finding synonyms : it contains all parents and children of the node neighborhood graph of likely “Central” uses this sub-graph to rank automatically synonyms Comparison with Vectors, ArcRank (automatic) Wordnet, Microsoft Word (manual) Disappear Vectors Central ArcRanc Wordnet Microsoft 1 vanish vanish epidemic vanish vanish 2 wear pass disappearing go away cease to exist 3 die die port end fade away 4 sail wear dissipate finish die out 5 faint faint cease terminate go 6 light fade eat cease evaporate 7 port sail gradually wane 8 absorb light instrumental expire 9 appear dissipate darkness withdraw 10 cease cease efface pass away Mark 3.6 6.3 1.2 7.5 8.6 Std Dev 1.8 1.7 1.2 1.4 1.3 Parallelogram Vectors Central ArcRanc Wordnet Microsoft 1 square square quadrilateral quadrilateral diamond 2 parallel rhomb gnomon quadrangle lozenge 3 rhomb parallel right-lined tetragon rhomb 4 prism figure rectangle 5 figure prism consequently 6 equal equal parallelopiped 7 quadrilateral opposite parallel 8 opposite angles cylinder 9 altitude quadrilateral popular 10 parallelopiped rectangle prism Mark 4.6 4.8 3.3 6.3 5.3 Std Dev 2.7 2.5 2.2 2.5 2.6 Science Vectors Central ArcRanc Wordnet Microsoft 1 art art formulate knowledge domain discipline 2 branch branch arithmetic knowledge base knowledge 3 nature law systematize discipline skill 4 law study scientific subject art 5 knowledge practice knowledge subject area 6 principle natural geometry subject field 7 life knowledge philosophical field 8 natural learning learning field of study 9 electricity theory expertness ability 10 biology principle mathematics power Mark 3.6 4.4 3.2 7.1 6.5 Std Dev 2.0 2.5 2.9 2.6 2.4 Sugar Vectors Central ArcRanc Wordnet Microsoft 1 juice cane granulation sweetening darling 2 starch starch shrub sweetener baby 3 cane sucrose sucrose carbohydrate honey 4 milk milk preserve saccharide dear 5 molasses sweet honeyed organic compound love 6 sucrose dextrose property saccarify dearest 7 wax molasses sorghum sweeten beloved 8 root juice grocer dulcify precious 9 crystalline glucose acetate edulcorate pet 10 confection lactose saccharine dulcorate babe Mark 3.9 6.3 4.3 6.2 4.7 Std Dev 2.0 2.4 2.3 2.9 2.7 Conclusion • New notion of similarity between vertices of a graph • Easy to compute : start from X0 = 1 and take even normalized iterates of Xk+1=BXkAT+BTXkA • Potential use for data-mining, classification, clustering • Successful implementation for the french dictionary “Le petit Robert” • Applications in texts, internet, reference lists, telephone networks, bipartite graphs… (Melnik, Widom, …) • Different from sub-graph problems ! Distribution of calls received Number of customers Number of calls received Example: 2000 people have received 100 calls
© Copyright 2025 Paperzz