Information Retrieval (11) Prof. Dragomir R. Radev [email protected] IR Winter 2010 … 17. continued … [Slide from Reka Albert] [Slide from Reka Albert] The strength of weak ties • Granovetter’s study: finding jobs • Weak ties: more people can be reached through weak ties than strong ties (e.g., through your 7th and 8th best friends) • More here: http://en.wikipedia.org/wiki/Weak_tie Prestige and centrality • Degree centrality: how many neighbors each node has. • Closeness centrality: how close a node is to all of the other nodes • Betweenness centrality: based on the role that a node plays by virtue of being on the path between two other nodes • Eigenvector centrality: the paths in the random walk are weighted by the centrality of the nodes that the path connects. • Prestige = same as centrality but for directed graphs. IR Winter 2010 … 18. Graph-based methods Harmonic functions Random walks PageRank … Random walks and harmonic functions • Drunkard’s walk: – Start at position 0 on a line 0 1 2 3 4 5 • What is the prob. of reaching 5 before reaching 0? • Harmonic functions: – – – – P(0) = 0 P(N) = 1 P(x) = ½*p(x-1)+ ½*p(x+1), for 0<x<N (in general, replace ½ with the bias in the walk) (**) The original Dirichlet problem • Distribution of temperature in a sheet of metal. • One end of the sheet has temperature t=0, the other end: t=1. • Laplace’s differential equation: u u xx u yy 0 2 • This is a special (steady-state) case of the (transient) heat equation : k 2u ut • In general, the solutions to this equation are called harmonic functions. Learning harmonic functions • The method of relaxations – – – – – Discrete approximation. Assign fixed values to the boundary points. Assign arbitrary values to all other points. Adjust their values to be the average of their neighbors. Repeat until convergence. • Monte Carlo method – Perform a random walk on the discrete representation. – Compute f as the probability of a random walk ending in a particular fixed point. • Eigenvector methods – Look at the stationary distribution of a random walk Eigenvectors and eigenvalues • An eigenvector is an implicit “direction” for a matrix Av v where v (eigenvector) is non-zero, though λ (eigenvalue) can be any complex number in principle • Computing eigenvalues: det( A I ) 0 Eigenvectors and eigenvalues • Example: 1 3 A 2 0 1 A I 2 • Det (A-I) = (-1-)*(-)-3*2=0 • Then: +2-6=0; 1=2; 2=-3 • For 12: 3 2 3 x1 0 2 x2 • Solutions: x1=x2 3 Stochastic matrices • Stochastic matrices: each row (or column) adds up to 1 and no value is less than 0. Example: 3 A 8 1 4 5 8 3 4 • The largest eigenvalue of a stochastic matrix E is real: λ1 = 1. • For λ1, the left (principal) eigenvector is p, the right eigenvector = 1 • In other words, GTp = p. Electrical networks and random walks • Ergodic (connected) Markov chain with transition matrix P c Pxy 1Ω 1Ω a b 0.5 Ω a b 0.5 Ω 1Ω c d C xy Cx C x C xy a b c d 1 1 0 0 2 2 1 2 0 0 3 3 1 1 1 0 4 4 2 1 2 2 0 5 5 5 y 1 Cxy Rxy w=Pw 2 14 3 14 4 14 5 14 d From Doyle and Snell 2000 T Electrical networks and random walks ixy c vx v y Rxy 1Ω 1Ω (v x v y )C xy vx b 0.5 Ω 0.5 Ω 1Ω d 1V xy 0 y y a i cxy cx v y Pxy v y y 1 1 7 v v c d va 1 4 2 16 vb 0 v 1 2 v 3 d c 5 5 8 • vx is the probability that a random walk starting at x will reach a before reaching b. • The random walk interpretation allows us to use Monte Carlo methods to solve electrical circuits. Markov chains • A homogeneous Markov chain is defined by an initial distribution x and a Markov kernel E. • Path = sequence (x0, x1, …, xn). Xi = xi-1*E • The probability of a path can be computed as a product of probabilities for each step i. • Random walk = find Xj given x0, E, and j. Stationary solutions • The fundamental Ergodic Theorem for Markov chains [Grimmett and Stirzaker 1989] says that the Markov chain with kernel E has a stationary distribution p under three conditions: – E is stochastic – E is irreducible – E is aperiodic • To make these conditions true: – All rows of E add up to 1 (and no value is negative) – Make sure that E is strongly connected – Make sure that E is not bipartite • Example: PageRank [Brin and Page 1998]: use “teleportation” Example 1 6 1 8 0.9 t=0 0.8 PageRank 0.7 0.6 0.5 0.4 0.3 0.2 2 0.1 0 7 1 2 3 4 5 6 7 8 1 0.9 5 t=1 0.8 3 PageRank 0.7 4 0.6 0.5 0.4 0.3 0.2 0.1 This graph E has a second graph E’ (not drawn) superimposed on it: E’ is the uniform transition graph. 0 1 2 3 4 5 6 7 8 Eigenvectors • An eigenvector is an implicit “direction” for a matrix. Ev = λv, where v is non-zero, though λ can be any complex number in principle. • The largest eigenvalue of a stochastic matrix E is real: λ1 = 1. • For λ1, the left (principal) eigenvector is p, the right eigenvector = 1 • In other words, ETp = p. Computing the stationary distribution Solution for the stationary distribution pE p T (I ET ) p 0 Convergence rate is O(m) function PowerStatDist (E): begin p(0) = u; (or p(0) = [1,0,…0]) i=1; repeat p(i) = ETp(i-1) L = ||p(i)-p(i-1)||1; i = i + 1; until L < return p(i) end Example 1 0.9 t=0 0.8 PageRank 0.7 0.6 0.5 0.4 0.3 0.2 1 6 8 0.1 0 1 2 3 4 5 6 7 8 1 0.9 t=1 0.8 0.7 PageRank 2 7 0.6 0.5 0.4 0.3 0.2 0.1 5 0 1 2 3 4 5 6 7 8 1 4 t=10 0.8 0.7 PageRank 3 0.9 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 PageRank • Developed at Stanford and allegedly still being used at Google. • Not query-specific, although query-specific varieties exist. • In general, each page is indexed along with the anchor texts pointing to it. • Among the pages that match the user’s query, Google shows the ones with the largest PageRank. • Google also uses vector-space matching, keyword proximity, anchor text, etc. IR Winter 2010 … 19. Hubs and authorities Bipartite graphs HITS and SALSA Models of the web … HITS • Hypertext-induced text selection. • Developed by Jon Kleinberg and colleagues at IBM Almaden as part of the CLEVER engine. • HITS is query-specific. • Hubs and authorities, e.g. collections of bookmarks about cars vs. actual sites about cars. Honda Ford VW Car and Driver HITS • Each node in the graph is ranked for hubness (h) and authoritativeness (a). • Some nodes may have high scores on both. • Example authorities for the query “java”: – – – – – www.gamelan.com java.sun.com digitalfocus.com/digitalfocus/… (The Java developer) lightyear.ncsa.uiuc.edu/~srp/java/javabooks.html sunsite.unc.edu/javafaq/javafaq.html HITS • HITS algorithm: – obtain root set (using a search engine) related to the input query – expand the root set by radius one on either side (typically to size 1000-5000) – run iterations on the hub and authority scores together – report top-ranking authorities and hubs • Eigenvector interpretation: a G h ' a (G T G ) T h Ga ' h (GG ) T p (G ) Example [slide from Baldi et al.] HITS • HITS is now used by Ask.com and Teoma.com . • It can also be used to identify communities (e.g., based on synonyms as well as controversial topics. • Example for “jaguar” – Principal eigenvector gives pages about the animal – The positive end of the second nonprincipal eigenvector gives pages about the football team – The positive end of the third nonprincipal eigenvector gives pages about the car. • Example for “abortion” – The positive end of the second nonprincipal eigenvector gives pages on “planned parenthood” and “reproductive rights” – The negative end of the same eigenvector includes “pro-life” sites. • SALSA (Lempel and Moran 2001) Models of the Web • Evolving networks: fundamental object of statistical physics, social networks, mathematical biology, and epidemiology • Erdös/Rényi 59, 60 • Barabási/Albert 99 • Watts/Strogatz 98 • Kleinberg 98 • Menczer 02 A a • Radev 03 B e k k k P(k ) kk! P(k ) k Np ( ) b Evolving Word-based Web • Observations: – Links are made based on topics – Topics are expressed with words – Words are distributed very unevenly (Zipf, Benford, self-triggerability laws) • Model – Pick n – Generate n lengths according to a power-law distribution – Generate n documents using a trigram model • Model (cont’d) – Pick words in decreasing order of r. – Generate hyperlinks with random directionality • Outcome – Generates power-law degree distributions – Generates topical communities – Natural variation of PageRank: LexRank Readings • paper by Church and Gale (http://citeseer.ist.psu.edu/church95poisso n.html)
© Copyright 2025 Paperzz