Link Analysis Hongning Wang CS@UVa Recap: formula for Rocchio feedback β’ Standard operation in vector space Parameters Modified query π½ ππ = πΌπ + |π·π | βππ βπ·π πΎ ππ β |π·π | ππ βππ βπ·π Original query Rel docs CS@UVa CS4501: Information Retrieval Non-rel docs 2 Recap: click models β’ Decompose relevance-driven clicks from position-driven clicks β Examine: user reads the displayed result β Click: user clicks on the displayed result β Atomic unit: (query, doc) Prob. (q,d1) Relevance quality Click probability Examine probability Pos. (q,d2) (q,d3) (q,d4) CS@UVa CS 4501: Information Retrieval 3 Structured v.s. unstructured data β’ Our claim before β IR v.s. DB = unstructured data v.s. structured data β’ As a result, we have assumed β Document = a sequence of words β Query = a short document β Corpus = a set of documents However, this assumption is not accurateβ¦ CS@UVa CS 4501: Information Retrieval 4 A typical web document has Title Anchor CS@UVa Body CS 4501: Information Retrieval 5 How does a human perceive a documentβs structure CS@UVa CS 4501: Information Retrieval 6 Intra-document structures Document Title Paragraph 1 Paragraph 2 β¦.. Images Anchor texts CS@UVa Concise summary of the document Likely to be an abstract of the document They might contribute differently for a documentβs relevance! Visual description of the document References to other documents CS 4501: Information Retrieval 7 Exploring intra-document structures for retrieval Document Title Intuitively, we want to give different weights to the parts to reflect their importance In vector space model? Weighted TF Paragraph 1 Paragraph 2 β¦.. Anchor texts Think about query-likelihood modelβ¦ Select Dj and generate a query word using Dj n p (Q | D, R ) ο½ ο p ( wi | D, R ) i ο½1 n k ο½ ο ο₯ s ( D j | D, R ) p ( wi | D j , R ) i ο½1 j ο½1 βpart selectionβ prob. Serves as weight for Dj Can be estimated by EM or manually set CS@UVa CS 4501: Information Retrieval 8 Inter-document structure β’ Documents are no longer independent CS@UVa Source: https://wiki.digitalmethods.net/Dmi/WikipediaAnalysis CS 4501: Information Retrieval 9 What do the links tell us? β’ Anchor β Rendered form β Original form CS@UVa CS 4501: Information Retrieval 10 What do the links tell us? β’ Anchor text β How others describe the page β’ E.g., βbig blueβ is a nick name of IBM, but never found on IBMβs official web site β A good source for query expansion, or can be directly put into index CS@UVa CS 4501: Information Retrieval 11 What do the links tell us? β’ Linkage relation β Endorsement from others β utility of the page "PageRank-hi-res". Licensed under Creative Commons Attribution-Share Alike 2.5 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:PageRank-hi-res.png#mediaviewer/File:PageRank-hi-res.png CS@UVa CS 4501: Information Retrieval 12 Analogy to citation network β’ Authors cite othersβ work because β A conferral of authority β’ They appreciate the intellectual value in that paper β There is certain relationship between the papers β’ Bibliometrics β A citation is a vote for the usefulness of that paper β Citation count indicates the quality of the paper β’ E.g., # of in-links CS@UVa CS 4501: Information Retrieval 13 Situation becomes more complicated in the web environment β’ Adding a hyperlink costs almost nothing β Taken advantage by web spammers β’ Large volume of machine-generated pages to artificially increase βin-linksβ of the target page β’ Fake or invisible links β’ We should not only consider the count of inlinks, but the quality of each in-link β PageRank β HITS CS@UVa CS 4501: Information Retrieval 14 Link structure analysis β’ Describes the characteristic of network structure β’ Reflect the utility of web documents in a general sense β’ An important factor when ranking documents β For learning-to-rank β For focused crawling CS@UVa CS 4501: Information Retrieval 15 Recall how we do internet browsing 1. Mike types a URL address in his Chromeβs URL bar; 2. He browses the content of the page, and follows the link he is interested in; 3. When he feels the current page is not interesting or there is no link to follow, he types another URL and starts browsing from there; 4. He repeats 2 and 3 until he is tired or satisfied with this browsing activity CS@UVa CS 4501: Information Retrieval 16 PageRank β’ A random surfing model of internet 1. A surfer begins at a random page on the web and starts random walk on the graph 2. On current page, the surfer uniformly follows an out-link to the next page 3. When there is no out-link, the surfer uniformly jumps to a page from the whole page 4. Keep doing Step 2 and 3 forever CS@UVa CS 4501: Information Retrieval 17 PageRank β’ A measure of web page popularity β Probability of a random surfer who arrives at this web page β Only depends on the linkage structure of web pages Transition matrix d1 π= d3 d2 d4 CS@UVa 0 0 0.5 0.5 0 0 0 0 0 1 0 0 0.5 0.5 0 0 ππ‘ π = πΌππ ππ‘β1 CS 4501: Information Retrieval Ξ±: probability of random jump N: # of pages 1βπΌ π + ππ‘β1 π π Random walk 18 Theoretic model of PageRank β’ Markov chains β A discrete-time stochastic process β’ It occurs in a series of time-steps in each of which a random choice is made β Can be described by a directed graph or a P(So-so|Cheerful)=0.2 transition matrix 0.6 0.2 π = 0.3 0.4 0 0.3 CS@UVa CS 4501: Information Retrieval A first-order Markov chain for emotion 0.2 0.3 0.7 19 Markov chains β’ Markov property Idea of random surfing β π ππ+1 π1 , β¦ , ππ = π(ππ+1 |ππ ) β’ Memoryless (first-order) β’ Transition matrix β A stochastic matrix β’ βπ, π πππ =1 β Key property 0.6 0.2 0.2 π = 0.3 0.4 0.3 0 0.3 0.7 β’ It has a principal left eigenvector corresponding to its largest eigenvalue, which is one Mathematical interpretation of PageRank score CS@UVa CS 4501: Information Retrieval 20 Theoretic model of PageRank β’ Transition matrix of a Markov chain for PageRank d1 d3 d2 0 π΄= 0 0 1 0 0 1 1 1. Enable random 1 jump on dead end 1 0 0 0 0 0 0 0 0 π΄β² = 0.25 0.25 0 1 1 1 π= CS@UVa 1 0.25 0 0 2. Normalization d4 0.125 0.25 0.125 0.375 1 0.25 0 0 0.125 0.25 0.625 0.375 0.375 0.25 0.125 0.125 0.375 0.25 0.125 0.125 3. Enable random jump on all nodes π΄β²β² = πΌ = 0.5 CS 4501: Information Retrieval 0 0 0.25 0.25 0 1 0.5 0.5 0.5 0.5 0.25 0.25 0 0 0 0 21 Steps to derive transition matrix for PageRank 1. If a row of A has no 1βs, replace each element by 1/N. 2. Divide each 1 in A by the number of 1βs in its row. 3. Multiply the resulting matrix by 1 β Ξ±. 4. Add Ξ±/N to every entry of the resulting matrix, to obtain M. A: adjacent matrix of network structure; Ξ±: dumping factor CS@UVa CS 4501: Information Retrieval 22 Recap: exploring intra-document structures for retrieval Document Title Intuitively, we want to give different weights to the parts to reflect their importance In vector space model? Weighted TF Paragraph 1 Paragraph 2 β¦.. Anchor texts Think about query-likelihood modelβ¦ Select Dj and generate a query word using Dj n p (Q | D, R ) ο½ ο p ( wi | D, R ) i ο½1 n k ο½ ο ο₯ s ( D j | D, R ) p ( wi | D j , R ) i ο½1 j ο½1 βpart selectionβ prob. Serves as weight for Dj Can be estimated by EM or manually set CS@UVa CS 4501: Information Retrieval 23 Recap: what do the links tell us? β’ Anchor text β How others describe the page β’ E.g., βbig blueβ is a nick name of IBM, but never found on IBMβs official web site β A good source for query expansion, or can be directly put into index CS@UVa CS 4501: Information Retrieval 24 Recap: what do the links tell us? β’ Linkage relation β Endorsement from others β utility of the page "PageRank-hi-res". Licensed under Creative Commons Attribution-Share Alike 2.5 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:PageRank-hi-res.png#mediaviewer/File:PageRank-hi-res.png CS@UVa CS 4501: Information Retrieval 25 Recap: situation becomes more complicated in the web environment β’ Adding a hyperlink costs almost nothing β Taken advantage by web spammers β’ Large volume of machine-generated pages to artificially increase βin-linksβ of the target page β’ Fake or invisible links β’ We should not only consider the count of inlinks, but the quality of each in-link β PageRank β HITS CS@UVa CS 4501: Information Retrieval 26 Recap: PageRank β’ A random surfing model of internet 1. A surfer begins at a random page on the web and starts random walk on the graph 2. On current page, the surfer uniformly follows an out-link to the next page 3. When there is no out-link, the surfer uniformly jumps to a page from the whole page 4. Keep doing Step 2 and 3 forever CS@UVa CS 4501: Information Retrieval 27 Recap: transition matrix in PageRank β’ Transition matrix of a Markov chain for PageRank d1 d3 d2 0 π΄= 0 0 1 0 0 1 1 1. Enable random 1 jump on dead end 1 0 0 0 0 0 0 0 0 π΄β² = 0.25 0.25 0 1 1 1 π= CS@UVa 1 0.25 0 0 2. Normalization d4 0.125 0.25 0.125 0.375 1 0.25 0 0 0.125 0.25 0.625 0.375 0.375 0.25 0.125 0.125 0.375 0.25 0.125 0.125 3. Enable random jump on all nodes π΄β²β² = πΌ = 0.5 CS 4501: Information Retrieval 0 0 0.25 0.25 0 1 0.5 0.5 0.5 0.5 0.25 0.25 0 0 0 0 28 PageRank computation becomes β’ ππ‘ π = ππ ππ‘β1 π β Assuming π0 π = 1 1 ,β¦, π π β Iterative computation (forever?) β’ ππ‘ π = ππ ππ‘β1 π = β― = (ππ )π‘ π0 π β Intuition: after enough rounds of random walk, each dimension of ππ‘ π indicates the frequency of a random surfer visiting document π β Question: will this frequency converges to certain fixed, steady-state quantity? CS@UVa CS 4501: Information Retrieval 29 Stationary distribution of a Markov chain β’ For a given Markov chain with transition matrix M, its stationary distribution of Ο is βπ β π, ππ β₯ 0 ππ = 1 A probability vector πβπ ππ π π= β Necessary condition Random walk does not affect its distribution β’ Irreducible: a state is reachable from any other state β’ Aperiodic: states cannot be partitioned such that transitions happened periodically among the partitions CS@UVa CS 4501: Information Retrieval 30 Markov chain for PageRank β’ Random jump operation makes PageRank satisfy the necessary conditions 1. Random jump makes every node is reachable for the other nodes 2. Random jump breaks potential loop in a subnetwork β’ What does PageRank score really converge to? CS@UVa CS 4501: Information Retrieval 31 Stationary distribution of PageRank β’ For any irreducible and aperiodic Markov chain, there is a unique steady-state probability vector Ο, such that if π(π, π‘) is the number of visits to state π after π‘ steps, then π π, π‘ lim = ππ π‘ββ π‘ β PageRank score converges to the expected visit frequency of each node CS@UVa CS 4501: Information Retrieval 32 Computation of PageRank β’ Power iteration β ππ‘ π = ππ ππ‘β1 π = β― = (ππ )π‘ π0 (π) β’ Normalize ππ‘ π in each iteration β’ Convergence rate is determined by the second eigenvalue β Random walk becomes series of matrix production β Alternative interpretation of PageRank score β’ Principal left eigenvector corresponding to its largest eigenvalue, which is one ππ × π = 1 × π CS@UVa CS 4501: Information Retrieval 33 Computation of PageRank β’ An example from Manningβs text book 1/6 2/3 1/6 π = 5/12 1/6 5/12 1/6 2/3 1/6 CS@UVa CS 4501: Information Retrieval 34 Variants of PageRank β’ Topic-specific PageRank β Control the random jump to topic-specific nodes β’ E.g., surfer interests in Sports will only randomly jump to Sports-related website when they have no out-links to follow β ππ‘ π = [πΌππ + 1 β πΌ πππ (π)]ππ‘β1 π β’ ππ π > 0 iff d belongs to the topic of interest β’ π is a column vector of ones CS@UVa CS 4501: Information Retrieval 35 Variants of PageRank β’ Topic-specific PageRank β A userβs interest is a mixture of topics Compute it off-line Userβs interest: 60% Sports, 40% politics Damping factor: 10% CS@UVa π= Manning, βIntroduction to Information Retrievalβ, Chapter 21, Figure 21.5 CS 4501: Information Retrieval π(ππ |π’π ππ) π ππ π 36 Variants of PageRank β’ LexRank β A sentence is important if it is similar to other important sentences β PageRank on sentence similarity graph Centrality-based sentence salience ranking for document summarization CS@UVa CS 4501: Information Retrieval Erkan & Radev, JAIRβ04 37 Variants of PageRank β’ SimRank β Two objects are similar if they are referenced by similar objects β PageRank on bipartite graph of object relations Measure similarity between objects via their connecting relation CS@UVa Glen & Widom, KDD'02 CS 4501: Information Retrieval 38 HITS algorithm β’ Two types of web pages for a broad-topic query β Authorities β trustful source of information β’ UVa-> University of Virginia official site β Hubs β hand-crafted list of links to authority pages for a specific topic β’ Deep learning -> deep learning reading list CS@UVa CS 4501: Information Retrieval 39 HITS algorithm β’ Intuition HITS=Hyperlink-Induced Topic Search β Using hub pages to discover authority pages β’ Assumption β A good hub page is one that points to many good authorities -> a hub score β A good authority page is one that is pointed to by many good hub pages -> an authority score β’ Recursive definition indicates iterative algorithm CS@UVa CS 4501: Information Retrieval 40 Computation of HITS scores β’ Two scores for a web page for a given query β Authority score: π π β Hub score: β(π) π£ β π means there is a link from π£ to π π π β β(π£) π£βπ β π β π(π£) πβπ£ Important HITS scores are query-dependent! CS@UVa With proper normalization (L2-norm) CS 4501: Information Retrieval 41 Computation of HITS scores β’ In matrix form β π β π΄π β and β β π΄π β That is π β π΄π π΄π and β β π΄π΄π β β Another eigen-system 1 π π = π΄ π΄π ππ 1 β = π΄π΄π β πβ CS@UVa CS 4501: Information Retrieval Power iteration is applicable here as well 42 Constructing the adjacent matrix β’ Only consider a subset of the Web 1. For a given query, retrieve all the documents containing the query (or top K documents in a ranked list) β root set 2. Expand the root set by adding pages either linking to a page in the root set, or being linked to by a page in the root set β base set 3. Build adjacent matrix of pages in the base set CS@UVa CS 4501: Information Retrieval 43 Constructing the adjacent matrix β’ Reasons behind the construction steps β Reduce the computation cost β A good authority page may not contain the query text β The expansion of root set might introduce good hubs and authorities into the sub-network CS@UVa CS 4501: Information Retrieval 44 Sample results Kleinberg,Retrievalβ, JACM'99 Chapter 21, Figure 21.6 Manning, βIntroduction to Information CS@UVa CS 4501: Information Retrieval 45 Todayβs reading β’ Introduction to information retrieval β Chapter 21: Link Analysis CS@UVa CS 4501: Information Retrieval 46 References β’ Page, Lawrence, Sergey Brin, Rajeev Motwani, and Terry Winograd. "The PageRank citation ranking: Bringing order to the web." (1999) β’ Haveliwala, Taher H. "Topic-sensitive pagerank." In Proceedings of the 11th international conference on World Wide Web, pp. 517526. ACM, 2002. β’ Erkan, Günes, and Dragomir R. Radev. "LexRank: Graph-based lexical centrality as salience in text summarization." J. Artif. Intell. Res.(JAIR) 22, no. 1 (2004): 457-479. β’ Jeh, Glen, and Jennifer Widom. "SimRank: a measure of structuralcontext similarity." In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 538-543. ACM, 2002. β’ Kleinberg, Jon M. "Authoritative sources in a hyperlinked environment." Journal of the ACM (JACM) 46, no. 5 (1999): 604632. CS@UVa CS 4501: Information Retrieval 47
© Copyright 2025 Paperzz