Link Analysis - cs.Virginia

Link Analysis
Hongning Wang
CS@UVa
Recap: formula for Rocchio feedback
β€’ Standard operation in vector space
Parameters
Modified query
𝛽
π‘žπ‘š = π›Όπ‘ž +
|π·π‘Ÿ |
βˆ€π‘‘π‘– βˆˆπ·π‘Ÿ
𝛾
𝑑𝑖 βˆ’
|𝐷𝑛 |
𝑑𝑗
βˆ€π‘‘π‘— βˆˆπ·π‘›
Original query
Rel docs
CS@UVa
CS4501: Information Retrieval
Non-rel docs
2
Recap: click models
β€’ Decompose relevance-driven clicks from
position-driven clicks
– Examine: user reads the displayed result
– Click: user clicks on the displayed result
– Atomic unit: (query, doc)
Prob.
(q,d1)
Relevance quality
Click probability
Examine probability
Pos.
(q,d2)
(q,d3)
(q,d4)
CS@UVa
CS 4501: Information Retrieval
3
Structured v.s. unstructured data
β€’ Our claim before
– IR v.s. DB = unstructured data v.s. structured data
β€’ As a result, we have assumed
– Document = a sequence of words
– Query = a short document
– Corpus = a set of documents
However, this assumption is not accurate…
CS@UVa
CS 4501: Information Retrieval
4
A typical web document has
Title
Anchor
CS@UVa
Body
CS 4501: Information Retrieval
5
How does a human perceive a
document’s structure
CS@UVa
CS 4501: Information Retrieval
6
Intra-document structures
Document
Title
Paragraph 1
Paragraph 2
…..
Images
Anchor texts
CS@UVa
Concise summary of the document
Likely to be an abstract of the document
They might contribute differently
for a document’s relevance!
Visual description of the document
References to other documents
CS 4501: Information Retrieval
7
Exploring intra-document structures
for retrieval
Document
Title
Intuitively, we want to give different weights
to the parts to reflect their importance
In vector space model? Weighted TF
Paragraph 1
Paragraph 2
…..
Anchor texts
Think about query-likelihood model…
Select Dj and generate a query
word using Dj
n
p (Q | D, R ) ο€½  p ( wi | D, R )
i ο€½1
n
k
ο€½  οƒ₯ s ( D j | D, R ) p ( wi | D j , R )
i ο€½1 j ο€½1
β€œpart selection” prob. Serves as weight for Dj
Can be estimated by EM or manually set
CS@UVa
CS 4501: Information Retrieval
8
Inter-document structure
β€’ Documents are no longer independent
CS@UVa
Source: https://wiki.digitalmethods.net/Dmi/WikipediaAnalysis
CS 4501: Information Retrieval
9
What do the links tell us?
β€’ Anchor
– Rendered form
– Original form
CS@UVa
CS 4501: Information Retrieval
10
What do the links tell us?
β€’ Anchor text
– How others describe the page
β€’ E.g., β€œbig blue” is a nick name of IBM, but never found
on IBM’s official web site
– A good source for query expansion, or can be
directly put into index
CS@UVa
CS 4501: Information Retrieval
11
What do the links tell us?
β€’ Linkage relation
– Endorsement from others – utility of the page
"PageRank-hi-res". Licensed under Creative Commons Attribution-Share Alike 2.5 via Wikimedia Commons
- http://commons.wikimedia.org/wiki/File:PageRank-hi-res.png#mediaviewer/File:PageRank-hi-res.png
CS@UVa
CS 4501: Information Retrieval
12
Analogy to citation network
β€’ Authors cite others’ work because
– A conferral of authority
β€’ They appreciate the intellectual value in that paper
– There is certain relationship between the papers
β€’ Bibliometrics
– A citation is a vote for the usefulness of that paper
– Citation count indicates the quality of the paper
β€’ E.g., # of in-links
CS@UVa
CS 4501: Information Retrieval
13
Situation becomes more complicated
in the web environment
β€’ Adding a hyperlink costs almost nothing
– Taken advantage by web spammers
β€’ Large volume of machine-generated pages to artificially
increase β€œin-links” of the target page
β€’ Fake or invisible links
β€’ We should not only consider the count of inlinks, but the quality of each in-link
– PageRank
– HITS
CS@UVa
CS 4501: Information Retrieval
14
Link structure analysis
β€’ Describes the characteristic of network
structure
β€’ Reflect the utility of web documents in a
general sense
β€’ An important factor when ranking documents
– For learning-to-rank
– For focused crawling
CS@UVa
CS 4501: Information Retrieval
15
Recall how we do internet browsing
1. Mike types a URL address in his Chrome’s URL
bar;
2. He browses the content of the page, and follows
the link he is interested in;
3. When he feels the current page is not interesting
or there is no link to follow, he types another
URL and starts browsing from there;
4. He repeats 2 and 3 until he is tired or satisfied
with this browsing activity
CS@UVa
CS 4501: Information Retrieval
16
PageRank
β€’ A random surfing model of internet
1. A surfer begins at a random page on the web
and starts random walk on the graph
2. On current page, the surfer uniformly follows an
out-link to the next page
3. When there is no out-link, the surfer uniformly
jumps to a page from the whole page
4. Keep doing Step 2 and 3 forever
CS@UVa
CS 4501: Information Retrieval
17
PageRank
β€’ A measure of web page popularity
– Probability of a random surfer who arrives at this
web page
– Only depends on the linkage structure of web
pages
Transition matrix
d1
𝑀=
d3
d2
d4
CS@UVa
0 0
0.5 0.5
0 0
0
0
0
1
0 0
0.5 0.5
0 0
𝑝𝑑 𝑑 = 𝛼𝑀𝑇 π‘π‘‘βˆ’1
CS 4501: Information Retrieval
Ξ±: probability of
random jump
N: # of pages
1βˆ’π›Ό
𝑑 +
π‘π‘‘βˆ’1 𝑑
𝑁
Random walk
18
Theoretic model of PageRank
β€’ Markov chains
– A discrete-time stochastic process
β€’ It occurs in a series of time-steps in each of which a
random choice is made
– Can be described by a directed graph or a
P(So-so|Cheerful)=0.2
transition matrix
0.6 0.2
𝑀 = 0.3 0.4
0 0.3
CS@UVa
CS 4501: Information Retrieval
A first-order Markov chain for emotion
0.2
0.3
0.7
19
Markov chains
β€’ Markov property
Idea of random surfing
– 𝑃 𝑋𝑛+1 𝑋1 , … , 𝑋𝑛 = 𝑃(𝑋𝑛+1 |𝑋𝑛 )
β€’ Memoryless (first-order)
β€’ Transition matrix
– A stochastic matrix
β€’ βˆ€π‘–,
𝑗 𝑀𝑖𝑗
=1
– Key property
0.6 0.2 0.2
𝑀 = 0.3 0.4 0.3
0 0.3 0.7
β€’ It has a principal left eigenvector corresponding to its
largest eigenvalue, which is one
Mathematical interpretation of PageRank score
CS@UVa
CS 4501: Information Retrieval
20
Theoretic model of PageRank
β€’ Transition matrix of a Markov chain for
PageRank
d1
d3
d2
0
𝐴= 0
0
1
0
0
1
1
1. Enable random
1 jump on dead end
1
0 0
0 0
0 0
0
0
𝐴′ = 0.25 0.25
0 1
1 1
𝑀=
CS@UVa
1
0.25
0
0
2. Normalization
d4
0.125
0.25
0.125
0.375
1
0.25
0
0
0.125
0.25
0.625
0.375
0.375
0.25
0.125
0.125
0.375
0.25
0.125
0.125
3. Enable random
jump on all nodes
𝐴′′ =
𝛼 = 0.5
CS 4501: Information Retrieval
0
0
0.25 0.25
0
1
0.5 0.5
0.5
0.5
0.25 0.25
0 0
0 0
21
Steps to derive transition matrix for
PageRank
1. If a row of A has no 1’s, replace each element
by 1/N.
2. Divide each 1 in A by the number of 1’s in its
row.
3. Multiply the resulting matrix by 1 βˆ’ Ξ±.
4. Add Ξ±/N to every entry of the resulting
matrix, to obtain M.
A: adjacent matrix of network structure;
Ξ±: dumping factor
CS@UVa
CS 4501: Information Retrieval
22
Recap: exploring intra-document
structures for retrieval
Document
Title
Intuitively, we want to give different weights
to the parts to reflect their importance
In vector space model? Weighted TF
Paragraph 1
Paragraph 2
…..
Anchor texts
Think about query-likelihood model…
Select Dj and generate a query
word using Dj
n
p (Q | D, R ) ο€½  p ( wi | D, R )
i ο€½1
n
k
ο€½  οƒ₯ s ( D j | D, R ) p ( wi | D j , R )
i ο€½1 j ο€½1
β€œpart selection” prob. Serves as weight for Dj
Can be estimated by EM or manually set
CS@UVa
CS 4501: Information Retrieval
23
Recap: what do the links tell us?
β€’ Anchor text
– How others describe the page
β€’ E.g., β€œbig blue” is a nick name of IBM, but never found
on IBM’s official web site
– A good source for query expansion, or can be
directly put into index
CS@UVa
CS 4501: Information Retrieval
24
Recap: what do the links tell us?
β€’ Linkage relation
– Endorsement from others – utility of the page
"PageRank-hi-res". Licensed under Creative Commons Attribution-Share Alike 2.5 via Wikimedia Commons
- http://commons.wikimedia.org/wiki/File:PageRank-hi-res.png#mediaviewer/File:PageRank-hi-res.png
CS@UVa
CS 4501: Information Retrieval
25
Recap: situation becomes more
complicated in the web environment
β€’ Adding a hyperlink costs almost nothing
– Taken advantage by web spammers
β€’ Large volume of machine-generated pages to artificially
increase β€œin-links” of the target page
β€’ Fake or invisible links
β€’ We should not only consider the count of inlinks, but the quality of each in-link
– PageRank
– HITS
CS@UVa
CS 4501: Information Retrieval
26
Recap: PageRank
β€’ A random surfing model of internet
1. A surfer begins at a random page on the web
and starts random walk on the graph
2. On current page, the surfer uniformly follows an
out-link to the next page
3. When there is no out-link, the surfer uniformly
jumps to a page from the whole page
4. Keep doing Step 2 and 3 forever
CS@UVa
CS 4501: Information Retrieval
27
Recap: transition matrix in PageRank
β€’ Transition matrix of a Markov chain for
PageRank
d1
d3
d2
0
𝐴= 0
0
1
0
0
1
1
1. Enable random
1 jump on dead end
1
0 0
0 0
0 0
0
0
𝐴′ = 0.25 0.25
0 1
1 1
𝑀=
CS@UVa
1
0.25
0
0
2. Normalization
d4
0.125
0.25
0.125
0.375
1
0.25
0
0
0.125
0.25
0.625
0.375
0.375
0.25
0.125
0.125
0.375
0.25
0.125
0.125
3. Enable random
jump on all nodes
𝐴′′ =
𝛼 = 0.5
CS 4501: Information Retrieval
0
0
0.25 0.25
0
1
0.5 0.5
0.5
0.5
0.25 0.25
0 0
0 0
28
PageRank computation becomes
β€’ 𝑝𝑑 𝑑 = 𝑀𝑇 π‘π‘‘βˆ’1 𝑑
– Assuming 𝑝0 𝑑 =
1
1
,…,
𝑁
𝑁
– Iterative computation (forever?)
β€’ 𝑝𝑑 𝑑 = 𝑀𝑇 π‘π‘‘βˆ’1 𝑑 = β‹― = (𝑀𝑇 )𝑑 𝑝0 𝑑
– Intuition: after enough rounds of random walk,
each dimension of 𝑝𝑑 𝑑 indicates the frequency
of a random surfer visiting document 𝑑
– Question: will this frequency converges to certain
fixed, steady-state quantity?
CS@UVa
CS 4501: Information Retrieval
29
Stationary distribution of a Markov chain
β€’ For a given Markov chain with transition
matrix M, its stationary distribution of Ο€ is
βˆ€π‘– ∈ 𝑆, πœ‹π‘– β‰₯ 0
πœ‹π‘– = 1
A probability vector
π‘–βˆˆπ‘†
𝑀𝑇 πœ‹
πœ‹=
– Necessary condition
Random walk does not
affect its distribution
β€’ Irreducible: a state is reachable from any other state
β€’ Aperiodic: states cannot be partitioned such that
transitions happened periodically among the partitions
CS@UVa
CS 4501: Information Retrieval
30
Markov chain for PageRank
β€’ Random jump operation makes PageRank
satisfy the necessary conditions
1. Random jump makes every node is reachable for
the other nodes
2. Random jump breaks potential loop in a subnetwork
β€’ What does PageRank score really converge
to?
CS@UVa
CS 4501: Information Retrieval
31
Stationary distribution of PageRank
β€’ For any irreducible and aperiodic Markov
chain, there is a unique steady-state
probability vector Ο€, such that if 𝑐(𝑖, 𝑑) is the
number of visits to state 𝑖 after 𝑑 steps, then
𝑐 𝑖, 𝑑
lim
= πœ‹π‘–
π‘‘β†’βˆž
𝑑
– PageRank score converges to the expected visit
frequency of each node
CS@UVa
CS 4501: Information Retrieval
32
Computation of PageRank
β€’ Power iteration
– 𝑝𝑑 𝑑 = 𝑀𝑇 π‘π‘‘βˆ’1 𝑑 = β‹― = (𝑀𝑇 )𝑑 𝑝0 (𝑑)
β€’ Normalize 𝑝𝑑 𝑑 in each iteration
β€’ Convergence rate is determined by the second
eigenvalue
– Random walk becomes series of matrix
production
– Alternative interpretation of PageRank score
β€’ Principal left eigenvector corresponding to its largest
eigenvalue, which is one
𝑀𝑇 × πœ‹ = 1 × πœ‹
CS@UVa
CS 4501: Information Retrieval
33
Computation of PageRank
β€’ An example from Manning’s text book
1/6 2/3 1/6
𝑀 = 5/12 1/6 5/12
1/6 2/3 1/6
CS@UVa
CS 4501: Information Retrieval
34
Variants of PageRank
β€’ Topic-specific PageRank
– Control the random jump to topic-specific nodes
β€’ E.g., surfer interests in Sports will only randomly jump
to Sports-related website when they have no out-links
to follow
– 𝑝𝑑 𝑑 = [𝛼𝑀𝑇 + 1 βˆ’ 𝛼 𝑒𝑝𝑇 (𝑑)]π‘π‘‘βˆ’1 𝑑
β€’ 𝑝𝑇 𝑑 > 0 iff d belongs to the topic of interest
β€’ 𝑒 is a column vector of ones
CS@UVa
CS 4501: Information Retrieval
35
Variants of PageRank
β€’ Topic-specific PageRank
– A user’s interest is a mixture of topics
Compute it off-line
User’s interest: 60%
Sports, 40% politics
Damping factor: 10%
CS@UVa
πœ‹=
Manning, β€œIntroduction to Information Retrieval”, Chapter 21, Figure 21.5
CS 4501: Information Retrieval
𝑝(π‘‡π‘˜ |π‘’π‘ π‘’π‘Ÿ) πœ‹ π‘‡π‘˜
π‘˜
36
Variants of PageRank
β€’ LexRank
– A sentence is important if it is similar to other
important sentences
– PageRank on sentence similarity graph
Centrality-based sentence
salience ranking for
document summarization
CS@UVa
CS 4501: Information Retrieval
Erkan & Radev, JAIR’04
37
Variants of PageRank
β€’ SimRank
– Two objects are similar if they are referenced by
similar objects
– PageRank on bipartite graph of object relations
Measure similarity
between objects via
their connecting
relation
CS@UVa
Glen & Widom, KDD'02
CS 4501: Information Retrieval
38
HITS algorithm
β€’ Two types of web pages for a broad-topic
query
– Authorities – trustful source of information
β€’ UVa-> University of Virginia official site
– Hubs – hand-crafted list of links to authority pages
for a specific topic
β€’ Deep learning -> deep learning reading list
CS@UVa
CS 4501: Information Retrieval
39
HITS algorithm
β€’ Intuition
HITS=Hyperlink-Induced Topic Search
– Using hub pages to discover authority pages
β€’ Assumption
– A good hub page is one that points to many good
authorities -> a hub score
– A good authority page is one that is pointed to by
many good hub pages -> an authority score
β€’ Recursive definition indicates iterative
algorithm
CS@UVa
CS 4501: Information Retrieval
40
Computation of HITS scores
β€’ Two scores for a web page for a given query
– Authority score: π‘Ž 𝑑
– Hub score: β„Ž(𝑑)
𝑣 β†’ 𝑑 means there is
a link from 𝑣 to 𝑑
π‘Ž 𝑑 ←
β„Ž(𝑣)
𝑣→𝑑
β„Ž 𝑑 ←
π‘Ž(𝑣)
𝑑→𝑣
Important HITS scores
are query-dependent!
CS@UVa
With proper normalization (L2-norm)
CS 4501: Information Retrieval
41
Computation of HITS scores
β€’ In matrix form
– π‘Ž ← 𝐴𝑇 β„Ž and β„Ž ← π΄π‘Ž
– That is π‘Ž ← 𝐴𝑇 π΄π‘Ž and β„Ž ← 𝐴𝐴𝑇 β„Ž
– Another eigen-system
1 𝑇
π‘Ž = 𝐴 π΄π‘Ž
πœ†π‘Ž
1
β„Ž = 𝐴𝐴𝑇 β„Ž
πœ†β„Ž
CS@UVa
CS 4501: Information Retrieval
Power iteration is
applicable here as well
42
Constructing the adjacent matrix
β€’ Only consider a subset of the Web
1. For a given query, retrieve all the documents
containing the query (or top K documents in a
ranked list) – root set
2. Expand the root set by adding pages either
linking to a page in the root set, or being linked
to by a page in the root set – base set
3. Build adjacent matrix of pages in the base set
CS@UVa
CS 4501: Information Retrieval
43
Constructing the adjacent matrix
β€’ Reasons behind the construction steps
– Reduce the computation cost
– A good authority page may not contain the query
text
– The expansion of root set might introduce good
hubs and authorities into the sub-network
CS@UVa
CS 4501: Information Retrieval
44
Sample results
Kleinberg,Retrieval”,
JACM'99 Chapter 21, Figure 21.6
Manning, β€œIntroduction to Information
CS@UVa
CS 4501: Information Retrieval
45
Today’s reading
β€’ Introduction to information retrieval
– Chapter 21: Link Analysis
CS@UVa
CS 4501: Information Retrieval
46
References
β€’ Page, Lawrence, Sergey Brin, Rajeev Motwani, and Terry Winograd.
"The PageRank citation ranking: Bringing order to the web." (1999)
β€’ Haveliwala, Taher H. "Topic-sensitive pagerank." In Proceedings of
the 11th international conference on World Wide Web, pp. 517526. ACM, 2002.
β€’ Erkan, Günes, and Dragomir R. Radev. "LexRank: Graph-based
lexical centrality as salience in text summarization." J. Artif. Intell.
Res.(JAIR) 22, no. 1 (2004): 457-479.
β€’ Jeh, Glen, and Jennifer Widom. "SimRank: a measure of structuralcontext similarity." In Proceedings of the eighth ACM SIGKDD
international conference on Knowledge discovery and data mining,
pp. 538-543. ACM, 2002.
β€’ Kleinberg, Jon M. "Authoritative sources in a hyperlinked
environment." Journal of the ACM (JACM) 46, no. 5 (1999): 604632.
CS@UVa
CS 4501: Information Retrieval
47