Search Engine Technology

Information Retrieval
(11)
Prof. Dragomir R. Radev
[email protected]
IR Winter 2010
…
17. continued
…
[Slide from Reka Albert]
[Slide from Reka Albert]
The strength of weak ties
• Granovetter’s study: finding jobs
• Weak ties: more people can be reached
through weak ties than strong ties (e.g.,
through your 7th and 8th best friends)
• More here:
http://en.wikipedia.org/wiki/Weak_tie
Prestige and centrality
• Degree centrality: how many neighbors each node
has.
• Closeness centrality: how close a node is to all of the
other nodes
• Betweenness centrality: based on the role that a
node plays by virtue of being on the path between
two other nodes
• Eigenvector centrality: the paths in the random walk
are weighted by the centrality of the nodes that the
path connects.
• Prestige = same as centrality but for directed graphs.
IR Winter 2010
…
18. Graph-based methods
Harmonic functions
Random walks
PageRank
…
Random walks and harmonic
functions
• Drunkard’s walk:
– Start at position 0 on a line
0
1
2
3
4
5
• What is the prob. of reaching 5 before reaching 0?
• Harmonic functions:
–
–
–
–
P(0) = 0
P(N) = 1
P(x) = ½*p(x-1)+ ½*p(x+1), for 0<x<N
(in general, replace ½ with the bias in the walk)
(**)
The original Dirichlet problem
• Distribution of temperature in a sheet of metal.
• One end of the sheet has temperature t=0, the other
end: t=1.
• Laplace’s differential equation:
 u  u xx  u yy  0
2
• This is a special (steady-state) case of the (transient)
heat equation :
k 2u  ut
• In general, the solutions to this equation are called
harmonic functions.
Learning harmonic functions
• The method of relaxations
–
–
–
–
–
Discrete approximation.
Assign fixed values to the boundary points.
Assign arbitrary values to all other points.
Adjust their values to be the average of their neighbors.
Repeat until convergence.
• Monte Carlo method
– Perform a random walk on the discrete representation.
– Compute f as the probability of a random walk ending in a
particular fixed point.
• Eigenvector methods
– Look at the stationary distribution of a random walk
Eigenvectors and eigenvalues
• An eigenvector is an implicit “direction” for a


matrix
Av  v
where v (eigenvector) is non-zero, though λ
(eigenvalue) can be any complex number in
principle
• Computing eigenvalues:
det( A  I )  0
Eigenvectors and eigenvalues
• Example:
 1 3

A  
 2 0
 1  
A  I  
 2
• Det (A-I) = (-1-)*(-)-3*2=0
• Then: +2-6=0; 1=2; 2=-3
• For 12:
3

 2
3  x1 
   0
 2  x2 
• Solutions: x1=x2
3 


Stochastic matrices
• Stochastic matrices: each row (or column) adds
up to 1 and no value is less than 0. Example:
3
A 8
1
 4
5 
8
3 
4
• The largest eigenvalue of a stochastic matrix E is
real: λ1 = 1.
• For λ1, the left (principal) eigenvector is p, the
right eigenvector = 1
• In other words, GTp = p.
Electrical networks and random walks
• Ergodic (connected) Markov
chain with transition matrix P
c
Pxy 
1Ω
1Ω
a
b
0.5 Ω
a
b
0.5 Ω
1Ω
c
d
C xy
Cx
C x   C xy
a b c d
1 1

0
0


2
2


1
2
0 0


3 3
1 1
1
0


4
4
2


1
2
2

0 
5 5 5

y
1
Cxy 
Rxy
w=Pw
2
 
 14 
3
 14 
4
 
 14 
 5 
 14 
d
From Doyle and Snell 2000
T
Electrical networks and random walks
ixy 
c
vx  v y
Rxy
1Ω
1Ω
 (v x  v y )C xy
vx  
b
0.5 Ω
0.5 Ω
1Ω
d
1V
xy
0
y
y
a
i
cxy
cx
v y   Pxy v y
y
1 1
7
v


v

c
d
va  1
4 2
16
vb  0 v  1  2 v  3
d
c
5 5
8
• vx is the probability that a random
walk starting at x will reach a
before reaching b.
• The random walk interpretation
allows us to use Monte Carlo
methods to solve electrical circuits.
Markov chains
• A homogeneous Markov chain is defined by
an initial distribution x and a Markov kernel E.
• Path = sequence (x0, x1, …, xn).
Xi = xi-1*E
• The probability of a path can be computed as
a product of probabilities for each step i.
• Random walk = find Xj given x0, E, and j.
Stationary solutions
• The fundamental Ergodic Theorem for Markov chains [Grimmett and
Stirzaker 1989] says that the Markov chain with kernel E has a
stationary distribution p under three conditions:
– E is stochastic
– E is irreducible
– E is aperiodic
• To make these conditions true:
– All rows of E add up to 1 (and no value is negative)
– Make sure that E is strongly connected
– Make sure that E is not bipartite
• Example: PageRank [Brin and Page 1998]: use “teleportation”
Example
1
6
1
8
0.9
t=0
0.8
PageRank
0.7
0.6
0.5
0.4
0.3
0.2
2
0.1
0
7
1
2
3
4
5
6
7
8
1
0.9
5
t=1
0.8
3
PageRank
0.7
4
0.6
0.5
0.4
0.3
0.2
0.1
This graph E has a second graph E’
(not drawn) superimposed on it:
E’ is the uniform transition graph.
0
1
2
3
4
5
6
7
8
Eigenvectors
• An eigenvector is an implicit “direction” for a
matrix.
Ev = λv, where v is non-zero, though λ can be any
complex number in principle.
• The largest eigenvalue of a stochastic matrix E
is real: λ1 = 1.
• For λ1, the left (principal) eigenvector is p, the
right eigenvector = 1
• In other words, ETp = p.
Computing the stationary
distribution
Solution for the
stationary distribution
pE p
T
(I  ET ) p  0
Convergence rate is O(m)
function PowerStatDist (E):
begin
p(0) = u; (or p(0) = [1,0,…0])
i=1;
repeat
p(i) = ETp(i-1)
L = ||p(i)-p(i-1)||1;
i = i + 1;
until L < 
return p(i)
end
Example
1
0.9
t=0
0.8
PageRank
0.7
0.6
0.5
0.4
0.3
0.2
1
6
8
0.1
0
1
2
3
4
5
6
7
8
1
0.9
t=1
0.8
0.7
PageRank
2
7
0.6
0.5
0.4
0.3
0.2
0.1
5
0
1
2
3
4
5
6
7
8
1
4
t=10
0.8
0.7
PageRank
3
0.9
0.6
0.5
0.4
0.3
0.2
0.1
0
1
2
3
4
5
6
7
8
PageRank
• Developed at Stanford and allegedly still being used at
Google.
• Not query-specific, although query-specific varieties
exist.
• In general, each page is indexed along with the anchor
texts pointing to it.
• Among the pages that match the user’s query, Google
shows the ones with the largest PageRank.
• Google also uses vector-space matching, keyword
proximity, anchor text, etc.
IR Winter 2010
…
19. Hubs and authorities
Bipartite graphs
HITS and SALSA
Models of the web
…
HITS
• Hypertext-induced text selection.
• Developed by Jon Kleinberg and colleagues at IBM
Almaden as part of the CLEVER engine.
• HITS is query-specific.
• Hubs and authorities, e.g. collections of bookmarks
about cars vs. actual sites about cars.
Honda
Ford
VW
Car and Driver
HITS
• Each node in the graph is ranked for hubness (h) and
authoritativeness (a).
• Some nodes may have high scores on both.
• Example authorities for the query “java”:
–
–
–
–
–
www.gamelan.com
java.sun.com
digitalfocus.com/digitalfocus/… (The Java developer)
lightyear.ncsa.uiuc.edu/~srp/java/javabooks.html
sunsite.unc.edu/javafaq/javafaq.html
HITS
• HITS algorithm:
– obtain root set (using a search engine) related to the input query
– expand the root set by radius one on either side (typically to size
1000-5000)
– run iterations on the hub and authority scores together
– report top-ranking authorities and hubs
• Eigenvector interpretation:
a G h
'
a   (G T G )
T
h  Ga
'
h   (GG )
T
p   (G )
Example
[slide from Baldi et al.]
HITS
• HITS is now used by Ask.com and Teoma.com .
• It can also be used to identify communities (e.g., based on
synonyms as well as controversial topics.
• Example for “jaguar”
– Principal eigenvector gives pages about the animal
– The positive end of the second nonprincipal eigenvector gives pages
about the football team
– The positive end of the third nonprincipal eigenvector gives pages about
the car.
• Example for “abortion”
– The positive end of the second nonprincipal eigenvector gives pages on
“planned parenthood” and “reproductive rights”
– The negative end of the same eigenvector includes “pro-life” sites.
• SALSA (Lempel and Moran 2001)
Models of the Web
• Evolving networks: fundamental object of statistical physics, social
networks, mathematical biology, and epidemiology
• Erdös/Rényi 59, 60
• Barabási/Albert 99
• Watts/Strogatz 98
• Kleinberg 98
• Menczer 02
A
a
• Radev 03
B
e  k   k  k
P(k ) 
kk!
P(k ) 
 k   Np  ( )
b
Evolving Word-based Web
• Observations:
– Links are made based on
topics
– Topics are expressed with
words
– Words are distributed very
unevenly (Zipf, Benford,
self-triggerability laws)
• Model
– Pick n
– Generate n lengths
according to a power-law
distribution
– Generate n documents
using a trigram model
• Model (cont’d)
– Pick words in decreasing
order of r.
– Generate hyperlinks with
random directionality
• Outcome
– Generates power-law
degree distributions
– Generates topical
communities
– Natural variation of
PageRank: LexRank
Readings
• paper by Church and Gale
(http://citeseer.ist.psu.edu/church95poisso
n.html)