Graph Models The PageRank Algorithm The PageRank Algorithm

Graph Models
The PageRank Algorithm
Anna-Karin Tornberg
Mathematical Models, Analysis and Simulation
Fall semester, 2013
The PageRank Algorithm
I
Invented by Larry Page and Sergey Brin around 1998 and used in
the prototype of Google’s search engine.
I
Estimating the popularity/importance of a webpage based on the
interconnection of the web.
Two basic assumptions:
I
i) A page with more incoming links is more imporant than a page with
less incoming links.
ii) A page with a link from a page of high importance is also important.
I
We have used incidence matrices to define the structure of a graph.
In our earlier examples, two nodes only connected by a link in one
direction. Now, webpage 1 can link to webpage 2, while 2 also links
to 1.
I
From now on, “node” and “webpage” or simply “page” used
interchangeably.
The first model - the bored surfer
I
Imagine a bored surfer that klick links in a random manner.
I
If a page has a number of links, the bored surfer is as likely to click
on any of the links.
I
If there are no links from the current webpage, the surfer goes to
another webpage at random.
I
If the vector x 2 lRN contains the probabilities that a surfer is at
website 1, 2, . . . , N at a certain instant, then we want to create a
matrix A s.t. Ax is the probability that the surfer is at website
1, 2, . . . , N after one more step.
Defining the matrix
I
Denote by L(j), the number of links from a page j.
I
First, define the matrix entries, aij , i, j = 1, . . . , N:
aij =
⇢
1
L(j)
0
if there is a link from j to i,
otherwise.
I
This is the assumption that all links from page will be clicked with
equal probability.
I
If there are no links from a page, this will however render a column
with zeros. Then set all values in the column to 1/N, using the
assumption that the surfer will pick a new page at random (i.e. all
pages have equal probability).
I
This yields
aij =
I
8
<
:
NOTES: EXAMPLE.
1
L(j)
1
N
0
if there is a link from j to i,
if there are no links from j,
otherwise.
The page rank
I
If the vector x 2 lRN contains the probabilities that a surfer is at
website 1, 2, . . . , N at a certain instant, then Ax is the probability
that the surfer is at website 1, 2, . . . , N after one more step.
I
The page rank is given by the vector x such that a multiplication of
A no longer changes the probabilities, i.e.
x = Ax.
I
This has a solution if the matrix A has an eigenvalue 1 with the
corresponding eigenvector x.
I
Our matrix A is a so-called column stochastic matrix: all entries are
non-negative, and the entries in each column sum to 1.
I
The Perron-Frobenius theorem ensures that every stochastic matrix
has an eigenvalue = 1, and that all other eigenvalues are smaller
in magnitude.
I
Without more assumptions, it does however not guarantee that
= 1 is a single eigenvalue, and hence that x is unique.
The Power Method
I
The dominant eigenvalue of a matrix A is the eigenvalue with the
largest magnitude, and a dominant eigenvector is an eigenvector
corresponding to this eigenvalue.
I
Introduce the power iteration (with x0 a unit vector),
x0 , x1 =
Ax0
Ax1
Axk
, x2 =
, . . . , xk =
kAx0 k
kAx1 k
kAxk
1
1k
, ...
Then this sequence converges to a unit dominant eigenvector, under
the assumptions
i) A has an eigenvalue that is strictly greater in magnitude than its
other eigenvalues.
ii) The starting vector x0 has a non-zero component in the direction of
an eigenvector associated to the dominant eigenvalue.
and the sequence
T
T
xT
1 Ax1 , x2 Ax2 , . . . , xk Axk , . . .
converges to the dominant eigenvalue .
I
WORKSHEET
What can fail?
I
The PageRank algorithm is simply the power iteration
xk = Ak x0 .
where xk converges to the PageRank vector x as k ! 1. All entries
in x are non-negative, the node with the corresponding entry with
the largest value is ranked the most important, etc. (x0 must have
non/negative entries).
I
In the homework, you are asked to show that for a
column-stochastic matrix A,
n
X
i=1
k
(A x)i =
n
X
(x)i ,
i=1
i.e. if the entries of x0 are scaled to sum to 1, so will the entries of
xk .
I
Can the PageRank algorithm fail the way we have constructed A so
far? [EXAMPLE].
Reducible graphs
I
A graph is called irreducible if we can reach all nodes, independent
of which node we start from.
I
A graph that is not irreducible is called reducible.
Example of a reducible graphs: Imagine two sets of nodes.
I
I
I
Both set C and set D contain many nodes, and they all link to other
nodes.
Now assume that nodes from set C link to set D, but no nodes from
set D links to nodes in set C .
That means, that once we are at a node in set D, there is no
possibility to go to a node in set C , following the link structure.
I
In the algorithm, the random “restart”, with 1/N as each column
entry in A will never occur since each node has outgoing links. This
means, that if we are at a node in set D, we have zero probability of
returning to set C .
The “Google matrix”
I
I
I
I
In order to caclulate PageRanks for a reducible web
Brin proposed to define the following matrix:
2
1 1 1 ··· 1
6
1 ↵ 6 1 1 1 ··· 1
G = ↵A +
6 . . .
.
N 4 .. .. .. . . . ..
1 1 1 ··· 1
graph, Page and
3
7
7
7,
5
where A is the matrix we have already defined, and the “damping”
factor ↵ has a default value of 0.85. This gives the surfer a 1 ↵
probability to jump randomly to any page.
The matrix G is still a column stochastic matrix, but now the entries
are not only non-negative, but strictly positive.
For such a matrix, the Perron-Frobenius theorem tells us that the
eigenvalue = 1 is a simple eigenvalue (multiplicity 1), and that all
other eigenvalues are of smaller magnitude.
Hence, xk = G k x0 will converge to a non-negative eigenvector x as
k ! 1, which is unique up to normalization.
(Given that x0 is non-negative).