The PageRank Citation Ranking: Bringing Order to the Web

The PageRank Citation Ranking:
Bringing Order to the Web
Dr. Yingwu Zhu
Overview

Motivation
 Related work
 Page Rank & Random Surfer Model
 Implementation
 Conclusion
Motivation

Web: heterogeneous and unstructured
 Free of quality control on the web
 Commercial interest to manipulate ranking
Related Work

Academic citation analysis
 Link-based analysis
 Clustering methods of link structure
 Hubs & Authorities Model
Backlink

Link Structure of the Web
 Approximation of importance / quality
PageRank

Pages with lots of backlinks are important
 Backlinks coming from important pages
convey more importance to a page
R (v )
R(u )  c 
vB N v
u
PageRank
Two Problems!

Rank sink
– Introduce escape terms

Dangling Links
– Dangling links are simply links that point to
any page with no outgoing links
– They do not affect the rank of any other pages
directly
– Ignore first and add back later
Rank Sink

Page cycles pointed by some incoming link

Problem: this loop will accumulate rank but
never distribute any rank outside
Escape Term

Solution: Rank Source
R (v )
R(u)  c 
 cE (u )
vBu N v

c is maximized and R = 1
 E(u) is some vector over the web pages
– uniform, favorite page etc.
1
Matrix Notation
R  c( A  E  e ) R
T

T
R is the dominant eigenvector and c is the
T
(
A

E

e
) because c
dominant eigenvalue of
is maximized
Computing PageRank
R0  S
- initialize vector over web pages
Ri 1  AT Ri
- new ranks sum of normalized backlink ranks
loop:
d  Ri 1  Ri 1
while
- compute normalizing factor
1
Ri 1  Ri 1  dE
- add escape term
  Ri 1  Ri
- control parameter
 
- stop when converged
Random Surfer Model

PageRank corresponds to the probability
distribution of a random walk on the web graphs

E(u) can be re-phrased as the random surfer gets
bored periodically and jumps to a different page
and not kept in a loop forever
Implementation

Computing resources
— 24 million pages
— 75 million URLs

Memory and disk storage
Weight Vector
(4 byte float)
Matrix A
(linear access)
Implementation (Con't)

Dealing with dangling links
– Unique integer ID for each URL
– Sort and Remove dangling links
– Rank initial assignment
– Iteration until convergence
– Add back dangling links and Re-compute
Convergence Properties (con't)

PageRank computation is O(log(|V|)) due to
rapidly mixing graph G of the web.
Personalized PageRank

Rank Source E can be initialized :
– uniformly over all pages: e.g. copyright
warnings, disclaimers, mailing lists archives
 result in overly high ranking
– total weight on a single page, e.g. Netscape,
McCarthy
great variation of ranks under different single pages
as rank source

– and everything in-between, e.g. server root
pages

allow manipulation by commercial interests
Issues

Users are no random walkers
– Content based methods

Starting point distribution
– Actual usage data as starting vector




Reinforcing effects/bias towards main pages
How about traffic to ranking pages?
No query specific rank
Linkage spam
– PageRank favors pages that managed to get other pages to link to
them
– Linkage not necessarily a sign of relevancy, only of promotion
(advertisement…)
Conclusion

PageRank is a global ranking based on the
web's graph structure
 PageRank use backlinks information to
bring order to the web
 PageRank can separate out representative
pages as cluster center
 A great variety of applications