The PageRank Citation Ranking: Bringing Order to the Web Dr. Yingwu Zhu Overview Motivation Related work Page Rank & Random Surfer Model Implementation Conclusion Motivation Web: heterogeneous and unstructured Free of quality control on the web Commercial interest to manipulate ranking Related Work Academic citation analysis Link-based analysis Clustering methods of link structure Hubs & Authorities Model Backlink Link Structure of the Web Approximation of importance / quality PageRank Pages with lots of backlinks are important Backlinks coming from important pages convey more importance to a page R (v ) R(u ) c vB N v u PageRank Two Problems! Rank sink – Introduce escape terms Dangling Links – Dangling links are simply links that point to any page with no outgoing links – They do not affect the rank of any other pages directly – Ignore first and add back later Rank Sink Page cycles pointed by some incoming link Problem: this loop will accumulate rank but never distribute any rank outside Escape Term Solution: Rank Source R (v ) R(u) c cE (u ) vBu N v c is maximized and R = 1 E(u) is some vector over the web pages – uniform, favorite page etc. 1 Matrix Notation R c( A E e ) R T T R is the dominant eigenvector and c is the T ( A E e ) because c dominant eigenvalue of is maximized Computing PageRank R0 S - initialize vector over web pages Ri 1 AT Ri - new ranks sum of normalized backlink ranks loop: d Ri 1 Ri 1 while - compute normalizing factor 1 Ri 1 Ri 1 dE - add escape term Ri 1 Ri - control parameter - stop when converged Random Surfer Model PageRank corresponds to the probability distribution of a random walk on the web graphs E(u) can be re-phrased as the random surfer gets bored periodically and jumps to a different page and not kept in a loop forever Implementation Computing resources — 24 million pages — 75 million URLs Memory and disk storage Weight Vector (4 byte float) Matrix A (linear access) Implementation (Con't) Dealing with dangling links – Unique integer ID for each URL – Sort and Remove dangling links – Rank initial assignment – Iteration until convergence – Add back dangling links and Re-compute Convergence Properties (con't) PageRank computation is O(log(|V|)) due to rapidly mixing graph G of the web. Personalized PageRank Rank Source E can be initialized : – uniformly over all pages: e.g. copyright warnings, disclaimers, mailing lists archives result in overly high ranking – total weight on a single page, e.g. Netscape, McCarthy great variation of ranks under different single pages as rank source – and everything in-between, e.g. server root pages allow manipulation by commercial interests Issues Users are no random walkers – Content based methods Starting point distribution – Actual usage data as starting vector Reinforcing effects/bias towards main pages How about traffic to ranking pages? No query specific rank Linkage spam – PageRank favors pages that managed to get other pages to link to them – Linkage not necessarily a sign of relevancy, only of promotion (advertisement…) Conclusion PageRank is a global ranking based on the web's graph structure PageRank use backlinks information to bring order to the web PageRank can separate out representative pages as cluster center A great variety of applications
© Copyright 2026 Paperzz