PageRank Un Motor de Búsqueda “obama” PageRank Model: Final Version • The Web: a directed graph Vertices (pages) f a e b d c Edges (links) Input Structure • 41.5 million edges • 5.4 million nodes document-with-link document-linked Step 1. Dictionary Encode Links • Strings difficult to fit in memory • Encode strings as OIDs (object ids = integers) • Input line: http://es.dbpedia.org/resource/Ciencia_ficción http://es.dbpedia.org/resource/Robot • Output line: 12039 52673 • Dictionary: 12039 … 52673 … http://es.dbpedia.org/resource/Ciencia_ficción http://es.dbpedia.org/resource/Robot • OIDCompress -i [folder]/page_links_es.tsv.gz -igz -o [folder]/page_links_es.oid.gz -ogz -d [folder]/page_links_es.dict.gz -dgz Step 2. Write PageRank Algorithm • PageRankGraph.rankGraph(int[][] graph) • int[] out = graph[i]; – out contains the nodes linked from node i – it might be empty or null if node i doesn’t link to anything! • two rank vectors: rank[graph.length], nextRank[graph.length] • initial rank values set as 1d / graph.length • run ITERS number of iterations – compute edge-invariant rank once per iteration (red and blue) • need to keep track of sum of ranks of nodes with no outlinks from prev. round – for each node (orange) • split it’s rank[] by the number of outlinks it has, and add the result to the nextRank[] of each node it links to – the sum of the ranks after each round should be very very close to 1 • test on –i data/test-graph.txt –o data/test-data.txt Step 3. Rank full data • Run ranking -i [folder]/page_links_es.oid.gz -igz -o [folder]/page_ranks_es.oid.gz –ogz • Sort by rank -i [folder]/page_ranks_es.oid.gz -igz -o [folder]/page_ranks_es_s.oid.gz –ogz • Decompress the file -d [folder]/page_links_es.dict.gz -dgz -i [folder]/page_ranks_es_s.oid.gz -igz -n 0 -o [folder]/page_ranks_es_s.tsv.gz -ogz Course Marking • 45% for Weekly Labs (~3% a lab!) • 35% for Final Exam • 20% for Small Class Project Class Project • Done in pairs (Except Alejandro :P) • Goal: Use what you’ve learned to do something cool (basically) • Expected difficulty: More than a lab’s worth – But from scratch / without my help! • Marked on: Difficulty, appropriateness, scale, good use of techniques, presentation, coolness – Ambition is appreciated, even if you don’t succeed: feel free to bite off more than you can chew! • Process: – Pair up (default random) by Wednesday – Decide on a topic (by June 9th) or let me assign one – If you need data or get stuck, I will (try to) help out • Deliverables: 10 minute presentation (June 23rd) & 4-page report
© Copyright 2026 Paperzz