PageRankGraph.rankGraph(int[][] graph)

PageRank
Un Motor de Búsqueda
“obama”
PageRank Model: Final Version
• The Web: a directed graph
Vertices
(pages)
f
a
e
b
d
c
Edges
(links)
Input Structure
• 41.5 million edges
• 5.4 million nodes
document-with-link document-linked
Step 1. Dictionary Encode Links
• Strings difficult to fit in memory
• Encode strings as OIDs (object ids = integers)
• Input line:
http://es.dbpedia.org/resource/Ciencia_ficción http://es.dbpedia.org/resource/Robot
• Output line:
12039
52673
• Dictionary:
12039
…
52673
…
http://es.dbpedia.org/resource/Ciencia_ficción
http://es.dbpedia.org/resource/Robot
• OIDCompress
-i [folder]/page_links_es.tsv.gz -igz -o
[folder]/page_links_es.oid.gz -ogz -d
[folder]/page_links_es.dict.gz -dgz
Step 2. Write PageRank Algorithm
• PageRankGraph.rankGraph(int[][] graph)
• int[] out = graph[i];
– out contains the nodes linked from node i
– it might be empty or null if node i doesn’t link to anything!
• two rank vectors: rank[graph.length],
nextRank[graph.length]
• initial rank values set as 1d / graph.length
• run ITERS number of iterations
– compute edge-invariant rank once per iteration (red and blue)
• need to keep track of sum of ranks of nodes with no outlinks from prev. round
– for each node (orange)
• split it’s rank[] by the number of outlinks it has, and add the result to the
nextRank[] of each node it links to
– the sum of the ranks after each round should be very very close to 1
• test on –i data/test-graph.txt –o data/test-data.txt
Step 3. Rank full data
• Run ranking
-i [folder]/page_links_es.oid.gz -igz -o
[folder]/page_ranks_es.oid.gz –ogz
• Sort by rank
-i [folder]/page_ranks_es.oid.gz -igz -o
[folder]/page_ranks_es_s.oid.gz –ogz
• Decompress the file
-d [folder]/page_links_es.dict.gz -dgz -i
[folder]/page_ranks_es_s.oid.gz -igz -n 0 -o
[folder]/page_ranks_es_s.tsv.gz -ogz
Course Marking
• 45% for Weekly Labs (~3% a lab!)
• 35% for Final Exam
• 20% for Small Class Project
Class Project
• Done in pairs (Except Alejandro :P)
• Goal: Use what you’ve learned to do something cool (basically)
• Expected difficulty: More than a lab’s worth
– But from scratch / without my help!
• Marked on: Difficulty, appropriateness, scale, good use of techniques,
presentation, coolness
– Ambition is appreciated, even if you don’t succeed: feel free to bite
off more than you can chew!
• Process:
– Pair up (default random) by Wednesday
– Decide on a topic (by June 9th) or let me assign one 
– If you need data or get stuck, I will (try to) help out
• Deliverables: 10 minute presentation (June 23rd) & 4-page report