Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter: Suchitra Manepalli Jan 27, 2005 791 Digital Preservation Seminar 1 Searching on the Web Information Overload Indexing Google, Alta-Vista Integration BizRate Focus on Indexing Jan 27, 2005 2 How Google Works? Copyright © 2003 Google Inc. Jan 27, 2005 3 Crawling initial urls init get next url to visit urls get page visited urls web extract urls web pages Jan 27, 2005 Taken from Cho Thesis 4 Challenges Page selection and scrape What page to scrape? Page and index update How to update pages? Page ranking What page is “important” or “relevant”? Determine “Canonical” copy? Scalability What is the maximum number of pages that we can afford to “index”? Jan 27, 2005 5 Focusing Page selection and scrape What page to scrape? Page and index update How to update pages? Page ranking What page is “important” or “relevant”? Determine “Canonical” copy? Scalability What is the maximum number of pages that we can afford to “index”? Jan 27, 2005 6 Jan 27, 2005 7 Jan 27, 2005 8 Jan 27, 2005 9 Jan 27, 2005 10 Presentation Outline Introduction Problems Framework – Effective Solutions Different policies Weighted Freshness Experiments Conclusion Jan 27, 2005 11 Introduction Between web-crawling, the web-site changes in-deterministically Main Issue How often do we crawl? Jan 27, 2005 12 Questions How can we maintain pages fresh? What are “fresh” pages? How often should the index be maintained? What constraints are posted? What are the refresh policies? How effective are the refresh policies? Jan 27, 2005 13 “Freshness” Assuming each element is equally important Freshness of element ei at time t is F ( ei ; t ) = 1 if ei is up-to-date at time t 0 otherwise F( S ; t ) = Jan 27, 2005 N F( e ; t ) i=1 i database ei ei ... 1 N web ... Freshness of the database S at time t is 14 “Age” Assume equal importance of pages Age of element ei at time t is A( ei ; t ) = 0 if ei is up-to-date at time t - (modification ei time) otherwise N Jan 27, 2005 N A( e ; t ) i=1 i database ei ei ... A( S ; t ) = 1 web ... Age of the database S at time t is 15 “Freshness” and “Age” F(ei) 1 0 time A(ei) 0 time update Jan 27, 2005 refresh 16 Poisson process Real world Elements are modified by a Poisson process Happen randomly and independently with a fixed rate over time Jan 27, 2005 17 “Expected” - Variables Next event occurs in a Poisson process with change rate λ Probability of ei changes at least once in the time interval (0,t] is Jan 27, 2005 18 “Expected” - Equations Expected Freshness Expected Age Jan 27, 2005 19 “Expected” - Graphs Jan 27, 2005 20 Evolution Model of Database Uniform Change Frequency Model All real-world elements change at the same frequency λ Individual element changes over time All elements change at the same average rate Non-Uniform Change Frequency Model Elements change at different rates Jan 27, 2005 21 Histogram of Change Frequencies Jan 27, 2005 22 Synchronization Policies Synchronization Frequency Resource Allocation Synchronization Order Synchronization Points Jan 27, 2005 23 Synchronization Policies Synchronization Frequency How frequently do we synchronize the database More often, more fresher Resource Allocation How frequently we should synchronize each individual element Uniform Allocation Policy Non-Uniform Allocation Policy Jan 27, 2005 24 Synchronization Policies Synchronization Order What order we need to synchronize the elements? Fixed order Same order repeatedly Random order Synchronization order is different in each iteration Purely random At each synchronization point, we select a random element from the database and synchronize it Jan 27, 2005 25 Synchronization Policies Synchronization Points Jan 27, 2005 26 Synchronization Order - Policies Fixed order policy Jan 27, 2005 27 Synchronization Order - Policies Random order policy Jan 27, 2005 28 Synchronization Order - Policies Purely Random Policy Jan 27, 2005 29 Comparison Jan 27, 2005 30 Resource Allocation Policies What can we do if the elements change at different rates and we know how often each element changes? Is it better to synchronize an element more often when it changes more often? Is it better to synchronize equally? Jan 27, 2005 31 Trick Question Two page database e1 changes daily e2 changes once a week We can visit one page per week How should we visit pages? e1 e2 e1 e2 e1 e2 e1 e2... [uniform] e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 … [proportional] Jan 27, 2005 32 Proportional is often not good Visit fast changing e1 get 1/2 day of freshness Visit slow changing e2 get 1/2 week of freshness Visiting e2 is a better deal! Jan 27, 2005 33 Uniform versus Proportional Intuitively assume, proportional allocation policy performs better than uniform policy Two element database Uniform policy is actually better To improve freshness we should penalize the elements that change too often Jan 27, 2005 34 Weighted Freshness If elements have different importance? Synchronize the elements to maximize the freshness of the database perceived by the users? Refresh one more than the other? Jan 27, 2005 35 Weighted Freshness Metrics To capture the concept: weights are given Freshness: Age: Jan 27, 2005 36 Experimental Setup 270 sites visited identified 400 sites with highest “PageRank” contacted administrators February 17 to June 24, 1999 3,000 pages from each site daily start at root, visit breadth first (get new & old pages) ran only 9pm - 6am, 10 seconds between site requests Jan 27, 2005 37 Change interval of pages Pages change 10 days Jan 27, 2005 38 Results Indicate a Poisson curve as predicted Constraint: Crawled web pages on a daily basis Does not verify for pages that change: Very often Less frequent Typical crawling rate of search engines, exact change is of relative importance For example: Google Jan 27, 2005 39 Experiment 2: Synchronization-order Selected pages with average change frequency : Two weeks Simulated multiple crawls: Once a day Once every week Once every month Once every two months Assumed page changed in middle of the day Jan 27, 2005 40 Synchronization-order policy Jan 27, 2005 41 Results Theoretical implications How can we measure how fresh a local database is? How can we guarantee certain freshness of a local database? Jan 27, 2005 42 Experiment 3: Frequency of Change Average change interval of a page Dividing monitoring period by the number of detected changes in a page Page changed 4 times in 4 month period Estimate the average change interval of the page: 4 months/4 = 1 month Jan 27, 2005 43 Frequency of Change Jan 27, 2005 44 Results Pages maintained at commercial sites: Updated frequently Gives, reasonable average change interval for most pages Estimation may not be accurate If page changes more than once every day If page changes several times a day, but remains static for a week Jan 27, 2005 45 Experiment 4: Resource-Allocation How frequently we synchronize each group Previous experiment: 23% of pages change every day 15% change every week That did not change for 4 months, changes for a year Tests for : Proportional Uniform Optimal Jan 27, 2005 46 Resource Allocation Policy Jan 27, 2005 47 Results Proportional policy performs very poorly when pages change very often Optimal policy becomes relatively more effective than the uniform policy Lesson learned: Optimal policy performs better to monitor frequently changing information Jan 27, 2005 48 Conclusion Proportional Synchronization Policy Intuitive appealing Does not work well Optimal policies Improve freshness and age significantly using real web data Jan 27, 2005 49 Conclusion Two Metrics “Freshness” “Age” Synchronization Policies Synchronization Frequency Resource Allocation Synchronization Order Synchronization Points Jan 27, 2005 50 References http://oak.cs.ucla.edu/~cho/ Interesting information: talks, experiments, publications, course material http://www.googleguide.com/google_works.html How google works http://www.google.com/remove.html Google indexing tips http://www.seorank.com/google-pagerank.htm Google Page rank algorithm explained Jan 27, 2005 51
© Copyright 2026 Paperzz