Effective Page Refresh Policies for Web Crawlers

Effective Page Refresh
Policies for Web Crawlers
Written By:
Junghoo Cho & Hector Garcia-Molina
Presenter:
Suchitra Manepalli
Jan 27, 2005
791 Digital Preservation Seminar
1
Searching on the Web
Information Overload
Indexing
Google, Alta-Vista
Integration
BizRate
Focus on Indexing
Jan 27, 2005
2
How Google Works?
Copyright © 2003 Google Inc.
Jan 27, 2005
3
Crawling
initial urls
init
get next url
to visit urls
get page
visited urls
web
extract urls
web pages
Jan 27, 2005
Taken from Cho
Thesis
4
Challenges
 Page selection and scrape
What page to scrape?
 Page and index update
How to update pages?
 Page ranking
What page is “important” or “relevant”?
Determine “Canonical” copy?
 Scalability
What is the maximum number of pages that we can
afford to “index”?
Jan 27, 2005
5
Focusing
 Page selection and scrape
What page to scrape?
 Page and index update
How to update pages?
 Page ranking
What page is “important” or “relevant”?
Determine “Canonical” copy?
 Scalability
What is the maximum number of pages that we can
afford to “index”?
Jan 27, 2005
6
Jan 27, 2005
7
Jan 27, 2005
8
Jan 27, 2005
9
Jan 27, 2005
10
Presentation Outline
Introduction
Problems
Framework – Effective Solutions
Different policies
Weighted Freshness
Experiments
Conclusion
Jan 27, 2005
11
Introduction
 Between web-crawling, the web-site
changes in-deterministically
 Main Issue
 How often do we crawl?
Jan 27, 2005
12
Questions
How can we maintain pages fresh?
What are “fresh” pages?
How often should the index be maintained?
What constraints are posted?
What are the refresh policies?
How effective are the refresh policies?
Jan 27, 2005
13
“Freshness”
Assuming each element is equally
important
Freshness of element ei at time t is
F ( ei ; t ) = 1 if ei is up-to-date at time t
0 otherwise
F( S ; t ) =
Jan 27, 2005
N
F( e ; t )

i=1
i
database
ei
ei
...
1
N
web
...
Freshness of the database S at time t is
14
“Age”
Assume equal importance of pages
Age of element ei at time t is
A( ei ; t ) =
0
if ei is up-to-date at time
t - (modification ei time) otherwise
N
Jan 27, 2005
N
A( e ; t )

i=1
i
database
ei
ei
...
A( S ; t ) =
1
web
...
Age of the database S at time t is
15
“Freshness” and “Age”
F(ei)
1
0
time
A(ei)
0
time
update
Jan 27, 2005
refresh
16
Poisson process
Real world
Elements are modified by a Poisson process
Happen randomly and independently with a
fixed rate over time
Jan 27, 2005
17
“Expected” - Variables
Next event occurs in a Poisson process
with change rate λ
Probability of ei changes at least once in
the time interval (0,t] is
Jan 27, 2005
18
“Expected” - Equations
Expected Freshness
Expected Age
Jan 27, 2005
19
“Expected” - Graphs
Jan 27, 2005
20
Evolution Model of Database
Uniform Change Frequency Model
All real-world elements change at the same
frequency λ
Individual element changes over time
All elements change at the same average rate
Non-Uniform Change Frequency Model
Elements change at different rates
Jan 27, 2005
21
Histogram of Change Frequencies
Jan 27, 2005
22
Synchronization Policies
Synchronization Frequency
Resource Allocation
Synchronization Order
Synchronization Points
Jan 27, 2005
23
Synchronization Policies
Synchronization Frequency
 How frequently do we synchronize the
database
More often, more fresher
Resource Allocation
How frequently we should synchronize each
individual element
Uniform Allocation Policy
Non-Uniform Allocation Policy
Jan 27, 2005
24
Synchronization Policies
Synchronization Order
What order we need to synchronize the
elements?
Fixed order
Same order repeatedly
Random order
Synchronization order is different in each iteration
Purely random
At each synchronization point, we select a random
element from the database and synchronize it
Jan 27, 2005
25
Synchronization Policies
Synchronization Points
Jan 27, 2005
26
Synchronization Order - Policies
 Fixed order policy
Jan 27, 2005
27
Synchronization Order - Policies
 Random order policy
Jan 27, 2005
28
Synchronization Order - Policies
 Purely Random Policy
Jan 27, 2005
29
Comparison
Jan 27, 2005
30
Resource Allocation Policies
What can we do if the elements change at
different rates and we know how often
each element changes?
Is it better to synchronize an element more
often when it changes more often?
Is it better to synchronize equally?
Jan 27, 2005
31
Trick Question
Two page database
e1 changes daily
e2 changes once a week
We can visit one page per week
How should we visit pages?
e1 e2 e1 e2 e1 e2 e1 e2... [uniform]
e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 … [proportional]
Jan 27, 2005
32
Proportional is often not good
Visit fast changing e1
 get 1/2 day of freshness
Visit slow changing e2
 get 1/2 week of freshness
Visiting e2 is a better deal!
Jan 27, 2005
33
Uniform versus Proportional
Intuitively assume, proportional allocation
policy performs better than uniform policy
Two element database
Uniform policy is actually better
To improve freshness we should penalize
the elements that change too often
Jan 27, 2005
34
Weighted Freshness
If elements have different importance?
Synchronize the elements to maximize the
freshness of the database perceived by
the users?
Refresh one more than the other?
Jan 27, 2005
35
Weighted Freshness Metrics
 To capture the concept: weights are given
 Freshness:

 Age:

Jan 27, 2005
36
Experimental Setup
270 sites visited
identified 400 sites with highest “PageRank”
contacted administrators
February 17 to June 24, 1999
3,000 pages from each site daily
start at root, visit breadth first (get new & old
pages)
ran only 9pm - 6am, 10 seconds between site
requests
Jan 27, 2005
37
Change interval of pages
Pages change 10 days
Jan 27, 2005
38
Results
Indicate a Poisson curve as predicted
Constraint:
Crawled web pages on a daily basis
Does not verify for pages that change:
Very often
Less frequent
Typical crawling rate of search engines,
exact change is of relative importance
For example: Google
Jan 27, 2005
39
Experiment 2: Synchronization-order
Selected pages with average change
frequency : Two weeks
Simulated multiple crawls:
Once a day
Once every week
Once every month
Once every two months
Assumed page changed in middle of the
day
Jan 27, 2005
40
Synchronization-order policy
Jan 27, 2005
41
Results
Theoretical implications
How can we measure how fresh a local
database is?
How can we guarantee certain freshness of a
local database?
Jan 27, 2005
42
Experiment 3: Frequency of Change
Average change interval of a page
 Dividing monitoring period by the number of
detected changes in a page
 Page changed 4 times in 4 month period
Estimate the average change interval of the page: 4
months/4 = 1 month
Jan 27, 2005
43
Frequency of Change
Jan 27, 2005
44
Results
Pages maintained at commercial sites:
Updated frequently
Gives, reasonable average change
interval for most pages
Estimation may not be accurate
If page changes more than once every day
If page changes several times a day, but
remains static for a week
Jan 27, 2005
45
Experiment 4: Resource-Allocation
 How frequently we synchronize each group
 Previous experiment:
23% of pages change every day
15% change every week
That did not change for 4 months, changes for a year
 Tests for :
Proportional
Uniform
Optimal
Jan 27, 2005
46
Resource Allocation Policy
Jan 27, 2005
47
Results
Proportional policy performs very poorly
when pages change very often
Optimal policy becomes relatively more
effective than the uniform policy
Lesson learned:
Optimal policy performs better to monitor
frequently changing information
Jan 27, 2005
48
Conclusion
Proportional Synchronization Policy
Intuitive appealing
Does not work well
Optimal policies
Improve freshness and age significantly using
real web data
Jan 27, 2005
49
Conclusion
Two Metrics
“Freshness”
“Age”
Synchronization Policies
Synchronization Frequency
Resource Allocation
Synchronization Order
Synchronization Points
Jan 27, 2005
50
References
 http://oak.cs.ucla.edu/~cho/
Interesting information: talks, experiments, publications,
course material
 http://www.googleguide.com/google_works.html
How google works
 http://www.google.com/remove.html
Google indexing tips
 http://www.seorank.com/google-pagerank.htm
Google Page rank algorithm explained
Jan 27, 2005
51