Suman Narsing`s Slides

CS-791/891--Preservation of Digital Objects and Collections
Estimating Frequency of Change
Written By
Junghoo Cho, Hector Garcia-Molina
Presented By
Suman Kumar Narsing.
The topics to be dealt in this are:
1.
INTRODUCTION
2.
TAXONOMY OF ISSUES
3.
PRELIMINARIES
4.
ESTIMATION OF FREQUENCY: EXISTENCE OF CHANGE
5.
ESTIMATION OF FREQUENCY: LAST DATE OF CHANGE
6.
EXPERIMENTS
7.
CONCLUSION
1. INTRODUCTION:
• Now many data sources are available online.
• These are autonomous and are updated independently.
• Ex: CNN & NY Times, online stores etc.
• As sources updated autonomously, clients don’t know exactly
when and how the sources change often.
HOW TO IMPROVE THEIR EFFECTIVENESS:
Improving a Web crawler.
Improving the update policy of a data warehouse.
Improving Web caching.
Data mining.
HOW TO ESTIMATE THE FREQUENCY OF
CHANGE:
Incomplete change history.
Irregular access interval.
Difference in available information.
EXAMPLE 1:
 A web crawler accessed a page on a daily basis for 10
days, and it detected 6 changes. From this data, the
Change frequency is = 6/10 = 0.6 times a day.
EXAMPLE 2:
 In a web cache a user accessed a web page for 4 times at
day1, day2, day 7 and day 10. Web page had changes in it
on day 2 and day 7. Then what does this imply? Does the
page change every 10/2 = 5 days on an average?
2. TAXONOMY OF ISSUES:
 What do we mean by “ Change of an Element”?
What does “Element” mean?
What does “Change” mean?
Element – “Web page” and any Change is – any modification to
the page.
Developing Taxonomy:
• How do we trace the history of an element?
 Passive monitoring
Active monitoring
Regular interval
Random interval
•What information do we have?
Complete history of changes.
Last date of change
Existence of change
Developing Taxonomy: (Contd..)
•How do we use estimated frequency?
Estimation of frequency.
Categorization of frequency
3. PRELIMINARIES:
Poisson Process: The model for the changes of an element.
The no. of events expected to occur in a unit interval:
E[X(t+1)-X(t)] = ∑kPr{X(t+1)-X(t)=k}= ∑k(λk e-λ /k|)= λ
X(t)—No. of occurrences of a change in interval (0,t]
λ – Poisson process of rate or frequency.
For s>= 0 and t<0, the random variable X(s+t)-X(s) has the Poisson probability
distribution
Pr{X(s+t)-X(s) = k} = (λt)k e-λt /k! for k =0,1…….
Graphs explaining the importance of λ:
Estimator: λ = X/T;
 The distribution of λ determines how
effective the estimator λ is:
a) Bias.
b) Efficiency.
c) Consistency.
4. ESTIMATION OF FREQUENCY: EXISTENCE
OF CHANGE:
 Total time elapsed =, T = nI = n/f;
 Assuming estimator from now as frequency ratio,
r = λ/f = 1/f(X/T) = X/n.
Measuring X repeated accesses to the element:
Is the estimator r biased?
Theorem 4.1 The expected value of the estimator r is
E[r] = 1 – e -r
 Is the estimator r consistent?
 How efficient is the estimator?
Corollary 4.2 The standard deviation of the estimator r = X/n
is calculated.
5. ESTIMATION OF FREQUENCY: LAST DATE
OF CHANGE
 Let T be the time to the previous event in a Poisson process
with rate λ. Then the expected value of T is E[T] = 1/ λ.
 The new estimator consists of three functions.
a) Init()
b) Update()
c) Estimate()
The estimator using last modified changes:
Init() /* initialize variables */
N = 0; /* total number of accesses */
X = 0; /* number of detected changes */
T = 0; /* sum of the times from changes */
Update(Ti, Ii) /* update variables */
N = N + 1;
/* Has the element changed? */
If (Ti < Ii) then
/* The element has changed. */
X = X + 1;
T = T + Ti;
else
/* The element has not changed */
T = T + Ii;
Estimate() /* return the estimated lambda */
return X/T;
6. EXPERIMENTS:
 Non-Poisson model.
 Improvement from last modification date.
 Effectiveness of estimators for real Web data.
COMPARISION OF NAÏVE ESTIMATOR AND OURS
 Application to a Web crawler:
- Uniform Policy:
- Naïve Policy.
- Our Policy.
7. CONCLUSION:
Future work:
• Adaptive Scheme:
• Changing λ
REFERENCES:
 Junghoo Cho, Hector Garcia-Molina "Estimating
frequency of change." ACM Transactions on Internet
Technology, 3(3): August 2003.
http://oak.cs.ucla.edu/~cho/papers/cho-freq.pdf
THANK YOU