Sampling Techniques for Large, Dynamic Graphs

Understanding Churn in
Peer-to-Peer Networks
Daniel Stutzbach – University of Oregon
Reza Rejaie – University of Oregon
Internet Measurement Conference
Rio de Janeiro, Brazil
October 26th, 2006
Motivation

P2P systems are very popular in practice.



Churn is an important property to model.



Outside the control of the designer
Needed for simulation or analysis
Churn is hard to measure.



Several million simultaneous users collectively.
60% of all Internet traffic [CacheLogic Research ‘05]
Requires continuous monitoring
Many potential pitfalls
Prior results are contradictory.
Daniel Stutzbach
The ION P2P Project
http://mirage.cs.uoregon.edu/P2P
Slide 2/21
Talk Outline

Datasets




Pitfalls


Gnutella
Kad
BitTorrent
Lots!
Characterizations



Inter-arrival distribution
Session-length distribution
Others
Daniel Stutzbach
The ION P2P Project
http://mirage.cs.uoregon.edu/P2P
Slide 3/21
Datasets

Data necessities



Quality of data



Arrival time of peers
Departure time of peers
The precision of timestamps
How representative the sessions are
We use data from 3 different P2P systems:



Gnutella (unstructured file-sharing)
Kad (DHT file-sharing)
BitTorrent (unstructured content-delivery)
Daniel Stutzbach
The ION P2P Project
http://mirage.cs.uoregon.edu/P2P
Slide 4/21
Datasets: Gnutella
More than 1 million simultaneous users
 Complete snapshots gathered with
Cruiser [Stutzbach 05 Global Internet]
 Using many back-to-back snapshots, we
can determine arrival & departure times.
 Approximately 7 minute granularity
 Five sets of data, each set is 48-hours

Daniel Stutzbach
The ION P2P Project
http://mirage.cs.uoregon.edu/P2P
Slide 5/21
Datasets: Kad
Kad is a DHT used by eMule
 Approximately 1 million simultaneous
users
 We monitor a zone of the DHT address
space.
 Each peer selects its ID uniformly at
random, so any zone is representative
 4 sets of data, each set is 48-hours.

Daniel Stutzbach
The ION P2P Project
http://mirage.cs.uoregon.edu/P2P
Slide 6/21
Datasets: BitTorrent


BitTorrent is used for transferring files.
A collection of P2P overlays, rather than one
big network
 Each peer periodically contacts a centralized
point (the tracker).
 Tracker logs reveal arrival and departure
information with 1-second granularity.
 Three sets of data:



Red Hat ISO image
Debian ISO images
A demo of the game FlatOut, from 3dgamers.com
Daniel Stutzbach
The ION P2P Project
http://mirage.cs.uoregon.edu/P2P
Slide 7/21
Pitfalls

Problems can occur:




Specific pitfalls:
When gathering data
When cleaning data
When analyzing data






Daniel Stutzbach
The ION P2P Project
http://mirage.cs.uoregon.edu/P2P
Missing Data
False Negatives
NAT
Long Sessions
Biased Peer
Selection
Handling Brief
Events
Slide 8/21
Pitfalls: Missing Data

No significant gaps in
Gnutella or Kad data
 BitTorrent logs have
significant gaps!



Nearly all events are
followed by another within
4 minutes.
A gap of several hours is
highly suspect.
We use the longest
continuous segment.
Daniel Stutzbach
The ION P2P Project
http://mirage.cs.uoregon.edu/P2P
Slide 9/21
Pitfalls: False Negatives







What if a peer is missing from a
crawl, but did not really depart?
Small chance per crawl (p)
 Compounded after n crawls
 pn
 → high chance: e
If p=10%, at most we would
observe 1 in 3.1 trillion sessions
longer than 1 day.
Since we observe many more
than that, p must be lower.
We can compute an upper-bound
of p=1.8%.
But this would only occur if all
sessions were longer than 1 day.
In practice, p is likely much lower.
Daniel Stutzbach
Worst case impact of false negatives
based on upper-bound of p
The ION P2P Project
http://mirage.cs.uoregon.edu/P2P
Slide 10/21
Pitfalls: NAT

NAT presents two obstacles:



BitTorrent


Peers contact the tracker and present a unique ID
Kad



Some peers may not be visible at all
Multiple peers may look like one peer (large NATs only)
DHT peers must be able to receive unsolicited incoming
packets
No NATed peers permitted in the DHT overlay
Gnutella


NATed peers are discovered through their neighbors
No good way to resolve multiple peers behind one NAT
Daniel Stutzbach
The ION P2P Project
http://mirage.cs.uoregon.edu/P2P
Slide 11/21
Pitfalls: Long Sessions

Some sessions are longer than the
measurement window.
 Truncating sessions leads to bias.
 Ignoring sessions also leads to bias.
 The “create-based method”:





Measurement Window
Divide the measurement window into two
halves
Only consider sessions that begin in the first
half
Every session beginning in the first half
counts
Equal opportunity to observe sessions shorter
than half a window
For longer sessions, we count them, but do
not record a particular value.
Daniel Stutzbach
The ION P2P Project
http://mirage.cs.uoregon.edu/P2P
Slide 12/21
Pitfalls: Miscellaneous

Biased Peer Selection





Monitoring a fixed set of peers causes bias because their
sessions are correlated.
Selecting peers must be done carefully to avoid correlations
between uptime and the selection process.
In Gnutella and BitTorrent, we use all peers in the overlay.
In Kad, we use all peers in a zone.
Handling Brief Events

What if a peer departs briefly and returns?


Daniel Stutzbach
Not a problem with BitTorrent.
Most peers do not depart and return within a day, so this is
probably not a large problem.
The ION P2P Project
http://mirage.cs.uoregon.edu/P2P
Slide 13/21
Pitfalls: Summary

Problems can occur:





Specific pitfalls:
When gathering data
When cleaning data
When analyzing data
See paper for more details






Daniel Stutzbach
The ION P2P Project
http://mirage.cs.uoregon.edu/P2P
Missing Data
False Negatives
NAT
Long Sessions
Biased Peer
Selection
Handling Brief
Events
Slide 14/21
Characterizations

Properties critical for simulations
Inter-arrival distribution
 Session length distribution


Properties providing design insight
Uptime distribution
 Correlations of consecutive sessions


Additional properties in the paper
Daniel Stutzbach
The ION P2P Project
http://mirage.cs.uoregon.edu/P2P
Slide 15/21
Inter-arrival Time





Inter-arrival time: from the arrival of one peer until the arrival of any other peer
Gnutella data is too dense to examine.
Exponential is the simplest assumption (commonly used in simulation & analysis).
Weibull provides a better fit.
However, this may be due to a time-varying exponential (future work).
Daniel Stutzbach
The ION P2P Project
http://mirage.cs.uoregon.edu/P2P
Slide 16/21
Session-Length




Some prior studies report peer session lengths are heavy-tailed
or Pareto (linear on log-log plots).
Our data exhibits downward curvature in log-log scale.
For the BitTorrent data, the curvature is dramatic for sessions
longer than 1 day.
Weibull and log-normal distributions provide decent fits.
Daniel Stutzbach
The ION P2P Project
http://mirage.cs.uoregon.edu/P2P
Slide 17/21
Lingering after Download
Completion (in BitTorrent)




How long do peers remain after download completion?
Many peers linger for a few hours.
A few peers linger for days or weeks.
This explains the high seed percentage observed in
other studies.
Daniel Stutzbach
The ION P2P Project
http://mirage.cs.uoregon.edu/P2P
Slide 18/21
Peer Uptime




The uptime of active peers is related to, but different from, the
session length distribution.
40 to 60% of peers have an uptime longer than 5 hours.
10 to 20% of peers have an uptime longer than one day.
Conclusion: the typical session is short, but the typical peer has
been up a long time.
Daniel Stutzbach
The ION P2P Project
http://mirage.cs.uoregon.edu/P2P
Slide 19/21
Correlations in Session Length




Is uptime a good predictor of remaining uptime? Yes and no.
Uptime is a good predictor of the median remaining uptime.
But many predictions are wrong.
Conclusion: predictions are useful if the cost of a bad prediction is
low
Daniel Stutzbach
The ION P2P Project
http://mirage.cs.uoregon.edu/P2P
Slide 20/21
Summary & Future Work

Many pitfalls when studying churn


We enumerate and address them.
Characterizations




Neither Exponential or Pareto distributions are consistent with our
session length data.
Session lengths can be modeled by Weibull or log-normal
distributions.
The typical session is short, but the typical peer is up for a long time.
Past session length predicts the next session length, on average


But is often wrong
Future Work: longer studies of Kad and Gnutella

Reduce the chance of False Negatives using heuristics


Look for fingerprints indicating whether a peer is new or not
Use uniform sampling [Stutzbach 06 IMC] to closely monitor a
manageable set of sessions
Daniel Stutzbach
The ION P2P Project
http://mirage.cs.uoregon.edu/P2P
Slide 21/21