P2PStorageSystems

Replica Control for Peer-toPeer Storage Systems
P2P
• Peer-to-peer (P2P) has emerged as an
important paradigm model for sharing resources
at the edges of the Internet.
• The most widely exploited resource is storage,
as typified in P2P music file sharing
– Napster
– Gnutella
• Following the great success of P2P file sharing,
a natural next step is to develop wide-area, P2P
storage systems to aggregate the storage
across the Internet.
Replica Control Protocol
•Replication
to maintain multiple copies of some critical data
to increase the availability
•Replica Control Protocol
to guarantee a consistent view of the replicated
data
Replica Control Methods
• Optimistic
– Proceed optimistically with computation on
the available subgroup and join later with
consistency
– Approaches
• Log, Version vector, etc.
• Pessimistic
– Restrict computations with worst-case
assumptions
– Approaches
• Primary site, Voting, etc.
Write-ahead Log
• Files are actually modified in place, but before
any file block is changed, a record is written to a
log telling which node is making the change,
which file block is being changed, and what the
old and new values are
• Only after the log has been written successfully
is the change made to the file
• Write-ahead log can be used for undo (rollback)
and redo
5
Version Vector
• Version vector for file f
– N element vector, where N is the number of
nodes in which f is stores
– The ith element represents the number of
updates done by node I
• A vector V dominated V’ if
– Every element in V >= corresponding element
in V’
• Conflicts if neither dominates
Version Vector
• Consistency resolution
– If V dominates V’, inconsistent; can be
resolved by copying V to V’
– If V and V’ conflict, inconsistency is detected
• Version vector can detect only update
conflicts; cannot detect read-write conflicts
Primary Site Approach
• Data replicated on at least k+1 nodes (for
k-resilient)
• One node acts as the primary site (PS)
– Any read request is served by the PS
– Any write request is copied to all other backup sites
– Any write request to back-up sites are
forwarded to the PS
PS Failure Handling
• If back-up fails, no interruption in service
• If PS fails, there are two possibilities
– If the network not segmented
• Choose a back-up node as the primary
• If checkpointing has been active, need to restart
only from the previous checkpoint
– If segmented
• Only the partition with PS can progress
• Other partitions stops updates on data
• Necessary to distinguish between site failures and
network partitions
Voting Approach
• V votes are distributed to n replicas with
– Vw+Vr > V
– Vw+Vw > V
• Obtain Vr or more votes to read
• Obtain Vw or more votes to write
• Quorum system is more general than
voting
Quorum Systems
• Trees
• Grid-based (array-based)
• Torus
• Hierarchical
• Multi-column
and so on…
Quorum-Based Schemes (1/2)
• n replicas with version numbers
• Read operation
– Read-lock and access a read quorum
– Obtaining a largest-version-number replica
• Write operation
the largest
+1
– Write-lock and access a write quorum
– Updating all replicas with the new version number
12/40
Quorum-Based Schemes (2/2)
The set of replicas must behave as if there is
only a single copy. This is the strictest
consistency criterion.
• One-copy equivalence is guaranteed If
we restrict
– Write-write and write-read lock exclusion
– Intersection Property
• A non-empty intersection in any pair of
13/40
–A read quorum and a write quorum
–Two write quorums
Witnesses
Witness
- small entity that maintains enough
information to identify the replicas
that contain the most recent version of
the data
- the information could also be a
timestamp containing the time of the
latest update
- the information could also be a
version number, which is an integer
incremented each time the data are
updated
Classification of P2P Storage Sys.
• Unstructured
– “Replication Strategies for Highly Available Peer-topeer Storage”
– “Replication Strategies in Unstructured Peer-to-peer
Networks”
• Structured
–
–
–
–
–
–
–
–
CFS
PAST
Read only
LAR
Ivy
Oasis
Read/Write (Mutable)
Om
Eliot
Sigma (for mutual exclusion primitive)
Ivy
• Stores a set of logs with the aid of distributed
hash tables.
• Ivy keeps, for each participant, a log storing all
its updates, and maintains data consistency
optimistically by performing conflict resolutions
among all logs. (Maintain data consistency in a
best-effort manner)
• The logs should be kept indefinitely and a
participant must scan all the logs related to a file
to look up the up-to-date file data. Thus, Ivy is
only suitable for small groups of participants.
Ivy uses DHT
(Ivy)
Distributed application
put(key, data)
get (key)
Distributed hash table
lookup(key)
data
(DHash)
node IP address
Lookup service
• DHT provides
– Simple API
• Put(key, value) and get(key)  value
– Availability (Replication)
– Robustness (Integrity checking)
(Chord)
Solution: Log Based
• Update: Each participant maintains a log
of changes to the file system
• Lookup: Each participant scans all logs
Eliot
• Eliot relies a reliable, fault-tolerant, immutable
P2P storage substrate  Charles to store data
blocks, and uses an auxiliary metadata service
(MS) for storing mutable metadata.
• It supports NFS-like consistency semantics;
however, the traffic between MS and the client is
high for such semantics.
• It also supports AFS open-close consistency
semantics; however, this semantics may cause
the problem of lost updates.
• The MS service is provided by a conventional
replicated database, which may be not fit for
dynamic P2P environments.
Oasis
• Oasis is based on Gifford’s weighted voting quorum
concept and allows dynamic quorum membership.
• It spreads versioned metadata along with data replicas
over the P2P network.
• To complete an operation on a data object, a client must
first find a metadata related to the object and figure out
the total number of votes, required votes for read/write
operations, replica list, and so on, to form a quorum
accordingly.
• One drawback of Oasis is that if a node happens to use
a stale metadata, the data consistency may be violated.
Om
• Om is based on the concepts of automatic replica
regeneration and replica membership reconfiguration.
• The consistency is maintained by two quorum systems:
a read-one-write-all quorum system for accessing
replicas, and a witness-modeled quorum system for
reconfiguration.
• Om allows replica regeneration from single replica.
However, a write in Om is always first forwarded to the
primary copy, which serializing all writes and uses a twophase procedure to propagate the write to all secondary
replicas.
• The drawbacks of Om are
(1) the primary replica may become a bottleneck
(2) the overhead incurred by the two-phase procedure
may be too high
(3) the reconfiguration by witness model has the
probability of violating consistency.
Om and Pastry
PAST example
Object key = 100
120
80
90
104
103
98
99
101
100
Om and Pastry
PAST example
Object key = 100
120
80
Replication
90
104
103
98
99
101
100
Om and Pastry
PAST example
Object key = 100
120
80
Replication
Replica crash
90
104
103
98
99
101
100
Om and Pastry
PAST example
Object key = 100
120
Replication
Replica crash
Regeneration
80
90
104
103
98
99
101
100
Om: Normal Case Operation
Read-one / write-all
approach
Writes serialized via
primary
120
80
90
104
write
103
98
99
101
100
read
primary
Om: System Architecture
Overview
Witness Model
• The witness model utilizes the following
limited view divergence property
• Intuitively, the property says that two replicas are
unlikely to have a completely different view regarding
the reachability of a set of randomly-placed witnesses.
Witness Model
• To utilize the limited view divergence
property, all replicas logically organize the
witnesses into an m*t matrix
• The number of rows, m, determines the
probability of intersection
• The number of columns, t, protects against
the failure of individual witnesses, so that
each row has at least one functioning
witness with high probability
Witness Model
Sigma System Model
Sigma System Model
• Replicas are always available, but their
internal states may be randomly reset.
 failure-recovery model
• The number of clients is unpredictable.
Clients are not malicious and fail stop.
• Clients and replicas communicate via
messages, which could be replicated, lost,
but never forged.
Sigma
• The Sigma protocol intelligently collect states
from all replicas to achieve mutual exclusion.
• The basic idea of the Sigma protocol is as
follows. A node u wishing to be the winner of the
mutual exclusion sends a timestamped request
for each of the totally n (n=3k+1) replicas and
waits for replies. On receiving a request from u,
a node v should put u’s request in a local queue
by the timestamp order, takes the node as the
winner whose request is in the front of the queue,
and reply the winner ID to u.
Sigma
• When the number of replies received by u exceeds m
(m=2k+1), u acts according to the following conditions:
(1) if more than m replies take v as the winner, then u is the
winner. (2) if more than m replies take w (wu) as the
winner, then w is the winner and u just keeps waiting.
(3) if no node is regarded as the winner by more than m
replies, then u sends YIELD message to cancel its request
temporarily and then re-inserts its request again with
random backoff.
• In this manner, one node can eventually be elected as the
winner even when communication delay variance is large.
• A drawback of the Sigma protocol is that a node needs to
send requests to all replicas and gets advantaged replies
from a large portion (2/3) of nodes to be the winner of the
mutual exclusion, which will incur large overhead. Moreover,
the overhead will even be larger under an environment of
high contention.
•
•
•
•
•
References
Ivy:
A. Muthitacharoen, R. Morris, T. Gil, and B. Chen, “Ivy: A Read/write Peerto-peer File System,” in Proc. of the Symposium on Operating Systems
Design and Implementation (OSDI), 2002.
Eliot:
C. Stein, M. Tucker, and M. Seltzer, “Building a Reliable Mutable File
System on Peer-to-peer Storage,” in Proc. of 21st IEEE Symposium on
Reliable Distributed Systems (WRP2PDS), 2002.
Oasis:
M. Rodrig, and A. Lamarca, “Decentralized Weighted Voting for P2P Data
Management,” in Proc. of the 3rd ACM International Workshop on Data
Engineering for Wireless and Mobile Access, pp. 85–92, 2003.
OM:
H. Yu. and A. Vahdat, “Consistent and Automatic Replica Regeneration,” in
Proc. of First Symposium on Networked Systems Design and
Implementation (NSDI '04), 2004.
Sigma:
S. Lin, Q. Lian, M. Chen, and Z. Zhang, “A practical distributed mutual
exclusion protocol in dynamic peer-to- peer systems,” in Proc. of 3rd
International Workshop on Peer-to-Peer Systems (IPTPS’04), 2004.
Q&A