Reconfigurable Distributed Storage for Dynamic Networks

Reconfigurable Distributed
Storage for Dynamic Networks
Gregory Chockler, Seth Gilbert,
Vincent Gramoli, Peter M Musial,
Alexander A Shvartsman
OPODIS 05
Goals
Reconfigurable Distributed Storage (RDS)
• Atomic consistency (read/write)
• Fault Tolerance
…in Dynamic and Asynchronous Systems.
OPODIS 05
Distributed Storage
OPODIS 05
Distributed Storage
Data is replicated at several network locations
OPODIS 05
Distributed Storage
Read
Write
Operation policy
OPODIS 05
…in Dynamic Networks
OPODIS 05
Distributed Storage in Dynamic
Networks
OPODIS 05
Distributed Storage in Dynamic
Networks
leaving nodes
joining nodes
OPODIS 05
Distributed Storage in Dynamic
Networks
OPODIS 05
Distributed Storage in Dynamic
Networks
…requires a reconfiguration process.
OPODIS 05
Distributed Storage in Dynamic
Networks
…by achieving agreement.
OPODIS 05
Model
• Distributed
– Connected set of processors
– Each processor has a unique id i  I
– MWMR, any processor is a potential client
• Asynchronous
– Asynchronous processors
– Point-to-point asynchronous unreliable
channels
• Dynamic
– Processors join and leave the system
– Processors may crash
OPODIS 05
What is a configuration?
• Configuration
<members, read-quorums, write-quorums>
– members is a set of processors,
– read-quorums, write-quorums two sets of quorums
–  RQ  read-quorums, WQ  write-quorums
• RQ  members
• WQ  members
• RQ  WQ   (only for a given configuration)
• Every client maintains a set of
configurations, initially containing the
default one.
OPODIS 05
Single Object Operations Overview
After [ABD95]
• tag = <c,i>  N  I, val a possible value
1. (tag,val) query(NULL): gathers (tag,val) pairs of all
processors of a RQ and returns the one with the largest tag.
2. NULL prop(tag,val): updates (tag,val) pairs at all processors
of a WQ.
• val = Read()i
(<c,j>,val)=query();[prop(<c,j>,val);]
• Write(val)i
Read tag
(<c’,j>,val’)=query();prop(<c’++,i>,val);
Write tag
OPODIS 05
Reconfiguration Design Goals
• Sound
– Totally ordered configurations
• Flexible
– No dependences between configurations
• Non-intrusive
– Makes possible concurrent read/write
operations
• Fast
–
Strengthening
fault
tolerance
OPODIS 05
Decoupling Reconfiguration
• Reconfiguration = Replacing Configurations
– {I} Installing a new configuration
– {R} Removing old configuration(s)
≺ {I}  Operations are delayed
If {I} ≺ {R}  Stronger configuration
• If {R}
•
viability assumption is required
OPODIS 05
Solution
({R} ≺ {I})  ({I} ≺ {R})

{I} // {R}
Tighter coupling between
removal and installation
OPODIS 05
RDS Reconfiguration
• Reconfiguration is based on Paxos
(3 phases leader-based consensus alorithm)
• l is the leader
• c is the current configuration
• configs is the set of active configurations
• A ballot has a unique identifier b and a value v,
which is a configuration
• Paxos phases:
– Prepare: l creates a new ballot and chooses/gets the
value to propose.
– Propose: l proposes <b,v> and gathers votes from a
majority.
– Propagate: l propagates decision
OPODIS 05
RDS Reconfiguration
Recon(c,c’)
l
WQ
OPODIS 05
RQ
RDS Reconfiguration
Recon(c,c’)
Prepare phase
WQ
OPODIS 05
•Creates a new larger ballot b
l
RQ
RDS Reconfiguration
Recon(c,c’)
Prepare phase
l
<1a, b>
WQ
OPODIS 05
RQ
RDS Reconfiguration
Recon(c,c’)
Prepare phase
l
•Updates its ballot’s value v
with the one received
•Updates its configs set
<1b, b, configs, <b’’, c’’>>
<1a, b>
WQ
OPODIS 05
RQ
RDS Reconfiguration
Recon(c,c’)
Propose phase
l
<1b, b, configs, <b’’, c’’>>
<2a, b, c, v>
<1a, b>
WQ
OPODIS 05
RQ
RDS Reconfiguration
Recon(c,c’)
Propose phase
l
<1b, b, configs, <b’’, c’’>>
<2a, b, c, v>
<1a, b>
WQ
<2b, b, c, v, tag, val>
<2b, b, c, v, tag, val>
OPODIS 05
•Updates their tag and val
•Adds v to their configs set
RQ
RDS Reconfiguration
Recon(c,c’)
Propagation phase
l
<1b, b, configs, <b’’, c’’>>
<2a, b, c, v>
<1a, b>
WQ
<2b, b, c, v, tag, val>
<3a, c, v, tag, val>
<2b, b, c, v, tag, val>
<3a, c, v, tag, val>
OPODIS 05
•Update their tag and val
•Remove configuration c from
their configs set
RQ
<3a, c, v, tag, val>
Proving Atomicity
• Ordering configurations
Theorem 1: The set of installed configurations
in the system is totally ordered.
• Ordering operations
Theorem 2: If operation 1 precedes operation
2 then 1’s tag is not larger than 2’s tag.
OPODIS 05
Additional Assumptions
• Eventual stabilization with
–
–
–
–
–
Unique leader l
Message delay bound d (unkown to the algorithm)
Gossip with frequency d
Restricted reconfiguration rate
Some quorums remain alive in active configurations
tl: Algorithm stabilization time
ts: System stabilization time
2d
Let’s tr be the Request time
OPODIS 05
ts
tl
Reconfiguration Latency
Worst case scenario:
Last reconfiguration was done by a different leader.
5d
2d
Prepare
max(tl, tr)
2d
Propose
d
Propagate
te
te: end time
Reconfiguration is complete
OPODIS 05
Reconfiguration Latency
Other cases:
The leader made the previous reconfiguration.
3d
2d
Propose
max(tl, tr)
d
Propagate
te
te: end time
Reconfiguration is complete
OPODIS 05
Operation Latency
Phase latency:
• 2d is sufficient for the phase round trip.
• In some cases (pending reconfiguration), the phase might be delayed twice.
2d
2d
1st round trip
2nd round trip
New configuration discovered
Operation latency:
• Operations are bounded by 8d.
• In some cases, the propagation phase of the read operation can be
ignored, leading to a possible bound of 2d.
OPODIS 05
Experimental Results
•
IOA to Java code following set of rules.
•
Implementation of Attiya, Bar-Noy, and Dolev
algorithm « ABD » (w/o Reconfiguration) and
RDS which shares parts of the ABD code.
•
Using majority-based configurations.
•
Measuring operation latency
1. While varying configuration size
2. While varying algorithm instances
OPODIS 05
Experimental Results
•
•
Operation latency of RDS is competitive with
ABD, confirming the theory.
Reconfiguration messages contain operation
information which might accelerate operations in
RDS.
OPODIS 05
Conclusion
• RDS, Reconfigurable Distributed Storage.
• With sound, flexible, non-intrusive and
fast reconfiguration.
• It solves two problems in one:
Configuration replacement and
Consensus.
• Reconfiguration is inexpensive (time).
• Fault tolerance is strenghtened.
• RAMBO can become more agressive: it is
exactly what we did here!
OPODIS 05