Timed Quorum Systems for large-scale and dynamic environments

Timed Quorum Systems
…for large-scale and dynamic environments
Vincent Gramoli, Michel Raynal
Context
Large-scale dynamic distributed systems
OPODIS’07
Gramoli, Raynal
Context
Large-scale dynamic distributed systems
• Nodes communicate through message-passing
OPODIS’07
Gramoli, Raynal
Context
Large-scale dynamic distributed systems
• Nodes communicate through message-passing
• Nodes join/leave the system at any time
OPODIS’07
Gramoli, Raynal
Goal
To emulate a shared-memory in this context
read
OPODIS’07
Gramoli, Raynal
Goal
To emulate a shared-memory in this context
read
Providing atomic (i.e. linearizable) read/write operations
OPODIS’07
Gramoli, Raynal
Roadmap
1. Model and preliminary definitions
2. Related work
3. Timed Quorum System (TQS)
4. An efficient implementation of TQS
5. Conclusion
OPODIS’07
Gramoli, Raynal
Simple model
• System of n interconnected nodes with unique IDs
• Asynchronous communication with neighbors
(nodes whose ID is known)
• Dynamism intensity (i.e. churn) c
• We consider a single object (local atomicity)
OPODIS’07
Gramoli, Raynal
Quorum System
• Quorums are sets (of nodes) that mutually
intersect.
• A Quorum System (QS) is a set of quorums.
Q1
Q2
Q3
Ex. 3 quorums of size q=2
OPODIS’07
Gramoli, Raynal
Q1 ∩ Q2 ≠ Ø
Q1 ∩ Q3 ≠ Ø
Q2 ∩ Q3 ≠ Ø
Operations
• Atomic quorum-based operations for static
settings:
[Attiya, Bar-Noy, Dolev, JACM 1996]
• Each node of the quorums maintains:
– A local value v of the object
– A unique tag t, the version number of this value
OPODIS’07
Gramoli, Raynal
Operations
1) Reading a value
Q1
Q2
Q3
value? tag?
v1,t1
Phase 1: Consult the most up-to-date value v
OPODIS’07
Gramoli, Raynal
Operations
1) Reading a value
Q1
Q2
Q3
v1,t1
Phase 1: Consult the most up-to-date value v
Phase 2: Propagate the consulted value
OPODIS’07
Gramoli, Raynal
Operations
1) Reading a value
Q1
Q2
Theorem of Attiya and Welch 1998:
« Read must write » to prevent
new/old inversions for unbounded #
of readers.
Q3
v1,t1
Phase 1: Consult the most up-to-date value v
Phase 2: Propagate the consulted value
OPODIS’07
Gramoli, Raynal
Operations
1) Reading a value
Q1
Q2
Q3
Output: v1
Phase 1: Consult the most up-to-date value v
Phase 2: Propagate the consulted value
OPODIS’07
Gramoli, Raynal
Operations
2) Writing a value v2
Input: v2
Q1
Q2
Q3
OPODIS’07
Gramoli, Raynal
Operations
2) Writing a value v2
max tag?
t1
Q1
Q2
Q3
Phase 1: Consult the value version and choose a new one strictly larger
OPODIS’07
Gramoli, Raynal
Operations
2) Writing a value v2
Q1
Q2
v2,t2 (with t2 > t1)
Q3
Phase 1: Consult the value version and choose a new one strictly larger
Phase 2: Propagate the new value associated with the new version
OPODIS’07
Gramoli, Raynal
Dynamic Solutions
• Reconfigurable storage: a failing QS is replaced by a new
one.
– RAMBO: Shvartsman, Lynch, DISC’02
– RDS: Chockler et al. OPODIS’05
• Structured dynamic quorums: failed servers are replaced
by new ones.
– AM05: Abraham, Malkhi, Dist. Comp. 2005
– NN05: Nadav, Naor, DISC’05
– SQUARE: Gramoli, Anceaume, Virgillito, SAC’07
OPODIS’07
Gramoli, Raynal
Dynamic Solutions
• Reconfigurable storage: a failing QS is replaced by new
one.
– RAMBO: Shvartsman, Lynch, DISC’02
– RDS: Chockler et al. OPODIS’05
• Structured dynamic quorums: failed servers are replaced
by new ones.
– AM05: Abraham, Malkhi, Dist. Comp. 2005
– NN05: Nadav, Naor, DISC’05
– SQUARE: Gramoli, Anceaume, Virgillito, SAC’07
All solutions require bounded churn during any finite period
OPODIS’07
Gramoli, Raynal
Dynamic Solutions
Reconfiguration complexity vs.
operation latency tradeoff
RAMBO
RDS
AM05
NN05
SQUARE
operation latency
reconfiguration complexity
Prevents scalability!
OPODIS’07
Gramoli, Raynal
Timed Quorum System
Dynamic quorum systems should be:
– Probabilistic: # of failures not necessarily bounded
– Timed: no property can hold forever
OPODIS’07
Gramoli, Raynal
Timed Quorum System
• Timed access strategy ω:
A mapping from any time t to a probability distribution on
the possible quorums.
• Δ-Timed Quorum System (TQS):
For any Q1 and Q2 accessed resp. with ω(t1) and ω(t2),
if |t2 – t1| ≤ Δ, then Q1  Q2 ≠ Ø with high probability.
OPODIS’07
Gramoli, Raynal
Timed Quorum System
• Δ-Timed Quorum System (TQS):
For any Q(t1) and Q(t2) accessed resp. with ω(t1) and ω(t2):
if |t2 – t1| ≤ Δ, then Q(t1)  Q(t2) ≠ Ø with high probability.
Q(t2)
Q(t5)
Q(t1)
Q(t3)
Q(t4)
Time
Q(t1)Q(t2)
Q(t2)Q(t3) Q(t3)Q(t4) Q(t3)Q(t5)
Δ
Example of a TQS: {Q(t1),Q(t2),Q(t3),Q(t4),Q(t5)}
OPODIS’07
Gramoli, Raynal
Consistency
• Probabilistic Atomicity:
– In the real-time sequence of operations:
• Any operation verifies atomicity w.r.t. all preceding successful
operations, and it is said successful
• Or this operation is said unsuccessful
– Any operation is successful with high probability
OPODIS’07
Gramoli, Raynal
Consistency
• Probabilistic Atomicity:
– In the real-time sequence of operations:
• Any operation verifies atomicity w.r.t. all preceding successful
operations, and it is said successful
• Or this operation is said unsuccessful
– Any operation is successful with high probability
Theorem 1: If at least one quorum is
accessed every Δ period of time, then Δ-TQS
implements probabilistic atomicity.
OPODIS’07
Gramoli, Raynal
Some observations
• Replication is necessary for data
persistence
• In large-scale systems, operations are
frequent
• Theorem « read must write » of Attiya and
Welch indicates that some information must
be replicated in any operation
OPODIS’07
Gramoli, Raynal
Efficient TQS Implementation
• Underlying gossip-based shuffle of neighborhood:
– Each node has constantly a new random set of
neighbors
• Classical quorum-based operations:
– Consulting v and t at some quorum
– Choosing v’ and t’ to propagate
– Propagating v’ and t’ to some quorum
OPODIS’07
Gramoli, Raynal
Efficient TQS Implementation
Disseminate until q = O(n) nodes are contacted
Client
k
1
l
1
k
1
k
1
OPODIS’07
k
Gramoli, Raynal
Efficient TQS Implementation
Assumptions:
–
–
–
neighbors are chosen uniformly at random
at least one operation succeeds every Δ time
c = rate of arrival = rate of departure  [0,1)
Results:
–
OPODIS’07
This algorithm implements a TQS
Gramoli, Raynal
Efficient TQS Implementation
Assumptions:
–
–
–
neighbors are chosen uniformly at random
at least one operation succeeds every Δ time
c = rate of arrival = rate of departure  [0,1)
Results:
–
–
OPODIS’07
This algorithm implements a TQS
Replication is piggybacked into operations
Gramoli, Raynal
Efficient TQS Implementation
Assumptions:
–
–
–
neighbors are chosen uniformly at random
at least one operation succeeds every Δ time
c = rate of arrival = rate of departure  [0,1)
Results:
–
–
–
OPODIS’07
This algorithm implements a TQS
Replication is piggybacked into operations
The quorum size is O(nD) where D = (1-c)-Δ
Gramoli, Raynal
Efficient TQS Implementation
Assumptions:
–
–
–
neighbors are chosen uniformly at random
at least one operation succeeds every Δ time
c = rate of arrival = rate of departure  [0,1)
Results:
–
–
–
–
OPODIS’07
This algorithm implements a TQS
Replication is piggybacked into operations
The quorum size is O(nD) where D = (1-c)-Δ
The operation latency is O(logk nD) message delays, where
D = (1-c)-Δ
Gramoli, Raynal
Efficient TQS Implementation
Assumptions:
–
–
–
neighbors are chosen uniformly at random
at least one operation succeeds every Δ time
c = rate of arrival = rate of departure  [0,1)
Results:
–
–
–
–
–
OPODIS’07
This algorithm implements a TQS
Replication is piggybacked into operations
The quorum size is O(nD) where D = (1-c)-Δ
The operation latency is O(logk nD) message delays, where
D = (1-c)-Δ
Smallest quorum size O(n) for static systems when D=O(1)
cf. [Malkhi, Reiter, Wool, Wright, Inf. and Comp. Journal 2001]
Gramoli, Raynal
Conclusion
We defined Timed Quorum System that:
• Is inherently dynamic:
– NO underlying structure
– Timely intersection requirement
• Ensures Probabilistic Atomicity
• Scales well:
– O(nD) messages by operation
– O(logk nD) time by operation
• Is optimal:
– When D=O(1), translates into best known static result: O(n)
OPODIS’07
Gramoli, Raynal
Open Issue
• TQS in Mobile Sensor Networks:
– Consultation phase:
• Gather motes to consult t and v
• Scatter motes to make t and v likely visible
– Propagation phase:
• Gather motes to propagate t’ and v’
• Scatter motes to make t’ and v’ likely visible
OPODIS’07
Gramoli, Raynal