Providing High Availability Using Lazy Replication

Providing High Availability
Using Lazy Replication
Rivaka Ladin, Barbara Liskov,
Liuba Shrira, Sanjay Ghemawat
Presented by Huang-Ming Huang
Outline




Model
Algorithm
Performance Analysis
Discussion
Replication Model
client
FE
Front ends
client
FE
RM
Service
RM
RM
Replication
Manager
Excerpt from “Distributed Systems – Concept and Design” by Coulouris, Dollimore and Kindberg
System Guarantees


Each client obtains a consistent service
over time
Relaxed consistency between replicas

Updates are applied with ordering
guarantees that make the replicas
sufficiently similar.
Operation Classification
RM
RM
RM
gossip
Update, prev
Val, new
Query, prev
Update id
FE
query
Client
FE
val
update
Client
Excerpt from “Distributed Systems – Concept and Design” by Coulouris, Dollimore and Kindberg
Update operation classification



Causal update
Forced update : performed in the same
order (relative to one another) at all
replicas.
Immediate update : performed at all
replicas in the same order relative to all
other operations.
Vector timestamp

Given two timestamps





T = (t1,t2,,tn)
S = (s1,s2,,sn)
T ≤ S ≡ti≤si for all i
merge(T,S)= (max(t1,s1),…,max(tn,sn))
Each part of the vector timestamp
corresponds to each replica manager in
the system.
RM components
Other replicas
Replica Timestamp
Replica log
Gossip
Messages
Timestamp table
Replica timestamp
Update log
stable
Value Timestamp
updates
Value
Executed operation table
Updates
FE
FE
Operation
prev id
Excerpt from “Distributed Systems – Concept and Design” by Coulouris, Dollimore and Kindberg
Query

The replica manager blocks the query q
operation until the condition holds:



q.prev <= valueTS
The replica manger returns valueTS
back to FE.
FE updates its own timestamp

frontEndTS := merge(frontEndTS, new)
Causal Update
Replication Manager i
ValueTS
merge(ValueTS, r.ts)
+1,…,r
(r1,r2,…,ri,…,r
n) n)
Value
apply(value.r.u.op)
Update log
executed
 r.u.id table
Executed
operation
ts=(p
+1,…,pn)
r.u.prev
≤ ivalueTS
1,p2,…,p
logRecord =(i, ts, u.op, u.prev, u.id)
ts
FE
operation (p1,p2,…pn,) id
Gossip messages


Goal : bring the states of replication
managers up to date.
Consists of :



Replication timestamp
Update log
Upon receiving gossip



Merge the arriving log with its own
Apply any unexecuted stable updates
Eliminate redundant log and executed operation
table entries
Control the size of update log

Timestamp table



keeps recent timestamps
from messages sent by all other replicas.
A log record r can be removed from the
log when

r.tsr.i < timestamp_table[j] r.i , for all j
Control the size of executed
operation table


Each update carries an extra time field
FE returns an ACK


Contains FE’s clock time after receiving the
response for an update from RM.
RM inserts the received ACK to the log.
Control the size of executed
operation table (con’t)

A message m from FE is late if




m.time + δ< replica’s clock time
An update is discard if it is late
An ACK is kept at least until it is late
Remove an entry c in executed operation
table when


an ACK for c’s update is received
all records for c’s update have been discarded.
Forced Update


Use the primary to assign a global
unique identifier.
The primary carries out a two phase
protocol for updates.
Two phase protocol


Upon receiving an update, the primary
sends it to all other replicas.
Upon receiving responses from all most
half of the backups,


the primary commit the update by insert
the record to its log.
Backups know the commitment from
gossip messages.
Fail Recovery



New coordinator informs participants
about the failure.
Participants inform coordinator about
most recent forced updates
Coordinator assign UID with the largest
it knows after the sub-majority of
replicas has responded.
Immediate Update

Primary use 3 phase protocol.



Pre-prepare
Prepare
Commit
3 phase protocol
primary
FE
Update log
update
Update id
logRecord
Give me your log
and timestamp
backup
backup
backup
Number of Messages for
different operations




Query : 2
Casual : 2 + (N-1)/K
Forced : 2N/2+ (N-1)/K
Immediate : 2N +2(N/2-1)+(N-1)K


N : the number of replicas
K : the number of update/ack pairs in a
gossip.
Capacity of a 3-replica system
Excerpt from “Providing high Availability Using Lazy Replication” by Ladin, Liskov, Shrira and Ghemawat
Capacity of the Unreplicated
System
Excerpt from “Providing high Availability Using Lazy Replication” by Ladin, Liskov, Shrira and Ghemawat
Discussion

No time guarantee for gossip messages

Not generally suitable for real-time application
such as



realtime conference
updating shared document.
Scalability


Timestamp space grows as number of replicas
grow.
can be increased by making most of the replicas
read-only
Qustions?