Consistency and Replication (3)
Topics
Consistency protocols
Readings
Van Steen and Tanenbaum: 6.5
Coulouris: 11,14
Introduction
A consistency protocol describes an
implementation of a specific consistency
model.
We will look at different architectures that can
be used to support different consistency
models, but first we look at a basic
architectural model.
A Basic Architectural Model for
the Management of Replicated
Data
Requests and
replies
C
Clients
FE
Front ends
C
FE
RM
RM
Service
RM
Replica
managers
A Basic Architectural Model for
the Management of Replicated
Data
A collection of replica managers provides a
service to clients.
The clients see a service that gives them access
to objects (e.g., calendar or bank accounts)
which are replicated.
Each client’s requests are handled by a
component called a front end.
A Basic Architectural Model for
the Management of Replicated
Data
The purpose of the front end is to hide the
replication from the client process.
The client processes do not know how many
replicas there are.
A front end may be implemented in the client’s
address space or it may be a separate process.
Replicas coordinate in preparation to execute
the request consistently.
A Basic Architectural Model for
the Management of Replicated
Data
Replica managers execute requests
One or more replicas may respond to the
application (through the front end).
Primary-Based Protocols
In primary-based protocols, each data item x in
the data store has an associated primary, which
is responsible for coordinating write operations
on x.
Primary-Backup Protocols
Read operations are performed on a locally available copy.
Write operations are done at a fixed primary copy.
The primary performs the update on its local copy of x and
then forwards the update to all the other replicas (which are
considered to be backups).
Primary-Based Protocols
Primary-Backup Protocols (cont)
Each backup server performs the update as well and sends an
acknowledgement back to the primary.
When all backup servers have updated their local copy the primary
sends an acknowledgement back to the initial process.
This implements sequential consistency.
The primary RM is a performance bottleneck
Can tolerate F failures for F+1 RMs
SUN NIS (yellow pages) uses passive replication:
client can contact primary or backup servers for
reads, but only primary servers for updates.
The Primary-Backup Protocol
Primary
C
FE
RM
RM
Backup
C
FE
RM
Backup
Replicated-Write Protocols
In replicated-write protocols, write operations
can be carried out at multiple replicas instead
of only one (as seen in the case of primarybased replicas).
Operations need to be carried out in the same
order everywhere.
We discussed one approach for doing so that
uses Lamport’s timestamps.
Using Lamport timestamps does not scale well
in large distributed systems.
Replicated-Write Protocols
An alternative approach to achieving total order is to
use a central coordinator which is sometimes called a
sequencer.
Forward each operation to the sequencer.
Sequencer assigns a unique sequence number and
subsequently forwards the operation to all replicas.
Operations are carried out in the order of their sequence
number.
Hmm. This resembles primary-based consistency protocols.
Useful for sequential consistency.
Replicated-Write Protocols
The use of a sequencer does not solve the scalability
problem.
A combination of Lamport timestamps and
sequencers may be necessary.
The approach is summarized as follows:
Each process has a unique identifier, pi, and keeps a sent
message counter ci. The process identifier and message
counter uniquely identify a message.
Active processes (or a sequencer) keep an extra counter: ti.
This is called the ticket number. A ticket is a triplet (pi, ti,
(pj, cj)).
Replicated-Write Protocols
Approach Summary (cont)
An active process issues tickets for its own messages and for messages
from its associated passive processes (these are processes that are not
sequencers).
Passive processes multicast their messages to all group processes which
then wait for a ticket stating the total order of each message.
The ticket is sent by each passive process’s sequencer.
Lamport’s totally ordered multicast algorithm is used among the
sequencers to determine the order of update operations.
When an operation is allowed, each sequencer sends the ticket to its
associated passive processes. It is assumed that the passive process
receives these tickets in the order sent.
Replicated-Write Protocols
Approach Summary (cont)
If a sequencer terminates abnormally, then one of the
passive sequencers associated with it can become the new
sequencer.
An election algorithm may be used to choose the new
sequencer.
Replicated-Write Protocols
Let’s say that we have 6 processes: p1,p2,p3,p4,p5,p6
Assume that p1,p2 are sequencers; p3,p4 are associated
with p1 and p5,p6 are associated with p2
Let’s say that p3 sends a message which is identified
by (p3 , 1).
p1 generates a ticket as follows: (p1, 1, (p3 , 1))
The ticket number is generated using the Lamport
clock algorithm.
Replicated-Write Protocols
Let’s say that p5 sends a message which is
identified by (p5 , 1).
p2 generates a ticket as follows: (p2, 1, (p3 , 1))
Which update gets done first? Basically, p1,p2
will apply Lamport’s algorithm for totally
ordered multicast.
When an update operation is allowed to
proceed, the sequencers send messages to their
associated processes.
Gossip Architecture
We just studied some architectures for sequential
consistency. What about causal consistency?
The Gossip Architecture supports causally-consistent
lazy replication which in essence refers to the
potential causality between read and write operations.
Clients are allowed to communicate with each other,
but will then have to exchange information on the
operations they performed on the data store. This
exchange of information is done through gossip
messages.
Gossip Architecture
Gossip Architecture
Each RMi maintains for its local copy the
vector timestamp VAL(i)
VAL(i)[i]: the total number of completed write requests that
have been sent from a client to RMi
VAL(i)[j]: the total number of completed write requests that
have been sent from RMj to RMi
This is referred to as the value timestamp and it reflects the
updates that have been completed at the replica.
This timestamp is attached to the reply of a read operation.
Gossip Architecture
Each RMi maintains for its local copy the vector
timestamps WORK(i) which represents those write
operations that been been received (but not
necessarily processed) at RMi
WORK(i)[i]: the total number of write requests that have
been sent from a client to RMi including those that have
been completed by RMi.
WORK(i)[j]: the total number of write requests that have
been sent from RMj to RMi including those that have been
completed by RMi.
This is referred to as the replica timestamp.
This timestamp is attached to the reply of a write operation.
Gossip Architecture
Each client keeps track of the writes that it has seen
so far. The client C maintains a vector timestamp
LOCAL(C) with LOCAL (C )[i] set equal to the most
recent value of the number of writes seen at RMi
(from C’s view point).
This vector timestamp is attached to every request
sent to a replica.
Note that the client can contact a different replica
each time it wants to read or write data.
Two front ends may exchange messages directly;
these messages also carry the timestamp represented
by LOCAL (C).
Gossip Architecture
Write log (queue)
Every write operation, when received by a replica, is
recorded in the update log of the replica.
Two reasons for this:
The update cannot be applied yet; it is held back
It is uncertain if the update has been received by all
replicas.
The entries are sorted by timestamp.
A similar log is needed for read operations.
This is referred to as the read log (or queue).
Gossip Architecture
The Executed Operation table
The same write operation may arrive at a replica from a
front end and in a gossip message from another replica.
To present an update from being applied twice, the replica
keeps a list of identifiers of the write operations that have
been applied so far.
Gossip Architecture
Processing read request R from C
Let DEP (R) be the timestamp associated with R. It is set to
LOCAL(C).
The request is sent to RMi (with DEP (R)) which stores the request in
its read queue.
The read request is processed if DEP(R)[j] <= VAL(i)[j] (for all j).
This indicates that RMi has seen the same writes as the client.
As soon as a read operation can be carried out, RMi returns the value
of the requested data item to the client, along with VAL(i).
LOCAL(C) is adjusted to the value max{LOCAL(C)[j],VAL(i)[j]} for
all j.
This make sense since the value returned by read is potentially the
cumulative result of all previous writes.
Gossip Architecture
Performing a read operation at a local copy.
Gossip Architecture
Processing a write operation, W, from C
Let DEP (W) be the timestamp associated with W. It is set to
LOCAL(C) .
When the request is received by RMi it increments WORK(i)[i] by 1
but leaves the other entries intact.
This is done so that WORK reflects that RM
i has received the
latest write request. At this point it isn’t known if it can be carried
out.
A timestamp ts(W) is derived from DEP(W) by setting ts(W)[i] to
WORK(i)[i]; the rest of entries are as found in DEP(W).
This timestamp is sent back as an acknowledgement to the client,
which subsequently adjusts LOCAL(C) by setting each kth entry to
max{LOCAL(C)[k],ts(W)[k]}.
Gossip Architecture
Processing Write Operations (cont)
The
write request W is processed if DEP(W)[j] <=
VAL(i)[j] (for all j).
This indicates that RMi has seen the same writes as
the client. This is referred to as the stability
condition.
The write operation takes place.
What if there exists a j such that DEP(W)[j] >
VAL(i)[j]?
This would indicate that there was a write seen
by the client that is not yet seen by RMi.
Gossip Architecture
Processing Write Operations(cont)
VAL(i) is adjusted by setting each jth entry to
max{VAL(i)[j],ts(W)[j]}.
Recall that ts(W)[j] is set to DEP(W)[j] for all j != i and
is set to WORK(i)[i] for j = i(which had been
incremented upon receiving the write request; the end
result is that VAL(i) is incremented by 1).
The following two conditions are satisfied:
All operations sent directly to RMi from other clients but that preceded
W, have been processed.
ts(W)[i] = VAL(i)[i] + 1
All write operations that W depends on have been processed.
ts(W)[j] <= VAL(i)[j] for all j != i
Gossip Architecture
Performing a write operation at a local copy.
Gossip Architecture
For every gossip message received by RMj from RMi,
does the following:
RMj adjusts WORK(j) by setting each kth entry equal to
max{WORK(i)[k],WORK(j)[k]}
RMj merges the write operations sent by RMi with its own
Apply those writes that have become stable i.e., a write
request W is processed if DEP(W)[j] <= VAL(i)[j] (for all
j). A write from RMj that is processed should cause
VAL(i)[j] to be incremented by 1.
A gossip message need not contain the entire log, if it
is certain that some of the updates have been seen by
the receiving replica.
Gossip Architecture (Example)
LOCAL = (0,0,0)
0
replicas
0
VAL = (0,0,0)
WORK=(0,0,0)
LOCAL = (0,0,0)
1
VAL = (0,0,0)
WORK=(0,0,0)
2
VAL = (0,0,0)
WORK=(0,0,0)
1
Initial state
Gossip Architecture (Example)
LOCAL = (0,0,0)
0
replicas
DEP(W0)=(0,0,0)
0
VAL = (0,0,0)
WORK=(0,0,0)
LOCAL = (0,0,0)
1
VAL = (0,0,0)
WORK=(0,0,0)
2
VAL = (0,0,0)
WORK=(0,0,0)
1
Client 0 sends a write, W0, to replica 0
Gossip Architecture (Example)
LOCAL = (0,0,0)
0
replicas
0
VAL = (0,0,0)
WORK=(1,0,0)
DEP(W0)=(0,0,0)
ts(W0)=(1,0,0)
1
VAL = (0,0,0)
WORK=(0,0,0)
2
VAL = (0,0,0)
WORK=(0,0,0)
LOCAL = (0,0,0)
1
WORK is updated
Gossip Architecture (Example)
LOCAL = (1,0,0)
replicas
0
VAL = (0,0,0)
WORK=(1,0,0)
DEP(W0)=(0,0,0)
ts(W0)=(1,0,0)
1
VAL = (0,0,0)
WORK=(0,0,0)
2
VAL = (0,0,0)
WORK=(0,0,0)
ack (ts(W0))
0
LOCAL = (0,0,0)
1
client 0 receives an ack from replica 0 for its write
LOCAL changes from (0,0,0) to (1,0,0)
Gossip Architecture (Example)
LOCAL = (1,0,0)
0
replicas
0
VAL = (1,0,0)
WORK=(1,0,0)
DEP(W0)=(0,0,0)
ts(W0)=(1,0,0)
1
VAL = (0,0,0)
WORK=(0,0,0)
2
VAL = (0,0,0)
WORK=(0,0,0)
LOCAL = (0,0,0)
1
W0 is applied since DEP(W0) <= VAL; VAL changes
Gossip Architecture (Example)
LOCAL = (1,0,0)
0
replicas
0
VAL = (1,0,0)
WORK=(1,0,0)
DEP(W0)=(0,0,0)
ts(W0)=(1,0,0)
1
VAL = (0,0,0)
WORK=(0,0,0)
2
VAL = (0,0,1)
WORK=(0,0,1)
DEP(W1)=(0,0,0)
ts(W1)=(0,0,1)
LOCAL = (0,0,1)
1
Represents state after Client 1 sends a write,W1, to replica 2
Gossip Architecture (Example)
LOCAL = (1,0,0)
0
replicas
DEP(W2)=(1,0,0)
0
VAL = (1,0,0)
WORK=(1,0,0)
DEP(W0)=(0,0,0)
ts(W0)=(1,0,0)
1
VAL = (0,0,0)
WORK=(0,0,0)
2
VAL = (0,0,1)
WORK=(0,0,2)
DEP(W1)=(0,0,0)
ts(W1)=(0,0,1)
DEP(W2)=(1,0,0)
ts(W2)=(1,0,2)
LOCAL = (0,0,1)
1
Client 0 sends a write message W2 to replica 2;
Cannot be done yet since replica 2 didn’t see the write done at replica 1
Gossip Architecture (Example)
LOCAL = (1,0,2)
replicas
0
0
VAL = (1,0,0)
WORK=(1,0,0)
DEP(W0)=(0,0,0)
ts(W0)=(1,0,0)
1
VAL = (0,0,0)
WORK=(0,0,0)
2
VAL = (0,0,1)
WORK=(0,0,2)
DEP(W1)=(0,0,0)
ts(W1)=(0,0,1)
DEP(W2)=(1,0,0)
ts(W2)=(1,0,2)
ack(ts(W2))
LOCAL = (0,0,1)
1
An ack has been returned to 0 which then updates LOCAL
from (1,0,0) to (1,0,2)
Gossip Architecture (Example)
LOCAL = (1,0,2)
0
replicas
0
VAL = (1,0,0)
WORK=(1,0,2)
DEP(W0)=(0,0,0)
ts(W0)=(1,0,0)
1
VAL = (0,0,0)
WORK=(0,0,0)
LOCAL = (0,0,1)
1
2
VAL = (0,0,1)
WORK=(1,0,2)
DEP(W1)=(0,0,0)
ts(W1)=(0,0,1)
DEP(W2)=(1,0,0)
ts(W2)=(1,0,2)
Replica 0 and 2 exchange update propagation messages (gossip)
WORK at both replicas is adjusted
Gossip Architecture (Example)
LOCAL = (1,0,2)
0
replicas
0
VAL = (1,0,0)
WORK=(1,0,2)
DEP(W0)=(0,0,0)
ts(W0)=(1,0,0)
1
VAL = (0,0,0)
WORK=(0,0,0)
LOCAL = (0,0,1)
1
2
VAL = (0,0,1)
WORK=(1,0,2)
DEP(W1)=(0,0,0)
ts(W1)=(0,0,1)
DEP(W2)=(1,0,0)
ts(W2)=(1,0,2)
Replica 0 has one write operation (W0). This is sent to replica 2 with
DEP(W0). Replica 2 has write operation(W1). This is sent to
replica 2 with DEP(W1). Replica 2 also sends W2 with DEP(W2)
Gossip Architecture (Example)
LOCAL = (1,0,2)
0
replicas
0
VAL = (1,0,0)
WORK=(1,0,2)
DEP(W0)=(0,0,0)
ts(W0)=(1,0,0)
1
VAL = (0,0,0)
WORK=(0,0,0)
2
VAL = (0,0,1)
WORK=(1,0,2)
DEP(W1)=(0,0,0)
ts(W1)=(0,0,1)
DEP(W2)=(1,0,0)
ts(W2)=(1,0,2)
LOCAL = (0,0,1)
1
Replica 2 can carry out W0 since DEP(W0) < VAL
Replica 0 can carry out W1 since DEP(W1) <= VAL
Gossip Architecture (Example)
LOCAL = (1,0,2)
0
replicas
0
VAL = (1,0,1)
WORK=(1,0,2)
DEP(W0)=(0,0,0)
ts(W0)=(1,0,0)
1
VAL = (0,0,0)
WORK=(0,0,0)
LOCAL = (0,0,1)
1
2
VAL in replica 0 and replica 2 are updated
VAL = (1,0,1)
WORK=(1,0,2)
DEP(W1)=(0,0,0)
ts(W1)=(0,0,1)
DEP(W2)=(1,0,0)
ts(W2)=(1,0,2)
Gossip Architecture (Example)
LOCAL = (1,0,2)
0
replicas
0
VAL = (1,0,1)
WORK=(1,0,2)
DEP(W0)=(0,0,0)
ts(W0)=(1,0,0)
1
VAL = (0,0,0)
WORK=(0,0,0)
LOCAL = (0,0,1)
1
2
W2 can now be executed at replica 2
since DEP(W2) < VAL; W2 can also be
applied at replica 0
VAL = (1,0,1)
WORK=(1,0,2)
DEP(W1)=(0,0,0)
ts(W1)=(0,0,1)
DEP(W2)=(1,0,0)
ts(W2)=(1,0,2)
Summary
There are good reasons to introduce
replication.
However, replication introduces consistency
problems.
Doing so may severely degrade performance,
especially in large-scale systems.
Thus consistency is relaxed.
We have studied consistency models and
protocols.
© Copyright 2026 Paperzz