Set 17: Fault-Tolerant Register Simulations
CSCE 668
DISTRIBUTED ALGORITHMS AND
SYSTEMS
CSCE 668
Fall 2011
Prof. Jennifer Welch
1
Fault-Tolerant Shared Memory
Simulations
2
Previous algorithms implemented shared variable on
top of message passing, assuming no failures.
What if some processors might crash?
Can we still provide a shared read/write variable on
top of message passing?
Yes, even in an asynchronous system, if we have
enough nonfaulty processors.
First, we must specify a failure-prone shared memory.
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Specification of f-Resilient Shared
Memory
3
Inputs are invocations on the shared object.
Outputs are responses of the shared object.
A sequence of inputs and outputs is allowable iff:
there is a partitioning of proc. indices into "faulty" and
"nonfaulty"
Correct Interaction: each proc. alternates invocations and
matching responses
Nonfaulty Liveness: Every invocation by a nonfaulty proc. has
a matching response
Extended Linearizability: Linearizability holds for all the
completed operations and some subset of the pending
operations
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Assumptions for Algorithm
4
Each read/write variable ("register") to be
simulated has
one reader and
one writer
(Next topic will be to build more powerful variables
out of these.)
There are n procs. which are cooperating to
simulate a collection of such variables
Underlying communication system is asynchronous
message passing
n > 2f (less than half the processors can crash)
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Main Ideas of Algorithm
5
Each simulated register has a replica stored at each
of the n procs., not just at the designated reader and
writer of that register.
Use the redundant storage to provide fault-tolerance.
Describe algorithm just for one simulated register; use
a separate copy of the same algorithm in parallel for
each simulated register.
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Writing the Simulated Register
6
generate the next sequence number
send a message with the value and the sequence
number to all the procs.
each
wait to get back an ack from > n/2 procs.
safe
recipient updates its local copy of the register
since n - f > n/2
do the ack for the write
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Reading the Simulated Register
7
send a request to all the procs.
each
recipient sends back current value of its replica
wait to get reply from > n/2 procs.
return value associated with largest sequence
number
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Key Idea for Correctness
8
Each read should return the value of "the most
recent" write.
Each read or write communicates with > n/2 procs.,
so the set of procs. participating in operation O1 is
guaranteed to intersect with the set of procs.
participating in any other operation O2.
Set 17: Fault-Tolerant Register Simulations
CSCE 668
But What About Asynchrony?
9
The underlying communication system is asynchronous:
message on behalf of one operation could be overtaken by
a message on behalf of a later operation.
Avoid such problems by adding additional mechanism
to the algorithm:
reader and writer keep track of "status" of each link
don't send a msg on a link until ack from previous msg has
been received
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Outline of Correctness Proof
10
Interesting part is proving(extended) linearizability.
Let ts(W) = sequence number of W
Let ts(R) = sequence number of write that R reads
from
Let O1 O2 denote O1 finishes before O2 starts
Key lemmas:
If W1 W2, then ts(W1) < ts(W2)
If W R, then ts(W) ≤ ts(R)
If R W, then ts(R ) < ts(W)
If R1 R2, then ts(R1) ≤ ts(R2)
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Matching Lower Bound on Resiliency
11
Theorem (10.22): No simulation of a 1-reader, 1writer read/write linearizable register using n procs
and asynchronous message passing can tolerate f ≥
n/2 crash failures.
Proof: Suppose in contradiction there is an algorithm A
that tolerates f = n/2 crashes and simulates a 1reader, 1-writer linearizable register on top of
asynchronous message passing.
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Lower Bound Proof
12
Partition procs into two sets, S0 and S1, each of size
f.
Let 0 be admissible exec. of A s.t.
initial value of simulated register is 0
all procs. in S1 crash initially
proc. p0 in S0 invokes write(1) at time 0 and no other
operations are invoked.
the write completes at some time t0 without any proc in
S0 receiving a message from any proc in S1: must
happen since A is supposed to tolerate f failures.
Set 17: Fault-Tolerant Register Simulations
CSCE 668
S0
0:
13
S1
p0
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Lower Bound Proof
14
Let 1 be admissible exec. of A s.t.
initial value of simulated register is 0
all procs. in S0 crash initially
proc. p1 in S1 invokes a read at time t0+1 and no other
operations are invoked.
the read completes at some time t1 without any proc. in
S1 receiving a message from any proc. in S0: must
happen since A is supposed to tolerate f failures
the read returns 0: must be since A guarantees
linearizability
Set 17: Fault-Tolerant Register Simulations
CSCE 668
1:
15
p1
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Lower Bound Proof
16
Now create admissible execution by "merging"
the views of procs in S0 from 0 and the views of
procs in S1 from 1:
messages
that go between S0 and S1 are delayed so
that they don't arrive until after time t1.
is not linearizable, since read(0) follows write(1).
Contradiction.
Set 17: Fault-Tolerant Register Simulations
CSCE 668
S0
0:
p0
p0
:
S1
delay until
after t1
1:
17
p1
p1
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Lower Bound Diagram for n = 2
time
t0
0
t0+1
t1
18
p0
o:
write(1)
p
1
p0
1:
p
1
p0
read(0)
write(1)
:
p
1
read(0)
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Simulating R/W Registers Using R/W
Registers
19
The previous algorithm showed how to simulate a 1reader, 1-writer register on top of message passing.
How can we get more powerful (flexible) registers,
i.e., with
more readers
more writers
We'll start with a warm-up:
simulate multi-valued register using binary-valued registers
1-reader and 1-writer
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Wait-Free Register Simulations
20
Asynchronous model
Linearizable shared registers
Wait-free
tolerate any number of crash failures
We want to simulate one kind of (n-1)-resilient
shared memory with another kind of (n-1)-resilient
memory
recall earlier definition of f-resilient shared memory
recall earlier definition of one kind of communication
system simulating another
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Alternative Definition of Wait-Free
Simulation
21
Alternative definition for the wait-free shared
memory case:
The failure-free version of one (SM) communication
system simulates the failure-free version of the
other, and
for any prefix of an admissible execution of the
simulation algorithm in which pi has a pending
operation, there is an extension in which the
operation completes and only pi takes steps.
Equivalent to previous definition, sometimes more
convenient.
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Proving Linearizability
22
We've seen one approach:
explicitly construct a permutation and prove that it has the
desired properties
Alternative approach:
identify a time point for each operation, between invocation
and response: linearization points
Linearization points give the permutation
Obviously real-time order is preserved
Just need to show that legality holds
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Overview of Register Simulations
23
single-reader
single-writer
binary-valued
single-reader
single-writer
multi-valued
multi-reader
single-writer
multi-valued
multi-reader
multi-writer
multi-valued
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Multi-Valued From Binary
24
Some ideas…
Use a different binary register to store each bit of
the multi-valued register being simulated
Read algorithm is to read all the binary registers
and return the resulting value
Write algorithm is to write the new bits in some
order
Difficulties arise if the reader overlaps a slow
write and sees some new bits and some old bits
Set 17: Fault-Tolerant Register Simulations
CSCE 668
A Unary Approach
25
Suppose the simulated register is to take on the
values {0,…,K-1}.
Use an array of K binary registers, B[0..K-1]
represent value v by having B[v] = 1 and the other
entries 0
Read algorithm: read B[0], B[1],…, until finding
the first 1; return the index
Write algorithm: zero out the old entry of B and
set the new entry
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Problems with Unary Approach
26
OK if reads and writes don't overlap.
If they do, have to worry about
reader never finding a 1 in B
new-old inversion: writer writes 1, then 2, but reader reads
2, then 1.
Counter-example execution on next slide
since binary registers are linearizable, we just mark the
linearization points of the reads and writes on the binary
registers
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Counter-Example
27
Initially B[0] = B[1] = B[2] = 0 and B[3] = 1
read 2
read 0
read 0
from B[0] from B[1]
read 1
read 1
read 0
read 1
from B[2] from B[0] from B[1]
write 2
write 1
write 1 write 0
to B[1] to B[3]
write 1
to B[2]
Set 17: Fault-Tolerant Register Simulations
write 0
to B[1]
CSCE 668
Corrected Multi-Valued Algorithm
28
To prevent "falling off the edge" of the end of B
without finding a 1, write algorithm only clears (sets
to 0) entries that are smaller the entry that is set (to
1)
To prevent new-old inversions, read algorithm scans
up to find first 1, and then scans down to make sure
those entries are still 0.
returns smallest value associated with a 1 entry in B that is
observed during the downward scan
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Multi-Valued Construction
29
B[0]
read
0/1
write
reader
writer
reader
alg.
read
.
.
.
B[K-1]
writer
alg.
write
0/1
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Algorithm is Wait-Free
30
Algorithm for writer does not involve any waiting:
just do at most K (low-level) writes
Algorithm for reader does not involve any waiting:
just do at most 2K-1 (low-level) reads.
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Algorithm Ensures Linearizability
31
Describe an ordering of the (high-level) operations
that is obviously legal (by the definition of the
ordering)
Then show that it respects real-time ordering of nonoverlapping operations.
Fix any admissible execution of the algorithm.
Fix any linearization of the low-level operations (on
the binary registers)
exists since the execution is admissible, which implies the
underlying communication system (the binary registers)
behaves properly (is linearizable)
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Reads-From Relations
32
Low-level read r on a binary register B[v] reads from
low-level write w on the register if w is the latest write
to B[v] that precedes r in the linearization of the lowlevel operations.
High-level read R on the simulated multi-valued
register reads from high-level write W on the
register if W returns v and W contains the low-level
write that R's last read of B[v] reads from.
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Reads-From Diagram
33
read 1
read 0
from B[0]
read 1
from B[1]
read 0
from B[0]
write 1
write 1
to B[1]
write 0
to B[0]
low-level reads-from relationships
high-level reads-from relationship
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Construct Permutation
34
Place all (high-level) writes in the order in which
they occur
no
Consider each (high-level) read in the occur in which
they occur
no
concurrent writes
concurrent reads
Suppose read R reads from write W. Place R
immediately before the write that follows W in the
permutation.
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Correctness of Permutation
35
Permutation is legal by construction
each
read is placed after the write that it reads from
Why does it preserve order of non-overlapping
operations?
two
writes: by construction
a read that precedes a write in the execution: OK,
since the read cannot read from a later write.
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Correctness of Permutation
36
Lemma (10.1): Suppose
(high-level) read R returns v
R reads B[u], with u < v, during its upward scan
this read of B[u] reads from a (low-level) write
contained in high-level write W1
Then R reads from a write that follows W1.
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Figure for Lemma 10.1
during upward scan,
u<v
37
read 0
from B[u]
top of upward scan
or during downward
scan
read 1
from B[v]
read v
write v
write 1
to B[v]
write w
write 1
to B[w]
write 0
to B[u]
low-level reads-from relationships
high-level reads-from relationship
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Correctness of Permutation
38
Two cases remain to show that real-time order of
non-overlapping operations is preserved:
a
write that precedes a read in the execution
two reads
Proof of both cases are by contradiction and
showing that there is a situation that violates Lemma
10.1.
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Multi-Reader from Single-Reader
39
First consider a simple idea:
Use a different single-reader register for each
reader (Val[1],…,Val[n]).
n
is number of readers
Write algorithm: write the new value in each of the
single-reader registers
Read algorithm: read your own single-reader
register and return that value
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Counter-Example
40
Suppose 0 is initial value of multi-reader register.
Suppose n = 2.
write 1
pw
write 1
to Val[1]
read 1
write 1
to Val[2]
p1
p2
read 1
from Val[1]
read 0
read 0
from Val[2]
Set 17: Fault-Tolerant Register Simulations
CSCE 668
New Idea for Correct Algorithm
41
Have the multi-reader algorithm write some
information to the single-reader registers to prevent
new-old inversions on the simulated register.
This is provably necessary…
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Readers Must Write
42
Theorem (10.3): In any wait-free simulation of a
multi-reader single-writer register from singlereader single-writer registers, at least one reader
must write.
Proof: Suppose in contradiction there is an algorithm
in which readers never write.
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Readers Must Write
43
pw is the writer, p1 and p2 are the readers
initial value of simulated register is 0
S1 is the set of single-reader registers that are read
by p1
S2 is the set of single-reader registers that are read
by p2
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Readers Must Write
44
Consider execution in which pw writes 1 to the
simulated register.
The write algorithm performs a series of writes,
w1,…,wk, to the single-reader registers.
Each wj is a write to a register in either S1 or S2.
Let vji be the value that would be returned if pi
were to do a read immediately after w
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Readers Must Write
45
write 1
pw
write
to w1
…
write
to wj
write
to wj+1
…
write
to wk
pi
read vji
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Readers Must Write
46
For each reader (p1 and p2), there is a point when
the writes w1, …, wk cause the value of the
simulated register, as it would be observed by that
reader, to "switch" from 0 (old) to 1 (new).
For p1:
v11 = v21 = … = va-11 = 0
va1 = … = vk1= 1
For p2:
v12 = v22 = … = vb-12 = 0
vb2 = … = vk2= 1
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Readers Must Write
47
Why must a and b be different?
a marks the point when p1's view of the simulated
register's current value changes from old to new. So
wa must write to a register in S1.
Similarly, wb must write to a register in S2.
W.l.o.g., assume a < b.
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Readers Must Write
48
write 1
pw
write
to w1
p1
p2
…
write
…
to wa+1
write
to wa
write
to wk
read va1 = 1
read va2 = 0
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Readers Must Write
49
Where did we use the assumption in this proof that
readers don't write?
The writer doing the slow write of 1 is oblivious to
whether any readers are concurrently reading.
The readers are oblivious to each other.
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Corrected Multi-Reader Algorithm
50
As part of the algorithm for the read on the simulated
register, announce the value to be returned.
Before deciding what value to return, check what
values have been returned by previous reads and
don't pick anything earlier.
Need timestamps to be able to determine relative age
of returned values.
Reader pi uses row i of a matrix to report its most
recently returned value to all the other readers
(remember, we only have single-reader variables at
our disposal)
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Writer's Algorithm
51
get the next sequence number
use
integers that are increased by one each time
write value and sequence number to
Val[1],…,Val[n] (one copy for each reader)
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Reader pi's Algorithm
52
read the value and timestamp written by the writer to
Val[i]
read the value and timestamp written by each reader
to Report[j,i]
choose the value-timestamp pair with the largest
timestamp
write that pair to row i of Report
return value associated with that pair
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Multi-Reader Construction
53
3 readers
Report
reader
alg.
writes
reader
alg.
reads
reader
alg.
writer
writer
alg.
Val
Set 17: Fault-Tolerant Register Simulations
writes
CSCE 668
54
Correctness of Multi-Reader
Algorithm
Wait-free
writer
does n low-level writes
reader does n+1 low-level reads and n low-level writes
To prove linearizability, explicitly construct a
permutation of the high-level operations that is
clearly legal and then prove it preserves real-time
order of non-overlapping operations.
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Constructing the Permutation
55
Put in all writes in the order in which they occur in the
execution
since single-writer, writes do not overlap
Consider the reads in the order of their responses in
the execution.
read R reads from write W if W generates the timestamp
associated with the value R returns
place R immediately before the write that follows W
By construction, the permutation is legal.
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Preserving Real-Time Order
56
write-write: by construction of
read-write: Suppose R precedes W in . Then R cannot
read from W or any succeeding write, so R is placed in
before W.
write-read: Suppose W precedes R in . Then R reads
W 's timestamp or a larger one from Val[ ] and reads
from W or a later write. Thus R is placed in after W.
read-read: Suppose Ri by pi precedes Rj by pj in . Then
pj reads Ri's timestamp or a larger one from Report[i,j].
So Rj reads from the same write that Ri reads from or a
later write. Thus Rj is placed in after Ri.
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Multi-Writer from Single-Writer
57
Idea:
each writer should announce each value it wants to write to
all the readers, by writing the value to its own (SW,MR)
register.
each reader reads all the values written by the writers and
returns the latest one
How to determine latest value?
use timestamps
new wrinkle is that multiple processes generate timestamps,
need to coordinate
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Using Vector Timestamps
58
Data structure VT at each proc consisting of a
vector of m integers
m
is the number of writers
To get a new timestamp, writer pi increments VT[i]
by one
To compare timestamps, use lexicographic order
This
is a total order that extends the partial order
defined for vector timestamps
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Writer pw's Algorithm
59
get the next vector timestamp:
read
the timestamp written by each writer to
TS[0],…,TS[m-1]
extract the i-th entry of each TS[i]
increment my own entry by 1
write my new timestamp to TS[w]
write value and timestamp to Val[w]
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Reader pr's Algorithm
60
read the value and timestamp written by each
writer to Val[0], …, Val[m-1]
choose the value-timestamp pair with the largest
timestamp
return value associated with that pair
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Multi-Writer Construction
61
3 readers
2 writers
TS
reader
alg.
writer
alg.
reader
alg.
Val
writer
alg.
reader
alg.
read
write
Set 17: Fault-Tolerant Register Simulations
CSCE 668
62
Correctness of Multi-Writer
Algorithm
Wait-free
writer
does m low-level reads and 2 low-level writes
reader does m low-level reads
To prove linearizability, explicitly construct a
permutation of the high-level operations that is
clearly legal and then prove it preserves real-time
order of non-overlapping operations.
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Constructing the Permutation
63
Put in all writes in timestamp order
Lemma 10.6 shows this preserves order of non-overlapping
writes
Consider the reads in the order of their responses in
the execution.
read R reads from write W if W generates the timestamp
associated with the value R returns
place R immediately before the write that follows W
By construction, the permutation is legal.
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Preserving Real-Time Order
64
write-write: by construction of
read-write: Suppose R precedes W in . Then R cannot
read from W or any succeeding write, so R is placed in
before W.
write-read: Suppose W precedes R in . Then R reads
W 's timestamp or a larger one from Val[ ] and reads
from W or a later write. Thus R is placed in after W.
read-read: Suppose Ri by pi precedes Rj by pj in . By
Lemmas 10.6 and 10.7, pj reads Ri's timestamp or a
larger one from Val[ ]. So Rj reads from the same write
that Ri reads from or a later write. Thus Rj is placed in
after Ri.
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Atomic Snapshot Objects (ASO)
65
An array of elements:
each
one can be updated by just one proc.
a proc. can scan the whole array "atomically"
Useful abstraction for designing shared memory
algorithms
Can be wait-free implemented from read/write
variables (registers)
Set 17: Fault-Tolerant Register Simulations
CSCE 668
ASO Sequential Specification
66
Operations are
invocation
scani, response returni(V) where V is an array
of n values, 0 ≤ i ≤ n-1
invocation updatei(d) where d is a data value, response
acki, 0 ≤ i ≤ n-1
Legal sequences: for each V returned by a scan,
V[i] equals parameter of latest preceding updatei
Set 17: Fault-Tolerant Register Simulations
CSCE 668
ASO Example
67
Suppose array = [a,b,c] initially.
This sequence is legal:
update1(x), update2(y), scan([a,x,y]), update0(z),
scan([z,x,y])
This sequence is not legal:
update1(x), update2(y), update0(z), scan([a,x,y])
Set 17: Fault-Tolerant Register Simulations
CSCE 668
68
Sketch of Wait-Free
Implementation Using Registers
Store each array entry ("segment") in a different
read/write variable
Update algorithm:
write to the variable holding that segment
Scan algorithm:
Collect (read) all the values in the segments twice
If no segment is updated during the "double collect", then
we got a valid snapshot -- return it
Issues:
how to tell if a segment is updated?
what to do if a segment is updated?
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Detecting Updates
69
Simple idea is to tag each value stored in a
segment with a counter (1,2,3,…)
requires
unbounded space
More complex, bounded-space, solution is given in
the textbook
uses
a "handshaking" mechanism
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Reacting to Update During Scan
70
If a scanner observes enough changes to a particular
segment, then one of the overlapping updaters has
performed a complete update during this scan
Embed a scan at the beginning of each update:
the view obtained in this scan is written with the data to the
segment
Scanner returns view obtained in last collect
Set 17: Fault-Tolerant Register Simulations
CSCE 668
Complexity of ASO Algorithm
71
Number of building-block read/write variables is
O(n) (although some are large)
Scan algorithm uses O(n2) low-level reads and
writes.
Update algorithm uses O(n2) low-level reads and
writes.
Set 17: Fault-Tolerant Register Simulations
CSCE 668
© Copyright 2026 Paperzz