Consistent Global States of Distributed
Systems: Fundamental Concepts and
Mechanisms
Author: Ozalp Babaoglu and Keith Marzullo
Distributed Systems: 526 U1580
Professor: Ching-Chi Hsu
1
Introduction
Many problems in distributed computing can be cast as executing
some notification or reaction when the state of the system
satisfies a particular condition
Global Predicate Evaluation (GPE): to establish the truth of a
Boolean expression whose variables may refer to the global
systems state
A global state may not be consistent
Asynchronous system:
no bounds on the relative speeds of processes and message delays
Impossible to maintain synchronized local clocks
Communication remains the only possible mechanism for
synchronization
channels are reliable but may deliver messages out of order
2
Outline
Two Class of solutions to the GPE problem:
A reactive-architecture: each process, when executing an event,
notify P0 by sending it a message describing the event
A snapshot architecture: the monitor P0 sends each process a
‘state enquiry’ message.
3
Definitions (1)
distributed systems: a collection of sequential processes p1, p2, ...,
pn networked by unidirectional communication channels
events: the activity of each sequential process, which can be
internal events or communications: send(m) or receive(m) with
another process
local history of process pi : hi = ei1ei2...
global history: H = h1h2... hn
cause-effect relation '->':
If eik, eilhi and k<l, then eik eil
If ei = send(m) and ej = receive(m), then ei ej
If e e' and e' e'', then e e''
Concurrent e||e': neither e e' nor e' e
4
Definitions (2)
distributed computation: a partially ordered set defined by the pair
(H, )
space-diagram: representation of a distributed computation
p1
p2
p3
e11
e1 2
e13
e14
e15
e16
e22
e21
e31
e23
e32
e33
e34
e35
5
e36
Definitions (3)
local state of pi immediately after executing event eik is denoted
by ik
global state: (, ..., n)
a cut C(c1,...,cn) is a subset of global history H and contains an
initial prefix of each of the local histories, i.e. C h1c1hncn
a run R is a total ordering of all events in H and is consistent with
each local history
Example: pp6
Note that a single distributed computation may have many runs
6
Example
Insistent cut and phantom deadlock
p1
e11
e12
e13
e14
e15
e16
resp
req
p2
p3
req
resp e 2
2
e21
e31
e23
req
e32
e33
e34
e35
C
7
req
e36
C’
Consistency
A consistent cut C, is such that
e and e', (e C)(e' e) => e' C
A consistent global state is one corresponding to a consistent cut
Aconsistent run R, is such that
e and e', (e e') => e appears before e' in R
Example: pp6
If the run is consistent then all the global states in the sequence
will be consistent as well
8
Observing Distributed Computations
A monitor p0 will assume a passive role in that it will not send
any messages of its own
The application processes notify p0 by sending it a message
whenever they execute an event
The monitor p0 constructs an observation of the underlying
distributed computation as the events arrived
Due to the variability of message delays, an observation can
correspond to a consistent run, an inconsistent run or no run at all
O1 = e21e11e31e32e34e12e22e33e13e14e35.... => not a run
O2 = e11e31e21e32e12e33e34e13e22e35e36.... => inconsistent run
O3 = e31e21e11e12e32e33e13e34e14e22e15.... => consistent run
To restore order of messages by defining a delivery rule for
deciding when received messages are to be presented to the
application process
9
FIFO delivery
First-In-First-Out(FIFO) delivery
for all messages m and m' from pi to pj
if sendi(m) sendi(m') => deliverj(m) deliverj(m')
FIFO can be implemented by adding sequence numbers to
messages
While FIFO delivery is sufficient to guarantee that observations
correspond to runs, it is not sufficient to guarantee consistent
observations
10
Observing Distributed Computations
with Real-Time Clocks
Environment:
message delays are bounded by
channels are FIFO
existence of a global real-time clock
each message includes RC(e), the global real-time clock when event
e occurs, as its timestamp
DR1:
At time t, deliver all received messages with timestatmps up to t- in
increasing timestamp order
Observation is consistent iff the following is satisfied
Clock condition: e e' => RC(e) < RC(e')
11
Observing Distributed Computations
with Logical Clocks
Environment:
channels are FIFO
asynchronous communication
implementation of logical clocks
each message includes LC(e), the logical clock when event e occurs,
as its timestamp
DR2:
Deliver all messages that are stable at p0 in increasing timestamp
order
Note: a message m is stable at p if no future messages with
timestamp < TS(m)
Given FIFO channels, m is stable at p0 when p0 has received at least
one message with timestamp>TS(m) from all other processes
12
Logical Clocks
Logical
Clock
each process pi maintains a local variable LCi
when a new event ei occurs, pi modifies LCi to
LCi + 1
max{ LCi, TS(m)} + 1
p1
1
4
5
6
7
5
p2
p3
2
if ei is an internal or send event
if ei = receive(m)
1
1
6
2
3
4
5
13
7
Observing Distributed Computations
with Causal Delivery
Causal Delivery (CD):
sendi(m) sendj(m') => deliverk(m) deliverk(m')
If p0 uses a delivery rule satisfying CD, then all of its
observations will be consistent
14
Efficient Delivering
For implementing causal delivery, what is really needed is an
effective procedure for deciding:
given events e,e' that are causally related and their clock values,
does there exists some other event e'' such that e e'' e'
Given RC(e) <RC(e') (or LC(e)<LC(e')), it may be that
e e' or e|| e', i.e. e' e)
The above observations suggest a timing mechanism TC whereby
causal precedence relations between events can be deduced from
their timstamps
Stong Clock Condition:
e e' TC(e) < TC(e')
15
Causal History (1)
Causal
history of event e
(e) = { e' H | e' e} {e}
That is, (e) is the smallest consistent cut that includes e
p1
p2
p3
e11
e12
e13
e1 4
e15
e16
e22
e21
e31
e23
e3 2
e33
e34
e35
Causal history of event e14
16
e36
Causal Histories (2)
Maintaining Causal History
Each process pi initializes local variable i to be
Each message m contains a timestamp TS(m) which is the causal
history of its send event
Scheme
If ei is internal or send event,
then i={ei} the causal history of the previous local event
If ei is the receive of message m by process pi from pj
then i={ei} the causal history of the previous local event of pi
the causal history of the corresponding send event at pj
The strong clock condition is satisfied if clock comparison is
interpreted as set inclusion
e e' (e) (e') or
e e' e (e') if e e'
Problem: the causal histories will grow rapidly
17
Vector Clocks
The causal history of an event can be represented as a fixeddimensional vector VC(e)[1..n] rather than a set, where
VC(e)[i] = k, iff i(e) = hik for i = 1,2,...,n
p1 (1,0,0)
p2
p3
(2,1,0)
(3,1,3)
(4,1,3)
(5,1,3) (6,1,3)
(1,2,4)
(0,1,0)
(4,3,4)
(0,0,1) (1,0,2) (1,0,3) (1,0,4)
18
(1,0,5)
(1,0,6)
Maintaining Vector Clocks
Maintaining Vector clock
Each process pi maintains a local vector VCi[1..n]
Each message m contains a timestamp TS(m) which is the vector
clock value VC(e)of its send event e
Scheme
if ei is an internal or send event
VCi [i]= VCi [i] + 1, and VC(ei)=VCi
if ei = receive(m)
VCi = max { VCi , TS(m) }
VCi [i] = VCi [i] + 1
VC(ei)[j] number of events of pj that causally precede event ei of pi
V < V' (VV')k: 1kn: V[k] V'[k])
19
Properties of Vector Clocks
Properties of Vector Clocks
Simple Strong Clock Condition
Strong Clock Condition
e e' VC(e) < VC(e')
ei ej VC(ei)[i] VC(ej)[i]
Concurrent
ei||ej VC(ei)[i] VC(ej)[i]) (VC(ej)[j] VC(ei)[j])
Pairwise Inconsistent
i j, VC(ei)[i] VC(ej)[i]) (VC(ej)[j] VC(ei)[j])
Consistent Cut (c1,c2, ..., cn) iff
i, j: 1 i,j n, VC(eici)[i] VC(ejcj)[i]
Counting: the number of events precedes e is givent by #(e)
#(e) =nj=1 VC(e)[j] -1
Weak Gap-Detection: Given ei and ej
if VC(ei)[k] < VC(ej)[k] for some k j,
then ek such that (ek ei) (ek ej)
20
Implementing Causal Delibery
with Vector Clocks
Babaoglu & Marzullo
monitor p0 maintains an array D[1..n] where D[i] contains
TS(mi)[i] where mi is the last message delivered from process pi
DR3:
Deliver message m from process pj when both of the following is
satisfied
D[j] = TS(m)[j] -1
D[k] TS(m)[k], k j
=> guarantee FIFO
=> guarantee Causal Relation
DR4:
Monitor p0 maintains an counter D
Deliver message m of event ei as soon as
D = #(ei) - 1
21
Causal Delivery with vector Clock
Examples
p0
p1
p2
(1,0) (1,1) (1,2) (2,2)(3,2)
[0,0]
(1,0)
(2,2)
(0,0)
(0,0)
(3,2)
(1,1)
(1,2)
22
Distributed Snapshots
In this strategy, p0 will request the states of the other processes
and then combined them into a global state
Definition:
channel state: for each channel from pi to pj,
i,j = set difference between i and j
incoming channels of process pi :INi
outgoing channels of process pi :OUTi
Snapshot Protocols
Chandy and Lamport [1985]
Morgan[1985]
23
Snapshot Protocol 1
Assumption:
existence of a global real-time clock : RC
Each message is attached with timestamp
Message delays are bounded
global clock algorithm
P0 sends [take snapshot at tss] to all processes
When clock RC reads tss, each process pi do the following
records its local state i,
sends an empty message over all its outgoing channels
and starts recording all message received over each incoming channels
For the time pi receives a message from pj with timestamp greater
than or equal to tss, pi stops recording messages for that channel
24
Snapshot Protocol 2
Assumption:
Bounded message delays
Channels are FIFO
Chandy & Lamport
P0 send [take snapshot] to itself
For each process pi receiving [take snapshot]
If it is the first time
records its local state i
sends each out-going channels [take snapshot]
starts recording messages from other incoming channels
If it is not the first time
stops recording message from that incoming channel
25
Chandy & Lamport (1985)
p0
p1
e11
e12
e13
e14 e15
e16
e23
e24
e25
e1 *
p2
e21
e22
e2 *
Real computation R= e21 e11 e12 e13 e22 e14 e23 e24 e15 e25 e16
in terms of global state =00 0111 21 31 32 42 43 44 54 55 65
26
Properties of Snapshots
Definition
a : the global state in which the snapshot protocol is initiated,
f : the global state in which the protocol terminates and
S : the global state constructed
ei* denote the event when pi receives [take snapshot] for the first
time, causing pi to start recording its state
let the time be ti when ei* occurs
ei is a prerecordering event if ei ei* ,
otherwise it is a post-recording event
Properties
Then there exists a run R' such that a S f
That is to say S could have happened
27
Argumentation (1)
Chandy & Lamport(1985)
consider any (post-recordering, prerecordering) pair (e, e')
then e e')
swapping all such events will result in another consistent run R'
swap (e13 , e22 ) r1= e21 e11 e12 e22 e13 e14 e23 e24 e15 e25 e16
swap (e14 , e23 ) r2= e21 e11 e12 e22 e13 e23 e14 e24 e15 e25 e16
swap (e13 , e23 ) R'= e21 e11 e12 e22 e23 e13 e14 e24 e15 e25 e16
the global state after executing the last prerecording event (e23 ) in
R' is S (=23), the constructed global state
If the computation goes in this run, S could have happen
28
Argumentation (2)
Lai & Yang(1987)
Let GSN(ti:piP) be a snapshot taken between 1 and 2, during the
computation R.
Let =2-1, construct R' as follows:
R' is the same as R except that every post-recording event in R is now
postponed for d units of time, that is
R'(t) = R(t)
if R(t) is an event at piand tti
R(t-) if R(t-) is an event at pi and t- ti
otherwise
Example
29
Properties of Global Predicates
Stable Predicates
Many system properties one wishes to detect have the characteristic
that once they become true, they remain true
If is a stable predicate, since a S f
( is true in s ) => ( is true in f )
( is false in s ) =>( is false in a )
Nonstable Predicates
the condition encoded by the predicate may not persist long enough
for it to be true when the predicate is evaluated
if a predicate is found to be true by the monitor, we do not know
whether ever held during the actual run
30
Nonstable Predicates
Two problems
The condition encoded by the predicate may not persist long
enough for it to be true when the predicate is evaluated
If a predicate is found to be true by the monitor, we do not
know whether ever held during the actual run
The predicate may have held even if it is not detected, and even if
it is detected it may have never held.
Extended nonstable global predicate: apply to the entire
distributed computation
Possibly()
Definitely()
31
Detecting Possibly and Definitely
min (ik) : the global state with the smallest level in the lattice
containing ik
max(ik) : the global state with
the largest level in the lattice
containing ik
Examples: min (13) = 31,max (13) = 33
min(ik) = (1c1, 2c2,…, ncn ):
j: VC(jcj)[j]=VC( ik)[j]
max(ik) = (1c1, 2c2,…, ncn ):
j: VC(jcj)[i]<=VC( ik)[i]
and ((jCj = jf) or (VC(jCj+1)[i] > VC(jk)[i]))
The minimum level containing jk is the sum of components of
the vector timestamp VC(jk)
An algorithm for detecting Definitely(): O(kn): k is the
maximum number of events a monitored process has executed
32
Example
33
© Copyright 2026 Paperzz