Consistent Global States of Distributed Systems: Fundamental

Consistent Global States
of Distributed Systems:
Fundamental Concepts and
Mechanisms
CS 249 Project
Fall 2005
Wing Wong
Outline



Introduction
Asynchronous distributed systems, distributed
computations, consistency
Two different strategies to construct global states

Monitor passively observes the system (reactivearchitecture)
 Monitor actively interrogates the system (snapshot
protocol)


Properties of global predicates
Sample applications: deadlock detection and debugging
Consist Global States
2
Introduction


global state = union of local states of individual
processes
many problems in distributed computing require:



difficulties:



construction of a global state and
evaluation of whether the state satisfies some predicate Φ
uncertainties in message delays
relative speeds of computations
global state obtained can be obsolete, incomplete, or
inconsistent
Consist Global States
3
Distributed Systems





collection of sequential processes p1, p2, …, pn
unidirectional communication channels between
pairs of processes
reliable channels
messages may be delivered out of order
network strongly connected (not necessarily
completely)
Consist Global States
4
Asynchronous Distributed Systems
no bounds on relative process speeds
 no bounds on message delays
 no synchronized local clocks
 communication is the only possible
mechanism for synchronization

Consist Global States
5
Distributed Computations
distributed program executed by a
collection of processes
 each process executes a sequence of
events
 communication through events send(m)
and receive(m), m as message identifier

Consist Global States
6
Distributed Computations

h i = e i 1e i 2…
 local
history of process pi
 canonical enumeration
h
 total order imposed by sequential
execution

hik= ei1ei2… eik
 initial

prefix of hi containing first k events
H = h1 U … U hn
 global
history containing all events
 does not specify relative timing between events
Consist Global States
7
Distributed Computations

to order events, define binary relation “→” to capture
“cause-and-effect”:

e → e’ if and only if e “causally precedes” e’
concurrent events: neither e → e’ nor e’ → e, write e || e’
distributed computation = partially ordered set defined by
(H, →)


Consist Global States
8
Distributed Computations

e21 → e36 ; e22 || e36
Consist Global States
9
Global States, Cuts and Runs

σi k
 local

state of process pi after event eik
Σ = (σ1,… ,σn)
 global
state of distributed computation
 n-tuple of local states

cut C = h1c1 U … U hncn or (c1, …, cn)
 subset
of global history H
Consist Global States
10
Global States, Cuts and Runs

(σ1c1,… ,σncn)
 global

state correspond to cut C
(e1c1,… ,encn)
 frontier
of cut C
 set of last events

run
a
total ordering R including all events in global history
 consistent with each local history
Consist Global States
11
Global States, Cuts and Runs


cut C = (5,2,4); cut C’ = (3,2,6)
a consistent run R = e31e11e32e21e33e34e22e12e35e13e14e15e36e23e16
Consist Global States
12
Consistency

cut C is consistent if for all events e and e’
 closed


under the causal precedence relation
consistent global state corresponds to a
consistent cut
run R is consistent if for all events, e → e’ implies
e appears before e’ in R
Consist Global States
13
Consistency
run R = e1e2… results in a sequence of
global states Σ0Σ1Σ2
 Σi is obtained from Σi-1 by some process
executing event ei , or Σi-1 leads to Σi
 denote the transitive closure of the leadsto relation by ~>R
 Σ’ is reachable from Σ in run R iff Σ ~>R Σ’

Consist Global States
14
Lattice of Global States



lattice = set of all
consistent global
states, along with
leads-to relation
Σk1…kn = shorthand
for global state
(σ1k1,…,σnkn)
k 1 + … + kn =
level of lattice
Consist Global States
15
Lattice of Global States



path = sequence of
global states of
increasing level
(downwards)
each path
corresponds to a
consistent run
a possible path:
Σ00 Σ01 Σ11 Σ21 Σ31 Σ32
Σ42 Σ43 Σ44 Σ54 Σ64 Σ65
Consist Global States
16
Observing Distributed Computations
(reactive-architecture)
processes notify monitor process p0
whenever they execute an event
 monitor constructs observation as the
sequence of events corresponding to the
notification messages
 problem:


observation may be inconsistent due to
variability in notification message delays
Consist Global States
17
Observing Distributed
Computations
Consist Global States
18
Observing Distributed
Computations


any permutation of run R is a possible
observation
we need:
 delivery
rule at monitor process to restore message
order

we have First-In-First-Out (FIFO) delivery using
sequence number for all source-destination pair
pi, pj :

sendi(m) → sendi(m’) => deliverj(m) → deliverj(m’)
Consist Global States
19
Delivery Rule 1

assume:
 global
real-time clock
 message delays bound by δ

process includes timestamp (real-time clock
value) when notifying p0 of local event e
DR1: At time t, deliver all received messages
with timestamps up to t – δ in increasing
timestamp order
Consist Global States
20
Delivery Rule 1
let RC(e) denotes value of global clock
when e is executed
 real-time clock satisfies Clock Condition:



e → e’ => RC(e) < RC(e’)
but logical clocks also satisfies clock
condition…
Consist Global States
21
Logical Clocks




event orderings based on increasing clock
values
LC(ei) denotes value of logical clock when ei is
executed by pi
each sent message m contains timestamp TS(m)
update rules by pi at occurrence of ei:
Consist Global States
22
Logical Clocks
Consist Global States
23
Delivery Rule 2
replace real-time clock by logical clock
 need gap-detection property:

events e, e’ where LC(e) < LC(e’),
determine if some event e’’ exists such that
LC(e) < LC(e’’) < LC(e’)
 message is “stable” at p if no future
messages with timestamps smaller than
TS(m) can be received by p
 given
Consist Global States
24
Delivery Rule 2


with FIFO, when p0 receives m from pi with
timestamp TS(m), can be certain no other
message m’ from pi with TS(m’) ≤ TS(m)
message m at p0 guaranteed stable when p0 has
received at least one message from all other
processes with timestamps > TS(m)
DR2: Deliver all received messages that are
stable at p0 in increasing timestamp order
Consist Global States
25
Strong Clock Condition




DR1, DR2 assume RC(e) < RC(e’) (or LC(e) <
LC(e’)) => e → e’
recall RC and LC guarantee clock condition: e →
e’ => RC(e) < RC(e’)
DR1, DR2 can unnecessarily delay delivery
want timing mechanism TC that gives Strong
Clock Condition:

e → e’ ≡ TC(e) < TC(e’)
Consist Global States
26
Timing Mechanism 1 - Causal
Histories

causal history as “clock” value
 set
of all events that causally precede event e:
 smallest
consistent cut that includes e
 projection of θ(e) on process pi: θi(e) = θ(e) ∩ hi
Consist Global States
27
Timing Mechanism 1 - Causal
Histories

Consist Global States
28
Timing Mechanism 1 - Causal
Histories

To maintain causal histories:
θ
initially empty
 if ei is an internal or send event

 if
θ(ei) = {ei} U θ(previous local event of pi)
ei = receive of message m by pi from pj

θ(ei) = {ei} U θ(previous local event of pi) U
θ(corresponding send event at pj)
Consist Global States
29
Timing Mechanism 1 - Causal
Histories
new
send
event:
new event e15
new
receive
event:
new event e23
Consist Global States
30
Timing Mechanism 1 - Causal
Histories

can interpret clock comparison as set
inclusion:
→ e’ ≡ θ(e)  θ(e’)
(why not set membership, [e → e’ ≡ e  θ(e’)]?)
e

unfortunately, causal histories grow too
rapidly
Consist Global States
31
Timing Mechanism 2 - Vector
Clocks

note:
θi(e) = hik for some unique k
 eir  θi(e) for all r < k
 can use single number k to represent θi(e)
 θ(e) = θ1(e) U … U θn(e)
 projection

represent entire causal history by n-dimensional
vector clock VC(e), where for all 1 ≤ i ≤ n

VC(e)[i] = k, if and only if θi(e) = hik
Consist Global States
32
Timing Mechanism 2 - Vector
Clocks
Consist Global States
33
Timing Mechanism 2 - Vector
Clocks

To maintain vector clock:
 each
process pi initializes VC to contain all zeros
 update rules by pi at occurrence of ei:


VC(ei)[i] ≡ number of events pi has executed
up to and including ei
VC(ei)[j] ≡ number of events of pj that causally
precede event ei of pi
Consist Global States
34
Timing Mechanism 2 - Vector
Clocks
causal histories
vector clocks
new
send
event:
new
receive
event:
Consist Global States
35
Vector Clock Comparison

Define “less than” relation:
V
< V’ ≡ (V ≠ V’)  ( 1 ≤ k ≤ n: V[k] ≤ V’[k])
Consist Global States
36
Properties of Vector Clocks
1.
Strong Clock Condition:

2.
e → e’ ≡ VC(e) < VC(e’)
Simple Strong Clock Condition:
given event ei of pi and event ej of pj, i ≠ j

ei → ej ≡ VC(ei)[i] ≤ VC(ej)[i]
Consist Global States
37
Properties of Vector Clocks
Test for Concurrency: given event ei of pi and event ej of pj
3.

ei || ej ≡ (VC(ei)[i] > VC(ej)[i])  (VC(ej)[j] > VC(ei)[j])
Pairwise Inconsistent: given event ei of pi and ej of pj, i ≠ j
4.

if ei , ej cannot belong to the frontier of the same consistent cut
(VC(ei)[i] < VC(ej)[i])  (VC(ej)[j] < VC(ei)[j])
(concurrent)
Consist Global States
38
Properties of Vector Clocks
5.
Consistent Cut:

frontier contains no pairwise inconsistent events
VC(eici)[i]  VC(ejcj)[i] , 1 ≤ i, j ≤ n
6.
Counting # of events causally precede ei:

#(ei) = (Σj=1 .. n VC(ei)[j]) – 1
Consist Global States
# events = 4+1+3-1 = 7
39
Properties of Vector Clocks
7.
Weak Gap-Detection: given event ei of pi and ej of pj,

if VC(ei)[k] < VC(ej)[k] for some k ≠ j, there exists
event ek such that (ek → ei)  (ek → ej)
Consist Global States
40
Causal Delivery and Vector Clocks



assume processes increment local component
of VC only for events notified to monitor p0
p0 maintains set M for messages received but
not yet delivered
suppose we have:
 message
m from pj
 m’ = last message delivered from process pk, k ≠ j
Consist Global States
41
Causal Delivery and Vector Clocks

To deliver m, p0 must verify:
no earlier message from pj is undelivered
(i.e. TS(m)[j] – 1 messages have been
delivered from pj)
2. no undelivered message m’’ from pk s.t.
sendk(m’)→sendk(m’’)→sendj(m), k ≠ j
(i.e. whether TS(m’)[k]  TS(m)[k] for all k)
1.
Consist Global States
42
Causal Delivery and Vector Clocks


p0 maintains array
D[1…n] where
D[i] = TS(mi)[i], mi
being last message
delivered from pi
e.g. on right, delivery
of m is delayed until
m’’ is received and
delivered
0 0
1 0
2 0
2 1
P0
m'
m''
(0, 0)
(1, 0)
(2, 0)
m
(2, 1)
Pk
Pj
(0, 0)
(1, 0)
Consist Global States
(2, 0)
(2, 1)
43
Delivery Rule 3

Causal Delivery:
all messages m, m’, sending processes pi, pj and
destination process pk
 for
sendi(m) → sendj(m’) => deliverk(m) → deliverk(m’)

DR3 (Causal Delivery): Deliver message m from
process pj as soon as
D[j] = TS(m)[j] – 1, and
D[k]  TS(m)[k], k ≠ j

p0 set D[j] to TS(m)[j] after delivery of m
Consist Global States
44
Causal Delivery and Hidden
Channels


should apply to
closed systems
incorrect conclusion
with hidden channels
(communication
channel external to
the system)
Consist Global States
45
Active Monitoring
- Distributed Snapshots
monitor p0 requests states of other
processes and combine into global state
 assume channels implement FIFO delivery
 channel state χi,j for channel pi to pj:
messages sent by pi not yet received by pj

Consist Global States
46
Distributed Snapshots


notations:
INi = set of processes having direct channels to
pi
OUTi = set of processes to which pi has a
channel
for each execution of the snapshot protocol,
process pi record its local state σi and the states
of its incoming channels (χj,i for all pj  INi)
Consist Global States
47
Distributed Snapshots

Snapshot Protocol (Chandy-Lamport)
1.
2.
p0 starts the protocol by sending itself a “take
snapshot” message
when receiving the “take snapshot” message for the
first time from process pf



pi records local state σi and relays the “take snapshot”
message along all outgoing channels
channel state χf,i is set to empty
pi starts recording messages on other incoming channels
Consist Global States
48
Distributed Snapshots
Snapshot Protocol (Chandy-Lamport)
3.
when receiving the “take snapshot” message
beyond the first time from process ps:


pi stops recording messages along channel from ps
channel state χs,i are messages that have been recorded
Consist Global States
49
Distributed Snapshots
p1 done
p2 done


dash arrows indicate “take snapshot” messages
constructed global state: Σ23; χ1,2 empty; χ2,1 = {m}
Consist Global States
50
Properties of Snapshots
Let Σs = global state constructed
Σa = global state when protocol initiated
Σf = global state when protocol terminated
 Σs is guaranteed to be consistent
 actual run that the system followed may
not pass through Σs
 but  a run R such that Σa ~>R Σs ~>R Σf

Consist Global States
51
Properties of Snapshots




Σa = Σ21
Σf = Σ55
r does not pass
through Σs ( = Σ23)
Consist Global States
52
Properties of Snapshots

but Σ21 ~> Σ23 ~> Σ55
Consist Global States
53
Properties of Global Predicates

Now we have two methods for global
predicate evaluation:
 monitor
passively observing runs
 monitor actively constructing snapshots

utility of either approach depends (in part)
on properties of the predicate
Consist Global States
54
Stable Predicates




communication delays => Σs can only reflect
some past state of the system
stable predicate: once become true, remain true
e.g. deadlock, termination, loss of all tokens,
unreachable storage
if Φ is stable, then
(Φ is true in Σs) => (Φ is true in Σf) and
(Φ is false in Σs) => (Φ is false in Σa)
Consist Global States
55
Stable Predicates

deadlock detection through snapshots (p.29, 30)
Consist Global States
56
Stable Predicates

deadlock detection using reactive protocol (p.31, 32)
Consist Global States
57
Nonstable Predicates


e.g. debugging, checking if queue
lengths exceed some thresholds
Two problems:
condition may not persist long enough for it
to be true when the predicate is evaluated
2. if a predicate Φ is found true, do not know
whether Φ ever held during the actual run
1.
Consist Global States
58
Nonstable Predicates

e.g. monitoring condition
(x = y)



7 states where (x = y) holds
but no longer hold after
state Σ54
e.g. (y – x) = 2


condition hold only in Σ31
and Σ41
monitor might detect (y - x)
= 2 even if actual run never
goes through Σ31 or Σ41
Consist Global States
59
Nonstable Predicates

very little value to
detect nonstable
predicate
Consist Global States
60
Nonstable Predicates




With observations, can extend predicates:
Possibly(Φ): There exist a consistent
observation O of the computation such that Φ
holds in a global state of O
Definitely(Φ): For every consistent observation
O of the computation, there exists a global state
of O in which Φ holds
e.g. Possibly((y – x) = 2), Definitely(x = y)
Consist Global States
61
Nonstable Predicates
use of extended predicate in debugging:
if Φ = some erroneous state, then
Possibly(Φ) indicates a bug, even if it is
not observed during an actual run
 if predicate Φ is stable, then
Possibly(Φ) ≡ Definitely(Φ)

Consist Global States
62
Detecting Possibly and Definitely Φ
detection based on the lattice of consistent
global states
 If any global state in the lattice satisfies Φ,
then Possibly(Φ) holds
 Definitely(Φ) requires all possible runs to
pass through a global state that satisfies Φ

Consist Global States
63
Detecting Possibly and Definitely Φ


Possibly((y – x) = 2)
Definitely(y = x)
(why?)
Consist Global States
64
Detecting Possibly and Definitely Φ


set of global state
current with
progressively
increasing levels
any member of
current satisfies Φ
=> Possibly(Φ) true
Consist Global States
65
Detecting Possibly and Definitely Φ



iteratively construct set
of global states of level
l without passing
through a state that
satisfies Φ
set empty =>
Definitely(Φ) true
set contains the final
state => Definitely(Φ)
true
Consist Global States
66
Conclusions


many distributed system problems require recognizing
certain global conditions
two approaches to constructing global states:





reactive-architecture based
snapshot based
timing mechanism that captures causal precedence
relation
applying to distributed deadlock detection and debugging
solutions can be adapted to deal with nonstable
predicates, multiple observations and failures
Consist Global States
67