Week 2 Notes

Global Predicate Detection
and Event Ordering
Our Problem
To compute predicates
over the state of
a distributed application
Model
Message passing
No failures
Two possible timing assumptions:
Synchronous System
Asynchronous System
No upper bound on message delivery
time
No bound on relative process speeds
No centralized clock
Asynchronous systems
Weakest possible assumptions
Weak assumptions ´ less vulnerabilities
Asynchronous  slow
“Interesting” model w.r.t. failures
Client-Server
•Processes exchange messages using
•Remote Procedure Call (RPC)
A client requests a service by
sending the server a message.
The client blocks while waiting for
a response
c
s
Client-Server
• Processes exchange messages using
• Remote Procedure Call (RPC)
A client requests a service by
sending the server a message.
The client blocks while waiting
for a response
The server computes the
response (possibly asking other
servers) and returns it to the
client
#!?%!
c
s
Deadlock!
Goal
•Design a protocol by which a
processor can determine whether a
global predicate (say, deadlock) holds
Wait-For Graphs
Draw arrow from pi to pj if pj has received a
request but has not responded yet
Wait-For Graphs
Draw arrow from pi to pj if pj has received a
request but has not responded yet
Cycle in WFG ) deadlock
Deadlock ) ¦ cycle in WFG
The protocol
p0 sends a message to p1 p3
On receipt of p0 ‘s message, pi replies with its state
and wait-for info
An execution
An execution
An execution
Ghost Deadlock!
We have a problem...
Asynchronous system
no centralized clock, etc. etc.
Synchrony useful to
coordinate actions
order events
Events and Histories
Processes execute sequences of events
Events can be of 3 types: local, send, and
receive
epi is the i-th event of process p
The local history hp of process p is the sequence
of events executed by process
hpk : prefix that contains first k events
hp0 : initial, empty sequence
The history H is the set hp0 [ hp1 [ … hpn-1
NOTE: In H, local histories are interpreted as sets, rather than sequences, of events
Ordering events
Observation 1:
Events in a local history are totally ordered
•
time
Ordering events
Observation 1:
Events in a local history are totally ordered
time
Observation 2:
For every message m, send(m) precedes receive(m)
time
time
Happened-before
(Lamport[1978])
•A binary relation
defined over events
1. if eik,eil2hi and k<l , then eik !eil
2. if ei=send
(m) and ej=receive(m), then ei! ej
3. if e!e’ and e’! e‘’, then e! e‘’
Space-Time diagrams
•A graphic representation of a distributed
execution
time
Space-Time diagrams
•A graphic representation of a distributed
execution
time
Space-Time diagrams
•A graphic representation of a distributed
execution
time
Space-Time diagrams
•A graphic representation of a distributed
execution
time
Space-Time diagrams
•A graphic representation of a distributed
execution
H and
time
impose a partial order
Space-Time diagrams
•A graphic representation of a distributed
execution
H and
time
impose a partial order
Space-Time diagrams
•A graphic representation of a distributed
execution
H and
time
impose a partial order
Space-Time diagrams
•A graphic representation of a distributed
execution
H and
time
impose a partial order
Runs and
Consistent Runs
A run is a total ordering of the events in H that is
consistent with the local histories of the processors
Ex: h1,h2, … ,hn is a run
A run is consistent if the total order imposed in the
run is an extension of the partial order induced by!
A single distributed computation may correspond to
several consistent runs!
Cuts
•A cut C is a subset of the global history of H
•
Cuts
•A cut C is a subset of the global history of H
•The frontier of C is the set of events
•
Global states and cuts
The global state of a distributed computation is an
tuple of n local states
  = (1 ... n )
To each cut (1c1, ... ncn) corresponds a global state
Consistent cuts and
consistent global states
A cut is consistent if
A consistent global state is one corresponding to a
consistent cut
What
sees
What
sees
•Not a consistent global state: the cut contains the
event corresponding to the receipt of the last
message by p3 but not the corresponding send event
Our task
Develop a protocol by which a processor can build
a consistent global state
Informally, we want to be able to take a snapshot
of the computation
Not obvious in an asynchronous system...
Our approach
Develop a simple synchronous protocol
Refine protocol as we relax assumptions
Record:
processor states
channel states
Assumptions:
FIFO channels
Each m timestamped with with T(send(m))
Snapshot I
•
1. p0 selects tss
•
2. p0 sends “take a snapshot at tss” to all processes
•
3. when clock of pi reads tss then
a.
b.
c.
records its local state i
starts recording messages received on each of incoming
channels
stops recording a channel when it receives first message
with timestamp greater than or equal to tss
Snapshot I
•
1. p0 selects tss
•
2. p0 sends “take a snapshot at tss” to all processes
•
3. when clock of pi reads tss then
a.
b.
c.
d.
records its local state i
sends an empty message along its outgoing channels
starts recording messages received on each of incoming
channels
stops recording a channel when it receives first message
with timestamp greater than or equal to tss
Correctness
Theorem:
Proof:
Snapshot I produces a consistent cut
Need to prove
< Definition >
< Assumption >
< Assumption >
< 0 and 1>
< Property of real time>
< 2 and 4>
< 5 and 3>
< Definition >
Clock Condition
< Property of real time>
Can the Clock Condition be
implemented some other way?
Lamport Clocks
•Each process maintains a local variable LC
•LC(e) = value of LC for event e
Increment Rules
Timestamp m with
Space-Time Diagrams
and Logical Clocks
2
3
6
7
8
7
1
8
4 5
6
9
A subtle problem
•
when LC=t do S
•
doesn’t make sense for Lamport clocks!
•
there is no guarantee that LC will ever be t
S is anyway executed after
•
•
Fixes:
•
If e is internal/send and LC = t-2
•
execute e and then S
•
If e = receive(m) Æ (TS(m) ¸ t) Æ (LC · t-1)
•
put message back in channel
•
re-enable e; set LC=t-1; execute S
An obvious problem
No tss !
Choose  large enough that it cannot be reached by
applying the update rules of logical clocks
An obvious problem
No tss!
Choose  large enough that it cannot be reached by
applying the update rules of logical clocks
Doing so assumes
upper bound on message delivery time
upper bound relative process speeds
Better relax it
Snapshot II
p0 selects 
p0 sends “take a snapshot at tss” to all processes; it waits
for all of them to reply and then sets its logical clock to 
when clock of pi reads  then pi
records its local state i
sends an empty message along its outgoing channels
starts recording messages received on each incoming
channel
stops recording a channel when receives first message
with timestamp greater than or equal to 
Relaxing synchrony
empty message:
take a
snapshot at
Process does nothing
for the protocol during
this time!
records
local state
monitors
channels
sends empty message:
Use empty message to announce snapshot!
Snapshot III
Processor p0 sends itself “take a snapshot “
when pi receives “take a snapshot” for the first time from pj:
records its local state i
sends “take a snapshot” along its outgoing channels
sets channel from pj to empty
starts recording messages received over each of its other incoming
channels
when receives “take a snapshot” beyond the first time from pk:
pi stops recording channel from pk
when pi has received “take a snapshot” on all channels, it sends
collected state to p0 and stops.
Snapshots: a
perspective
The global state s saved by the snapshot
protocol is a consistent global state
Snapshots: a
perspective
The global state s saved by the snapshot
protocol is a consistent global state
But did it ever occur during the computation?
a distributed computation provides only a
partial order of events
many total orders (runs) are compatible
with that partial order
all we know is that s could have occurred
Snapshots: a
perspective
The global state s saved by the snapshot
protocol is a consistent global state
But did it ever occur during the computation?
a distributed computation provides only a
partial order of events
many total orders (runs) are compatible
with that partial order
all we know is that s could have occurred
We are evaluating predicates on states that
may have never occurred!
An Execution and its
Lattice
An Execution and its
Lattice
An Execution and its
Lattice
An Execution and its
Lattice
An Execution and its
Lattice
An Execution and its
Lattice
An Execution and its
Lattice
An Execution and its
Lattice
An Execution and its
Lattice
An Execution and its
Lattice
An Execution and its
Lattice
An Execution and its
Lattice
An Execution and its
Lattice
An Execution and its
Lattice
Reachability
• kl is reachable from ij
if there is a path from kl to
ij in the lattice
Reachability
• kl is reachable from ij
if there is a path from kl to
ij in the lattice
Reachability
• kl is reachable from ij
if there is a path from kl to
ij in the lattice
Reachability
• kl is reachable from ij
if there is a path from kl to
ij in the lattice
So, why do we care about
s
 again?
Deadlock is a stable property
Deadlock
Deadlock
If a run R of the snapshot protocol starts in i and
terminates in f, then
So, why do we care about
s
 again?
Deadlock is a stable property
Deadlock
Deadlock
If a run R of the snapshot protocol starts in i and
terminates in f, then
Deadlock in s implies deadlock in f
No deadlock in s implies no deadlock in i
Same problem, different
approach
Monitor process does not query explicitly
Instead, it passively collects information and
uses it to build an observation.
(reactive architectures, Harel and Pnueli [1985])
An observation is an ordering of event of the
distributed computation based on the order
in which the receiver is notified of the
events.
Observations:
a few observations
An observation puts no constraint on the
order in which the monitor receives
notifications
Observations:
a few observations
An observation puts no constraint on the
order in which the monitor receives
notifications
Observations:
a few observations
An observation puts no constraint on the
order in which the monitor receives
notifications
•
Causal delivery
•FIFO delivery guarantees:
•
•
Causal delivery
•FIFO delivery guarantees:
•Causal delivery generalizes FIFO:
•
Causal delivery
•FIFO delivery guarantees:
•Causal delivery generalizes FIFO:
•
send event
receive event
deliver event
Causal delivery
•FIFO delivery guarantees:
•Causal delivery generalizes FIFO:
•
send event
receive event
deliver event
Causal delivery
•FIFO delivery guarantees:
•Causal delivery generalizes FIFO:
•
•
send event
receive event
deliver event
Causal delivery
•FIFO delivery guarantees:
•Causal delivery generalizes FIFO:
send event
receive event
deliver event
•
1
Causal delivery
•FIFO delivery guarantees:
•Causal delivery generalizes FIFO:
send event
receive event
deliver event
1
2
Causal Delivery
in Synchronous Systems
•We use the upper bound  on message
delivery time
•
Causal Delivery
in Synchronous Systems
•We use the upper bound
delivery time
on message
•DR1: At time t , p0 delivers all messages it
received with timestamp up to t- in
increasing timestamp order
Causal Delivery
with Lamport Clocks
• DR1.1: Deliver all received messages in
increasing (logical clock) timestamp order.
Causal Delivery
with Lamport Clocks
• DR1.1: Deliver all received messages in
increasing (logical clock) timestamp order.
1
Causal Delivery
with Lamport Clocks
• DR1.1: Deliver all received messages in
increasing (logical clock) timestamp order.
•
•
•
1
4
Should p0 deliver?
Causal Delivery
with Lamport Clocks
• DR1.1: Deliver all received messages in
increasing (logical clock) timestamp order.
1
4
Should p0 deliver?
• Problem: Lamport Clocks don’t provide gap
detection
•
Given two events e and e’ and their clock
values LC(e) and LC(e’) — where LC(e) <
LC(e’), determine whether some e’’ event
exists s.t. LC(e)
< LC(e’’) < LC(e’)
Stability
•DR2: Deliver all received stable messages in increasing
(logical clock) timestamp order.
•A message m received by p is stable at p if p will
never receive a future message m s.t. TS(m’) < TS(m)
Implementing Stability
Real-time clocks
wait for  time units
Implementing Stability
Real-time clocks
wait for time units
Lamport clocks
wait on each channel for m s.t. TS(m) >
LC(e)
Design better clocks!
Clocks and STRONG
Clocks
Lamport clocks implement the clock
condition:
We want new clocks that implement the
strong clock condition:
Causal Histories
The causal history of an event e in (H,!) is the set
Causal Histories
The causal history of an event e in (H,!) is the set
Causal Histories
The causal history of an event e in (H,!) is the set
How to build
•Each process : pi
initializes  = 0
if eik is an internal or send event, then
if eik is a receive event for message m, then
Pruning causal histories
Prune segments of history that are known to all
processes (Peterson, Bucholz and Schlichting)
Use a more clever way to encode (e)
Vector Clocks
Consider i(e), the projection of (e) on pi
i(e) is a prefix of hi
encoded using ki
: i(e) = hiki – it can be
(e) = 1(e) [ 2(e) [ . . . [ n(e) can be encoded using
Represent  using an n-vector VC such that
Update rules
Message m is
timestamped with
Example
[1,0,0] [2,1,0]
[3,1,2]
[4,1,2]
[5,1,2]
[1,2,3]
[0,1,0]
[4,3,3]
[1,0,1]
[1,0,2]
[1,0,3]
[5,1,4]
Operational interpretation
[1,0,0] [2,1,0]
[3,1,2]
[4,1,2]
[5,1,2]
[1,2,3]
[0,1,0]
[4,3,3]
[1,0,1]
•=
•=
[1,0,2]
[1,0,3]
[5,1,4]
Operational interpretation
[1,0,0] [2,1,0]
[3,1,2]
[4,1,2]
[5,1,2]
[1,2,3]
[0,1,0]
[4,3,3]
[1,0,1]
[1,0,2]
[1,0,3]
[5,1,4]
• ´ no. of events executed pi by up to and including ei
•´
Operational interpretation
[1,0,0] [2,1,0]
[3,1,2]
[4,1,2]
[5,1,2]
[1,2,3]
[0,1,0]
[4,3,3]
[1,0,1]
[1,0,2]
[1,0,3]
[5,1,4]
•´ no. of events executed pi by up to and including ei
•´ no. of events executed by pj that happen before ei
of pi
VC properties:
event ordering
1. Given two vectors V and V+, less than is defined as:
•
V < V+ ´ (V  V+) Æ (8 k : 1· k· n : V[k] ·
V+[k])
Strong Clock Condition:
Simple Strong Clock Condition: Given ei of pi and ej of pj,
where i  j
Concurrency: Given ei of pi and ej of pj, where i  j
VC properties: consistency
Pairwise inconsistency
•
Events ei of pi and ej of pj (i  j) are pairwise
inconsistent (i.e. can’t be on the frontier of the
same consistent cut) if and only if
Consistent Cut
•
A cut defined by (c1, . . ., cn) is consistent if
and only if
VC properties:
weak gap detection
Weak gap detection
•
Given ei of pi and ej of pj, if VC(ei)[k] <
VC(ej)[k]
for some k j, then there
exists ek s.t
[2,0,1]
[2,2,2]
[0,0,2]
VC properties:
weak gap detection
Weak gap detection
•
Given ei of pi and ej of pj, if VC(ei)[k] <
VC(ej)[k]
for some k j, then there
exists ek s.t
[1,0,1]
[2,0,1]
[2,2,2]
[2,1,1]
[0,0,1]
[0,0,2]
VC properties:
strong gap detection
Weak gap detection
•
Given ei of pi and ej of pj, if VC(ei)[k] <
VC(ej)[k]
for some k j, then there exists
ek s.t
Strong gap detection
•
Given ei of pi and ej of pj, if VC(ei)[i] < VC(ej)[i]
for some k j, then there exists ei’ s.t
VCs for Causal Delivery
Each process increments the local component of its
VC only for events that are notified to the monitor
Each message notifying event e is timestamped with
VC(e)
The monitor keeps all notification messages in a set M
Stability
•Suppose p0 has received mj from pj.
•When is it safe for p0 to deliver mj ?
Stability
•Suppose p0 has received mj from pj
•When is it safe for p0 to deliver mj ?
•
There is no earlier message in M
Stability
•Suppose p0 has received mj from pj
•When is it safe for p0 to deliver mj ?
•
There is no earlier message in M
•
There is no earlier message from pj
no. of pj messages delivered by p0
Stability
•Suppose p0 has received mj from pj
•When is it safe for p0 to deliver mj ?
•
There is no earlier message in M
•
There is no earlier message from pj
no. of pj messages delivered by p0
•
There is no earlier message mk’’ from pk (k j) …
?
Checking for
.
Let mk’ be the last message p0 delivered from pk
By strong gap detection, mk’’ exists only if
Hence, deliver mj as soon as
The protocol
p0 maintains an array D[1, . . ., n] of counters
D[i] = TS(mi)[i] where mi is the last message
delivered from pi
•
DR3: Deliver m from pj as soon as both of the
following conditions are satisfied:
1.
2.