JoshuaEberhart-DistributedSnapshots

Distributed Snapshots:
Determining Global States
of Distributed Systems
Joshua Eberhardt
Research Paper: Kanianthra Mani Chandy
and Leslie Lamport
Background

What is a distributed system?



Set of autonomous computers
Communication network
Software that integrates it into a single entity
Figure 1
Overview



Introduction
Model of a Distributed System
Global-state Detection Algorithm



Motivation
Termination
Stability Detection
Overview



Introduction
Model of a Distributed System
Global-state Detection Algorithm



Motivation
Termination
Stability Detection
Processes in Distributed Systems



Process is an instance of a computer
program being executed.
Processes in a distributed system
communicate by sending and receiving
messages.
A process can record its own state and the
message it sends and receives.
Global States and Processes


To determine a global state, a process p
must cooperate with other processes to
record their own states and send them to p.
Main problem is to devise an algorithm to
record global states.
Global State Detection Problems

Let y, be a predicate function defined over
the global states of the a distributed system
D.


(In other words, y(S) is true or false for a global
state S of D)
The predicate y is a stable property of D if
y(S) implies y(S’) for global states S’ of D
reachable from S of D
Going Further


Many distributed system problems can be
formulated as the general problem of
creating an algorithm by which a process in
a distributed system can determine whether
a stable property y holds.
Examples


Deadlock Detection
Termination Detection
Structure of Distributed Algorithms

Structured as sequence of phases.



Transient Part
Stable Part
Stability needs to be detected so that one
phase can be terminated and another
initiated.

Termination of a Computational Phase vs.
Termination of a Computation
Termination Phase


The overall problem can be partitioned into
the problems of detecting the termination of
one phase and initiating a new phase.
Example of a stable property


The kth computational phase has terminated
where k = 1, 2, 3, …
Thus we can determine the termination of the kth
phase for any given k.
Overview



Introduction
Model of a Distributed System
Global-state Detection Algorithm




Motivation
Termination
Properties
Stability Detection
Channels


A distributed system consists of a finite set
of processes and a finite set of channels.
Properties of channels.



Infinite buffers
Error-free
Deliver messages in order sent.
Linking the Terms

State of a channel


Process


Sequence of messages sent along the channel.
Defined by a set of states, including the initial
state and a set of events.
Event

An atomic action that may change the state of a
process and the state of at most one channel
that is incident of the process.
Figure 2

Distributed
system
with
processes
p, q, r and
channels
C1, C2,
C3, C4.
Events

Can be defined by






Process p in which the event occurs
State s of p before the event
State s’ of p after the event
Channel c whose state is altered by the event
Message M sent along channel c
Based on these definitions we can define
event e into a 5-tuple. <p, s, s’, M, c>
Expanding to Global States

Global state of a distributed system is a set
of component process and channel states.


Initially, all of the states are at their initial state,
and as a consequence all of the channels would
be the empty sequence.
Occurrences of events may change the
global state.
Events and Global States


Remember e = <p, s, s’, M, c>
We can say e can occur in a global state S:



The state of p in S is s
If c is directed towards p, then the state of c in S
is a sequence of messages with M at the head.
If c is directed away from p, then the state of c in
S is a sequence of messages with M at the tail.
Going Further

If c is directed towards p, then the state of c
in S is a sequence of messages with M at
the head.


Define a function next where next(S, e) is the
global state immediately after the occurrence of
event e in global state S.
The value of next(S, e) is defined only if event e
can occur in global state S.
Computational Model



Let seq = (ei: 0 < i < n) be a sequence of
events in component processes of a
distributed system.
Si+1 = next(Si, ei) for (0 < i < n) where S0 is
the initial global state.
We can say seq is a computation of the
system iff ei can occur in Si
Example: Single Token Conversation
(Deterministic)


Simple
distributed
system
State Transition
Diagram of a
Process
Example: Single Token Conversation
(Deterministic)
Example: Message Passing
(Nondeterministic)

New State
Transition
Diagrams
Example: Message Passing
(Nondeterministic)

More then one
way to change
the initial
global states,
all subsequent
states would
then be
different.
Overview



Introduction
Model of a Distributed System
Global-state Detection Algorithm




Motivation
Termination
Properties
Stability Detection
Motivation

How it works:



Each process records its own state and the 2
processes that a channel is incident on
cooperate in recording the channel state.
Algorithm is to be superimposed on the
underlying computation.
Next example will show how we can record the
state of a channel instantaneously. Let c be a
channel from p to q.
Single Token Example



Assume the state of process p is recorded as “in p”. Now assume
that the global state transitions to “in c”. Suppose the states of c, c’,
and q were also recorded in the global state “in c”.
This global state shows that there are two tokens!
This shows inconsistency because the state of p was recorded
before p sent the message along c and the state of c is recorded
after p sent the message.
Notation



Let n be the number of messages sent
along c before p’s state is recorded.
Let n’ be the number of messages sent
along c before c’s state is recorded.
In our example, this inconsistency shows
that n < n’ or (0 < 1)
Another scenario





Suppose the state of c is recorded in global state “in p”.
The system then transitions to the global state “in c” and the states of c’, p
and q are recorded in the global state “in c”.
The recorded state shows no tokens in the system!
This shows inconsistency when the state of c is recorded before p sends a
message along c and the state of p is recorded after p sends a message
along c. Other words n > n’ (1 > 0)
To maintain consistency, n = n’
In Relation to Messages Received




Let m be the number of messages received along
c before q’s state is recorded.
Let m’ be the number of messages received along
c before c’s state is recorded.
To show consistency, m = m’
So for every state the number of messages
received along a channel can’t exceed the
number of messages sent along that channel. In
other words n > m and n’ > m’.
Bank Example
Bank Example
Bank Example
Important Details to Note



The state of channel c that is recorded must be
the sequence of messages sent along the
channel before the sender’s state is recorded.
If n’ = m’, the recorded state of c must be the
empty sequence.
If n’ > m’, the recorded state of c must be the
(m’ + 1)st…… nth messages sent by p along c.
Markers




From these conditions we can devise an algorithm
by which q can record the state of the channel c.
Process p sends a marker after the nth message
it sends along c and before sending any
messages further along c.
The state of c is the sequence of messages
received by q after q records its own state and
before q sends the marker along c.
To ensure n > m, q must record its state after
receiving a marker along c and before q receives
further messages along c.
Algorithm Outline

Marker Sending Rule for a Process p

For each channel c, incident on and directed
away from p:

p sends a marker along c after p records its state
and before p sends further messages along c.
Algorithm Outline

Marker Receiving Rule for a Process q

On receiving a marker along a channel c:

if (q hasn’t recorded its state)
record q
q records c as the empty sequence
else
q records the state of c as the
sequence of messages received
along c after q’s state was recorded
and before q received the marker along c
Overview



Introduction
Model of a Distributed System
Global-state Detection Algorithm




Motivation
Termination
Properties
Stability Detection
Termination of the Algorithm

The marker receiving and sending rules
guarantee that if a marker is received along
every channel, then each process will
record its state and the states of all
incoming channels.
Finite Time

To ensure that the global state recording
algorithm terminates in finite time, each
process ensures


No marker remains forever in an incident input
channel.
It records its state within finite time of initiation of
the algorithm.
Finite Time


If process records its state and there is a
channel from p to q, then q will record its
state in finite time.
Termination in finite time is ensured if for
every process q, q records its state or there
is a path from p which records its state to q.
Overview



Introduction
Model of a Distributed System
Global-state Detection Algorithm



Motivation
Termination
Stability Detection
Stability Detection

Motivation


It is a paradigm for many practical problems,
such as distributed deadlock detection.
Can be defined as follows


Input: A stable property of y
Output: Boolean value definite with the property

(y(Si)  definite) or (definite  y(Sf)) where Si
represents the global state when initiated and Sf
represents the global state when it is terminated.
What this means



Input of the algorithm is based on the function of
y.
During execution of the algorithm the value y(S)
for any global state S may be determined by a
process in the system.
With the output of the algorithm stored in the
boolean value definite, we mean that

Process p enters and thereafter remains in some special
state to signal that definite = true or false.
Definite value

Definite = true


Implies the stable property holds when the
algorithm terminates.
Definite = false

Implies the stable property doesn’t hold when the
algorithm is initiated.
Solution

begin
record a global state S*;
definite := y(S*);
end.

Correctness of the stability detection algorithm



S* is reachable from Si
Sf is reachable from S* (Theorem)
y(S)  y(S’) for all S’ reachable from S (definition of
stable property)
Conclusion


Distributed systems are applied to many
applications used today, especially in
database applications.
Its important to know how each of the
processes interact with each other and to
know the global state of the system to
ensure it is consistent.
References


Chandy, K. M. and Lamport L. Distributed
Snapshots: Determining Global States of
Distributed Systems
http://www.eecs.ucf.edu/~dcm/Teaching/COT481
0-Spring2011/Literature/ChandyAndLamport.pdf
Llewellyn M. Intro to OS: (Distributed Process
Management)
http://www.cs.ucf.edu/courses/cop4600/sum2010/
distributed%20process%20management%20%20part%202%20(12).pdf