Chapter 9 Global Snapshot

Scalable Algorithms for Global
Snapshots in Distributed Systems
Rahul Garg
Vijay K. Garg
Yogish Sabharwal
IBM India Research Lab
Univ. of Texas at Austin
IBM India Research Lab
Motivation for Global Snapshot

Checkpoint to tolerate faults



Global property detection


Take global snapshot periodically
On failure, restart from the last checkpoint
Detecting deadlock, loss-of-a-token etc.
Distributed Debugging

Inspecting the global state
Consistent and inconsistent cuts
P1
m3
m1
P2
m2
P3
G1
G2

G1 is not consistent

G2 is consistent but m3 must be recorded
Model of the System

No shared clock

No shared memory

Processes communicate using messages

Messages are reliable

No upper bound on delivery of messages
Checkpoint

Classification of Messages



w – white process (pre-recording local state)
r – red process (post-recording)
e.g. rw – sent by a red process, received by a white process
P
rw
rr
ww
wr
Q


A process must be red to receive a red message
 A white process turns red on receiving a red message
Any white message received by a red process must be recorded
as in-transit message
Previous Work

Chandy and Lamport’s algorithm




Assumes FIFO channels
Requires one message (marker) per channel
 Marker indicates the end of white messages
Mattern’s algorithm
Schulz, Bronevetsky et al.


Work for non-FIFO channels
Require a message that indicates the total number of white
messages sent on the channel
Results
Algorithm
CLM
Message
Complexity
O(N2)
Grid-based O(N3/2)
Message
Size
O(1)
Space
O(N)
O(N)
O(N)
Tree-based O(N log N log W/n) O(1)
O(1)
Centralized O(N log W/n)
O(1)
O(1)
Grid-based Algorithm

Idea 1



Previously: send number of white messages/channel
This algorithm: the total number of white messages
destined to a process
Idea 2


Previously: send N messages of size O(1)
Now: send N messages of size N
Grid-based Algorithm
whiteSent =
[103] [210] [400]

1 0 3
2 1 0
4 0 0
Algorithm for P(r,c)


Step 1: send row i of matrix to P(r,i)
Step 2: compute cumulative count for row c


Send this count to P(c,c)
Step 3: if (r=c) // diagonal entry


Receive count from all processes in the column
Send jth entry to P(c,j)
Grid-based Algorithm
[123]
[210]
[141]

Algorithm for P(r,c)


Step 1: send row i of matrix to P(r,i)
Step 2: compute cumulative count for row c


Send this count to P(c,c)
Step 3: if (r=c) // diagonal entry


Receive count from all processes in the column
Send jth entry to P(c,j)
Grid-based Algorithm
[123]
[210]
+
For each processor of second row:
Count of messages sent to it
from processors in third row
[4
17
44
1]

Algorithm for P(r,c)


Step 1: send row i of matrix to P(r,i)
Step 2: compute cumulative count for row c


Send this count to P(c,c)
Step 3: if (r=c) // diagonal entry


Receive count from all processes in the column
Send jth entry to P(c,j)
Grid-based Algorithm
[474]

Algorithm for P(r,c)


Step 1: send row i of matrix to P(r,i)
Step 2: compute cumulative count for row c


Send this count to P(c,c)
Step 3: if (r=c) // diagonal entry


Receive count from all processes in the column
Send jth entry to P(c,j)
Grid-based Algorithm
[212]
[101]
[474]

Algorithm for P(r,c)


Step 1: send row i of matrix to P(r,i)
Step 2: compute cumulative count for row c


Send this count to P(c,c)
Step 3: if (r=c) // diagonal entry


Receive count from all processes in the column
Send jth entry to P(c,j)
Grid-based Algorithm
For each processor of second row:
Count of messages sent to it
from all processors
[786]

Algorithm for P(r,c)


Step 1: send row i of matrix to P(r,i)
Step 2: compute cumulative count for row c


Send this count to P(c,c)
Step 3: if (r=c) // diagonal entry


Receive count from all processes in the column
Send jth entry to P(c,j)
Tree/Centralized Algorithms

Idea




Previously: maintain white messages sent for every
destination
These algorithms: nodes maintain local deficits
Local deficit = white messg sent – white messg recvd
Total deficit = Sum of all local deficits
Distributed Message Counting Problem



W in-transit messages destined for N processors
Detect when all messages have been received
W tokens: a token is consumed when a message is
received
Tree/Centralized Algorithms

Distributed Message Counting Algorithm


Arrange nodes in suitable data structure
Distribute tokens equally to all processors at start


w = W/n
Each node has a color:



Green (Rich)
Yellow (Debt-free)
Orange (Poor)
: has more than w/2 tokens
: has <= w/2 tokens
: has no tokens and has received
a white message
Tree-based Algorithm: High level idea

Arrange nodes as a binary tree

Progresses in rounds



In each round all the nodes start off rich
A token is consumed on receiving a message
Debt-free node cannot have a rich child


Ensured by transfer of tokens
Starting a new round

When root is no longer rich  ½ tokens consumed
Tree-based Algorithm
I2
I1

Invariants



I1: Yellow process cannot have green child
I2: Root is always green
I3: Any orange node eventually becomes yellow
Tree-based Algorithm - Example

Invariants



I1: Yellow process cannot have green child
I2: Root is always green
I3: Any orange node eventually becomes yellow
Tree-based Algorithm - Example
Violates I1
Swap
Request
Swap
Accept

Invariants



I1: Yellow process cannot have green child
I2: Root is always green
I3: Any orange node eventually becomes yellow
Tree-based Algorithm - Example

Invariants



I1: Yellow process cannot have green child
I2: Root is always green
I3: Any orange node eventually becomes yellow
Tree-based Algorithm - Example
Split
Request
Split
Accept
Violates I3

Invariants



I1: Yellow process cannot have green child
I2: Root is always green
I3: Any orange node eventually becomes yellow
Tree-based Algorithm - Example
Violates I2

Invariants



I1: Yellow process cannot have green child
I2: Root is always green
I3: Any orange node eventually becomes yellow
Tree-based Algorithm - Example
Violates I2

Reset Round



Recalculate remaining tokens W’ ( <= nw/2 = W/2 )
Start new round with W’
Redistribute tokens equally  All nodes turn Green
Tree-based Algorithm – Analysis

Number of rounds




Number of control messages per round




If W < 2n, only O( n ) messages are required
Tokens reduce by half in every round
# of rounds = O( log W/n )
O( log n ) control messages per color change
Whenever color changes, some green node turns yellow
 O( n ) color changes per round
# of control messages per round = O( n log n )
Total control messages = O( n log n log W/n )
Centralized Algorithm

Idea

In tree-based algorithm, every color change requires
search for a green node to split/swap tokens with



Requires O( log n ) control messages
Can we find a green node with O(1) control messages?
Master node (tail) maintains list of all green nodes
Master
Centralized Algorithm - Example
Swap
Request
Master
Swap
Accept
Swap
Request
Master
Centralized Algorithm – Analysis

Number of rounds




Number of control messages per round




If W < 2n, only O( n ) messages are required
Tokens reduce by half in every round
# of rounds = O( log W/n )
O( 1 ) control messages per color change
Whenever color changes, some green node turns yellow
 O( n ) color changes per round
# of control messages per round = O( n )
Total control messages = O( n log W/n )
Lower Bound

Observation


Suppose there are W outstanding tokens
Some process must generate a control message on
receiving W/n white messages
W/n



W/n
W/n
W/n
W/n
Send W/n white messages to that processor
Remaining tokens = (n-1)W/n
Repeat Argument recursively


W/n
Tokens remaining after i control messages >= ((n-1)/n)i . W
# of control messages =  ( n log W/n )
Experimental Results
Milliseconds
Total Latencies
80
70
60
50
40
30
20
10
0
N=32, W=2880992 N=64, W=5764032
N=128,
W=11536256
N/W
Grid
Tree
Centralized
N=256,
W=23105280
N=512,
W=46341632
Experimental Results
Average Message Counts (W=40,000)
300
Count
250
200
Grid
150
Tree
100
Centralized
50
0
32
64
128
N
256
512
Conclusions

Global Snapshots in distributed systems


Distributed Message Counting problem
Optimal algorithm




Message Complexity O( n log W/n )
Matching lower bound
Centralized algorithm
Open Problem

Decentralized algorithm ?
Thank You
Questions?