EEC 688/788 Secure and Dependable Computing

EEC 688/788
Secure and Dependable Computing
Lecture 7
Wenbing Zhao
Department of Electrical and Computer Engineering
Cleveland State University
[email protected]
Outline

Checkpointing and logging

Checkpoint-based protocols



Uncoordinted checkpointing
Coordinated checkpointing
Logging-based protocols



7/31/2017
Pessimistic logging
Optimistic logging
Causal logging
EEC688/788: Secure & Dependable
Computing
Wenbing Zhao
Uncoordinated Checkpointing

Uncoordinated checkpointing: full autonomy,
appears to be simple. However, we do not
recommend it for two reasons

Checkpoints taken might not be useful to reconstruct a
consistent global state


Cascading rollback to the initial state (domino effect)
To enable the selection of a set of consistent
checkpoints during a recovery, the dependency of
checkpoints has to be determined and recorded
together with each checkpoint

Extra overhead and complexity => not simple after all
Cascading Rollback Problem

P
P
Last checkpoint: C1,1 by P1, P
m
m
before P1 crashed
C m
C
Cannot use C0,1 at P0
m
m
because it is inconsistent
m
with C1,1
C
C
m
=> P0 rollbacks to C0,0
m
Cannot use C2,1 at P2
C
C
m
because it fails to reflect the
Crashed
C
sending of m6
=> P2 rollbacks to C2,0
Cannot use C3,1 and C3,0 as a result => P3 rollbacks to
initial state
1
0
2
P3
1
0

2,0
1,0
3
2
4
C 3,0
5
2,1
0,0

6
8
0,1

1,1
7
2,2
C 3,1
Cascading Rollback Problem




The rollback of P3 to initial
state would invalidate C2,0
=> P2 rollbacks to initial state
P1 rollbacks to C1,0 due to the
rollback of P2 to initial state
This would invalidate the use
of C0,0 at P0
=> P0 rollbacks to initial state
The rollback of P0 to initial
state would invalidate the use
of C1,0 at P1 => P1 rollbacks
to initial state
P1
P0
P2
m1
m0
P3
C 2,0 m
3
C 1,0
m2
m4
m5
C 0,0
m6
m8
C 1,1
Crashed
C 0,1
C 3,0
C 3,1
C 2,1
m7
C 2,2
Tamir and Sequin Global Checkpointing
Protocol



One of the processes is designated as the coordinator
Others are participants
The coordinator uses a two-phase commit protocol for
consistency on the checkpoints



Global checkpointing is carried out atomically: all or nothing
First phase: create a quiescent point of the distributed system
Second phase: ensure the atomic switchover from old checkpoint
to the new one
Tamir and Sequin Global Checkpointing
Protocol

Control messages for coordination






CHECKPOINT message: initiate a global checkpoint & to create
quiescent point
SAVED message: to inform the coordinator that local checkpoint
is done by participant
FAULT message: a timeout occurred, global checkpointing
should abort
RESUME message: to inform participants that it is time to resume
normal operation
Sending a control message except SAVED: to all
outgoing channels except the one it receives from
CHECKPOINT certificate: keep track if received it from
every incoming channel
Finite State Machine for Coordinator
Tamir and Sequin
Global Checkpointing
Protocol
Typos: p24, figure 2.4, p25,
figure 2.5
 Final state machine
=> Finite state machine
Init
Start Global Chkpt
Stop Normal Execution
Send CHECKPOINT
Receive CHECKPOINT
Receive Regular Message
Add CHECKPOINT to Certificate
& CHECKPOINT Certificate Incomplete
Log the Message
Wait for
CHECKPOINT
Receive CHECKPOINT
Add CHECKPOINT to Certificate
& CHECKPOINT Certificate Complete
Take Checkpoint
Send SAVED
Time Out CHECKPOINT
Send FAULT
Time Out SAVED
Wait for
Send FAULT
SAVED
Receive SAVED
Add to SAVED Certificate
& SAVED Certificate Incomplete
Receive SAVED
Add to SAVED Certificate
& SAVED Certificate Complete
Commit Checkpoint
Send RESUME
Normal
Abort
Tamir and Sequin Finite State Machine for Participant
Global Checkpointing
Normal
Protocol
Receive
CHECKPOINT
Stop Normal Execution
Send CHECKPOINT
Receive Regular Message
Log the Message
Receive CHECKPOINT
Add CHECKPOINT to Certificate
& CHECKPOINT Certificate Incomplete
Receive RESUME
Commit Checkpoint
Send RESUME
Wait for
CHECKPOINT

SAVED: send
to up stream
node
Time Out CHECKPOINT
Send FAULT
Abort
Receive CHECKPOINT
Add CHECKPOINT to Certificate
& CHECKPOINT Certificate Complete
Take Checkpoint
Send SAVED
Wait for
RESUME
Receive SAVED
Relay SAVED to Uplink
Tamir and Sequin Global Checkpointing
Protocol: Example
P0
P0
CHEC
KPOIN
m0
P1
P2
T
m1
CHEC
P1
CHE
P2
INT
CKPO
CHE
CKP
KPOIN
T
OI NT
C
CHE
System Topology
CH
PO
ECK
KPO
C 2,0
INT
C 1,0



P0 channel state: m0
P1 channel state: m1
P2 channel state: empty
C 0,0
D
SAVE
ED
SAV
RESUME
Resume
Normal
Execution
INT
RES
UME
Tamir and Sequin Global Checkpointing
Protocol: Proof of Correctness


The protocol produces consistent global state
Proof: a consistent global state consists of only two
scenarios:

All msgs sent by one process prior to its taking a local checkpoint
have been received prior to the other process taking its local
checkpointing


Some msgs sent by one process prior to its taking a local
checkpoint might arrive after the other process has checkpointed
its state, but they are logged for replay


This is the case if no process sends any msg after the global checkpoint is
initiated
Msgs received after the initiation of global checkpointing are logged, but not
executed, ensuring this property
Note that if a process fails, the global checkpointing would abort
Chandy and Lamport Distributed
Snapshot Protocol

CL snapshot protocol is a nonblocking protocol





TS checkpointing protocol is blocking
CL protocol is more desirable for applications that do not wish to
suspect normal operation
However, CL protocol is only concerned how to obtain a
consistent global checkpoint
CL Protocol: no coordinator, any node may initiate a global
checkpointing
Data structure


Marker message: equivalent to the CHECKPOINT message
Marker certificate: keep track to see a marker is received from
every incoming channel
Receive Regular Message
CL
Distributed
Snapshot
Protocol
Execute the Message
Init Global Checkpointing
Take Local Checkpoint
Send Marker
Normal
Receive Marker
Add Marker to Certificate
Take Local Checkpoint
Send Marker
Receive Marker
Add Marker to Certificate
& Marker Certificate Complete
Report Completion
Checkpointing
Receive Marker
Add Marker to Certificate
& Marker Certificate Incomplete
Receive Regular Message
& Has Received Marker in Channel
Execute the Message
Receive Regular Message
& Has Not Received Marker in Channel
Append Message to Channel State
Execute the Message
Example
P1
P0
P0
C 0,0
Marke
P2
r
C 1,0
P1
P2
System Topology
Marker
m0
er
Mark
Mark



er
P0 channel state: m0 (p1 to p0 channel)
P1 channel state: m1 (p2 to p1 channel)
P2 channel state: empty
m1
Ma r k
er
Mar
ker
C 2,0
Comparison of TS & CL Protocols

Similarity


Both rely on control msgs to
coordinate checkpointing
Both capture channel state in
virtually the same way



P
Start logging channel state upon
receiving the 1st checkpoint msg
from another channel
Stop logging channel state after
received checkpoint on the
incoming channel
Communication overhead
similar
An Incoming
Channel to P
Channel State
Local State
Checkpointed
(P Initiated GC or
Received 1st
Marker
from Another
Channel)
Marker
Received
in This
Channel
(a)
P
An Incoming
Channel to P
Channel State
Process P
Initiated GC or
Received 1st
CHECKPOINT
Received
in This
Channel)
CHECKPOINT
from Another
Channel)
Regular Message
(b)
Marker/ CHECKPOINT Message
Comparison of TS & CL Protocols

Differences: strategies in producing a global
checkpoint



TS protocol suspends normal operation upon 1st
checkpoint msg while CL does not
TS protocol captures channel state prior to taking a
checkpoint, while CL captures channel state after
taking a checkpoint
TS protocol more complete and robust than CL

Has fault handling mechanism
Log Based Protocols



Work might be lost upon recovery using checkpointbased protocols
By logging messages, we may be able to recover the
system to where it was prior to the failure
System mode: the execution of a process is modeled as
a set of consecutive state intervals


Each interval is initiated by a nondeterministic state or initial state
We assume the only type of nondeterministic event is receiving of
a message
1st State Interval
Pi
m0
m1
2nd State Interval
m2
3rd State Interval
m3
m4
m5
Log Based Protocols

In practice, logging is always used together wit
checkpointing



Logging protocol types:




Limits the recovery time: start with the latest checkpoint instead
of from the initial state
Limits the size of the log: after taking a checkpoint, previously
logged events can be purged
Pessimistic logging: msgs are logged prior to execution
Optimistic logging: msgs are logged asynchronously
Causal logging: nondeterministic events that not yet logged (to
stable storage) are piggybacked with each msg sent
For optimistic and causal logging, dependency of
processes has to be tracked => more complexity, longer
recovery time
Pessimistic Logging



Synchronously log every incoming message to stable
storage prior to execution
Each process periodically checkpoints its state: no need
for coordination
Recovery: a process restores its state using the last
checkpoint and replay all logged incoming msgss
Pessimistic Logging: Example

Pessimistic
logging can
cope with
concurrent
failures and the
recovery of two
or more
processes
P1
P0
P2
m0
m1
P3
C 2,0 m
3
C 1,0
m2
m5
C 0,0
m4
m6
C 3,0
m7
m9
m8
Crashed
Crashed
Benefits of Pessimistic Logging

Processes do not need to track their dependencies



Output commit is automatically ensured
No need to carry out coordinated global checkpointing


Logging mechanism is easy to implement and less error prone
By replaying the logged msgs, a process can always bring itself
to be consistent with other processes
Recovery can be done completely locally

Only impact to other processes: duplicate msgs (can be
discarded)