EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University [email protected] Outline Checkpointing and logging Checkpoint-based protocols Uncoordinted checkpointing Coordinated checkpointing Logging-based protocols 7/31/2017 Pessimistic logging Optimistic logging Causal logging EEC688/788: Secure & Dependable Computing Wenbing Zhao Uncoordinated Checkpointing Uncoordinated checkpointing: full autonomy, appears to be simple. However, we do not recommend it for two reasons Checkpoints taken might not be useful to reconstruct a consistent global state Cascading rollback to the initial state (domino effect) To enable the selection of a set of consistent checkpoints during a recovery, the dependency of checkpoints has to be determined and recorded together with each checkpoint Extra overhead and complexity => not simple after all Cascading Rollback Problem P P Last checkpoint: C1,1 by P1, P m m before P1 crashed C m C Cannot use C0,1 at P0 m m because it is inconsistent m with C1,1 C C m => P0 rollbacks to C0,0 m Cannot use C2,1 at P2 C C m because it fails to reflect the Crashed C sending of m6 => P2 rollbacks to C2,0 Cannot use C3,1 and C3,0 as a result => P3 rollbacks to initial state 1 0 2 P3 1 0 2,0 1,0 3 2 4 C 3,0 5 2,1 0,0 6 8 0,1 1,1 7 2,2 C 3,1 Cascading Rollback Problem The rollback of P3 to initial state would invalidate C2,0 => P2 rollbacks to initial state P1 rollbacks to C1,0 due to the rollback of P2 to initial state This would invalidate the use of C0,0 at P0 => P0 rollbacks to initial state The rollback of P0 to initial state would invalidate the use of C1,0 at P1 => P1 rollbacks to initial state P1 P0 P2 m1 m0 P3 C 2,0 m 3 C 1,0 m2 m4 m5 C 0,0 m6 m8 C 1,1 Crashed C 0,1 C 3,0 C 3,1 C 2,1 m7 C 2,2 Tamir and Sequin Global Checkpointing Protocol One of the processes is designated as the coordinator Others are participants The coordinator uses a two-phase commit protocol for consistency on the checkpoints Global checkpointing is carried out atomically: all or nothing First phase: create a quiescent point of the distributed system Second phase: ensure the atomic switchover from old checkpoint to the new one Tamir and Sequin Global Checkpointing Protocol Control messages for coordination CHECKPOINT message: initiate a global checkpoint & to create quiescent point SAVED message: to inform the coordinator that local checkpoint is done by participant FAULT message: a timeout occurred, global checkpointing should abort RESUME message: to inform participants that it is time to resume normal operation Sending a control message except SAVED: to all outgoing channels except the one it receives from CHECKPOINT certificate: keep track if received it from every incoming channel Finite State Machine for Coordinator Tamir and Sequin Global Checkpointing Protocol Typos: p24, figure 2.4, p25, figure 2.5 Final state machine => Finite state machine Init Start Global Chkpt Stop Normal Execution Send CHECKPOINT Receive CHECKPOINT Receive Regular Message Add CHECKPOINT to Certificate & CHECKPOINT Certificate Incomplete Log the Message Wait for CHECKPOINT Receive CHECKPOINT Add CHECKPOINT to Certificate & CHECKPOINT Certificate Complete Take Checkpoint Send SAVED Time Out CHECKPOINT Send FAULT Time Out SAVED Wait for Send FAULT SAVED Receive SAVED Add to SAVED Certificate & SAVED Certificate Incomplete Receive SAVED Add to SAVED Certificate & SAVED Certificate Complete Commit Checkpoint Send RESUME Normal Abort Tamir and Sequin Finite State Machine for Participant Global Checkpointing Normal Protocol Receive CHECKPOINT Stop Normal Execution Send CHECKPOINT Receive Regular Message Log the Message Receive CHECKPOINT Add CHECKPOINT to Certificate & CHECKPOINT Certificate Incomplete Receive RESUME Commit Checkpoint Send RESUME Wait for CHECKPOINT SAVED: send to up stream node Time Out CHECKPOINT Send FAULT Abort Receive CHECKPOINT Add CHECKPOINT to Certificate & CHECKPOINT Certificate Complete Take Checkpoint Send SAVED Wait for RESUME Receive SAVED Relay SAVED to Uplink Tamir and Sequin Global Checkpointing Protocol: Example P0 P0 CHEC KPOIN m0 P1 P2 T m1 CHEC P1 CHE P2 INT CKPO CHE CKP KPOIN T OI NT C CHE System Topology CH PO ECK KPO C 2,0 INT C 1,0 P0 channel state: m0 P1 channel state: m1 P2 channel state: empty C 0,0 D SAVE ED SAV RESUME Resume Normal Execution INT RES UME Tamir and Sequin Global Checkpointing Protocol: Proof of Correctness The protocol produces consistent global state Proof: a consistent global state consists of only two scenarios: All msgs sent by one process prior to its taking a local checkpoint have been received prior to the other process taking its local checkpointing Some msgs sent by one process prior to its taking a local checkpoint might arrive after the other process has checkpointed its state, but they are logged for replay This is the case if no process sends any msg after the global checkpoint is initiated Msgs received after the initiation of global checkpointing are logged, but not executed, ensuring this property Note that if a process fails, the global checkpointing would abort Chandy and Lamport Distributed Snapshot Protocol CL snapshot protocol is a nonblocking protocol TS checkpointing protocol is blocking CL protocol is more desirable for applications that do not wish to suspect normal operation However, CL protocol is only concerned how to obtain a consistent global checkpoint CL Protocol: no coordinator, any node may initiate a global checkpointing Data structure Marker message: equivalent to the CHECKPOINT message Marker certificate: keep track to see a marker is received from every incoming channel Receive Regular Message CL Distributed Snapshot Protocol Execute the Message Init Global Checkpointing Take Local Checkpoint Send Marker Normal Receive Marker Add Marker to Certificate Take Local Checkpoint Send Marker Receive Marker Add Marker to Certificate & Marker Certificate Complete Report Completion Checkpointing Receive Marker Add Marker to Certificate & Marker Certificate Incomplete Receive Regular Message & Has Received Marker in Channel Execute the Message Receive Regular Message & Has Not Received Marker in Channel Append Message to Channel State Execute the Message Example P1 P0 P0 C 0,0 Marke P2 r C 1,0 P1 P2 System Topology Marker m0 er Mark Mark er P0 channel state: m0 (p1 to p0 channel) P1 channel state: m1 (p2 to p1 channel) P2 channel state: empty m1 Ma r k er Mar ker C 2,0 Comparison of TS & CL Protocols Similarity Both rely on control msgs to coordinate checkpointing Both capture channel state in virtually the same way P Start logging channel state upon receiving the 1st checkpoint msg from another channel Stop logging channel state after received checkpoint on the incoming channel Communication overhead similar An Incoming Channel to P Channel State Local State Checkpointed (P Initiated GC or Received 1st Marker from Another Channel) Marker Received in This Channel (a) P An Incoming Channel to P Channel State Process P Initiated GC or Received 1st CHECKPOINT Received in This Channel) CHECKPOINT from Another Channel) Regular Message (b) Marker/ CHECKPOINT Message Comparison of TS & CL Protocols Differences: strategies in producing a global checkpoint TS protocol suspends normal operation upon 1st checkpoint msg while CL does not TS protocol captures channel state prior to taking a checkpoint, while CL captures channel state after taking a checkpoint TS protocol more complete and robust than CL Has fault handling mechanism Log Based Protocols Work might be lost upon recovery using checkpointbased protocols By logging messages, we may be able to recover the system to where it was prior to the failure System mode: the execution of a process is modeled as a set of consecutive state intervals Each interval is initiated by a nondeterministic state or initial state We assume the only type of nondeterministic event is receiving of a message 1st State Interval Pi m0 m1 2nd State Interval m2 3rd State Interval m3 m4 m5 Log Based Protocols In practice, logging is always used together wit checkpointing Logging protocol types: Limits the recovery time: start with the latest checkpoint instead of from the initial state Limits the size of the log: after taking a checkpoint, previously logged events can be purged Pessimistic logging: msgs are logged prior to execution Optimistic logging: msgs are logged asynchronously Causal logging: nondeterministic events that not yet logged (to stable storage) are piggybacked with each msg sent For optimistic and causal logging, dependency of processes has to be tracked => more complexity, longer recovery time Pessimistic Logging Synchronously log every incoming message to stable storage prior to execution Each process periodically checkpoints its state: no need for coordination Recovery: a process restores its state using the last checkpoint and replay all logged incoming msgss Pessimistic Logging: Example Pessimistic logging can cope with concurrent failures and the recovery of two or more processes P1 P0 P2 m0 m1 P3 C 2,0 m 3 C 1,0 m2 m5 C 0,0 m4 m6 C 3,0 m7 m9 m8 Crashed Crashed Benefits of Pessimistic Logging Processes do not need to track their dependencies Output commit is automatically ensured No need to carry out coordinated global checkpointing Logging mechanism is easy to implement and less error prone By replaying the logged msgs, a process can always bring itself to be consistent with other processes Recovery can be done completely locally Only impact to other processes: duplicate msgs (can be discarded)
© Copyright 2026 Paperzz