Document

Jerzy Brzeziński
Michał Szychowiak
An Extended Home-Based
Coherence Protocol for
Causally Consistent Replicated
Read-Write Objects
Checkpoint replication = a new technique
of recovery for Distributed Shared
Memory (DSM) systems:

low overhead – checkpointing integrated
with the coherence protocol managing
replicated DSM objects

high availability of shared objects in spite of
multiple node failures ...

and network partitioning (majority partition)

fast recovery
PLAN of the talk:
1. DSM model
 consistency models
 coherence protocols
2.
3.
4.
5.
DSM recovery problem
Checkpoint replication
Example for causal consistency
Examples for other consistency models
Message Passing
LOCAL
MEMORY
easy to implement
hard to program
PROCESSOR
LOCAL
MEMORY
no global clock
no common memory
PROCESSOR
Distributed Shared Memory
SHARED
DATA
PROCESSOR
PROCESSOR
Distributed Shared Memory
SHARED
DATA
PROCESSOR
easy to program
hard to implement
PROCESSOR
DSM data unit



physical memory page
single variable
object

object members ("object value")

object methods (encapsulation)
persistent objects
read/write objects


read access: r(x)v
write access: w(x)v
Replication
DATA
X
DATA
X’
PROCESSOR
DATA
X’’
PROCESSOR
Replication
DATA
X
DATA
X’
PROCESSOR
DATA
X’’
local access = efficiency
improvement
local access = data
replication
PROCESSOR
Consistency
DATA
X
DATA
X’
coherence
protocol
PROCESSOR
DATA
X’’
PROCESSOR
Consistency models
How to efficiently control concurrent access to the
same object on different nodes to maintain the
consistency of DSM?
”A consistency model is essentially a contract
between the software and the memory. It says that if
the software agrees to obey certain rules, the
memory promises to work correctly”
[A.S. Tanenbaum]
Consistency models
Atomic consistency
Sequential consistency
Causal consistency
PRAM consistency
Cache consistency
Processor consistency
Relaxed consistency models (release, entry, ...)
Symbols
Ĥ = set of all operations issued by the system
Hi = set of access operations to shared objects issued by Pi
HW = set of all write operations to shared objects
H|x = set of all access operations issued to object x
local order relation:
i = total order of operations issued by Pi
real time order:
o1RT o2 = operation o1 finishes in real time before o2 starts
Atomic consistency
Execution
of
access
operations
is
atomically
consistent
if there exists a total order  of the operations in Ĥ:
o1RT o2  o1 o2,
satisfying two conditions:

(monotonic
reads) w ( x ) v , r ( x ) v H
(exclusive writing)

:: u  v  w(x)v  o(x)u  o(x)u  r(x)v
o ( x ) uH

w1 ( x ), w2 ( x )HW
:: w1(x)  w2(x)  w2(x)  w1(x)
Atomic consistency
r1 ( x)1
r1 ( x )2
P1
w2(x)1 r1(x)1
r (x)1  w (x)2
1
P2
2
w2(x)2 r1(x)2
w2 ( x )1
w2 ( x)2
Sequential consistency
Execution of access operations is sequentially consistent if for
each process PiPcor there exists a legal serialization i of the
operations in HiHW satisfying two conditions:
(local order)

::
o1 , o2H i  HW
(exclusive writing)
 ::
(
w1 ( x ), w2 ( x )HW
P o  o
1
j
Pj  cor

PP
i
cor
2
 o1 i o2
)
w1(x) i w2(x) 

PP
i
cor
w2(x) i w1(x)
Sequential consistency
r1 ( x)1
P1
P2
w 2 ( x)1
w2 ( x)2
Causal order
The causal-order  relation in Ĥ is the transitive closure of the
local order relation i and a write-before relation that holds
between a write operation and a read operation returning the
written value:
(i)
Hˆ ( (  o1i o2 )
o1 , o2
(ii)
(iii)

x O
Pi P
w(x)v  r(x)v
H(ˆ (o
1
o1 , o2 , o
 o1  o2 )
o  o  o2)  o1  o2 )
Causal consistency
Execution of access operations is causally consistent if for each
process PiPcor there exists a legal serialization i satisfying the
following condition:

o , o H i  HW
1
2
(o1  o2  o1i o2 )
Causal consistency
P1
P2
w1 ( x ) 2
w 2 ( x )1
w1 ( y )1
r2 ( y )1
r1 ( x )1
r2 ( x)2
Coherence protocols
OWNER NODE
master replica
2
NODE
ordinary replica
2
owner(x)
manager(x)
CS(x) – copyset
NODE
ordinary replica
2
Coherence protocols
OWNER NODE
master replica
2
5
write update
W(X)5
UPD(x,5)
NODE
ordinary replica
2
5
NODE
ordinary replica
requires broadcast
or similar tools
2
5
Coherence protocols
OWNER NODE
master replica
5
write invalidate
W(X)5
NODE
INV(x)
ordinary replica
2
NODE
ordinary replica
2
Coherence protocols
OWNER NODE
master replica
5
write invalidate
W(X)5
NODE
ordinary replica
NODE
ordinary replica
low read/write
ratio for objects
Failures
failure
x=1
wi(x)2
x=2
Pi
INV(x)
ACK
Pj
rj(x)???
x=1
tf
time
Recovery problem
y=square(x)
failure
x=1
wi(x)2
x=2
x=1 (restored)
Pi
rj(x)
Pj
y=1
recovery
x
inconsistency
rj(x)2 wj(y)4
y=4
tf
time
Recovery problem
The goal the DSM recovery protocol is to restore
such global state of the shared memory that is
consistent, i.e. it allows to restart the distributed
computation from this state and guarantees that
any access operation issued in the restarted
computation will be consistent according to the
assumed consistency model.
Checkpoints

restored object value should reflect the history of
operations performed on the object
(this is the job of the DSM recovery protocol)
Solutions

Three main approaches:
1.
write-update with full replication
2.
logging of coherence operations/messages
3.
checkpointing algorithms developed formerly for
msg-pass adapted for the DSM
Solutions
write-update
write-invalidate
logging
coordinated
checkpointing
independent
most solutions offer only single-failure resilience
Checkpoints
restored object value should reflect the history of
operations performed on the object
(this is the job of the DSM recovery protocol)
 consistent state of the memory may be stored in the
form of backup copies of all existing objects
(checkpoints)


on recovery, consistent values of all objects can be
quickly fetched from their checkpoints

checkpoints need to be saved in a stable
(non-volatile) storage able to survive failures
Drawbacks
1. Requirements:

external stable storage
2. Cost of checkpointing:


coordination between processes
access time to the stable storage
3. Cost of recovery:


restoration of the consistent state
re-execution of the processing
New solution

checkpoint storage in the DSM memory

checkpoint replication – special obj replicas –
CCS(x) = checkpoint copyset

full integration with the coherence protocol to
manage all kind of replicas

fast recovery – current checkpoint of each
obj is consistent with the current state of the
memory – no re-execution
Home-based
coherence protocol

Every object x has been assigned a home-node which
collects all the modifications of x.

The process running on that node is called owner of x.

Different objects can have different home nodes
(owners).

We assume that there is a home-node for each object
x, and the identity of the owner holding the master
replica of x is known.
Home-node

Owner holds the master replica of x.

Home-node = static owner.

The master replica is distinguished by WR state.

Each write access issued to a WR replica is
performed instantaneously.

Similarly each read access.
Multiple readers

Besides the master replica there can be several
ordinary replicas in RO state.

Each read access issued to a RO replica is
performed instantaneously.

Each write access issued to a RO requires
communication with the home-node (to update the
master replica).
Multiple writers

Several processes can issue write accesses to the
same object x at the same time.

The global order of modifications of x is determined
by the order of reception of the update messages
(UPD) at the home-node.
Invalidation

The INV state denotes a invalidated ordinary
replica.

Any access issued to such a replica requires to
fetch he value from the master replica.

The UPD message is sent from the home-node.

On reception, the invalidated replica is updated and
become RO.
Coherence protocol
automaton
a)
b)
INV
owner  i
rj(x)
______________________________________
Send x to Pj
wj(x)
______________________________________
local_invalidatei(...)
Send ack to j
ri(x)
______________________________________
Recv x from owner k
local_invalidatei(...)
ri(x)
WR
owner = i
wi(x)
wi(x)
_______________________________________
Recv ack from
owner k
local_invalidatei({...x...})
RO
owner  i
ri(x)
Example
x{WR,1,[1,0]}
x{WR,4,[2,3]}
x{WR,3,[1,3]}
x{RO,0,[0,0]}
wi(x)1
wi(x)4
Pi
UPD(x,3,[1,3])
UPD(x,2,[0,1])
UPD(x,-,[1,3])
UPD(x,-,[1,1])
Pj
wj(x)2
wj(y)4
y{RO,0,[0,0]}
wj(y)9
x{RO,3,[1,3]}
y{WR,4,[1,2]}
x{RO,2,[1,1]}
wj(x)3
y{WR,9,[1,3]}
Checkpointing

WR replica which has been modified and not
accessed by any other process than the owner is
dirty.

The first external access to a WR/dirty replica
requires to checkpoint the accessed value (in order
to protect the causal dependency).

Checkpoint operation consists in replicating the
accessed value, i.e updating the checkpoint replicas.
Checkpoint replicas

C replica is updated on checkpoint operations and
never become invalidated.

A single checkpoint operation ckpt(x)v is performed
by the owner of x, Pk, and consists in atomically
updating all C replicas of x held by processes
included in CCS(x) with the value v, carried in
ckpt(x,v,VTkx) message.
Checkpoint replicas

C replica of x held by Pj will be duplicated to a new
ROC replica on the first local access to x.

In fact, a checkpoint replica is a prefetched current
value of x and can therefore can serve further
accesses.

During the next single checkpoint operation ckpt(x)v',
all existing ROC replica of x are destroyed.
Checkpointing

RO/dirty
and
ROC/dirty
denote
additional
information about the write accesses invoked
locally on RO or ROC replica.

A replica is in RO/dirty or ROC/dirty state when it
was modified by the holding process but possibly
not yet checkpointed by its owner.
Burst checkpointing
The coordinated burst checkpoint operation consists in
atomically performing two operations:
(1)
taking a single checkpoint of all object x in WR state
and
(2)
sending checkpoint requests CKPT_REQ(y) to owners
of all other objects y in RO/dirty and ROC/dirty states.
After that, the RO/dirty and ROC/dirty objects are
switched to RO and ROC states, respectively.
Vector clock

The causal relationship of the memory accesses is reflected
in the vector timestamps associated with each shared
object. Each process Pi manages a vector clock VTi.

The value of i-th component of the VTi counts writes
performed by Pi. More precisely, only intervals of write
operations not interlaced with communication with other
processes are counted, as it is sufficient to track the causal
dependency
processes.
between
operations
issued
by
distinct
Vector clock

There are three operations performed on VTi:

inc(VTi) – increments a i-th component of the VTi

update(VTi,VTj) – returns the component wise
maximum of the two vectors

compare(VTi<VTj) – returns true iff
k: VTi[k]VTj[k] and k: VTi[k]VTj[k]
Vector timestamp

Each replica of shared object x stored at Pi has been
assigned a VTix..

On any local modification wi(x) VTix becomes equal to
VTi, and on the update UPD(x,v,VTx) from the master
replica, VTix becomes equal to VTx.
Causal order

A local_invalidatei(VT) operation ensures the
correctness of the basic protocol, by setting to INV the
status of all locally held replicas x not owned by Pi, for
which compare(VTix<VT) is true.

The reason for this operation is to invalidate all replicas
that could have potentially been overwritten since VTix
till VT.
Causal order

The local_invalidatej(VTjx) must be executed
when copying a checkpoint replica from C to
ROC.
Example
x{WR,1,[1,0]}
x{RO,0,[0,0]}
x{RO/dirty,3,[1,3]}
CKPT(x,2,[1,1])
x{WR/dirty,5,[2,4]}
x{RO,2,[1,1]}
wi(x)1
CKPT(x,5,[2,4])
x{RO/dirty,4,[1,4]}
wi(x)5
Pi
UPD(x,-,[1,1])
UPD(x,2,[0,1])
UPD(x,-,[1,3])
UPD(x,3,[1,3])
CKPT_REQ(x)
UPD(x,-,[1,4])
UPD(x,4,[1,4])
Pj
wj(x)2
y{RO,0,[0,0]}
wj(y)4 wj(x)3
wj(y)9
wj(x)4
x{RO/dirty,3,[1,3]}
y{WR,4,[1,2]}
y{WR,9,[1,3]}
UPD(y,9,[1,3])
rl(y)
x{RO,2,[1,1]}
Pl
rl(y)9
Example
x{WR,1,[1,0]}
x{WR/dirty,4,[2,3]}
CKPT(x,2,[1,1])
x{RO,0,[0,0]}
x{RO/dirty,3,[1,3]}
x{RO,2,[1,1]}
wi(x)1
CKPT(x,5,[3,3])
wi(x)4
wi(x)5
Pi
UPD(x,3,[1,3])
UPD(x,-,[1,1])
UPD(x,2,[0,1])
CKPT_REQ(x)
UPD(x,-,[1,3])
CKPT(y,9,[1,3])
CKPT(y,4,[1,3])
Pj
wj(x)2
wj(y)4
y{RO,0,[0,0]}
wj(x)3
wj(y)9
UPD(y,9,[1,3])
x{RO/dirty,3,[1,3]}
y{WR,4,[1,2]}
y{WR,9,[1,3]}
rl(y)
x{RO,2,[1,1]}
Pl
rl(y)9
Recovery

At any time before any failure occurs, there are at least
nrmin+1 replicas of x (the master replica plus nr  nrmin C
replicas), thus in case of a failure of f  nrmin processes
failure (at most f processes crash or become separated
from the majority partition) there will be at least one
non-faulty replica of x in the primary partition, and can
serve further access requests to x.
Recovery

As long as the current owner is non-faulty in the majority
partition, the extended coherence protocol assures
processing all requests to x issued in the majority partition.

However, if the current owner becomes unavailable, the
recovery procedure must elect a new owner from among
all available processes in the partition – the one holding
the most recent replica of x (timestamp comparison is
necessary for RO/ROC replicas).
Recovery

If there are no RO/ROC replicas of x any available C
replica may be used to restore the x value.

This is sufficient to continue the distributed processing
in the majority partition.

Alternatively, all shared object may be synchronously
rolled back to their checkpoint values, and all RO/dirty
and ROC/dirty replicas discarded.
CS(x) vs. CCS(x)

boundary restriction on nr = |CCS(x)|
nrmin  nr  nrmax
 f-resilience:

nr
n
= nrmin= nrmax=  2 
 
prefetching:
if |CS(x)|  nr then CCS(x)  CS(x)
(reduction of update cost)
if |CS(x)|  nr then CCS(x)  CS(x)
Other protocols for ckpt
replication
1. Another protocol for causal consistency
2. Atomic consistency
3. Processor consistency
Coherence protocol
Basic protocol: John and Ahamad 1993
Coherence protocol
wi (x)
________________________________ __
Recv x and ownership
from owner k
INV
owner  i
wj(x)
local_invalidatei(VTk)
________________________________ __
ri (x)
Send x and
ownership to Pj
________________________________ ___
Recv x from owner k
local_invalidatei(VTk)
local invalidation
when performing
local_invalidatei(VT)
rj(x)
________________________
Send x to Pj
ri (x)
RO
owner  i
ri (x)
wi (x)
________________________________ ___
Recv x and ownership
from owner k
local_invalidatei(VTk)
WR
owner = i
wi (x)
Extended coherence
protocol
 C (checkpoint)
 ROC (read-only checkpoint).
Checkpoint replica C is updated on checkpoint operations.
After that moment, it can be switched to ROC on the next
local read access to x – triggering coherence operations
(local_invalidate)
Until the next checkpoint, the ROC replica serves read
accesses as RO replicas do.
wi (x)
________________________________ __
Recv x and ownership
from owner k
INV
owner  i
wj(x) from jCCS
wj(x) from jCCS
________________________________ __
local_invalidatei(VTk)
________________________________ ___
Send x and
ownership to Pj
Checkpoint(CCS)
Send x and
ownership to Pj
invalidation
when performing
local_invalidatei(VT)
ri (x)
________________________________ ___
Recv x from owner k
local_invalidatei(VTk)
RO
owner  i
ri (x)
wi (x)
________________________________ ___
wi (x)
Recv x and ownership
from owner k
RO
owner = i
WR
local_invalidatei(VTk)
wi (x)
ri (x)
Extended
coherence
protocol
owner = i
ri (x)
rj(x)
rj(x) from jCCS
Send x to Pj
Checkpoint(CCS)
________________________________________________________________
___
__
wi (x)
Send x to Pj
________________________________ __________________________
Recv x and ownership from owner k
Change CCS
local_invalidatei(VTk)
C
owner  i
CHECKPOINT
________________________________ ___
atomic update
CHECKPOINT
___________________________________
atomic update
ROC
owner  i
ri (x)
________________________________ ___
local_invalidatei(VTx)
ri (x)
wj(x) from jCCS
___________________________________
Checkpoint(CCS)
Send x and
ownership to Pj
Extended coherence
protocol
time
Pa
x{C,2,VTx }
Pb
x
wi(x)1 wi(x)2
CKPT(x,2,VT
)
ACK
wi(x)3
wi(x)4
Pi
rj(x)
faili
x
wi(y)2
Pj
rj(x)2
tf
Extended coherence
protocol
Pa
Pb
y
CKPT(y,5,VT
wi(y)5
)
wi(y)6
x
CKPT(x,2,VT
)
y
wi(x)2
Pi
rj(y)
y
rj(x)
x
rj(y)
Pj
rj(y)5
rj(x)2
tf
y{INV,–,–}
rj(y)5 !
Atomic consistency
Execution
of
access
operations
is
atomically
consistent
if there exists a total order  of the operations in Ĥ:
o1RT o2  o1 o2,
satisfying two conditions:

(monotonic
reads) w ( x ) v , r ( x ) v H
(exclusive writing)

:: u  v  w(x)v  o(x)u  o(x)u  r(x)v
o ( x ) uH

w1 ( x ), w2 ( x )HW
:: w1(x)  w2(x)  w2(x)  w1(x)
Coherence protocol
Basic protocol: Li and Hudak 1989
Coherence protocol
wi(x)
______________________________________
INV
Recv x and ownership
from owner
Invalidate(CS)
owner  i
wj(x)
wj(x)
_____________________________________
______________________________________
Send x and
ownership to Pj
Send x and
ownership to Pj
INVALIDATE
_____________________________________
Send ACK to owner
ri(x)
_____________________________________
Recv x from owner
RO
owner  i
ri(x)
wi(x)
wi(x)
_____________________________________
RO
owner = i
ri(x)
wi(x)
________________________________________
Invalidate(CS)
Recv ownership
Invalidate(CS)
WR
owner = i
ri(x)
rj(x)
rj(x)
_____________________________________
______________________________________
Send x to Pj
Send x to Pj
Delayed checkpointing
time
Pa
Pb
wi(x)1
ckpt(x)2
wi(x)3 wi(x)4
Pi
wi(x)2
rj(x)
x
x INV(x)
rj(x)
Pj
wj(y)1
rj(x)2
rj(x)2
wj(y)4
tf
Extensions for checkpointing
ROC (read-only checkpoint) – checkpoint replica
available for read access to object x. All checkpoint
replicas are in state ROC directly after the
checkpointing operation, until some process issues a
write request to x.
C (checkpoint) – checkpoint replica available only on
recovery to restore a consistent state of DSM
(invalidated ROC)
CCS(x) – checkpoint copyset
Extensions for checkpointing
INV
owner  i
wj(x)
______________________________________
...
Checkpoint(CCS)
Send x and
ownership to Pj
wi(x)
RO
owner = i
WR
________________________________________
owner = i
Invalidate(CSCCS)
rj(x)
______________________________________
Checkpoint(CCS)
Send x to Pj
wi(x)
INVALIDATE
_____________________________________
Recv ownership
Change CCS
Invalidate(CSCCS)
______________________________________
Send ACK to owner
CHECKPOINT
ROC
CHECKPOINT
______________________________________
Send ACK to owner
owner  i
ri(x)
______________________________________
atomic update
ri(x)
______________________________________
Recv x from owner
C
owner  i
Checkpoint inconsistency
problem
ckpt(y)5
Pa
Pb
wi(y)5
ckpt(x)2
y
wi(y)6 wi(x)2
Pi
rj(y) y
Pj
INV(y)
rj(y)5
rj(x)
rj(y)
x
rj(y)5 !
rj(x)2
tf
Burst checkpointing
UPD(x,2) UPD(y,5) COMMIT(x,y)
wi(x)1 wi(y)5 wi(x)2
Pi
rj(x)
Pj
x
rj(x)2
Checkpoint atomicity
Recovery

if owner not available (crashed or separated from the
majority partition of system nodes), then ...

if any process from CS(x) available – chose it as the new
owner, else ...

if any process from CCS(x) available – chose it as the new
owner

concurrent owners (in distinct partitions) – only the ”majority”
owner can serve write requests and make new checkpoints

required failure detection and
majority partition detection
Correctness of the
extended protocol
1.
safety – the protocol correctly maintains the
coherence of shared data, accordingly to the
consistency model, besides failures of processes
and communication links
2.
liveness – each access operation issued to any
shared data will eventually be performed (in a finite
time), even in the presence of failures.
Correctness of the
extended protocol
Operation dependency: For any two operations
o1,o2Ĥ(t), we say that o2 is dependent on o1,
denoted o1o2 iff:
1)

P P ( t )
o 1 i o 2
i
cor
2) (write-before)


xO i , jP
3) (transitive closure)

oH ( t )
( o1=wi(x)v  o2=rj(x)v )
o1o  oo2
Correctness of the
extended protocol
Legal history: A set H(t)Ĥ(t) is legal history of access
operations iff:
1)
Hi(t)  H(t),

P P ( t )
i
2) 1 
and
cor
(
o H ( t )

o H ( t )
2
o1 o2  o1  H(t) )
Correctness of the
extended protocol
H = legal history of operations issued by the system
Hi = set of all access operations to shared objects issued by Pi
o1RT o2 = operation o1 finishes in real time before o2 starts
Execution of access operations is atomically consistent
if there exists a total order  of the operations in H :
o1RT o2  o1 o2,
satisfying two conditions:
(legality)

 :: u  v  w(x)v  o(x)u  o(x)u  r(x)v
 :: w (x)  w (x)  w (x)  w (x)
w ( x ) v , r ( x ) vH o ( x ) uH
(exclusive writing)
1
w1 ( x ), w2 ( x )H
2
2
1
Correctness of the
extended protocol
Theorem 1 (safety): Every access to an object x
performed by a correct process is an atomically
consistent reliable operation.
Theorem 2 (liveness): The protocol eventually brings
a value of x to any correct process in the majority
partition requesting the access.
Brief summary of the
protocol

independent checkpointing of DSM objects

replicated checkpoints

resilience to multiple node failures and network
partitioning

independent recovery (only lost objects must be
recovered)
Processor consistency
Processor consistency: Execution of access operations is
processor consistent if for each Pi there exists a serialization
i of the set Hi HW that satisfies:
(PRAM)
(cache)


x O
2
( j 1..n o  o
1
o , o H i  HW
1

w1 , w2 HW  H |x
2
j
(
 o1 i o2
w w
1
i 1.. n
2
i

)
w w )
1
i 1.. n
2
i
Coherence protocol
Brand new coherence protocol:
Brzeziński and Szychowiak 2003
 home-based (home-node=static owner, no object
manager necessary) – ensures cache consistency
 local invalidation – ensures PRAM consistency
 proven correct
Coherence protocol
r1(y) 0
x.state=INV
0
2
0
r1(x) 2
0
3
0
r1(y)0
P1
REQ(y.id,VT1[2]=0)
R_UPD(y,2,{x.id})
0
2
1
w2(x)0 w2(y)0
P2
owner(x,y)
VT2=
0
1
0
REQ(x.id,VT1[2]=2)
w2(x)2
0
3
1
W_ UPD(y,1,)
0
2
0
w3(y)1
P3
0
0
1
modified_since2(0)={x.id,y.id}
x.value=0 y.value=0
x.owner=2 y.owner=2
x.ts=VT2[2]=1 y.ts=VT2[2]=2
y.value=1
y.owner=2
y.ts=2
x.value=3
x.owner=2
x.ts=3
R_UPD(x,3,)
Coherence protocol
a)
b)
INV
owner  i
rj(x)
________________________________ __
Send x to Pj
wj(x)
________________________________ __
local_invalidatei(...)
Send ACK to j
ri (x)
___________________________________
Recv x from owner k
local_invalidatei(...)
ri (x)
WR
owner = i
wi (x)
wi (x)
___________________________________
Recv ACK from
owner k
local_invalidatei({...x...})
RO
owner  i
ri (x)
Conclusions
1.
introduction of a new concept: replication of ckpts

strict integration with coherence protocol (may lead to
overall cost reduction)

fast recovery (coherence protocol ensures consistency
of global ckpt)

network partitioning tolerated
Conclusions
2.
design of coherence protocols extended with ckpt
replication for high reliability
3.

atomic

causal

processor
correctness proofs

consistency models redefinition for unreliable
environment

safety (consistency of ckpts)

liveness (availability of ckpts)
Further work

benchmark evaluation SPLASH-2

reliability of the internal objects of the DSM
coherency protocol: object managers

failure detection

Download Report

Document

Paperzz.com

Your Paperzz