An Information-Theoretic Perspective of Consistent Distributed Storage

Codes for Distributed Computing
ISIT 2017 Tutorial
Viveck R. Cadambe (Pennsylvania State
University)
Pulkit Grover (Carnegie Mellon University)
Motivation
Worldwide Data Storage, 0.8 ZB in 2009, 4.4 ZB in 2013, 44 ZB in 2020
Motivation
I: Worldwide Data Storage, 0.8 ZB in 2009, 4.4 ZB in 2013, 44 ZB in 2020
re distributed
”
Moore’s law is saturating,
Improvements in energy/speed
are not “easy”
Motivation
I: Worldwide Data Storage, 0.8 ZB in 2009, 4.4 ZB in 2013, 44 ZB in 2020
re distributed
”
Moore’s law is saturating,
Improvements in energy/speed
are not “easy”
Massive parallelization, Distributed Computing
Apache Hadoop, Apache Spark, Graphlab, Map-reduce,
Challenges
• Computing components can straggle, i.e., they can be slow
•
•
“Map-reduce: simplified data processing on large clusters”, Dean-Ghemawat 08
“The tail at scale” Dean-Barroso 13
• Components can be erroneous or fail
•
•
“10 challenges towards exascale computing”: DOE ASCAC report 2014; Fault
tolerance techniques for high performance computing – Herault/Roberts
“... since data of value is never just on one disk, the BER for a single disk could
actually be orders of magnitude higher than the current target of 10-15.”
[Brewer et al., Google white paper, ‘16]
• Data transfer cost/time an important component of computing
performance
Redundancy in data storage, computation
can significantly improve performance!
Challenges
• Computing components can straggle, i.e., they can be slow
•
•
“Map-reduce: simplified data processing on large clusters”, Dean-Ghemawat 08
“The tail at scale” Dean-Barroso 13
• Components can be erroneous or fail
•
•
“10 challenges towards exascale computing”: DOE ASCAC report 2014; Fault
tolerance techniques for high performance computing – Herault/Roberts
“... since data of value is never just on one disk, the BER for a single disk could
actually be orders of magnitude higher than the current target of 10-15.”
[Brewer et al., Google white paper, ‘16]
• Data transfer cost/time an important component of computing
performance
Redundancy in data storage, computation
can significantly improve performance!
Challenges
• Computing components can straggle, i.e., they can be slow
•
•
“Map-reduce: simplified data processing on large clusters”, Dean-Ghemawat 08
“The tail at scale” Dean-Barroso 13
• Components can be erroneous or fail
•
•
“10 challenges towards exascale computing”: DOE ASCAC report 2014; Fault
tolerance techniques for high performance computing – Herault/Roberts
“... since data of value is never just on one disk, the BER for a single disk could
actually be orders of magnitude higher than the current target of 10-15….”
[Brewer et al., Google white paper, ‘16]
• Data transfer cost/time an important component of computing
performance
Redundancy in data storage, computation
can significantly improve performance!
Theme of tutorial
Role of coding and information theory in contemporary distributed
computing systems
• Models inspired by practical distributed computing systems
• Fundamental limits on trade-off between redundancy and performance.
• New coding abstractions and constructions
Complements rich literature in codes for network function
computing
•
•
An admittedly incomplete list: [Korner-Marton 79, Doshi-Shah-Medard-Jaggi
07, Orlitsky-Roche 95, Giridhar-Kumar 05, Ramamoorthy-Langberg 13, MaIshwar 11, Nazer-Gastpar 11]
Communication and information complexity literature
Theme of tutorial
Role of coding and information theory in contemporary distributed
computing systems
• Models inspired by practical distributed computing systems
• Fundamental limits on trade-off between redundancy and performance.
• New coding abstractions and constructions
Complements rich literature in codes for network function
computing
•
•
•
[Korner-Marton 79, Orlitsky-Roche 95, Giridhar-Kumar 05, Doshi-Shah-Medard-Jaggi
07, Dimakis-Kar-Moura-Rabbat-Scaglione 10, Duchi-Agarwal-Wainwright 11 Ma-Ishwar
11, Nazer-Gastpar 11, Appuswamy-Fraceschetti-Karamchandani-Zeger 13,
Ramamoorthy-Langberg 13]
Communication and information complexity literature
…..
Outline
• Information theory and codes for shared memory emulation
• Codes for distributed linear data processing
Outline
• Information theory and codes for shared memory emulation
• Codes for distributed linear data processing
Table of Contents
• Main theme and goal
– Distributed algorithms
– Shared memory emulation problem.
– Applications to key-value stores
• Shared Memory Emulation in Distributed Computing
– The concept of consistency
– Overview of replication-based shared memory emulation algorithms
– Challenges of erasure coding based shared memory emulation
• Information-Theoretic Framework
– Toy Model for distributed algorithms
– Multi-version Coding
– Closing the loop: connection to distributed algorithms
• Directions of Future Research
Table of Contents
• Main theme and goal
– Distributed algorithms
– Shared memory emulation problem.
– Applications to key-value stores
• Shared Memory Emulation in Distributed Computing
– The concept of consistency
– Overview of replication-based shared memory emulation algorithms
– Challenges of erasure coding based shared memory emulation
• Information-Theoretic Framework
– Toy Model for distributed algorithms
– Multi-version Coding
– Closing the loop: connection to distributed algorithms
• Directions of Future Research
Distributed Algorithms
•
Algorithms for distributed networks and multi-processors.
•
~50 years of research
• Applications: Cloud Computing, networking, multiprocessor
programming, Internet of Things1
Theme of this (part) of the tutorial: A marriage of ideas between
information theory and distributed algorithms
1Typical
publication venues: PODC, DISC, OPODIS, SODA, FOCS, Usenix Fast, Usenix ATC,
Distributed Algorithms: Central assumptions and
requirements
• Modeling assumptions
Unreliability
Asynchrony
Decentralized nature
• Requirements
Fault tolerance
Consistency: Service provided by system “looks as if centralized”
despite unreliability/asynchrony/decentralized nature of system
• Consequences
– Simple-looking tasks difficult to achieve or sometimes impossible.
– Careful algorithm design, non-trivial correctness proofs
Distributed Algorithms: Central assumptions and
requirements
• Modeling assumptions
Unreliability
Asynchrony
Decentralized nature
• Requirements
Fault tolerance
Consistency: Service provided by system “looks as if centralized”
despite unreliability/asynchrony/decentralized nature of system
• Consequences
– Simple-looking tasks difficult to achieve or sometimes
impossible.
– Careful algorithm design, non-trivial correctness proofs
Distributed Algorithms: Central assumptions and
requirements
• Modeling assumptions
Unreliability
Asynchrony
Decentralized nature
• Requirements
Fault tolerance
Consistency: Service provided by system “looks as if centralized”
despite unreliability/asynchrony/decentralized nature of system
• Consequences
– Simple-looking tasks can be tricky or sometimes even
impossible.
– Careful algorithm design, non-trivial correctness proofs
Shared memory emulation
Distributed
system
• Classical problem in distributed computing
• Goal: Implement a read-write memory over a distributed system
• Supports two operations:
Read ( variablename )
Write ( variablename, value )
% also called “get” operation
% also called a “put” operation
• For simplicity, we focus on a single variable and omit variablename.
19
Shared memory emulation
Distributed
system
Read-write memory
• Classical problem in distributed computing
• Goal: Implement a read-write memory over a distributed system
• Supports two operations:
Read ( variablename )
Write ( variablename, value )
% also called “get” operation
% also called a “put” operation
• For simplicity, we focus on a single variable and omit variablename.
20
Shared memory emulation
Distributed
system
Read-write memory
• Classical problem in distributed computing
• Goal: Implement a read-write memory over a distributed system
• Supports two operations:
Read ( variablename )
Write ( variablename, value )
% also called “get” operation
% also called a “put” operation
• For simplicity, we focus on a single variable and omit variablename.
21
Shared memory emulation
Distributed
system
Read-write memory
• Classical problem in distributed computing
• Goal: Implement a read-write memory over a distributed system
• Supports two operations:
Read ( variablename )
Write ( variablename, value )
% also called “get” operation
% also called a “put” operation
• For simplicity, we focus on a single variable and omit variablename.
22
Shared memory emulation: Application to cloud computing
Distributed
system
• Theoretical underpinnings of commercial and open source key-value
stores: Amazon Dynamo DB, Couch DB, Apache Cassandra DB,
Voldermort DB.
• Applications: transactions, reservation systems, multi-player gaming,
social networks, news feeds, distributed computing tasks etc.
• Design requires sophisticated distributed computing theory combined with
great engineering.
• Active research field in systems and theory communities.
23
Shared memory emulation: Application to cloud computing
Distributed
system
• Theoretical underpinnings of commercial and open source key-value
stores: Amazon Dynamo DB, Couch DB, Apache Cassandra DB,
Voldermort DB.
• Applications: transactions, reservation systems, multi-player gaming,
social networks, news feeds, distributed computing tasks etc.
• Design requires sophisticated distributed computing theory combined with
great engineering.
• Active research field in systems and theory communities.
24
Engineering Challenges in key-value store implementation
Distributed
system
• Failure tolerance, Fast reads, Fast writes
• Asynchrony: Weak (no) timing assumptions.
• Distributed Nature: Nodes unaware of the state
• How do you distinguish a failed node from a very slow node?
• How do you ensure that all copies of data have received a
write/update?
• Requirements: Present a consistent view of data
• Allow concurrent access to clients (no locks)
• Solutions exist to above challenges. However….
25
Engineering Challenges in key-value store implementation
Distributed
system
• Failure tolerance, Fast reads, Fast writes
• Asynchrony: Weak (no) timing assumptions.
• Distributed Nature: Nodes unaware of the state
• How do you distinguish a failed node from a very slow node?
• How do you ensure that all copies of data have received a
write/update?
• Requirements: Present a consistent view of data
• Allow concurrent access to clients (no locks)
• Solutions exist to above challenges. However….
26
Engineering Challenges in key-value store implementation
Distributed
system
• Failure tolerance, Fast reads, Fast writes
• Asynchrony: Weak (no) timing assumptions.
• Distributed Nature: Nodes unaware of the state
• How do you distinguish a failed node from a very slow node?
• How do you ensure that all copies of data have received a
write/update?
• Requirements: Present a consistent view of data
• Allow concurrent access to clients (no locks)
• Solutions exist to above challenges. However….
27
Engineering Challenges in key-value store implementation
Distributed
system
• Failure tolerance, Fast reads, Fast writes
• Asynchrony: Weak (no) timing assumptions.
• Distributed Nature: Nodes unaware of the state
• How do you distinguish a failed node from a very slow node?
• How do you ensure that all copies of data have received a
write/update?
• Requirements: Present a consistent view of data
• Allow concurrent access to clients (no locks)
• Solutions exist to above challenges. However….
28
Goal of this part of the tutorial
Distributed
system
• Analytical understanding of performance (memory overhead, latency)
limited.
• Replication used for fault tolerance and availability in practice today*
• Minimizing memory overhead important for several reasons
• Data volume is increasing exponentially
• Storing more data in high speed memory reduces latency.
• This tutorial discusses the following questions:
• Considerations for the use of erasure codes in such systems
• Information-theoretic framework
Goal of this part of the tutorial
Distributed
system
• Analytical understanding of performance (memory overhead, latency)
limited.
• Replication used for fault tolerance and availability in practice today*
• Minimizing memory overhead important for several reasons
• Data volume is increasing exponentially
• Storing more data in high speed memory reduces latency.
• This tutorial discusses the following questions:
• Considerations for the use of erasure codes in such systems
• Information-theoretic framework
Goal of this part of the tutorial
Distributed
system
• Analytical understanding of performance (memory overhead, latency)
limited.
• Replication used for fault tolerance and availability in practice today*
• Minimizing memory overhead important for several reasons
• Data volume is increasing exponentially
• Storing more data in high speed memory reduces latency.
• This tutorial discusses the following questions:
• Considerations for the use of erasure codes in such systems
• Information-theoretic framework
Table of Contents
• Main theme and goal
– Distributed algorithms
– Shared memory emulation problem.
– Applications to key-value stores
• Shared Memory Emulation in Distributed Computing
– The concept of consistency
– Overview of replication-based shared memory emulation algorithm
– Challenges of erasure coding based shared memory emulation
• Information-Theoretic Framework
– Toy Model for distributed algorithms
– Multi-version Coding
– Closing the loop: connection to shared memory emulation
• Directions of Future Research
Shared Memory Emulation in Distributed
Computing
• The concept of consistency
– Atomic consistency (or simply, atomicity)
– Other notions of consistency
• Atomic Shared Memory Emulation
– Problem formulation (Informal)
– Replication-based algorithm of [Attiya Bar-Noy Dolev 95]
• Erasure coding based algorithms – main challenges
Shared Memory Emulation in Distributed
Computing
• The concept of consistency
– Atomic consistency (or simply, atomicity)
– Other notions of consistency
• Atomic Shared Memory Emulation
– Problem formulation (Informal)
– Replication-based algorithm of [Attiya Bar-Noy Dolev 95]
• Erasure coding based algorithms – main challenges
Read-Write Memory
Write(v)
Read()
time
Read-Write Memory
Write(v)
Read()
time
Read-Write Memory
Write(v)
Read()
time
• Reality check: Operations over distributed asynchronous systems
cannot be modeled as instantaneous.
• Solution: Concept of atomicity!
Thought experiment: Motivation for atomicity
{
x=2.1
tic;
x=3.2 %write to shared variable x
toc;}
{
Read(x)
}
Two concurrent processes
Time of write operation: 120 ms
Thought experiment: Motivation for atomicity
{
x=2.1
tic;
x=3.2 %write to shared variable x
toc;}
{
Read(x)
}
Two concurrent processes
Time of write operation: 120 ms
Question: When is the new value (3.2) of shared variable x available
to a possibly concurrent read operation?
10 ms?
20 ms?
120 ms?
121 ms?
Atomic consistency for shared memory emulation
aka linearizability. [Herlihy, Wing 90]
[Lamport 86]
Write(v)
Read()
time
Atomic consistency for shared memory emulation
aka linearizability. [Herlihy, Wing 90]
[Lamport 86]
Write(v)
Read()
time
41
Atomic consistency for shared memory emulation
aka linearizability. [Herlihy, Wing 90]
[Lamport 86]
Write(v)
Read()
time
42
Atomic consistency for shared memory emulation
aka linearizability. [Herlihy, Wing 90]
[Lamport 86]
Write(v)
Read()
time
43
Atomic consistency for shared memory emulation
aka linearizability. [Herlihy, Wing 90]
[Lamport 86]
Write(v)
Read()
time
Atomic consistency for shared memory emulation
aka linearizability. [Herlihy, Wing 90]
[Lamport 86]
Write(v)
Read()
time
Atomic consistency for shared memory emulation
aka linearizability. [Herlihy, Wing 90]
[Lamport 86]
Write(v)
Read()
Write(v)
Read()
time
Atomic consistency for shared memory emulation
aka linearizability. [Herlihy, Wing 90]
[Lamport 86]
Write(v)
Read()
Write(v)
Read()
time
Examples of non-atomic executions
Write(v)
Read()
time
Write(v)
time
Read()
48
Importance of Consistency
• Modular algorithm design
– Design an application (e.g., bank accounts, reservation
systems) over an “instantaneous” memory
– Then use an atomic distributed memory in its place
– Program executions are indistinguishable
• Weaker consistency models also useful
– Social networks, news feeds use weaker consistency
measures for performance.
• In this talk, we will focus on atomic consistency.
Importance of Consistency
• Modular algorithm design
– Design an application (e.g., bank accounts, reservation
systems) over an “instantaneous” memory
– Then use an atomic distributed memory in its place
– Program executions are indistinguishable
• Weaker consistency models also useful
– Social networks, news feeds use weaker consistency
measures for performance.
• In this talk, we will focus on atomic consistency.
Importance of Consistency
• Modular algorithm design
– Design an application (e.g., bank accounts, reservation
systems) over an “instantaneous” memory
– Then use an atomic distributed memory in its place
– Program executions are indistinguishable
• Weaker consistency models also useful
– Social networks, news feeds use weaker consistency
measures for performance.
• In this talk, we will focus on atomic consistency.
Importance of Consistency
• Modular algorithm design
– Design an application (e.g., bank accounts, reservation
systems) over an “instantaneous” memory
– Then use an atomic distributed memory in its place
– Program executions are indistinguishable
• Weaker consistency models also useful
– Social networks, news feeds use weaker consistency
definitions
– Trade-off errors for performance
• In this talk, we will focus on atomic consistency.
Shared Memory Emulation in Distributed
Computing
• The concept of consistency
– Atomic consistency (or simply, atomicity)
– Other notions of consistency
• Atomic Shared Memory Emulation
– Problem formulation (Informal)
– Replication-based algorithm of [Attiya Bar-Noy Dolev 95]
• Erasure coding based algorithms – main challenges
Distributed System Model
Read Clients
Write Clients
Servers
• Client server architecture, nodes can fail, i.e., stop responding
(no. of server failures is limited)
• Point-to-point reliable links (arbitrary delay).
• Nodes unaware of the current state of any other node.
54
The shared memory emulation problem
Read Clients
Write Clients
Servers
Design write, read and server protocols
• Atomicity
• Liveness: Concurrent operations, no waiting.
55
The shared memory emulation problem
write(v)
{
Design write
Write Clients
protocol
}
{
Design server
protocol
}
Read()
{
Design read
Read Clients
protocol
}
Servers
Design write, read and server protocols to ensure
• Atomicity
• Liveness: Concurrent operations, no waiting.
56
The shared memory emulation problem
write(v)
{
Design write
Write Clients
protocol
}
{
Design server
protocol
}
Read()
{
Design read
Read Clients
protocol
}
Servers
Design write, read and server protocols to ensure
• Atomicity
• Liveness: Concurrent operations, (no blocking).
57
Shared Memory Emulation in Distributed
Computing
• The concept of consistency
– Atomic consistency (or simply, atomicity)
– Other notions of consistency
• Atomic Shared Memory Emulation
– Problem formulation (Informal)
– Replication-based algorithm of [Attiya Bar-Noy Dolev 95]
• Erasure coding based algorithms – main challenges
The ABD algorithm (sketch)
Read Clients
Write Clients
Servers
Idea: Write and read from a majority of server nodes.
Any pair of write and read operations intersect at at least one server
Algorithm works if a minority of server nodes do not fail.
59
The ABD algorithm (sketch)
Read Clients
Write Clients
Servers
Write protocol:
Send time-stamped value to every server; return after receiving acks from majority.
Read protocol:
Send read query; wait for responses from majority; and return with latest value.
Server protocol:
Store latest value from server; send ack
Respond to read request with value
60
The ABD algorithm (sketch)
ACK
ACK
ACK
Read Clients
Write Clients
ACK
ACK
ACK
Servers
Write protocol:
Send time-stamped value to every server; return after receiving acks from majority.
Read protocol::
Send read query; wait for sufficient responses and return with latest value.
Server protocol:
Store latest value; send ack
Respond to read request with value
61
The ABD algorithm (sketch)
Query
Query
Query
Read Clients
Query
Write Clients
Servers
Write protocol:
Send time-stamped value to every server; return after receiving acks from majority.
Read protocol:
Send read query; wait for sufficient responses and return with latest value.
Server protocol:
Store latest value from server; send ack
Respond to read request with value
62
The ABD algorithm (sketch)
Read Clients
Write Clients
Servers
Write protocol:
Send time-stamped value to every server; return after receiving acks from majority.
Read protocol:
Send read query; wait for majority to respond; return with latest value.
Server protocol:
Store latest value from server; send ack
Respond to read request with value
63
The ABD algorithm (sketch)
Read Clients
Write Clients
Servers
Write protocol:
Acquire latest tag via query; Send tagged value to every server; return after sufficeint acks
Read protocol:
Send read query; wait for acks from majority; send latest value to servers;
return latest value after receiving acks from quorum.
Server protocol:
Respond to query with tag. Store latest value at server; send ack
Respond to read request with value
64
The ABD algorithm (sketch)
ACK
ACK
ACK
Read Clients
Write Clients
ACK
ACK
ACK
Servers
Write protocol:
Acquire latest tag via query; Send tagged value to every server; return after sufficeint acks
Read protocol:
Send read query; wait for acks from majority; send latest value to servers;
return latest value after receiving acks from majority.
Server protocol:
Respond to query with tag. Store latest value from server; send ack
Respond to read request with value
65
The ABD algorithm is atomic – proof idea
After an operation P terminates,
(i) if every future operation acquires a tag at least as large as the tag of P,
(ii) if every future write operation acquires a tag strictly larger than the tag of P,
(iii) and a read with tag t returns the value of the corresponding write with tag t.
then algorithm is atomic.
[Paraphrasing of a lemma from Lynch 96]s
P
Write
Read
Acquires a tag at least as large
As the tag that P propagated
Why should read operations write back the value?
The ABD algorithm - summary
• An atomic read-write memory can be implemented over a
distributed asynchronous system
• All operations terminate so long as the number of servers that fail is a minority
• Design principles of several modern key-value stores mirror
shared memory emulation algorithms.
• See description of Amazon’s Dynamo key-value store [Decandia et. al. 2008]
• Replication is used for fault tolerance
67
Shared Memory Emulation in Distributed
Computing
• The concept of consistency
– Atomic consistency (or simply, atomicity)
– Other notions of consistency
• Atomic Shared Memory Emulation
– Problem formulation (Informal)
– Replication-based algorithm of [Attiya Bar-Noy Dolev 95]
• Erasure coding based algorithms – main challenges
Erasure Coding
Example:
2 parity Reedsolomon code
Smaller packets,
smaller overheads
Parity
symbols
• Value recoverable from any 4 codeword symbols
• Size of codeword symbol is ¼ size of value
69
Erasure Coding
Example:
2 parity Reedsolomon code
Smaller packets,
smaller overheads
Parity
symbols
• Value recoverable from any 4 codeword symbols
• Size of codeword symbol is ¼ size of value
• New constraint, need 4 symbols with same timestamp
70
Set up for hypothetical erasure coding based algorithm
Read Clients
Write Clients
Servers
Write/read from any five nodes, any two sets intersect at 4 nodes
Operations complete works if at most one node has failed.
from
More generally, in a system with N nodes, for dimension k code, write/read
any [(N+k)/2] nodes,
71
Hypothetical erasure coding based algorithm - challenges
Read Clients
Write Clients
Servers
72
Hypothetical erasure coding based algorithm - challenges
Query
QueryRead Clients
Write Clients
Servers
Query
Servers store multiple versions
First Challenge: reveal symbols to readers only when enough
symbols are propagated
Second Challenge: discard old versions safely
73
• First Challenge: reveal symbols to readers
only when enough symbols are
propagated
• Second Challenge: discard old versions
safely
• First Challenge: reveal symbols to readers
only when enough symbols are
propagated
• Second Challenge: discard old versions
safely
Crude, one sentence summary
Challenges can be solved through careful algorithm design,
storage cost savings if the extent of concurrency is small.
Sample algorithm in appendix
Different approaches to solving challenges
[Ganger et. al. 04] The HGR algorithm, [Dutta et. al 08] The ORCAS and ORCAS-B
algorithm, [Dobre et. al. 14] The M-PoWerStore algorithm [Androulaki et. al. 14] AWE
algorithm [Cadambe et. al. 16], The CASGC algorithm [Konwar et. al. 16] SODA
algorithm
• Storage cost grows as the number of concurrent write operations.
Noteworthy recent developments:
• Coding-based consistent store implementations [Zhang et. al. FAST
16], [Yu Li Chen et. al. Usenix ATC 17]
• Erasure coding based algorithms for consistency issues in edge
computing [Konwar et. al. PODC 2017],
What is the right information-theoretic abstraction of this system?
Does the storage cost necessarily grow with concurrency?
Can clever coding theoretic ideas improve storage cost?
Different approaches to solving challenges
[Ganger et. al. 04] The HGR algorithm, [Dutta et. al 08] The ORCAS and ORCAS-B
algorithm, [Dobre et. al. 14] The M-PoWerStore algorithm [Androulaki et. al. 14] AWE
algorithm [Cadambe et. al. 15], The CASGC algorithm [Konwar et. al. 16] SODA
algorithm
• Storage cost grows as the number of concurrent write operations.
Noteworthy recent developments:
• Coding-based consistent store implementations [Zhang et. al. FAST
16], [Chen et. al. Usenix ATC 17]
• Erasure coding based algorithms for consistency issues in edge
computing [Konwar et. al. PODC 2017],
What is the right information-theoretic abstraction of this system?
Does the storage cost necessarily grow with concurrency?
Can clever coding theoretic ideas improve storage cost?
Different approaches to solving challenges
[Ganger et. al. 04] The HGR algorithm, [Dutta et. al 08] The ORCAS and ORCAS-B
algorithm, [Dobre et. al. 14] The M-PoWerStore algorithm [Androulaki et. al. 14] AWE
algorithm [Cadambe et. al. 14], The CASGC algorithm [Konwar et. al. 16] SODA
algorithm
• Storage cost grows as the number of concurrent write operations.
Noteworthy recent developments:
• Coding-based consistent store implementations [Zhang et. al. FAST
16], [Chen et. al. Usenix ATC 17]
• Erasure coding based algorithms for consistency issues in edge
computing [Konwar et. al. PODC 2017],
Can clever coding theoretic ideas improve storage cost?
Does the storage cost necessarily grow with concurrency?
What is the right information-theoretic abstraction of this system?
Different approaches to solving challenges
[Ganger et. al. 04] The HGR algorithm, [Dutta et. al 08] The ORCAS and ORCAS-B
algorithm, [Dobre et. al. 14] The M-PoWerStore algorithm [Androulaki et. al. 14] AWE
algorithm [Cadambe et. al. 14], The CASGC algorithm [Konwar et. al. 16] SODA
algorithm
• Storage cost grows as the number of concurrent write operations.
Noteworthy recent developments:
• Coding-based consistent store implementations [Zhang et. al. FAST
16], [Chen et. al. Usenix ATC 17]
• Erasure coding based algorithms for consistency issues in edge
computing [Konwar et. al. PODC 2017],
Can clever coding theoretic ideas improve storage cost?
Does the storage cost necessarily grow with concurrency?
What is the right information-theoretic abstraction of this system?
Different approaches to solving challenges
[Ganger et. al. 04] The HGR algorithm, [Dutta et. al 08] The ORCAS and ORCAS-B
algorithm, [Dobre et. al. 14] The M-PoWerStore algorithm [Androulaki et. al. 14] AWE
algorithm [Cadambe et. al. 14], The CASGC algorithm [Konwar et. al. 16] SODA
algorithm
• Storage cost grows as the number of concurrent write operations.
Noteworthy recent developments:
• Coding-based consistent store implementations [Zhang et. al. FAST
16], [Chen et. al. Usenix ATC 17]
• Erasure coding based algorithms for consistency issues in edge
computing [Konwar et. al. PODC 2017],
Can clever coding theoretic ideas improve storage cost?
Does the storage cost necessarily grow with concurrency?
What is the right information-theoretic abstraction of this system?
Break
Table of Contents
• Main theme of this tutorial
– Distributed algorithms
– Shared memory emulation problem.
– Applications to key-value stores
• Shared Memory Emulation in Distributed Computing
– The concept of consistency
– Overview of replication-based shared memory emulation algorithm
– Challenges of erasure coding based shared memory emulation
• Information-Theoretic Framework
– Toy Model for distributed algorithms
– Multi-version Coding
– Closing the loop: connection to shared memory emulation
• Directions of Future Research
Information-Theoretic abstraction of
consistent distributed storage
Shared Memory
Emulation
Toy model
Multi-version
(MVC) codes
[Wang Cadambe, Accepted to Trans. IT, 2017]
Information-Theoretic abstraction of
consistent distributed storage
Shared Memory
Emulation
Toy model
Multi-version
(MVC) codes
[Wang Cadambe, Accepted to Trans. IT, 2017]
Toy Model for packet arrivals, links
Read Clients
(Decoders)
Write Clients
N Servers
f failures possible
85
Toy Model for packet arrivals, links
Read Clients
(Decoders)
Write Clients
N Servers
f failures possible
86
Toy Model for packet arrivals, links
Read Clients
(Decoders)
Write Clients
N Servers
f failures possible
• Arrival at client: New version in every time slot. Sent immediately to the servers.
• Channel from the write client to the server: Delay is an integer in [0,T-1].
• Channel from server to read client: instantaneous (no delay).
• Goal: decoder invoked at time t, gets the latest common version among c
servers
87
Toy Model for packet arrivals, links
Read Clients
(Decoders)
Write Clients
N Servers
f failures possible
• Arrival at client: New version in every time slot. Sent immediately to the servers.
• Channel from the write client to the server: Delay is an integer in [0,T-1].
• Channel from server to read client: instantaneous (no delay).
• Goal: decoder invoked at time t, gets the latest common version among c
servers
88
Toy Model for packet arrivals, links
Read Clients
(Decoders)
Write Clients
N Servers
f failures possible
• Arrival at client: New version in every time slot. Sent immediately to the servers.
• Channel from the write client to the server: Delay is an integer in [0,T-1].
• Channel from server to read client: instantaneous (no delay).
• Goal: decoder invoked at time t, gets the latest common version among c
servers
89
Toy Model – Decoding requirement
N Servers
f failures possible
• A version is complete at time t if it has arrived at N-f servers.
• Decoding requirement for decoder at time t: from every set of N-f servers, the
latest complete version or a later version must be decodable.
• Mirrors erasure coding based shared memory emulation protocols.
• We will instead study an equivalent decoding requirement.
Decoder connects to must be able to recover the latest common version, or a later
version at time t, from every set of c=N-2f servers
• Goal: decoder invoked at time t, gets the latest common version among c
servers
Toy Model – Decoding requirement
N Servers
f failures possible
• A version is complete at time t if it has arrived at N-f servers.
• Decoding requirement for decoder at time t: from every set of N-f servers, the
latest complete version or a later version must be decodable.
• Mirrors erasure coding based shared memory emulation protocols.
• We will instead study an equivalent decoding requirement.
Decoder connects to must be able to recover the latest common version, or a later
version at time t, from every set of c=N-2f servers
• Goal: decoder invoked at time t, gets the latest common version among c
servers
Toy Model – Decoding requirement
N Servers
f failures possible
• A version is complete at time t if it has arrived at N-f servers.
• Decoding requirement for decoder at time t: from every set of N-f servers, the
latest complete version or a later version must be decodable.
• Mirrors erasure coding based shared memory emulation protocols.
• We will instead study an equivalent decoding requirement.
Decoder connects to must be able to recover the latest common version, or a later
version at time t, from every set of c=N-2f servers
• Goal: decoder invoked at time t, gets the latest common version among c
servers
Toy Model – Decoding requirement
N Servers
f failures possible
• A version is complete at time t if it has arrived at N-f servers.
• Decoding requirement for decoder at time t: from every set of N-f servers, the
latest complete version or a later version must be decodable.
• Mirrors erasure coding based shared memory emulation protocols.
• We will instead study an equivalent decoding requirement.
Decoder connects to must be able to recover the latest common version, or a later
version at time t, from every set of c servers.
• The two decoding requirements have the same worst-case storage costs if c=N2f
• Goal: decoder invoked at time t, gets the latest common version among c
Toy Model – Decoding requirement
c servers
Snapshot at time
Decoder connects to any
decodes one of
servers,
94
Toy Model – Decoding requirement
c servers
Snapshot at time
Decoder connects to any
decodes one of
servers,
Goal: construct a storage method that minimizes storage cost 95
Information-Theoretic abstraction of
consistent distributed storage
Shared Memory
Emulation
Toy model
Multi-version
(MVC) codes
The multi-version coding (MVC) problem
97
The multi-version coding (MVC) problem
98
The multi-version coding (MVC) problem
99
The MVC problem
•
•
•
•
n servers
T versions
c connectivity
Goal: decode the latest common version
or later version among every set of c
servers
• Minimize the storage cost
– Worst case, across all “states”
– across all servers
100
Solution 1: Replication
Storage size = size-of-one-version
N=4,
T=2,
c=2
Solution 1: Replication
Storage size = size-of-one-version
N=4,
T=2,
c=2
Solution 2: (N,c) Maximum Distance Separable cod
• Question: Can we store a codeword symbol
corresponding to the latest version?
Solution 2: (N,c) Maximum Distance Separable cod
Storage size = T/c*size-of-one-version
1/2
1/2
1/2
1/2
1/2
Separate coding across versions.
Each server stores all the versions received.
1/2
N=4,
T=2,
c=2
Summary of results
1.4
Naïve MDS codes
1.2
Replication
1
Storage cost
0.8
0.6
0.4
0.2
0
0
2
4
6
8
10
12
Number of versions T
14
16
18
20
Summary of results
1.4
Naïve MDS codes
1.2
Replication
1
Storage cost
0.8
0.6
0.4
0.2
Singleton bound
0
0
2
4
6
8
10
12
Number of versions T
14
16
18
20
Summary of results
1.4
Naïve MDS codes
1.2
Replication
1
Storage cost
Our achievable
scheme
0.8
0.6
0.4
0.2
Singleton bound
0
0
2
4
6
8
10
12
Number of versions T
14
16
18
20
Summary of results
1.4
Naïve MDS codes
1.2
Replication
1
Storage cost
Our achievable
scheme
0.8
0.6
0.4
Our converse
0.2
Singleton bound
0
0
2
4
6
8
10
12
Number of versions T
14
16
18
20
Summary of results
1.4
Naïve MDS codes
1.2
Replication
1
Storage cost
Our achievable
scheme
0.8
0.6
0.4
Our converse
0.2
Singleton bound
0
0
2
4
6
8
10
12
14
16
18
20
Number of versions T
Storage cost inevitably increases as degree of asynchrony grows
Summary of results
Storage Cost
Normalized by size-ofvalue
Replication
Naïve MDS codes
Constructions*
Lower bound
1
Summary of results
Storage Cost
Normalized by size-ofvalue
Replication
1
Naïve MDS codes
Constructions*
Lower bound
*Achievability can be improved (see [Wang, Cadambe, Accepted to Trans. IT, 17]
Main insights and techniques
• Redundancy required to ensure consistency in an
asynchronous environment
– Amount of redundancy grows with the degree of asynchrony T
• Connected to pliable index coding [Brahma-Fragouli 12]
– Exercises in network information theory, can be converted to
exercises in combinatorics
• Achievability: Separate linear code for each version,
– Carefully choose the “budget” for each version based on the set of
received versions.
• Genie based converse, discover “worst-case” arrival
patterns
Main insights and techniques
• Redundancy required to ensure consistency in an
asynchronous environment
– Amount of redundancy grows with the degree of asynchrony T
• Connected to pliable index coding [Brahma-Fragouli 12]
– Exercises in network information theory, can be converted to
exercises in combinatorics
• Achievability: Separate linear code for each version,
– Carefully choose the “budget” for each version based on the set of
received versions.
• Genie based converse, discover “worst-case” arrival
patterns
Main insights and techniques
• Redundancy required to ensure consistency in an
asynchronous environment
– Amount of redundancy grows with the degree of asynchrony T
• Connected to pliable index coding [Brahma-Fragouli 12]
– Exercises in network information theory, can be converted to
exercises in combinatorics
• Achievability: Separate linear code for each version,
– Carefully choose the “budget” for each version based on the set of
received versions.
• Genie based converse, discover “worst-case” arrival
patterns
Achievability
Achievability
Achievability
Achievability
Therefore, the version corresponding to at least one partition is decodable
Main insights and techniques
• Redundancy required to ensure consistency in an
asynchronous environment
– Amount of redundancy grows with the degree of asynchrony T
• Connected to pliable index coding [Brahma-Fragouli 12]
– Exercises in network information theory, can be converted to
exercises in combinatorics
• Achievability: Separate linear code for each version,
– Carefully choose the “budget” for each version based on the set of
received versions.
• Genie based converse, discover “worst-case” arrival
patterns
Start with c servers
State vector s1= (1,1,……1), Version 1 is decodable
State vector s2 = (2,2,……2), Version 2 is decodable
State vector s3 = (2,2, …, 2, 1, 1, …, 1): Minimal state vector s.t. version 2 is decodable
State vector s4 = (2,2, …, 1, 1, 1, …, 1): Maximal state vector s.t. version 1 is decodable
Versions 1 and 2 decodable from c+1 symbols
c symbols in s3 and one changed symbol in s4
Start with c servers
State vector s1= (1,1,……1), Version 1 is decodable
State vector s2 = (2,2,……2), Version 2 is decodable
State vector s3 = (2,2, …, 2, 1, 1, …, 1): Minimal state vector s.t. version 2 is decodable
State vector s4 = (2,2, …, 1, 1, 1, …, 1): Maximal state vector s.t. version 1 is decodable
Versions 1 and 2 decodable from c+1 symbols
c symbols in s3 and one changed symbol in s4
Start with c servers
Propagate version 2 to a minimal set of servers such that it is
decodable
State vector s1= (1,1,……1), Version 1 is decodable
State vector s2 = (2,2,……2), Version 2 is decodable
State vector s3 = (2,2, …, 2, 1, 1, …, 1): Minimal state vector s.t. version 2 is decodable
State vector s4 = (2,2, …, 1, 1, 1, …, 1): Maximal state vector s.t. version 1 is decodable
Versions 1 and 2 decodable from c+1 symbols
c symbols in s3 and one changed symbol in s4
Start with c servers
Propagate version 2 to a minimal set of servers such that it is
decodable
State vector s1= (1,1,……1), Version 1 is decodable
State vector s2 = (2,2,……2), Version 2 is decodable
State vector s3 = (2,2, …, 2, 1, 1, …, 1): Minimal state vector s.t. version 2 is decodable
State vector s4 = (2,2, …, 1, 1, 1, …, 1): Maximal state vector s.t. version 1 is decodable
Versions 1 and 2 decodable from c+1 symbols
c symbols in s3 and one changed symbol in s4
State vector s1= (1,1,……1), Version 1 is decodable
State vector s2 = (2,2,……2), Version 2 is decodable
State vector s3 = (2,2, …, 2, 1, 1, …, 1): Minimal state vector s.t. version 2 is decodable
State vector s4 = (2,2, …, 1, 1, 1, …, 1): Maximal state vector s.t. version 1 is decodable
Versions 1 and 2 decodable from c+1 symbols
c symbols in s3 and one changed symbol in s4
Start with c servers
Propagate version 2 to a minimal set of servers such that it is
decodable
Versions 1 and 2 decodable from c+1 symbols
Main insights and techniques
• Redundancy required to ensure consistency in an asynchronous
environment
– Amount of redundancy grows with the degree of asynchrony T
• Connected to pliable index coding [Brahma-Fragouli 12]
– Exercises in network information theory, can be converted to exercises in
combinatorics
• Achievability: Separate linear code for each version,
– Carefully choose the “budget” for each version based on the set of
received versions.
• Genie based converse, discover “worst-case” arrival patterns
– More challenging combinatorial puzzle for T > 2.
Summary of results
1.4
Naïve MDS codes
1.2
Replication
1
Storage cost
Our achievable
scheme
0.8
0.6
0.4
Our converse
0.2
Singleton bound
0
0
2
4
6
8
10
12
14
16
18
20
Number of versions T
Storage cost inevitably increases as degree of asynchrony grows
Information-Theoretic abstraction of
consistent distributed storage
Shared Memory
Emulation
Toy model
Multi-version
(MVC) codes
Recall the shared memory emulation model
Read Clients
(Decoders)
Write Clients
N Servers
f failures possible
133
Recall the shared memory emulation model
Read Clients
(Decoders)
Write Clients
N Servers
f failures possible
• Arrival at clients: arbitrary
• Channel from clients to servers: arbitrary delay, reliable
• Clients, servers modeled as I/O automata, protocols can be designed.
134
Recall the shared memory emulation model
Read Clients
(Decoders)
Write Clients
N Servers
f failures possible
• Arrival at clients: arbitrary
• Channel from clients to servers: arbitrary (unbounded) delay, reliable
• Clients, servers modeled as I/O automata, protocols can be designed.
135
Recall the shared memory emulation model
Read Clients
(Decoders)
Write Clients
N Servers
f failures possible
• Arrival at clients: arbitrary
• Channel from clients to servers: arbitrary (unbounded) delay, reliable
• Clients, servers modeled as I/O automata, protocols can be designed.
136
Shared Memory Emulation model
•
Arbitrary arrival times, arbitrary delays between encoders, servers and
decoders
•
Clients, servers modeled as I/O automata, protocols can be designed.
•
Storage cost
•
Generalization of the Singleton bound.
[C-Lynch-Wang, ACM PODC 2016]
Shared Memory Emulation model
•
Arbitrary arrival times, arbitrary delays between encoders, servers and
decoders
•
Clients, servers modeled as I/O automata, protocols can be designed.
•
Per-server storage cost
•
Generalization of the Singleton bound.
•
Non-trivial due to interactive nature
[C-Lynch-Wang, ACM PODC 2016]
Shared Memory Emulation model
•
Arbitrary arrival times, arbitrary delays between encoders, servers and
decoders
•
Clients, servers modeled as I/O automata, protocols can be designed.
•
Per-server storage cost
•
Generalization of the Singleton bound.
•
Non-trivial due to interactive nature
[C-Lynch-Wang, ACM PODC 2016]
Shared Memory Emulation model
•
Arbitrary arrival times, arbitrary delays between encoders, servers
and decoders
•
Clients, servers modeled as I/O automata, protocols can be
designed.
•
Per-server storage cost
•
Generalization of the MVC converse for T=2.
•
Open question: Generalization of MVC bounds for T > 2 to the fullfledged distributed systems theoretic model.
[C-Lynch-Wang, ACM PODC 2016]
Shared Memory Emulation model
•
Arbitrary arrival times, arbitrary delays between encoders, servers
and decoders
•
Clients, servers modeled as I/O automata, protocols can be
designed.
•
Per-server storage cost
•
Generalization of the MVC converse for T=2.
•
Generalization of MVC converse for T > 2 to works for
non-interactive protocols:
Per-server storage cost
T = number of concurrent write operations.
[C-Lynch-Wang, ACM PODC 2016]
Shared Memory Emulation model
•
Arbitrary arrival times, arbitrary delays between encoders, servers
and decoders
•
Clients, servers modeled as I/O automata, protocols can be
designed.
•
Per-server storage cost
•
Generalization of the MVC converse for T=2.
•
Generalization of MVC converse for T > 2 to works for
non-interactive protocols:
Per-server storage cost
T = number of concurrent write operations.
[C-Lynch-Wang, ACM PODC 2016]
Storage Cost bounds for Shared Memory
Emulation
1.4
1.2
ABD algorithm
1
0.8
Storage Cost
Erasure coding
based algorithms
0.6
0.4
Second lower bound*
0.2
0
0
2
4
6
8
Baseline lower bound
10
12
14
16
18
First lower bound
Number of concurrent writes
20
Table of Contents
• Main theme of this tutorial
– Distributed algorithms
– Shared memory emulation problem.
– Applications to key-value stores
• Shared Memory Emulation in Distributed Computing
– The concept of consistency
– Overview of replication-based shared memory emulation algorithm
– Challenges of erasure coding based shared memory emulation
• Information-Theoretic Framework
– Toy Model for distributed algorithms
– Multi-version Coding
– Closing the loop: connection to shared memory emulation
• Directions of Future Research
Directions of Future Research
Classical Codes for
Distributed Storage
• System is synchronous
• Nodes have instantaneous,
system state information
• Nodes have global system
state information.
Multi-version codes
• System is totally
asynchronous, every version
arrival pattern is possible
• Nodes have do not even have
stale information of system
state
• Nodes do not even have
partial information of system
state.
Directions of Future Research
Classical Codes for
Distributed Storage
Multi-version codes
Optimistic model
Pessimistic/conservative model
• System is synchronous
• System is totally
asynchronous, every version
arrival pattern is possible
• Nodes have instantaneous,
system state information
• Nodes have global system
state information.
• Nodes have do not even have
stale information of system
state
• Nodes do not even have
partial information of system
state.
Directions of Future Research
Multi-version codes
Classical Codes for
Distributed Storage
Optimistic model
• System is synchronous
• Nodes have instantaneous,
system state information
• Nodes have global system
state information.
Practice
Pessimistic/conservative model
• System is totally
asynchronous, every version
arrival pattern is possible
• Nodes have do not even have
stale information of system
state
• Nodes do not even have
partial information of system
state.
Directions of future research
Beyond the worst-case model
• Correlated versions
Codes with Correlated Versions
[Ali-C, ITW 2016]
Markov Chain
Uniform
.
Codes with Correlated Versions
[Ali-C, ITW 2016]
1/2
?
1/2
1/2
?
1/2
N=4,
T=2,
c=2
Codes with Correlated Versions
[Ali-C, ITW 2016]
1/2
1/2
?
“Closeness” in message
1/2
1/2
N=4,
T=2,
c=2
?
“Closeness” in codeword
Delta
coding
Codes with Correlated Versions
[Ali-C, ITW 2016]
1/2
?
Storage cost
1/2
1/2
1/2
N=4,
T=2,
c=2
?
as opposed to
Codes with Correlated Versions
[Ali-C, ITW 2016]
1/2
?
1/2
1/2
?
1/2
N=4,
T=2,
c=2
Codes with Correlated Versions
[Ali-C, ITW 2016]
1/2
?
1/2
1/2
?
Apply Slepian-Wolf ideas
1/2
N=4,
T=2,
c=2
Codes with Correlated Versions
[Ali-C, ITW 2016]
1/2
1/2
?
1/2
1/2
?
Apply Slepian-Wolf ideas
Storage cost is
as opposed to
N=4,
T=2,
c=2
Codes with Correlated Versions
[Ali-C, ITW 2016]
1/2
1/2
1/2
?
1/2
N=4,
T=2,
c=2
?
Open: Information-theoretically optimal schemes for all regimes of
Open: Practical code constructions
Directions of future research
Beyond the worst-case model
• Correlated versions
• (Limited) server co-operation, exchange possibly partial
state information
• Average-case storage cost, good storage cost in typical
states, larger worst-case storage cost.
Beyond the toy model
• Explore relations between timing assumptions and storage
cost
• Open question: Can interactive protocols help?, or does
the storage cost necessarily grow with
Directions of future research
Beyond the worst-case model
• Correlated versions
• (Limited) server co-operation, exchange possibly partial
state information
• Average-case storage cost: good storage cost in typical
states, possibly with larger worst-case storage cost.
Beyond the toy model
• Explore relations between timing assumptions and storage
cost
• Open question: Can interactive protocols help?, or does
the storage cost necessarily grow with
Beyond the toy model
• Toy model serves as a bridge between shared memory
emulation and our network information theoretic
formulation
• Open: Can an interactive write protocol improve storage
cost?
• Future: realistic model, expose connections between:
– channel delay uncertainty
– staleness of read.
– storage cost
Beyond the toy model
• Toy model serves as a bridge between shared memory
emulation and our network information theoretic
formulation
• Future work: realistic model, expose connections between
– channel delay uncertainty
– staleness of read.
– storage cost
• Open: Can an interactive write protocol improve storage
cost?
Beyond the toy model
• Toy model serves as a bridge between shared memory
emulation and our network information theoretic
formulation
• Future work: realistic model, expose connections between
– channel delay uncertainty
– staleness of read.
– storage cost
• Open: Can an interactive write protocol improve storage
cost?
Directions of future research
Beyond the worst-case model
• Correlated versions
• (Limited) server co-operation, exchange possibly partial
state information
• Average-case storage cost: good storage cost in typical
states, possibly with larger worst-case storage cost.
Beyond the toy model
• Open question: Can interactive protocols help?, o
• Relation between system response, decoding requirement
and storage cost r does the
• storage cost necessarily grow with
Beyond read-write memory
• Several systems: more complicated data structures over
distributed asynchronous systems
– Transactions: Multiple read-write objects, more complicated
consistency requirements.
– Graph based data structures
• Question: How do you “erasure code” more complicated
data structures and state machines?
– Initial clues provided in [Balasubramanian-Garg 14]
Beyond read-write memory
• Several systems: more complicated data structures over
distributed asynchronous systems
– Transactions: Multiple read-write objects, more complicated
consistency requirements.
– Graph based data structures
• Question: How do you “erasure code” more complicated
data structures and state machines?
– Initial clues provided in [Balasubramanian-Garg 14]
Directions of future research
Beyond the worst-case model
• Correlated versions
• (Limited) server co-operation, exchange possibly partial
state information
• Average-case storage cost: good storage cost in typical
states, possibly with larger worst-case storage cost.
Beyond the toy model
• Open question: Can interactive protocols help?, o
• Relation between system response, decoding requirement
and storage cost r does the
Beyond read-write data structures
Thanks
References
[Lynch 96] N. A. Lynch, Distributed Algorithms USA: Morgan Kaufmann Publishers Inc., 1996.
[Lamport 86] Lamport, L.: On interprocess communication. Part I: Basic formalism. Distributed
Computing 2(1), 77–85 (1986)
[Vogels 09] W. Vogels, “Eventually consistent,” Queue, vol. 6, no. 6, pp. 14–19, 2008.
[Attiya-Bar-Noy-Dolev 95] H. Attiya, A. Bar-Noy, and D. Dolev, “Sharing memory robustly in
message-passing systems,” J. ACM, vol. 42, no. 1, pp. 124–142, Jan. 1995.
[Decandia et. al. 07] G. DeCandia, D. Hastorun, M. Jampani, G. Kakula- pati, A. Lakshman, A.
Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, “Dynamo: Amazon’s highly available keyvalue store,” in SOSP, vol. 7, 2007, pp. 205– 220.
[6] E. Hewitt, Cassandra: the definitive guide. ” O’Reilly Media, Inc.”, 2010.
[Hendricks et. al. 07] J. Hendricks, G. R. Ganger, and M. K. Reiter, “Low-overhead byzantine faulttolerant storage,” ACM SIGOPS Operating Systems Review, vol. 41, no. 6, pp. 73–86, 2007.
[Dutta et. al. 08] P. Dutta, R. Guerraoui, and R. R. Levy, “Optimistic erasure-coded distributed
storage,” in Distributed Com- puting. Springer, 2008, pp. 182–196.
[Cadambe et. al. 14] V. R. Cadambe, N. Lynch, M. Medard, and P. Musial, “A coded shared atomic
memory algorithm for message passing architectures,” in 2014 IEEE 13th International Symposium
on Network Computing and Applications (NCA). IEEE, 2014, pp. 253–260.
[Dobre et. al. 13] D. Dobre, G. Karame, W. Li, M. Majuntke, N. Suri, and M. Vukoli ć , “PoWerStore:
proofs of writing for efficient and robust storage,” in Proceedings of the 2013 ACM SIGSAC
conference on Computer & communications security. ACM, 2013, pp. 285–298.
[Konwar et. al. PODC 2017] K.M.Konwar, N.Prakash, N.A.Lynch, and M.M ́edard, “A layered
architecture for erasure-coded consistent distributed storage,” CoRR, vol. abs/1703.01286, 2017,
ac- cepted to 2017 ACM Principles of Distributed Computing, available at
http://arxiv.org/abs/1703.01286.
References
[Zhang et. al. FAST 16] H. Zhang, M. Dong, and H. Chen, “Efficient and avail- able in-memory kvstore with hybrid erasure coding and replication.” in FAST, 2016, pp. 167–180.
[Chen et. al. 2017] Y. L. Chen, S. Mu, J. Li, C. Huang, J. Li, A. Ogus, and D. Phillips, “Giza: Erasure
coding objects across global data centers,” in 2017 USENIX Annual Technical Conference (USENIX
ATC 17). Santa Clara, CA: USENIX Association, 2017.
[Herlihy Wing 90] M. P. Herlihy and J. M. Wing, “Linearizability: a cor- rectness condition for
concurrent objects,” ACM Trans. Program. Lang. Syst., vol. 12, pp. 463–492, July 1990.
[Cadambe-Lynch-Wang ACM PODC 16] V. R. Cadambe, Z. Wang, and N. Lynch, “Informationtheoretic lower bounds on the storage cost of shared memory emulation,” in Proceedings of the
ninth annual ACM symposium on Principles of distributed computing, ser. PODC ’16. ACM, 2016,
pp. 305–314.
[Wang Cadambe Trans. IT 17]. Z. Wang and V. Cadambe, “Multi-version coding – An InformationTheoretic Perspective of Consistent Distributed Storage,” to appear in IEEE Transactions on
Information Theory.
[Brahma Fragouli 2012] S. Brahma and C. Fragouli, “Pliable index coding,” in
2012 IEEE International Symposium on Information Theory Proceedings (ISIT). Ieee, 2012, pp.
2251–2255.
[Ali-Cadambe ITW 16] R. E. Ali and V. R. Cadambe, “Consistent distributed storage of correlated
data updates via multi-version cod- ing,” in Information Theory Workshop (ITW), 2016 IEEE. IEEE,
2016, pp. 176–180.
[Balasubramanian Garg 16] B. Balasubramanian and V. K. Garg, “Fault tolerance in distributed
systems using fused state machines,” Dis- tributed Computing, vol. 27, no. 4, pp. 287–311, 2014.
Appendix: Binary consensus - A simple
looking task that is impossible to achieve in
a decentralized asynchronous system
Fischer-Lynch-Patterson (FLP) impossibility result (informal)
go back
•
•
•
•
A famous impossibility result
Two processors P1, P2.
Each processor begins with an initial value in {0,1}.
They can communicate messages over a reliable link, but with arbitrary,
(unbounded) delay
• Goal: Design protocol such that
(a) both processors agree on the same value, which is an initial value of
some processor
(b) Each non-failed process must eventually decide.
• Write W1’s value reached all 6 servers before W2 started.
• The write with v(2) sent the value only to server 1 by time t2 before R2
started
•
•
•
•
Read R2 got responses from servers 1,2,3,4, therefore it returned v(2)
Server 1 failed after R2 completed, but before R1 started.
Read R3 then started and read v(1)…….(cannot read v(2)!).
Finally, after R3 completes, v(2) reaches the remaining non-failed servers.
Appendix: Why does a read operation need
to write back in the ABD algorithm?
What happens if a read does not write back?
go back
The following execution is possible if a read does not write back
W2
W1
Write(v)
time
Read()
R1
R2
R3
An example of a wrong execution:
• Suppose we have 6 servers.
• Write W1’s value reached all 6 servers before W2 started.
• The write with v(2) sent the value only to server 1 by time t2 before R2 started
•
•
•
•
Read R2 got responses from servers 1,2,3,4, therefore it returned v(2)
Server 1 failed after R2 completed, but before R1 started.
Read R3 then started and read v(1)…….(cannot read v(2)!).
Finally, after R3 completes, v(2) reaches the remaining non-failed servers.
Appendix: An Erasure Coding Based
Algorithm
Algorithm from [Cadambe-Lynch-Medard-Musial, IEEE NCA 2014],
extended version in Distributed Computing (Springer) 2016.
Coded Atomic Storage (CAS)
CAS
Solves first challenge of revealing correct
elements to readers
Good communication cost, but infinite storage cost
Failed attempt
at garbage
collection
CASGC
GC = garbage
collection
-
Attempts to solve challenge of discarding old
versions
Good storage cost, but poor liveness conditions if too many
clients fail
Solves both challenges
Uses server gossip to propagate metadata
Good storage and communication cost
Good handling of client failures
Coded Atomic Storage (CAS)
CAS
Solves first challenge of revealing correct
elements to readers
Good communication cost, but infinite storage cost
Failed attempt
at garbage
collection
CASGC
GC = garbage
collection
-
Attempts to solve challenge of discarding old
versions
Good storage cost, but poor liveness conditions if too many
clients fail
Solves both challenges
Uses server gossip to propagate metadata
Good storage and communication cost
Good handling of client failures
Coded Atomic Storage (CAS)
solves challenge of only revealing completed writes to readers
• N servers, f failures.
• Use MDS code of dimension k, where k is no bigger than
(N-k)/2.
• Every set of at least (N+k)/2 server nodes is referred to as a
“quorum set”. Note that any two quorum sets intersect in at
least k nodes.
• Additional “fin” label at servers,
•
indicates that a sufficient number of versions are propagated
• Additional write phase
•
•
Tells the servers that elements have been propagated to a quorum
Servers store all the history
Read Clients
Write Clients
Servers
Server 1
Has been
propagated to a
quorum
179
CAS – Protocol overview
Write:
Acquire latest tag; send tag and coded element to every server;
Send finalize message after getting acks from quorum;
Return after receiving acks from quorum.
Read:
Send read query; wait for tags from a quorum;
Send request with latest tag to servers;
Decode value after receiving coded elements from quorum.
Servers:
Store the coded element; send ack.
Set fin flag for time-stamp on receiving finalize message. Send ack.
Respond to query with latest finalized tag.
Finalize the requested tag; respond to read request with codeword symbol.
CAS – Protocol overview
Write:
Acquire latest tag; send (incremented) tag and coded element to every server;
Send finalize message after getting acks from quorum;
Return after receiving acks from quorum.
Read:
Send read query; wait for tags from a quorum;
Send request with latest tag to servers;
Decode value after receiving coded elements from quorum.
Servers:
Store the coded element; send ack.
Set fin flag for time-stamp on receiving finalize message. Send ack.
Respond to read query with latest finalized tag.
Finalize the requested tag; respond to read request with codeword symbol.
CAS – Protocol overview
Write:
Acquire latest tag; send tag and coded element to every server;
Send finalize message after getting acks from quorum;
Return after receiving acks from quorum.
Read:
Send read query; wait for tags from a quorum;
Send request with latest tag to servers;
Decode value after receiving coded elements from quorum.
Servers:
Store the coded symbol; send ack.
Set fin flag for time-stamp on receiving finalize message. Send ack.
Respond to read query with latest finalized tag.
Finalize the requested tag; respond to read request with codeword symbol.
Read Clients
Write Clients
Servers
Server 1
183
Read Clients
Write Clients
Servers
Server 1
184
ACK
ACK
ACK
Read Clients
Write Clients
ACK
ACK
ACK
Servers
Server 1
185
fin
fin
fin
fin
fin
Read Clients
Write Clients
Servers
Server 1
186
ACK
ACK
ACK
Read Clients
Write Clients
ACK
ACK
ACK
Servers
Server 1
187
CAS – Protocol overview
Write:
Acquire latest tag; send tag and coded element to every server;
Send finalize message after getting acks from quorum;
Return after receiving acks from quorum.
Read:
Send read query; wait for tags from a quorum;
Send request with latest tag to servers;
Decode value after receiving coded elements from quorum.
Servers:
Store the coded symbol; send ack.
Set fin flag for time-stamp on receiving finalize message. Send ack.
Respond to read query with latest tag labeled fin.
Finalize the requested tag; respond to read request with codeword symbol.
CAS – Protocol overview
Write:
Acquire latest tag; send tag and coded element to every server;
Send finalize message after getting acks from quorum;
Return after receiving acks from quorum.
Read:
Send read query; wait for tags from a quorum;
Send request with latest tag to servers;
Decode value after receiving coded elements from quorum.
Servers:
Store the coded symbol; send ack.
Set fin flag for time-stamp on receiving finalize message. Send ack.
Respond to read query with latest tag labeled fin.
Label the requested tag as fin; respond to read request with coded element if
available.
CAS – Protocol overview
Write:
Acquire latest tag; send tag and coded element to every server;
Send finalize message after getting acks from quorum;
Return after receiving acks from quorum.
Read:
Send read query; wait for tags from a quorum;
Send request with latest tag to servers;
Decode value after receiving coded elements from quorum.
Servers:
Store the coded symbol; send ack.
Set fin flag for time-stamp on receiving finalize message. Send ack.
Respond to read query with latest tag labeled fin.
Label the requested tag as fin; respond to read request with coded element if
available.
Coded Atomic Storage (CAS)
solves challenge of only revealing completed writes to readers
• Additional “fin” label at servers,
•
indicates that a sufficient number of versions are propagated
• Additional write phase
•
•
Tells the servers that elements have been propagated to a quorum
Servers store all the history
•
Infinite storage cost (solved in CASGC)
CAS – Protocol overview
Theorem:
1) CAS satisfies atomicity.
2) Liveness: All operations return (if the number of
server failures is below a pre-fixed threshold)
Coded Atomic Storage (CAS)
CAS
Solves first challenge of revealing correct
elements to readers
Good communication cost, but infinite storage cost
Failed attempt
at garbage
collection
CASGC
GC = garbage
collection
-
Attempts to solve challenge of discarding old
versions
Good storage cost, but poor liveness conditions if too many
clients fail
Solves both challenges
Uses server gossip to propagate metadata
Good storage and communication cost
Good handling of client failures
Read Clients
Write Clients
Servers
Possible solution: Store at most d+1 coded elements and delete older ones
Server 1
Keep at most d+1
elements
194
Modification of CAS
Possible solution: Store at most d+1 coded elements and delete older ones
The good:
- Finite storage cost.
- All operations terminate if no. of writes that overlap with a read smaller than d
- Atomicity (Simulation relation with CAS)
Write
Read
The bad:
Failed write clients result in weak liveness condition,
that is, d failed writes can render all future reads incomplete.
Does not end, concurrent with all future writes!
Write
Coded Atomic Storage (CAS)
CAS
Solves first challenge of revealing correct
elements to readers
Good communication cost, but infinite storage cost
Failed attempt
at garbage
collection
CASGC
GC = garbage
collection
-
Attempts to solve challenge of discarding old
versions
Good storage cost, but poor liveness conditions if too many
clients fail
Solves both challenges
Uses server gossip to propagate metadata
Good storage and communication cost
Good handling of client failures
The CASGC algorithm: The main novelties
• Client protocol same as CAS. We only summarize difference in server protocol here
• Keep d+1 coded elements with fin label and all intervening elements, delete older ones
The CASGC algorithm: The main novelties
• Client protocol same as CAS. We only summarize difference in server protocol here
• Keep d+1 coded elements with fin label and all intervening elements, delete older ones
• Use server gossip to propagate fin labels and “complete” failed operations
write
End-point: point at which operation is “completed” through gossip,
Or the point of failure if the operation cannot be completed through gossip
The CASGC algorithm: The main novelties
• Client protocol same as CAS. We only summarize difference in server protocol here
• Keep d+1 coded elements with fin label and all intervening elements, delete older ones
• Use server gossip to propagate fin labels and “complete” failed operations
write
End-point: point at which operation is “completed” through gossip,
Or the point of failure if the operation cannot be completed through gossip
Definition of end-point suffices for defining concurrency, and a satisfactory liveness
Main Theorems
• All operations complete if the number of writes concurrent with a read is smaller than
• (In the paper), a bound on the storage cost.
200
Main Theorems
• All operations complete if the number of writes concurrent with a read is smaller than
• (In the paper), a bound on the storage cost.
Main Insights
• Significant saving in network traffic overheads possible with clever design
• Sever gossip powerful tool for good liveness in storage systems
• Storage overheads depend on many factors, including extent of client concurrency
activity
201
Go back
Summary
CASGC
Liveness
Comm. Cost
Storage
Cost
Conditional,
Tuneable
Small
Tuneable,
Quantifiabl
e
Viveck R. Cadambe, Nancy Lynch, Muriel Médard, and Peter Musial. A Coded
Shared Atomic Memory Algorithm for Message Passing Architectures.
Distributed Computing (Springer), 2017
Appendix: Converse for multi-version codes,
T = 3, description of worst-case state.
Converse: T=3
Ver
1
Ver
1
Ver
1
Ver
2
Ver
2
Ver
2
Ver
3
Ver
3
Ver
3
Converse: T=3
Ver
1
Ver
1
Ver
1
Ver
2
Ver
2
Ver
2
Ver
3
Ver
3
Ver
3
Ver
1
Ver
2
Ver
3
Converse: T=3
Converse: T=3
Ver
1
Ver
1
Ver
3
Ver
3
Ver
3
Ver
2
Ver
3
Ver
1
Ver
1
Ver
1
Ver
2
Ver
2
Ver
2
Ver
3
Ver
3
Ver
1
Ver
1
Ver
3
Converse: T=3
Converse: T=3
Ver
1
Ver
1
Ver
3
Ver
3
Ver
3
Ver
2
Ver
3
Ver
1
Ver
1
Ver
1
Ver
2
Ver
2
Ver
2
Ver
3
Ver
3
Ver
1
Ver
1
Ver
3
Converse: T=3
Ver
1
Ver
1
Ver
1
Ver
1
Ver
2
Ver
2
Ver
2
Ver
3
Ver
3
Ver
3
Ver
3
Ver
2
Ver
3
Ver
1
go back
Ver
1
All three versions decodable from
these c+2 servers, implying storage
cost bound!
Ver
3
Ver
3
Ver
1
Ver
3
Ver
3