introduction

SEMINAR 236825
OPEN PROBLEMS IN
DISTRIBUTED COMPUTING
Winter 2013-14
Hagit Attiya & Faith Ellen
236825
Introduction
1
INTRODUCTION
236825
Introduction
2
Distributed Systems
• Distributed systems are
everywhere:
– share resources
– communicate
– increase performance
(speed & fault tolerance)
• Characterized by
– independent activities
(concurrency)
– loosely coupled parallelism
(heterogeneity)
– inherent uncertainty
236825
E.g.
• operating systems
• (distributed) database
systems
• software fault-tolerance
• communication networks
• multiprocessor
architectures
Introduction
3
Main Admin Issues
• Goal: Read some interesting papers,
related to some open problems in the area
• Mandatory (active) participation
– 1 absence w/o explanation
• Tentative list of papers already published
– First come first served
• Lectures in English
236825
Introduction
4
Course Overview: Basic Models
message shared
passing memory
synchronous
PRAM
asynchronous
236825
Introduction
5
Message-Passing Model
• processors p0, p1, …, pn-1 are nodes of the graph.
Each is a state machine with a local state.
• bidirectional point-to-point channels are the undirected
edges of the graph.
• Channel from pi to pj is modeled in two pieces:
– outbuf variable of pi (physical channel)
– inbuf variable of pj (incoming message queue)
p0
2
1
2
p3
236825
1
3
2
p2
1
Introduction
1
p1
6
Modeling Processors and Channels
• processors p0, p1, …, pn-1 are nodes of the graph.
Each is a state machine with a local state.
• bidirectional point-to-point channels are the undirected
edges of the graph.
• Channel from pi to pj is modeled in two pieces:
– outbuf variable of pi (physical channel)
– inbuf variable of pj (incoming message queue)
inbuf[1]
outbuf[2]
p1's local
variables
p2's local
variables
outbuf[1]
236825
inbuf[2]
Introduction
7
Configuration
A snapshot of entire system: accessible
processor states (local variables & incoming msg
queues) as well as communication channels.
Formally, a vector of processor states (including
outbufs, i.e., channels), one per processor
236825
Introduction
8
Deliver Event
Moves a message from sender's outbuf to
receiver's inbuf; message will be available next
time receiver takes a step
p1
p1
236825
m3 m2 m1
m3 m2
p2
m1
p2
Introduction
9
Computation Event
Occurs at one processor
• Start with old accessible state (local vars + incoming
messages)
• Apply processor's state machine transition function;
handle all incoming messages
• End with new accessible state with empty inbufs
& new outgoing messages
b
a
c
236825
old
local
state
new
local
state
d e
Introduction
10
Execution
configuration, event, configuration, event, configuration, …
• In the first configuration: each processor is in initial
state and all inbufs are empty
• For each consecutive triple
configuration, event, configuration
new configuration is same as old configuration except:
– if delivery event: specified msg is transferred from sender's
outbuf to receiver's inbuf
– if computation event: specified processor's state (including
outbufs) change according to transition function
236825
Introduction
11
Asynchronous Executions
• An execution is admissible in asynchronous model if
– every message in an outbuf is eventually delivered
– every processor takes an infinite number of steps
• No constraints on when these events take place:
arbitrary message delays and relative processor
speeds are not ruled out
• Models a reliable system (no message is lost and no
processor stops working)
236825
Introduction
12
Example: Simple Flooding Algorithm
• Each processor's local state consists of variable color,
either red or green
• Initially:
– p0: color = green, all outbufs contain M
– others: color = red, all outbufs empty
• Transition: If M is in an inbuf and color = red, then
change color to green and send M on all outbufs
236825
Introduction
13
Example: Flooding
M
p0
M
M
p0
M
p2
deliver event
at p1 from p0
p1
M
p0
p2
M
p1
p0
M
p2
p1
M
deliver event
at p2 from p1
M
236825
computation
event by p1
Introduction
p2
p1
computation
event by p2
M
14
Example: Flooding (cont'd)
M
p0
M
M
M
p2
p2
236825
p2
M
p1
p1
M
p0
M
M
M
M
deliver event
at p1 from p2
p1
M
M
p0
p0
M
M
deliver event
at p0 from p1
Introduction
p2
computation
event by p1
p1
etc. to
deliver
rest of
msgs
15
(Worst-Case) Complexity Measures
• Message complexity: maximum number of
messages sent in any admissible execution
• Time complexity: maximum "time" until all
processes terminate in any admissible execution.
• How to measure time in an asynchronous
execution?
– Produce a timed execution by assigning nondecreasing real times to events so that time between
sending and receiving any message is at most 1.
– Time complexity: maximum time until termination in
any timed admissible execution.
236825
Introduction
16
Complexities of Flooding Algorithm
A state is terminated if color = green.
• One message is sent over each edge in each
direction  message complexity is 2m,
where m = number of edges.
• A node turns green once a "chain" of
messages reaches it from p0  time
complexity is diameter + 1 time units.
236825
Introduction
17
Synchronous Message Passing Systems
An execution is admissible for the synchronous
model if it is an infinite sequence of rounds
– A round is a sequence of deliver events moving all
msgs in transit into inbuf's, followed by a sequence of
computation events, one for each processor.
Captures the lockstep behavior of the model
Also implies
– every message sent is delivered
– every processor takes an infinite number of steps.
Time is the number of rounds until termination
236825
Introduction
18
Example: Flooding in the
Synchronous Model
M
p0
p2
p0
M
p1
p0
p2
236825
p1
M
M
p2
round 1
events
p1
M M
round 2
events
Time complexity is diameter + 1
Message complexity is 2m
Introduction
19
Broadcast Over
a Rooted Spanning Tree
• Processors have information about a rooted spanning
tree of the communication topology
– parent and children local variables at each processor
• root initially sends M to its children
• when a processor receives M from its parent
– sends M to its children
– terminates (sets a local Boolean to true)
• Complexities (synchronous and asynchronous model)
– time is depth of the spanning tree, which is at most n - 1
– number of messages is n - 1, since one message is sent over
each spanning tree edge
236825
Introduction
20
Finding a Spanning Tree from a Root
• root sends M to all its neighbors
• when non-root first gets M
– set the sender as its parent
– send "parent" msg to sender
– send M to all other neighbors (if no other neighbors, then
terminate)
• when get M otherwise
– send "reject" msg to sender
• use "parent" and "reject" msgs to set children
variables and terminate (after hearing from all
neighbors)
236825
Introduction
24
Execution of Spanning Tree Algorithm
root
root
a
b
d
c
e
f
g
h
b
d
Asynchronous: not
necessarily BFS tree
236825
a
Both models:
O(m) messages
O(diam) time
c
e
f
g
h
Synchronous: always gives
breadth-first search (BFS) tree
Introduction
25
Execution of Spanning Tree Algorithm
root
a
b
d
a
c
b
e
f
g
h
d
An asynchronous execution gave
a depth-first search (DFS) tree.
Is DFS property guaranteed?
236825
root
No!
Introduction
c
e
f
g
h
Another asynchronous
execution results in this tree:
neither BFS nor DFS
26
Shared Memory Model
Processors (also called processes) communicate via
a set of shared variables
Each shared variable has a type, defining a set of
primitive operations (performed atomically)
•
•
•
•
p0
read, write
compare&swap (CAS)
read
LL/SC, DCAS, kCAS, …
read-modify-write (RMW),
kRMW
236825
Introduction
p1
write
X
p2
RMW write
Y
29
Changes from the
Message-Passing Model
• no inbuf and outbuf state components
• configuration includes values for shared variables
• one event type: a computation step by a process
– pi 's state in old configuration specifies which shared
variable is to be accessed and with which primitive
– shared variable's value in the new configuration
changes according to the primitive's semantics
– pi 's state in the new configuration changes according
to its old state and the result of the primitive
An execution is admissible if every processor takes
an infinite number of steps
236825
Introduction
30
Abstract Data Types
• Abstract representation of data & set of methods
(operations) for accessing it
• Implement using primitives on base objects
• Sometimes, a hierarchy of
implementations:
Primitive operations
implemented from more
low-level ones
236825
Introduction
data
31
Executing Operations
deq
1
invocation
P1
response
enq(1)
ok
P2
enq(2)
P3
236825
Introduction
32
Interleaving Operations, or Not
enq(1)
ok deq
1 enq(2)
Sequential behavior: invocations & responses
alternate and match (on process & object)
Sequential Specification:
All legal sequential behaviors
236825
Introduction
33
Correctness: Sequential consistency
[Lamport, 1979]
• For every concurrent execution there is a
sequential execution that
– Contains the same operations
– Is legal (obeys the sequential specification)
– Preserves the order of operations by the same
process
236825
Introduction
34
Example 1: Multi-Writer Registers
Using (multi-reader) single-writer registers
Add logical time to values
Read only own value
Write(v,X)
read TS1,..., TSn
TSi = max TSj +1
write v,TSi
Read(X)
read v,TSi
return v
Once in a while
read TS1,..., TSn
and write to TSi
Need to ensure writes are
eventually visible
236825
Introduction
35
Timestamps
1. The timestamps of two write operations by the same
process are ordered
2. If a write operation completes before another one
starts, it has a smaller timestamp
Write(v,X)
read TS1,..., TSn
TSi = max TSj +1
write v,TSi
236825
Introduction
36
Multi-Writer Registers: Proof
Create sequential execution:
– Place writes in timestamp order
– Insert reads after the appropriate write
Write(v,X)
read TS1,..., TSn
TSi = max TSj +1
write v,TSi
236825
Read(X)
read v,TSi
return v
Once in a while
read TS1,..., TSn
and write to TSi
Introduction
37
Multi-Writer Registers: Proof
Create sequential execution:
– Place writes in timestamp order
– Insert reads after the appropriate write
 Legality is immediate
 Per-process order is preserved since a read returns a
value (with timestamp) larger than the preceding write by
the same process
236825
Introduction
38
Correctness: Linearizability
[Herlihy & Wing, 1990]
• For every concurrent execution there is a sequential
execution that
– Contains the same operations
– Is legal (obeys the specification of the ADTs)
– Preserves the real-time order of non-overlapping
operations
• Each operation appears to takes effect
instantaneously at some point between its invocation
and its response (atomicity)
236825
Introduction
39
Example 2: Linearizable
Multi-Writer Registers
[Vitanyi & Awerbuch, 1987]
Using (multi-reader) single-writer registers
Add logical time to values
Write(v,X)
read TS1,..., TSn
TSi = max TSj +1
write v,TSi
236825
Read(X)
read TS1,...,TSn
return value with
max TS
Introduction
40
Multi-writer registers:
Linearization order
Create linearization:
– Place writes in timestamp order
– Insert each read after the appropriate write
Write(v,X)
read TS1,..., TSn
TSi = max TSj +1
write v,TSi
236825
Read(X)
read TS1,...,TSn
return value with
max TS
Introduction
41
Multi-Writer Registers: Proof
Create linearization:
– Place writes in timestamp order
– Insert each read after the appropriate write
 Legality is immediate
 Real-time order is preserved since a read returns a value
(with timestamp) larger than all preceding operations
236825
Introduction
42
Example 3: Atomic Snapshot
• n components
• Update a single component
• Scan all the components
“at once” (atomically)
update
ok
scan
v1,…,vn
Provides an instantaneous view of the whole
memory
236825
Introduction
43
Atomic Snapshot Algorithm
[Afek, Attiya, Dolev, Gafni, Merritt, Shavit, JACM 1993]
Update(v,k)
A[k] = v,seqi,i
double
Scan()
collect
repeat
read A[1],…,A[n]
read A[1],…,A[n]
if equal
return A[1,…,n]
Linearize:
• Updates with their writes
• Scans inside the double collects
236825
Introduction
44
Atomic Snapshot: Linearizability
Double collect (read a set of values twice)
If equal, there is no write between the collects
– Assuming each write has a new value (seq#)
read A[1],…,A[n]
read A[1],…,A[n]
write A[j]
Creates a “safe zone”, where the scan can be
linearized
236825
Introduction
45
Liveness Conditions
• Wait-free: every operation completes within a finite
number of (its own) steps
no starvation for mutex
• Nonblocking: some operation completes within a finite
number of (some other process) steps
deadlock-freedom for mutex
• Obstruction-free: an operation (eventually) running solo
completes within a finite number of (its own) steps
– Also called solo termination
wait-free  nonblocking  obstruction-free
Bounded wait-free  bounded nonblocking  bounded
obstruction-free
236825
Introduction
46
Wait-free Atomic Snapshot
[Afek, Attiya, Dolev, Gafni, Merritt, Shavit, JACM 1993]
• Embed a scan within the Update.
Update(v,k)
V = scan
A[k] = v,seqi,i,V
Linearize:
• Updates with their writes
• Direct scans as before
• Borrowed scans in place
236825
Scan()
direct
scan
repeat
read A[1],…,A[n]
read A[1],…,A[n]
if equal
return A[1,…,n]
else record diff
if twice pj
borrowed
scan
return Vj
Introduction
47
Atomic Snapshot: Borrowed Scans
Interference by process pj
And another one…
pj does a scan inbeteween
read A[j]
…
…
write A[j]
read A[j]
…
…
read A[j]
…
…
embedded scan
read A[j]
…
…
write A[j]
Linearizing with the borrowed scan is OK.
236825
Introduction
48
List of Topics (Indicative)
• Atomic snapshots
• Renaming
• Space complexity of
consensus
• Maximal independent
set
• Dynamic storage
• Routing
• Vector agreement
and possibly others…
236825
Introduction
49