ZooKeeper

Event Based Systems
Time and synchronization (II), CAP theorem
and ZooKeeper
Dr. Emanuel Onica
Faculty of Computer Science, Alexandru Ioan Cuza University of Iaşi
Contents
1. Time and synchronization (part II) – vector clocks
2. CAP Theorem
3. BASE vs ACID
4. ZooKeeper
2/33
Vector Clocks
• Recap: Problem with Lamport timestamps - if A→B then
timestamp(A) < timestamp(B), but if timestamp(A) <
timestamp(B), not necessarily A→B
3/33
Vector clocks
Using multiple time tags per process helps in determining
the exact happens-before relation as resulting from tags:
- each process will use a vector of N values where N =
number of processes, initialized with 0
- the i-th element of the vector clock is the clock of the i-th
process
- the vector is send along the messages
- if a process sends a message increments just its own
clock in the vector
- If a process receives a message increments its own clock
in the local vector and sets the rest of the clocks to the
max(clock value in local vector, clock value in received
vector)
4/33
Vector Clocks example
5/33
Vector Clocks example
6/33
Vector Clocks example
7/33
Vector Clocks example
8/33
Vector Clocks – Causality determination
• We can establish that there is a causality (happenedbefore) relation between two events E1 and E2 if V(E1) <
V(E2) (where V is the associated vector clock).
• V(E1) < V(E2) if all clock values in V(E1) <= corresponding
clock values in V(E2), and there exist at least one value in
V(E1) < corresponding clock value in V(E2).
• If !(V(E1) < V(E2)) and !(V(E2) < V(E1)) we can label the
events as concurrent
9/33
The CAP Theorem
First formulated by Eric Brewer as a conjecture (PODC
2000). Formal proof by Seth Gilbert and Nancy Lynch
(2002).
It is impossible for a distributed computer system to
simultaneously provide all three of the following
guarantees:
• Consistency (all nodes see the same data at the same
time)
• Availability (all requests are answered with success or
failure within a similar timely manner)
• Partition tolerance (system continues to work despite
arbitrary partitioning due to network fails)
10/33
The CAP Theorem
11/33
Figure source: Concurrent Programming for Scalable Web Architectures, Benjamin Erb
The CAP Theorem – the requirements
Consistency – why we need it?
• Distributed services are used by multiple (up to millions users)
• Concurrent reads and writes take place in the distributed
systems
• We need an unitary view across all sessions and across all data
replicas
Example: An online shop stores stock info on a distributed
storage system. When an user buys a product, the stock info
must be updated on every storage replica and be visible to all
users across all sessions.
12/33
The CAP Theorem – the requirements
Availability – why we need it?
• Reads and writes should be completed in a timely reliable
fashion, for offering the desired level of QoS
• Often Service Level Agreements (SLAs) are set in commercial
environments to establish desirable parameters of
functionality that should be mandatorily be provided
• ½ second delay per page load results in 20% drop in traffic and
revenue (Google, Web 2.0 Conference, 2006)
• Amazon tests by simulating artificial delays resulted in 6M
dollars per each ms (2009)
13/33
The CAP Theorem – the requirements
Partition-tolerance – why we need it?
• Multiple data centers used by the same distributed service can
be partitioned due to various failures:
•
•
•
•
Power outages
DNS timeouts
Cable failures
and others ...
• Failures, or more generally faults in distributed systems are
rather the norm than the exception
• Let’s say a rack server in a data center has a one downtime in 3
years
• Try to figure out how often the data center fails on average if it
has 100 servers
14/33
BASE vs ACID
ACID – the traditional guarantee set in RDBMS
• Atomicity – a transaction either succeds with the change
of the database, or fails with a rollback
• Consistency – there is no invalid state in a sequence of
states caused by transactions
• Isolation – each transaction is isolated from the rest,
preventing conflicts between concurrent transactions
• Durability – each transaction commit is persistent over
failures
Doesn’t work for distributed storage systems – no coverage
for partition tolerance (and this is mandatory)
15/33
BASE vs ACID
BASE – Basically Available Soft-state Eventual consistency
• Basically Available – ensures the availability requirement
in the CAP theorem
• Soft-state – there is no strong consistency provided, and
the state of the system might be stale at certain times
• Eventual consistency – eventually the state of the
distributed system converges to a consistent view
16/33
BASE vs ACID
BASE model is mostly used in distributed key-value stores,
where availability is typically favored over consistency
(NoSQL):
• Apache Cassandra (originally used by Facebook)
• Dynamo DB (Amazon)
• Voldemort (LinkedIn)
There are exceptions:
• HBase (inspired from Google’s Big Table) – favors
consistency over availability
17/33
ZooKeeper – Distributed Coordination
Ground idea:
• Distributed systems world is like a Zoo, and beasts
should be kept on a leash
• Multiple instances of distributed applications (the
same or different apps) often require synchronizing
for proper interaction
• ZooKeeper is a coordination service where:
• The apps coordinated are distributed
• The service itself is also distributed
18/33
Article: ZooKeeper: Wait-free coordination for Internetscale systems (P. Hunt et al. – USENIX 2010)
ZooKeeper – Distributed Coordination
What do we mean by apps requiring synchronization?
Various stuff (it’s a Zoo ...):
• Detecting group membership validity
• Leader election protocols
• Mutual exclusive access on shared resources
• and others ...
What do we mean by ZooKeeper providing coordination?
• ZooKeeper service does not offer complex server side
primitives as above for synchronization
• ZooKeeper service offers a coordination kernel, exposing
an API that can be used by clients to implement what
primitives they require
19/33
ZooKeeper – Guarantees
The ZooKeeper coordination kernel provides several guarantees:
1. It is wait-free
Let’s stop a bit ... What does this mean (in general)?
• lock-freedom – at least one system component makes
progress (system wide throughput is guaranteed, but some
components can starve)
• wait-freedom – all system components make progress (no
individual starvation)
2. It guarantees FIFO ordering for all client operations
3. It guarantees linearizable writes
20/33
ZooKeeper – Guarantees (Linearizability)
Let’s stop a bit ... What does linearizability mean (in general)?
• An operation has typically an invocation and a response phase
– looking at it atomically the invocation and response are
indivisible, but in reality is not exactly like this ...
• Property of a linearizable operations execution means that:
• Invocation of operations and responses to them can be reordered
without change of the system behavior equivalent to a sequence
of atomic execution of operations (a sequential history)
• The sequential history obtained is semantically correct
• If an operation response completes in the original order before
another operation starts, it will still complete before in the
reordering
21/33
ZooKeeper – Guarantees (Linearizability)
Example (threads):
T1.lock(); T2.lock(); T1.fail; T2.ok;
Let’s reorder ...
a)
T1.lock(); T1.fail; T2.lock(); T2.ok;
• it is sequential ...
• ... but not semantically correct
b) T2.lock(); T2.ok; T1.lock(); T1.fail;
• it is sequential ...
• ... and semantically correct
We have b) => the original history is linearizable.
Back to ZK, recap: the coordination kernel ensures that application
write operations history are linearizable (not the reads).
22/33
ZooKeeper – Fundamentals
High level overview of the service:
Figure source: ZooKeeper Documentation
• The service interface is exposed to clients through a client
library API.
• Multiple servers can distributedly offer the same coordination
service (single system image), among which a leader is
defined as part of the internal ZK protocol that ensures
consistency.
23/33
ZooKeeper – Fundamentals
The main abstraction offered by the ZooKeeper client
library is a hierarchy of znodes, organized similar to a file
system:
Figure source: ZooKeeper: Wait-free coordination for Internet-scale systems
(P. Hunt et al. – USENIX 2010)
Applications can create, delete and change a limited size
data content in nodes (default 1MB) to set configuration
parameters that are used in the distributed environment.
24/33
ZooKeeper – Fundamentals
Two types of nodes:
• Regular – created and deleted explicitly by apps
• Ephemeral – deletion can be performed automatically when session
during which creation occured terminates
Nodes can be created with the same base name, but having a
sequential flag set, for which the ZK service appends an monotonically
increasing number.
What’s this good for?
• same client application (same code), that creates a node to store
configuration (e.g., a pub/sub broker)
• run multiple times in distributed fashion
• obviously we don’t want to overwrite an existing config node
• maybe we also need to organize a queue of nodes based on order
• other applications (various algorithms)
25/33
ZooKeeper – Fundamentals
Client sessions:
- applications connect to the ZK service and execute operations
as part of a session
- sessions are ended explicitly by clients or when session
timeouts occur (the ZK server does not receive anything from
clients for a while)
Probably the most important ZooKeeper feature: watches
- permit application clients to receive notifications about
change events on znodes without polling
- normally set through flags on read type operations
- one-time triggers: if a watch is triggered once by an event, the
watch is removed; to be notified again, the watch should be
set again
- are associated to a session (unregistered once the session
ends)
26/33
ZooKeeper – Client API (generic form)
• create (path, data, flags)
Creates a znode at the specified path, which is filled with the
specified data, and has the type (regular or ephemeral,
sequential or normal) specified in the flags. The method returns
the node name.
• delete (path, version)
Deletes the znode at the specified path if the node has the
specified version (optional, use -1 to ignore).
• exists (path, watch)
Checks if a node exists at the specified path, and optionally sets
a watch that triggers a notification when the node will be
created, deleted or new data is set on it.
27/33
ZooKeeper – Client API (generic form)
• getData (path, watch)
Returns the data at the specified path if that exists, and
optionally sets a watch that triggers when new data is set at the
znode in the path or the znode is deleted.
• setData (path, data, version)
Sets the specified data at the specified path if the znode exists,
optionally just if the node has the specified version. Returns a
structure containing various information about the znode.
• getChildren (path, watch)
Returns the set of names of the children of the znode at the
specified path, and optionally sets a watch that triggers when
either a new child is created or deleted at the path, or the node
at the path is deleted.
28/33
ZooKeeper – Client API (generic form)
• sync (path)
- ZooKeeper offers a single system image to all connected
clients (all clients have the same view of znodes no
matter at which server are connected);
- depending on the server where a client is connected,
updates might not be always already processed, when
the client would execute a read operation;
- sync waits for all updates to propagate to the server
where the client is connected;
- path parameter is simply ignored.
29/33
ZooKeeper – Client API
All operations of read and write type (not the sync), have
two forms:
• synchronous
• execute a single operation and block until this is finalized
• does not permit other concurrent tasks
• asynchronous
• sets a callback for the invoked operation, which is triggered
when the operation completes
• does permit concurrent tasks
• an order guarantee is preserved for asynchronous callbacks
based on their invocation order
30/33
ZooKeeper – Example scenario
Context (use of ZK, not ZK itself):
• Distributed applications that have a leader among them
responsible with their coordination
• While the leader changes system configuration, none of
the apps should start using the configuration being
changed
• If the leader dies before finishing changing configuration,
none of the apps should use the unfinished configuration
31/33
ZooKeeper – Example scenario
How it’s done (using ZK):
• The leader designates a /ready path node as flag for
ready-to-use configuration, monitored by other apps
• While changing configuration the leader deletes the
/ready node, and creates it back when finished
• FIFO ordering guarantees that apps are notified by the
/ready node creation, only after configuration is finished
Looks ok ...
... or not?
32/33
ZooKeeper – Example scenario
Q: What if one app sees /ready just before being deleted and starts
reading the configuration while being changed?
A: The app will be notified when /ready is deleted, so it knows new
configuration is being set up, and old one is invalid. It just need to
reset the (one-time triggered) watch on /ready to find out when is
created.
Q: What if one app is notified by a configuration change (node /ready
deleted), but the app is slow and until setting a new watch, the node is
already created and deleted again?
A: Target of app’s action is reading/using an actual valid state of
configuration (which can be, and is, the latest). Missing previous valid
versions should not be critical.
33/33