Byzantine - Cornell Computer Science

Byzantine Techniques II
Justin W. Hart
CS 614
12/01/2005
Papers


BAR Fault Tolerance for Cooperative
Services. Amitanand S. Aiyer, et. al.
(SOSP 2005)
Fault-scalable Byzantine Fault-Tolerant
Services. Michael Abd-El-Malek
et.al. SOSP 2005
BAR Fault Tolerance for
Distributed Services



BAR Model
General Three-Level Architecture
BAR-B
Motivation

“General approach to constructing
cooperative services that span multiple
administrative domains (MADs)”
Why is this difficult?


Nodes are under control of multiple
administrators
Broken – Byzantine behaviors.


Misconfigured, or configured with malicious
intent.
Selfish – Rational behaviors

Alter the protocol to increase local utility
Other models?


Byzantine Models – Account for
Byzantine behavior, but do not handle
rational behavior.
Rational Models – Account for rational
behavior, but may break with Byzantine
behavior.
BAR Model

Byzantine


Altruistic


Behaving arbitrarily or maliciously
Execute the proposed program, whether it
benefits them or not
Rational

Deviate from the proposed program for
purposes of local benefit
BART – BAR Tolerant

It’s a cruel world


At most (n-2)/3
nodes in the system
are Byzantine
The rest are rational
Two classes of protocols

Incentive-Compatible Byzantine Fault Tolerant
(IC-BFT)



Byzantine Altruistic Rational Tolerant


Guarantees a set of safety and liveliness
properties
It is in the best interest of rational nodes to follow
the protocol exactly
Guarantees a set of safety and liveliness
properties despite the presence of rational nodes
IC-BFT is a subset of BART
An important concept

It isn’t enough for a protocol to survive
drills of a handful of attacks. It must
provably provide its guarantees.
A flavor of things to come


Protocol builds on Practical Byzantine
Fault Tolerance in order to combat
Byzantine behavior
Protocol uses game theoretical concepts
in order to combat rational behavior
A taste of Nash Equilibrium
Swerve
Swerve
Go Straight
0, 0
-1,+1
Go Straight +1,-1
X_X,X_X
-100,-100
…and the nodes are starving!


Nodes require access to a state
machine in order to complete their
objectives
Protocol contains methods for punishing
rational nodes, including denying them
access to the state machine
An expensive notion of identity

Identity is established
through cryptographic
keys assigned through a
trusted authority




Prevents Sybil attacks
Bounds the number of
Byzantine nodes
Gives rational nodes
reason to consider longterm consequences of
their actions
Gives real world
grounding to identity
Assumptions about rational
nodes




“Receive long-term benefit from staying in
the protocol”
“Conservative when computing the impact of
Byzantine nodes on their utility”
“If the protocol provides a Nash equilibrium,
then all rational nodes will follow it”
“Rational nodes do not collude…colluding
nodes are classified as Byzantine”
Byzantine nodes



Byzantine fault model
Strong adversary
Adversary can coordinate collusion
attacks
Important concepts



Promptness principal
Proof of Misbehavior (POM)
Cost balancing
Promptness principal

If a rational node gains no benefit from
delaying a message, it will send it as
soon as possible
Proof of Misbehavior (POM)


Self-contained, cryptographic proof of
wrongdoing
Provides accountability to nodes for
their actions
Example of POM





Node A requests that Node B store a chunk
Node B replies that it has stored the chunk
Later Node A requests that chunk back
Node B sends back random garbage (it
hadn’t stored the chunk) and a signature
Because Node A stored a hash of the
chunk, it can demonstrate misbehavior on
part of Node B
…but it’s a bit more
complicated than that!

This corresponds to a rather simple
behavior to combat. “Aggressively
Byzantine” behavior.
Passive-aggressive behaviors

Harder cases than “aggressively
Byzantine”


A malicious Node A could merely lie about
misbehavior on the part of Node B
A node could exploit non-determinism in
order to shirk work
Cost Balancing

If two behaviors have the same cost,
there is no reason to choose the wrong
one
Three-Level Architecture
Level 1

Unilaterally deny service to nodes that
fail to deliver messages


Balance costs


“Tit-for-Tat”
No incentive to make the wrong choice
Penance

Unilaterally impose extra work on nodes
with untimely responses
Level 2

Failure to respond to a request by a
state machine will generate a POM from
a quorum of nodes in the state machine
Level 3


Makes use of reliable work assignment
Needs only to provide sufficient
information to identify valid
request/response pairs
Nuts and Bolts


Level 1
Level 2
Level 1

Ensure long-term benefit to participants



The RSM rotates the leadership role to
participants.
Participants want to stay in the system in order to
control the RSM and complete their protocols
Limit non-determinism


Self interested nodes could hide behind nondeterminism to shirk work
Use Terminating Reliable Broadcast, rather than
consensus.


In TRB, only the sender can propose a value
Other nodes can only adopt this value, or choose a
default value
Level 1

Mitigate the effects of residual nondeterminism

Cost balancing


Encouraging timeliness


The protocol preferred choice is no more expensive than
any other
Nodes can inflict sanctions on untimely messages
Enforce predictable communication patterns

Nodes have to have participated at every step in
order to have the opportunity to issue a command
Terminating Reliable Broadcast
3f+2 nodes, rather than 3f+1





Suppose a sender “s” is slow
The same group of nodes now want to
determine that “s” is slow
A new leader is elected
Every node but “s” wants a timely conclusion
to this, in order to get their turn to propose a
value to the state machine
“s” is not allowed to participate in this
quorum
TRB provides a few
guarantees

They differ during periods of synchrony
and periods of asynchrony
In synchrony

Termination


Every non-Byzantine process delivers
exactly one message
Agreement

If on non-Byzantine process delivers a
message m, then all non-Byzantine
processes eventually deliver m
In asynchrony

Integrity


If a non-Byzantine process delivers m, then
the sender sent m
Non-Triviality

If the sender is non-Byzantine and sends
m, then the sender eventually delivers m
Message Queue


Enforces predictable communication patterns
Bubbles

A simple retaliation policy





Node A’s message queue is filled with messages that it
intends to send to Node B
This message queue is interleaved with bubbles.
Bubbles contain predicates indicating messages expected
from B
No message except the expected predicate from B can
fill the bubble
No messages in A’s queue will go to B until B fills the
bubble
Balanced Messages



We’ve already discussed this quite a bit
We assure this at this level of the
protocol
This is where we get our gigantic
timeout message
Penance

Untimely vector


Tracks a nodes perception of the
responsiveness of other nodes
When a node becomes a sender, it
includes its untimely vector with the
message
Penance

All nodes but the sender receive penance
messages from each node.



Because of bubbles, each untimely node must
sent a penance message back in order to continue
using the system
This provides a penalty to those nodes
The sender is excluded from this process, because
it may be motivated to lie in its penance vector, in
order to avoid the work of transmitting penance
messages
Timeouts and Garbage
Collection

Set-turn timeout




Timeout to take leadership away from the sender
Initially 10 seconds in this implementation, in
order to overcome all expected network delays
Can only be changed by the sender
Max_response_time


Time at which a node is removed from the
system, its messages discarded and its resources
garbage collected
Set to 1 week or 1 month in the prototypes
Global Punishment

Badlists



Transform local suspicion into POMs
Suspicion is recorded in a local nodes
badlist
Sender includes its badlist with its message

If, over time, recipients see a node in f + 1
different senders badlists, then they too,
consider that node to be faulty
Proof

Real proofs do not appear in this paper,
they appear in the technical report
…but here’s a bit

Theorem 1: The TRB protocol satisfies
Termination, Agreement, Integrity and
Non-Triviality
…and a bit more

Theorem 2: No node has a unilateral
incentive to deviate from the protocol

Lemma 1: No rational node r benefits from
delaying sending the “set-turn” message


Follows from penance
Lemma 2: No rational node r benefits from
sending the “set-turn” message early

Sending early could result in senderTO to be sent (this
protocol uses synchronized clocks, and all messages are
cryptographically signed)
…and the rest that’s
mentioned in the paper

Lemma 3: No rational node r benefits
from sending a malformed “set-turn”
message.

The “set-turn” message only contains the
turn number. Because of this, doing so
reduces to either sending early (dealt with
in Lemma 1) or sending late (dealt with in
Lemma 2)
Level 2

State machine replication is sufficient to
support a backup service, but the
overhead is unacceptable


100 participants… 100 MB backed up… 10
GB of drive space
Assign work to individual nodes, using
arithmetic codes to provide lowoverhead fault-tolerant storage
Guaranteed Response




Direct communication is insufficient
when nodes can behave rationally
We introduce a “witness” that overhears
the conversation
This eliminates ambiguity
Messages are routed through this
intermediary
Guaranteed Response
Guaranteed Response




Node A sends a request to Node B
through the witness
The witness stores the request, and
enters RequestReceived state
Node B sends a response to Node A
through the witness
The witness stores the response, and
enters ResponseReceived
Guaranteed Response

Deviation from this protocol will cause
the witness to either notice the timeout
from Node B or lying on the part of
Node A
Implementation



The system must remain incentive-compatible
Communication with the witness node is not
in the form of actual message sending, it is in
the form of a command to the RSM
Theorem 3: If the witness node enters the
request received state, for some work w to
rational node b, then b will execute w

Holds if sufficient sanctions exist to cause it to be
motivated to do this
State limiting

State is limited by limiting the number
of slots (nodes with which a node can
communicate) available to a node



Applies a limit to the memory overhead
Limits the rate at which requests are
inserted into the system
Forces nodes to acknowledge responses to
requests

Nodes want their slots back
Optimization through Credible
Threats
Optimization through Credible
Threats





Returns to game theory
Protocol is optimized so nodes can
communicate directly. Add a fast path
Nodes register “vows” with the witness
If recipient does not respond, nodes proceed
to the unoptimized case
Analogous to a driver in “chicken” throwing
their steering wheel out the window
Periodic Work Protocol



Witness checks that periodic tasks, such
as system maintenance are performed
It is expected that, with a certain
frequency, each node in the system will
perform such a task
Failure to perform one will generate a
POM from the witness
Authoritative Time Service



Maintains authoritative time
Binds messages sent to that time
Guaranteed response protocol relies on
this for generating NoResponses
Authoritative Time Service



Each submission to the state machine
contains the timestamp of the proposer
Timestamp is taken to be the maximum
of the median of timestamps of the
previous f+1 decisions
If “no decision” is decided, then the
timestamp is the previous authoritative
time
Level 3 BAR-B


BAR-B is a cooperative backup system
Three operations



Store
Retrieve
Audit
Storage




Nodes break files up into chunks
Chunks are encrypted
Chunks are stored on remote nodes
Remote nodes send signed receipts and
store StoreInfos
Retrieval

A node storing a chunk can respond to
a request for a chunk with



The chunk
A demonstration that the chunk’s lease has
expired
A more recent StoreInfo
Auditing


Receipts constitute audit records
Nodes will exchange receipts in order to
verify compliance with storage quotas
Arithmetic Coding



Arithmetic coding is used to keep
storage size reasonable
1 GB of storage requires 1.3 GB of
overhead
Keeping this ratio reasonable is crucial
to motivate self-interested nodes to
participate
Request-Response pattern



Store
Retrieve
Audit
Retrieve


Originator sends a Receipt for the StoreInfo
to be retrieved
Storage node can send

A RetrieveConfirm


A RetrieveDeny


Containing the data and the receipt
Containing a receipt and a proof regarding why
Anything else

Generates a POM
Store


Originator sends a StoreInfo to be
stored
Storage node can send


A receipt
A StoreReject


Demonstrates that the node has reached its
storage commitment
Anything else

Generates a POM
Audit

Three phases




Auditor requests both OwnList and
StoreList from auditee
Does this for random nodes in the system
Lists are checked for inconsistencies
Inconsistencies result in a POM
Time constraints



Data is stored for 30 days
After this, it is garbage collected
Nodes must renew their leases on
stored chunks, in order to keep them in
the system, prior to this expiration
Sanctions



Periodic work protocol forces generation
of POMs or special NoPOMs
POMs and NoPOMs are balanced
POMs evict nodes from the system
Recovery





Nodes must be able to recover after failures
Chained membership certificates are used in
order to allow them to retrieve their old
chunks
Use of certificate later in the chain is
regarded as a new node entering the system
The old node is regarded as dead
The new node is allowed to view the old
nodes chunks
Recovery


This forces nodes to redistribute their
chunks that were on that node
Length of chains is limited, in order to
prevent nodes from shirking work by
using a certificate later in the chain
Guarantees




Data on BAR-B can be retrieved within
the lease period
No POM can be gathered against a
node that does not deviate from the
protocol
No node can store more than its quota
A time window is available to nodes
with catastrophic failures for recovery
Evaluation

Performance is inferior to protocols that
do note make these guarantees, but
acceptable
Impact of additional nodes
Impact of rotating leadership
Impact of fast path
optimization
Fault-Scalable Byzantine FaultTolerant Services

Query/Update (Q/U) protocol



Optimistic quorum based protocol
Better throughput and fault-scalability
than Replicated State Machines
Introduces preferred quorum as an
optimization on quorum protocols
Motivation


Compelling need for services and
distributed data structures to be
efficient and fault-tolerant
In Byzantine fault-tolerant systems,
performance drops off sharply as more
faults are tolerated
Fault Scalability

A fault-scalable service is one in which
performance degrades gracefully as
more server faults are tolerated
Operations-based interface


Provides an interface similar to RSMs
Exports interfaces comprised of deterministic
methods

Queries


Updates


Do not modify data
Modify data
Multi-object updates

Allow a set of objects to be updated together
Properties




Operates correctly under an asynchronous
model
Queries and updates are strictly serializable
In benign execution, they are obstruction-free
Cost is an increase in the number of required
servers 5b + 1 servers, rather than 3b + 1
servers
Optimism



Servers store a version history of
objects
Updates are non-destructive to the
objects
Use of logical timestamps based on
contents of update and object state
upon which the update is conditioned
Speedups

Preferred quorum, rather than random
quorum


Addressed later
Efficient cryptographic techniques

Addressed later
Efficiency and Scalability
Efficiency

Most failure atomic protocols require at least
a 2 phase commit



The optimistic approach does not need a
prepare phase


Prepare
Commit
This introduces the need for clients to repair
inconsistent objects
The optimistic approach also obviates the
need for locking!
Versioning Servers



In order to allow for this, versioning servers
are employed
Each update creates a new version on the
server
Updates contain information about the
version to be updated.

If no update has been committed since that
version, the update goes through unimpeded.
Throughput-scalability

Additional servers, beyond those
necessary to provide the desired fault
tolerance, can provide additional
throughput
Scaleup pitfall?


Encourage the use of fine-grained objects,
which reduce per-object contention
If majority of accesses access individual
objects, or few objects, then scaleup pitfall
can be avoided

In the example applications, this holds.
No need to partition


Other systems achieve throughputscalability by partitioning services
This is unnecessary in this system
The Query/Update Protocol
System model




Asynchronous timing
Clients and servers may be Byzantine faulty
Clients and servers assumed to be
computationally bounded, assuring
effectiveness of cryptography
Failure model is a hybrid failure model



Benign
Malevolent
Faulty
System model

Extends definition of
“fail prone system”
given by Malkhi and
Reiter
System model



Point-to-point authenticated channels
exist between all clients and servers
Infrastructure deploying symmetric keys
on all channels
Channels are assumed unreliable

…but, of course, they can be made reliable
Overview


Clients update objects by issuing requests
stamped with object versions to version
servers.
Version servers evaluate these requests.




If the request is over an out of date version, the clients version is
corrected and the request reissued
If an out of date server is required to reach a quorum, it retrieves
an object history from a group of other servers
If the version matches the server version, of course, it is executed
Everything else is a variation upon this theme
Overview






Queries are read only methods
Updates modify an object
Methods exported take arguments and return
answers
Clients perform operations by issuing
requests to a quorum
A server receives a request. If it accepts it it
invokes a method
Each update creates a new object version
Overview



The object version is kept with its
logical timestamp in a version history
called the replica history
Servers return replica histories in
response to requests
Clients store replica histories in their
object history set, an array of replicas
indexed by server
Overview


Timestamps in these
histories are
candidates for future
operations
Candidates are
classified in order to
determine which
object version a
method should be
executed upon
Overview

In non-optimistic operation, a client
may need to perform a repair


Addressed later
To perform an operation, a client first
retrieves an object history set. The
clients operation is conditioned on this
set, which is transmitted with the
operation.
Overview


The client sends this operation to a
quorum of servers.
To promote efficiency, the client sends
the request to a preferred quorum


Addressed later
Single phase operation hinges on the
availability of a preferred quorum, and
on concurrency-free access.
Overview




Before executing a request, servers first
validate its integrity.
This is important, servers do not
communicate object histories directly to each
other, so the client’s data must be validated.
Servers use authenticators to do this, lists of
HMACs that prevent malevolent nodes from
fabricating replica histories.
Servers cull replica histories from the
conditioned on OHS that they cannot validate
Overview – the last bit



Servers validate that they do not have a
higher timestamp in their local replica
histories
Failing this, the client repairs
Passing this, the method is executed,
and the new timestamp created

Timestamps are crafted such that they
always increase in value
Preferred Quorums

Traditional quorum systems use
random quorums, but this means
that servers frequently need to be
synced


This is to distribute the load
Preferred quorums choose to
access servers with the most up to
date data, assuring that syncs
happen less often
Preferred Quorums

If a preferred quorum cannot be met,
clients probe for additional servers to add
to the quorum



Authenticators make it impossible to forge
object histories for benign servers
The new host syncs with b+1 host servers, in
order to validate that the data is correct
In the prototype, probing selects servers
such that the load is distributed using a
method parameterized on object ID and
server ID
Concurrency and Repair


Concurrent access to an object may fail
Two operations

Barrier



Barrier candidates have no data associated with them,
and so are safe to select during periods of contention
Barrier advances the logical clock so as to prevent earlier
timestamps from completing
Copy

Copies the latest object data past the barrier, so it can be
acted upon
Concurrency and Repair

Clients may repeatedly barrier each
other, to combat this, an exponential
backoff strategy is enforced
Classification and Constraints

Based on partial
observations of the
global system state,
an operation may be


Complete
Repairable


Can be repaired using
the copy and barrier
strategy
Incomplete
Multi-Object Updates


In this case, servers lock their local copies, if
they approve the OHS, the update goes
through
If not, a multi-object repair protocol goes
through


In this case, repair depends on the ability to
establish all objects in the set
Objects in the set are only repairable if all are
repairable. If objects in the set that would be
repairable are reclassified as incomplete.
An example of all of this
Implementation details
Cached object history set


Clients cache object history sets during
execution, and execute updates without
first querying.
Failing the request based on an out of date
OHS, the server returns an up-to-date OHS
with the failure
Optimistic query execution


If a client has not accessed an object
recently, it is still possible to complete
in a single phase.
Servers execute the update on the
latest object that they store. Clients
then evaluate the result normally.
Inline repair



Does not require a barrier and copy
Repairs the candidate “in-place,”
obviating the need for a round trip
Only possible in cases where there is no
contention
Handling repeated requests


Mechanisms may cause requests to be
repeated
In order to shortcut other checks, the
timestamp is checked first
Retry and backoff policies


Update-update requires retry, and
backoff to avoid livelock
Update-query does not, the query can
be updated in place
Object syncing



Only 1 server needs to send the entire
object version state
Others send hashes
Syncing server then calculates hash and
comparers against all others
Other speedups

Authenticators


Compact timestamps


Authenticators use HMACs rather than digital
signatures
Hashes are used rather than object histories in
timestamps using a collision resistant hash
Compact replica histories

Replica histories are prune based on the
conditioned-on timestamp after updates
Malevolent components

The astute among you must have noticed the
possibility of DOS attacking by refusing
exponential backoff


Servers could rate-limit clients
Clients could also issue updates to a subset of
a quorum, forcing incomplete updates


Lazy verification can be used to verify correctness
of client operations in the background
The amount of unverified work by a client can
then be limited
Correctness

Operations are strictly serializable




To understand, consider the conditioned-on chain.
All operations chain back to the initial candidate,
and a total order is imposed through on all
established operations
Operations occur atomically, including those
spanning multiple objects
If no operations span multiple objects, then
correct operations that complete are also
linearizable
Tests


Tests performed on a rack of 76 Intel
Pentium 4 2.8 GHz machines
Implemented an “increment” method
and an NFSv3 metadata service
Fault Scalability
More fault-scalability
Isolated vs Contending
NFSv3 metadata
References

Text and images have been borrowed directly from both papers.