Practical Byzantine Fault Tolerance

Byzantine Techniques II
Presenter: Georgios Piliouras
Partly based on slides by
Justin W. Hart & Rodrigo Rodrigues
Papers
• Practical Byzantine Fault Tolerance.
Miguel Castro et. al. (OSDI 1999)
• BAR Fault Tolerance for Cooperative
Services. Amitanand S. Aiyer, et. al.
(SOSP 2005)
Motivation
• Computer systems provide crucial services
client
server
Problem
• Computer systems provide crucial services
• Computer systems fail
– natural disasters
client
– hardware failures
– software errors
– malicious attacks
server
Need highly-available services
Replication
unreplicated service
client
server
Replication
unreplicated service
client
server
replicated service
client
server
replicas
Replication algorithm:
• masks a fraction of faulty replicas
• high availability if replicas fail “independently”
Assumptions are a Problem
• Replication algorithms make assumptions:
– behavior of faulty processes
– synchrony
– bound on number of faults
• Service fails if assumptions are invalid
Assumptions are a Problem
• Replication algorithms make assumptions:
– behavior of faulty processes
– synchrony
– bound on number of faults
• Service fails if assumptions are invalid
– attacker will work to invalidate assumptions
Most replication algorithms assume too much
Contributions
• Practical replication algorithm:
– weak assumptions  tolerates attacks
– good performance
• Implementation
– BFT: a generic replication toolkit
– BFS: a replicated file system
• Performance evaluation
BFS is only 3% slower than a standard file system
Talk Overview
• Problem
• Assumptions
• Algorithm
• Implementation
• Performance
• Conclusions
Bad Assumption: Benign Faults
• Traditional replication assumes:
– replicas fail by stopping or omitting steps
Bad Assumption: Benign Faults
• Traditional replication assumes:
– replicas fail by stopping or omitting steps
• Invalid with malicious attacks:
– compromised replica may behave arbitrarily
– single fault may compromise service
– decreased resiliency to malicious attacks
client
server
replicas
attacker replaces
replica’s code
BFT Tolerates Byzantine Faults
• Byzantine fault tolerance:
– no assumptions about faulty behavior
• Tolerates successful attacks
– service available when hacker controls replicas
client
server
replicas
attacker replaces
replica’s code
Byzantine-Faulty Clients
• Bad assumption: client faults are benign
– clients easier to compromise than replicas
Byzantine-Faulty Clients
• Bad assumption: client faults are benign
– clients easier to compromise than replicas
• BFT tolerates Byzantine-faulty clients:
– access control
– narrow interfaces
– enforce invariants
attacker replaces
client’s code
server
replicas
Support for complex service operations is important
Bad Assumption: Synchrony
• Synchrony  known bounds on:
– delays between steps
– message delays
• Invalid with denial-of-service attacks:
– bad replies due to increased delays
• Assumed by most Byzantine fault tolerance
Asynchrony
• No bounds on delays
• Problem: replication is impossible
Asynchrony
• No bounds on delays
• Problem: replication is impossible
Solution in BFT:
• provide safety without synchrony
– guarantees no bad replies
Asynchrony
• No bounds on delays
• Problem: replication is impossible
Solution in BFT:
• provide safety without synchrony
– guarantees no bad replies
• assume eventual time bounds for liveness
– may not reply with active denial-of-service attack
– will reply when denial-of-service attack ends
Talk Overview
• Problem
• Assumptions
• Algorithm
• Implementation
• Performance
• Conclusions
clients
Algorithm Properties
• Arbitrary replicated service
– complex operations
– mutable shared state
• Properties (safety and liveness):
replicas
– system behaves as correct centralized service
– clients eventually receive replies to requests
• Assumptions:
– 3f+1 replicas to tolerate f Byzantine faults (optimal)
– strong cryptography
– only for liveness: eventual time bounds
Algorithm
State machine replication:
– deterministic replicas start in same state
– replicas execute same requests in same order
– correct replicas produce identical replies
client
replicas
Algorithm
State machine replication:
– deterministic replicas start in same state
– replicas execute same requests in same order
– correct replicas produce identical replies
client
replicas
Hard: ensure requests execute in same order
Ordering Requests
Primary-Backup:
• View designates the primary replica
client
primary
backups
replicas
• Primary picks ordering
view
• Backups ensure primary behaves correctly
– certify correct ordering
– trigger view changes to replace faulty primary
Rough Overview of Algorithm
• A client sends a request for a service to the
primary
client
primary
backups
replicas
Rough Overview of Algorithm
• A client sends a request for a service to the
primary
• The primary mulicasts the request to the
backups
client
primary
backups
replicas
Rough Overview of Algorithm
• A client sends a request for a service to the
primary
• The primary mulicasts the request to the
backups
• Replicas execute request and sent a reply to the
client
client
primary
backups
replicas
Rough Overview of Algorithm
• A client sends a request for a service to the
primary
• The primary mulicasts the request to the
backups
• Replicas execute request and sent a reply to the
client
• The client waits for f+1 replies from different
replicas with the same result; this is the result of
the operation
f+1 matching
replies
client
primary
backups
view
replicas
Quorums and Certificates
quorums have at least 2f+1 replicas
quorum A
quorum B
3f+1 replicas
quorums intersect in at least one correct replica
• Certificate  set with messages from a quorum
• Algorithm steps are justified by certificates
Algorithm Components
• Normal case operation
• View changes
• Garbage collection
• Recovery
All have to be designed to work together
Normal Case Operation
• Three phase algorithm:
– pre-prepare picks order of requests
– prepare ensures order within views
– commit ensures order across views
• Replicas remember messages in log
• Messages are authenticated
– •k denotes a message sent by k
Pre-prepare Phase
assign sequence number n to request m in view v
request : m
multicast PRE-PREPARE,v,n,m0
primary = replica 0
replica 1
replica 2
replica 3
fail
backups accept pre-prepare if:
• in view v
• never accepted pre-prepare for v,n with different request
Prepare Phase
digest of m
multicast PREPARE,v,n,D(m),11
m
prepare
pre-prepare
replica 0
replica 1
replica 2
replica 3
fail
accepted PRE-PREPARE,v,n,m0
all collect pre-prepare and 2f matching prepares
P-certificate(m,v,n)
Order Within View
No P-certificates with the same view and
sequence number and different requests
If it were false:
replicas
quorum for
P-certificate(m,v,n)
quorum for
P-certificate(m’,v,n)
one correct replica in common  m = m’
Commit Phase
m
pre-prepare
multicast COMMIT,v,n,D(m),22
commit
prepare
replies
replica 0
replica 1
replica 2
replica 3
fail
replica has
P-certificate(m,v,n)
all collect 2f+1 matching commits
Request m executed after:
C-certificate(m,v,n)
• having C-certificate(m,v,n)
• executing requests with sequence number less than n
View Changes
• Provide liveness when primary fails:
– timeouts trigger view changes
– select new primary ( view number mod 3f+1)
View Changes
• Provide liveness when primary fails:
– timeouts trigger view changes
– select new primary ( view number mod 3f+1)
• But also need to:
– preserve safety
– ensure replicas are in the same view long enough
– prevent denial-of-service attacks
View Change Protocol
send P-certificates: VIEW-CHANGE,v+1,P,22
replica 0 = primary v
fail
2f VC messages
replica 1= primary v+1
replica 2
replica 3
primary collects
VC-messages in X: NEW-VIEW,v+1,X,O1
pre-prepares messages
for v+1 view in O with the same sequence number
backups multicast prepare messages for pre-prepares in O
View Change Safety
Goal: No C-certificates with the same
sequence number and different requests
• Intuition: if replica has C-certificate(m,v,n) then
quorum for
C-certificate(m,v,n)
any quorum Q
correct replica in Q has P-certificate(m,v,n)
Garbage Collection
Truncate log with certificate:
• periodically checkpoint state (K)
• multicast CHECKPOINT,n,D(checkpoint),i
i
• all collect 2f+1 checkpoint messages
S-certificate(h,checkpoint)
discard messages and checkpoints
sequence
numbers
Log
h
H=h+2K
send checkpoint in view-changes
reject messages
Formal Correctness Proofs
• Complete safety proof with I/O automata
– invariants
– simulation relations
• Partial liveness proof with timed I/O automata
– invariants
Communication Optimizations
• Digest replies: send only one reply to client with result
Communication Optimizations
• Digest replies: send only one reply to client with result
• Optimistic execution: execute prepared requests
client
2f+1 replies
Read-write operations
execute in two round-trips
Communication Optimizations
• Digest replies: send only one reply to client with result
• Optimistic execution: execute prepared requests
client
2f+1 replies
Read-write operations
execute in two round-trips
• Read-only operations: executed in current state
client
2f+1 replies
Read-only operations
execute in one round-trip
Talk Overview
• Problem
• Assumptions
• Algorithm
• Implementation
• Performance
• Conclusions
BFS: A Byzantine-Fault-Tolerant NFS
replica 0
snfsd
replication
library
andrew
benchmark
replication
library
relay
kernel NFS client
kernel VM
snfsd
replication
library
kernel VM
replica n
No synchronous writes – stability through replication
Talk Overview
• Problem
• Assumptions
• Algorithm
• Implementation
• Performance
• Conclusions
Andrew Benchmark
Elapsed time (seconds)
70
60
50
Configuration
40
• 1 client, 4 replicas
• Alpha 21064, 133 MHz
• Ethernet 10 Mbit/s
30
20
10
0
BFS
BFS-nr
• BFS-nr is exactly like BFS but without replication
• 30 times worse with digital signatures
BFS is Practical
Elapsed time (seconds)
70
60
50
Configuration
40
• 1 client, 4 replicas
• Alpha 21064, 133 MHz
• Ethernet 10 Mbit/s
• Andrew benchmark
30
20
10
0
BFS
NFS
• NFS is the Digital Unix NFS V2 implementation
Elapsed time (seconds)
BFS is Practical 7 Years Later
500
450
400
350
300
250
200
150
100
50
0
Configuration
• 1 client, 4 replicas
• Pentium III, 600MHz
• Ethernet 100 Mbit/s
• 100x Andrew benchmark
BFS
BFS-nr
NFS
• NFS is the Linux 2.2.12 NFS V2 implementation
Conclusions
Byzantine fault tolerance is practical:
– Good performance
– Weak assumptions  improved resiliency
What happens if we go MAD?
What happens if we go MAD?
• Several useful cooperative services
span Multiple Administrative Domains.
– Internet routing
– File distribution
– Cooperative backup e.t.c.
• Dealing only with Byzantine behaviors is
not enough.
Why?
• Nodes are under control of multiple
administrators
Why?
• Nodes are under control of multiple
administrators
• Broken – Byzantine behaviors.
– Misconfigured, or configured with malicious
intent.
Why?
• Nodes are under control of multiple
administrators
• Broken – Byzantine behaviors.
– Misconfigured, or configured with malicious
intent.
• Selfish – Rational behaviors
– Alter the protocol to increase local utility
Talk Overview
• Problem
• Model
• 3 Level Architecture
• Performance
• Conclusions
It is time to raise the BAR
It is time to raise the BAR
• Byzantine
– Behaving arbitrarily or maliciously
• Altruistic
– Execute the proposed program, whether it
benefits them or not
• Rational
– Deviate from the proposed program for
purposes of local benefit
Protocols
• Incentive-Compatible Byzantine Fault
Tolerant (IC-BFT)
– It is in the best interest of rational nodes to follow
the protocol exactly
Protocols
• Incentive-Compatible Byzantine Fault
Tolerant (IC-BFT)
– It is in the best interest of rational nodes to follow
the protocol exactly
• Byzantine Altruistic Rational Tolerant (BART)
– Guarantees a set of safety and liveliness
properties despite the presence of rational nodes
• IC-BFT  BART
General idea
• Extend/Modify the Practical Byzantine
Fault Tolerance Model in a way that
combats the negative effects of rational
(greedy) behavior.
• We will achieve that by using gametheoretic tools.
+
A taste of Nash Equilibrium
Swerve
Swerve
Go Straight
0, 0
-1,+1
Go Straight +1,-1
X_X,X_X
-100,-100
Naughty nodes are punished
• Nodes require access to a state
machine in order to complete their
objectives
• Protocol contains methods for punishing
rational nodes, including denying them
access to the state machine
Talk Overview
• Problem
• BAR Model
• 3 Level Architecture
• Performance
• Conclusions
Three-Level Architecture
• Layered design
• simplifies analysis/construction of systems
• isolates classes of misbehavior at
appropriate levels of abstraction
Level 1 – Basic Primitives
• Goals:
– Provide IC-BFT versions of key
abstractions
– Ensure long-term benefit to participants
– Limit non-determinism
– Mitigate the efffects of residual nondeterminism
– Enforce predictable communication
patterns
Level 1 – Basic Primitives
• Goals:
– Provide IC-BFT versions of key
abstractions
– Ensure long-term benefit to participants
– Limit non-determinism
– Mitigate the efffects of residual nondeterminism
– Enforce predictable communication
patterns
Level 1 – Basic Primitives
BART-RSM based on PBFT
Differences: use TRB instead of consensus
3f+2 nodes required for f faulty
Level 1 – Basic Primitives
• Goals:
– Provide IC-BFT versions of key
abstractions
– Ensure long-term benefit to participants
– Limit non-determinism
– Mitigate the efffects of residual nondeterminism
– Enforce predictable communication
patterns
Level 1 – Basic Primitives
• The RSM rotates the leadership role to
participants.
• Participants want to stay in the system
in order to control the RSM and
complete their protocols
• Ultimately, incentives stem from the
higher level service
Level 1 – Basic Primitives
• Goals:
– Provide IC-BFT versions of key
abstractions
– Ensure long-term benefit to participants
– Limit non-determinism
– Mitigate the efffects of residual nondeterminism
– Enforce predictable communication
patterns
Level 1 – Basic Primitives
• Self interested nodes could hide behind nondeterminism to shirk work.
– Tit-for-Tat policy
– Communicate proofs of misconducts, leads to
global punishment
• Use Terminating Reliable Broadcast, rather
than consensus.
– In TRB, only the sender can propose a value
– Other nodes can only adopt this value, or choose
a default value
Level 1 – Basic Primitives
• Goals:
– Provide IC-BFT versions of key
abstractions
– Ensure long-term benefit to participants
– Limit non-determinism
– Mitigate the efffects of residual nondeterminism
– Enforce predictable communication
patterns
Level 1 – Basic Primitives
• Balance costs
– No incentive to make the wrong choice
• Encourage timeliness
– By allowing nodes to judge unilaterally
whether other nodes’ messages are late
and inflict sanctions to them (Penance)
Level 1 – Basic Primitives
• Goals:
– Provide IC-BFT versions of key
abstractions
– Ensure long-term benefit to participants
– Limit non-determinism
– Mitigate the efffects of residual nondeterminism
– Enforce predictable communication
patterns
Level 1 – Basic Primitives
• Nodes have to have participated at
every step in order to have the
opportunity to issue a command
• Message queues
y
y
y
x
x
x
x
x
y
I am waiting message from x
Level 1 – Basic Primitives
• Nodes have to have participated at
every step in order to have the
opportunity to issue a command
• Message queues
y
y
y
x
message
x
y
x
x
x
Level 1 – Basic Primitives
• Nodes have to have participated at
every step in order to have the
opportunity to issue a command
• Message queues
y
y
y
y
x
I am waiting message from y
x
y
x
x
Theorems
• Theorem 1: The TRB protocol satisfies
Termination, Agreement, Integrity and
Non-Triviality
• Theorem 2: No node has a unilateral
incentive to deviate from the protocol
Level 2
• State machine replication is sufficient to
support a backup service, but the
overhead is unacceptable
– 100 participants… 100 MB backed up… 10
GB of drive space
• Assign work to individual nodes, using
arithmetic codes to provide lowoverhead fault-tolerant storage
Guaranteed Response
• Direct communication is insufficient
when nodes can behave rationally
• We introduce a “witness” that overhears
the conversation
• This eliminates ambiguity
• Messages are routed through this
intermediary
Guaranteed Response
Guaranteed Response
• Node A sends a request to Node B
through the witness
• The witness stores the request, and
enters RequestReceived state
• Node B sends a response to Node A
through the witness
• The witness stores the response, and
enters ResponseReceived
Guaranteed Response
• Deviation from this protocol will cause
the witness to either notice the timeout
from Node B or lying on the part of
Node A
Optimization through Credible
Threats
Optimization through Credible
Threats
• Returns to game theory
• Protocol is optimized so nodes can
communicate directly. Add a fast path
• If recipient does not respond, nodes proceed
to the unoptimized case
• Analogous to a game of chicken
Periodic Work Protocol
• Witness checks that periodic tasks,
such as system maintenance are
performed
• It is expected that, with a certain
frequency, each node in the system will
perform such a task
• Failure to perform one will generate a
POM from the witness
Authoritative Time Service
• Maintains authoritative time
• Binds messages sent to that time
• Guaranteed response protocol relies on
this for generating NoResponses
Authoritative Time Service
• Each submission to the state machine
contains the timestamp of the proposer
• Timestamp is taken to be the maximum
of the median of timestamps of the
previous f+1 decisions
• If “no decision” is decided, then the
timestamp is the previous authoritative
time
Level 3 BAR-B
• BAR-B is a cooperative backup system
• Three operations
– Store
– Retrieve
– Audit
Storage
•
•
•
•
Nodes break files up into chunks
Chunks are encrypted
Chunks are stored on remote nodes
Remote nodes send signed receipts
and store StoreInfos
Retrieval
• A node storing a chunk can respond to
a request for a chunk with
– The chunk
– A demonstration that the chunk’s lease has
expired
– A more recent StoreInfo
Auditing
• Receipts constitute audit records
• Nodes will exchange receipts in order to
verify compliance with storage quotas
Talk Overview
• Problem
• BAR Model
• 3 Level Architecture
• Performance
• Conclusions
Evaluation
• Performance is inferior to protocols that
do note make these guarantees, but
acceptable (?)
Impact of additional nodes
Impact of rotating leadership
Impact of fast path optimization
Talk Overview
• Problem
• BAR Model
• 3 Level Architecture
• Performance
• Conclusions
Conclusions
• More useful as a proof of concept but
certainly explores a very interesting
common ground between systems and
game theory as a way of exploring the
performance of real-life systems.
Conclusions
• More useful as a proof of concept but
certainly explores a very interesting
common ground between systems and
game theory as a way of exploring the
performance of real-life systems.
• CS 614 is over