Byzantine Techniques II Justin W. Hart CS 614 12/01/2005 Papers BAR Fault Tolerance for Cooperative Services. Amitanand S. Aiyer, et. al. (SOSP 2005) Fault-scalable Byzantine Fault-Tolerant Services. Michael Abd-El-Malek et.al. SOSP 2005 BAR Fault Tolerance for Distributed Services BAR Model General Three-Level Architecture BAR-B Motivation “General approach to constructing cooperative services that span multiple administrative domains (MADs)” Why is this difficult? Nodes are under control of multiple administrators Broken – Byzantine behaviors. Misconfigured, or configured with malicious intent. Selfish – Rational behaviors Alter the protocol to increase local utility Other models? Byzantine Models – Account for Byzantine behavior, but do not handle rational behavior. Rational Models – Account for rational behavior, but may break with Byzantine behavior. BAR Model Byzantine Altruistic Behaving arbitrarily or maliciously Execute the proposed program, whether it benefits them or not Rational Deviate from the proposed program for purposes of local benefit BART – BAR Tolerant It’s a cruel world At most (n-2)/3 nodes in the system are Byzantine The rest are rational Two classes of protocols Incentive-Compatible Byzantine Fault Tolerant (IC-BFT) Byzantine Altruistic Rational Tolerant Guarantees a set of safety and liveliness properties It is in the best interest of rational nodes to follow the protocol exactly Guarantees a set of safety and liveliness properties despite the presence of rational nodes IC-BFT is a subset of BART An important concept It isn’t enough for a protocol to survive drills of a handful of attacks. It must provably provide its guarantees. A flavor of things to come Protocol builds on Practical Byzantine Fault Tolerance in order to combat Byzantine behavior Protocol uses game theoretical concepts in order to combat rational behavior A taste of Nash Equilibrium Swerve Swerve Go Straight 0, 0 -1,+1 Go Straight +1,-1 X_X,X_X -100,-100 …and the nodes are starving! Nodes require access to a state machine in order to complete their objectives Protocol contains methods for punishing rational nodes, including denying them access to the state machine An expensive notion of identity Identity is established through cryptographic keys assigned through a trusted authority Prevents Sybil attacks Bounds the number of Byzantine nodes Gives rational nodes reason to consider longterm consequences of their actions Gives real world grounding to identity Assumptions about rational nodes “Receive long-term benefit from staying in the protocol” “Conservative when computing the impact of Byzantine nodes on their utility” “If the protocol provides a Nash equilibrium, then all rational nodes will follow it” “Rational nodes do not collude…colluding nodes are classified as Byzantine” Byzantine nodes Byzantine fault model Strong adversary Adversary can coordinate collusion attacks Important concepts Promptness principal Proof of Misbehavior (POM) Cost balancing Promptness principal If a rational node gains no benefit from delaying a message, it will send it as soon as possible Proof of Misbehavior (POM) Self-contained, cryptographic proof of wrongdoing Provides accountability to nodes for their actions Example of POM Node A requests that Node B store a chunk Node B replies that it has stored the chunk Later Node A requests that chunk back Node B sends back random garbage (it hadn’t stored the chunk) and a signature Because Node A stored a hash of the chunk, it can demonstrate misbehavior on part of Node B …but it’s a bit more complicated than that! This corresponds to a rather simple behavior to combat. “Aggressively Byzantine” behavior. Passive-aggressive behaviors Harder cases than “aggressively Byzantine” A malicious Node A could merely lie about misbehavior on the part of Node B A node could exploit non-determinism in order to shirk work Cost Balancing If two behaviors have the same cost, there is no reason to choose the wrong one Three-Level Architecture Level 1 Unilaterally deny service to nodes that fail to deliver messages Balance costs “Tit-for-Tat” No incentive to make the wrong choice Penance Unilaterally impose extra work on nodes with untimely responses Level 2 Failure to respond to a request by a state machine will generate a POM from a quorum of nodes in the state machine Level 3 Makes use of reliable work assignment Needs only to provide sufficient information to identify valid request/response pairs Nuts and Bolts Level 1 Level 2 Level 1 Ensure long-term benefit to participants The RSM rotates the leadership role to participants. Participants want to stay in the system in order to control the RSM and complete their protocols Limit non-determinism Self interested nodes could hide behind nondeterminism to shirk work Use Terminating Reliable Broadcast, rather than consensus. In TRB, only the sender can propose a value Other nodes can only adopt this value, or choose a default value Level 1 Mitigate the effects of residual nondeterminism Cost balancing Encouraging timeliness The protocol preferred choice is no more expensive than any other Nodes can inflict sanctions on untimely messages Enforce predictable communication patterns Nodes have to have participated at every step in order to have the opportunity to issue a command Terminating Reliable Broadcast 3f+2 nodes, rather than 3f+1 Suppose a sender “s” is slow The same group of nodes now want to determine that “s” is slow A new leader is elected Every node but “s” wants a timely conclusion to this, in order to get their turn to propose a value to the state machine “s” is not allowed to participate in this quorum TRB provides a few guarantees They differ during periods of synchrony and periods of asynchrony In synchrony Termination Every non-Byzantine process delivers exactly one message Agreement If on non-Byzantine process delivers a message m, then all non-Byzantine processes eventually deliver m In asynchrony Integrity If a non-Byzantine process delivers m, then the sender sent m Non-Triviality If the sender is non-Byzantine and sends m, then the sender eventually delivers m Message Queue Enforces predictable communication patterns Bubbles A simple retaliation policy Node A’s message queue is filled with messages that it intends to send to Node B This message queue is interleaved with bubbles. Bubbles contain predicates indicating messages expected from B No message except the expected predicate from B can fill the bubble No messages in A’s queue will go to B until B fills the bubble Balanced Messages We’ve already discussed this quite a bit We assure this at this level of the protocol This is where we get our gigantic timeout message Penance Untimely vector Tracks a nodes perception of the responsiveness of other nodes When a node becomes a sender, it includes its untimely vector with the message Penance All nodes but the sender receive penance messages from each node. Because of bubbles, each untimely node must sent a penance message back in order to continue using the system This provides a penalty to those nodes The sender is excluded from this process, because it may be motivated to lie in its penance vector, in order to avoid the work of transmitting penance messages Timeouts and Garbage Collection Set-turn timeout Timeout to take leadership away from the sender Initially 10 seconds in this implementation, in order to overcome all expected network delays Can only be changed by the sender Max_response_time Time at which a node is removed from the system, its messages discarded and its resources garbage collected Set to 1 week or 1 month in the prototypes Global Punishment Badlists Transform local suspicion into POMs Suspicion is recorded in a local nodes badlist Sender includes its badlist with its message If, over time, recipients see a node in f + 1 different senders badlists, then they too, consider that node to be faulty Proof Real proofs do not appear in this paper, they appear in the technical report …but here’s a bit Theorem 1: The TRB protocol satisfies Termination, Agreement, Integrity and Non-Triviality …and a bit more Theorem 2: No node has a unilateral incentive to deviate from the protocol Lemma 1: No rational node r benefits from delaying sending the “set-turn” message Follows from penance Lemma 2: No rational node r benefits from sending the “set-turn” message early Sending early could result in senderTO to be sent (this protocol uses synchronized clocks, and all messages are cryptographically signed) …and the rest that’s mentioned in the paper Lemma 3: No rational node r benefits from sending a malformed “set-turn” message. The “set-turn” message only contains the turn number. Because of this, doing so reduces to either sending early (dealt with in Lemma 1) or sending late (dealt with in Lemma 2) Level 2 State machine replication is sufficient to support a backup service, but the overhead is unacceptable 100 participants… 100 MB backed up… 10 GB of drive space Assign work to individual nodes, using arithmetic codes to provide lowoverhead fault-tolerant storage Guaranteed Response Direct communication is insufficient when nodes can behave rationally We introduce a “witness” that overhears the conversation This eliminates ambiguity Messages are routed through this intermediary Guaranteed Response Guaranteed Response Node A sends a request to Node B through the witness The witness stores the request, and enters RequestReceived state Node B sends a response to Node A through the witness The witness stores the response, and enters ResponseReceived Guaranteed Response Deviation from this protocol will cause the witness to either notice the timeout from Node B or lying on the part of Node A Implementation The system must remain incentive-compatible Communication with the witness node is not in the form of actual message sending, it is in the form of a command to the RSM Theorem 3: If the witness node enters the request received state, for some work w to rational node b, then b will execute w Holds if sufficient sanctions exist to cause it to be motivated to do this State limiting State is limited by limiting the number of slots (nodes with which a node can communicate) available to a node Applies a limit to the memory overhead Limits the rate at which requests are inserted into the system Forces nodes to acknowledge responses to requests Nodes want their slots back Optimization through Credible Threats Optimization through Credible Threats Returns to game theory Protocol is optimized so nodes can communicate directly. Add a fast path Nodes register “vows” with the witness If recipient does not respond, nodes proceed to the unoptimized case Analogous to a driver in “chicken” throwing their steering wheel out the window Periodic Work Protocol Witness checks that periodic tasks, such as system maintenance are performed It is expected that, with a certain frequency, each node in the system will perform such a task Failure to perform one will generate a POM from the witness Authoritative Time Service Maintains authoritative time Binds messages sent to that time Guaranteed response protocol relies on this for generating NoResponses Authoritative Time Service Each submission to the state machine contains the timestamp of the proposer Timestamp is taken to be the maximum of the median of timestamps of the previous f+1 decisions If “no decision” is decided, then the timestamp is the previous authoritative time Level 3 BAR-B BAR-B is a cooperative backup system Three operations Store Retrieve Audit Storage Nodes break files up into chunks Chunks are encrypted Chunks are stored on remote nodes Remote nodes send signed receipts and store StoreInfos Retrieval A node storing a chunk can respond to a request for a chunk with The chunk A demonstration that the chunk’s lease has expired A more recent StoreInfo Auditing Receipts constitute audit records Nodes will exchange receipts in order to verify compliance with storage quotas Arithmetic Coding Arithmetic coding is used to keep storage size reasonable 1 GB of storage requires 1.3 GB of overhead Keeping this ratio reasonable is crucial to motivate self-interested nodes to participate Request-Response pattern Store Retrieve Audit Retrieve Originator sends a Receipt for the StoreInfo to be retrieved Storage node can send A RetrieveConfirm A RetrieveDeny Containing the data and the receipt Containing a receipt and a proof regarding why Anything else Generates a POM Store Originator sends a StoreInfo to be stored Storage node can send A receipt A StoreReject Demonstrates that the node has reached its storage commitment Anything else Generates a POM Audit Three phases Auditor requests both OwnList and StoreList from auditee Does this for random nodes in the system Lists are checked for inconsistencies Inconsistencies result in a POM Time constraints Data is stored for 30 days After this, it is garbage collected Nodes must renew their leases on stored chunks, in order to keep them in the system, prior to this expiration Sanctions Periodic work protocol forces generation of POMs or special NoPOMs POMs and NoPOMs are balanced POMs evict nodes from the system Recovery Nodes must be able to recover after failures Chained membership certificates are used in order to allow them to retrieve their old chunks Use of certificate later in the chain is regarded as a new node entering the system The old node is regarded as dead The new node is allowed to view the old nodes chunks Recovery This forces nodes to redistribute their chunks that were on that node Length of chains is limited, in order to prevent nodes from shirking work by using a certificate later in the chain Guarantees Data on BAR-B can be retrieved within the lease period No POM can be gathered against a node that does not deviate from the protocol No node can store more than its quota A time window is available to nodes with catastrophic failures for recovery Evaluation Performance is inferior to protocols that do note make these guarantees, but acceptable Impact of additional nodes Impact of rotating leadership Impact of fast path optimization Fault-Scalable Byzantine FaultTolerant Services Query/Update (Q/U) protocol Optimistic quorum based protocol Better throughput and fault-scalability than Replicated State Machines Introduces preferred quorum as an optimization on quorum protocols Motivation Compelling need for services and distributed data structures to be efficient and fault-tolerant In Byzantine fault-tolerant systems, performance drops off sharply as more faults are tolerated Fault Scalability A fault-scalable service is one in which performance degrades gracefully as more server faults are tolerated Operations-based interface Provides an interface similar to RSMs Exports interfaces comprised of deterministic methods Queries Updates Do not modify data Modify data Multi-object updates Allow a set of objects to be updated together Properties Operates correctly under an asynchronous model Queries and updates are strictly serializable In benign execution, they are obstruction-free Cost is an increase in the number of required servers 5b + 1 servers, rather than 3b + 1 servers Optimism Servers store a version history of objects Updates are non-destructive to the objects Use of logical timestamps based on contents of update and object state upon which the update is conditioned Speedups Preferred quorum, rather than random quorum Addressed later Efficient cryptographic techniques Addressed later Efficiency and Scalability Efficiency Most failure atomic protocols require at least a 2 phase commit The optimistic approach does not need a prepare phase Prepare Commit This introduces the need for clients to repair inconsistent objects The optimistic approach also obviates the need for locking! Versioning Servers In order to allow for this, versioning servers are employed Each update creates a new version on the server Updates contain information about the version to be updated. If no update has been committed since that version, the update goes through unimpeded. Throughput-scalability Additional servers, beyond those necessary to provide the desired fault tolerance, can provide additional throughput Scaleup pitfall? Encourage the use of fine-grained objects, which reduce per-object contention If majority of accesses access individual objects, or few objects, then scaleup pitfall can be avoided In the example applications, this holds. No need to partition Other systems achieve throughputscalability by partitioning services This is unnecessary in this system The Query/Update Protocol System model Asynchronous timing Clients and servers may be Byzantine faulty Clients and servers assumed to be computationally bounded, assuring effectiveness of cryptography Failure model is a hybrid failure model Benign Malevolent Faulty System model Extends definition of “fail prone system” given by Malkhi and Reiter System model Point-to-point authenticated channels exist between all clients and servers Infrastructure deploying symmetric keys on all channels Channels are assumed unreliable …but, of course, they can be made reliable Overview Clients update objects by issuing requests stamped with object versions to version servers. Version servers evaluate these requests. If the request is over an out of date version, the clients version is corrected and the request reissued If an out of date server is required to reach a quorum, it retrieves an object history from a group of other servers If the version matches the server version, of course, it is executed Everything else is a variation upon this theme Overview Queries are read only methods Updates modify an object Methods exported take arguments and return answers Clients perform operations by issuing requests to a quorum A server receives a request. If it accepts it it invokes a method Each update creates a new object version Overview The object version is kept with its logical timestamp in a version history called the replica history Servers return replica histories in response to requests Clients store replica histories in their object history set, an array of replicas indexed by server Overview Timestamps in these histories are candidates for future operations Candidates are classified in order to determine which object version a method should be executed upon Overview In non-optimistic operation, a client may need to perform a repair Addressed later To perform an operation, a client first retrieves an object history set. The clients operation is conditioned on this set, which is transmitted with the operation. Overview The client sends this operation to a quorum of servers. To promote efficiency, the client sends the request to a preferred quorum Addressed later Single phase operation hinges on the availability of a preferred quorum, and on concurrency-free access. Overview Before executing a request, servers first validate its integrity. This is important, servers do not communicate object histories directly to each other, so the client’s data must be validated. Servers use authenticators to do this, lists of HMACs that prevent malevolent nodes from fabricating replica histories. Servers cull replica histories from the conditioned on OHS that they cannot validate Overview – the last bit Servers validate that they do not have a higher timestamp in their local replica histories Failing this, the client repairs Passing this, the method is executed, and the new timestamp created Timestamps are crafted such that they always increase in value Preferred Quorums Traditional quorum systems use random quorums, but this means that servers frequently need to be synced This is to distribute the load Preferred quorums choose to access servers with the most up to date data, assuring that syncs happen less often Preferred Quorums If a preferred quorum cannot be met, clients probe for additional servers to add to the quorum Authenticators make it impossible to forge object histories for benign servers The new host syncs with b+1 host servers, in order to validate that the data is correct In the prototype, probing selects servers such that the load is distributed using a method parameterized on object ID and server ID Concurrency and Repair Concurrent access to an object may fail Two operations Barrier Barrier candidates have no data associated with them, and so are safe to select during periods of contention Barrier advances the logical clock so as to prevent earlier timestamps from completing Copy Copies the latest object data past the barrier, so it can be acted upon Concurrency and Repair Clients may repeatedly barrier each other, to combat this, an exponential backoff strategy is enforced Classification and Constraints Based on partial observations of the global system state, an operation may be Complete Repairable Can be repaired using the copy and barrier strategy Incomplete Multi-Object Updates In this case, servers lock their local copies, if they approve the OHS, the update goes through If not, a multi-object repair protocol goes through In this case, repair depends on the ability to establish all objects in the set Objects in the set are only repairable if all are repairable. If objects in the set that would be repairable are reclassified as incomplete. An example of all of this Implementation details Cached object history set Clients cache object history sets during execution, and execute updates without first querying. Failing the request based on an out of date OHS, the server returns an up-to-date OHS with the failure Optimistic query execution If a client has not accessed an object recently, it is still possible to complete in a single phase. Servers execute the update on the latest object that they store. Clients then evaluate the result normally. Inline repair Does not require a barrier and copy Repairs the candidate “in-place,” obviating the need for a round trip Only possible in cases where there is no contention Handling repeated requests Mechanisms may cause requests to be repeated In order to shortcut other checks, the timestamp is checked first Retry and backoff policies Update-update requires retry, and backoff to avoid livelock Update-query does not, the query can be updated in place Object syncing Only 1 server needs to send the entire object version state Others send hashes Syncing server then calculates hash and comparers against all others Other speedups Authenticators Compact timestamps Authenticators use HMACs rather than digital signatures Hashes are used rather than object histories in timestamps using a collision resistant hash Compact replica histories Replica histories are prune based on the conditioned-on timestamp after updates Malevolent components The astute among you must have noticed the possibility of DOS attacking by refusing exponential backoff Servers could rate-limit clients Clients could also issue updates to a subset of a quorum, forcing incomplete updates Lazy verification can be used to verify correctness of client operations in the background The amount of unverified work by a client can then be limited Correctness Operations are strictly serializable To understand, consider the conditioned-on chain. All operations chain back to the initial candidate, and a total order is imposed through on all established operations Operations occur atomically, including those spanning multiple objects If no operations span multiple objects, then correct operations that complete are also linearizable Tests Tests performed on a rack of 76 Intel Pentium 4 2.8 GHz machines Implemented an “increment” method and an NFSv3 metadata service Fault Scalability More fault-scalability Isolated vs Contending NFSv3 metadata References Text and images have been borrowed directly from both papers.
© Copyright 2026 Paperzz