Byzantine Replication Under Attack Yair Amir, Jonathan Kirsch, John Lane Johns Hopkins University Brian Coan Telcordia Technologies DSN 2008 1 Byzantine Replication Under Attack Yair Amir, Jonathan Kirsch, John Lane Johns Hopkins University Brian Coan Telcordia Technologies DSN 2008 2 Motivation • Society depends on large-scale, distributed computer systems for critical infrastructure. • Insider attacks are a real threat, even for systems designed with security in mind. • Byzantine replication provides fault tolerance by protecting against partial system compromises. – Attacker must compromise more than some threshold fraction of the system to cause inconsistency or prevent the system from functioning. – Systems perform well in fault-free or benign fault runs. – What about performance when under attack? DSN 2008 3 The Downside of Asynchrony • Existing correctness criteria: safety and liveness – Safety: servers remain consistent. – Liveness: each update is eventually executed. • Protocols are designed to be safe in all executions. – Do not rely on synchrony for safety! – Guarantee liveness only when the network is sufficiently stable. • Real systems are not completely asynchronous. – Systems can satisfy much stronger performance guarantees than liveness during stable periods. • Consequence: Performance attacks! – An attacker can exploit the gap between what is promised during stable periods (liveness) and what is possible. DSN 2008 4 Performance Attacks: A First-Hand Look • Red-team attack on Steward [DSN 06]. • Goal was to violate safety or liveness. • Steward survived all of the attacks! • Most did not affect performance. • The system was slowed down in one experiment. – Speed of update ordering was slowed down by a factor of 5. • Big problem: – A better attack could slow the system down by a factor of 100. – But the system is still considered live! • Liveness is a necessary but insufficient correctness criterion for practical systems on wide-area networks. DSN 2008 5 Byzantine Performance Failures Previously Considered Byzantine Failures Failure Type Failure Behavior Mitigated by Value Domain Sending incorrect, conflicting, or invalid messages Cryptography, agreement protocols Time Domain Messages arrive after timeouts or not at all Timeouts, view change • If the adversary cannot violate safety and liveness, the next best thing is to slow down the system beyond usefulness. • Performance failures: send correct messages slowly but without triggering timeouts. DSN 2008 6 A New Problem: Performance Under Attack • Existing systems are vulnerable to performance attacks. – A small number of faulty servers can cause the system to make progress at an extremely slow rate -- indefinitely! • Leader-based protocols are vulnerable to performance attacks by a malicious leader. – Problem is magnified in wide-area networks, where it is difficult to predict the performance that should be expected of the leader. • Main challenges: – Developing meaningful performance metrics for evaluating Byzantine replication protocols. – Designing protocols that perform well according to these metrics, even when the system is under attack. DSN 2008 7 Outline • Motivation • Byzantine Performance Failures • Relevant Prior Work • Case Study: BFT Under Attack • The Prime Replication System • Bounded-Delay • Protocol Overview • Experimental Results • Summary DSN 2008 8 Relevant Prior Work • Leader-based Byzantine replication – – – – BFT [Castro, Liskov 99] Separating agreement from execution [Yin et al. 03] Fast Byzantine Consensus [Martin, Alvisi 05] Zyzzyva [Kotla et al. 07] • Randomized Byzantine replication – SINTRA [Cachin, Portiz 02] – RITAS [Moniz et al. 06] • Quorum-based Byzantine replication – Q/U [Abd-El-Malek et al. 05] – HQ [Cowling et al. 06] DSN 2008 9 Case Study: BFT Under Attack [Castro and Liskov 99] Client request pre-prepare prepare commit reply (Leader) 0 1 2 3 • Attack 1: Pre-Prepare Delay – Malicious leader can add delay into the ordering path by withholding its Pre-Prepare. – Non-leaders maintain a FIFO queue of pending updates. • Use timeouts to monitor the leader. • Timeout placed on execution of first update in queue. – Malicious leader can stay in power by ordering one update per queue per timeout period! DSN 2008 10 Case Study: BFT Under Attack [Castro and Liskov 99] Client request pre-prepare prepare commit reply (Leader) 0 1 2 3 • Attack 2: Timeout Manipulation – Timeout doubles every time the leader is replaced. – Use a denial of service attack to increase the timeout, then stop on a malicious leader. • Each update is eventually executed, but performance is much worse than if there were only correct servers. DSN 2008 11 The Prime Replication System • Performance-Oriented Replication in Malicious Environments – Leader-based protocol providing Bounded-Delay, a stronger guarantee than liveness, when the network is stable. • System components: – Prime Ordering Protocol (Preordering phase, Global ordering phase) – Suspect-Leader Protocol for detecting malicious leaders. • Main Ideas: – Resources needed by the leader to do its job are bounded and independent of system throughput. • Leader has “no excuse” for not sending timely messages. – Non-leader servers compute a threshold level of acceptable performance that the leader should meet. • Upper-bounded by a function of the latency between correct servers after the network stabilizes. DSN 2008 12 Bounded Delay • Prime-Stability: There is a time after which the following condition holds for a set of at least 2f+1 correct servers (the stable servers): • For each pair of stable servers r and s, there exists a value Min_Lat(r,s), unknown to the servers, such that if r sends a message to s, it will arrive with delay , where • Bounded-Delay: There exists a time after which the update latency for any update initiated by a stable server is upperbounded. DSN 2008 13 Prime: Ordering Protocol PO REQUEST No Attack L PO ACK PO ARU PRE PREPARE PREPARE COMMIT L = Leader O = Originator O = Aggregation Delay • Preordering (PO) Phase: – Each server, o, disseminates its updates to the other servers (PO-Request). – Agreement protocol binds update u to preorder identifier (o, i), where u is the ith update originated by server o (PO-ACK). – Each server cumulatively acknowledges the updates it preorders (PO-ARU). DSN 2008 14 Prime: Ordering Protocol No Attack L PO REQUEST PO ACK PO ARU PRE PREPARE PREPARE COMMIT L = Leader O = Originator = Aggregation Delay O Server 1 Server 2 Server 3 Server 4 ua, ub, uc (1, 1, ua), (1, 2, ub), (1, 3, uc) ud, ue (2, 1, ud), (2, 2, ue) uf ug, uh, ui Preordering Protocol (3, 1, uf) (4, 1, ug), (4, 2, uh), (4, 3, ui) 3 DSN 2008 2 1 PO-ARU 3 15 Prime: Ordering Protocol No Attack L PO REQUEST PO ACK PO ARU PRE PREPARE PREPARE COMMIT L = Leader O = Originator = Aggregation Delay O • Global Ordering Phase: – Similar to BFT (Pre-Prepare, Prepare, Commit) – Leader periodically sends a Pre-Prepare containing a proof matrix (vector of PO-ARU messages). – Each globally ordered Pre-Prepare maps to a batch of preordered updates based on contents of proof matrix. – Final total order is obtained by deterministically ordering the updates in each batch based on preorder identifier. DSN 2008 16 Prime: Ordering Protocol No Attack L PO REQUEST PO ACK PO ARU PRE PREPARE PREPARE COMMIT L = Leader O = Originator = Aggregation Delay O Pre-Prepare 1 Pre-Prepare 2 PO-ARU1 PO-ARU2 PO-ARU3 PO-ARU4 PO-ARU1’ PO-ARU2’ PO-ARU3’ PO-ARU4’ PP1 PP2 Global Ordering Protocol … Final Total Order DSN 2008 17 Attack Analysis No Attack L PO REQUEST PO ACK PO ARU PRE PREPARE PREPARE COMMIT L = Leader O = Originator = Aggregation Delay O • Key Points: – Preordering phase for updates sent by correct servers cannot be slowed down by faulty servers. – Once all correct servers receive a Pre-Prepare, global ordering cannot be slowed down by faulty servers. • Possible Attacks: – 1. Leader sends its Pre-Prepare to only some correct servers. – 2. Leader sends a Pre-Prepare with out-of-date PO-ARUs. – 3. Leader delays its Pre-Prepare. DSN 2008 18 Addition 1: Pre-Prepare Flooding No Attack L PO REQUEST PO ACK PO ARU PRE PREPARE PREPARE COMMIT L = Leader O = Originator = Aggregation Delay O PO ACK PO ARU PRE PREPARE PREPARE COMMIT Attack L PO REQUEST O • Intuition: 1. The leader must withhold the Pre-Prepare from all correct servers to significantly impact latency. 2. If we can force the leader to send timely, up-to-date Pre-Prepares to at least one correct server, we can ensure timely ordering! DSN 2008 19 Addition 2: Proof Matrix Messages PO ACK PO ARU PROOF MATRIX PRE PREPARE PREPARE COMMIT Attack L PO REQUEST O • Each server periodically sends a Proof-Matrix message, containing the latest PO-ARU messages it has received, to the leader. – A correct server expects a leader to include, in its next PrePrepare, PO-ARU messages that are at least as up-to-date as those in the Proof-Matrix message. • Why is this expectation justified? – A correct leader can simply adopt any PO-ARU messages that are more up to date than what it currently has. DSN 2008 20 Key Idea: Turn-Around Time PO ACK PO ARU PROOF MATRIX PRE PREPARE PREPARE COMMIT Attack L PO REQUEST O • Turn-around time – Time between sending a Proof-Matrix message, PM, and receiving a PrePrepare “covering” all of the PO-ARU messages in PM. • Key Observation: – The resources required by the leader to send a Pre-Prepare (bandwidth, CPU) are bounded and independent of system throughput. – We can use turn-around time as a measure by which to judge the leader! • Intuition: Force the leader to be timely by ensuring that it provides a fast enough turn-around time to at least one correct server. DSN 2008 21 Suspect-Leader Protocol • Protocol Strategy: – Dynamically determine an acceptable turn-around time based on roundtrip measurements (TAT_acceptable). – Use turn-around times measured in the current view to compute a measure of the current leader’s performance (TAT_leader). – Suspect the leader if TAT_leader > TAT_acceptable. • Design Challenges: – Malicious servers can lie to try to lower expectation of acceptable performance. • Leader could remain in power while going slowly. – Malicious servers can lie to make a correct leader look bad. • Would lead to continuous view changes. DSN 2008 22 Suspect-Leader: Key Properties • Any server that retains a role as leader must provide a TAT to at least one correct server that is no more than = Maximum delay – Maximum update latency: between correct servers = Aggregation delay Bounded-Delay! • There exists a set of at least f+1 correct servers that will not be suspected by any correct server if elected leader. – Aggressive but not overly aggressive. DSN 2008 23 Experimental Results • Leader performs just well enough to stay in power. BFT: aggressive timeout (300ms) • BFT: Pre-Prepare delay • Prime: – Leader adds as much delay as possible. – Non-leader servers force as much reconciliation as possible. 900 800 Update Throughput (updates/sec) – 50ms diameter, 10 Mbps links • Update Throughput vs. Clients 50ms Diameter, 10Mbps Links 7 servers (f = 2) Symmetric network 700 BFT - No Attack 600 500 Prime - No Attack 400 Prime - Under Attack 300 200 BFT - Under Attack 100 0 0 100 200 300 400 500 Number of Clients Update Latency vs. Clients 50ms Diameter, 10Mbps Links 1.4 1.2 Update Latency (s) • • BFT - No Attack 1 0.8 Prime - No Attack 0.6 Prime - Under Attack 0.4 BFT - Under Attack 0.2 0 0 100 200 300 400 500 Number of Clients DSN 2008 24 Summary • Existing leader-based Byzantine replication protocols are vulnerable to performance attacks. – Liveness is not a meaningful performance metric for evaluating Byzantine replication protocols. • Bounded-Delay: a new performance metric. – Can we provide stronger guarantees? – Can we guarantee a minimum throughput? • Prime: a new Byzantine replication protocol. – Achieves Bounded-Delay when the network is sufficiently stable. DSN 2008 25 Questions? • DSN 2008 26
© Copyright 2026 Paperzz