SECONDSITE: DISASTER TOLERANCE AS A SERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield FAILURES IN A DATACENTER 2 TOLERATING FAILURES IN A DATACENTER REMUS Initial idea behind Remus was to tolerate Datacenter level failures. 3 CAN A WHOLE DATACENTER FAIL ? Yes! It’s a “Disaster”! 4 DISASTERS “Truck driver in Texas kills all the websites you really use” …Southlake FD found that he had low blood sugar Illustrative Image courtesy of TangoPango, Flickr. - valleywag.com “Our Internet infrastructure, despite all the talk, is as fragile as a fine porcelain cup on the roof of a car zipping across a pot-holed goat track. A single truck driver can take out sites like 37Signals in a snap.” - Om Malik, GigaOM 5 DISASTERS.. Water-main break cripples Dallas County computers, operations The county's criminal justice system nearly ground to a halt, as paper processing from another era led to lengthy delays - keeping some prisoners in jail longer than normal. - Dallas Morning News, Jun 2010 6 DISASTERS.. 7 MORE FODDER BACK HOME “ An explosion … near our server bank … electrical box containing 580 fiber cables. electrical box … was covered in asbestos … mandated the wearing of hazmat suits .... Worse yet, the dynamic rerouting —which is the hallmark of the internet … did not function. In other words, the perfect storm. Oh well. S*it happens. ’’ -Dan Empfield, Slowswitch.com - a Gossamer Threads customer. 8 DISASTER RECOVERY – THE OLD FASHIONED WAY Storage replication between a primary and backup site. Manually restore physical servers from backup images. Data Loss and Long Outage periods. Expensive Hardware – Storage Arrays, Replicators, etc. 9 STATE OF THE ART DISASTER RECOVERY Protected Site VirtualCenter Recovery Site Site Recovery Manager VirtualCenter VMs online in Protected Site VMs become unavailable Site Recovery Manager VMs offline VMs powered on Array Replication Datastore Groups Source: VMWare Site Recovery Manager – Technical Overview Datastore Groups 10 PROBLEMS WITH EXISTING SOLUTIONS Data Loss & Service Disruption (RPO ~15min, RTO ~few hours) Complicated Recovery Planning (e.g. service A needs to be up before B, etc.) Application Bottom Level Recovery Line: Current State of DR is Complicated Expensive Not suitable for a general purpose cloud-level offering. 11 DISASTER TOLERANCE AS A SERVICE ? Our Vision 12 OVERVIEW A Case for Commoditizing Disaster Tolerance SecondSite – System Design Evaluation & Experiences 13 PRIMARY & BACKUP SITES 5ms RTT 14 FAILOVER & FAILBACK WITHOUT OUTAGE Primary Site: Vancouver Backup Site : Kamloops Primary Site: Vancouver Primary Site: Kamloops Complete State Recovery (CPU, disk, memory, network) No Application Level Recovery Primary Site: Kamloops Backup Site : Vancouver 15 MAIN CONTRIBUTIONS Remus (NSDI ’08) Checkpoint based State Replication Fully Transparent HA Recovery Consistency No Application level recovery RemusDB (VLDB’11) Optimize Server Latency Reduce Replication Bandwidth by up to 80% using Page Delta Compression Disk Read Tracking SecondSite (VEE’12) Failover Arbitration in Wide Area Stateful Network Failover over Wide Area 16 CONTRIBUTIONS.. 17 FAILURE DETECTION IN REMUS External Network LAN NIC1 NIC1 NIC2 Primar y Checkpoints NIC2 Backup • A pair of independent dedicated NICs carry replication traffic. • Backup declares Primary failure only if • It cannot reach Primary via NIC 1 and NIC2 • It can reach External N/W via NIC1 • Failure of Replication link alone results in Backup shutdown. • Split Brain occurs only when both NICs/links fail. 18 FAILURE DETECTION IN WIDE AREA DEPLOYMENTS INTERNET External Network WAN LAN Replication Channel NIC1 NIC2 Primary Checkpoints Primar Datacente y r NIC1 NIC2 Backup Backup Datacente r Cannot distinguish between link and node failure. Higher chances of Split Brain as the network is not reliable anymore 19 FAILOVER ARBITRATION Local Quorum of Simple Reachability Detectors. Stewards can be placed on third party clouds. Google App Server implementation with ~100 LoC. Provider/User could have other sophisticated implementations. 20 FAILOVER ARBITRATION.. 1 2 Stewards 3 5 4 POLL 5 X X X X X Apriori Steward Set Agreement Primary Replication Stream Quorum Logic I need majority to stay alive Backup Quorum Logic I need exclusive majority to failover 21 NETWORK FAILOVER WITHOUT SERVICE INTERRUPTION Remus – LAN - Gratuitous ARP from Backup Host SecondSite – WAN/Internet – BGP Route Update from Backup Datacenter Need support from upstream ISP(s) at both Datacenters IP Migration achieved through BGP Multi-homing 22 NETWORK FAILOVER WITHOUT SERVICE INTERRUPTION.. Internet BGP Multihoming BCNet (AS-271) Vancouver (134.87.2.173) Kamloops (207.23.255.237) VMs Primary Site Routing traffic to Primary Site as-path prepend 64678 as-path prepend 64678 64678 64678 64678 as-path prepend 64678 64678 134.87.2.174 Replication 207.23.255.238 AS-64678 (stub) (134.87.3.0/24) AS-64678 (stub) (134.87.3.0/24) VMs VMs Backup Site Re-routing traffic to Backup Site on Failover 23 OVERVIEW A Case for Commoditizing Disaster Tolerance SecondSite – System Design Evaluation & Experiences 24 EVALUATION Failover Works!! I want periodic failovers with no downtime! More than one failure ? I will have to restart HA! Did you run regression tests ? 25 RESTARTING HA Need to Resynchronize Storage. Avoiding Service Downtime requires Online Resynchronization Leverage DRBD –only resynchronizes blocks that have changed Integrate DRBD with Remus Add checkpoint based asynchronous disk replication protocol. 26 REGRESSION TESTS Synthetic Workloads to stress test the Replication Pipeline Failovers every 90 minutes Discovered some interesting corner cases Page-table corruptions in memory checkpoints Write-after-write I/O ordering in disk replication 27 SECONDSITE – THE COMPLETE PICTURE 4 VMs x 100 Clients/VM • Service Downtime includes timeout for failure detection (10s) • Failure Detection Timeout is configurable 28 REPLICATION BANDWIDTH CONSUMPTION 4 VMs x 100 Clients/VM 29 DEMO Expect a real disaster (conference demos are not a good idea!) 30 APPLICATION THROUGHPUT VS. REPLICATION LATENCY Kamloops SPECWeb w/ 100 Clients 31 RESOURCE UTILIZATION VS. APPLICATION LOAD Domain-0 CPU Utilization Bandwidth usage on Replication Channel Cost of HA as a function of Application Load (OLTP w/ 100 Clients) 32 RESYNCHRONIZATION DELAYS VS. OUTAGE PERIOD OLTP Workload 33 SETUP WORKFLOW – RECOVERY SITE The user creates a recovery plan which is associated to a single or multiple protection groups 34 Source: VMWare Site Recovery Manager – Technical Overview RECOVERY PLAN VM Shutdown High Priority VM Shutdown Prepare Storage High Priority VM Recovery Normal Priority VM Recovery Low Priority VM Recovery Source: VMWare Site Recovery Manager – Technical Overview 35
© Copyright 2026 Paperzz