Redmond Protocols Plugfest 2016 Validating Hardware for Private Clouds: Tests & Techniques Aniket Malatpure and Anagha Koshe Cloud Solutions-Fundamentals: Reliability Team Session Objectives And Takeaways • Session Objective(s): • • • • Private Cloud Market Opportunity Private Cloud Hardware Issues Private Cloud Validation Approach Solution Validation: Private Cloud Simulator (PCS) • Key Takeaways • Private Cloud Simulator: Overview & end-to-end workflow Private Cloud: Hardware Issues Hardware Issues: Storage • There is little that Cloud software can do if the underlying hardware fails to adhere to fundamental design assumptions • Storage hardware defects trigger a ‘Cloud Down’ scenario where some or all storage goes down taking tenant VMs offline Hardware Type Issue Description HDD • • • • SSD • Persistent Reservation support showed multiple failures • Disk media issues • Inadequate support for the SBC-3 UNMAP command Enclosures • Power-on failures after repeated power cycles • SAS Drawers failing partially (i.e. 1 of 5 drawers failed) Number of initiators supported for Persistent Reservation was smaller than required Drives failed to power-up on repeated enclosure powercycles. Drive locked up when queue depth during multi-initiator access exceeds drive queue Large sequential (1MB size) writes are slower during MPIO Round Robin access Hardware Issues: Network Area / Category Description Special packet handling • Deeply fragmented Network Buffers • Packets with data larger than HW max System reboot during stress traffic • Multiple layers of driver resource cleanup reqd. to handle this condition Error handling when RDMA connections are created and destroyed in parallel • List corruption • Reference count mismatches NDK spec violation • Empty CQ notifications • Reference count of NDK objects during creation failure Handling hardware errors gracefully • Logging the error • Shutting down the hardware as much as possible Connection Scalability • Not being able to scale to large number of RDMA connections • Connection tear down and reconnect paths troublesome • Near line throughput with high number of connect/disconnect/reconnects Private Cloud Simulator Solution Validation: PCS • PCS (aka Private Cloud Simulator) • Simulates a comprehensive private cloud-based datacenter • Provides success metrics based on cloud-level SLAs Private Cloud Datacenter End-Users Tenant VMs VM Private Cloud Datacenter Administrator Planned Maintenance Environment Resiliency Compute Storage Network Planned Actions Planned Actions Planned Actions Planned Actions Unexpected Faults Unexpected Faults Unexpected Faults Unexpected Faults PCS: Simulation Approach • Model Based • • • Small Test Blocks – Big Scenarios Diversity Through Randomization Link to video Configs Migration Snapshots Resource Pool State Changes Resource VM Config State Remove Add NIC Snapshots Pool Changes NIC Change Randomization Microsoft Confidential Migrate Add VHDx VM Enable DM Take Add Checkpoint NIC Remove NIC Private Cloud Validation Approach Validation Approach: Overview Service Level Agreements (SLAs) Private Cloud Scenarios Private Cloud Simulation Accelerated Testing • SLAs • Focus on metrics that determine the expected availability, throughput and scalability of the private cloud service • Availability of 99.95% or higher is commonly expected • Throughput generally specified in terms of IOPS (transaction rate) or MB/s (streaming) is expected to be maintained • System is expected to scale seamlessly to accommodate higher load Signoff Validation Approach: Overview • Private Cloud Scenarios Service Level Agreements (SLAs) Private Cloud Scenarios Private Cloud Simulation Accelerated Testing Signoff • Excellent Uptime • VMs expected to get created, stay online and get backed up/restored with minimal downtime (planned or unplanned) • Load Balancing • Infrastructure redistributes VMs across multiple servers with no discernible end-user impact • Non-disruptive Admin Activities • Activities which normally cause system downtime (patching, upgrade, hardware replacement) shouldn’t disrupt VM users • Fault Resiliency • VMs should stay online and not lose any data during incidents like corrupt disks, tripped network cables, failed servers etc. Validation Approach: Overview Service Level Agreements (SLAs) Private Cloud Scenarios Private Cloud Simulation • Private Cloud Simulation • VM Creation & boot storm o Create expected number of VMs o Boot all expected VMs simultaneously • Tenant Workload • Run simulated workload inside each VM • Live Migration Accelerated Testing Signoff • Migrate VMs between server nodes • Backup & Restore • Do periodic backup of tenant VMs • Planned, unplanned downtimes • Activities like pulling network cables, replacing disks, rebooting nodes etc. Profile - Device.Storage.Controller VmCloneAction VmLiveMigrationAction StorageNodeFileServerMove StorageNodePoolMove VmSnapshotAction StorageNodeRestart VmStateChangeAction StorageNodeBugcheck VmStorageMigrationAction VmGuestRestartAction StorageNodeBusResetAction StorageNodePortDisableAllAction VmStartWorkloadAction StorageNodePortDisableSingleAction VmGuestFullPowerCycleAction StorageRetireAndRepairAction ComputeNodeEvacuation DisableNetworkAdapters ComputeNodeEvict ComputeNodeJoin Profile - Device.Storage.Enclosure VmCloneAction ClusterCSVMove VmLiveMigrationAction ComputeNodeEvacuation VmSnapshotAction ComputeNodeEvict VmStateChangeAction ComputeNodeJoin VmStorageMigrationAction StorageNodeRestart VmGuestRestartAction StorageNodeBugcheck VmStartWorkloadAction StorageNodeBusResetAction VmGuestFullPowerCycleAction StorageRetireAndRepairAction DisableNetworkAdapters StorageNodePoolMove Profile - Device.Storage.HD VmCloneAction ClusterCSVMove StorageNodeFileServerMov StorageNodeDiskReadMedi e umErrorAction VmLiveMigrationAction ComputeNodeEvacuation StorageNodePoolMove StorageRetireAndRepairActi on VmSnapshotAction ComputeNodeEvict StorageNodeRestart DisableNetworkAdapters VmStateChangeAction ComputeNodeJoin StorageNodeBugcheck VmStorageMigrationActi StorageNodeRestart on VmGuestRestartAction VmStartWorkloadAction VmGuestFullPowerCycle Action StorageNodeBugcheck StorageNodeBusResetActio n StorageRetireAndRepairAct ion DisableNetworkAdapters StorageNodePoolMove StorageNodeDiskIoTimeout OnceAction StorageNodeDiskReadTime StorageNodeDiskIoTimeout outAction SingleDiskAlwaysAction StorageNodeDiskWriteTim StorageNodeDiskIoTimeout eoutAction SingleDiskRandomAction StorageNodeBusResetActio n StorageNodePortDisableAll Action StorageNodePortDisableSi ngleAction StorageNodeDiskAllTimeou tSingleDiskAlwaysAction StorageNodeDiskAllTimeou tSingleDiskRandomAction StorageNodeUpdateStorage ProviderCacheAction Profile - Device.Network.LAN.10GbOrGreater VmCloneAction VmStartWorkloadAction VmLiveMigrationAction VmGuestFullPowerCycleAction VmSnapshotAction ClusterCSVMove VmStateChangeAction ComputeNodeEvacuation VmStorageMigrationAction ComputeNodeEvict VmGuestRestartAction ComputeNodeJoin Profile - Device.Network.LAN NetCreateTenantVmActio n VmGuestRestartAction ComputeNodeRestartActoin NetDeleteTenantVmActio n NetCreateVirtualNetwork Action NetDeleteVirtualNetwork Action NetCreateVirtualSubnetA ction NetDeleteVirtualSubnetA ction NetCreateVmNetworkAda pterAction VmGuestFullPowerCycleAction ComputeNodeEvacuationActoin NetLoadBalancerEastWestILBTraffi cAction ClusterCSVMoveAction NetLoadBalancerEastWestInterTier TrafficAction ComputeNodeBugcheckActoin NetLoadBalancerEastWestIntraTier TrafficAction NetConnectVmNetworkAdapter NetLoadBalancerInboundTrafficAc Action tion NetDisconnectVmNetworkAdap NetLoadBalancerNorthSouthTraffi terAction cAction NetRunEastWestCrossSubnetTra NetLoadBalancerOutboundTraffic fficAction Action VMSnapshotAction VmLiveMigrationAction VmStorageMigrationAction VmStateChangeAction NetBugCheckHostAgentAction NetRunEastWestSameSubnetTraffi cAction NetPublicIpAddressTrafficAction Validation Approach: Overview • Accelerated Testing Service Level Agreements (SLAs) Private Cloud Scenarios Private Cloud Simulation Accelerated Testing Signoff • Simulate 1 year of Data Center activity in shorter time • Example (for Microsoft Cloud Platform Solution) Sample Hoster Profile for CPS Accelerated Testing: From 1 year to 7 days Cloud Topology • • • • Tenant VMs: 2000 VM Profile: 1 vCPU, 1.75GB Compute Nodes: 30 Storage Nodes: 4 (SOFS+SPACES) Tenant Workloads & SLA • • • • Variety workloads always running SLA: Zero impact on workloads SLA: Zero IO errors or timeouts SLA: Failovers within a minute • • • • • • • • • • • • • • • • 10,152 VMs live migrated 5,734 VMs storage migrated 376 node drain & failovers 244 unplanned failovers 1 JBOD failure per day 2 drive failures per pool per day 28 NIC/Cable & 2 Switch failures 8 SAS Cable Pulls & 2 HBA failures VM Live Migrations Storage Migrations Compute Node Failures Storage Node Failures JBOD Power Failures Shared Disk Failures NIC, Cable, Switch Failures SAS HBA & Cable Failures PCS Overview: Success Criteria • PCS success • Based on impact of actions on the target personae Persona PCS Success Criteria (aka expected Private Cloud SLA) User Tenant VM should not be impacted 99.95% of the time by admin actions or environment faults Admin All planned admin actions should succeed at least 90% of the time (in the presence of faults) Environment No unexpected crash (except if initiated by PCS) of any cluster node • • • Validation Approach: Overview Service Level Agreements (SLAs) Private Cloud Scenarios Private Cloud Simulation Accelerated Testing Signoff • Signoff Indicates that the planned SLAs have been met in the simulated private cloud deployment (in the presence of faults, admin actions and with ongoing tenant workload). Gives confidence that the planned private cloud will stand up to customer load in real-world deployments. Redmond Protocols Plugfest 2016 Questions or Comments? Redmond Protocols Plugfest 2016 Thank You!
© Copyright 2025 Paperzz