Presentation

Redmond Protocols Plugfest 2016
Validating Hardware for Private Clouds:
Tests & Techniques
Aniket Malatpure and Anagha Koshe
Cloud Solutions-Fundamentals: Reliability Team
Session Objectives And Takeaways
• Session Objective(s):
•
•
•
•
Private Cloud Market Opportunity
Private Cloud Hardware Issues
Private Cloud Validation Approach
Solution Validation: Private Cloud Simulator (PCS)
• Key Takeaways
• Private Cloud Simulator: Overview & end-to-end workflow
Private Cloud: Hardware Issues
Hardware Issues: Storage
• There is little that Cloud software can do if the underlying hardware fails to adhere to fundamental
design assumptions
• Storage hardware defects trigger a ‘Cloud Down’ scenario where some or all storage goes down
taking tenant VMs offline
Hardware
Type
Issue Description
HDD
•
•
•
•
SSD
• Persistent Reservation support showed multiple failures
• Disk media issues
• Inadequate support for the SBC-3 UNMAP command
Enclosures
• Power-on failures after repeated power cycles
• SAS Drawers failing partially (i.e. 1 of 5 drawers failed)
Number of initiators supported for Persistent Reservation was smaller than required
Drives failed to power-up on repeated enclosure powercycles.
Drive locked up when queue depth during multi-initiator access exceeds drive queue
Large sequential (1MB size) writes are slower during MPIO Round Robin access
Hardware Issues: Network
Area / Category
Description
Special packet handling
• Deeply fragmented Network Buffers
• Packets with data larger than HW max
System reboot during stress traffic
• Multiple layers of driver resource cleanup reqd. to handle this
condition
Error handling when RDMA
connections are created and
destroyed in parallel
• List corruption
• Reference count mismatches
NDK spec violation
• Empty CQ notifications
• Reference count of NDK objects during creation failure
Handling hardware errors
gracefully
• Logging the error
• Shutting down the hardware as much as possible
Connection Scalability
• Not being able to scale to large number of RDMA connections
• Connection tear down and reconnect paths troublesome
• Near line throughput with high number of
connect/disconnect/reconnects
Private Cloud Simulator
Solution Validation: PCS
• PCS (aka Private Cloud Simulator)
• Simulates a comprehensive private cloud-based datacenter
• Provides success metrics based on cloud-level SLAs
Private Cloud
Datacenter
End-Users
Tenant VMs
VM
Private Cloud
Datacenter
Administrator
Planned
Maintenance
Environment
Resiliency
Compute
Storage
Network
Planned
Actions
Planned
Actions
Planned
Actions
Planned
Actions
Unexpected
Faults
Unexpected
Faults
Unexpected
Faults
Unexpected
Faults
PCS: Simulation Approach
• Model Based
•
•
•
Small Test Blocks – Big Scenarios
Diversity Through Randomization
Link to video
Configs
Migration
Snapshots
Resource
Pool
State
Changes
Resource
VM
Config
State
Remove
Add
NIC
Snapshots
Pool
Changes
NIC
Change
Randomization
Microsoft Confidential
Migrate
Add
VHDx
VM
Enable
DM
Take
Add
Checkpoint
NIC
Remove
NIC
Private Cloud Validation Approach
Validation Approach: Overview
Service Level
Agreements (SLAs)
Private Cloud
Scenarios
Private Cloud
Simulation
Accelerated
Testing
• SLAs
• Focus on metrics that determine the expected availability, throughput
and scalability of the private cloud service
• Availability of 99.95% or higher is commonly expected
• Throughput generally specified in terms of IOPS (transaction rate) or
MB/s (streaming) is expected to be maintained
• System is expected to scale seamlessly to accommodate higher load
Signoff
Validation Approach: Overview
• Private Cloud Scenarios
Service Level
Agreements (SLAs)
Private Cloud
Scenarios
Private Cloud
Simulation
Accelerated
Testing
Signoff
• Excellent Uptime
• VMs expected to get created, stay online and get backed up/restored with
minimal downtime (planned or unplanned)
• Load Balancing
• Infrastructure redistributes VMs across multiple servers with no discernible
end-user impact
• Non-disruptive Admin Activities
• Activities which normally cause system downtime (patching, upgrade,
hardware replacement) shouldn’t disrupt VM users
• Fault Resiliency
• VMs should stay online and not lose any data during incidents like corrupt
disks, tripped network cables, failed servers etc.
Validation Approach: Overview
Service Level
Agreements (SLAs)
Private Cloud
Scenarios
Private Cloud
Simulation
• Private Cloud Simulation
• VM Creation & boot storm
o Create expected number of VMs
o Boot all expected VMs simultaneously
• Tenant Workload
• Run simulated workload inside each VM
• Live Migration
Accelerated
Testing
Signoff
• Migrate VMs between server nodes
• Backup & Restore
• Do periodic backup of tenant VMs
• Planned, unplanned downtimes
• Activities like pulling network cables, replacing disks, rebooting nodes etc.
Profile - Device.Storage.Controller
VmCloneAction
VmLiveMigrationAction
StorageNodeFileServerMove
StorageNodePoolMove
VmSnapshotAction
StorageNodeRestart
VmStateChangeAction
StorageNodeBugcheck
VmStorageMigrationAction
VmGuestRestartAction
StorageNodeBusResetAction
StorageNodePortDisableAllAction
VmStartWorkloadAction
StorageNodePortDisableSingleAction
VmGuestFullPowerCycleAction
StorageRetireAndRepairAction
ComputeNodeEvacuation
DisableNetworkAdapters
ComputeNodeEvict
ComputeNodeJoin
Profile - Device.Storage.Enclosure
VmCloneAction
ClusterCSVMove
VmLiveMigrationAction
ComputeNodeEvacuation
VmSnapshotAction
ComputeNodeEvict
VmStateChangeAction
ComputeNodeJoin
VmStorageMigrationAction
StorageNodeRestart
VmGuestRestartAction
StorageNodeBugcheck
VmStartWorkloadAction
StorageNodeBusResetAction
VmGuestFullPowerCycleAction
StorageRetireAndRepairAction
DisableNetworkAdapters
StorageNodePoolMove
Profile - Device.Storage.HD
VmCloneAction
ClusterCSVMove
StorageNodeFileServerMov StorageNodeDiskReadMedi
e
umErrorAction
VmLiveMigrationAction
ComputeNodeEvacuation
StorageNodePoolMove
StorageRetireAndRepairActi
on
VmSnapshotAction
ComputeNodeEvict
StorageNodeRestart
DisableNetworkAdapters
VmStateChangeAction
ComputeNodeJoin
StorageNodeBugcheck
VmStorageMigrationActi StorageNodeRestart
on
VmGuestRestartAction
VmStartWorkloadAction
VmGuestFullPowerCycle
Action
StorageNodeBugcheck
StorageNodeBusResetActio
n
StorageRetireAndRepairAct
ion
DisableNetworkAdapters
StorageNodePoolMove
StorageNodeDiskIoTimeout
OnceAction
StorageNodeDiskReadTime StorageNodeDiskIoTimeout
outAction
SingleDiskAlwaysAction
StorageNodeDiskWriteTim StorageNodeDiskIoTimeout
eoutAction
SingleDiskRandomAction
StorageNodeBusResetActio
n
StorageNodePortDisableAll
Action
StorageNodePortDisableSi
ngleAction
StorageNodeDiskAllTimeou
tSingleDiskAlwaysAction
StorageNodeDiskAllTimeou
tSingleDiskRandomAction
StorageNodeUpdateStorage
ProviderCacheAction
Profile - Device.Network.LAN.10GbOrGreater
VmCloneAction
VmStartWorkloadAction
VmLiveMigrationAction
VmGuestFullPowerCycleAction
VmSnapshotAction
ClusterCSVMove
VmStateChangeAction
ComputeNodeEvacuation
VmStorageMigrationAction
ComputeNodeEvict
VmGuestRestartAction
ComputeNodeJoin
Profile - Device.Network.LAN
NetCreateTenantVmActio
n
VmGuestRestartAction
ComputeNodeRestartActoin
NetDeleteTenantVmActio
n
NetCreateVirtualNetwork
Action
NetDeleteVirtualNetwork
Action
NetCreateVirtualSubnetA
ction
NetDeleteVirtualSubnetA
ction
NetCreateVmNetworkAda
pterAction
VmGuestFullPowerCycleAction
ComputeNodeEvacuationActoin NetLoadBalancerEastWestILBTraffi
cAction
ClusterCSVMoveAction
NetLoadBalancerEastWestInterTier
TrafficAction
ComputeNodeBugcheckActoin NetLoadBalancerEastWestIntraTier
TrafficAction
NetConnectVmNetworkAdapter NetLoadBalancerInboundTrafficAc
Action
tion
NetDisconnectVmNetworkAdap NetLoadBalancerNorthSouthTraffi
terAction
cAction
NetRunEastWestCrossSubnetTra NetLoadBalancerOutboundTraffic
fficAction
Action
VMSnapshotAction
VmLiveMigrationAction
VmStorageMigrationAction
VmStateChangeAction
NetBugCheckHostAgentAction
NetRunEastWestSameSubnetTraffi
cAction
NetPublicIpAddressTrafficAction
Validation Approach: Overview
• Accelerated Testing
Service Level
Agreements (SLAs)
Private Cloud
Scenarios
Private Cloud
Simulation
Accelerated
Testing
Signoff
• Simulate 1 year of Data Center activity in shorter time
• Example (for Microsoft Cloud Platform Solution)
Sample Hoster Profile for CPS
Accelerated Testing: From 1 year to 7 days
Cloud Topology
•
•
•
•
Tenant VMs: 2000
VM Profile: 1 vCPU, 1.75GB
Compute Nodes: 30
Storage Nodes: 4 (SOFS+SPACES)
Tenant Workloads & SLA
•
•
•
•
Variety workloads always running
SLA: Zero impact on workloads
SLA: Zero IO errors or timeouts
SLA: Failovers within a minute
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
10,152 VMs live migrated
5,734 VMs storage migrated
376 node drain & failovers
244 unplanned failovers
1 JBOD failure per day
2 drive failures per pool per day
28 NIC/Cable & 2 Switch failures
8 SAS Cable Pulls & 2 HBA failures
VM Live Migrations
Storage Migrations
Compute Node Failures
Storage Node Failures
JBOD Power Failures
Shared Disk Failures
NIC, Cable, Switch Failures
SAS HBA & Cable Failures
PCS Overview: Success Criteria
• PCS success
• Based on impact of actions on the target personae
Persona
PCS Success Criteria (aka expected Private Cloud SLA)
User
Tenant VM should not be impacted 99.95% of the time by admin actions or environment faults
Admin
All planned admin actions should succeed at least 90% of the time (in the presence of faults)
Environment
No unexpected crash (except if initiated by PCS) of any cluster node
•
•
•
Validation Approach: Overview
Service Level
Agreements (SLAs)
Private Cloud
Scenarios
Private Cloud
Simulation
Accelerated
Testing
Signoff
• Signoff
Indicates that the planned SLAs have been met in the simulated
private cloud deployment (in the presence of faults, admin
actions and with ongoing tenant workload).
Gives confidence that the planned private cloud will stand up to
customer load in real-world deployments.
Redmond Protocols Plugfest 2016
Questions or Comments?
Redmond Protocols Plugfest 2016
Thank You!