Reliability Testing In OPNFV Hao Pang ([email protected]) San Francisco 11/09/2015 Agenda 2 • Overview of reliability • Reliability research in OPNFV • Introduction of reliability testing • Reliability testing framework • Test case demo • Work Plan and Related links Overview of Reliability Reliability is the ability of a system to perform and maintain its functions in routine circumstances, as well as hostile or unexpected circumstances. High reliability is often defined as a system availability of 99.999%, which meets the carrier-level requirement and protects carriers from business and reputation loss due to system breakdown. No Single Point of Failure Overload Control High Reliability Automatic Recovery Fault Isolation 3 …… Reliability is Critical Important More and more business services rely on the Internet and cloud. A short outage can cause a huge economic loss. 4 When the natural disaster happening, the telecom reliability supports the nation security and life emergency. Reliability in NFV and OPNFV Standard 5 Title NFV-REL 001 Resiliency Requirements NFV-REL 002 Report on Scalable Architectures for Reliability Management NFV-REL 003 E2E reliability models NFV-REL 004 Active monitoring and failure detection NFV-REL 005 Quality Accountability Framework Avail ability Doctor Pin Point MultiSite Escalat or Predict ion OPNFV Platform The Goal of Reliability Testing Reliability Testing Goal Customer Requires 6 Device had better never have fault Test the product ability of not having fault If the fault occur, Do not affect the main services Test the product ability of fault recovery If the main services affected, locate and recover the fault as soon as possible Test the product ability of fault detection, location and tolerance. Availability Numbers of 9 Downtime In a year(minute) Apply Product 99.9% 3 500 Computer / Server 99.99% 4 50 Enterprise-class Device 99.999% 5 5 Common carrier -class Device 99.9999% 6 0.5 Higher carrier-class Device Type of Reliability Testing Known Fault • Simulate fault directly Fault Injection • Simulate fault directly • Trigger fault By External trigger Scenario Fault Injection Unknown Scenario Known Scenario • Predict it by Modeling Reliability Prediction • Trigger it by the Stress Scenario Stability Testing Unknown Fault 7 Metrics for Service Availability 【General Metrics】 Failure detection time: the time interval between the failure happens and the failure been detected. Service recovery time: the time interval between the failure happens and the service finish its recovery. the time interval from the occurrence of an abnormal event (e.g. failure, manual interruption of service, etc.) until recovery of the service. Fault repair Time: the time interval between the failure detected and faulty entity is repaired. Service failover time: the time from the moment of detecting the failure of the instance providing the service until the service is provided again by a new instance. 【Carrier Metrics】 Network access success rate: the proportion of the end-user can access the network when requested. Call Drop Rate: the proportion of mobiles which having successfully accessed suffer an abnormal release, caused by loss of the radio link. 8 Scope of Reliability Testing (1) Item Layer Controller node Hardware Reliability testing Compute node Storage node Networking 9 Failure Mode Priority Control node server failure H Memory failure / Memory condition not ok M CPU failure / CPU condition not ok M Management network failure M Storage network failure M Compute node server failure H Memory failure / Memory condition not ok M CPU failure / CPU condition not ok M Management network failure M Storage network failure M Service network failure M Storage node server failure M Hard disk failure M HW failure of physical switch/router M Network Interface failure M Scope of Reliability Testing (2) Item Layer Failure Mode Priority Robustness test HOST OS failure VIM important process failure VIM important service failure Management network failure—OVS Storage network failure—OVS HOST OS failure VIM important service failure VIM important process failure Management network failure—OVS Storage network failure—OVS Service network failure-OVS Hypervisor(KVM 、libvirt、Qemu)failure VM failure Gest OS failure VM service failure VM process failure Various Faults random injection testing H H H M M H M M M M M H H M L L M Stability testing Long time testing with load L Controller node Software Reliability testing Compute node VM System reliability testing 10 Architecture of Reliability Testing Scheduler Traffic Flow VNF VNF Attacker Monitor Reporter Analyst Infrastructure 11 Reliability Testing Framework: Yardstick Goal: a Test framework for verifying the infrastructure compliance when running VNF applications. Scope: Generic test cases to verify the NFVI in perspective of a VNF. Test stimuli to enable infrastructure testing such as upgrade and recovery. Covered various aspects: performance, reliability, security and so on. Design: Yardstick testing framework is flexible to support various contexts of SUT, and easy to develop additional plugin for fault injection, system monitor, and result evaluation. It is also convenient to integrate other existing test frameworks or test tools. 12 Reliability Testing Workflow output.json cases.yaml 1.input 5.output TaskCommand Yardstick Output(Process) Runner Context Scenarios HeatContext BMContext 2.deploy Attacker Monitor 6.undeploy 4.collect data System under test(SUT) VM Nova VM Neutron …… 13 3.fault Inject Heat Host Note: when some fault injection methods break the SUT(e.g. kill one controller), a option step recovering the broken SUT might be executed before the scenario is over. Case Sample: OpenStack Service HA ⑧ test result ④ setup attacker Yardstick ② setup monitor ① Check Monitor Attacker ⑤ fault injection ⑦ Recover the broken nova-api nova-api nova-api Controller Node #1 Controller Node #2 Controller Node #3 Controllers HA 14 ③ monitor service ⑥ get the SLA Test Case Configure and SLA 15 Test Case Demo Show 1. The nova-api processes are running on Controller Node #3. 2. The nova service is running normally. Before the Fault Injection 16 Test Case Demo Show 3. The nova-api processes have been killed. 4. Nova service is still working normally. 5. ServiceHA test case ran by using Yardstick and recorded its test result. After the Fault Injection 17 Work Plan 18 TC Name OpenStack controller service failure Testing Purpose Verify the services running on the controller is high available. Controller Node abnormally shutdown Management network timeout VM abnormally down Verify the controllers cluster deployment is high available. Verify the controllers cluster deployment is high available. Verify the VM running on the compute is high available. Priority Status H Done M Doing M TODO M TODO Related Links Wiki of Yardstick project: https://wiki.opnfv.org/yardstick Requirement of HA project: https://etherpad.opnfv.org/p/High_Availabiltiy_Requirement_for_OPNFV HA Test cases in Yardstick: https://etherpad.opnfv.org/p/yardstick_ha 19 Copyright©2015 Huawei Technologies Co., Ltd. All Rights Reserved. The information in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results, future product portfolio, new technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed or implied in the predictive statements. Therefore, such information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the information at any time without notice.
© Copyright 2025 Paperzz