How to verify the reliability of OPNFV

Reliability Testing In OPNFV
Hao Pang ([email protected])
San Francisco 11/09/2015
Agenda
2
•
Overview of reliability
•
Reliability research in OPNFV
•
Introduction of reliability testing
•
Reliability testing framework
•
Test case demo
•
Work Plan and Related links
Overview of Reliability
Reliability is the ability of a system to perform and maintain its functions in routine
circumstances, as well as hostile or unexpected circumstances.
High reliability is often defined as a system availability of 99.999%, which meets
the carrier-level requirement and protects carriers from business and reputation
loss due to system breakdown.
No Single Point of Failure
Overload Control
High Reliability
Automatic Recovery
Fault Isolation
3
……
Reliability is Critical Important
More and more business
services rely on the Internet
and cloud. A short outage can
cause a huge economic loss.
4
When the natural disaster
happening, the telecom
reliability supports the nation
security and life emergency.
Reliability in NFV and OPNFV
Standard
5
Title
NFV-REL 001
Resiliency Requirements
NFV-REL 002
Report on Scalable Architectures
for Reliability Management
NFV-REL 003
E2E reliability models
NFV-REL 004
Active monitoring and failure
detection
NFV-REL 005
Quality Accountability Framework
Avail
ability
Doctor
Pin
Point
MultiSite
Escalat
or
Predict
ion
OPNFV Platform
The Goal of Reliability Testing
Reliability Testing Goal
Customer Requires
6
 Device had better never have fault
 Test the product ability of not having fault
 If the fault occur, Do not affect the main services
 Test the product ability of fault recovery
 If the main services affected, locate and recover
the fault as soon as possible
 Test the product ability of fault detection, location
and tolerance.
Availability
Numbers of 9
Downtime In a year(minute)
Apply Product
99.9%
3
500
Computer / Server
99.99%
4
50
Enterprise-class Device
99.999%
5
5
Common carrier -class Device
99.9999%
6
0.5
Higher carrier-class Device
Type of Reliability Testing
Known Fault
• Simulate fault directly
Fault Injection
• Simulate fault directly
• Trigger fault By External trigger
Scenario
Fault Injection
Unknown Scenario
Known Scenario
• Predict it by Modeling
Reliability Prediction
• Trigger it by the Stress
Scenario
Stability Testing
Unknown Fault
7
Metrics for Service Availability
【General Metrics】
Failure detection time: the time interval between the failure happens and the failure been detected.
Service recovery time: the time interval between the failure happens and the service finish its recovery. the time
interval from the occurrence of an abnormal event (e.g. failure, manual interruption of service, etc.) until recovery
of the service.
Fault repair Time: the time interval between the failure detected and faulty entity is repaired.
Service failover time: the time from the moment of detecting the failure of the instance providing the service until
the service is provided again by a new instance.
【Carrier Metrics】
Network access success rate: the proportion of the end-user can access the network when requested.
Call Drop Rate: the proportion of mobiles which having successfully accessed suffer an abnormal release,
caused by loss of the radio link.
8
Scope of Reliability Testing (1)
Item
Layer
Controller node
Hardware
Reliability testing
Compute node
Storage node
Networking
9
Failure Mode
Priority
Control node server failure
H
Memory failure / Memory condition not ok
M
CPU failure / CPU condition not ok
M
Management network failure
M
Storage network failure
M
Compute node server failure
H
Memory failure / Memory condition not ok
M
CPU failure / CPU condition not ok
M
Management network failure
M
Storage network failure
M
Service network failure
M
Storage node server failure
M
Hard disk failure
M
HW failure of physical switch/router
M
Network Interface failure
M
Scope of Reliability Testing (2)
Item
Layer
Failure Mode
Priority
Robustness test
HOST OS failure
VIM important process failure
VIM important service failure
Management network failure—OVS
Storage network failure—OVS
HOST OS failure
VIM important service failure
VIM important process failure
Management network failure—OVS
Storage network failure—OVS
Service network failure-OVS
Hypervisor(KVM 、libvirt、Qemu)failure
VM failure
Gest OS failure
VM service failure
VM process failure
Various Faults random injection testing
H
H
H
M
M
H
M
M
M
M
M
H
H
M
L
L
M
Stability testing
Long time testing with load
L
Controller node
Software Reliability
testing
Compute node
VM
System reliability
testing
10
Architecture of Reliability Testing
Scheduler
Traffic
Flow
VNF
VNF
Attacker
Monitor
Reporter
Analyst
Infrastructure
11
Reliability Testing Framework: Yardstick
Goal:
a Test framework for verifying the infrastructure compliance when running VNF
applications.
Scope:



Generic test cases to verify the NFVI in perspective of a VNF.
Test stimuli to enable infrastructure testing such as upgrade and recovery.
Covered various aspects: performance, reliability, security and so on.
Design:
Yardstick testing framework is flexible to support various contexts of SUT, and easy to
develop additional plugin for fault injection, system monitor, and result evaluation. It is
also convenient to integrate other existing test frameworks or test tools.
12
Reliability Testing Workflow
output.json
cases.yaml
1.input
5.output
TaskCommand
Yardstick
Output(Process)
Runner
Context
Scenarios
HeatContext
BMContext
2.deploy
Attacker
Monitor
6.undeploy
4.collect data
System under
test(SUT)
VM
Nova
VM
Neutron
……
13
3.fault Inject
Heat
Host
Note: when some fault injection methods break the SUT(e.g. kill one controller), a option step
recovering the broken SUT might be executed before the scenario is over.
Case Sample: OpenStack Service HA
⑧ test result
④ setup attacker
Yardstick
② setup monitor
① Check
Monitor
Attacker
⑤ fault
injection
⑦ Recover
the broken
nova-api
nova-api
nova-api
Controller
Node #1
Controller
Node #2
Controller
Node #3
Controllers HA
14
③ monitor
service
⑥ get
the SLA
Test Case Configure and SLA
15
Test Case Demo Show
1. The nova-api processes
are running on Controller
Node #3.
2. The nova service
is running normally.
Before the Fault Injection
16
Test Case Demo Show
3. The nova-api processes
have been killed.
4. Nova service is still
working normally.
5. ServiceHA test case ran by
using Yardstick and recorded
its test result.
After the Fault Injection
17
Work Plan
18
TC Name
OpenStack controller
service failure
Testing Purpose
Verify the services running on the
controller is high available.
Controller Node
abnormally shutdown
Management network
timeout
VM abnormally
down
Verify the controllers cluster deployment
is high available.
Verify the controllers cluster deployment
is high available.
Verify the VM running on the compute is
high available.
Priority Status
H
Done
M
Doing
M
TODO
M
TODO
Related Links
Wiki of Yardstick project: https://wiki.opnfv.org/yardstick
Requirement of HA project:
https://etherpad.opnfv.org/p/High_Availabiltiy_Requirement_for_OPNFV
HA Test cases in Yardstick: https://etherpad.opnfv.org/p/yardstick_ha
19
Copyright©2015 Huawei Technologies Co., Ltd. All Rights Reserved.
The information in this document may contain predictive statements including, without limitation,
statements regarding the future financial and operating results, future product portfolio, new technology,
etc. There are a number of factors that could cause actual results and developments to differ materially
from those expressed or implied in the predictive statements. Therefore, such information is provided
for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the
information at any time without notice.