Candy: Component-based Availability Modeling Framework for

Candy: Component-based Availability Modeling Framework
for Cloud Service Management Using SysML
Fumio Machida1,2, Ermeson Andrade1,3, Dong Seong Kim1 and Kishor S. Trivedi1
1
Department of Electrical and Computer Engineering, Duke University, Durham, NC 27708, United States
[email protected], [email protected], [email protected], [email protected]
2
Service Platforms Research Laboratories, NEC Corporation, Kawasaki, Japan
[email protected]
3
Informatics Center, Federal University of Pernambuco (UFPE), Recife, PE, Brazil
[email protected]
Abstract— High-availability assurance of cloud service is a
critical and challenging issue for cloud service providers. To
quantify the availability of cloud services from both
architectural and operational points of views, availability
modeling and evaluation are essential. This paper presents a
component-based availability modeling framework, named
Candy, which constructs a comprehensive availability model
semi-automatically from system specifications described by
Systems Modeling Language (SysML). SysML diagrams are
translated into components of availability model and the
components are assembled together to form the entire
availability model in Stochastic Reward Nets (SRNs). In order
to incorporate the maintenance operations of cloud services in
availability models, Candy defines the translation rules from
Activity diagram to SRN and synchronizes the related SRNs
according to SysML allocation notations. The feasibility of the
proposed modeling and availability evaluation process is
studied by an illustrative example of a web application service
hosted on a cloud infrastructure having multiple failureisolation zones and automatic scale-up function.
Keywords-component; automatic scale-up, availability
assessment, cloud service, stochastic reward nets (SRNs), systems
modeling language (SysML)
I.
INTRODUCTION
Cloud computing is an emerging style of computing
service to provide shared computing resources on the
Internet or private networks on an on-demand basis. Cloud
service providers own service infrastructures and take
responsibility for infrastructure management. The users of
cloud services do not need to own service infrastructures and
can save the costs related to infrastructure management. As
cloud services have been used widely, the availability of
cloud services becomes a major concern of the users. Cloud
services occasionally become unavailable due to system
failure or scheduled maintenance. The users, however,
cannot control the downtime of the service because most of
the maintenance operations and recovery processes are
delegated to the cloud service provider. Cloud service
providers should assess the availability of their service
quantitatively and disclose the information to the users.
To analyze the availability of cloud services based on
their system configurations and maintenance operations,
model-based availability assessment provides a reasonable
solution. Analytic models such as Markov chains and
stochastic Petri nets are used to analyze the availability of
complex IT systems [1][2]. However, a comprehensive
availability model for a complex IT system cannot be easily
obtained without expertise in analytic modeling. Cloud
service infrastructures have complex configurations and the
states of the systems are dynamically changing. Although
such complex configurations and dynamic behavior are
understood well by system administrators who are
responsible for maintaining cloud service infrastructures, it is
not easy for them to compose a comprehensive availability
model from scratch. If availability model can be generated
from the knowledge of system administrators through a welldefined procedure, it becomes very useful in assessing the
availability of the cloud services.
This paper presents a component-based availability
modeling framework named Candy, to compose an
availability model for cloud service semi-automatically from
the system specification model expressed in Systems
Modeling Language (SysML) [3]. SysML is defined as an
extension of Unified Modeling Language (UML) [4] and is
used for modeling complex IT systems and various system
engineering applications. Candy translates the SysML
diagrams into the components of an availability model.
These components are then assembled and synchronized
together according to the dependencies among them in order
to form an entire availability model as stochastic reward nets
(SRNs) [5]. The model translation has been studied in
previous literature [6][7][8][9]. Most of the existing model
translation methods aimed to evaluate the performance of the
system [6][7]. A few studies addressed the availability
assessment [8][9], but they did not incorporate maintenance
operations which do affect the system availability.
In contrast to the existing studies, the key contributions
of our work can be summarized as follows. i) SRNs are
composed for the purpose of availability analysis of complex
IT systems. ii) Dynamic behavior of system maintenance
operations are designed with activity diagrams and
incorporated in the availability model. In Candy, activity
diagrams are translated into SRNs and synchronized with the
related SRNs to compose the entire availability model. The
proposed synchronization method is not discussed in any
previous literature. iii) Our case study is the first one to
construct analytic models from system specification models
to evaluate the availability of a web application system
hosted on a cloud service infrastructure considering
automatic scale-up function and failure-isolation zones [10].
The rest of this paper is organized as follows. Section II
introduces an illustrative example of a web application
system on a cloud service infrastructure. Section III presents
an overview of Candy. Section IV introduces the use of
SysML diagrams and Section V presents the translation of
SysML diagrams. Section VI shows the model assembly step
and Section VII presents the model synchronization step in
detail. An example case study of a web application system is
presented in Section VIII. Section IX describes related work
and Section IX provides the conclusion and future work.
distributes the user requests to the web server processes
running on different zones. In each zone, the number of
running web server processes is maintained by automatic
scale-up function. The function is configured to keep four
web servers running across the two zones. Four web servers
are equally divided onto the zones (i.e., each zone hosts two
web servers) as long as both of the two zones are available.
If a zone becomes unavailable, the scale-up function
automatically increases the number of web servers up to four
to meet the performance requirements. A DB server process
is replicated by a DB hot standby deployed on the other zone.
When the active DB server fails, the DB hot standby takes
over the operation until the active DB server is recovered.
Configurable
load balancer (LB)
Isolation zone-1
II.
Isolation zone-2
WEB APPLICATION SYSTEM HOSTED ON CLOUD
A web application system hosted on a cloud service
infrastructure is introduced as an illustrative example.
Although our framework is applicable to other types of
complex IT systems, we focus on Infrastructure as a Service
(IaaS) cloud provided by a private cloud service provider.
Web application service is one of the most popular
applications hosted on IaaS cloud such as the Amazon EC2
[11]. Compared to the traditional web application hosting
services, IaaS cloud provides elastic and highly-available
computing resources based on server virtualization
technology. The following characteristics are especially
important in terms of high-availability assurance of the cloud
service.
 Failure-isolation zone
The user of IaaS cloud can easily specify the location for
executing the application instances. Each location called
a zone is isolated from the failures in other zones. The
user can make a high-availability configuration by
distributing the application instances among the multiple
zones. A failure-isolation zone is called an Availability
Zone in Amazon EC2 [10].
 Automatic scale-up
A cloud infrastructure automatically starts or stops the
instances of virtual machines according to the rules
defined by users. This scale-up function is useful to
maintain the high-availability of a web application by
keeping the number of running server processes constant
in face of failures of server processes.
 Configurable load balancer
Load balancing service distributes the requests for the
web applications to different web server processes
across multiple zones. The configuration of load
balancer can be changed dynamically in response to the
behavior of the automatic scale-up function.
Based on these characteristics of an IaaS cloud, a
configuration of web application system on a cloud service
infrastructure is depicted in Figure 1. The web application
system is composed of web server processes and database
(DB) server processes, and is deployed on two isolation
zones. Each server process is assumed to run on its own
virtual machine. A configurable load balancer (LB)
Web
server
Web
server
Automatic
scale up
Web
server
DB Server
(active)
Web
server
DB hot
standby
Figure 1. Configuration of a web application on an IaaS cloud
The web application system on a cloud infrastructure
may achieve higher availability than the system with
traditional configuration which does not have zones and
automatic scale-up function. To evaluate the availability of
the system quantitatively, an analytic model that captures
both system configuration and maintenance operations is
required. In the following sections, we present a framework
to compose the availability model from SysML models and
apply the framework to the web application system.
III.
OVERVIEW OF CANDY
Candy
SysML
IBD
(1) Model
translation
STM
(1) Model
translation
AD
(1) Model
translation
Model templates
Model
components
(2) Model
assembly
System SRN
(3) Model
synchronization
Model
components
SRNs
Activity SRN
Model
components
System
administrators
Assign guard functions
Figure 2. Availability modeling steps using Candy
Candy is a component-based availability modeling
framework for assessing the availability of the systems from
system specifications described by SysML. An overview of
the availability modeling steps in Candy is shown in Figure 2.
Candy supports three types of SysML diagrams as input
model; Internal Block Diagram (IBD), State Machine
Diagram (STM) and Activity Diagram (AD). IBD represents
static system configurations such as the composition of
server processes with and without their redundancy structure.
STM describes state transitions of a specific system element
(e.g., failure and recovery behavior of a server process). AD
describes behavior of maintenance operations (e.g.,
scheduled server shutdown by a system administrator). The
dependencies between the elements of the diagrams are
specified by the notation called allocation.
The steps of model specification are followed by three
steps; (1) model translation, (2) model assembly and (3)
model synchronization. In the first step, Candy translates the
SysML diagrams into components of the availability model
named model components. The model components represent
system elements in IBDs, state transitions in STM, or flows
of specific system maintenance operations described in AD.
The model component generated by AD is named Activity
SRN. Candy defines reusable templates of model
components and uses the templates in the model translation.
In the second step, Candy assembles the model
components generated from IBD and STM according to
allocation notations. We define some stereotypes for SysML
allocation to specify the type of the allocations. The resulting
assembled model is named System SRN that represents the
system configurations and state transitions of each system
element. System SRNs marking can be affected by system
maintenance operations described in ADs. As an example, a
state of a server process represented in System SRN can be
changed by a system maintenance operation represented in
Activity SRN such as a scheduled shutdown.
In the third step, Activity SRN is synchronized to System
SRN by identifying the relationships between actions in
Activity SRN and state transitions in System SRN. Candy
guides system administrators to clarify the relationships
between the SRNs and to define the guard functions for
synchronization. Once an availability model composed of
System SRNs and Activity SRNs is obtained, various
availability measures can be computed by using the software
packages supporting SRN such as SHARPE [12] and SPNP
[13]. The following section describes each step in detail.
IV.
SYSML BASED SYSTEM DESCRIPTION
Candy uses three types of SysML diagrams (IBD, STM
and AD) to describe system configurations and behavior of
cloud services. The details of system descriptions by SysML
diagrams are introduced in this section.
A. Internal Block Diagram (IBD)
IBD is used to describe the static system configuration
such as logical functions, process structure and hardware
configurations with and without their redundancies. The
diagram is depicted with internal blocks and connectors.
Figure 3 shows the IBD representation of the process
structure of the web application system introduced in
Section II. The web application consists of an LB process,
two distinct web server processes, a DB server process and a
DB hot standby process. The multiplicity denoted in a block
represents the level of redundancy of the system element. For
example, the multiplicities of the web server processes are
specified as "2" in Figure 3, which means that there are four
web server processes in total. Special redundant component
such as a hot standby process is specified by a stereotype as
shown in the block of the DB hot standby process (i.e.,
<<hot standby>>). For the block representing a logical
function, we define "require" property which indicates the
required number of processes to maintain the function
properly. The value of the "require" property is used in the
generation of reward function in Section VI-D.
ibd Process structure
Web server
process [2]
LB Process
DB server
process
Web server
process [2]
<<standby>>
<<hot standby>>
DB hot standby
Figure 3. IBD representing a process structure of web application system
B. State machine diagram (STM)
State transitions of a system element are described using
STM. In an STM, a state is depicted as a rounded rectangle,
and a transition from one state to another state is represented
by an arrow. In Candy, STM is used to describe the failure
and recovery behavior of a system element. An example of
an STM for a web server process is shown in Figure 4(a). A
web server process starts from Stop state. The web process
starts the operation when the start up command is invoked,
and then the process enters Running state. We define the
Boolean property “available” that specifies the available
states of the system element. The property value is set to true
only when the system element works properly in the state. In
this example, Running is the only available state according to
the “available” property. If the server process fails during the
operation, the process enters Failed. When the server process
is recovered, the process returns to Running. The web server
process is terminated by shutdown operation and enters Stop.
stm Web process state machine
Stop
{available=false}
Start up
Shut down
Running
{available=true}
Fail
Recover
Failed
{available=false}
(a)
ad Scheduled server shutdown
every 24 hours
Server state check
[state!=Running]
[state==Running]
<<control>>
Server shutdown
(b)
Figure 4. (a) STM representing the state transition of web server process,
and (b) AD representing the scheduled server shutdown operation
C. Activity Diagram (AD)
AD is used for describing behavior of system
maintenance operations performed by system administrators
or management middleware. An AD is represented by a set
of activity nodes and directed edges representing the flow
among the activity nodes. The activity nodes used in this
paper are summarized in Figure 7 in Section V-D. An
example of an AD for server maintenance activity is shown
in Figure 4(b). The activity starts from an initial node,
represented by a filled circle, and flows into the wait time
action depicted as an hour glass. The wait time action
invokes the next action for checking server state every 24
hours. If the checked server state is Running, the server is
shutdown by the subsequent action. The guards for decision
condition are specified on the outgoing edges from the
decision node. According to the decision output, the activity
ends at either one of the final nodes. To specify the action
that affects a state transition of a system element, we
introduce a new stereotype <<control>> for action nodes. In
this example, the server shutdown action is annotated with
<<control>> because the server state is supposed to change
from Running to Stop by this action.
D. SysML Allocation
SysML allocation is used to represent a general crossassociation of elements in SysML diagrams. According to
the SysML specification, allocation can be used in various
contexts in order to ensure the flexibility of model
descriptions. To reuse SysML diagrams designed for system
specification to availability modeling purpose, we introduce
some stereotypes of allocations to specify the meaning of
relationships between system elements. We define five
stereotypes <<transition>>, <<hosted>>, <<standby>>,
<<process>> and <<operation>> as summarized in Table I.
Detail of each stereotype are discussed in the following
sections.
TABLE I.
STEREOTYPES IN CANDY
Stereotype
<<hot standby>>
<<control>>
Target
Block
Action
<<transition>>
Allocation
<<standby>>
Allocation
<<hosted>>
Allocation
<<process>>
Allocation
<<operation>>
Allocation
V.
Description
A block for a hot standby element
An action that may induce a state
transition of a system element
An allocation from an STM to a
corresponding block in IBD
A relationship between a standby
element and an active element
A dependency between two blocks
having hosting relationship
An association between blocks
representing a logical function and
processes implementing the function
An allocation from AD to blocks
affected in IBDs
MODEL TRANSLATION
This section introduces model translation from SysML
diagrams to model components. Since all model components
are based on SRN, first SRN is introduced in brief.
A. SRN
SRN extends Generalized Stochastic Petri Net (GSPN)
by introducing reward functions, guard functions and general
marking dependencies. A reward function defines the reward
rate for each tangible marking of Petri Net. Various
quantitative measures such as steady-state availability can be
computed by defining the corresponding reward functions. A
guard function assigned to a transition specifies condition to
enable or disable the transition, in addition to the constraints
imposed by priority, input arcs, and inhibitor arcs. More
details on SRN can be found in [5].
B. IBD translation
Each block in IBD represents a system element such as a
server process. In terms of the availability of each element, a
system element has at least two states, namely Up state and
Down state. The blocks in IBD can be translated into the
same elemental model component shown in Figure 5(a).
Pup
Pup
n
1
Tfail
Trecv
Tfail
Pdown
#
Trecv
Pdown
(a) Elemental model component
(b) Cluster model component
Figure 5. The elemental model component and cluster model component
In the elemental model component, a token deposited in
Pup, implies that the system element is available. The
transition Tfail fires when the system element goes down, and
then a token from Pup is removed and a token is deposited in
Pdown; thence the system element is not available. Trecv fires
when the system element recovers, removing a token from
Pdown and depositing a token in Pup. A redundant element is
translated into the number of tokens in model component.
Assuming that all redundant elements are available at the
initial state and the transition rate to the Down state depends
on the number of available elements, the redundant elements
can be translated into the cluster model component shown in
Figure 5(b). The symbol “#” near Tfail says that the transition
rate depends on the number of tokens in Pup [5].
If the block is annotated with a stereotype for special
system element, the information is used to generate a special
model component. We define the model template for
instantiating a model component from a stereotyped block.
For instance, the block denoted with <<hot standby>> is
translated to the specific model component representing hot
standby element. The right part of Figure 9 shows an
example of a model component for hot standby. Other types
of redundant elements also can be incorporated by defining
special model templates with the stereotype. The dependency
between active system element and standby element is
discussed in Section VI-B in detail.
C. STM translation
STM can be translated into a model component by
converting each state into a place and each transition into a
timed transition with input/output arcs. Figure 6 shows the
translated model component for the STM introduced in
Figure 4(a) of a web server process. From the values of
“available” property, Prunning can be identified as a place
representing an available state of web server process.
Tstart
Pstop
1
Prunning
Tdown
Trecover
Pfailed
Tfail
Figure 6. Model component translated from STM for a web server process
D. AD translation
AD is translated into SRN by converting each action into
related places, timed/immediate transitions and guard
functions. We define the translation rules for each node type
of AD as shown in Figure 7. The translation rules
automatically generate blank guard functions for specific
transitions such as gin-act and gout-act. The definitions of the
guard functions are specified in the synchronization step
discussed in Section VII.
Initial node
Control
Action node
Pini
1
Pin-act
<<control>>
Action
Tini
Tact [gin-act]
Pout-act
Action node
Tout-act
[gout-act]
Pact
Action
Tact
Final node
Decision node
Pindet
Tdet
Pfin
Tfin
Poutdet
Toutdet1
[goutdet1 ]
Pini
Wait time action
1
Pwait
Twait
Toutdet2
[goutdet2 ]
Pclock
Treset
Tclock
Ptrigger
Figure 7. Translation rules for activity nodes
Initial node
1
Pini
Tini
Wait time
action
Pwait
Twait
1
Pclock
Tclock
Treset
Ptrigger
Action node
Pact1
Tact1
Decision node
Pindet
Tdet
Poutdet
Toutdet1 Toutdet2
[goutdet1 ] [goutdet2 ]
Final node
Control
Action node
Pin-act2
Tact2
[gin-act2 ]
Pout-act2
Tout-act2
[gout-act2 ]
Pfin1
Pfin2
Tfin1
Tfin2
Each immediate transition has a guard function for
expressing the decision guard in the AD. The translations of
the final nodes include the outgoing arcs to the initial place,
which means the activity starts repeatedly on the timer event.
VI.
A. Detailed state transitions
First, model components translated from STMs replace
the elemental/cluster model components generated from
IBDs. A <<transition>> allocation indicates the relationship
between STM and the corresponding block in the IBD. Since
an STM provides more detailed state transitions of a system
element, Candy replaces the elemental/cluster model
component derived from IBD to the model component from
STM according to <<transition>> allocation. The number of
tokens in the model component is set by the multiplicity of
the block in the IBD.
B. Standby element
A <<standby>> allocation specifies a failover
relationship from a standby element to an active element.
When the active element fails, the standby element takes
over the operation of the active element. The standby
element switches back to standby state after the active
element is recovered. The failover behavior can be
represented by guard functions for the model component for
standby element. Let m1 be a model component for a standby
element and let m2 be a model component for an active
element. A <<standby>> allocation is directed from m1 to m2
as shown in Figure 9. The m1 has the transitions for failover
Tfover and for switch back Tfbac. In the model assembly step,
Candy generates the guard functions g1fover and g1fbac to
specify the condition of failover and switch back. Let us
define the set of places representing the up states (down
states) for mi as Ui (Di). Ui and Di can be derived from the
"available" property in the STM or the definition of model
templates. The transition Tfover is enabled by g1fover when a
token in m2 is deposited in D2 (P2down belongs to D2 in Figure
9). Similarly, the guard function g1fbac enables the transition
Tfbac when a token in m2 is deposited in U2 (P2up belongs to
U2). The definitions of g1fover and g1fbac are summarized in
Table II. #Ui and #Di represent the number of tokens in Ui
and Di, respectively.
m2
m1
Tfover [g1fover]
<<standby>>
Pup
P2upU2
Figure 8. Activity SRN for a server maintenance operation
Figure 8 shows an example of translation from AD for
the server maintenance activity shown in Figure 4(b). As
described in the AD, server state is checked every 24 hours
by a timed transition Tclock with deterministic firing time
denoted by a filled rectangle. The decision node is translated
into Pindet with the timed transition Tdet and Poutdet with two
outgoing arcs to the immediate transitions Toutdet1 and Toutdet2.
MODEL ASSEMBLY
This section introduces the model assembly step to
compose System SRNs by assembling model components
translated from IBDs and STMs in accordance with the
allocation notations in SysML diagrams.
1
1
T2fail
T2recv
Tfail
Tfbac
[g1fbac]
Phot
Thot
Thotfail
P2down
D2
Active system element
Pdown
Trecv
Hot standby element
Precvd
Figure 9. Model assembly for <<standby>> allocation
C. Hosted dependency
A <<hosted>> allocation represents a hosted dependency
between system elements such as a dependency between a
virtual machine and a physical server. State transitions of a
hosted element are enabled only when the corresponding
hosting element is available. If the hosting element goes
down, the hosted element becomes unavailable at the same
time. This hosted dependency can be represented by the
guard functions in the model components for the hosted
element. Let m3 be a model component for a hosted element
(e.g., a virtual machine) and let m4 be a model component for
a hosting element (e.g., a physical server), a <<hosted>>
allocation is directed from m3 to m4 as shown in Figure 10.
P3upU3
P3up
T3fail
T3recv
T3fail
m3
T3recv
T3dw41 [g3dw4]
[g3up4 ]
P3downD3
<<hosted>>
P3dw41 =P3down
T4recv
P4downD4
For all the places in U3, new immediate transitions T3dw4i
(
) are introduced to deposit a token into one of
the place P3dw4i D3 at a down of the hosting element. The
guard function g3dw4 enables all T3dw4i transitions when a
token in m4 is deposited in D4. Figure 10 shows an example
of an additional immediate transition T3dw41 connected from
P3up U3 to P3down D3. P3down is selected as P3dw41 in this
case. Since the state transition of the hosted element never
happens during the downtime of the hosting element, all the
transitions in m3 connected from P3dw4i are disabled by the
guard function g3up4 while a token is deposited in D4. In the
example in Figure 10, T3recv is disabled by g3up4. The
definition of guard functions are summarized in Table II.
If D3 contains more than one place, Candy needs to select
a place P3dw4i from D3. This selection can be supported by an
additional property such as "failure" specifying the failure
state in STM. If such a property is not used, Candy chooses
one place according to a certain rule (e.g., choose the nearest
place in D3 for each place in U3).
<<process>>
m5
A. Synchronization of action node
Since an action stereotyped as <<control>> in an Activity
SRN affects state transitions in a System SRN, guard
functions to associate the related transitions are required. For
a transition Tact in an Activity SRN, a system administrator
needs to find the corresponding transition Ttr in the System
SRN in accordance with <<operation>> allocation. The
search for Ttr can be automated by introducing a naming rule
for actions in AD and transitions in STM, or by using
specific allocation from the action to the corresponding
transitions.
Activity SRN
1
T6fail
Ttr
Pin-tr
Ttr [gout-tr]
P6up  U6
T6recv
LB process
P5down
Tin-tr [gin-tr]
expand
Tact [gin-act]
1
T5recv
System SRN
Pin-act
Tout-act [gout-act]
m6
<<process>>
P5up
T5fail
Definition
if (#D2 > 0) 1 else 0
if (#U2 > 0) 1 else 0
if (#D4 > 0) 1 else 0
if (#D4 == 0) 1 else 0
if (#U6 >= n5) 1 else 0
System SRNs and Activity SRNs are dependent on each
other. An action in an Activity SRN may induce state
changes in System SRNs. On the other hand, the flows in an
Activity SRN may change depending on a marking in the
System SRNs. In this Section, these dependencies are
incorporated by introducing additional guard functions.
Pout-act
D. Process implementation
LB function
Function name
g1fover
g1fbac
g3dw4
g3up4
r5
VII. MODEL SYNCHRONIZATION
Figure 10. Model assembly for <<hosted>> allocation
Require=1
Allocation type
Standby component
Process implementation
1
T4fail
GUARD AND REWARD FUNCTIONS FOR MODEL ASSEMBLY
Hosted dependency
P4up U4
m4
TABLE II.
1
1
This relationship is specified by a <<process>> allocation as
shown in Figure 11. The <<process>> allocation from the
LB function indicates the associated LB server process. The
"require" property specifies the required number of processes
for the function. From the <<process>> allocation and the
"require" property, a reward function for the availability of
the logical function can be generated. Let m5 be a model
component for a logical function and m6 be a model
component for the process implementing the function. A
<<process>> allocation is directed from m5 to m6. In the
model assembly step, m5 is replaced with m6. The reward
function r5 is defined with n5, which represents the value of
"require" property of m5, as shown in Table II.
P6down
Figure 11. IBDs representing logical function and processes
In Candy, both of the logical functions and processes are
represented by blocks in IBDs. The logical function is
implemented by the associated processes. The function is
available as long as the associated processes work properly.
Four guard functions
for synchronization
1. gin-tr : if(#Pin-act == 1) 1 else 0 end
2. gin-act : if(#Pin-tr == 1) 1 else 0 end
3.gout-tr : if(#Pout-act ==1) 1 else 0 end
4. gout-act : if(#Pin-tr==0) 1 else 0 end
Figure 12. Synchronization of Tact in Activity SRN and Ttr in System SRN
To synchronize the action to the transition, the transition
Ttr is expanded with an immediate transition Tin-tr and a place
Pin-tr with an inhibitor arc as shown in the right part of Figure
12. Tin-tr and Pin-tr represent the action invocation and action
execution state, respectively. An inhibitor arc from Pin-tr to
Tin-tr ensures that the action affects a single system element
(e.g., a server process) at a time. For all the related
transitions, Candy generates four guard functions gin-tr, gin-act,
gout-tr, and gout-act to make the transitions in a consistent order
as shown in Figure 12. The first guard function gin-tr
represents the trigger of the action. The second function gin-act
ensures the start of state transition. The third guard function
gout-tr represents the end of the action and the fourth guard
function gout-act ensures that the state transition completes.
Pstop
Pin-act2
Tact2
[gin-act2 ]
Tstart
Prunning
Pfailed
[gout-down]
A. SysML design
ibd Web application system
Web function
LB function
Require=1
Require=1
<<process>> <<process>>
<<process>> <<process>>
Web process 1 [2]
LB Process
<<standby>>
DB process
1
<<hot standby>>
DB hot standby
Web process 2 [2]
Max=5
Pin-down
<<process>>
Max=5
Pout-act2
Tout-act2
[gout-act2]
DB function
Require=4
ibd Process structure
Trecver
Tdown
specified by SysML diagrams. From the SysML diagrams,
SRNs are composed through Candy and the system
availability is computed by SPNP [13].
[gin-down]
Tin-down
Tfail
<<hosted>>
<<hosted>>
<<hosted>>
<<hosted>>
ibd Zone configuration
gin-down : if(#Pin-act2 ==1) 1 else 0 end gout-down : if(#Pout-act2 ==1) 1 else 0 end
gin-act2 : if(#Pin-down ==1) 1 else 0 end gout-act2 : if(#Pin-down == 0) 1 else 0 end
Zone 1
Zone 2
Figure 13. An example of expanded System SRN and guard functions
Figure 15. IBDs for the web application system on an IaaS cloud
Figure 13 shows an example of the synchronization
between Tact2 in the Activity SRN shown in Figure 8 and
corresponding transition Tdown in the System SRN in Figure 6.
The transition Tdown is expanded with an immediate
transition Tin-down and a place Pin-down. The four guard
functions are automatically generated for the related
transitions. As a result of the synchronization, the shutdown
action definitely changes the state of the server process.
B. Synchronization of decision node
If a condition of a decision node in an AD depends on the
state of a system element, guard functions in Activity SRN
can be defined by a marking of the corresponding System
SRN. Since guard conditions for decision node in AD are not
specified formally, a manual interpretation of the guard
condition by system administrator is required. An associated
part of the System SRN can be identified by <<operation>>
allocation. In our example, the guard functions, Toutdet1 and
Toutdet2 in Figure 8, are described with the number of tokens
in Prunning as shown in Figure 14.
goutdet1 : if(#Prunning == 0) 1 else 0 end goutdet2 : if(#Prunning > 0) 1 else 0 end
Figure 14. An example of guard functions for the decision outputs
Due to manual configurations by system administrators,
definitions of guard functions might contain some errors. To
avoid the human errors in the guard function definition, we
can introduce a constraints language for SysML diagrams
such as Object Constraint Language (OCL) [14] which
specifies guard condition in a formal way. However, such
constraint language restricts the flexibility of SysML
description. In our case study, we assume that the definitions
of guard functions are correctly determined by system
administrator from the original AD.
VIII. CASE STUDY
As a case study, we recall the web application system
example introduced in Section II. Configurations and
maintenance operations of the web application system are
ad automatic scale up
every 5 mins
Check the state of the
other zone
[Up]
Check the number of
available processes (na)
and unused processes (nu)
<<control>>
Stop a process
Check the number of
available processes (na)
and unused processes (nu)
[na >=4 or nu ==0]
[na ==2or nu ==0]
[na >=2]
[Down]
[na <2 and nu >0]
<<control>>
Start a new process
[na <4 and nu >0]
<<control>>
Start a new process
Figure 16. AD representing automatic scale-up function for web processes
For the web application system illustrated in Figure 1,
first system administrators (or designers) use IBD to
represent the configuration of the system. Figure 15 shows
the IBD representation of the system. There are three layers
of IBDs which represent the logical functions of the
application system, the process structure and the zone
configuration. Each block in the top level IBD has
<<process>> allocation to the blocks in the middle level IBD
and each block in the middle level IBD has <<hosted>>
allocation to the blocks representing the zones in the bottom
level IBD. In an IaaS cloud, each process is assumed to run
on its own virtual machine. The maximum number of web
processes is five and the required number of processes for
proper web function is four in total. The relationship between
DB process and DB hot standby process is represented by
<<standby>> allocation. Next, the detailed state transitions
of system element are designed using STM. In this case
study, the STM shown in Figure 4 is used for all of the web
processes. The relationship between the STM in Figure 4 and
the blocks for the web processes in Figure 15 is specified
with a <<transition>> allocation (This allocation is not
depicted in the figures). Finally, AD is used to represent the
behavior of automatic scale-up function. Figure 16 shows the
AD for automatic scale-up function applied to web processes
in a zone. The <<operation>> allocation connects the AD in
Figure 16 to the affected blocks for web processes in IBD in
Figure 15 (This allocation is not depicted in the figures).
The activity starts the action to check the state of the
other zone every five minute interval as long as its own zone
is available. Depending on the availability of the other zone,
the desired number of web processes is changed. If the other
zone is in Up state, the automatic scale-up function keeps the
number of web processes in its own zone to two. If the
number of available web processes is less than two, it starts a
new web process as long as there are unused processes
(nu>0) out of five processes. Conversely, if the number of the
processes is more than two, it stops one of the running web
processes. In case that the other zone is in Down state, the
automatic scale-up function tries to increase the number of
processes in its own zone up to four.
standby by the stereotype <<hot standby>> (as described in
Section V-B). The other blocks in IBDs are translated as the
elemental model components shown in Figure 5(a). The
STM for the web processes are subsequently translated into
the model components in Figure 6. Finally, the ADs for
automatic scale-up function are translated by the translation
rules described in Figure 7. The obtained Activity SRN for
an automatic scale-up function of the web processes is
shown in Figure 17. For the two web server processes on the
zone-1 and the zone-2, two Activity SRNs for automatic
scale-up functions are generated.
C. Model assembly
PLBup
TDHfover [gDHfover]
PDHup
1
TLBfail
TLBrecv
1
TDHfbac
[gDHfbac]
TDHdwZ21
PLBdown
[gDHdwZ2 ] TDHfail
TDHhotfail
TDHhot
TDHdwZ22
PDBup
1
TDBfail
PDHhot
[gDHdwZ2 ]
TDBrecv
B. Model translation
[gDBdwZ1 ] TDBdwZ11
Pwait
1
Pini
1
Twait
Tini
Pclock
Tclock
Treset
[gDBupZ1]
PDHdown
PDBdown
Ptrigger
PDHrecvd
TWEB1recv [gWEB1upZ1 ]
TWEB1start
PWEB1stop
PZ1up
TWEB1dwZ11
1
TZ1fail
TDHrecv
[gDHupZ2 ]
PWEB1up
TZ1recv
PWEB1fail
2
[gWEB1dwZ1]
Pindet1
Pact1 Tact1
Poutdet1
Toutdet11
[goutdet11]
TWEB1stop
Tdet1
#
TWEB1fail
PZ1down
Toutdet12
[goutdet12]
PZ2up
1
TZ2fail
TWEB2recv [gWEB2upZ2 ]
TWEB2start
PWEB2stop
TZ2recv
TWEB2dwZ21
PWEB2up
PWEB2fail
2
[gWEB2dwZ2]
Pact2
Pact3
PZ2down
Tact2
Tact3
Pindet2
Pindet3
Tdet2
Tdet3
Toutdet21
[goutdet21 ]
Poutdet2
Poutdet3
Toutdet22
[goutdet22 ]
Toutdet23
[goutdet23 ]
Tfin1
Pfin1
Toutdet31
[goutdet31 ]
TWEB2stop
#
TWEB2fail
Figure 18. System SRNs obtained by model assemble process
TABLE III.
Allocation
<<hosted>>
Toutdet32
[goutdet32 ]
<<standby>>
<<process>>
Pin-act4
Pin-act5
Pin-act6
Pfin2
Tact4
[gin-act4 ]
Tact5
[gin-act5 ]
Tact6
[gin-act6 ]
Tfin2
Pout-act4
Pout-act5
Pout-act6
Tout-act4
[gout-act4 ]
Tout-act5
[gout-act5]
Tout-act6
[gout-act6 ]
Figure 17. Activity SRN for automatic scale-up activity
Candy first translates the blocks in IBDs into model
components based on the model templates. The cluster
model component shown in Figure 5(b) is used to instantiate
the model components for the web processes. The hot
standby model component is generated for the DB hot
GUARD AND REWARD FUNCTIONS FOR SYSTEM SRN
Function name
gDBupZ1, gWEB1upZ1
gDHupZ2, gWEB2upZ2
gDBdwZ1, gWEB1dwZ1
gDHdwZ2, gWEB2dw Z2
gDHfover
gDHfbac
rLB
rWEB
rDB
Definition
if (#Pz1down == 0) 1 else 0
if (#Pz2down == 0) 1 else 0
if (#Pz1down > 0) 1 else 0
if (#Pz2down > 0) 1 else 0
if (#PDBdown > 0) 1 else 0
if (#PDBup > 0) 1 else 0
if (#PLBup >= 1) 1 else 0
if (#PWEB1up+#PWEB2up >= 4) 1 else 0
if (#PDBup+#PDHup >=1) 1 else 0
From the model components generated from IBDs and
STM, Candy assembles them together to compose System
SRNs according to the stereotyped allocations. First, the
cluster model components for the web processes are replaced
by the model component generated from STM by
<<transition>> allocation (see Section VI-A). Next, the
guard functions for failover behavior are generated by the
assembly method for <<standby>> (see Section in VI-B).
The generated guard functions are summarized in Table III.
According to the <<hosted>> allocations, immediate
transitions and associated guard functions for the server
processes are generated (see Section VI-C). Finally, reward
functions for three functions are generated as shown in Table
III. The obtained System SRNs are summarized in Figure 18.
D. Model synchronization
In the Activity SRN, Tact4, Tact5 and Tact6 affect the state
transitions of web processes in the System SRNs. From the
name of the action in the AD, system administrators can
easily find the corresponding transition for those actions. The
start server action is associated to the TWEB1start (in Figure 18),
and the stop server action is associated to the TWEB1stop (in
Figure 18). TWEB1start and TWEB1stop are expanded with
immediate transitions as introduced in the Section VII-A,
and the guard functions for synchronization are generated
according to the rules as shown in Figure 12. The updated
System SRN for web server processes on zone-1 is shown in
Figure 19. Since the maximum number of web processes is
five, the additional three tokens are deposited in PWEB1down.
Tin-WEB1start
[gin-WEB1start]
PWEB1stop
TWEB1start
[gout-WEB1start]
Pin-WEB1start
TWEB1dwZ11
PWEB1up
3
TWEB1recv
PWEB1fail
2
E. Numerical results of availability evaluation
We use SPNP [13] to compute system availability from
the model created by Candy. Since the web application
system is the composition of the LB function, web function
and DB function, the reward function for the system
availability is expressed as:
Asys = Pr((#PLBup≥1) (#PWEB1up+#PWEB2up≥4) (#PDBup+#PDHup≥1))
TABLE V.
DEFAULT PARAMETERS USED IN THE EVALUATION
Parameter names
LB process failure rate
LB process recovery rate
WEB server failure rate
WEB server recovery rate
WEB server
startup/shutdown rate
DB server failure rate
DB server recovery rate
DB hot standby rate
DB failover/switch rate
DB hot standby failure rate
Zone failure rate
Zone recovery rate
Scale-up trigger rate
Action rate
Assigned transitions
TLBfail
TLBrecv
TWEB1fail , TWEB2fail
TWEB1recv, TWEB2recv
TWEB1start, TWEB2start,
TWEB1stop, TWEB2stop
TDBfail, TDHfail
TDBrecv, TDHrecv
TDHhot
TDHfover, TDHfbac
TDHfail
TZ1fail, TZ2fail
TZ1recv, TZ2recv
Twait
TactX, TdetX
Values [1/h]
0.00011415
0.5
0.00069444
1
60
0.00023148
0.5
12
60
0.00013889
0.00011415
0.25
12
3600
[gWEB1dwZ1 ]
TWEB1stop
[gout-WEB1stop]
Pin-WEB1stop
Tin-WEB1stop
[gin-WEB1stop]
#
TWEB1fail
Figure 19. Updated System SRN for web server processes
TABLE IV.
Action
Process
start
Process
stop
Decision
GUARD FUNCTIONS GENERATED BY SYNCHRONIZATION
Function
gin-WEB1start
ginact4, ginact6
gout-WEB1start
goutact4, goutact6
gin-WEB1stop
ginact5
gout-WEB1stop
goutact5
goutdet11
goutdet12
goutdet21
goutdet22
goutdet23
goutdet31
goutdet32
Definition
if (#Pinact4 == 1 || #Pinact6 == 1) 1 else 0
if (#Pin-WEB1start == 1) 1 else 0
if (#Poutact4 == 1 || #Poutact6 == 1) 1 else 0
if (#Pin-WEB1start == 0) 1 else 0
if (#Pinact5 == 1) 1 else 0
if (#Pin-WEB1stop == 1) 1 else 0
if (#Poutact5 == 1) 1 else 0
if (#Pin-WEB1stop == 0) 1 else 0
if (#PZ2up > 0) 1 else 0
if (#PZ2up == 0) 1 else 0
if (#PWEB1up == 2 || #PWEB1stop == 0) 1 else 0
if (#PWEB1up < 2 && #PWEB1stop > 0) 1 else 0
if (#PWEB1up > 2) 1 else 0
if (#PWEB1up < 4 && #PWEB1stop > 0) 1 else 0
if (#PWEB1up >= 4||#PWEB1stop == 0) 1 else 0
The guard functions for decision guards need to be
defined in model synchronization as described in Section
VII-B. Since the decision outputs of the place Poutdet1 (in
Figure 17) depend on the availability of the other zone, we
can define the guard functions goutdet11 and goutdet12 using a
marking in Pz2up (in Figure 18). The decision outputs of the
places Poutdet2 and Poutdet3 depend on the number of available
web processes and unused web processes in own zone. The
guard functions goutdet21, goutdet22, goutdet23, goutdet31 and goutdet32
are defined by using a marking in PWEB1up and PWEB1stop (in
Figure 19). Table IV shows the obtained guard functions
through model synchronization to web processes on the
zone-1. The AD for the web processes on the zone-2 is
synchronized to the System SRN in the same manner.
The default parameters used in the evaluation are
summarized in the Table V. The failure, and recovery rate of
a zone are set in accordance with the declared availability of
a zone in Amazon EC2 [15]. Most of the other parameters
are reasonable guestimates. In our case study, all timed
transitions are assumed to be exponentially distributed
except the deterministic transition Twait in which we use 10stage Erlang approximation [5]. The expected reward rate for
each function and the availability Asys are summarized in
Table VI. Note that Asys is not equal to the product of
expected reward rate for each function because web function
and DB function are not independent of each other.
TABLE VI.
Two zones
One zone
AVAILABILITY NUMERICAL RESULTS
rLB
0.999772
0.999772
rWEB
0.999804
0.999025
rDB
0.999993
0.999420
Asys
0.999571
0.998778
To study the effect of the multiple failure-isolation zones,
we made the other model in which all of web processes and
DB processes belong to the same zone. This modification is
easily performed on the SysML. We redirect the <<hosted>>
allocations of the DB hot standby from zone-2 to zone-1 and
delete the web process-2 and the zone-2 from the IBD. The
AD for automatic scale-up function is simplified by deleting
the action for checking the state of the other zone and
increase the desired number of web processes to four. The
availability model for the modified SysML can be generated
by the same procedure in Candy. We conduct the sensitivity
analysis with respect to the time interval of the automatic
scale-up function. The comparison results are summarized in
Table VI and Figure 20. From the results, we observe the
advantage of using two zones in the cloud infrastructure
quantitatively. The effectiveness of the automatic scale-up
function depends on the time interval of trigger the operation.
The shorter trigger interval achieves higher system
availability. The result implies that the shortness of trigger
interval of automatic scale-up function is not a negligible
factor for system availability.
System availability
1
using an export function of the CASSI. We will implement
model assembly and synchronization steps in the near future.
The scalability of modeling is an important issue for the
future work. We may introduce model decomposition
techniques [1] to cope with the scalability issue.
two zones
one zone
0.9995
0.999
REFERENCES
0.9985
[1]
0.998
0.9975
[2]
0.997
0
50
100
150
200
[3]
Trigger interval of scale-up function [minutes]
Figure 20. Sensitivity analysis of time interval of scale-up function
IX.
RELATED WORK
Major public cloud service providers assess the
availability of their services based on empirical data. The
web sites called service dashboard summarize the current
availability and the histories of the status [16][17]. In
contrast to the availability information provided by the
service dashboard, model-based availability assessment
gives a reasonable prediction of the availability of cloud
services based on the architecture and system management
operations.
A few research studies addressed the availability
management of cloud services. A dynamic regeneration
technique for software component in cloud to restore the
redundancy after a failure was presented in [18]. Dynamic
resource management of cloud service infrastructure under
availability constraints was studied in [19]. FTCloud was
presented as a framework for providing the optimal selection
of software fault-tolerance techniques for building cloud
applications [20]. Contrary to the existing works, our
research focus is on availability assessment of cloud services
based on the architecture and maintenance operations. Candy
guides system administrators to quantify the availability of
cloud services by system descriptions in SysML.
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
X.
CONCLUSIONS AND FUTURE WORK
This paper has presented Candy which is a componentbased availability modeling framework to compose a
comprehensive availability model for cloud services from the
system specifications described in SysML diagrams. The
framework semi-automatically translates the elements of
SysML diagrams into model components and the
components are assembled and synchronized together to
form the whole availability model according to stereotyped
allocations. The modeling method based on the proposed
framework is demonstrated with an illustrative example of a
web application system on an IaaS cloud. The composed
availability model is used to evaluate the effectiveness of the
automatic scale-up function and failure-isolation zones.
Candy is under implementation as a part of system design
and/or management tools. NEC has an in-house SysML
modeling tool for system design called CASSI [21]. We
developed a prototype implementation of model translation
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
D. S. Kim, F. Machida, and K. S. Trivedi, Availability modeling and
analysis of a virtualized system, In Proc. of 15th Pacific Rim Int.
Symp. on Dependable Computing (PRDC), 2009.
W. E. Smith, K. S. Trivedi, L. A. Tomek, J. Ackaret, Availability
analysis of blade server systems, IBM System J. Vol. 47, No. 4, 2008
OMG Systems Modeling Language (OMG SysML) Version 1.2,
http://www.omg.org/spec/SysML/1.2/
OMG Unified Modeling Language (OMG UML) , Superstracture
Version 2.3, http://www.omg.org/spec/UML/2.3/
K. S. Trivedi, Probability and Statistics with Reliability, Queuing, and
Computer Science Applications, John Wiley, New York, 2001.
J. P. Lo'pez-Grao, J. Merseguer, and J. Campos, From UML Activity
Diagrams To Stochastic Petri Nets, In Proc. of the 4th Int. Workshop
on Software and Performance (WOSP), pp. 25-36, 2004.
S. Distefano, M. Scarpa, and A. Puliafito, From UML to Petri Nets:
the PCM-Based Methodology, IEEE Trans. on Soft. Eng., Jan. 2010.
A. Bondavalli, I. Maizik, and I. Mura. Automated Dependability
Analysis of UML Designs. In Proc. 2nd Int. Symp. on Objectoriented Real-time distributed Computing (ISORC), 1999.
G. J. Pai and J. Dugan, Automatic Synthesis of Dynamic Fault Trees
from UML System Models, In Proc. 13th Int. Symp. on Software
Reliability Engineering (ISSRE), 2002.
J. Barr, A. Narin and J. Varia, Building Fault-Tolerant Applications
on AWS, http://media.amazonwebservices.com/AWS_Building_Fault
_Tolerant_Applications.pdf, 2010.
M. Tavis, Web application hosting in the AWS Cloud - Best Practices,
http://media.amazonwebservices.com/AWS_Web_Hosting_Best_Pra
ctices.pdf, 2010.
K. S. Trivedi and R. Sahner, "SHARPE at the age of twenty two," SI
GMETRICS Perform. Eval. Rev., vol. 36, no. 4, pp.52-57, 2009
G. Ciardo, A. Blakemore, P.F. Chimento, J.K. Muppala, and K.S.
Trivedi, Automated generation and analysis of Markov reward
models using stochastic reward nets, in: C. Meyer, R. Plemmons
(Eds.), Linear Algebra, Markov Chains and Queuing Models, vol. 48,
Springer, pp. 145-191, 1993.
OMG Object Constraint Language (OCL), http://www.omg.org/spec/
OCL/2.2
Amazon EC2 SLA, http://aws.amazon.com/ec2-sla/
AWS Service Health Dashboard, http://aws.amazon.com/ec2-sla/
Google AppEngine Status, http://code.google.com/status/appengine
G. Jung, K. R. Joshi, M. A. Hiltunen, R. D. Schlichting, C. Pu,
Performance and Availability Aware Regeneration For Cloud Based
Multitier Applications, In Proc. of Int. Conf. on Dependable Systems
and Networks (DSN), 2010.
B. Addis, D. Ardagna, B. Panicucci, and L. Zhang, Automatic
Management of Cloud Services Centers with Availability Guarantees,
In Proc. of 3rd Int. Conf. on Cloud Computing (CLOUD), 2010.
Z. Zheng, T. C. Zhou, M. R. Lyu, and I. King, FTCloud: A
Component Ranking Framework for Fault-Tolerant Cloud
Applications, In Proc. of 21st Int. on Software Reliability Engineering
(ISSRE), pp. 398-407. 2010.
S. Izukura, et. al., Applying a Model-Based Approach to IT Systems
Development using SysML Extension, To appear in Proc. of Int. Conf.
on Model Driven Engineering Languages and Systems, 2011.
G. Ciardo, and K. S. Trivedi, A Decomposition Approach for
Stochastic Petri Net Models, Performance Evaluation, vol. 18, 1993.