Candy: Component-based Availability Modeling Framework for Cloud Service Management Using SysML Fumio Machida1,2, Ermeson Andrade1,3, Dong Seong Kim1 and Kishor S. Trivedi1 1 Department of Electrical and Computer Engineering, Duke University, Durham, NC 27708, United States [email protected], [email protected], [email protected], [email protected] 2 Service Platforms Research Laboratories, NEC Corporation, Kawasaki, Japan [email protected] 3 Informatics Center, Federal University of Pernambuco (UFPE), Recife, PE, Brazil [email protected] Abstract— High-availability assurance of cloud service is a critical and challenging issue for cloud service providers. To quantify the availability of cloud services from both architectural and operational points of views, availability modeling and evaluation are essential. This paper presents a component-based availability modeling framework, named Candy, which constructs a comprehensive availability model semi-automatically from system specifications described by Systems Modeling Language (SysML). SysML diagrams are translated into components of availability model and the components are assembled together to form the entire availability model in Stochastic Reward Nets (SRNs). In order to incorporate the maintenance operations of cloud services in availability models, Candy defines the translation rules from Activity diagram to SRN and synchronizes the related SRNs according to SysML allocation notations. The feasibility of the proposed modeling and availability evaluation process is studied by an illustrative example of a web application service hosted on a cloud infrastructure having multiple failureisolation zones and automatic scale-up function. Keywords-component; automatic scale-up, availability assessment, cloud service, stochastic reward nets (SRNs), systems modeling language (SysML) I. INTRODUCTION Cloud computing is an emerging style of computing service to provide shared computing resources on the Internet or private networks on an on-demand basis. Cloud service providers own service infrastructures and take responsibility for infrastructure management. The users of cloud services do not need to own service infrastructures and can save the costs related to infrastructure management. As cloud services have been used widely, the availability of cloud services becomes a major concern of the users. Cloud services occasionally become unavailable due to system failure or scheduled maintenance. The users, however, cannot control the downtime of the service because most of the maintenance operations and recovery processes are delegated to the cloud service provider. Cloud service providers should assess the availability of their service quantitatively and disclose the information to the users. To analyze the availability of cloud services based on their system configurations and maintenance operations, model-based availability assessment provides a reasonable solution. Analytic models such as Markov chains and stochastic Petri nets are used to analyze the availability of complex IT systems [1][2]. However, a comprehensive availability model for a complex IT system cannot be easily obtained without expertise in analytic modeling. Cloud service infrastructures have complex configurations and the states of the systems are dynamically changing. Although such complex configurations and dynamic behavior are understood well by system administrators who are responsible for maintaining cloud service infrastructures, it is not easy for them to compose a comprehensive availability model from scratch. If availability model can be generated from the knowledge of system administrators through a welldefined procedure, it becomes very useful in assessing the availability of the cloud services. This paper presents a component-based availability modeling framework named Candy, to compose an availability model for cloud service semi-automatically from the system specification model expressed in Systems Modeling Language (SysML) [3]. SysML is defined as an extension of Unified Modeling Language (UML) [4] and is used for modeling complex IT systems and various system engineering applications. Candy translates the SysML diagrams into the components of an availability model. These components are then assembled and synchronized together according to the dependencies among them in order to form an entire availability model as stochastic reward nets (SRNs) [5]. The model translation has been studied in previous literature [6][7][8][9]. Most of the existing model translation methods aimed to evaluate the performance of the system [6][7]. A few studies addressed the availability assessment [8][9], but they did not incorporate maintenance operations which do affect the system availability. In contrast to the existing studies, the key contributions of our work can be summarized as follows. i) SRNs are composed for the purpose of availability analysis of complex IT systems. ii) Dynamic behavior of system maintenance operations are designed with activity diagrams and incorporated in the availability model. In Candy, activity diagrams are translated into SRNs and synchronized with the related SRNs to compose the entire availability model. The proposed synchronization method is not discussed in any previous literature. iii) Our case study is the first one to construct analytic models from system specification models to evaluate the availability of a web application system hosted on a cloud service infrastructure considering automatic scale-up function and failure-isolation zones [10]. The rest of this paper is organized as follows. Section II introduces an illustrative example of a web application system on a cloud service infrastructure. Section III presents an overview of Candy. Section IV introduces the use of SysML diagrams and Section V presents the translation of SysML diagrams. Section VI shows the model assembly step and Section VII presents the model synchronization step in detail. An example case study of a web application system is presented in Section VIII. Section IX describes related work and Section IX provides the conclusion and future work. distributes the user requests to the web server processes running on different zones. In each zone, the number of running web server processes is maintained by automatic scale-up function. The function is configured to keep four web servers running across the two zones. Four web servers are equally divided onto the zones (i.e., each zone hosts two web servers) as long as both of the two zones are available. If a zone becomes unavailable, the scale-up function automatically increases the number of web servers up to four to meet the performance requirements. A DB server process is replicated by a DB hot standby deployed on the other zone. When the active DB server fails, the DB hot standby takes over the operation until the active DB server is recovered. Configurable load balancer (LB) Isolation zone-1 II. Isolation zone-2 WEB APPLICATION SYSTEM HOSTED ON CLOUD A web application system hosted on a cloud service infrastructure is introduced as an illustrative example. Although our framework is applicable to other types of complex IT systems, we focus on Infrastructure as a Service (IaaS) cloud provided by a private cloud service provider. Web application service is one of the most popular applications hosted on IaaS cloud such as the Amazon EC2 [11]. Compared to the traditional web application hosting services, IaaS cloud provides elastic and highly-available computing resources based on server virtualization technology. The following characteristics are especially important in terms of high-availability assurance of the cloud service. Failure-isolation zone The user of IaaS cloud can easily specify the location for executing the application instances. Each location called a zone is isolated from the failures in other zones. The user can make a high-availability configuration by distributing the application instances among the multiple zones. A failure-isolation zone is called an Availability Zone in Amazon EC2 [10]. Automatic scale-up A cloud infrastructure automatically starts or stops the instances of virtual machines according to the rules defined by users. This scale-up function is useful to maintain the high-availability of a web application by keeping the number of running server processes constant in face of failures of server processes. Configurable load balancer Load balancing service distributes the requests for the web applications to different web server processes across multiple zones. The configuration of load balancer can be changed dynamically in response to the behavior of the automatic scale-up function. Based on these characteristics of an IaaS cloud, a configuration of web application system on a cloud service infrastructure is depicted in Figure 1. The web application system is composed of web server processes and database (DB) server processes, and is deployed on two isolation zones. Each server process is assumed to run on its own virtual machine. A configurable load balancer (LB) Web server Web server Automatic scale up Web server DB Server (active) Web server DB hot standby Figure 1. Configuration of a web application on an IaaS cloud The web application system on a cloud infrastructure may achieve higher availability than the system with traditional configuration which does not have zones and automatic scale-up function. To evaluate the availability of the system quantitatively, an analytic model that captures both system configuration and maintenance operations is required. In the following sections, we present a framework to compose the availability model from SysML models and apply the framework to the web application system. III. OVERVIEW OF CANDY Candy SysML IBD (1) Model translation STM (1) Model translation AD (1) Model translation Model templates Model components (2) Model assembly System SRN (3) Model synchronization Model components SRNs Activity SRN Model components System administrators Assign guard functions Figure 2. Availability modeling steps using Candy Candy is a component-based availability modeling framework for assessing the availability of the systems from system specifications described by SysML. An overview of the availability modeling steps in Candy is shown in Figure 2. Candy supports three types of SysML diagrams as input model; Internal Block Diagram (IBD), State Machine Diagram (STM) and Activity Diagram (AD). IBD represents static system configurations such as the composition of server processes with and without their redundancy structure. STM describes state transitions of a specific system element (e.g., failure and recovery behavior of a server process). AD describes behavior of maintenance operations (e.g., scheduled server shutdown by a system administrator). The dependencies between the elements of the diagrams are specified by the notation called allocation. The steps of model specification are followed by three steps; (1) model translation, (2) model assembly and (3) model synchronization. In the first step, Candy translates the SysML diagrams into components of the availability model named model components. The model components represent system elements in IBDs, state transitions in STM, or flows of specific system maintenance operations described in AD. The model component generated by AD is named Activity SRN. Candy defines reusable templates of model components and uses the templates in the model translation. In the second step, Candy assembles the model components generated from IBD and STM according to allocation notations. We define some stereotypes for SysML allocation to specify the type of the allocations. The resulting assembled model is named System SRN that represents the system configurations and state transitions of each system element. System SRNs marking can be affected by system maintenance operations described in ADs. As an example, a state of a server process represented in System SRN can be changed by a system maintenance operation represented in Activity SRN such as a scheduled shutdown. In the third step, Activity SRN is synchronized to System SRN by identifying the relationships between actions in Activity SRN and state transitions in System SRN. Candy guides system administrators to clarify the relationships between the SRNs and to define the guard functions for synchronization. Once an availability model composed of System SRNs and Activity SRNs is obtained, various availability measures can be computed by using the software packages supporting SRN such as SHARPE [12] and SPNP [13]. The following section describes each step in detail. IV. SYSML BASED SYSTEM DESCRIPTION Candy uses three types of SysML diagrams (IBD, STM and AD) to describe system configurations and behavior of cloud services. The details of system descriptions by SysML diagrams are introduced in this section. A. Internal Block Diagram (IBD) IBD is used to describe the static system configuration such as logical functions, process structure and hardware configurations with and without their redundancies. The diagram is depicted with internal blocks and connectors. Figure 3 shows the IBD representation of the process structure of the web application system introduced in Section II. The web application consists of an LB process, two distinct web server processes, a DB server process and a DB hot standby process. The multiplicity denoted in a block represents the level of redundancy of the system element. For example, the multiplicities of the web server processes are specified as "2" in Figure 3, which means that there are four web server processes in total. Special redundant component such as a hot standby process is specified by a stereotype as shown in the block of the DB hot standby process (i.e., <<hot standby>>). For the block representing a logical function, we define "require" property which indicates the required number of processes to maintain the function properly. The value of the "require" property is used in the generation of reward function in Section VI-D. ibd Process structure Web server process [2] LB Process DB server process Web server process [2] <<standby>> <<hot standby>> DB hot standby Figure 3. IBD representing a process structure of web application system B. State machine diagram (STM) State transitions of a system element are described using STM. In an STM, a state is depicted as a rounded rectangle, and a transition from one state to another state is represented by an arrow. In Candy, STM is used to describe the failure and recovery behavior of a system element. An example of an STM for a web server process is shown in Figure 4(a). A web server process starts from Stop state. The web process starts the operation when the start up command is invoked, and then the process enters Running state. We define the Boolean property “available” that specifies the available states of the system element. The property value is set to true only when the system element works properly in the state. In this example, Running is the only available state according to the “available” property. If the server process fails during the operation, the process enters Failed. When the server process is recovered, the process returns to Running. The web server process is terminated by shutdown operation and enters Stop. stm Web process state machine Stop {available=false} Start up Shut down Running {available=true} Fail Recover Failed {available=false} (a) ad Scheduled server shutdown every 24 hours Server state check [state!=Running] [state==Running] <<control>> Server shutdown (b) Figure 4. (a) STM representing the state transition of web server process, and (b) AD representing the scheduled server shutdown operation C. Activity Diagram (AD) AD is used for describing behavior of system maintenance operations performed by system administrators or management middleware. An AD is represented by a set of activity nodes and directed edges representing the flow among the activity nodes. The activity nodes used in this paper are summarized in Figure 7 in Section V-D. An example of an AD for server maintenance activity is shown in Figure 4(b). The activity starts from an initial node, represented by a filled circle, and flows into the wait time action depicted as an hour glass. The wait time action invokes the next action for checking server state every 24 hours. If the checked server state is Running, the server is shutdown by the subsequent action. The guards for decision condition are specified on the outgoing edges from the decision node. According to the decision output, the activity ends at either one of the final nodes. To specify the action that affects a state transition of a system element, we introduce a new stereotype <<control>> for action nodes. In this example, the server shutdown action is annotated with <<control>> because the server state is supposed to change from Running to Stop by this action. D. SysML Allocation SysML allocation is used to represent a general crossassociation of elements in SysML diagrams. According to the SysML specification, allocation can be used in various contexts in order to ensure the flexibility of model descriptions. To reuse SysML diagrams designed for system specification to availability modeling purpose, we introduce some stereotypes of allocations to specify the meaning of relationships between system elements. We define five stereotypes <<transition>>, <<hosted>>, <<standby>>, <<process>> and <<operation>> as summarized in Table I. Detail of each stereotype are discussed in the following sections. TABLE I. STEREOTYPES IN CANDY Stereotype <<hot standby>> <<control>> Target Block Action <<transition>> Allocation <<standby>> Allocation <<hosted>> Allocation <<process>> Allocation <<operation>> Allocation V. Description A block for a hot standby element An action that may induce a state transition of a system element An allocation from an STM to a corresponding block in IBD A relationship between a standby element and an active element A dependency between two blocks having hosting relationship An association between blocks representing a logical function and processes implementing the function An allocation from AD to blocks affected in IBDs MODEL TRANSLATION This section introduces model translation from SysML diagrams to model components. Since all model components are based on SRN, first SRN is introduced in brief. A. SRN SRN extends Generalized Stochastic Petri Net (GSPN) by introducing reward functions, guard functions and general marking dependencies. A reward function defines the reward rate for each tangible marking of Petri Net. Various quantitative measures such as steady-state availability can be computed by defining the corresponding reward functions. A guard function assigned to a transition specifies condition to enable or disable the transition, in addition to the constraints imposed by priority, input arcs, and inhibitor arcs. More details on SRN can be found in [5]. B. IBD translation Each block in IBD represents a system element such as a server process. In terms of the availability of each element, a system element has at least two states, namely Up state and Down state. The blocks in IBD can be translated into the same elemental model component shown in Figure 5(a). Pup Pup n 1 Tfail Trecv Tfail Pdown # Trecv Pdown (a) Elemental model component (b) Cluster model component Figure 5. The elemental model component and cluster model component In the elemental model component, a token deposited in Pup, implies that the system element is available. The transition Tfail fires when the system element goes down, and then a token from Pup is removed and a token is deposited in Pdown; thence the system element is not available. Trecv fires when the system element recovers, removing a token from Pdown and depositing a token in Pup. A redundant element is translated into the number of tokens in model component. Assuming that all redundant elements are available at the initial state and the transition rate to the Down state depends on the number of available elements, the redundant elements can be translated into the cluster model component shown in Figure 5(b). The symbol “#” near Tfail says that the transition rate depends on the number of tokens in Pup [5]. If the block is annotated with a stereotype for special system element, the information is used to generate a special model component. We define the model template for instantiating a model component from a stereotyped block. For instance, the block denoted with <<hot standby>> is translated to the specific model component representing hot standby element. The right part of Figure 9 shows an example of a model component for hot standby. Other types of redundant elements also can be incorporated by defining special model templates with the stereotype. The dependency between active system element and standby element is discussed in Section VI-B in detail. C. STM translation STM can be translated into a model component by converting each state into a place and each transition into a timed transition with input/output arcs. Figure 6 shows the translated model component for the STM introduced in Figure 4(a) of a web server process. From the values of “available” property, Prunning can be identified as a place representing an available state of web server process. Tstart Pstop 1 Prunning Tdown Trecover Pfailed Tfail Figure 6. Model component translated from STM for a web server process D. AD translation AD is translated into SRN by converting each action into related places, timed/immediate transitions and guard functions. We define the translation rules for each node type of AD as shown in Figure 7. The translation rules automatically generate blank guard functions for specific transitions such as gin-act and gout-act. The definitions of the guard functions are specified in the synchronization step discussed in Section VII. Initial node Control Action node Pini 1 Pin-act <<control>> Action Tini Tact [gin-act] Pout-act Action node Tout-act [gout-act] Pact Action Tact Final node Decision node Pindet Tdet Pfin Tfin Poutdet Toutdet1 [goutdet1 ] Pini Wait time action 1 Pwait Twait Toutdet2 [goutdet2 ] Pclock Treset Tclock Ptrigger Figure 7. Translation rules for activity nodes Initial node 1 Pini Tini Wait time action Pwait Twait 1 Pclock Tclock Treset Ptrigger Action node Pact1 Tact1 Decision node Pindet Tdet Poutdet Toutdet1 Toutdet2 [goutdet1 ] [goutdet2 ] Final node Control Action node Pin-act2 Tact2 [gin-act2 ] Pout-act2 Tout-act2 [gout-act2 ] Pfin1 Pfin2 Tfin1 Tfin2 Each immediate transition has a guard function for expressing the decision guard in the AD. The translations of the final nodes include the outgoing arcs to the initial place, which means the activity starts repeatedly on the timer event. VI. A. Detailed state transitions First, model components translated from STMs replace the elemental/cluster model components generated from IBDs. A <<transition>> allocation indicates the relationship between STM and the corresponding block in the IBD. Since an STM provides more detailed state transitions of a system element, Candy replaces the elemental/cluster model component derived from IBD to the model component from STM according to <<transition>> allocation. The number of tokens in the model component is set by the multiplicity of the block in the IBD. B. Standby element A <<standby>> allocation specifies a failover relationship from a standby element to an active element. When the active element fails, the standby element takes over the operation of the active element. The standby element switches back to standby state after the active element is recovered. The failover behavior can be represented by guard functions for the model component for standby element. Let m1 be a model component for a standby element and let m2 be a model component for an active element. A <<standby>> allocation is directed from m1 to m2 as shown in Figure 9. The m1 has the transitions for failover Tfover and for switch back Tfbac. In the model assembly step, Candy generates the guard functions g1fover and g1fbac to specify the condition of failover and switch back. Let us define the set of places representing the up states (down states) for mi as Ui (Di). Ui and Di can be derived from the "available" property in the STM or the definition of model templates. The transition Tfover is enabled by g1fover when a token in m2 is deposited in D2 (P2down belongs to D2 in Figure 9). Similarly, the guard function g1fbac enables the transition Tfbac when a token in m2 is deposited in U2 (P2up belongs to U2). The definitions of g1fover and g1fbac are summarized in Table II. #Ui and #Di represent the number of tokens in Ui and Di, respectively. m2 m1 Tfover [g1fover] <<standby>> Pup P2upU2 Figure 8. Activity SRN for a server maintenance operation Figure 8 shows an example of translation from AD for the server maintenance activity shown in Figure 4(b). As described in the AD, server state is checked every 24 hours by a timed transition Tclock with deterministic firing time denoted by a filled rectangle. The decision node is translated into Pindet with the timed transition Tdet and Poutdet with two outgoing arcs to the immediate transitions Toutdet1 and Toutdet2. MODEL ASSEMBLY This section introduces the model assembly step to compose System SRNs by assembling model components translated from IBDs and STMs in accordance with the allocation notations in SysML diagrams. 1 1 T2fail T2recv Tfail Tfbac [g1fbac] Phot Thot Thotfail P2down D2 Active system element Pdown Trecv Hot standby element Precvd Figure 9. Model assembly for <<standby>> allocation C. Hosted dependency A <<hosted>> allocation represents a hosted dependency between system elements such as a dependency between a virtual machine and a physical server. State transitions of a hosted element are enabled only when the corresponding hosting element is available. If the hosting element goes down, the hosted element becomes unavailable at the same time. This hosted dependency can be represented by the guard functions in the model components for the hosted element. Let m3 be a model component for a hosted element (e.g., a virtual machine) and let m4 be a model component for a hosting element (e.g., a physical server), a <<hosted>> allocation is directed from m3 to m4 as shown in Figure 10. P3upU3 P3up T3fail T3recv T3fail m3 T3recv T3dw41 [g3dw4] [g3up4 ] P3downD3 <<hosted>> P3dw41 =P3down T4recv P4downD4 For all the places in U3, new immediate transitions T3dw4i ( ) are introduced to deposit a token into one of the place P3dw4i D3 at a down of the hosting element. The guard function g3dw4 enables all T3dw4i transitions when a token in m4 is deposited in D4. Figure 10 shows an example of an additional immediate transition T3dw41 connected from P3up U3 to P3down D3. P3down is selected as P3dw41 in this case. Since the state transition of the hosted element never happens during the downtime of the hosting element, all the transitions in m3 connected from P3dw4i are disabled by the guard function g3up4 while a token is deposited in D4. In the example in Figure 10, T3recv is disabled by g3up4. The definition of guard functions are summarized in Table II. If D3 contains more than one place, Candy needs to select a place P3dw4i from D3. This selection can be supported by an additional property such as "failure" specifying the failure state in STM. If such a property is not used, Candy chooses one place according to a certain rule (e.g., choose the nearest place in D3 for each place in U3). <<process>> m5 A. Synchronization of action node Since an action stereotyped as <<control>> in an Activity SRN affects state transitions in a System SRN, guard functions to associate the related transitions are required. For a transition Tact in an Activity SRN, a system administrator needs to find the corresponding transition Ttr in the System SRN in accordance with <<operation>> allocation. The search for Ttr can be automated by introducing a naming rule for actions in AD and transitions in STM, or by using specific allocation from the action to the corresponding transitions. Activity SRN 1 T6fail Ttr Pin-tr Ttr [gout-tr] P6up U6 T6recv LB process P5down Tin-tr [gin-tr] expand Tact [gin-act] 1 T5recv System SRN Pin-act Tout-act [gout-act] m6 <<process>> P5up T5fail Definition if (#D2 > 0) 1 else 0 if (#U2 > 0) 1 else 0 if (#D4 > 0) 1 else 0 if (#D4 == 0) 1 else 0 if (#U6 >= n5) 1 else 0 System SRNs and Activity SRNs are dependent on each other. An action in an Activity SRN may induce state changes in System SRNs. On the other hand, the flows in an Activity SRN may change depending on a marking in the System SRNs. In this Section, these dependencies are incorporated by introducing additional guard functions. Pout-act D. Process implementation LB function Function name g1fover g1fbac g3dw4 g3up4 r5 VII. MODEL SYNCHRONIZATION Figure 10. Model assembly for <<hosted>> allocation Require=1 Allocation type Standby component Process implementation 1 T4fail GUARD AND REWARD FUNCTIONS FOR MODEL ASSEMBLY Hosted dependency P4up U4 m4 TABLE II. 1 1 This relationship is specified by a <<process>> allocation as shown in Figure 11. The <<process>> allocation from the LB function indicates the associated LB server process. The "require" property specifies the required number of processes for the function. From the <<process>> allocation and the "require" property, a reward function for the availability of the logical function can be generated. Let m5 be a model component for a logical function and m6 be a model component for the process implementing the function. A <<process>> allocation is directed from m5 to m6. In the model assembly step, m5 is replaced with m6. The reward function r5 is defined with n5, which represents the value of "require" property of m5, as shown in Table II. P6down Figure 11. IBDs representing logical function and processes In Candy, both of the logical functions and processes are represented by blocks in IBDs. The logical function is implemented by the associated processes. The function is available as long as the associated processes work properly. Four guard functions for synchronization 1. gin-tr : if(#Pin-act == 1) 1 else 0 end 2. gin-act : if(#Pin-tr == 1) 1 else 0 end 3.gout-tr : if(#Pout-act ==1) 1 else 0 end 4. gout-act : if(#Pin-tr==0) 1 else 0 end Figure 12. Synchronization of Tact in Activity SRN and Ttr in System SRN To synchronize the action to the transition, the transition Ttr is expanded with an immediate transition Tin-tr and a place Pin-tr with an inhibitor arc as shown in the right part of Figure 12. Tin-tr and Pin-tr represent the action invocation and action execution state, respectively. An inhibitor arc from Pin-tr to Tin-tr ensures that the action affects a single system element (e.g., a server process) at a time. For all the related transitions, Candy generates four guard functions gin-tr, gin-act, gout-tr, and gout-act to make the transitions in a consistent order as shown in Figure 12. The first guard function gin-tr represents the trigger of the action. The second function gin-act ensures the start of state transition. The third guard function gout-tr represents the end of the action and the fourth guard function gout-act ensures that the state transition completes. Pstop Pin-act2 Tact2 [gin-act2 ] Tstart Prunning Pfailed [gout-down] A. SysML design ibd Web application system Web function LB function Require=1 Require=1 <<process>> <<process>> <<process>> <<process>> Web process 1 [2] LB Process <<standby>> DB process 1 <<hot standby>> DB hot standby Web process 2 [2] Max=5 Pin-down <<process>> Max=5 Pout-act2 Tout-act2 [gout-act2] DB function Require=4 ibd Process structure Trecver Tdown specified by SysML diagrams. From the SysML diagrams, SRNs are composed through Candy and the system availability is computed by SPNP [13]. [gin-down] Tin-down Tfail <<hosted>> <<hosted>> <<hosted>> <<hosted>> ibd Zone configuration gin-down : if(#Pin-act2 ==1) 1 else 0 end gout-down : if(#Pout-act2 ==1) 1 else 0 end gin-act2 : if(#Pin-down ==1) 1 else 0 end gout-act2 : if(#Pin-down == 0) 1 else 0 end Zone 1 Zone 2 Figure 13. An example of expanded System SRN and guard functions Figure 15. IBDs for the web application system on an IaaS cloud Figure 13 shows an example of the synchronization between Tact2 in the Activity SRN shown in Figure 8 and corresponding transition Tdown in the System SRN in Figure 6. The transition Tdown is expanded with an immediate transition Tin-down and a place Pin-down. The four guard functions are automatically generated for the related transitions. As a result of the synchronization, the shutdown action definitely changes the state of the server process. B. Synchronization of decision node If a condition of a decision node in an AD depends on the state of a system element, guard functions in Activity SRN can be defined by a marking of the corresponding System SRN. Since guard conditions for decision node in AD are not specified formally, a manual interpretation of the guard condition by system administrator is required. An associated part of the System SRN can be identified by <<operation>> allocation. In our example, the guard functions, Toutdet1 and Toutdet2 in Figure 8, are described with the number of tokens in Prunning as shown in Figure 14. goutdet1 : if(#Prunning == 0) 1 else 0 end goutdet2 : if(#Prunning > 0) 1 else 0 end Figure 14. An example of guard functions for the decision outputs Due to manual configurations by system administrators, definitions of guard functions might contain some errors. To avoid the human errors in the guard function definition, we can introduce a constraints language for SysML diagrams such as Object Constraint Language (OCL) [14] which specifies guard condition in a formal way. However, such constraint language restricts the flexibility of SysML description. In our case study, we assume that the definitions of guard functions are correctly determined by system administrator from the original AD. VIII. CASE STUDY As a case study, we recall the web application system example introduced in Section II. Configurations and maintenance operations of the web application system are ad automatic scale up every 5 mins Check the state of the other zone [Up] Check the number of available processes (na) and unused processes (nu) <<control>> Stop a process Check the number of available processes (na) and unused processes (nu) [na >=4 or nu ==0] [na ==2or nu ==0] [na >=2] [Down] [na <2 and nu >0] <<control>> Start a new process [na <4 and nu >0] <<control>> Start a new process Figure 16. AD representing automatic scale-up function for web processes For the web application system illustrated in Figure 1, first system administrators (or designers) use IBD to represent the configuration of the system. Figure 15 shows the IBD representation of the system. There are three layers of IBDs which represent the logical functions of the application system, the process structure and the zone configuration. Each block in the top level IBD has <<process>> allocation to the blocks in the middle level IBD and each block in the middle level IBD has <<hosted>> allocation to the blocks representing the zones in the bottom level IBD. In an IaaS cloud, each process is assumed to run on its own virtual machine. The maximum number of web processes is five and the required number of processes for proper web function is four in total. The relationship between DB process and DB hot standby process is represented by <<standby>> allocation. Next, the detailed state transitions of system element are designed using STM. In this case study, the STM shown in Figure 4 is used for all of the web processes. The relationship between the STM in Figure 4 and the blocks for the web processes in Figure 15 is specified with a <<transition>> allocation (This allocation is not depicted in the figures). Finally, AD is used to represent the behavior of automatic scale-up function. Figure 16 shows the AD for automatic scale-up function applied to web processes in a zone. The <<operation>> allocation connects the AD in Figure 16 to the affected blocks for web processes in IBD in Figure 15 (This allocation is not depicted in the figures). The activity starts the action to check the state of the other zone every five minute interval as long as its own zone is available. Depending on the availability of the other zone, the desired number of web processes is changed. If the other zone is in Up state, the automatic scale-up function keeps the number of web processes in its own zone to two. If the number of available web processes is less than two, it starts a new web process as long as there are unused processes (nu>0) out of five processes. Conversely, if the number of the processes is more than two, it stops one of the running web processes. In case that the other zone is in Down state, the automatic scale-up function tries to increase the number of processes in its own zone up to four. standby by the stereotype <<hot standby>> (as described in Section V-B). The other blocks in IBDs are translated as the elemental model components shown in Figure 5(a). The STM for the web processes are subsequently translated into the model components in Figure 6. Finally, the ADs for automatic scale-up function are translated by the translation rules described in Figure 7. The obtained Activity SRN for an automatic scale-up function of the web processes is shown in Figure 17. For the two web server processes on the zone-1 and the zone-2, two Activity SRNs for automatic scale-up functions are generated. C. Model assembly PLBup TDHfover [gDHfover] PDHup 1 TLBfail TLBrecv 1 TDHfbac [gDHfbac] TDHdwZ21 PLBdown [gDHdwZ2 ] TDHfail TDHhotfail TDHhot TDHdwZ22 PDBup 1 TDBfail PDHhot [gDHdwZ2 ] TDBrecv B. Model translation [gDBdwZ1 ] TDBdwZ11 Pwait 1 Pini 1 Twait Tini Pclock Tclock Treset [gDBupZ1] PDHdown PDBdown Ptrigger PDHrecvd TWEB1recv [gWEB1upZ1 ] TWEB1start PWEB1stop PZ1up TWEB1dwZ11 1 TZ1fail TDHrecv [gDHupZ2 ] PWEB1up TZ1recv PWEB1fail 2 [gWEB1dwZ1] Pindet1 Pact1 Tact1 Poutdet1 Toutdet11 [goutdet11] TWEB1stop Tdet1 # TWEB1fail PZ1down Toutdet12 [goutdet12] PZ2up 1 TZ2fail TWEB2recv [gWEB2upZ2 ] TWEB2start PWEB2stop TZ2recv TWEB2dwZ21 PWEB2up PWEB2fail 2 [gWEB2dwZ2] Pact2 Pact3 PZ2down Tact2 Tact3 Pindet2 Pindet3 Tdet2 Tdet3 Toutdet21 [goutdet21 ] Poutdet2 Poutdet3 Toutdet22 [goutdet22 ] Toutdet23 [goutdet23 ] Tfin1 Pfin1 Toutdet31 [goutdet31 ] TWEB2stop # TWEB2fail Figure 18. System SRNs obtained by model assemble process TABLE III. Allocation <<hosted>> Toutdet32 [goutdet32 ] <<standby>> <<process>> Pin-act4 Pin-act5 Pin-act6 Pfin2 Tact4 [gin-act4 ] Tact5 [gin-act5 ] Tact6 [gin-act6 ] Tfin2 Pout-act4 Pout-act5 Pout-act6 Tout-act4 [gout-act4 ] Tout-act5 [gout-act5] Tout-act6 [gout-act6 ] Figure 17. Activity SRN for automatic scale-up activity Candy first translates the blocks in IBDs into model components based on the model templates. The cluster model component shown in Figure 5(b) is used to instantiate the model components for the web processes. The hot standby model component is generated for the DB hot GUARD AND REWARD FUNCTIONS FOR SYSTEM SRN Function name gDBupZ1, gWEB1upZ1 gDHupZ2, gWEB2upZ2 gDBdwZ1, gWEB1dwZ1 gDHdwZ2, gWEB2dw Z2 gDHfover gDHfbac rLB rWEB rDB Definition if (#Pz1down == 0) 1 else 0 if (#Pz2down == 0) 1 else 0 if (#Pz1down > 0) 1 else 0 if (#Pz2down > 0) 1 else 0 if (#PDBdown > 0) 1 else 0 if (#PDBup > 0) 1 else 0 if (#PLBup >= 1) 1 else 0 if (#PWEB1up+#PWEB2up >= 4) 1 else 0 if (#PDBup+#PDHup >=1) 1 else 0 From the model components generated from IBDs and STM, Candy assembles them together to compose System SRNs according to the stereotyped allocations. First, the cluster model components for the web processes are replaced by the model component generated from STM by <<transition>> allocation (see Section VI-A). Next, the guard functions for failover behavior are generated by the assembly method for <<standby>> (see Section in VI-B). The generated guard functions are summarized in Table III. According to the <<hosted>> allocations, immediate transitions and associated guard functions for the server processes are generated (see Section VI-C). Finally, reward functions for three functions are generated as shown in Table III. The obtained System SRNs are summarized in Figure 18. D. Model synchronization In the Activity SRN, Tact4, Tact5 and Tact6 affect the state transitions of web processes in the System SRNs. From the name of the action in the AD, system administrators can easily find the corresponding transition for those actions. The start server action is associated to the TWEB1start (in Figure 18), and the stop server action is associated to the TWEB1stop (in Figure 18). TWEB1start and TWEB1stop are expanded with immediate transitions as introduced in the Section VII-A, and the guard functions for synchronization are generated according to the rules as shown in Figure 12. The updated System SRN for web server processes on zone-1 is shown in Figure 19. Since the maximum number of web processes is five, the additional three tokens are deposited in PWEB1down. Tin-WEB1start [gin-WEB1start] PWEB1stop TWEB1start [gout-WEB1start] Pin-WEB1start TWEB1dwZ11 PWEB1up 3 TWEB1recv PWEB1fail 2 E. Numerical results of availability evaluation We use SPNP [13] to compute system availability from the model created by Candy. Since the web application system is the composition of the LB function, web function and DB function, the reward function for the system availability is expressed as: Asys = Pr((#PLBup≥1) (#PWEB1up+#PWEB2up≥4) (#PDBup+#PDHup≥1)) TABLE V. DEFAULT PARAMETERS USED IN THE EVALUATION Parameter names LB process failure rate LB process recovery rate WEB server failure rate WEB server recovery rate WEB server startup/shutdown rate DB server failure rate DB server recovery rate DB hot standby rate DB failover/switch rate DB hot standby failure rate Zone failure rate Zone recovery rate Scale-up trigger rate Action rate Assigned transitions TLBfail TLBrecv TWEB1fail , TWEB2fail TWEB1recv, TWEB2recv TWEB1start, TWEB2start, TWEB1stop, TWEB2stop TDBfail, TDHfail TDBrecv, TDHrecv TDHhot TDHfover, TDHfbac TDHfail TZ1fail, TZ2fail TZ1recv, TZ2recv Twait TactX, TdetX Values [1/h] 0.00011415 0.5 0.00069444 1 60 0.00023148 0.5 12 60 0.00013889 0.00011415 0.25 12 3600 [gWEB1dwZ1 ] TWEB1stop [gout-WEB1stop] Pin-WEB1stop Tin-WEB1stop [gin-WEB1stop] # TWEB1fail Figure 19. Updated System SRN for web server processes TABLE IV. Action Process start Process stop Decision GUARD FUNCTIONS GENERATED BY SYNCHRONIZATION Function gin-WEB1start ginact4, ginact6 gout-WEB1start goutact4, goutact6 gin-WEB1stop ginact5 gout-WEB1stop goutact5 goutdet11 goutdet12 goutdet21 goutdet22 goutdet23 goutdet31 goutdet32 Definition if (#Pinact4 == 1 || #Pinact6 == 1) 1 else 0 if (#Pin-WEB1start == 1) 1 else 0 if (#Poutact4 == 1 || #Poutact6 == 1) 1 else 0 if (#Pin-WEB1start == 0) 1 else 0 if (#Pinact5 == 1) 1 else 0 if (#Pin-WEB1stop == 1) 1 else 0 if (#Poutact5 == 1) 1 else 0 if (#Pin-WEB1stop == 0) 1 else 0 if (#PZ2up > 0) 1 else 0 if (#PZ2up == 0) 1 else 0 if (#PWEB1up == 2 || #PWEB1stop == 0) 1 else 0 if (#PWEB1up < 2 && #PWEB1stop > 0) 1 else 0 if (#PWEB1up > 2) 1 else 0 if (#PWEB1up < 4 && #PWEB1stop > 0) 1 else 0 if (#PWEB1up >= 4||#PWEB1stop == 0) 1 else 0 The guard functions for decision guards need to be defined in model synchronization as described in Section VII-B. Since the decision outputs of the place Poutdet1 (in Figure 17) depend on the availability of the other zone, we can define the guard functions goutdet11 and goutdet12 using a marking in Pz2up (in Figure 18). The decision outputs of the places Poutdet2 and Poutdet3 depend on the number of available web processes and unused web processes in own zone. The guard functions goutdet21, goutdet22, goutdet23, goutdet31 and goutdet32 are defined by using a marking in PWEB1up and PWEB1stop (in Figure 19). Table IV shows the obtained guard functions through model synchronization to web processes on the zone-1. The AD for the web processes on the zone-2 is synchronized to the System SRN in the same manner. The default parameters used in the evaluation are summarized in the Table V. The failure, and recovery rate of a zone are set in accordance with the declared availability of a zone in Amazon EC2 [15]. Most of the other parameters are reasonable guestimates. In our case study, all timed transitions are assumed to be exponentially distributed except the deterministic transition Twait in which we use 10stage Erlang approximation [5]. The expected reward rate for each function and the availability Asys are summarized in Table VI. Note that Asys is not equal to the product of expected reward rate for each function because web function and DB function are not independent of each other. TABLE VI. Two zones One zone AVAILABILITY NUMERICAL RESULTS rLB 0.999772 0.999772 rWEB 0.999804 0.999025 rDB 0.999993 0.999420 Asys 0.999571 0.998778 To study the effect of the multiple failure-isolation zones, we made the other model in which all of web processes and DB processes belong to the same zone. This modification is easily performed on the SysML. We redirect the <<hosted>> allocations of the DB hot standby from zone-2 to zone-1 and delete the web process-2 and the zone-2 from the IBD. The AD for automatic scale-up function is simplified by deleting the action for checking the state of the other zone and increase the desired number of web processes to four. The availability model for the modified SysML can be generated by the same procedure in Candy. We conduct the sensitivity analysis with respect to the time interval of the automatic scale-up function. The comparison results are summarized in Table VI and Figure 20. From the results, we observe the advantage of using two zones in the cloud infrastructure quantitatively. The effectiveness of the automatic scale-up function depends on the time interval of trigger the operation. The shorter trigger interval achieves higher system availability. The result implies that the shortness of trigger interval of automatic scale-up function is not a negligible factor for system availability. System availability 1 using an export function of the CASSI. We will implement model assembly and synchronization steps in the near future. The scalability of modeling is an important issue for the future work. We may introduce model decomposition techniques [1] to cope with the scalability issue. two zones one zone 0.9995 0.999 REFERENCES 0.9985 [1] 0.998 0.9975 [2] 0.997 0 50 100 150 200 [3] Trigger interval of scale-up function [minutes] Figure 20. Sensitivity analysis of time interval of scale-up function IX. RELATED WORK Major public cloud service providers assess the availability of their services based on empirical data. The web sites called service dashboard summarize the current availability and the histories of the status [16][17]. In contrast to the availability information provided by the service dashboard, model-based availability assessment gives a reasonable prediction of the availability of cloud services based on the architecture and system management operations. A few research studies addressed the availability management of cloud services. A dynamic regeneration technique for software component in cloud to restore the redundancy after a failure was presented in [18]. Dynamic resource management of cloud service infrastructure under availability constraints was studied in [19]. FTCloud was presented as a framework for providing the optimal selection of software fault-tolerance techniques for building cloud applications [20]. Contrary to the existing works, our research focus is on availability assessment of cloud services based on the architecture and maintenance operations. Candy guides system administrators to quantify the availability of cloud services by system descriptions in SysML. [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] X. CONCLUSIONS AND FUTURE WORK This paper has presented Candy which is a componentbased availability modeling framework to compose a comprehensive availability model for cloud services from the system specifications described in SysML diagrams. The framework semi-automatically translates the elements of SysML diagrams into model components and the components are assembled and synchronized together to form the whole availability model according to stereotyped allocations. The modeling method based on the proposed framework is demonstrated with an illustrative example of a web application system on an IaaS cloud. The composed availability model is used to evaluate the effectiveness of the automatic scale-up function and failure-isolation zones. Candy is under implementation as a part of system design and/or management tools. NEC has an in-house SysML modeling tool for system design called CASSI [21]. We developed a prototype implementation of model translation [15] [16] [17] [18] [19] [20] [21] [22] D. S. Kim, F. Machida, and K. S. Trivedi, Availability modeling and analysis of a virtualized system, In Proc. of 15th Pacific Rim Int. Symp. on Dependable Computing (PRDC), 2009. W. E. Smith, K. S. Trivedi, L. A. Tomek, J. Ackaret, Availability analysis of blade server systems, IBM System J. Vol. 47, No. 4, 2008 OMG Systems Modeling Language (OMG SysML) Version 1.2, http://www.omg.org/spec/SysML/1.2/ OMG Unified Modeling Language (OMG UML) , Superstracture Version 2.3, http://www.omg.org/spec/UML/2.3/ K. S. Trivedi, Probability and Statistics with Reliability, Queuing, and Computer Science Applications, John Wiley, New York, 2001. J. P. Lo'pez-Grao, J. Merseguer, and J. Campos, From UML Activity Diagrams To Stochastic Petri Nets, In Proc. of the 4th Int. Workshop on Software and Performance (WOSP), pp. 25-36, 2004. S. Distefano, M. Scarpa, and A. Puliafito, From UML to Petri Nets: the PCM-Based Methodology, IEEE Trans. on Soft. Eng., Jan. 2010. A. Bondavalli, I. Maizik, and I. Mura. Automated Dependability Analysis of UML Designs. In Proc. 2nd Int. Symp. on Objectoriented Real-time distributed Computing (ISORC), 1999. G. J. Pai and J. Dugan, Automatic Synthesis of Dynamic Fault Trees from UML System Models, In Proc. 13th Int. Symp. on Software Reliability Engineering (ISSRE), 2002. J. Barr, A. Narin and J. Varia, Building Fault-Tolerant Applications on AWS, http://media.amazonwebservices.com/AWS_Building_Fault _Tolerant_Applications.pdf, 2010. M. Tavis, Web application hosting in the AWS Cloud - Best Practices, http://media.amazonwebservices.com/AWS_Web_Hosting_Best_Pra ctices.pdf, 2010. K. S. Trivedi and R. Sahner, "SHARPE at the age of twenty two," SI GMETRICS Perform. Eval. Rev., vol. 36, no. 4, pp.52-57, 2009 G. Ciardo, A. Blakemore, P.F. Chimento, J.K. Muppala, and K.S. Trivedi, Automated generation and analysis of Markov reward models using stochastic reward nets, in: C. Meyer, R. Plemmons (Eds.), Linear Algebra, Markov Chains and Queuing Models, vol. 48, Springer, pp. 145-191, 1993. OMG Object Constraint Language (OCL), http://www.omg.org/spec/ OCL/2.2 Amazon EC2 SLA, http://aws.amazon.com/ec2-sla/ AWS Service Health Dashboard, http://aws.amazon.com/ec2-sla/ Google AppEngine Status, http://code.google.com/status/appengine G. Jung, K. R. Joshi, M. A. Hiltunen, R. D. Schlichting, C. Pu, Performance and Availability Aware Regeneration For Cloud Based Multitier Applications, In Proc. of Int. Conf. on Dependable Systems and Networks (DSN), 2010. B. Addis, D. Ardagna, B. Panicucci, and L. Zhang, Automatic Management of Cloud Services Centers with Availability Guarantees, In Proc. of 3rd Int. Conf. on Cloud Computing (CLOUD), 2010. Z. Zheng, T. C. Zhou, M. R. Lyu, and I. King, FTCloud: A Component Ranking Framework for Fault-Tolerant Cloud Applications, In Proc. of 21st Int. on Software Reliability Engineering (ISSRE), pp. 398-407. 2010. S. Izukura, et. al., Applying a Model-Based Approach to IT Systems Development using SysML Extension, To appear in Proc. of Int. Conf. on Model Driven Engineering Languages and Systems, 2011. G. Ciardo, and K. S. Trivedi, A Decomposition Approach for Stochastic Petri Net Models, Performance Evaluation, vol. 18, 1993.
© Copyright 2026 Paperzz