Cognitive Behavior Analysis framework for Fault Prediction in Cloud Computing Reza Farrahi Moghaddam & Fereydoun Farrahi Moghaddam Synchromedia Lab, ETS University of Quebec Montreal (QC), Canada H3C 1K3 Email: [email protected] [email protected] Vahid Asghari Mohamed Cheriet INRS-EMT University of Quebec Montreal (QC), Canada H5A 1K6 Email: [email protected] Synchromedia Lab, ETS University of Quebec Montreal (QC), Canada H3C 1K3 Email: [email protected] Abstract—Complex computing systems, including clusters, grids, clouds and skies, are becoming the fundamental tools of green and sustainable ecosystems of future. However, they can also pose critical bottlenecks and ignite disasters. Their complexity and number of variables could easily go beyond the capacity of any analyst or traditional operational research paradigm. In this work, we introduce a multi-paradigm, multi-layer and multi-level behavior analysis framework which can adapt to the behavior of a target complex system. It not only learns and detects normal and abnormal behaviors, it could also suggest cognitive responses in order to increase the system resilience and its grade. The multi-paradigm nature of the framework provides a robust redundancy in order to cross-cover possible hidden aspects of each paradigm. After providing the high-level design of the framework, three different paradigms are discussed. We consider the following three paradigms: Probabilistic Behavior Analysis, Simulated Probabilistic Behavior Analysis, and Behavior-Time Profile Modeling and Analysis. To be more precise and because of paper limitations, we focus on the fault prediction in the paper as a specific event-based abnormal behavior. We consider both spontaneous and gradual failure events. The promising potential of the framework has been demonstrated using simple examples and topologies. The framework can provide an intelligent approach to balance between green and high probability of completion (or high probability of availability) aspects in computing systems. I. I NTRODUCTION The computing systems, in various forms such as clusters, grids, clouds and skies [9], [10], [17], are scaling not only in terms of number of involved components and their physical distribution, they are also becoming very heterogeneous with new components such as sensors and mobile devices. Although this brings more computational and conscious power, at the same time it increases the degree of uncertainty and risk. There are many risk sources involved, such as operators (human), software bugs, software overload, hardware overload, hardware failure, among others [24]. Therefore, a full understanding of the system behavior, which brings the ability to predict its normal or abnormal behaviors, is of great importance for scheduling, allocation, binding, and other actions. We call it the Behavior Analysis (BA), and we propose a framework with three high-level units: the Behavior Analysis Unit (BAU), the Behavior Stimulator Unit (BSU), and the Cognitive Responder Unit (CRU). A schematic example of the proposed framework is shown in Figure 1. Fig. 1. A schematic example of the proposed framework in a sky system. In a non-technical way, we can consider two high-level modes of transaction between a service provider and its clients: 1) Leasing: The provider dedicates an agreed set of resources to the client for an agreed limited period of time. The lease can be renewed upon the agreement and resource availability. This mode is especially interesting for handling those resources which may evanesce or vanish without a notice (such as those resources powered by intermittent energy sources). The lease mode is a good practice for service providers with Infrastructure as a Service (IaaS) model or other similar models. 2) Task completion: The service provider guarantees completion of a (or a volume of) task(s) within an agreed period of time. This implicitly implies that the provider is aware of the task’s detailed steps. In reality, there is a chance of failure to deliver the agreed Service-Level Objectives (SLOs). This brings us to two important concepts: the Probability of Completion (PoC) and the Probability of Availability (PoA). Ability of a provider to determine, estimate, or measure the PoC (or PoA, depending on its business model), not only enables him to (a) The overall picture Fig. 2. (b) The multi-layer nature of the framework Schematic diagram of the proposed behavior analysis framework in its systemic picture. negotiate instrumental Service-Layer Agreements (SLAs) with its clients, it also provides a way to grade its various services. Especially, services with High Probability of Completion (HPoC) or High Probability of Availability (HPoA) grades would attract mission-critical clients, such as communication providers, and emergency operators. Usually, the HPoC (or HPoA) is achieved by resource “over allocating” and by task “replicating.” This can be interpreted as a traditional correlation between the HPoC (or HPoA) requirement and the high level of energy/resource consumption (non-greenness) of a service. This is a critical issue because, with the push for the ICT enabling effect and also the move toward the Internet of things, the HPoC (or HPoA) will be required by an enormous number of clients; the ICT enabling effect is one of fundamental instruments for reducing the footprint of other industrial sectors by re-directing non-ICT service calls to the ICT sector [30]. The Internet of things is also becoming a reality in near future because of exponential increase in the number of portable phones, distributed sensors, and Radio Frequency (RF) devices [20]. One way to break the correlation between the HPoC (or HPoA) and non-greenness of services is adding intelligence to determine, predict and react to the possible changes in the PoC (or PoA) in real-time. This could help a system to provide a desirable HPoC (or HPoA) with a minimum amount of resources. This analyzer and its implementation is our ultimate research goal, and in this work, we present an overview and some preliminary results. Calculation and verification of the PoC (or PoA) of a service can be carried out based on analyzing the system configuration. However, real-time variation in PoC (or PoA) is very critical and can lead to violation of the SLA despite having a satisfactory configuration-based predicted PoC (or PoA) value. Therefore, in our framework, we consider three paradigms to compensate the weakness of each other one: Probabilistic (Statistical Inference) Behavior Analysis, Simulated Probabilistic (Statistical Inference by Means of Simulation) Behavior Analysis, and Behavior-Time Profile Modeling and Analysis. The two first paradigms work based on the configuration of the system using the data collected from the experiments and simulations to provide insight on the system behavior. The third paradigm uses machine learning techniques to learn the patterns and behaviors of the system from its time profiles collected by a set of opportunistic agents across the system. The organization of the paper is as follows. In section II, a brief introduction of the proposed framework is presented. Fault injection approaches are discussed in section III. The three behavior analysis paradigms and some experimental results are presented in section IV. The related work is described in section V. Finally, the conclusions and future prospects are provided in section VI. II. P ROPOSED B EHAVIOR A NALYSIS A schematic diagram of the proposed framework is presented in Figure 2(a). The computing system under study is represented roughly by many involved layers on each one opportunistic and cognitive agents of the framework will reside in order to collect status and time-profiles. All the collected information is directed to the main unit, the Behavior Analyzer Unit (BAU). Using its multi-paradigm approach and based on the collected data, this unit not only infers on the current status of the system and its components, it also produces predictions on the changes in the system status or possibility of occurrence of abnormal events. As the BAU works based on machine learning techniques, it requires enough samples of various behaviors under different conditions and operations to build its inference models. To convert the learning process from passive to active, and to reduce the learning time, another unit, the Behavior Stimulator Unit (BSU), is considered which is responsible to “stimulate” the desired behaviors in a controlled manner. The last part of the framework is a Cognitive Responder Unit (CRU) which makes recommendations to the system in order to prevent or compensate the abnormal behaviors/events and their side-effects in an optimal way and using all available resources. The framework considers the following three paradigms: Probabilistic Behavior Analysis, Simulated Probabilistic Behavior Analysis, and Behavior-Time Profile Modeling and Analysis. Each paradigm is considered to independently work and draw its inference. The CRU is supposed to combine (using voting, mixture of experts, stacking, cascading [1], or any other strategy) the conclusions of the three paradigms and Fig. 3. Schematic diagram of the proposed behavior analysis framework in its ecosystemic picture. make a cognitive decision. Therefore, the CRU cognition could be very different form one system to another one, depending on the desired level of dependability, and it can float on a wide spectrum from extremely pessimistic to highly optimistic cognitions. Let’s consider an example of the BA application in upgrading a service grade. Assume, in a system, the Mean time between failures (MTBF) of the dominant fault is 10 weeks and its Mean time to repair (MTTR) is 10 minutes. It reads to an average down time of 365/70 × 10 ' 52 minutes and 8.4 seconds per year which is graded to 4 nines availability grade level [29]. If the BA framework can achieve a success rate of %90 in predicting faults 15 minutes before their associated failure, downtime will be reduced 5.2 minutes per year which is graded to 5 nines availability grade, achieved without any extra investment in core hardware/software of the system and just by using the BA framework. This upgrade can not only increase the profit of the service provider and reduces the fee for the service user, it also reduces the footprint on the environment; services which use hardware with longer life span has lesser lifecycle footprint on the environment because of overall lowered manufacturing footprint. This shows a great value of the BA framework, especially the real-time BA paradigm. Although the BA framework is not limited to a specific behavior, we consider only analysis of the “fault” events in this work. Analysis of other types of behavior-related events, such as “degradation,” and also the system behavior itself will be considered in the future. As can be seen from Figure 2(a), the framework works on several layers from hardware to applications. In each layer, a multi-level approach is considered for representation. At the lowest level, all the system components (even the networking links) at that layer are considered as objects forming a graph based on their functional connectivity to each other. A schematic example of lowest level graphs of various layers is shown in Figure 2(b). Each graph hypothetically spreads over the physical and non-physical location coordinates that can be used to incorporate location intelligence into the framework. High frequent cliques or sub-graphs on this level forms the super-components which constitute the next level of the representation. The same process leads to higher levels of representation. This brings a vertical scalability to the framework that helps in abstracting the behavior of a high number components using a few number of super-components at high levels. At the same time, this multi-level representation facilitate horizontal (increasing the number of lowest-level components) scaling of the system as the scaling can be easily absorbed within the higher levels. In addition to horizontal and vertical scalability, the framework is capable to hierarchically or federally scale along the platform dimension. In this dimension, the BA units of the behavior of each lower-level platform (ranging from rack, cluster, data center (node), cloud to sky) can be recapitulated and then aggregated with others’ behavior, and then hierarchically passed to the BA units of the higher-level platform, or federally shared among the BAs of the platforms at the same level. A schematic example of this concept of scalability from the cloud level to the sky level is shown in Figure 1, in which a skybus enables communications between cloud-level recapitulators and sky-level BA units. Although the limited space of this paper does not allow full discussion, we want to bring attention to another aspect of the proposed BA framework that arises when the complexity, number of actors, and diversity of a system drastically increase. In these cases, the traditional picture of a “system,” especially its implicit “controllability” concept, no longer fits. Instead, a “manageability” concept, in the form of an “ecosystem” picture seems more appropriate. Obvious examples of these ecosystems are highly-penetrated systems into societies, such as: i) cellphone networks and ii) “Y-to-the-home” (Y-TTH) networks. The Y-TTH concept can be seen as a transversal approach compared to the traditional ”Fiber-to-the-X” (FTTX) concept. While the FTT-X emphasizes on the depth of fiber penetration in the premises, the Y-TTH focuses on the touching access technology: metal, fiber, and wireless. An example of the Y-TTH (and at the same time, of the FTT-X) implementation is the Fiber-TTH (FTTH) Reggefiber company in Netherlands [31]. In both cases of highly-penetrated systems into societies, the resulting populace of highly interactive actors, ranging from end users to service providers and computing providers (which includes all types of computing resources, especially access networks), forms an ecosystem of diverse actors. Although no governance is expected in these ecosystems, collaborative living among the actors and also alien (out of the ecosystem; for example, sourced from environmental regulations or sustainability reporting requirements) governance could be the base of manageability within the ecosystem [8], [5]. Our general picture of these computing ecosystems is provided in Figure 3. The main difference between our picture and the traditional ecosystem-society picture is that we consider all actors, even the society (the end users), inside the ecosystem. It implicitly implies that society is a part of any ecosystem, and socioeconomic footprint indicators should be also considered along with the environmental indicators. This picture enables us to build closed loops within the ecosystem, and therefore avoid any requirement or assumption on the ecosystem boundary conditions. In our ecosystem picture, there are three major classes of actors: end users, service providers, and computing providers. Although computing providers class could be actually considered as a subset of service providers, it is considered as a separate class in order to emphasize on that fact that the most of management and governance is actually implemented and executed by these actors. In the other words, service providers are considered as “free,” and probably selfish, actors who play within the constraints of governances imposed by the computing providers. The computing providers class is highly general, and also includes active and passive operators, for example. The “transformations” actions between various classes, shown in Figure 3, illustrate the vague nature of the classes, and represent transition of actors from one class to another class based on their behavior. In other words, the realtime classification of an actor is performed by analyzing its behavior, and there is no official or assigned class assignment. For example, a CEO can be transformed from the end user class to a service provider class when he uses his cellphone to participate in providing a service to another actor. The main BA requirement in the ecosystem is to profile end users and other actors based on their behavior, and use them in provisioning and also grading the actors, among other governing actions. In the proposed ecosystem view, as the concept of controllability no longer exists, the three units of the BA paradigm are redefined as follows. The core of the BA solution is still called the Behavior Analyzer Unit (BAU) but with a different mission: profiling the actors. Because of security and trust concerns, it is assumed that the main sources of behavior profiles available to the BAU are sourced from the computing providers, which are presumably its clients. The second unit, The BSU, is replaced with the Actor Simulation Unit (ASU). The ASU provides required scenarios upon the request from the BAU by generating imaginary actors in both classes of the end users and the service providers. Furthermore, the CRU is replaced with the Cognitive Advisory Unit (CAU) which provides cognitive advices to the computing providers, and potentially to the service providers, without any guarantee of acceptance by those actors. It is worth noting that each computing provider, or a collection of them, can still has an internal BA solution at their systematic level. From here on, the BA framework in its system picture (shown in Figure 2) will be followed. III. FAULT I NJECTION AND P ROPAGATION In computing systems, having a certain degree of reliability and dependability is very important [3]. For instance, employing low-cost processor components or having software bugs can significantly affect their quality of service (QoS). Sophisticated fault testing techniques must be used to obtain a specific dependability requirement in a system. Fault injection is a technique that can validate the dependability of a system by carefully considering injection of faults into the system and then observing the resulted effects on the system performance Fig. 4. The fault injecting scenario. [15]. Indeed, this technique accelerates the fault-occurrence and -propagation into the system. At the same time, it can be used to study the behavior of the system in response to faults, and also track back the behaviors to the failures. A fault injection model for a typical system is illustrated in Fig. 4, which consists of the following components: • Load Generator and Fault Generator: The load generator generates workload of the target system and provides a foundation for injecting faults. The fault generator injects faults into the system as it executes commands from the workload generator. The injector not only support different types of fault, it also controls their target component and time based on the requests from the BAU and its own fault library. • Data Analyzer and Behavior Analyzer: The behavior analyzer requests fault and failure scenarios in order to complete its models during the fault analysis experiments. Specifically, it tracks the execution of the requests, and incrementally improves its behavioral models. The data analyzer is responsible of handling the big data collected from the system as a preprocessing unit. In addition, the opportunistic agents, which collecting data from various components of the system, trim the data before uploading it to the BAU. Two major categories of main common causes of failure are software faults [22] and hardware faults [16], [3]; almost 60% of the system failures are caused by the software faults. Some fault injection schemes are designed to emulate only software failures, such as the JAFL (Java fault loader) [21]. Also, there are some fault injection schemes that can emulate both software and hardware failures, such as the CEDP (Computer Evaluator for Dependability and Performance) [32]. In particular, the JAFL is a fault injector scheme designed for testing the fault-tolerance in grid computing systems. While most of the similar class of fault injector schemes only focused on the fault injection at basic level, such as corrupting the code or data [7], the JAFL is a more sophisticated software fault injector that considers a wide range of faults such as CPU processing usage, Memory usage, I/O bus usage, and etc. [21]. On the other hand, faults can be also injected into the hardware of the system. The CEDP is a fault injecting scheme developed for quantitative evaluation of system dependability by testing software and hardware of the system. This scheme is also able to characterize the behavior of the fault propagation in the system. In the CEDP, hardware fault represents a transient fault in the CPU register or in the memory block. Then, during the next execution of system program, this fault/error will propagate and cause faults for the other system states. We will use both these two fault generators in our future work. Having the fault injectors incorporated, the BSU could generate the data and profiles required by the BAU to create distributions and models. (a) (b) IV. B EHAVIOR A NALYSIS PARADIGMS In this section, we present the three paradigms that process and model the behavior data. A. Probabilistic Behavior Analysis The probabilistic (statistical inference) analysis, in which the reliability of a system is estimated along the time, is a well-known and popular approach [19], [26]. In this paradigm, each layer of the computing system (as shown in Figure 2) is considered as a graph composed of the system components of that layer connected to each other based on their functional connectivity. This graph could vary along the time. As many components are of the same type, the graph can be decomposed into repeated cliques or sub-graphs; having the behavior of the sub-graphs, the behavior of the graph can be easily calculated. The sub-graphs can be considered as supercomponents and can compose a higher level of representation. The super-components on a level themselves can also be merged into sub-graphs (super-components) of the next level. This brings up a multi-level representation for each layer, and also converts the problem into a combinatorial problem of sub-graphs. At each level, a sub-graph consists of a set of directly connected components of that level (which could be by themselves sub-graphs (super-components) at a lower level). Some basic examples are shown in Figure 5. Servers, switches, and network connections are shown by blue squares, orange circles, and green ovals respectively. To represent a graph/clique on the lth level with n sub-components and a topology T , we use the notation GTn,l . When the details are not required, we use Gi to represents a sub-graph. For the sake of simplicity, we assume all components are fully maintained/repaired to their best status at t = 0. Let define the PoA (or reliability) as PoAtG0 = R(G, t0 ) be the probability of having the component G not failed over the interval [0, t0 ]. PoAtG0 is a decreasing function of t0 , and can be related to its Cumulative Distribution Function (CDF) of failure: PoAtG0 = 1 − CDFG (t0 ). Considering a certain scaling factor s, the CDF(t0 ) can be related to a Differential Density Function (DDF), DDFs (P0 ), where P0 is the CDF value at t0 . The DDF is defined as: 1 ∂CDFG (t) s DDFG (P0 = CDFG (t0 )) := s ∂t t0 and, a CDF can be also inversely expressed based on its DDF by solving the following differential equation: ∂CDFG (t) = sDDFsG (CDFG (t)), ∂t CDFG (0) = 0 (c) Fig. 5. Various examples of sub-graphs. (a) (b) Fig. 6. a) The empirical CDF of the lanl05 database compared with its best Weibull and tanh fits. b)The DDFs of 1- and 2-components systems. In the rest of the paper, we assume s = 1. If a clique consists of two components, and if the full availability is required, we have: PoAtG0 ∩G = PoAtG0 PoAtG0 = (1−CDFG1 (t0 ))(1−CDFG2 (t0 )). 2 1 1 2 From this, we can calculate the CDF of the combined system: 1 − CDF(G1 ∩ G2 , t0 ) = (1 − CDFG1 (t0 ))(1 − CDFG2 (t0 )). Then, CDF(G1 ∩ G2 , t0 ) = CDFG1 (t0 ) + CDFG2 (t0 ) −CDFG1 (t0 )CDFG2 (t0 ). Therefore, the DDF of the combined system is: DDF (G1 ∩ G2 , P0 ) = DDFG1 (P0,1 ) + DDFG2 (P0,2 ) −P0,1 DDFG2 (P0,2 ) − P0,2 DDFG1 (P0,1 ), where P0,1 = CDFG1 (t0 ), and so on. For identical components, we have: DDF(G1 ∩ G2 , P0 ) where P0,1 2(1 − P0,1 )DDFG1 (P0,1 ) p p = 2 1 − P0 DDFG1 (1 − 1 − P0 ). √ = CDFG1 (t0 ) = 1 − 1 − P0 . = Various CDF functions have been used in the literature. One of interesting CDF functions is the Weibull distribution, which has been effective for large-scale systems [19]. It has two parameters: the shape parameter β and the scale parameter δ. In contrast, in this work, we assume that the CDFs can be approximated by the tanh distribution. We define a tanh CDF distribution as: x − xc xc 1 tanh( ) + tanh( ) , CDF xc ,xs (x) = Zxc ,xs 2xs 2xs (a) 1-component (b) 2-component Fig. 7. Monte Carlo validation of CDFs of 1- and 2-component systems. where Zxc ,xs = 1 + tanh(xc /(2xs )) is a normalization factor, xc is the center parameter, and xs is the shape parameter. The corresponding tanh DDF function is: n exp( xc )−P exp( xc ) o2 0 xs xs −s tanh 12 log xc P0 exp( x )+1 xc ,xs ,s s , DDF (P0 ) = s+ 2xs + 2xs tanh(xc /(2xs )) where P0 = CDF xc ,xs (t0 ). As an example, the empirical CDF of the (union-interpreted) lanl05 database [28], retrieved from the Failure Trace Archive (FTA) [18], is compared with its best fits using both the Weibull distribution [18] and the tanh distribution [fitted on the logarithm of time with optimal parameters xc = 5.564(±0.0035) and xs = 1.577(±0.0030)], and shown in Figure 6(a). The empirical CDF is shown in black, while the tanh and the Weibull distributions are shown in solid blue and dashed red lines respectively. For the sake of clarity, the absolute difference between the empirical distribution and the fitted distributions are also shown in percentage in the figure. As can be seen, the tanh distribution has a better fit to the real data. This is confirmed by its high p-values (compared to the traditional significance level of 0.05) with respect to the Kolmogorov-Smirnov and the Anderson-Darling goodness of fit (GOF) tests: 0.4999 and 0.5705 respectively. To calculate the p-values, averaging over 1000 p-value estimations, each of which was calculated on a randomly-selected set of 30 samples from the real data set, was used. The profiles of the DDF functions of an 1- and a 2-component cliques using the tanh distribution are shown in Figure 6(b). For more complex sub-graphs and when partial availability is required, the calculations will be very tedious and vulnerable to errors. An example is the graph in Figure 5(c) which composed of 11 components. In full-availability case, i.e., requiring to have all four servers up and connected, the graph can be broken down to two cliques of Figure 5(b) and one clique of only one network connection. Therefore, the PoA of the whole graph will be: PoAtG0 11,1 PoAtG0 3,2 = (1 − CDF(G5,1 , t0 ))2 × (1 − CDF(G1,1 , t0 )), = which can be expanded in full. The formula for the corresponding DDF is not provided because of limited space. In the case of partial availability, for example having at least three servers up and connected, the problem can be expressed as a combinatorial problem in which one instance of each one of the cliques shown in Figures 5(a) and 5(b) and a network (a) CDF (b) DDF Fig. 8. The simulated Monte Carlo estimation of the CDF and DDF of a 5-component system. connection compose the graph. The corresponding possible cases and their inter-relations makes the calculations very complex. This brings us the the second paradigm of simulation behavior analysis presented in the next section. B. Simulated Probabilistic Behavior Analysis The second paradigm is based on simulations, in which the target system is implemented in a suitable environment, such as grid simulators1 or network simulators2 , and the system characteristics are statistically estimated based on the properties of its components. In order to build the statistics of the system, a series of simulated experiments is performed in a Monte Carlo approach, and then some characteristics, such as CDFs, are calculated. In order to show the correctness of the paradigm, the results obtained by the Monte Carlo analysis for one- and two-component systems are estimated and shown in Figure 7. The theoretical results are also shown as dashed lines for the sake of comparison. These results have been obtained with averaging over 1000 simulations. The paradigm can easily estimates the CDF and DDF of any system. For example, the CDF and the DDF of the 5-component sub-graph of Figure 5(b) are estimated and shown in Figure 8 and are compared with two-component case. In this simulation, it is assumed that the servers, switch, and the network connections have tt = 3 and tr = 1, tt = 3.5 and tr = 1, and tt = 4 and tr = 4, respectively. Some parametric models can be considered to fit on the simulation results to provide closed-forms models. The 1 http://www.cloudbus.org/gridsim/;http://simgrid.gforge.inria.fr/ 2 http://www.isi.edu/nsnam/ns/;http://www.nsnam.org/;http://www.omnetpp. org/ (a) Full utilization.(b) Less-consuming partial utilization. Fig. 9. Consolidation of components without lowering the SLO using the PoA estimation. simulated paradigm can be also used to validate the results of the theoretical models. An application of probabilistic behavior analyzers is shown in Figure 9. The required availability is two servers and one switch. By estimating the PoA of the sub-graph consisting of just one switch (shown in Figure 9(b)), the system can shut down one of the switches until the PoA reaches a predefined threshold. This not only saves a considerable amount of energy, it can increase the lifetime of the components. C. Behavior-Time Profile Modeling and Analysis The third paradigm works directly with the time profiles of the components. These time profiles, which are collected in an opportunistic way by some agents, are learned and modeled using various machine learning and pattern recognition methods, such as Support Vector Machines (SVMs) [25], among others. The advantage of this paradigm is that it works directly with the pattern, not with their statistical moments. Therefore, it can discover behaviors which can be missed by the other paradigms. A typical scenario for the time-profile behavior analysis is shown in Figure 10. In this example, the CPU and memory resource usage of a 2-component system is shown. The BAU discovers a fault at 9:45AM, when the memory usage of the second component increases, based on the models learned in the failure generating phase (section III), and the response of the CRU, based on this detection, prevents a failure at 10:00AM. This not only prevents an instance of SLA violation, it also reduces the down time by 30 minutes. V. R ELATED W ORK There is a huge literature on the fault detection and fault avoidance topics [19], [26]. In [23], several probabilistic models, such as the Weibull and hyper-exponential distributions, were fitted on empirical data available in three datasets: CSIL, Condor and Long. They showed that the Weibull and exponential distributions are more accurate to model the behavior of resources. In [27], a data analysis was carried out on the system logs and reports of a cluster system. They used time-series analysis and also rule-based classification for predicting continuous and critical events. A similar analysis carried out in [12] in which temporal and spatial correlations among failure events in coalition systems are used. In [14], an analysis of runtime data was performed to detect anomalous behaviors. For this purpose, an outliers analysis was used. In [11], Support Vector machines (SVMs), random indexing and runtime system logs were used for predicting failures. In [4], an online predictor for mission critical distributed system was presented. The predictor was based on the analysis of the network traffic. A modular fashion for integrating a cognitive operator into a system was presented in [6]. And in [2], cognitive map modeling was used for the purpose of failure analysis of system interaction failure modes. Finally, treatment learning enables tracking back the failures in largescale systems is of great value in order to identify the causing component or factor [13]. This not only reduces the expenses and experts’ time, it also reduces the chance of secondary failures related to human mistakes of the experts. The main highlight of our approach compared to others is in its multi-granularity nature that comes from various dimensions of the framework. In one dimension, multi-level analysis of the system (graphs) enables the framework to horizontally and vertically scales while avoiding exponential computational and analysis costs associated to scaling. On another dimension, its multi-layer aspect provides a systematic and separable approach for analysis of the behavior of the non-hardware parts (software, virtualware, etc). It can be argued that the main performance bottleneck of future systems roots in the errors and faults of their non-hardware parts. Finally, in a third dimension, multi-paradigm approach of the framework paves the way for cognitive responding in a crosscover manner. VI. C ONCLUSION AND FUTURE PROSPECTS A multi-paradigm, multi-layer, and multi-level cognitive behavior analysis framework has been introduced. The framework uses probabilistic (statistical inference), simulated (statistical inference by means of simulation), and time-profile modeling and analysis in order to learn and model various behaviors of complex computing systems. Its multi-paradigm approach enables validation and cross-cover among various paradigms. The framework can perform at multiple granularities thanks to its multi-level and multi-layer approach. This facilitate i) systematic horizontal, vertical and hierarchical scaling, ii) straightforward integration of non-physical parts (software, virtualware, etc) in the analysis, iii) increasing the system dependability, such as the Probability of Availability (PoA), achieved by smart, cross-covered cognitive responding. Also, a new distribution, the tanh distribution, has been introduced with promising results on a real database. The application of the framework in failure analysis and detection has been discussed in this work. The framework is specially designed toward application in open-source architectures such as OpenStack3 and OpenGSN4 that will be considered as real-system examples in the future work. Furthermore, more sophisticated distributions, such as asymmetrical tanh and spline distributions, will be introduced. 3 http://www.openstack.org/ 4 http://www.greenstarnetwork.com/ Fig. 10. A typical example of time-profile behavior analysis and its impact on the overall grade improvement. ACKNOWLEDGMENTS The authors thank the NSERC of Canada for their financial support. R EFERENCES [1] Ethem Alpaydin. Techniques for combining multiple learners. In Proceedings of Engineering of Intelligent Systems, pages 6–12. ICSC Press, 1998. [2] Manu Augustine, Om Yadav, Rakesh Jain, and Ajay Rathore. Cognitive map-based system modeling for identifying interaction failure modes. Research in Engineering Design, 23(2):105–124, 2012. [3] A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 1(1):11–33, 2004. [4] Roberto Baldoni and et. al. Online black-box failure prediction for mission critical distributed systems. Technical Report 3/12 - 2012, MIDLAB, 2012. [5] Reinette Biggs and et. al. Toward principles for enhancing the resilience of ecosystem services. Annual Review of Environment and Resources, 37(1):null, 2012. [6] Sven Burmester and et. al. Tool support for the design of self-optimizing mechatronic multi-agent systems. STTT, 10(3):207–222, 2008. [7] N.-G.M. Leme E. Martins, C.-M.F. Rubira. Jaca: A reflective fault injection tool based on patterns. In DSN’02, pages 483–487, Maryland, USA, 23–26 June 2002. [8] Malin Falkenmark. Governance as a Trialogue: Government-SocietyScience in Transition, chapter Good Ecosystem Governance: Balancing Ecosystems and Social Needs, pages 59–76. Water Resources Development and Management. Springer Berlin Heidelberg, 2007. [9] Fereydoun Farrahi Moghaddam, Reza Farrahi Moghaddam, and Mohamed Cheriet. Carbon metering and effective tax cost modeling for virtual machines. In CLOUD’12, pages 758–763, Honolulu, Hawaii, USA, June 2012. [10] Fereydoun Farrahi Moghaddam, Reza Farrahi Moghaddam, and Mohamed Cheriet. Multi-level grouping genetic algorithm for low carbon virtual private clouds. In CLOSER’12, pages 315–324, Porto, Portugal, April 18–21 2012. [11] Ilenia Fronza, Alberto Sillitti, Giancarlo Succi, Mikko Terho, and Jelena Vlasenko. Failure prediction based on log files using random indexing and support vector machines. Journal of Systems and Software, In Press(0):–, 2012. [12] Song Fu and Cheng-Zhong Xu. Exploring event correlation for failure prediction in coalitions of clusters. In SC’07, pages 1–12, Reno, Nevada, 2007. ACM. [13] Gregory Gay, Tim Menzies, Misty Davies, and Karen Gundy-Burlet. Automatically finding the control variables for complex system behavior. Automated Software Engineering, 17(4):439–468, 2010. [14] Qiang Guan and Song Fu. auto-AID: A data mining framework for autonomic anomaly identification in networked computer systems. In IPCCC’10, pages 73–80, 2010. [15] Mei-Chen Hsueh, T.K. Tsai, and R.K. Iyer. Fault injection techniques and tools. Computer, 30(4):75 –82, Apr 1997. [16] Bing Huang, M. Rodriguez, Ming Li, J.B. Bernstein, and C.S. Smidts. Hardware error likelihood induced by the operation of software. IEEE Transactions on Reliability, 60(3):622–639, 2011. [17] K. Keahey, M. Tsugawa, A. Matsunaga, and J. Fortes. Sky computing. IEEE Internet Computing, 13(5):43–51, 2009. [18] Derrick Kondo, Bahman Javadi, Alexandru Iosup, and Dick Epema. The failure trace archive: Enabling comparative analysis of failures in diverse distributed systems. In CCGrid’10, pages 398–407, 2010. [19] Antonios Litke, Dimitrios Skoutas, Konstantinos Tserpes, and Theodora Varvarigou. Efficient task replication and management for adaptive fault tolerance in mobile grid environments. Future Generation Computer Systems, 23(2):163–178, February 2007. [20] Daniele Miorandi, Sabrina Sicari, Francesco De Pellegrini, and Imrich Chlamtac. Internet of things: Vision, applications and research challenges. Ad Hoc Networks, 10(7):1497–1516, September 2012. [21] D. Sousa N. Rodrigues and L.M. Silva. A fault-injector tool to evaluate failure detectors in grid-services. In CoreGRID’07, pages 261–271, Heraklion, Crete, Greece, 12–13 June 2007. [22] R. Natella, D. Cotroneo, J. Duraes, and H. Madeira. On fault representativeness of software fault injection. IEEE Transactions on Software Engineering, Accepted(99):–, 2012. [23] Daniel Nurmi, John Brevik, and Rich Wolski. Modeling machine availability in enterprise and wide-area distributed computing environments. In Jos Cunha and Pedro Medeiros, editors, Lecture Notes in Computer Science (Euro-Par 2005 Parallel Processing), volume 3648, pages 612– 612. Springer, 2005. [24] Fabio Oliveira and et. al. Barricade: defending systems against operator mistakes. In EuroSys’10, pages 83–96, Paris, France, 2010. ACM. [25] Juan Josá Rodrı́guez, M.guez, Carlos J. Alonso, and Josá A. Maestro. Support vector machines of interval-based features for time series classification. Knowledge-Based Systems, 18(45):171–178, August 2005. [26] Brent Rood and Michael Lewis. Grid resource availability predictionbased scheduling and task replication. Journal of Grid Computing, 7(4):479–500, 2009. [27] R. K. Sahoo, A. J. Oliner, I. Rish, M. Gupta, J. E. Moreira, S. Ma, R. Vilalta, and A. Sivasubramaniam. Critical event prediction for proactive management in large-scale computer clusters. In KDD’03, pages 426–435, Washington, D.C., 2003. ACM. [28] B. Schroeder and G.A. Gibson. A large-scale study of failures in highperformance computing systems. IEEE Transactions on Dependable and Secure Computing, 7(4):337–350, 2010. [29] A.P. Snow and G.R. Weckman. What are the chances an availability SLA will be violated? In ICN’07, pages 35–35, Martinique, 22-28 April 2007. [30] The Climate Group. SMART 2020: Enabling the low carbon economy in the information age. Technical report, the Global eSustainability Initiative (GeSI), 2008. [31] Annemijn Van Gorp and Catherine A. Middleton. Fiber to the home unbundling and retail competition: Developments in the netherlands. Communications and Strategies, 78(2):87–106, June 2010. [32] Keun Soo Yim, Z. Kalbarczyk, and R.K. Iyer. Measurement-based analysis of fault and error sensitivities of dynamic memory. In DSN’10, pages 431–436, Chicago, IL, USA, June 28 2010-July 1 2010 2010.
© Copyright 2026 Paperzz