Cognitive Behavior Analysis framework for Fault

Cognitive Behavior Analysis framework for Fault
Prediction in Cloud Computing
Reza Farrahi Moghaddam
&
Fereydoun Farrahi Moghaddam
Synchromedia Lab, ETS
University of Quebec
Montreal (QC), Canada H3C 1K3
Email: [email protected]
[email protected]
Vahid Asghari
Mohamed Cheriet
INRS-EMT
University of Quebec
Montreal (QC), Canada H5A 1K6
Email: [email protected]
Synchromedia Lab, ETS
University of Quebec
Montreal (QC), Canada H3C 1K3
Email: [email protected]
Abstract—Complex computing systems, including clusters,
grids, clouds and skies, are becoming the fundamental tools of
green and sustainable ecosystems of future. However, they can
also pose critical bottlenecks and ignite disasters. Their complexity and number of variables could easily go beyond the capacity
of any analyst or traditional operational research paradigm.
In this work, we introduce a multi-paradigm, multi-layer and
multi-level behavior analysis framework which can adapt to the
behavior of a target complex system. It not only learns and detects
normal and abnormal behaviors, it could also suggest cognitive
responses in order to increase the system resilience and its grade.
The multi-paradigm nature of the framework provides a robust
redundancy in order to cross-cover possible hidden aspects of
each paradigm. After providing the high-level design of the
framework, three different paradigms are discussed. We consider
the following three paradigms: Probabilistic Behavior Analysis,
Simulated Probabilistic Behavior Analysis, and Behavior-Time
Profile Modeling and Analysis. To be more precise and because of
paper limitations, we focus on the fault prediction in the paper as
a specific event-based abnormal behavior. We consider both spontaneous and gradual failure events. The promising potential of
the framework has been demonstrated using simple examples and
topologies. The framework can provide an intelligent approach
to balance between green and high probability of completion (or
high probability of availability) aspects in computing systems.
I. I NTRODUCTION
The computing systems, in various forms such as clusters,
grids, clouds and skies [9], [10], [17], are scaling not only in
terms of number of involved components and their physical
distribution, they are also becoming very heterogeneous with
new components such as sensors and mobile devices. Although
this brings more computational and conscious power, at the
same time it increases the degree of uncertainty and risk. There
are many risk sources involved, such as operators (human),
software bugs, software overload, hardware overload, hardware failure, among others [24]. Therefore, a full understanding of the system behavior, which brings the ability to predict
its normal or abnormal behaviors, is of great importance for
scheduling, allocation, binding, and other actions. We call it
the Behavior Analysis (BA), and we propose a framework with
three high-level units: the Behavior Analysis Unit (BAU), the
Behavior Stimulator Unit (BSU), and the Cognitive Responder
Unit (CRU). A schematic example of the proposed framework
is shown in Figure 1.
Fig. 1.
A schematic example of the proposed framework in a sky system.
In a non-technical way, we can consider two high-level
modes of transaction between a service provider and its clients:
1) Leasing: The provider dedicates an agreed set of resources to the client for an agreed limited period of
time. The lease can be renewed upon the agreement and
resource availability. This mode is especially interesting
for handling those resources which may evanesce or
vanish without a notice (such as those resources powered
by intermittent energy sources). The lease mode is a
good practice for service providers with Infrastructure
as a Service (IaaS) model or other similar models.
2) Task completion: The service provider guarantees completion of a (or a volume of) task(s) within an agreed
period of time. This implicitly implies that the provider
is aware of the task’s detailed steps.
In reality, there is a chance of failure to deliver the agreed
Service-Level Objectives (SLOs). This brings us to two important concepts: the Probability of Completion (PoC) and
the Probability of Availability (PoA). Ability of a provider
to determine, estimate, or measure the PoC (or PoA, depending on its business model), not only enables him to
(a) The overall picture
Fig. 2.
(b) The multi-layer nature of the framework
Schematic diagram of the proposed behavior analysis framework in its systemic picture.
negotiate instrumental Service-Layer Agreements (SLAs) with
its clients, it also provides a way to grade its various services.
Especially, services with High Probability of Completion
(HPoC) or High Probability of Availability (HPoA) grades
would attract mission-critical clients, such as communication
providers, and emergency operators. Usually, the HPoC (or
HPoA) is achieved by resource “over allocating” and by
task “replicating.” This can be interpreted as a traditional
correlation between the HPoC (or HPoA) requirement and
the high level of energy/resource consumption (non-greenness)
of a service. This is a critical issue because, with the push
for the ICT enabling effect and also the move toward the
Internet of things, the HPoC (or HPoA) will be required by
an enormous number of clients; the ICT enabling effect is one
of fundamental instruments for reducing the footprint of other
industrial sectors by re-directing non-ICT service calls to the
ICT sector [30]. The Internet of things is also becoming a
reality in near future because of exponential increase in the
number of portable phones, distributed sensors, and Radio
Frequency (RF) devices [20].
One way to break the correlation between the HPoC (or
HPoA) and non-greenness of services is adding intelligence
to determine, predict and react to the possible changes in
the PoC (or PoA) in real-time. This could help a system
to provide a desirable HPoC (or HPoA) with a minimum
amount of resources. This analyzer and its implementation
is our ultimate research goal, and in this work, we present
an overview and some preliminary results. Calculation and
verification of the PoC (or PoA) of a service can be carried
out based on analyzing the system configuration. However,
real-time variation in PoC (or PoA) is very critical and can
lead to violation of the SLA despite having a satisfactory
configuration-based predicted PoC (or PoA) value. Therefore,
in our framework, we consider three paradigms to compensate
the weakness of each other one: Probabilistic (Statistical Inference) Behavior Analysis, Simulated Probabilistic (Statistical
Inference by Means of Simulation) Behavior Analysis, and
Behavior-Time Profile Modeling and Analysis. The two first
paradigms work based on the configuration of the system using
the data collected from the experiments and simulations to
provide insight on the system behavior. The third paradigm
uses machine learning techniques to learn the patterns and
behaviors of the system from its time profiles collected by a
set of opportunistic agents across the system.
The organization of the paper is as follows. In section II,
a brief introduction of the proposed framework is presented.
Fault injection approaches are discussed in section III. The
three behavior analysis paradigms and some experimental results are presented in section IV. The related work is described
in section V. Finally, the conclusions and future prospects are
provided in section VI.
II. P ROPOSED B EHAVIOR A NALYSIS
A schematic diagram of the proposed framework is presented in Figure 2(a). The computing system under study is
represented roughly by many involved layers on each one
opportunistic and cognitive agents of the framework will reside
in order to collect status and time-profiles. All the collected
information is directed to the main unit, the Behavior Analyzer
Unit (BAU). Using its multi-paradigm approach and based on
the collected data, this unit not only infers on the current
status of the system and its components, it also produces
predictions on the changes in the system status or possibility
of occurrence of abnormal events. As the BAU works based
on machine learning techniques, it requires enough samples
of various behaviors under different conditions and operations
to build its inference models. To convert the learning process
from passive to active, and to reduce the learning time, another
unit, the Behavior Stimulator Unit (BSU), is considered which
is responsible to “stimulate” the desired behaviors in a controlled manner. The last part of the framework is a Cognitive
Responder Unit (CRU) which makes recommendations to
the system in order to prevent or compensate the abnormal
behaviors/events and their side-effects in an optimal way and
using all available resources.
The framework considers the following three paradigms:
Probabilistic Behavior Analysis, Simulated Probabilistic Behavior Analysis, and Behavior-Time Profile Modeling and
Analysis. Each paradigm is considered to independently work
and draw its inference. The CRU is supposed to combine
(using voting, mixture of experts, stacking, cascading [1], or
any other strategy) the conclusions of the three paradigms and
Fig. 3. Schematic diagram of the proposed behavior analysis framework in
its ecosystemic picture.
make a cognitive decision. Therefore, the CRU cognition could
be very different form one system to another one, depending
on the desired level of dependability, and it can float on a
wide spectrum from extremely pessimistic to highly optimistic
cognitions.
Let’s consider an example of the BA application in upgrading a service grade. Assume, in a system, the Mean
time between failures (MTBF) of the dominant fault is
10 weeks and its Mean time to repair (MTTR) is 10 minutes.
It reads to an average down time of 365/70 × 10 '
52 minutes and 8.4 seconds per year which is graded
to 4 nines availability grade level [29]. If the BA framework can achieve a success rate of %90 in predicting faults
15 minutes before their associated failure, downtime will be
reduced 5.2 minutes per year which is graded to 5 nines
availability grade, achieved without any extra investment in
core hardware/software of the system and just by using the BA
framework. This upgrade can not only increase the profit of
the service provider and reduces the fee for the service user, it
also reduces the footprint on the environment; services which
use hardware with longer life span has lesser lifecycle footprint
on the environment because of overall lowered manufacturing
footprint. This shows a great value of the BA framework, especially the real-time BA paradigm. Although the BA framework
is not limited to a specific behavior, we consider only analysis
of the “fault” events in this work. Analysis of other types of
behavior-related events, such as “degradation,” and also the
system behavior itself will be considered in the future.
As can be seen from Figure 2(a), the framework works on
several layers from hardware to applications. In each layer, a
multi-level approach is considered for representation. At the
lowest level, all the system components (even the networking links) at that layer are considered as objects forming a
graph based on their functional connectivity to each other. A
schematic example of lowest level graphs of various layers
is shown in Figure 2(b). Each graph hypothetically spreads
over the physical and non-physical location coordinates that
can be used to incorporate location intelligence into the
framework. High frequent cliques or sub-graphs on this level
forms the super-components which constitute the next level of
the representation. The same process leads to higher levels
of representation. This brings a vertical scalability to the
framework that helps in abstracting the behavior of a high
number components using a few number of super-components
at high levels. At the same time, this multi-level representation
facilitate horizontal (increasing the number of lowest-level
components) scaling of the system as the scaling can be easily
absorbed within the higher levels. In addition to horizontal and
vertical scalability, the framework is capable to hierarchically
or federally scale along the platform dimension. In this dimension, the BA units of the behavior of each lower-level
platform (ranging from rack, cluster, data center (node), cloud
to sky) can be recapitulated and then aggregated with others’
behavior, and then hierarchically passed to the BA units of the
higher-level platform, or federally shared among the BAs of
the platforms at the same level. A schematic example of this
concept of scalability from the cloud level to the sky level is
shown in Figure 1, in which a skybus enables communications
between cloud-level recapitulators and sky-level BA units.
Although the limited space of this paper does not allow full
discussion, we want to bring attention to another aspect of the
proposed BA framework that arises when the complexity, number of actors, and diversity of a system drastically increase.
In these cases, the traditional picture of a “system,” especially
its implicit “controllability” concept, no longer fits. Instead,
a “manageability” concept, in the form of an “ecosystem”
picture seems more appropriate. Obvious examples of these
ecosystems are highly-penetrated systems into societies, such
as: i) cellphone networks and ii) “Y-to-the-home” (Y-TTH)
networks. The Y-TTH concept can be seen as a transversal
approach compared to the traditional ”Fiber-to-the-X” (FTTX) concept. While the FTT-X emphasizes on the depth of
fiber penetration in the premises, the Y-TTH focuses on the
touching access technology: metal, fiber, and wireless. An
example of the Y-TTH (and at the same time, of the FTT-X)
implementation is the Fiber-TTH (FTTH) Reggefiber company
in Netherlands [31].
In both cases of highly-penetrated systems into societies, the
resulting populace of highly interactive actors, ranging from
end users to service providers and computing providers (which
includes all types of computing resources, especially access
networks), forms an ecosystem of diverse actors. Although
no governance is expected in these ecosystems, collaborative
living among the actors and also alien (out of the ecosystem; for example, sourced from environmental regulations
or sustainability reporting requirements) governance could be
the base of manageability within the ecosystem [8], [5]. Our
general picture of these computing ecosystems is provided in
Figure 3. The main difference between our picture and the
traditional ecosystem-society picture is that we consider all
actors, even the society (the end users), inside the ecosystem.
It implicitly implies that society is a part of any ecosystem, and
socioeconomic footprint indicators should be also considered
along with the environmental indicators. This picture enables
us to build closed loops within the ecosystem, and therefore
avoid any requirement or assumption on the ecosystem boundary conditions.
In our ecosystem picture, there are three major classes of
actors: end users, service providers, and computing providers.
Although computing providers class could be actually considered as a subset of service providers, it is considered as
a separate class in order to emphasize on that fact that the
most of management and governance is actually implemented
and executed by these actors. In the other words, service
providers are considered as “free,” and probably selfish, actors
who play within the constraints of governances imposed by
the computing providers. The computing providers class is
highly general, and also includes active and passive operators,
for example. The “transformations” actions between various
classes, shown in Figure 3, illustrate the vague nature of the
classes, and represent transition of actors from one class to
another class based on their behavior. In other words, the realtime classification of an actor is performed by analyzing its
behavior, and there is no official or assigned class assignment.
For example, a CEO can be transformed from the end user
class to a service provider class when he uses his cellphone
to participate in providing a service to another actor.
The main BA requirement in the ecosystem is to profile
end users and other actors based on their behavior, and use
them in provisioning and also grading the actors, among other
governing actions. In the proposed ecosystem view, as the
concept of controllability no longer exists, the three units of
the BA paradigm are redefined as follows. The core of the
BA solution is still called the Behavior Analyzer Unit (BAU)
but with a different mission: profiling the actors. Because of
security and trust concerns, it is assumed that the main sources
of behavior profiles available to the BAU are sourced from the
computing providers, which are presumably its clients. The
second unit, The BSU, is replaced with the Actor Simulation
Unit (ASU). The ASU provides required scenarios upon the
request from the BAU by generating imaginary actors in both
classes of the end users and the service providers. Furthermore,
the CRU is replaced with the Cognitive Advisory Unit (CAU)
which provides cognitive advices to the computing providers,
and potentially to the service providers, without any guarantee
of acceptance by those actors. It is worth noting that each
computing provider, or a collection of them, can still has an
internal BA solution at their systematic level. From here on,
the BA framework in its system picture (shown in Figure 2)
will be followed.
III. FAULT I NJECTION AND P ROPAGATION
In computing systems, having a certain degree of reliability and dependability is very important [3]. For instance,
employing low-cost processor components or having software
bugs can significantly affect their quality of service (QoS).
Sophisticated fault testing techniques must be used to obtain a
specific dependability requirement in a system. Fault injection
is a technique that can validate the dependability of a system
by carefully considering injection of faults into the system and
then observing the resulted effects on the system performance
Fig. 4.
The fault injecting scenario.
[15]. Indeed, this technique accelerates the fault-occurrence
and -propagation into the system. At the same time, it can be
used to study the behavior of the system in response to faults,
and also track back the behaviors to the failures.
A fault injection model for a typical system is illustrated in
Fig. 4, which consists of the following components:
• Load Generator and Fault Generator: The load generator
generates workload of the target system and provides a
foundation for injecting faults. The fault generator injects
faults into the system as it executes commands from the
workload generator. The injector not only support different types of fault, it also controls their target component
and time based on the requests from the BAU and its own
fault library.
• Data Analyzer and Behavior Analyzer: The behavior
analyzer requests fault and failure scenarios in order to
complete its models during the fault analysis experiments.
Specifically, it tracks the execution of the requests, and
incrementally improves its behavioral models. The data
analyzer is responsible of handling the big data collected
from the system as a preprocessing unit. In addition, the
opportunistic agents, which collecting data from various
components of the system, trim the data before uploading
it to the BAU.
Two major categories of main common causes of failure
are software faults [22] and hardware faults [16], [3]; almost
60% of the system failures are caused by the software faults.
Some fault injection schemes are designed to emulate only
software failures, such as the JAFL (Java fault loader) [21].
Also, there are some fault injection schemes that can emulate
both software and hardware failures, such as the CEDP (Computer Evaluator for Dependability and Performance) [32]. In
particular, the JAFL is a fault injector scheme designed for
testing the fault-tolerance in grid computing systems. While
most of the similar class of fault injector schemes only focused
on the fault injection at basic level, such as corrupting the code
or data [7], the JAFL is a more sophisticated software fault
injector that considers a wide range of faults such as CPU
processing usage, Memory usage, I/O bus usage, and etc. [21].
On the other hand, faults can be also injected into the hardware
of the system. The CEDP is a fault injecting scheme developed
for quantitative evaluation of system dependability by testing
software and hardware of the system. This scheme is also
able to characterize the behavior of the fault propagation in
the system. In the CEDP, hardware fault represents a transient
fault in the CPU register or in the memory block. Then, during
the next execution of system program, this fault/error will
propagate and cause faults for the other system states. We will
use both these two fault generators in our future work. Having
the fault injectors incorporated, the BSU could generate the
data and profiles required by the BAU to create distributions
and models.
(a)
(b)
IV. B EHAVIOR A NALYSIS PARADIGMS
In this section, we present the three paradigms that process
and model the behavior data.
A. Probabilistic Behavior Analysis
The probabilistic (statistical inference) analysis, in which
the reliability of a system is estimated along the time, is a
well-known and popular approach [19], [26]. In this paradigm,
each layer of the computing system (as shown in Figure 2) is
considered as a graph composed of the system components
of that layer connected to each other based on their functional connectivity. This graph could vary along the time. As
many components are of the same type, the graph can be
decomposed into repeated cliques or sub-graphs; having the
behavior of the sub-graphs, the behavior of the graph can be
easily calculated. The sub-graphs can be considered as supercomponents and can compose a higher level of representation.
The super-components on a level themselves can also be
merged into sub-graphs (super-components) of the next level.
This brings up a multi-level representation for each layer,
and also converts the problem into a combinatorial problem
of sub-graphs. At each level, a sub-graph consists of a set
of directly connected components of that level (which could
be by themselves sub-graphs (super-components) at a lower
level). Some basic examples are shown in Figure 5. Servers,
switches, and network connections are shown by blue squares,
orange circles, and green ovals respectively. To represent a
graph/clique on the lth level with n sub-components and a
topology T , we use the notation GTn,l . When the details are
not required, we use Gi to represents a sub-graph.
For the sake of simplicity, we assume all components are
fully maintained/repaired to their best status at t = 0. Let
define the PoA (or reliability) as PoAtG0 = R(G, t0 ) be the
probability of having the component G not failed over the
interval [0, t0 ]. PoAtG0 is a decreasing function of t0 , and can
be related to its Cumulative Distribution Function (CDF) of
failure:
PoAtG0 = 1 − CDFG (t0 ).
Considering a certain scaling factor s, the CDF(t0 ) can be
related to a Differential Density Function (DDF), DDFs (P0 ),
where P0 is the CDF value at t0 . The DDF is defined as:
1 ∂CDFG (t) s
DDFG (P0 = CDFG (t0 )) :=
s
∂t
t0
and, a CDF can be also inversely expressed based on its DDF
by solving the following differential equation:
∂CDFG (t)
= sDDFsG (CDFG (t)),
∂t
CDFG (0) = 0
(c)
Fig. 5.
Various examples of sub-graphs.
(a)
(b)
Fig. 6. a) The empirical CDF of the lanl05 database compared with its best
Weibull and tanh fits. b)The DDFs of 1- and 2-components systems.
In the rest of the paper, we assume s = 1.
If a clique consists of two components, and if the full
availability is required, we have:
PoAtG0 ∩G = PoAtG0 PoAtG0 = (1−CDFG1 (t0 ))(1−CDFG2 (t0 )).
2
1
1
2
From this, we can calculate the CDF of the combined system:
1 − CDF(G1 ∩ G2 , t0 ) = (1 − CDFG1 (t0 ))(1 − CDFG2 (t0 )).
Then,
CDF(G1 ∩ G2 , t0 ) = CDFG1 (t0 ) + CDFG2 (t0 )
−CDFG1 (t0 )CDFG2 (t0 ).
Therefore, the DDF of the combined system is:
DDF (G1 ∩ G2 , P0 ) = DDFG1 (P0,1 ) + DDFG2 (P0,2 )
−P0,1 DDFG2 (P0,2 ) − P0,2 DDFG1 (P0,1 ),
where P0,1 = CDFG1 (t0 ), and so on. For identical components, we have:
DDF(G1 ∩ G2 , P0 )
where P0,1
2(1 − P0,1 )DDFG1 (P0,1 )
p
p
= 2 1 − P0 DDFG1 (1 − 1 − P0 ).
√
= CDFG1 (t0 ) = 1 − 1 − P0 .
=
Various CDF functions have been used in the literature. One
of interesting CDF functions is the Weibull distribution, which
has been effective for large-scale systems [19]. It has two
parameters: the shape parameter β and the scale parameter
δ. In contrast, in this work, we assume that the CDFs can be
approximated by the tanh distribution. We define a tanh CDF
distribution as:
x − xc
xc
1
tanh(
) + tanh(
) ,
CDF xc ,xs (x) =
Zxc ,xs
2xs
2xs
(a) 1-component
(b) 2-component
Fig. 7. Monte Carlo validation of CDFs of 1- and 2-component systems.
where Zxc ,xs = 1 + tanh(xc /(2xs )) is a normalization factor,
xc is the center parameter, and xs is the shape parameter. The
corresponding tanh DDF function is:
n exp( xc )−P exp( xc ) o2
0
xs
xs
−s tanh 12 log
xc
P0 exp( x
)+1
xc ,xs ,s
s
,
DDF
(P0 ) = s+
2xs + 2xs tanh(xc /(2xs ))
where P0 = CDF xc ,xs (t0 ). As an example, the empirical
CDF of the (union-interpreted) lanl05 database [28], retrieved
from the Failure Trace Archive (FTA) [18], is compared with
its best fits using both the Weibull distribution [18] and the
tanh distribution [fitted on the logarithm of time with optimal
parameters xc = 5.564(±0.0035) and xs = 1.577(±0.0030)],
and shown in Figure 6(a). The empirical CDF is shown
in black, while the tanh and the Weibull distributions are
shown in solid blue and dashed red lines respectively. For the
sake of clarity, the absolute difference between the empirical
distribution and the fitted distributions are also shown in
percentage in the figure. As can be seen, the tanh distribution
has a better fit to the real data. This is confirmed by its
high p-values (compared to the traditional significance level
of 0.05) with respect to the Kolmogorov-Smirnov and the
Anderson-Darling goodness of fit (GOF) tests: 0.4999 and
0.5705 respectively. To calculate the p-values, averaging over
1000 p-value estimations, each of which was calculated on a
randomly-selected set of 30 samples from the real data set,
was used. The profiles of the DDF functions of an 1- and a
2-component cliques using the tanh distribution are shown in
Figure 6(b).
For more complex sub-graphs and when partial availability
is required, the calculations will be very tedious and vulnerable
to errors. An example is the graph in Figure 5(c) which
composed of 11 components. In full-availability case, i.e.,
requiring to have all four servers up and connected, the graph
can be broken down to two cliques of Figure 5(b) and one
clique of only one network connection. Therefore, the PoA of
the whole graph will be:
PoAtG0
11,1
PoAtG0
3,2
= (1 − CDF(G5,1 , t0 ))2 × (1 − CDF(G1,1 , t0 )),
=
which can be expanded in full. The formula for the corresponding DDF is not provided because of limited space. In
the case of partial availability, for example having at least
three servers up and connected, the problem can be expressed
as a combinatorial problem in which one instance of each one
of the cliques shown in Figures 5(a) and 5(b) and a network
(a) CDF
(b) DDF
Fig. 8. The simulated Monte Carlo estimation of the CDF and DDF of a
5-component system.
connection compose the graph. The corresponding possible
cases and their inter-relations makes the calculations very
complex. This brings us the the second paradigm of simulation
behavior analysis presented in the next section.
B. Simulated Probabilistic Behavior Analysis
The second paradigm is based on simulations, in which
the target system is implemented in a suitable environment,
such as grid simulators1 or network simulators2 , and the
system characteristics are statistically estimated based on the
properties of its components. In order to build the statistics of
the system, a series of simulated experiments is performed in a
Monte Carlo approach, and then some characteristics, such as
CDFs, are calculated. In order to show the correctness of the
paradigm, the results obtained by the Monte Carlo analysis for
one- and two-component systems are estimated and shown in
Figure 7. The theoretical results are also shown as dashed lines
for the sake of comparison. These results have been obtained
with averaging over 1000 simulations. The paradigm can easily
estimates the CDF and DDF of any system. For example, the
CDF and the DDF of the 5-component sub-graph of Figure
5(b) are estimated and shown in Figure 8 and are compared
with two-component case. In this simulation, it is assumed that
the servers, switch, and the network connections have tt = 3
and tr = 1, tt = 3.5 and tr = 1, and tt = 4 and tr = 4,
respectively. Some parametric models can be considered to fit
on the simulation results to provide closed-forms models. The
1 http://www.cloudbus.org/gridsim/;http://simgrid.gforge.inria.fr/
2 http://www.isi.edu/nsnam/ns/;http://www.nsnam.org/;http://www.omnetpp.
org/
(a) Full utilization.(b) Less-consuming partial utilization.
Fig. 9. Consolidation of components without lowering the SLO using the
PoA estimation.
simulated paradigm can be also used to validate the results of
the theoretical models.
An application of probabilistic behavior analyzers is shown
in Figure 9. The required availability is two servers and one
switch. By estimating the PoA of the sub-graph consisting of
just one switch (shown in Figure 9(b)), the system can shut
down one of the switches until the PoA reaches a predefined
threshold. This not only saves a considerable amount of
energy, it can increase the lifetime of the components.
C. Behavior-Time Profile Modeling and Analysis
The third paradigm works directly with the time profiles
of the components. These time profiles, which are collected
in an opportunistic way by some agents, are learned and
modeled using various machine learning and pattern recognition methods, such as Support Vector Machines (SVMs) [25],
among others. The advantage of this paradigm is that it works
directly with the pattern, not with their statistical moments.
Therefore, it can discover behaviors which can be missed by
the other paradigms. A typical scenario for the time-profile
behavior analysis is shown in Figure 10. In this example, the
CPU and memory resource usage of a 2-component system
is shown. The BAU discovers a fault at 9:45AM, when the
memory usage of the second component increases, based on
the models learned in the failure generating phase (section III),
and the response of the CRU, based on this detection, prevents
a failure at 10:00AM. This not only prevents an instance of
SLA violation, it also reduces the down time by 30 minutes.
V. R ELATED W ORK
There is a huge literature on the fault detection and fault
avoidance topics [19], [26]. In [23], several probabilistic models, such as the Weibull and hyper-exponential distributions,
were fitted on empirical data available in three datasets:
CSIL, Condor and Long. They showed that the Weibull
and exponential distributions are more accurate to model the
behavior of resources. In [27], a data analysis was carried
out on the system logs and reports of a cluster system. They
used time-series analysis and also rule-based classification for
predicting continuous and critical events. A similar analysis
carried out in [12] in which temporal and spatial correlations
among failure events in coalition systems are used. In [14], an
analysis of runtime data was performed to detect anomalous
behaviors. For this purpose, an outliers analysis was used.
In [11], Support Vector machines (SVMs), random indexing
and runtime system logs were used for predicting failures.
In [4], an online predictor for mission critical distributed
system was presented. The predictor was based on the analysis
of the network traffic. A modular fashion for integrating a
cognitive operator into a system was presented in [6]. And
in [2], cognitive map modeling was used for the purpose of
failure analysis of system interaction failure modes. Finally,
treatment learning enables tracking back the failures in largescale systems is of great value in order to identify the causing
component or factor [13]. This not only reduces the expenses
and experts’ time, it also reduces the chance of secondary
failures related to human mistakes of the experts.
The main highlight of our approach compared to others
is in its multi-granularity nature that comes from various
dimensions of the framework. In one dimension, multi-level
analysis of the system (graphs) enables the framework to
horizontally and vertically scales while avoiding exponential
computational and analysis costs associated to scaling. On
another dimension, its multi-layer aspect provides a systematic
and separable approach for analysis of the behavior of the
non-hardware parts (software, virtualware, etc). It can be
argued that the main performance bottleneck of future systems
roots in the errors and faults of their non-hardware parts.
Finally, in a third dimension, multi-paradigm approach of the
framework paves the way for cognitive responding in a crosscover manner.
VI. C ONCLUSION AND FUTURE PROSPECTS
A multi-paradigm, multi-layer, and multi-level cognitive
behavior analysis framework has been introduced. The framework uses probabilistic (statistical inference), simulated (statistical inference by means of simulation), and time-profile
modeling and analysis in order to learn and model various
behaviors of complex computing systems. Its multi-paradigm
approach enables validation and cross-cover among various
paradigms. The framework can perform at multiple granularities thanks to its multi-level and multi-layer approach. This
facilitate i) systematic horizontal, vertical and hierarchical
scaling, ii) straightforward integration of non-physical parts
(software, virtualware, etc) in the analysis, iii) increasing the
system dependability, such as the Probability of Availability
(PoA), achieved by smart, cross-covered cognitive responding.
Also, a new distribution, the tanh distribution, has been
introduced with promising results on a real database. The
application of the framework in failure analysis and detection
has been discussed in this work. The framework is specially
designed toward application in open-source architectures such
as OpenStack3 and OpenGSN4 that will be considered as
real-system examples in the future work. Furthermore, more
sophisticated distributions, such as asymmetrical tanh and
spline distributions, will be introduced.
3 http://www.openstack.org/
4 http://www.greenstarnetwork.com/
Fig. 10.
A typical example of time-profile behavior analysis and its impact on the overall grade improvement.
ACKNOWLEDGMENTS
The authors thank the NSERC of Canada for their financial
support.
R EFERENCES
[1] Ethem Alpaydin. Techniques for combining multiple learners. In
Proceedings of Engineering of Intelligent Systems, pages 6–12. ICSC
Press, 1998.
[2] Manu Augustine, Om Yadav, Rakesh Jain, and Ajay Rathore. Cognitive
map-based system modeling for identifying interaction failure modes.
Research in Engineering Design, 23(2):105–124, 2012.
[3] A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr. Basic concepts
and taxonomy of dependable and secure computing. IEEE Transactions
on Dependable and Secure Computing, 1(1):11–33, 2004.
[4] Roberto Baldoni and et. al. Online black-box failure prediction for
mission critical distributed systems. Technical Report 3/12 - 2012,
MIDLAB, 2012.
[5] Reinette Biggs and et. al. Toward principles for enhancing the resilience
of ecosystem services. Annual Review of Environment and Resources,
37(1):null, 2012.
[6] Sven Burmester and et. al. Tool support for the design of self-optimizing
mechatronic multi-agent systems. STTT, 10(3):207–222, 2008.
[7] N.-G.M. Leme E. Martins, C.-M.F. Rubira. Jaca: A reflective fault
injection tool based on patterns. In DSN’02, pages 483–487, Maryland,
USA, 23–26 June 2002.
[8] Malin Falkenmark. Governance as a Trialogue: Government-SocietyScience in Transition, chapter Good Ecosystem Governance: Balancing
Ecosystems and Social Needs, pages 59–76. Water Resources Development and Management. Springer Berlin Heidelberg, 2007.
[9] Fereydoun Farrahi Moghaddam, Reza Farrahi Moghaddam, and Mohamed Cheriet. Carbon metering and effective tax cost modeling for
virtual machines. In CLOUD’12, pages 758–763, Honolulu, Hawaii,
USA, June 2012.
[10] Fereydoun Farrahi Moghaddam, Reza Farrahi Moghaddam, and Mohamed Cheriet. Multi-level grouping genetic algorithm for low carbon
virtual private clouds. In CLOSER’12, pages 315–324, Porto, Portugal,
April 18–21 2012.
[11] Ilenia Fronza, Alberto Sillitti, Giancarlo Succi, Mikko Terho, and Jelena
Vlasenko. Failure prediction based on log files using random indexing
and support vector machines. Journal of Systems and Software, In
Press(0):–, 2012.
[12] Song Fu and Cheng-Zhong Xu. Exploring event correlation for failure
prediction in coalitions of clusters. In SC’07, pages 1–12, Reno, Nevada,
2007. ACM.
[13] Gregory Gay, Tim Menzies, Misty Davies, and Karen Gundy-Burlet.
Automatically finding the control variables for complex system behavior.
Automated Software Engineering, 17(4):439–468, 2010.
[14] Qiang Guan and Song Fu. auto-AID: A data mining framework for
autonomic anomaly identification in networked computer systems. In
IPCCC’10, pages 73–80, 2010.
[15] Mei-Chen Hsueh, T.K. Tsai, and R.K. Iyer. Fault injection techniques
and tools. Computer, 30(4):75 –82, Apr 1997.
[16] Bing Huang, M. Rodriguez, Ming Li, J.B. Bernstein, and C.S. Smidts.
Hardware error likelihood induced by the operation of software. IEEE
Transactions on Reliability, 60(3):622–639, 2011.
[17] K. Keahey, M. Tsugawa, A. Matsunaga, and J. Fortes. Sky computing.
IEEE Internet Computing, 13(5):43–51, 2009.
[18] Derrick Kondo, Bahman Javadi, Alexandru Iosup, and Dick Epema. The
failure trace archive: Enabling comparative analysis of failures in diverse
distributed systems. In CCGrid’10, pages 398–407, 2010.
[19] Antonios Litke, Dimitrios Skoutas, Konstantinos Tserpes, and Theodora
Varvarigou. Efficient task replication and management for adaptive fault
tolerance in mobile grid environments. Future Generation Computer
Systems, 23(2):163–178, February 2007.
[20] Daniele Miorandi, Sabrina Sicari, Francesco De Pellegrini, and Imrich
Chlamtac. Internet of things: Vision, applications and research challenges. Ad Hoc Networks, 10(7):1497–1516, September 2012.
[21] D. Sousa N. Rodrigues and L.M. Silva. A fault-injector tool to evaluate
failure detectors in grid-services. In CoreGRID’07, pages 261–271,
Heraklion, Crete, Greece, 12–13 June 2007.
[22] R. Natella, D. Cotroneo, J. Duraes, and H. Madeira. On fault representativeness of software fault injection. IEEE Transactions on Software
Engineering, Accepted(99):–, 2012.
[23] Daniel Nurmi, John Brevik, and Rich Wolski. Modeling machine availability in enterprise and wide-area distributed computing environments.
In Jos Cunha and Pedro Medeiros, editors, Lecture Notes in Computer
Science (Euro-Par 2005 Parallel Processing), volume 3648, pages 612–
612. Springer, 2005.
[24] Fabio Oliveira and et. al. Barricade: defending systems against operator
mistakes. In EuroSys’10, pages 83–96, Paris, France, 2010. ACM.
[25] Juan Josá Rodrı́guez, M.guez, Carlos J. Alonso, and Josá A. Maestro.
Support vector machines of interval-based features for time series
classification. Knowledge-Based Systems, 18(45):171–178, August 2005.
[26] Brent Rood and Michael Lewis. Grid resource availability predictionbased scheduling and task replication. Journal of Grid Computing,
7(4):479–500, 2009.
[27] R. K. Sahoo, A. J. Oliner, I. Rish, M. Gupta, J. E. Moreira, S. Ma,
R. Vilalta, and A. Sivasubramaniam. Critical event prediction for
proactive management in large-scale computer clusters. In KDD’03,
pages 426–435, Washington, D.C., 2003. ACM.
[28] B. Schroeder and G.A. Gibson. A large-scale study of failures in highperformance computing systems. IEEE Transactions on Dependable and
Secure Computing, 7(4):337–350, 2010.
[29] A.P. Snow and G.R. Weckman. What are the chances an availability
SLA will be violated? In ICN’07, pages 35–35, Martinique, 22-28 April
2007.
[30] The Climate Group. SMART 2020: Enabling the low carbon economy
in the information age. Technical report, the Global eSustainability
Initiative (GeSI), 2008.
[31] Annemijn Van Gorp and Catherine A. Middleton. Fiber to the home
unbundling and retail competition: Developments in the netherlands.
Communications and Strategies, 78(2):87–106, June 2010.
[32] Keun Soo Yim, Z. Kalbarczyk, and R.K. Iyer. Measurement-based
analysis of fault and error sensitivities of dynamic memory. In DSN’10,
pages 431–436, Chicago, IL, USA, June 28 2010-July 1 2010 2010.