Analysis and Modeling of Time-Correlated Failures in Large

Delft University of Technology
Parallel and Distributed Systems Report Series
Analysis and Modeling of Time-Correlated Failures
in Large-Scale Distributed Systems
Nezih Yigitbasi, Matthieu Gallet, Derrick Kondo,
Alexandru Iosup, and Dick Epema
The Failure Trace Archive.
Web:
fta.inria.fr Email:
report number PDS-2010-004
PDS
ISSN 1387-2109
[email protected]
Published and produced by:
Parallel and Distributed Systems Section
Faculty of Information Technology and Systems Department of Technical Mathematics and Informatics
Delft University of Technology
Zuidplantsoen 4
2628 BZ Delft
The Netherlands
Information about Parallel and Distributed Systems Report Series:
[email protected]
Information about Parallel and Distributed Systems Section:
http://pds.twi.tudelft.nl/
c 2010 Parallel and Distributed Systems Section, Faculty of Information Technology and Systems, Department
of Technical Mathematics and Informatics, Delft University of Technology. All rights reserved. No part of this
series may be reproduced in any form or by any means without prior written permission of the publisher.
PDS
N. Yigitbasi. et al.
Wp
Wp
Time-Correlated Failures in Distributed SystemsWp
Wp
Abstract
The analysis and modeling of the failures bound to occur in today’s large-scale production systems is
invaluable in providing the understanding needed to make these systems fault-tolerant yet efficient. Many
previous studies have modeled failures without taking into account the time-varying behavior of failures,
under the assumption that failures are identically, but independently distributed. However, the presence
of time correlations between failures (such as peak periods with increased failure rate) refutes this assumption and can have a significant impact on the effectiveness of fault-tolerance mechanisms. For example,
the performance of a proactive fault-tolerance mechanism is more effective if the failures are periodic or
predictable; similarly, the performance of checkpointing, redundancy, and scheduling solutions depends on
the frequency of failures. In this study we analyze and model the time-varying behavior of failures in largescale distributed systems. Our study is based on nineteen failure traces obtained from (mostly) production
large-scale distributed systems, including grids, P2P systems, DNS servers, web servers, and desktop grids.
We first investigate the time correlation of failures, and find that many of the studied traces exhibit strong
daily patterns and high autocorrelation. Then, we derive a model that focuses on the peak failure periods
occurring in real large-scale distributed systems. Our model characterizes the duration of peaks, the peak
inter-arrival time, the inter-arrival time of failures during the peaks, and the duration of failures during
peaks; we determine for each the best-fitting probability distribution from a set of several candidate distributions, and present the parameters of the (best) fit. Last, we validate our model against the nineteen real
failure traces, and find that the failures it characterizes are responsible on average for over 50% and up to
95% of the downtime of these systems.
Wp
1
http://www.st.ewi.tudelft.nl/∼nezih/
PDS
N. Yigitbasi. et al.
Wp
Time-Correlated Failures in Distributed SystemsWp
Wp
WpContents
Contents
1 Introduction
4
2 Method
2.1 Failure Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
4
5
5
3 Analysis of Autocorrelation
6
3.1 Failure Autocorrelations in the Traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 Modeling the Peaks of Failures
11
4.1 Peak Periods Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5 Related Work
16
6 Conclusion
18
Wp
2
http://www.st.ewi.tudelft.nl/∼nezih/
PDS
N. Yigitbasi. et al.
Wp
Time-Correlated Failures in Distributed SystemsWp
Wp
WpList of Figures
List of Figures
1
2
3
4
5
6
7
Failure rates at different time granularities for all platforms and the corresponding autocorrelation
functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Failure rates at different time granularities for all platforms and the corresponding autocorrelation
functions (Cont.). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Failure rates at different time granularities for all platforms and the corresponding autocorrelation
functions (Cont.). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Failure rates at different time granularities for all platforms and the corresponding autocorrelation
functions (Cont.). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Daily and hourly failure rates for Overnet,Grid’5000,Notre-Dame (CPU), and PlanetLab
platforms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Parameters of the peak periods model. The numbers in the figure match the (numbered) model
parameters in the text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Peak Duration: CDF of the empirical data for the Notre-Dame (CPU) platform and the
distributions investigated for modeling the peak duration parameter. None of the distributions
provide a good fit due to a peak at 1h. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
8
9
10
12
13
17
List of Tables
1
2
3
4
5
6
Wp
Summary of nineteen data sets in the Failure Trace Archive. . . . . . . . . . . . . . . . . . . . .
Average values for the model parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Empirical distribution for the peak duration parameter. . . . . . . . . . . . . . . . . . . . . . .
Peak model: The parameter values for the best fitting distributions for all studied systems. .
The average duration and average IAT of failures for the entire traces and for the peaks. . . . .
Fraction of downtime and fraction of number of failures due to failures that originate during
peaks (k = 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
.
.
.
.
.
5
14
15
16
16
. 17
http://www.st.ewi.tudelft.nl/∼nezih/
PDS
N. Yigitbasi. et al.
Wp
Wp
Time-Correlated Failures in Distributed SystemsWp
1
Wp1. Introduction
Introduction
Large-scale distributed systems have reached an unprecedented scale and complexity in recent years. At this
scale failures inevitably occur—networks fail, disks crash, packets get lost, bits get flipped, software misbehaves,
or systems crash due to misconfiguration and other human errors. Deadline-driven or mission-critical services
are part of the typical workload for these infrastructures, which thus need to be available and reliable despite
the presence of failures. Researchers and system designers have already built numerous fault-tolerance mechanisms that have been proven to work under various assumptions about the occurrence and duration of failures.
However, most previous work focuses on failure models that assume the failures to be non-correlated, but this
may not be realistic for the failures occurring in large-scale distributed systems. For example, such systems
may exhibit peak failure periods, during which the failure rate increases, affecting in turn the performance of
fault-tolerance solutions. To investigate such time correlations, we perform in this work a detailed investigation of the time-varying behavior of failures using nineteen traces obtained from several large-scale distributed
systems including grids, P2P systems, DNS servers, web servers, and desktop grids.
Recent studies report that in production systems, failure rates can be of over 1000 failures per year and,
depending on the root cause of the corresponding problems, the mean time to repair can range from hours
to days [16]. The increasing scale of deployed distributed systems causes the failure rates to increase, which
in turn can have a significant impact on the performance and cost, such as degraded response times [21] and
increased Total Cost of Operation (TCO) due to increased administration costs and human resource needs [1].
This situation also motivates the need for further research in failure characterization and modeling.
Previous studies [18,13,12,14,20,16] focused on characterizing failures in several different distributed systems.
However, most of these studies assume that failures occur independently or disregard the time correlation of
failures, despite the practical importance of these correlations [11, 19, 15]. First of all, understanding if failures
are time correlated has significant implications for proactive fault tolerance solutions. Second, understanding
the time-varying behavior of failures and peaks observed in failure patterns is required for evaluating design
decisions. For example, redundant submissions may all fail during a failure peak period, regardless of the quality
of the resubmission strategy. Third, understanding the temporal correlations and exploiting them for smart
checkpointing and scheduling decisions provides new opportunities for enhancing conventional fault-tolerance
mechanisms [21,7]. For example, a simple scheduling policy could be to stop scheduling large parallel jobs during
failure peaks. Finally, it is possible to devise adaptive fault-tolerance mechanisms that adjust the policies based
on the information related to peaks. For example, an adaptive fault-tolerance mechanism can migrate the
computation at the beginning of a predicted peak.
To understand the time-varying behavior of failures in large-scale distributed systems, we perform a detailed
investigation using data sets from diverse large-scale distributed systems including more than 100K hosts and
1.2M failure events spanning over 15 years of system operation. Our main contribution is threefold:
1. We make four new failure traces publicly available through the Failure Trace Archive (Section 2).
2. We present a detailed evaluation of the time correlation of failure events observed in traces taken from
nineteen (production) distributed systems (Section 3).
3. We propose a model for peaks observed in the failure rate process (Section 4).
2
Method
2.1
Failure Datasets
In this work we use and contribute to the data sets in the Failure Trace Archive (FTA) [9]. The FTA is an
online public repository of availability traces taken from diverse parallel and distributed systems.
The FTA makes failure traces available online in a unified format, which records the occurrence time and
duration of resource failures as an alternating time series of availability and unavailability intervals. Each
Wp
4
http://www.st.ewi.tudelft.nl/∼nezih/
PDS
N. Yigitbasi. et al.
Wp
Time-Correlated Failures in Distributed SystemsWp
Wp2.2
Table 1: Summary of nineteen data
System
Type
Nodes
Grid’5000
Grid
1,288
COND-CAE1
Grid
686
COND-CS1
Grid
725
COND-GLOW1
Grid
715
TeraGrid
Grid
1001
LRI
Desktop Grid
237
Deug
Desktop Grid
573
Notre-Dame 2 Desktop Grid
700
Notre-Dame 3 Desktop Grid
700
Microsoft
Desktop Grid
51,663
UCB
Desktop Grid
80
PlanetLab
P2P
200-400
Overnet
P2P
3,000
Skype
P2P
4,000
Websites
Web servers
129
LDNS
DNS servers
62,201
SDSC
HPC Clusters
207
LANL
HPC Clusters
4,750
PNNL
HPC Cluster
1,005
1
2
3
Wp
Analysis
sets in the Failure Trace Archive.
Period
Year
Events
1.5 years 2005-2006
588,463
35 days
2006
7899
35 days
2006
4543
33 days
2006
1001
8 months 2006-2007
1999
10 days
2005
1,792
9 days
2005
33,060
6 months
2007
300,241
6 months
2007
268,202
35 days
1999
1,019,765
11 days
1994
21,505
1.5 year 2004-2005
49,164
2 weeks
2003
68,892
1 month
2005
56,353
8 months 2001-2002
95,557
2 weeks
2004
384,991
12 days
2003
6,882
9 years
1996-2005
43,325
4 years
2003-2007
4,650
COND-* data sets denote the Condor data sets.
This is the host availability version which is according to the multi-state availability model of Brent Rood.
This is the CPU availability version.
availability or unavailability event in a trace records the start and the end of the event, and the resource that
was affected by the event. Depending on the trace, the resource affected by the event can be either a node of a
distributed system such as a node in a grid, or a component of a node in a system such as CPU or memory.
Prior to the work leading to this article, the FTA made fifteen failure traces available in its standard
format; as a result of our work, the FTA now makes available nineteen failure traces. Table 1 summarizes
the characteristics of these nineteen traces, which we use throughout this work. The traces originate from
systems of different types (multi-cluster grids, desktop grids, peer-to-peer systems, DNS and web servers) and
sizes (from hundreds to tens of thousands of resources), which makes these traces ideal for a study among
different distributed systems. Furthermore, many of the traces cover several months of system operation. A
more detailed description of each trace is available on the FTA web site(http://fta.inria.fr).
2.2
Analysis
In our analysis, we use the autocorrelation function (ACF) to measure the degree of correlation of the failure
time series data with itself at different time lags. The ACF takes on values between -1 (high negative correlation)
and 1 (high positive correlation). In addition, the ACF reveals when the failures are random or periodic. For
random data the correlation coefficients will be close to zero; and a periodic component in the ACF reveals that
the failure data is periodic or at least it has a periodic component.
2.3
Modeling
In the modeling phase, we statistically model the peaks observed in the failure rate process, i.e., the number of
failure events per time unit. Towards this end we use the Maximum Likelihood Estimation (MLE) method [2]
for fitting the probability distributions to the empirical data as it delivers good accuracy for the large data
Wp
5
http://www.st.ewi.tudelft.nl/∼nezih/
PDS
N. Yigitbasi. et al.
Wp
Time-Correlated Failures in Distributed SystemsWp
Wp
Wp3. Analysis of Autocorrelation
samples specific to failure traces. After we determine the best fits for each candidate distribution for all data
sets, we perform the goodness-of-fit tests to assess the quality of the fitting for each distribution, and to establish
the best fit. As the goodness-of-fit tests, we use both the Kolmogorov-Smirnov (KS) and the Anderson-Darling
(AD) tests, which essentially assess how close the cumulative distribution function (CDF) of the probability
distribution is to the CDF of the empirical data. For each candidate distribution with the parameters found
during the fitting process, we formulate the hypothesis that the empirical data are derived from it (the nullhypothesis of the goodness-of-fit test). Neither the KS or the AD tests can confirm the null-hypothesis, but both
are useful in understanding the goodness-of-fit. For example, the KS-test provides a test statistic, D, which
characterizes the maximal distance between the CDF of the empirical distribution of the input data and that
of the fitted distribution; distributions with a lower D value across different failure traces are better. Similarly,
the tests return p-values which are used to either reject the null-hypothesis if the p-value is smaller than or
equal to the significance level, or confirm that the observation is consistent with the null-hypothesis if the pvalue is greater than the significance level. Consistent with the standard method for computing p-values [12, 9],
we average 1,000 p-values, each of which is computed by selecting 30 samples randomly from the data set, to
calculate the final p-value for the goodness-of-fit tests.
3
Analysis of Autocorrelation
In this section we present the autocorrelations in failures using traces obtained from grids, desktop grids, P2P
systems, web servers, DNS servers and HPC clusters, respectively. We consider the failure rate process, that is
the number of failure events per time unit.
3.1
Failure Autocorrelations in the Traces
Our aim is to investigate whether the occurrence of failures is repetitive in our data sets. Toward this end,
we compute the autocorrelation of the failure rate for different time lags including hours, weeks, and months.
Figure 1 shows for several platforms the failure rate at different time granularities, and the corresponding
autocorrelation functions. Overall, we find that many systems exhibit strong correlation from small to moderate
time lags confirming that failures are indeed repetitive in many of the systems.
Many of the systems investigated in this work exhibit strong autocorrelation for hourly and weekly lags.
Figures 1(a), 1(b), 1(c), 1(d), 1(e), 1(f), 2(a), 2(b), 2(c), 2(d), 2(e), 2(f), 3(a), 3(b), and 3(c) show the
failure rates and autocorrelation functions for the Grid’5000, Condor (CAE), LRI, Deug, UCB, NotreDame (CPU), Microsoft, PlanetLab, Overnet, Skype, Websites, LDNS, SDSC, and LANL systems,
respectively.
The Grid’5000 data set is a one and a half year long trace collected from an academic research grid. Since
this system is mostly used for experimental purposes, and is large-scale (∼3K processors) the failure rate is
quite high. In addition, since most of the jobs are submitted through the OAR resource manager, hence without
direct user interaction, the daily pattern is not clearly observable. However, during the summer the failure rate
decreases, which indicates a correlation between the system load and the failure rate. Finally, as the system
size increases over the years, the failure rate does not increase significantly, which indicates system stability.
The Condor (CAE) data set is a one month long trace collected from a desktop grid using the Condor cyclestealing scheduler. As expected from a desktop grid, this trace exhibits daily peaks in the failure rate, and
hence in the autocorrelation function. In contrast to other desktop grids, the failure rate is lower. The LRI
data set is a ten day long trace collected from a desktop grid. The main trend of this trace is enforced by
users who power their machines off during weekends. Deug and UCB data sets are nine and eleven day long
traces collected from desktop grids. In contrast to LRI, machines in these systems are shut down every night.
This difference is captured well by autocorrelation functions: LRI has moderate peaks for lags corresponding
to one or two days, while a clear daily pattern is visible in Deug and UCB systems. The Notre-Dame (CPU)
Wp
6
http://www.st.ewi.tudelft.nl/∼nezih/
PDS
N. Yigitbasi. et al.
Wp
Wp
Time-Correlated Failures in Distributed SystemsWp
Wp3.1
Failure Autocorrelations in the Traces
1
1
0.8
ACF
Failures per week
Failures per hour
0.6
10000
8000
0.4
6000
4000
0.2
200
0.8
150
0.6
ACF
12000
100
0.4
50
0.2
2000
0
0
0
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52
20
40
60
0
0
80
Lag (Weeks)
Weeks
200
400
600
0
0
800
(a) Grid’5000
400
600
Lag (Hours)
800
1000
(b) Condor (CAE)
1
1.2
1
0.8
50
200
Hours
600
0.8
Failures per hour
30
0.4
20
0.2
10
0
0
ACF
500
0.6
ACF
Failures per hour
40
400
300
24 48 72 96 120 144 168 192 216 240
Hours
100
200
Lag (Hours)
0
0
300
0.4
0.2
200
0
100
0
0
0.6
24
48
72
−0.2
0
96 120 144 168 192 216
Hours
(c) LRI
50
100
150
Lag (Hours)
200
250
1000
2000 3000
Lag (Hours)
4000
5000
(d) Deug
1
1.2
1
250
500
0.8
400
0.6
150
0.6
0.4
100
0.2
50
0
0
0
48
96
144
192
Hours
240
288
336
Failures per hour
ACF
Failures per hour
200
−0.2
0
ACF
0.8
300
0.4
200
0.2
100
100
200
300
Lag (Hours)
400
(e) UCB
0
0
720
1440
2160 2880
Hours
3600
4320
0
0
(f) Notre-Dame (CPU)
Figure 1: Failure rates at different time granularities for all platforms and the corresponding autocorrelation
functions.
data set is a six month long trace collected from a desktop grid. This trace models the CPU availability of
the nodes in the system. Similar to other desktop grids, Notre-Dame (CPU) presents clear daily and weekly
patterns, with high auto-correlation for short time lags. The Microsoft data set is a thirty five day long trace
collected from 51,663 desktop machines of Microsoft employees during summer 1999. It shows similar behavior
as other desktop grids, with high autocorrelation for short time lags, and clear daily pattern with few failures
corresponding to the weekends. We observe correlation between the workload intensity and the failures rate in
this data set as the failure rate is higher for the work days compared to non-work days. The PlanetLab data
set is a one and a half year long trace collected from a desktop grid of 692 nodes distributed throughout several
academic institutions around the world. Failures are highly correlated for small time lags, but this correlation
is not stable over time; the average availability changes on large time scales. Moreover, the user presence and
hence the increased load has impact on the correlation: the failure rate strongly decreases during summer. The
Overnet is a one week long data set collected from a P2P system of 3,000 clients. In such a P2P system the
clients may frequently join or leave the network, and all clients which are not connected on the initialization
Wp
7
http://www.st.ewi.tudelft.nl/∼nezih/
PDS
N. Yigitbasi. et al.
Wp
Wp
Time-Correlated Failures in Distributed SystemsWp
Wp3.1
Failure Autocorrelations in the Traces
1
1
0.8
0.8
700
12000
600
Failures per week
8000
0.4
6000
4000
0.2
0.6
ACF
0.6
ACF
Failures per hour
10000
500
0.4
400
300
0.2
200
2000
0
0
200
400
Hours
600
0
0
800
200
400
600
Lag (Hours)
800
1000
100
0
10
20
30
40
0
0
50
(a) Microsoft
0.8
400
0.4
0.2
0.6
ACF
Failures per hour
ACF
Failures per hour
0
0
300
0.4
200
0.2
100
24
48
72
96 120
Hours
144
0
0
168
50
100
150
Lag (Hours)
0
0
200
200
400
0
0
600
400
600
800
30
40
Lag (Hours)
(d) Skype
1
1
0.8
0.8
200
200
Hours
(c) Overnet
3000
0.2
50
ACF
Failures per week
0.4
100
0.6
2500
0.6
150
ACF
Failures per hour
50
0.8
500
0.6
500
40
1
2000
1000
20
30
Lag (Weeks)
(b) PlanetLab
1
1500
10
Weeks
2000
0.4
1500
1000
0.2
500
0
0
1000
2000
3000
Hours
4000
5000
0
0
2000
4000
Lag (Hours)
6000
(e) Websites (hourly)
0
0
10
20
Weeks
30
40
0
0
10
20
Lag (Weeks)
(f) Websites (weekly)
Figure 2: Failure rates at different time granularities for all platforms and the corresponding autocorrelation
functions (Cont.).
Wp
8
http://www.st.ewi.tudelft.nl/∼nezih/
PDS
N. Yigitbasi. et al.
Wp
Wp
Time-Correlated Failures in Distributed SystemsWp
Wp3.1
Failure Autocorrelations in the Traces
1
0.8
0.6
Failures per hour
200
ACF
600
0.4
400
0.2
200
0.8
250
0.6
ACF
800
Failures per hour
1
150
0.4
100
0.2
50
0
0 48 96 144 192 240 288 336 384 432 480 528 576
0
0
200
400
600
Lag (Hours)
Hours
0
0 24 48 72 96 120144168192216240264288
Hours
100
200
300
Lag (Hours)
(b) SDSC
140
100
120
80
60
800
Failures per month
120
Failures per week
Failures per week
(a) LDNS
0
0
100
40
80
60
20
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52
600
400
200
40
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52
0
0
50
Weeks
Weeks
100
150
100
150
Months
(c) LANL (For 2002, 2004 and the whole trace)
1
1
0.8
0.8
0.8
0.6
0.6
0.6
ACF
ACF
ACF
1
0.4
0.4
0.4
0.2
0.2
0.2
0
0
20
40
Lag (Weeks)
60
0
0
20
40
60
0
0
Lag (Weeks)
50
Lag (Months)
(d) LANL (For 2002, 2004 and the whole trace)
Figure 3: Failure rates at different time granularities for all platforms and the corresponding autocorrelation
functions (Cont.).
Wp
9
http://www.st.ewi.tudelft.nl/∼nezih/
400
PDS
N. Yigitbasi. et al.
Wp
Wp
Time-Correlated Failures in Distributed SystemsWp
Wp3.1
Failure Autocorrelations in the Traces
1
400
40
0.6
300
30
0.4
20
0.2
10
0
0
1000
2000
3000
4000
5000
0
0
6000
2000
4000
6000
0.6
0.4
200
0.2
100
0
0
8000
Lag (Hours)
Hours
0.8
ACF
0.8
Failures per hour
50
ACF
Failures per hour
1
720
1440
2160
2880
3600
4320
0
0
1000
Hours
(a) TeraGrid
2000
3000
4000
5000
Lag (Hours)
(b) Notre-Dame
1
0.8
0.6
30
ACF
Failures per hour
40
0.4
20
0.2
10
0
0
0.432 0.864 1.296 1.728
Hours
2.16
2.592
4
x 10
0
0
1
2
Lag (Hours)
3
4
x 10
(c) PNNL (hourly)
Figure 4: Failure rates at different time granularities for all platforms and the corresponding autocorrelation
functions (Cont.).
are considered as down, explaining the initial peak of failures. We observe strong autocorrelation for short time
lags similar to other P2P systems. The Skype data set is a one month long trace collected from a P2P system
used by 4, 000 clients. Similar to Overnet, clients of Skype may also join or leave the system, and clients that
are not online are considered as unavailable in this trace. Similar to desktop grids, there is high autocorrelation
at small time lags, and the daily and weekly peaks are more pronounced. The Websites data set is a seven
month long trace collected from web servers. While the average availability of the web servers remains almost
stable (around 80%), there are few peaks in the failure rate probably due to network problems. In addition,
we do not observe clear hourly and weekly patterns. The LDNS data set is a two week long trace collected
from DNS servers. Unlike P2P systems and desktop grids, DNS servers do not exhibit strong autocorrelation
for short time lags with periodic behavior. In addition, as the workload intensity increases during the peak
hours of the day, we observe that the failure rate also increases. The SDSC data set is a twelve day long trace
collected from the San Diego Supercomputer Center (HPC clusters). Similar to other HPC clusters and grids,
we observe strong correlation for small to moderate time lags. Finally, The LANL data set is a ten year long
trace collected from production HPC clusters. The weekly failure rate is quite low compared to Grid’5000.
We do not observe a clear yearly pattern as the failure rate increases during summer 2002, whereas the failure
rate decreases during summer 2004. Since around 3, 000 nodes were added to the LANL system between 2002
and 2003, the failure rate also increases correspondingly.
Last, a few systems exhibit weak autocorrelation in failure occurrence. Figure 4(a), 4(b), and 4(c) show
the failure rate and the corresponding autocorrelation function for the TeraGrid, Notre-Dame and PNNL
systems. The TeraGrid data set is an eight month long trace collected from an HPC cluster that is part of
a grid. We observe weak autocorrelation at all time lags, which implies that the failure rates observed over
time are independent. In addition, there are no clear hourly or daily patterns, which gives evidence of an
erratic occurrence of failures in this system. The Notre-Dame data set is a six month long trace collected
Wp
10
http://www.st.ewi.tudelft.nl/∼nezih/
PDS
N. Yigitbasi. et al.
Wp
Time-Correlated Failures in Distributed SystemsWp
Wp
Wp3.2
Discussion
from a desktop grid. The failure events in this data set consist of the availability/unavailability events of the
hosts in this system. Similar to other desktop grids, we observe clear daily and weekly patterns. However,
the autocorrelation is low when compared to other desktop grids. Finally, PNNL data set is a four year long
trace collected from HPC clusters. We observe correlation between the intensity of the workload and the hourly
failure rate since the failure rate decreases during the summer. The weak autocorrelation indicates independent
failures as expected from HPC clusters.
3.2
Discussion
As we have shown in the previous section, many systems exhibit strong correlation from small to moderate
time lags, which indicates probably a high degree of predictability. In contrast, a small number of systems
(Notre-Dame, PNNL, and TeraGrid) exhibit weak autocorrelation; only for these systems, the failure rates
observed over time are independent.
We have found that similar systems have similar time-varying behavior, e.g., desktop grids and P2P systems
have daily and weekly periodic failure rates, and these systems exhibit strong temporal correlation at hourly
time lags. Some systems (Notre-Dame and Condor (CAE)) have direct user interaction, which produces
clear daily and weekly patterns in both system load and occurrence of failures—the failure rate increases during
work hours and days, and decreases during free days and holidays (the summer).
Finally, not all systems exhibit a correlation between work hours and days, and the failure rate. In the
examples depicted in Figure 5, while Grid’5000 exhibits this correlation, PlanetLab exhibit irregular/erratic
hourly and daily failure behavior.
Our results are consistent with previous studies [14, 4, 8, 17] as in many traces we observe strong autocorrelation at small time lags, and that we observe correlation between the intensity of the workload and failure
rates.
4
Modeling the Peaks of Failures
In this section we present a model for the peaks observed in the failure rate process in diverse large-scale
distributed systems.
4.1
Peak Periods Model
Our model of peak failure periods comprises four parameters as shown in Figure 6: the peak duration, the time
between peaks (inter-peak time), the inter-arrival time of failures during peaks, and the duration of failures
during peaks:
1. Peak Duration: The duration of peaks observed in a data set.
2. Time Between Peaks (inter-peak time): The time from the end of a previous peak to the start of
the next peak.
3. Inter-arrival Time of Failures During Peaks: The inter-arrival time of failure events that occur
during peaks.
4. Failure Duration During Peaks: The duration of failure events that start during peaks. These failure
events may last longer than a peak.
Our modeling process is based on analyzing the failure traces taken from real distributed systems in two
steps which we describe in turn.
Wp
11
http://www.st.ewi.tudelft.nl/∼nezih/
PDS
N. Yigitbasi. et al.
Wp
Wp
Wp4.1 Peak Periods Model
6400
6200
6000
5800
5600
5400
5200
5000
4800
4600
4400
4200
3500
Number of Failures
Number of Failures
Time-Correlated Failures in Distributed SystemsWp
3000
2500
2000
1500
1000
Friday MondaySaturdaySundayThursday
Tuesday
Wednesday
0 1 2 3 4 5 6 7 8 9 1011121314151617181920212223
Day of Week
Hour of Day
(a) Overnet
55000
Number of Failures
Number of Failures
50000
45000
40000
35000
30000
25000
20000
Mon
Tue
Wed
Thu
Fri
Sat
22000
20000
18000
16000
14000
12000
10000
8000
6000
4000
2000
Sun
0 1 2 3 4 5 6 7 8 9 1011121314151617181920212223
Day of Week
Hour of Day
(b) Grid’5000
24000
Number of Failures
Number of Failures
22000
20000
18000
16000
14000
12000
10000
11000
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
Friday MondaySaturdaySundayThursday
Tuesday
Wednesday
0 1 2 3 4 5 6 7 8 9 1011121314151617181920212223
Day of Week
Hour of Day
4200
1500
4000
1400
Number of Failures
Number of Failures
(c) Notre-Dame (CPU)
3800
3600
3400
3200
3000
1300
1200
1100
1000
900
2800
2600
800
Mon
Tue
Wed
Thu
Fri
Sat
Sun
0 1 2 3 4 5 6 7 8 9 1011121314151617181920212223
Day of Week
Hour of Day
(d) PlanetLab
Figure 5: Daily and hourly failure rates for Overnet,Grid’5000,Notre-Dame (CPU), and PlanetLab
platforms.
Wp
12
http://www.st.ewi.tudelft.nl/∼nezih/
PDS
N. Yigitbasi. et al.
Wp
Wp
Time-Correlated Failures in Distributed SystemsWp
Wp4.2 Results
Failure Rate
Peak
Failure Event
1
2
1
Failure Rate
2
2
2
1
3
4
4
Time
Figure 6: Parameters of the peak periods model. The numbers in the figure match the (numbered) model
parameters in the text.
The first step is to identify for each trace the peaks of hourly failure rates. Since there is no rigorous
mathematical definition of peaks in time-series, to identify the peaks we define a threshold value as µ + kσ,
where µ is the average and σ is the standard deviation of the failure rate, and k is a positive integer; a period
with a failure rate above the threshold is a peak period. We adopt this threshold to achieve a good balance
between capturing in the model extreme system behavior, and characterizing with our model an important part
of the system failures (either number of failures or downtime caused to the system). A threshold excluding all
but a few periods, for example defining peak periods as distributional outliers, may capture too few periods and
explain only a small fraction of the system failures. A more inclusive threshold would lead to the inclusion of
more failures, but the data may come from periods with very different characteristics, which is contrary to the
goal of building a model for peak failure periods.
In the second step we extract the model parameters from the data sets using the peaks that we identified in
the first step. Then we try to find a good fit, that is, a well-known probability distribution and the parameters
that lead to the best fit between that distribution and the empirical data. When selecting the probability
distributions, we consider the degrees of freedom (number of parameters) of that distribution. Although a
distribution with more degrees of freedom may provide a better fit for the data, such a distribution can result
in a complex model, and hence it may be difficult to analyze the model mathematically. In this study we use
five probability distributions to fit to the empirical data: exponential, Weibull, Pareto, lognormal, and gamma.
For the modeling process, we follow the methodology that we describe in Section 2.3.
4.2
Results
After applying the modeling methodology that we describe in the previous section and Section 2.3, in this
section we present the peak model that we derived from diverse large scale distributed systems.
Table 2 shows the average values for all the model parameters for all platforms. The average peak duration
Wp
13
http://www.st.ewi.tudelft.nl/∼nezih/
PDS
N. Yigitbasi. et al.
Wp
Wp
Time-Correlated Failures in Distributed SystemsWp
Wp4.2 Results
Table 2: Average values for the model parameters.
System
Avg. Peak
Duration [s]
Grid’5000
Condor (CAE)
Condor (CS)
Condor (GLOW)
TeraGrid
LRI
Deug
Notre-Dame
Notre-Dame (CPU)
Microsoft
UCB
PlanetLab
Overnet
Skype
Websites
LDNS
SDSC
LANL
5,047
5,287
3,927
4,200
3,680
4,080
14,914
3,942
7,520
7,200
21,272
4,810
3,600
4,254
5,211
4,841
4,984
4,122
Avg. Failure
IAT During
Peaks [s]
13
23
4
14
35
78
10
21
33
0
23
264
1
11
103
8
26
653
Avg. Time
Between Peaks [s]
55,101
87,561
241,920
329,040
526,500
58,371
103,800
257,922
47,075
75,315
77,040
47,124
14,400
112,971
104,400
42,042
84,900
94,968
Avg. Failure
Duration During
Peaks [s]
20,984
4,397
20,740
75,672
368,903
31,931
1,091
280,593
22,091
90,116
332
166,913
382,225
26,402
3476
30,212
6,114
21,193
varies across different systems, and even for the same type of systems. For example, UCB, Microsoft and
Deug are all desktop grids, but the average peak duration widely varies among these platforms. In contrast,
for the SDSC, LANL, and PNNL platforms, which are HPC clusters, the average peak duration values are
relatively close. The Deug and UCB platforms have small number of long peak durations resulting in higher
average peak durations compared to the other platforms. Finally, as there are two peaks of zero length (single
data point) in the Overnet system, the average peak duration is zero.
The average inter-arrival time during peaks is rather low, as expected, as the failure rates are higher during
peaks compared to off-peak periods. For the Microsoft platform, as all failures arrive as burst during peaks,
average inter-arrival time during peaks is zero.
Similar to the average peak duration parameter, the average time between peaks parameter is also highly
variable across different systems. For some systems like TeraGrid, this parameter is in the order of days, and
for some systems like Overnet it is in the order of hours.
Similarly, the duration of failures during peaks highly varies even across similar platforms. For example, the
difference between the average failure duration during peaks between the UCB and the Microsoft platforms,
which are both desktop grids, is huge because the machines in the UCB platform leave the system less often
than the machines in the Microsoft platform. In addition, in some platforms like Overnet and TeraGrid,
the average failure durations during peaks is in the order of days showing the impact of space-correlated failures,
that is multiple nodes failing nearly simultaneously.
Using the AD and KS tests we next determine the best fitting distributions for each model parameter and
each system. Since we determine the hourly failure rates using fixed time windows of one hour, the peak duration
and the inter-peak time are multiples of one hour. In addition, as the peak duration parameter is mostly in
the range [1h-5h], and for several systems this parameter is mostly 1h causing the empirical distribution to
have a peak at 1h, none of the distributions provide a good fit for the peak duration parameter (see Figure 7).
Therefore, for the peak duration model parameter, we only present an empirical histogram in Table 3. We find
Wp
14
http://www.st.ewi.tudelft.nl/∼nezih/
PDS
N. Yigitbasi. et al.
Wp
Wp
Time-Correlated Failures in Distributed SystemsWp
Wp4.2 Results
Table 3: Empirical distribution for the peak duration parameter. h denotes hours. Values above 10% are
depicted as bold.
Platform / Peak Duration
Grid’5000
Condor (CAE)
Condor (CS)
Condor (GLOW)
TeraGrid
LRI
Deug
Notre-Dame
Notre-Dame (CPU)
Microsoft
UCB
PlanetLab
Overnet
Skype
Websites
LDNS
SDSC
LANL
PNNL
Avg
1h
80.56 %
93.75 %
90.91 %
83.33 %
97.78 %
86.67 %
28.57 %
90.48 %
56.83 %
35.90 %
9.09 %
80.17 %
100.00 %
90.91 %
76.74 %
75.86 %
69.23 %
88.35 %
85.99 %
74.79 %
2h
13.53 %
3.13 %
9.09 %
16.67 %
2.22 %
13.33 %
–
9.52 %
17.78 %
33.33 %
9.09 %
13.36 %
–
4.55 %
13.95 %
13.79 %
23.08 %
9.24 %
10.35 %
11.3 %
3h
3.38 %
–
–
–
–
–
28.57 %
–
9.84 %
25.64 %
–
3.71 %
–
–
5.23 %
10.34 %
7.69 %
2.06 %
1.75 %
5.16 %
4h
1.33 %
–
–
–
–
–
–
–
3.49 %
5.13 %
–
1.27 %
–
4.55 %
2.33 %
–
–
0.25 %
0.96 %
1.01 %
5h
0.24 %
–
–
–
–
–
14.29 %
–
5.40 %
–
–
0.53 %
–
–
0.58 %
–
–
0.06 %
0.64 %
1.14 %
6h
0.12 %
–
–
–
–
–
–
–
3.49 %
–
9.09 %
0.53 %
–
–
–
–
–
0.03 %
0.16 %
0.7 %
≥ 7h
0.84%
3.12%
–
–
–
–
28.57 %
–
3.17
–
72.73 %
0.43
–
–
1.16
–
–
–
0.16 %
5.85 %
that the peak duration for almost all platforms are less than 3h.
Table 4 shows the best fitting distributions for the model parameters for all data sets investigated in this
study. To generate synthetic yet realistic traces without using a single system as a reference, we create the
average system model which has the average characteristics of all systems we investigate. We create the average
system model as follows. First, we determine the candidate distributions for a model parameter with the
distributions having the smallest D values for each system. Then, for each model parameter, we determine
the best fitting distribution among the candidate distributions that has the lowest average D value over all
data sets. After we determine the best fitting distribution for the average system model, each data set is fit
independently to this distribution to find the set of best fit parameters. The parameters of the average system
model shown in the ”Avg.” row represent the average of this set of parameters.
For the IAT during peak durations, several platforms do not have a best fitting distribution since for these
platforms most of the failures during peaks occur as bursts hence having inter-arrival times of zero. Similarly,
for the time between peaks parameter, some platforms (like all CONDOR platforms, Deug, Overnet and
UCB platforms) do not have best fitting distributions since these platforms have inadequate number of samples
to generate a meaningful model. For the failure duration between peaks parameter, some platforms do not have
a best fitting distribution due to the nature of the data. For example, for all CONDOR platforms the failure
duration is a multiple of a monitoring interval creating peaks in the empirical distribution at that monitoring
interval. As a result, none of the distributions we investigate provide a good fit.
In our model we find that the model parameters do not follow a heavy-tailed distribution since the p-values
for the Pareto distribution are very low. For the IAT during peaks parameter, Weibull distribution provides
a good fit for most of the platforms. For the time between peaks parameter, we find that the platforms can
either be modeled by the lognormal distribution or the Weibull distribution. Similar to our previous model [9],
which is derived from both peak and off-peak periods, for the failure duration during peaks parameter, we find
that the lognormal distribution provides a good fit for most of the platforms. To conclude, for all the model
parameters, we find that either the lognormal or the Weibull distributions provide a good fit for the average
system model.
Similar to the average system models built for other systems [10], we cannot claim that our average system
model represents the failure behavior of an actual system. However, the main strength of the average system
model is that it represents a common basis for the traces from which it has been extracted. To generate failure
Wp
15
http://www.st.ewi.tudelft.nl/∼nezih/
PDS
N. Yigitbasi. et al.
Wp
Wp
Time-Correlated Failures in Distributed SystemsWp
Wp5. Related Work
Table 4: Peak model: The parameter values for the best fitting distributions for all studied systems. E,W,LN,
and G stand for exponential, Weibull, lognormal and gamma distributions, respectively.
System
Grid’5000
Condor (CAE)
Condor (CS)
Condor (GLOW)
TeraGrid
LRI
Deug
Notre-Dame
Notre-Dame (CPU)
Microsoft
UCB
PlanetLab
Overnet
Skype
Websites
LDNS
SDSC
LANL
PNNL
Avg
IAT During Peaks
– (see text)
– (see text)
– (see text)
– (see text)
– (see text)
LN(3.49,1.86)
W(9.83,0.95)
– (see text)
– (see text)
– (see text)
E(23.77)
– (see text)
– (see text)
– (see text)
W(66.61,0.60)
W(8.97,0.98)
W(16.27,0.46)
G(1.35,797.42)
– (see text)
W(193.91,0.83)
Time Between Peaks
LN(10.30,1.09)
– (see text)
– (see text)
– (see text)
LN(12.40,1.42)
LN(10.51,0.98)
– (see text)
W(247065.52,0.92)
W(44139.20,0.89)
G(1.50,50065.81)
– (see text)
LN(10.13,1.03)
– (see text)
W(123440.05,1.37)
LN(10.77,1.25)
LN(10.38,0.79)
E(84900)
LN(10.63,1.16)
E(160237.32)
LN(10.89,1.08)
Failure Duration During Peaks
– (see text)
– (see text)
– (see text)
– (see text)
LN(10.27,1.90)
– (see text)
LN(5.46,1.29)
LN(9.06,2.73)
LN(7.19,1.35)
W(55594.48,0.61)
LN(5.25,0.99)
LN(8.47,2.50)
– (see text)
– (see text)
– (see text)
LN(9.09,1.63)
LN(7.59,1.68)
LN(8.26,1.53)
– (see text)
LN(8.09,1.59)
Table 5: The average duration and average IAT of failures for the entire traces and for the peaks.
System
Grid’5000
Notre-Dame (CPU)
Microsoft
PlanetLab
Overnet
Skype
Websites
LDNS
LANL
Avg. Failure
Duration [h]
(Entire)
7.41
4.25
16.49
49.61
11.98
14.30
1.17
8.61
5.88
Avg. Failure
Duration [h]
(Peaks)
5.83
6.14
25.03
46.36
106.17
7.33
0.97
8.39
5.89
Avg. Failure
IAT [s]
(Entire)
160
119
6
1880
17
91
380
12
13874
Avg. Failure
IAT [s]
(Peaks)
13
33
0
264
1
11
103
8
653
traces for a specific system, individual best fitting distributions and their parameters shown in Table 4 may be
used instead of the average system.
Next, we compute the average failure duration/inter-arrival time over each data set and only during peaks
(Table 5). We compare only the data sets used both in this study and our previous study [9], where we modelled
each data set individually without isolating peaks. We observe that the average failure duration per data set
can be twice as long as the average duration during peaks. In addition, the average failure inter-arrival time
per data set is on average nine times the average failure inter-arrival time during peaks. This implies that
the distribution per data set is significantly different from the distribution for peaks, and that fault detection
mechanisms must be significantly faster during peaks. Likewise, fault-tolerance mechanisms during peaks must
have considerably lower overhead than during non-peak periods.
Finally, we investigate the fraction of downtime caused by failures that originate during peaks, and the
fraction of the number of failures that originate during peaks (Table 6). We find that on average over 50% and
up to 95% of the downtime of the systems we investigate are caused by the failures that originate during peaks.
5
Related Work
Much work has been dedicated to characterizing and modeling system failures [18, 13, 12, 14, 20, 16]. While
the correlation among failure events has received attention since the early 1990s [18], previous studies focus
mostly on space-correlated failures, that is, on multiple nodes failing nearly simultaneously. Although the time
Wp
16
http://www.st.ewi.tudelft.nl/∼nezih/
PDS
N. Yigitbasi. et al.
Wp
Wp
Time-Correlated Failures in Distributed SystemsWp
Wp5. Related Work
1
0.9
0.8
0.7
CDF
0.6
0.5
0.4
Empirical data
Exp
Weibull
G−Pareto
LogN
Gamma
0.3
0.2
0.1
0
0
0.5
1
1.5
2
2.5
3
Peak Duration (seconds)
3.5
4
x 10
4
Figure 7: Peak Duration: CDF of the empirical data for the Notre-Dame (CPU) platform and the distributions investigated for modeling the peak duration parameter. None of the distributions provide a good fit
due to a peak at 1h.
Table 6: Fraction of downtime and fraction of number of failures due to failures that originate during peaks.
System
Grid’5000
Condor (CAE)
Condor (CS)
Condor (GLOW)
TeraGrid
LRI
Deug
Notre-Dame
Notre-Dame (CPU)
Microsoft
UCB
PlanetLab
Overnet
Skype
Websites
LDNS
SDSC
LANL
Time %
61.93
63.03
80.09
39.53
100
87.42
47.31
62.62
73.77
52.16
100
50.55
68.69
32.87
24.41
38.21
87.57
100
Avg.
65.01
k = 0.5
# Failures %
74.43
90.91
89.56
70.51
77.10
71.87
83.61
73.06
56.92
40.26
100
54.70
12.90
36.01
31.45
38.80
86.13
44.68
62.93
Time %
52.11
62.93
80.04
39.37
66.90
84.73
26.07
58.53
63.85
37.44
97.70
38.07
65.97
12.06
15.33
13.74
67.86
44.69
51.52
k = 0.9
# Failures %
64.58
90.52
88.63
69.70
77.10
62.26
66.46
69.72
43.56
25.10
97.78
41.81
7.44
18.93
18.87
13.85
65.25
44.68
53.67
Time %
49.19
62.93
80.04
39.37
66.90
84.73
25.07
53.40
57.92
35.54
95.31
38.07
65.97
7.65
13.60
10.14
67.46
44.69
49.88
k = 1.0
# Failures %
62.56
90.52
88.63
69.70
77.10
62.26
63.03
69.09
40.15
23.41
96.10
41.81
7.44
14.93
16.59
10.41
64.02
44.68
Time %
47.27
62.73
80.04
39.37
66.90
77.92
25.07
45.19
56.19
32.20
95.31
30.02
65.97
6.14
12.85
7.80
67.46
44.69
52.35
47.95
k = 1.1
# Failures %
60.84
89.88
88.63
69.70
77.10
59.47
63.03
67.70
37.93
20.08
96.10
32.34
7.44
13.81
14.97
8.28
64.02
44.68
50.88
Time %
35.93
62.73
80.04
39.37
66.90
75.66
22.28
45.12
47.61
28.78
93.19
30.02
65.97
4.32
12.33
5.07
65.24
44.69
45.84
k = 1.25
# Failures %
57.93
89.88
88.63
69.70
77.10
57.94
57.76
67.19
33.32
16.81
94.18
32.34
7.44
12.05
13.85
5.61
61.01
44.68
49.30
Time %
33.25
62.57
80.04
39.37
66.90
75.66
20.94
43.69
41.88
23.76
86.75
26.69
65.97
3.29
9.27
3.25
64.91
44.69
44.04
k = 1.5
# Failures %
53.60
89.13
88.63
69.70
77.10
57.94
53.31
64.47
26.21
12.80
87.62
24.86
7.44
10.74
11.92
3.57
59.23
44.68
46.83
Time %
27.63
62.10
80.04
37.28
62.98
73.91
16.83
41.79
28.76
15.35
54.48
24.02
65.97
3.29
8.53
2.06
52.20
22.96
k = 2.0
# Failures %
44.99
87.08
88.63
67.47
70.61
55.71
41.11
62.61
14.99
6.73
57.88
20.27
7.44
10.74
10.06
2.45
48.78
21.41
37.78
correlation of failure events deserve a detailed investigation due to its practical importance [11,19,15], relatively
little attention has been given to characterize the time correlation of failures in distributed systems. Our work
is the first to investigate the time correlation between failure events across a broad spectrum of large-scale
distributed systems. In addition, we also propose a model for peaks observed in the failure rate process derived
from several distributed systems.
Previous failure studies [18, 13, 12, 14, 20] used few data sets or even data from a single system; their data
also span relatively short periods of time. In contrast, we perform a detailed investigation using data sets from
diverse large-scale distributed systems including more than 100K hosts and 1.2M failure events spanning over
15 years of system operation.
In our recent work [6], we proposed a model for space-correlated failures using fifteen FTA data sets, and we
showed that space-correlated failures are dominant in most of the systems that we investigated. In this work,
we extend our recent work with a detailed time-correlation analysis, and we propose a model for peak periods.
Closest to our work, Schroeder and Gibson [16] present an analysis using a large set of failure data obtained
from a high performance computing site. However, this study lacks a time correlation analysis and focuses
on well-known failure characteristics like MTTF and MTTR. Sahoo et al. [14] analyze one year long failure
data obtained from a single cluster. Similar to the results of our analysis, they report that there is strong
correlation with significant periodic behavior. Bhagwan et al. [3] present a characterization of the availability
of the Overnet P2P system. Their and other studies [5] show that the availability of P2P systems has diurnal
Wp
17
http://www.st.ewi.tudelft.nl/∼nezih/
39.94
PDS
N. Yigitbasi. et al.
Wp
Time-Correlated Failures in Distributed SystemsWp
Wp
Wp6. Conclusion
patterns, but do not characterize the time correlations of failure events.
Traditional failure analysis studies [4, 8] report strong correlation between the intensity of the workload and
failure rates. Our analysis brings further evidence supporting the existence of this correlation–we observe more
failures during peak hours of the day and during work days in most of the (interactive) traces.
6
Conclusion
In the era of cloud and peer-to-peer computing, large-scale distributed systems may consist of hundreds of
thousands of nodes. At this scale, providing highly available service is a difficult challenge, and overcoming
it depends on the development of efficient fault-tolerance mechanisms. To develop new and improve existing
fault-tolerance mechanisms, we need realistic models of the failures occurring in large-scale distributed systems.
Traditional models have investigated failures in distributed systems at much smaller scale, and often under the
assumption of independence between failures. However, more recent studies have shown evidence that there
exist time patterns and other time-varying behavior in the occurrence of failures. Thus, we have investigated in
this work the time-varying behavior of failures in large-scale distributed systems, and proposed and validated a
model for time-correlated failures in such systems.
First, we have assessed the presence of time-correlated failures, using traces from nineteen (production)
systems, including grids, P2P systems, DNS servers, web servers, and desktop grids. We found for most of
the studied systems that, while the failure rates are highly variable, the failures still exhibit strong periodic
behavior and time correlation.
Second, to characterize the periodic behavior of failures and the peaks in failures, we have proposed a peak
model with four attributes: the peak duration, the failure inter-arrival time during peaks, the time between
peaks, and the failure duration during peaks. We found that the peak failure periods explained by our model
are responsible for on average for over 50% and up to 95% of the system downtime. We also found that the
Weibull and the lognormal distributions provide good fits for the model attributes. We have provided bestfitting parameters for these distributions, which will be useful to the community when designing and evaluating
fault-tolerance mechanisms in large-scale distributed systems.
Last but not least, we have made four new traces available in the Failure Trace Archive, which we hope will
encourage others to use the archive and also to contribute to it with new failure traces.
Wp
18
http://www.st.ewi.tudelft.nl/∼nezih/
PDS
N. Yigitbasi. et al.
Wp
Time-Correlated Failures in Distributed SystemsWp
Wp
WpReferences
References
[1] The uc berkeley/stanford recovery-oriented computing (roc) project. http://roc.cs.berkeley.edu/.
[2] J. Aldrich. R. A. Fisher and the making of maximum likelihood 1912-1922. Statistical Science, 12(3):162–176, 1997.
[3] R. Bhagwan, S. Savage, and G. M. Voelker. Understanding availability. In M. F. Kaashoek and I. Stoica, editors,
IPTPS, volume 2735 of Lecture Notes in Computer Science, pages 256–267. Springer, 2003.
[4] X. Castillo and D. Siewiorek. Workload, performance, and reliability of digital computing systems. pages 84–89,
jun 1981.
[5] J. Chu, K. Labonte, and B. N. Levine. Availability and locality measurements of peer-to-peer file systems. In
Proceedings of ITCom: Scalability and Traffic Control in IP Networks, 2002.
[6] M. Gallet, N. Yigitbasi, B. Javadi, D. Kondo, A. Iosup, and D. Epema. A model for space-correlated failures in
large-scale distributed systems. In Euro-Par, 2010. to appear.
[7] T. Z. Islam, S. Bagchi, and R. Eigenmann. Falcon: a system for reliable checkpoint recovery in shared grid
environments. In SC ’09, pages 1–12. ACM, 2009.
[8] R. K. Iyer, D. J. Rossetti, and M. C. Hsueh. Measurement and modeling of computer reliability as affected by
system activity. ACM Trans. Comput. Syst., 4(3):214–237, 1986.
[9] D. Kondo, B. Javadi, A. Iosup, and D. Epema. The Failure Trace Archive: Enabling comparative analysis of failures
in diverse distributed systems. In CCGRID, pages 1–10, 2010.
[10] U. Lublin and D. G. Feitelson. The workload on parallel supercomputers: modeling the characteristics of rigid jobs.
J. Parallel Distrib. Comput., 63(11):1105–1122, 2003.
[11] J. W. Mickens and B. D. Noble. Exploiting availability prediction in distributed systems. In NSDI’06, pages 6–6.
USENIX Association, 2006.
[12] D. Nurmi, J. Brevik, and R. Wolski. Modeling machine availability in enterprise and wide-area distributed computing
environments. In Euro-Par, pages 432–441, 2005.
[13] D. Oppenheimer, A. Ganapathi, and D. A. Patterson. Why do internet services fail, and what can be done about
it? In USITS’03, pages 1–1. USENIX Association, 2003.
[14] R. K. Sahoo, A. Sivasubramaniam, M. S. Squillante, and Y. Zhang. Failure data analysis of a large-scale heterogeneous server environment. In DSN ’04, page 772, 2004.
[15] F. Salfner, M. Lenk, and M. Malek. A survey of online failure prediction methods. ACM Comput. Surv., 42(3):1–42,
2010.
[16] B. Schroeder and G. A. Gibson. A large-scale study of failures in high-performance computing systems. In DSN
’06, pages 249–258, 2006.
[17] B. Schroeder and G. A. Gibson. Disk failures in the real world: what does an mttf of 1,000,000 hours mean to you?
In FAST ’07, page 1. USENIX Association, 2007.
[18] D. Tang, R. Iyer, and S. Subramani. Failure analysis and modeling of a vaxcluster system. pages 244 –251, jun
1990.
[19] K. Vaidyanathan, R. E. Harper, S. W. Hunter, and K. S. Trivedi. Analysis and implementation of software
rejuvenation in cluster systems. In SIGMETRICS ’01, pages 62–71. ACM, 2001.
[20] J. Xu, Z. Kalbarczyk, and R. K. Iyer. Networked windows nt system field failure data analysis. In PRDC ’99, page
178, 1999.
[21] Y. Zhang, M. S. Squillante, and R. K. Sahoo. Performance implications of failures in large-scale cluster scheduling.
In JSSPP, pages 233–252, 2004.
Wp
19
http://www.st.ewi.tudelft.nl/∼nezih/