Delft University of Technology Parallel and Distributed Systems Report Series Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih Yigitbasi, Matthieu Gallet, Derrick Kondo, Alexandru Iosup, and Dick Epema The Failure Trace Archive. Web: fta.inria.fr Email: report number PDS-2010-004 PDS ISSN 1387-2109 [email protected] Published and produced by: Parallel and Distributed Systems Section Faculty of Information Technology and Systems Department of Technical Mathematics and Informatics Delft University of Technology Zuidplantsoen 4 2628 BZ Delft The Netherlands Information about Parallel and Distributed Systems Report Series: [email protected] Information about Parallel and Distributed Systems Section: http://pds.twi.tudelft.nl/ c 2010 Parallel and Distributed Systems Section, Faculty of Information Technology and Systems, Department of Technical Mathematics and Informatics, Delft University of Technology. All rights reserved. No part of this series may be reproduced in any form or by any means without prior written permission of the publisher. PDS N. Yigitbasi. et al. Wp Wp Time-Correlated Failures in Distributed SystemsWp Wp Abstract The analysis and modeling of the failures bound to occur in today’s large-scale production systems is invaluable in providing the understanding needed to make these systems fault-tolerant yet efficient. Many previous studies have modeled failures without taking into account the time-varying behavior of failures, under the assumption that failures are identically, but independently distributed. However, the presence of time correlations between failures (such as peak periods with increased failure rate) refutes this assumption and can have a significant impact on the effectiveness of fault-tolerance mechanisms. For example, the performance of a proactive fault-tolerance mechanism is more effective if the failures are periodic or predictable; similarly, the performance of checkpointing, redundancy, and scheduling solutions depends on the frequency of failures. In this study we analyze and model the time-varying behavior of failures in largescale distributed systems. Our study is based on nineteen failure traces obtained from (mostly) production large-scale distributed systems, including grids, P2P systems, DNS servers, web servers, and desktop grids. We first investigate the time correlation of failures, and find that many of the studied traces exhibit strong daily patterns and high autocorrelation. Then, we derive a model that focuses on the peak failure periods occurring in real large-scale distributed systems. Our model characterizes the duration of peaks, the peak inter-arrival time, the inter-arrival time of failures during the peaks, and the duration of failures during peaks; we determine for each the best-fitting probability distribution from a set of several candidate distributions, and present the parameters of the (best) fit. Last, we validate our model against the nineteen real failure traces, and find that the failures it characterizes are responsible on average for over 50% and up to 95% of the downtime of these systems. Wp 1 http://www.st.ewi.tudelft.nl/∼nezih/ PDS N. Yigitbasi. et al. Wp Time-Correlated Failures in Distributed SystemsWp Wp WpContents Contents 1 Introduction 4 2 Method 2.1 Failure Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 5 5 3 Analysis of Autocorrelation 6 3.1 Failure Autocorrelations in the Traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4 Modeling the Peaks of Failures 11 4.1 Peak Periods Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 5 Related Work 16 6 Conclusion 18 Wp 2 http://www.st.ewi.tudelft.nl/∼nezih/ PDS N. Yigitbasi. et al. Wp Time-Correlated Failures in Distributed SystemsWp Wp WpList of Figures List of Figures 1 2 3 4 5 6 7 Failure rates at different time granularities for all platforms and the corresponding autocorrelation functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Failure rates at different time granularities for all platforms and the corresponding autocorrelation functions (Cont.). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Failure rates at different time granularities for all platforms and the corresponding autocorrelation functions (Cont.). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Failure rates at different time granularities for all platforms and the corresponding autocorrelation functions (Cont.). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daily and hourly failure rates for Overnet,Grid’5000,Notre-Dame (CPU), and PlanetLab platforms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parameters of the peak periods model. The numbers in the figure match the (numbered) model parameters in the text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peak Duration: CDF of the empirical data for the Notre-Dame (CPU) platform and the distributions investigated for modeling the peak duration parameter. None of the distributions provide a good fit due to a peak at 1h. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 8 9 10 12 13 17 List of Tables 1 2 3 4 5 6 Wp Summary of nineteen data sets in the Failure Trace Archive. . . . . . . . . . . . . . . . . . . . . Average values for the model parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Empirical distribution for the peak duration parameter. . . . . . . . . . . . . . . . . . . . . . . Peak model: The parameter values for the best fitting distributions for all studied systems. . The average duration and average IAT of failures for the entire traces and for the peaks. . . . . Fraction of downtime and fraction of number of failures due to failures that originate during peaks (k = 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 . . . . . 5 14 15 16 16 . 17 http://www.st.ewi.tudelft.nl/∼nezih/ PDS N. Yigitbasi. et al. Wp Wp Time-Correlated Failures in Distributed SystemsWp 1 Wp1. Introduction Introduction Large-scale distributed systems have reached an unprecedented scale and complexity in recent years. At this scale failures inevitably occur—networks fail, disks crash, packets get lost, bits get flipped, software misbehaves, or systems crash due to misconfiguration and other human errors. Deadline-driven or mission-critical services are part of the typical workload for these infrastructures, which thus need to be available and reliable despite the presence of failures. Researchers and system designers have already built numerous fault-tolerance mechanisms that have been proven to work under various assumptions about the occurrence and duration of failures. However, most previous work focuses on failure models that assume the failures to be non-correlated, but this may not be realistic for the failures occurring in large-scale distributed systems. For example, such systems may exhibit peak failure periods, during which the failure rate increases, affecting in turn the performance of fault-tolerance solutions. To investigate such time correlations, we perform in this work a detailed investigation of the time-varying behavior of failures using nineteen traces obtained from several large-scale distributed systems including grids, P2P systems, DNS servers, web servers, and desktop grids. Recent studies report that in production systems, failure rates can be of over 1000 failures per year and, depending on the root cause of the corresponding problems, the mean time to repair can range from hours to days [16]. The increasing scale of deployed distributed systems causes the failure rates to increase, which in turn can have a significant impact on the performance and cost, such as degraded response times [21] and increased Total Cost of Operation (TCO) due to increased administration costs and human resource needs [1]. This situation also motivates the need for further research in failure characterization and modeling. Previous studies [18,13,12,14,20,16] focused on characterizing failures in several different distributed systems. However, most of these studies assume that failures occur independently or disregard the time correlation of failures, despite the practical importance of these correlations [11, 19, 15]. First of all, understanding if failures are time correlated has significant implications for proactive fault tolerance solutions. Second, understanding the time-varying behavior of failures and peaks observed in failure patterns is required for evaluating design decisions. For example, redundant submissions may all fail during a failure peak period, regardless of the quality of the resubmission strategy. Third, understanding the temporal correlations and exploiting them for smart checkpointing and scheduling decisions provides new opportunities for enhancing conventional fault-tolerance mechanisms [21,7]. For example, a simple scheduling policy could be to stop scheduling large parallel jobs during failure peaks. Finally, it is possible to devise adaptive fault-tolerance mechanisms that adjust the policies based on the information related to peaks. For example, an adaptive fault-tolerance mechanism can migrate the computation at the beginning of a predicted peak. To understand the time-varying behavior of failures in large-scale distributed systems, we perform a detailed investigation using data sets from diverse large-scale distributed systems including more than 100K hosts and 1.2M failure events spanning over 15 years of system operation. Our main contribution is threefold: 1. We make four new failure traces publicly available through the Failure Trace Archive (Section 2). 2. We present a detailed evaluation of the time correlation of failure events observed in traces taken from nineteen (production) distributed systems (Section 3). 3. We propose a model for peaks observed in the failure rate process (Section 4). 2 Method 2.1 Failure Datasets In this work we use and contribute to the data sets in the Failure Trace Archive (FTA) [9]. The FTA is an online public repository of availability traces taken from diverse parallel and distributed systems. The FTA makes failure traces available online in a unified format, which records the occurrence time and duration of resource failures as an alternating time series of availability and unavailability intervals. Each Wp 4 http://www.st.ewi.tudelft.nl/∼nezih/ PDS N. Yigitbasi. et al. Wp Time-Correlated Failures in Distributed SystemsWp Wp2.2 Table 1: Summary of nineteen data System Type Nodes Grid’5000 Grid 1,288 COND-CAE1 Grid 686 COND-CS1 Grid 725 COND-GLOW1 Grid 715 TeraGrid Grid 1001 LRI Desktop Grid 237 Deug Desktop Grid 573 Notre-Dame 2 Desktop Grid 700 Notre-Dame 3 Desktop Grid 700 Microsoft Desktop Grid 51,663 UCB Desktop Grid 80 PlanetLab P2P 200-400 Overnet P2P 3,000 Skype P2P 4,000 Websites Web servers 129 LDNS DNS servers 62,201 SDSC HPC Clusters 207 LANL HPC Clusters 4,750 PNNL HPC Cluster 1,005 1 2 3 Wp Analysis sets in the Failure Trace Archive. Period Year Events 1.5 years 2005-2006 588,463 35 days 2006 7899 35 days 2006 4543 33 days 2006 1001 8 months 2006-2007 1999 10 days 2005 1,792 9 days 2005 33,060 6 months 2007 300,241 6 months 2007 268,202 35 days 1999 1,019,765 11 days 1994 21,505 1.5 year 2004-2005 49,164 2 weeks 2003 68,892 1 month 2005 56,353 8 months 2001-2002 95,557 2 weeks 2004 384,991 12 days 2003 6,882 9 years 1996-2005 43,325 4 years 2003-2007 4,650 COND-* data sets denote the Condor data sets. This is the host availability version which is according to the multi-state availability model of Brent Rood. This is the CPU availability version. availability or unavailability event in a trace records the start and the end of the event, and the resource that was affected by the event. Depending on the trace, the resource affected by the event can be either a node of a distributed system such as a node in a grid, or a component of a node in a system such as CPU or memory. Prior to the work leading to this article, the FTA made fifteen failure traces available in its standard format; as a result of our work, the FTA now makes available nineteen failure traces. Table 1 summarizes the characteristics of these nineteen traces, which we use throughout this work. The traces originate from systems of different types (multi-cluster grids, desktop grids, peer-to-peer systems, DNS and web servers) and sizes (from hundreds to tens of thousands of resources), which makes these traces ideal for a study among different distributed systems. Furthermore, many of the traces cover several months of system operation. A more detailed description of each trace is available on the FTA web site(http://fta.inria.fr). 2.2 Analysis In our analysis, we use the autocorrelation function (ACF) to measure the degree of correlation of the failure time series data with itself at different time lags. The ACF takes on values between -1 (high negative correlation) and 1 (high positive correlation). In addition, the ACF reveals when the failures are random or periodic. For random data the correlation coefficients will be close to zero; and a periodic component in the ACF reveals that the failure data is periodic or at least it has a periodic component. 2.3 Modeling In the modeling phase, we statistically model the peaks observed in the failure rate process, i.e., the number of failure events per time unit. Towards this end we use the Maximum Likelihood Estimation (MLE) method [2] for fitting the probability distributions to the empirical data as it delivers good accuracy for the large data Wp 5 http://www.st.ewi.tudelft.nl/∼nezih/ PDS N. Yigitbasi. et al. Wp Time-Correlated Failures in Distributed SystemsWp Wp Wp3. Analysis of Autocorrelation samples specific to failure traces. After we determine the best fits for each candidate distribution for all data sets, we perform the goodness-of-fit tests to assess the quality of the fitting for each distribution, and to establish the best fit. As the goodness-of-fit tests, we use both the Kolmogorov-Smirnov (KS) and the Anderson-Darling (AD) tests, which essentially assess how close the cumulative distribution function (CDF) of the probability distribution is to the CDF of the empirical data. For each candidate distribution with the parameters found during the fitting process, we formulate the hypothesis that the empirical data are derived from it (the nullhypothesis of the goodness-of-fit test). Neither the KS or the AD tests can confirm the null-hypothesis, but both are useful in understanding the goodness-of-fit. For example, the KS-test provides a test statistic, D, which characterizes the maximal distance between the CDF of the empirical distribution of the input data and that of the fitted distribution; distributions with a lower D value across different failure traces are better. Similarly, the tests return p-values which are used to either reject the null-hypothesis if the p-value is smaller than or equal to the significance level, or confirm that the observation is consistent with the null-hypothesis if the pvalue is greater than the significance level. Consistent with the standard method for computing p-values [12, 9], we average 1,000 p-values, each of which is computed by selecting 30 samples randomly from the data set, to calculate the final p-value for the goodness-of-fit tests. 3 Analysis of Autocorrelation In this section we present the autocorrelations in failures using traces obtained from grids, desktop grids, P2P systems, web servers, DNS servers and HPC clusters, respectively. We consider the failure rate process, that is the number of failure events per time unit. 3.1 Failure Autocorrelations in the Traces Our aim is to investigate whether the occurrence of failures is repetitive in our data sets. Toward this end, we compute the autocorrelation of the failure rate for different time lags including hours, weeks, and months. Figure 1 shows for several platforms the failure rate at different time granularities, and the corresponding autocorrelation functions. Overall, we find that many systems exhibit strong correlation from small to moderate time lags confirming that failures are indeed repetitive in many of the systems. Many of the systems investigated in this work exhibit strong autocorrelation for hourly and weekly lags. Figures 1(a), 1(b), 1(c), 1(d), 1(e), 1(f), 2(a), 2(b), 2(c), 2(d), 2(e), 2(f), 3(a), 3(b), and 3(c) show the failure rates and autocorrelation functions for the Grid’5000, Condor (CAE), LRI, Deug, UCB, NotreDame (CPU), Microsoft, PlanetLab, Overnet, Skype, Websites, LDNS, SDSC, and LANL systems, respectively. The Grid’5000 data set is a one and a half year long trace collected from an academic research grid. Since this system is mostly used for experimental purposes, and is large-scale (∼3K processors) the failure rate is quite high. In addition, since most of the jobs are submitted through the OAR resource manager, hence without direct user interaction, the daily pattern is not clearly observable. However, during the summer the failure rate decreases, which indicates a correlation between the system load and the failure rate. Finally, as the system size increases over the years, the failure rate does not increase significantly, which indicates system stability. The Condor (CAE) data set is a one month long trace collected from a desktop grid using the Condor cyclestealing scheduler. As expected from a desktop grid, this trace exhibits daily peaks in the failure rate, and hence in the autocorrelation function. In contrast to other desktop grids, the failure rate is lower. The LRI data set is a ten day long trace collected from a desktop grid. The main trend of this trace is enforced by users who power their machines off during weekends. Deug and UCB data sets are nine and eleven day long traces collected from desktop grids. In contrast to LRI, machines in these systems are shut down every night. This difference is captured well by autocorrelation functions: LRI has moderate peaks for lags corresponding to one or two days, while a clear daily pattern is visible in Deug and UCB systems. The Notre-Dame (CPU) Wp 6 http://www.st.ewi.tudelft.nl/∼nezih/ PDS N. Yigitbasi. et al. Wp Wp Time-Correlated Failures in Distributed SystemsWp Wp3.1 Failure Autocorrelations in the Traces 1 1 0.8 ACF Failures per week Failures per hour 0.6 10000 8000 0.4 6000 4000 0.2 200 0.8 150 0.6 ACF 12000 100 0.4 50 0.2 2000 0 0 0 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 20 40 60 0 0 80 Lag (Weeks) Weeks 200 400 600 0 0 800 (a) Grid’5000 400 600 Lag (Hours) 800 1000 (b) Condor (CAE) 1 1.2 1 0.8 50 200 Hours 600 0.8 Failures per hour 30 0.4 20 0.2 10 0 0 ACF 500 0.6 ACF Failures per hour 40 400 300 24 48 72 96 120 144 168 192 216 240 Hours 100 200 Lag (Hours) 0 0 300 0.4 0.2 200 0 100 0 0 0.6 24 48 72 −0.2 0 96 120 144 168 192 216 Hours (c) LRI 50 100 150 Lag (Hours) 200 250 1000 2000 3000 Lag (Hours) 4000 5000 (d) Deug 1 1.2 1 250 500 0.8 400 0.6 150 0.6 0.4 100 0.2 50 0 0 0 48 96 144 192 Hours 240 288 336 Failures per hour ACF Failures per hour 200 −0.2 0 ACF 0.8 300 0.4 200 0.2 100 100 200 300 Lag (Hours) 400 (e) UCB 0 0 720 1440 2160 2880 Hours 3600 4320 0 0 (f) Notre-Dame (CPU) Figure 1: Failure rates at different time granularities for all platforms and the corresponding autocorrelation functions. data set is a six month long trace collected from a desktop grid. This trace models the CPU availability of the nodes in the system. Similar to other desktop grids, Notre-Dame (CPU) presents clear daily and weekly patterns, with high auto-correlation for short time lags. The Microsoft data set is a thirty five day long trace collected from 51,663 desktop machines of Microsoft employees during summer 1999. It shows similar behavior as other desktop grids, with high autocorrelation for short time lags, and clear daily pattern with few failures corresponding to the weekends. We observe correlation between the workload intensity and the failures rate in this data set as the failure rate is higher for the work days compared to non-work days. The PlanetLab data set is a one and a half year long trace collected from a desktop grid of 692 nodes distributed throughout several academic institutions around the world. Failures are highly correlated for small time lags, but this correlation is not stable over time; the average availability changes on large time scales. Moreover, the user presence and hence the increased load has impact on the correlation: the failure rate strongly decreases during summer. The Overnet is a one week long data set collected from a P2P system of 3,000 clients. In such a P2P system the clients may frequently join or leave the network, and all clients which are not connected on the initialization Wp 7 http://www.st.ewi.tudelft.nl/∼nezih/ PDS N. Yigitbasi. et al. Wp Wp Time-Correlated Failures in Distributed SystemsWp Wp3.1 Failure Autocorrelations in the Traces 1 1 0.8 0.8 700 12000 600 Failures per week 8000 0.4 6000 4000 0.2 0.6 ACF 0.6 ACF Failures per hour 10000 500 0.4 400 300 0.2 200 2000 0 0 200 400 Hours 600 0 0 800 200 400 600 Lag (Hours) 800 1000 100 0 10 20 30 40 0 0 50 (a) Microsoft 0.8 400 0.4 0.2 0.6 ACF Failures per hour ACF Failures per hour 0 0 300 0.4 200 0.2 100 24 48 72 96 120 Hours 144 0 0 168 50 100 150 Lag (Hours) 0 0 200 200 400 0 0 600 400 600 800 30 40 Lag (Hours) (d) Skype 1 1 0.8 0.8 200 200 Hours (c) Overnet 3000 0.2 50 ACF Failures per week 0.4 100 0.6 2500 0.6 150 ACF Failures per hour 50 0.8 500 0.6 500 40 1 2000 1000 20 30 Lag (Weeks) (b) PlanetLab 1 1500 10 Weeks 2000 0.4 1500 1000 0.2 500 0 0 1000 2000 3000 Hours 4000 5000 0 0 2000 4000 Lag (Hours) 6000 (e) Websites (hourly) 0 0 10 20 Weeks 30 40 0 0 10 20 Lag (Weeks) (f) Websites (weekly) Figure 2: Failure rates at different time granularities for all platforms and the corresponding autocorrelation functions (Cont.). Wp 8 http://www.st.ewi.tudelft.nl/∼nezih/ PDS N. Yigitbasi. et al. Wp Wp Time-Correlated Failures in Distributed SystemsWp Wp3.1 Failure Autocorrelations in the Traces 1 0.8 0.6 Failures per hour 200 ACF 600 0.4 400 0.2 200 0.8 250 0.6 ACF 800 Failures per hour 1 150 0.4 100 0.2 50 0 0 48 96 144 192 240 288 336 384 432 480 528 576 0 0 200 400 600 Lag (Hours) Hours 0 0 24 48 72 96 120144168192216240264288 Hours 100 200 300 Lag (Hours) (b) SDSC 140 100 120 80 60 800 Failures per month 120 Failures per week Failures per week (a) LDNS 0 0 100 40 80 60 20 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 600 400 200 40 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 0 0 50 Weeks Weeks 100 150 100 150 Months (c) LANL (For 2002, 2004 and the whole trace) 1 1 0.8 0.8 0.8 0.6 0.6 0.6 ACF ACF ACF 1 0.4 0.4 0.4 0.2 0.2 0.2 0 0 20 40 Lag (Weeks) 60 0 0 20 40 60 0 0 Lag (Weeks) 50 Lag (Months) (d) LANL (For 2002, 2004 and the whole trace) Figure 3: Failure rates at different time granularities for all platforms and the corresponding autocorrelation functions (Cont.). Wp 9 http://www.st.ewi.tudelft.nl/∼nezih/ 400 PDS N. Yigitbasi. et al. Wp Wp Time-Correlated Failures in Distributed SystemsWp Wp3.1 Failure Autocorrelations in the Traces 1 400 40 0.6 300 30 0.4 20 0.2 10 0 0 1000 2000 3000 4000 5000 0 0 6000 2000 4000 6000 0.6 0.4 200 0.2 100 0 0 8000 Lag (Hours) Hours 0.8 ACF 0.8 Failures per hour 50 ACF Failures per hour 1 720 1440 2160 2880 3600 4320 0 0 1000 Hours (a) TeraGrid 2000 3000 4000 5000 Lag (Hours) (b) Notre-Dame 1 0.8 0.6 30 ACF Failures per hour 40 0.4 20 0.2 10 0 0 0.432 0.864 1.296 1.728 Hours 2.16 2.592 4 x 10 0 0 1 2 Lag (Hours) 3 4 x 10 (c) PNNL (hourly) Figure 4: Failure rates at different time granularities for all platforms and the corresponding autocorrelation functions (Cont.). are considered as down, explaining the initial peak of failures. We observe strong autocorrelation for short time lags similar to other P2P systems. The Skype data set is a one month long trace collected from a P2P system used by 4, 000 clients. Similar to Overnet, clients of Skype may also join or leave the system, and clients that are not online are considered as unavailable in this trace. Similar to desktop grids, there is high autocorrelation at small time lags, and the daily and weekly peaks are more pronounced. The Websites data set is a seven month long trace collected from web servers. While the average availability of the web servers remains almost stable (around 80%), there are few peaks in the failure rate probably due to network problems. In addition, we do not observe clear hourly and weekly patterns. The LDNS data set is a two week long trace collected from DNS servers. Unlike P2P systems and desktop grids, DNS servers do not exhibit strong autocorrelation for short time lags with periodic behavior. In addition, as the workload intensity increases during the peak hours of the day, we observe that the failure rate also increases. The SDSC data set is a twelve day long trace collected from the San Diego Supercomputer Center (HPC clusters). Similar to other HPC clusters and grids, we observe strong correlation for small to moderate time lags. Finally, The LANL data set is a ten year long trace collected from production HPC clusters. The weekly failure rate is quite low compared to Grid’5000. We do not observe a clear yearly pattern as the failure rate increases during summer 2002, whereas the failure rate decreases during summer 2004. Since around 3, 000 nodes were added to the LANL system between 2002 and 2003, the failure rate also increases correspondingly. Last, a few systems exhibit weak autocorrelation in failure occurrence. Figure 4(a), 4(b), and 4(c) show the failure rate and the corresponding autocorrelation function for the TeraGrid, Notre-Dame and PNNL systems. The TeraGrid data set is an eight month long trace collected from an HPC cluster that is part of a grid. We observe weak autocorrelation at all time lags, which implies that the failure rates observed over time are independent. In addition, there are no clear hourly or daily patterns, which gives evidence of an erratic occurrence of failures in this system. The Notre-Dame data set is a six month long trace collected Wp 10 http://www.st.ewi.tudelft.nl/∼nezih/ PDS N. Yigitbasi. et al. Wp Time-Correlated Failures in Distributed SystemsWp Wp Wp3.2 Discussion from a desktop grid. The failure events in this data set consist of the availability/unavailability events of the hosts in this system. Similar to other desktop grids, we observe clear daily and weekly patterns. However, the autocorrelation is low when compared to other desktop grids. Finally, PNNL data set is a four year long trace collected from HPC clusters. We observe correlation between the intensity of the workload and the hourly failure rate since the failure rate decreases during the summer. The weak autocorrelation indicates independent failures as expected from HPC clusters. 3.2 Discussion As we have shown in the previous section, many systems exhibit strong correlation from small to moderate time lags, which indicates probably a high degree of predictability. In contrast, a small number of systems (Notre-Dame, PNNL, and TeraGrid) exhibit weak autocorrelation; only for these systems, the failure rates observed over time are independent. We have found that similar systems have similar time-varying behavior, e.g., desktop grids and P2P systems have daily and weekly periodic failure rates, and these systems exhibit strong temporal correlation at hourly time lags. Some systems (Notre-Dame and Condor (CAE)) have direct user interaction, which produces clear daily and weekly patterns in both system load and occurrence of failures—the failure rate increases during work hours and days, and decreases during free days and holidays (the summer). Finally, not all systems exhibit a correlation between work hours and days, and the failure rate. In the examples depicted in Figure 5, while Grid’5000 exhibits this correlation, PlanetLab exhibit irregular/erratic hourly and daily failure behavior. Our results are consistent with previous studies [14, 4, 8, 17] as in many traces we observe strong autocorrelation at small time lags, and that we observe correlation between the intensity of the workload and failure rates. 4 Modeling the Peaks of Failures In this section we present a model for the peaks observed in the failure rate process in diverse large-scale distributed systems. 4.1 Peak Periods Model Our model of peak failure periods comprises four parameters as shown in Figure 6: the peak duration, the time between peaks (inter-peak time), the inter-arrival time of failures during peaks, and the duration of failures during peaks: 1. Peak Duration: The duration of peaks observed in a data set. 2. Time Between Peaks (inter-peak time): The time from the end of a previous peak to the start of the next peak. 3. Inter-arrival Time of Failures During Peaks: The inter-arrival time of failure events that occur during peaks. 4. Failure Duration During Peaks: The duration of failure events that start during peaks. These failure events may last longer than a peak. Our modeling process is based on analyzing the failure traces taken from real distributed systems in two steps which we describe in turn. Wp 11 http://www.st.ewi.tudelft.nl/∼nezih/ PDS N. Yigitbasi. et al. Wp Wp Wp4.1 Peak Periods Model 6400 6200 6000 5800 5600 5400 5200 5000 4800 4600 4400 4200 3500 Number of Failures Number of Failures Time-Correlated Failures in Distributed SystemsWp 3000 2500 2000 1500 1000 Friday MondaySaturdaySundayThursday Tuesday Wednesday 0 1 2 3 4 5 6 7 8 9 1011121314151617181920212223 Day of Week Hour of Day (a) Overnet 55000 Number of Failures Number of Failures 50000 45000 40000 35000 30000 25000 20000 Mon Tue Wed Thu Fri Sat 22000 20000 18000 16000 14000 12000 10000 8000 6000 4000 2000 Sun 0 1 2 3 4 5 6 7 8 9 1011121314151617181920212223 Day of Week Hour of Day (b) Grid’5000 24000 Number of Failures Number of Failures 22000 20000 18000 16000 14000 12000 10000 11000 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 Friday MondaySaturdaySundayThursday Tuesday Wednesday 0 1 2 3 4 5 6 7 8 9 1011121314151617181920212223 Day of Week Hour of Day 4200 1500 4000 1400 Number of Failures Number of Failures (c) Notre-Dame (CPU) 3800 3600 3400 3200 3000 1300 1200 1100 1000 900 2800 2600 800 Mon Tue Wed Thu Fri Sat Sun 0 1 2 3 4 5 6 7 8 9 1011121314151617181920212223 Day of Week Hour of Day (d) PlanetLab Figure 5: Daily and hourly failure rates for Overnet,Grid’5000,Notre-Dame (CPU), and PlanetLab platforms. Wp 12 http://www.st.ewi.tudelft.nl/∼nezih/ PDS N. Yigitbasi. et al. Wp Wp Time-Correlated Failures in Distributed SystemsWp Wp4.2 Results Failure Rate Peak Failure Event 1 2 1 Failure Rate 2 2 2 1 3 4 4 Time Figure 6: Parameters of the peak periods model. The numbers in the figure match the (numbered) model parameters in the text. The first step is to identify for each trace the peaks of hourly failure rates. Since there is no rigorous mathematical definition of peaks in time-series, to identify the peaks we define a threshold value as µ + kσ, where µ is the average and σ is the standard deviation of the failure rate, and k is a positive integer; a period with a failure rate above the threshold is a peak period. We adopt this threshold to achieve a good balance between capturing in the model extreme system behavior, and characterizing with our model an important part of the system failures (either number of failures or downtime caused to the system). A threshold excluding all but a few periods, for example defining peak periods as distributional outliers, may capture too few periods and explain only a small fraction of the system failures. A more inclusive threshold would lead to the inclusion of more failures, but the data may come from periods with very different characteristics, which is contrary to the goal of building a model for peak failure periods. In the second step we extract the model parameters from the data sets using the peaks that we identified in the first step. Then we try to find a good fit, that is, a well-known probability distribution and the parameters that lead to the best fit between that distribution and the empirical data. When selecting the probability distributions, we consider the degrees of freedom (number of parameters) of that distribution. Although a distribution with more degrees of freedom may provide a better fit for the data, such a distribution can result in a complex model, and hence it may be difficult to analyze the model mathematically. In this study we use five probability distributions to fit to the empirical data: exponential, Weibull, Pareto, lognormal, and gamma. For the modeling process, we follow the methodology that we describe in Section 2.3. 4.2 Results After applying the modeling methodology that we describe in the previous section and Section 2.3, in this section we present the peak model that we derived from diverse large scale distributed systems. Table 2 shows the average values for all the model parameters for all platforms. The average peak duration Wp 13 http://www.st.ewi.tudelft.nl/∼nezih/ PDS N. Yigitbasi. et al. Wp Wp Time-Correlated Failures in Distributed SystemsWp Wp4.2 Results Table 2: Average values for the model parameters. System Avg. Peak Duration [s] Grid’5000 Condor (CAE) Condor (CS) Condor (GLOW) TeraGrid LRI Deug Notre-Dame Notre-Dame (CPU) Microsoft UCB PlanetLab Overnet Skype Websites LDNS SDSC LANL 5,047 5,287 3,927 4,200 3,680 4,080 14,914 3,942 7,520 7,200 21,272 4,810 3,600 4,254 5,211 4,841 4,984 4,122 Avg. Failure IAT During Peaks [s] 13 23 4 14 35 78 10 21 33 0 23 264 1 11 103 8 26 653 Avg. Time Between Peaks [s] 55,101 87,561 241,920 329,040 526,500 58,371 103,800 257,922 47,075 75,315 77,040 47,124 14,400 112,971 104,400 42,042 84,900 94,968 Avg. Failure Duration During Peaks [s] 20,984 4,397 20,740 75,672 368,903 31,931 1,091 280,593 22,091 90,116 332 166,913 382,225 26,402 3476 30,212 6,114 21,193 varies across different systems, and even for the same type of systems. For example, UCB, Microsoft and Deug are all desktop grids, but the average peak duration widely varies among these platforms. In contrast, for the SDSC, LANL, and PNNL platforms, which are HPC clusters, the average peak duration values are relatively close. The Deug and UCB platforms have small number of long peak durations resulting in higher average peak durations compared to the other platforms. Finally, as there are two peaks of zero length (single data point) in the Overnet system, the average peak duration is zero. The average inter-arrival time during peaks is rather low, as expected, as the failure rates are higher during peaks compared to off-peak periods. For the Microsoft platform, as all failures arrive as burst during peaks, average inter-arrival time during peaks is zero. Similar to the average peak duration parameter, the average time between peaks parameter is also highly variable across different systems. For some systems like TeraGrid, this parameter is in the order of days, and for some systems like Overnet it is in the order of hours. Similarly, the duration of failures during peaks highly varies even across similar platforms. For example, the difference between the average failure duration during peaks between the UCB and the Microsoft platforms, which are both desktop grids, is huge because the machines in the UCB platform leave the system less often than the machines in the Microsoft platform. In addition, in some platforms like Overnet and TeraGrid, the average failure durations during peaks is in the order of days showing the impact of space-correlated failures, that is multiple nodes failing nearly simultaneously. Using the AD and KS tests we next determine the best fitting distributions for each model parameter and each system. Since we determine the hourly failure rates using fixed time windows of one hour, the peak duration and the inter-peak time are multiples of one hour. In addition, as the peak duration parameter is mostly in the range [1h-5h], and for several systems this parameter is mostly 1h causing the empirical distribution to have a peak at 1h, none of the distributions provide a good fit for the peak duration parameter (see Figure 7). Therefore, for the peak duration model parameter, we only present an empirical histogram in Table 3. We find Wp 14 http://www.st.ewi.tudelft.nl/∼nezih/ PDS N. Yigitbasi. et al. Wp Wp Time-Correlated Failures in Distributed SystemsWp Wp4.2 Results Table 3: Empirical distribution for the peak duration parameter. h denotes hours. Values above 10% are depicted as bold. Platform / Peak Duration Grid’5000 Condor (CAE) Condor (CS) Condor (GLOW) TeraGrid LRI Deug Notre-Dame Notre-Dame (CPU) Microsoft UCB PlanetLab Overnet Skype Websites LDNS SDSC LANL PNNL Avg 1h 80.56 % 93.75 % 90.91 % 83.33 % 97.78 % 86.67 % 28.57 % 90.48 % 56.83 % 35.90 % 9.09 % 80.17 % 100.00 % 90.91 % 76.74 % 75.86 % 69.23 % 88.35 % 85.99 % 74.79 % 2h 13.53 % 3.13 % 9.09 % 16.67 % 2.22 % 13.33 % – 9.52 % 17.78 % 33.33 % 9.09 % 13.36 % – 4.55 % 13.95 % 13.79 % 23.08 % 9.24 % 10.35 % 11.3 % 3h 3.38 % – – – – – 28.57 % – 9.84 % 25.64 % – 3.71 % – – 5.23 % 10.34 % 7.69 % 2.06 % 1.75 % 5.16 % 4h 1.33 % – – – – – – – 3.49 % 5.13 % – 1.27 % – 4.55 % 2.33 % – – 0.25 % 0.96 % 1.01 % 5h 0.24 % – – – – – 14.29 % – 5.40 % – – 0.53 % – – 0.58 % – – 0.06 % 0.64 % 1.14 % 6h 0.12 % – – – – – – – 3.49 % – 9.09 % 0.53 % – – – – – 0.03 % 0.16 % 0.7 % ≥ 7h 0.84% 3.12% – – – – 28.57 % – 3.17 – 72.73 % 0.43 – – 1.16 – – – 0.16 % 5.85 % that the peak duration for almost all platforms are less than 3h. Table 4 shows the best fitting distributions for the model parameters for all data sets investigated in this study. To generate synthetic yet realistic traces without using a single system as a reference, we create the average system model which has the average characteristics of all systems we investigate. We create the average system model as follows. First, we determine the candidate distributions for a model parameter with the distributions having the smallest D values for each system. Then, for each model parameter, we determine the best fitting distribution among the candidate distributions that has the lowest average D value over all data sets. After we determine the best fitting distribution for the average system model, each data set is fit independently to this distribution to find the set of best fit parameters. The parameters of the average system model shown in the ”Avg.” row represent the average of this set of parameters. For the IAT during peak durations, several platforms do not have a best fitting distribution since for these platforms most of the failures during peaks occur as bursts hence having inter-arrival times of zero. Similarly, for the time between peaks parameter, some platforms (like all CONDOR platforms, Deug, Overnet and UCB platforms) do not have best fitting distributions since these platforms have inadequate number of samples to generate a meaningful model. For the failure duration between peaks parameter, some platforms do not have a best fitting distribution due to the nature of the data. For example, for all CONDOR platforms the failure duration is a multiple of a monitoring interval creating peaks in the empirical distribution at that monitoring interval. As a result, none of the distributions we investigate provide a good fit. In our model we find that the model parameters do not follow a heavy-tailed distribution since the p-values for the Pareto distribution are very low. For the IAT during peaks parameter, Weibull distribution provides a good fit for most of the platforms. For the time between peaks parameter, we find that the platforms can either be modeled by the lognormal distribution or the Weibull distribution. Similar to our previous model [9], which is derived from both peak and off-peak periods, for the failure duration during peaks parameter, we find that the lognormal distribution provides a good fit for most of the platforms. To conclude, for all the model parameters, we find that either the lognormal or the Weibull distributions provide a good fit for the average system model. Similar to the average system models built for other systems [10], we cannot claim that our average system model represents the failure behavior of an actual system. However, the main strength of the average system model is that it represents a common basis for the traces from which it has been extracted. To generate failure Wp 15 http://www.st.ewi.tudelft.nl/∼nezih/ PDS N. Yigitbasi. et al. Wp Wp Time-Correlated Failures in Distributed SystemsWp Wp5. Related Work Table 4: Peak model: The parameter values for the best fitting distributions for all studied systems. E,W,LN, and G stand for exponential, Weibull, lognormal and gamma distributions, respectively. System Grid’5000 Condor (CAE) Condor (CS) Condor (GLOW) TeraGrid LRI Deug Notre-Dame Notre-Dame (CPU) Microsoft UCB PlanetLab Overnet Skype Websites LDNS SDSC LANL PNNL Avg IAT During Peaks – (see text) – (see text) – (see text) – (see text) – (see text) LN(3.49,1.86) W(9.83,0.95) – (see text) – (see text) – (see text) E(23.77) – (see text) – (see text) – (see text) W(66.61,0.60) W(8.97,0.98) W(16.27,0.46) G(1.35,797.42) – (see text) W(193.91,0.83) Time Between Peaks LN(10.30,1.09) – (see text) – (see text) – (see text) LN(12.40,1.42) LN(10.51,0.98) – (see text) W(247065.52,0.92) W(44139.20,0.89) G(1.50,50065.81) – (see text) LN(10.13,1.03) – (see text) W(123440.05,1.37) LN(10.77,1.25) LN(10.38,0.79) E(84900) LN(10.63,1.16) E(160237.32) LN(10.89,1.08) Failure Duration During Peaks – (see text) – (see text) – (see text) – (see text) LN(10.27,1.90) – (see text) LN(5.46,1.29) LN(9.06,2.73) LN(7.19,1.35) W(55594.48,0.61) LN(5.25,0.99) LN(8.47,2.50) – (see text) – (see text) – (see text) LN(9.09,1.63) LN(7.59,1.68) LN(8.26,1.53) – (see text) LN(8.09,1.59) Table 5: The average duration and average IAT of failures for the entire traces and for the peaks. System Grid’5000 Notre-Dame (CPU) Microsoft PlanetLab Overnet Skype Websites LDNS LANL Avg. Failure Duration [h] (Entire) 7.41 4.25 16.49 49.61 11.98 14.30 1.17 8.61 5.88 Avg. Failure Duration [h] (Peaks) 5.83 6.14 25.03 46.36 106.17 7.33 0.97 8.39 5.89 Avg. Failure IAT [s] (Entire) 160 119 6 1880 17 91 380 12 13874 Avg. Failure IAT [s] (Peaks) 13 33 0 264 1 11 103 8 653 traces for a specific system, individual best fitting distributions and their parameters shown in Table 4 may be used instead of the average system. Next, we compute the average failure duration/inter-arrival time over each data set and only during peaks (Table 5). We compare only the data sets used both in this study and our previous study [9], where we modelled each data set individually without isolating peaks. We observe that the average failure duration per data set can be twice as long as the average duration during peaks. In addition, the average failure inter-arrival time per data set is on average nine times the average failure inter-arrival time during peaks. This implies that the distribution per data set is significantly different from the distribution for peaks, and that fault detection mechanisms must be significantly faster during peaks. Likewise, fault-tolerance mechanisms during peaks must have considerably lower overhead than during non-peak periods. Finally, we investigate the fraction of downtime caused by failures that originate during peaks, and the fraction of the number of failures that originate during peaks (Table 6). We find that on average over 50% and up to 95% of the downtime of the systems we investigate are caused by the failures that originate during peaks. 5 Related Work Much work has been dedicated to characterizing and modeling system failures [18, 13, 12, 14, 20, 16]. While the correlation among failure events has received attention since the early 1990s [18], previous studies focus mostly on space-correlated failures, that is, on multiple nodes failing nearly simultaneously. Although the time Wp 16 http://www.st.ewi.tudelft.nl/∼nezih/ PDS N. Yigitbasi. et al. Wp Wp Time-Correlated Failures in Distributed SystemsWp Wp5. Related Work 1 0.9 0.8 0.7 CDF 0.6 0.5 0.4 Empirical data Exp Weibull G−Pareto LogN Gamma 0.3 0.2 0.1 0 0 0.5 1 1.5 2 2.5 3 Peak Duration (seconds) 3.5 4 x 10 4 Figure 7: Peak Duration: CDF of the empirical data for the Notre-Dame (CPU) platform and the distributions investigated for modeling the peak duration parameter. None of the distributions provide a good fit due to a peak at 1h. Table 6: Fraction of downtime and fraction of number of failures due to failures that originate during peaks. System Grid’5000 Condor (CAE) Condor (CS) Condor (GLOW) TeraGrid LRI Deug Notre-Dame Notre-Dame (CPU) Microsoft UCB PlanetLab Overnet Skype Websites LDNS SDSC LANL Time % 61.93 63.03 80.09 39.53 100 87.42 47.31 62.62 73.77 52.16 100 50.55 68.69 32.87 24.41 38.21 87.57 100 Avg. 65.01 k = 0.5 # Failures % 74.43 90.91 89.56 70.51 77.10 71.87 83.61 73.06 56.92 40.26 100 54.70 12.90 36.01 31.45 38.80 86.13 44.68 62.93 Time % 52.11 62.93 80.04 39.37 66.90 84.73 26.07 58.53 63.85 37.44 97.70 38.07 65.97 12.06 15.33 13.74 67.86 44.69 51.52 k = 0.9 # Failures % 64.58 90.52 88.63 69.70 77.10 62.26 66.46 69.72 43.56 25.10 97.78 41.81 7.44 18.93 18.87 13.85 65.25 44.68 53.67 Time % 49.19 62.93 80.04 39.37 66.90 84.73 25.07 53.40 57.92 35.54 95.31 38.07 65.97 7.65 13.60 10.14 67.46 44.69 49.88 k = 1.0 # Failures % 62.56 90.52 88.63 69.70 77.10 62.26 63.03 69.09 40.15 23.41 96.10 41.81 7.44 14.93 16.59 10.41 64.02 44.68 Time % 47.27 62.73 80.04 39.37 66.90 77.92 25.07 45.19 56.19 32.20 95.31 30.02 65.97 6.14 12.85 7.80 67.46 44.69 52.35 47.95 k = 1.1 # Failures % 60.84 89.88 88.63 69.70 77.10 59.47 63.03 67.70 37.93 20.08 96.10 32.34 7.44 13.81 14.97 8.28 64.02 44.68 50.88 Time % 35.93 62.73 80.04 39.37 66.90 75.66 22.28 45.12 47.61 28.78 93.19 30.02 65.97 4.32 12.33 5.07 65.24 44.69 45.84 k = 1.25 # Failures % 57.93 89.88 88.63 69.70 77.10 57.94 57.76 67.19 33.32 16.81 94.18 32.34 7.44 12.05 13.85 5.61 61.01 44.68 49.30 Time % 33.25 62.57 80.04 39.37 66.90 75.66 20.94 43.69 41.88 23.76 86.75 26.69 65.97 3.29 9.27 3.25 64.91 44.69 44.04 k = 1.5 # Failures % 53.60 89.13 88.63 69.70 77.10 57.94 53.31 64.47 26.21 12.80 87.62 24.86 7.44 10.74 11.92 3.57 59.23 44.68 46.83 Time % 27.63 62.10 80.04 37.28 62.98 73.91 16.83 41.79 28.76 15.35 54.48 24.02 65.97 3.29 8.53 2.06 52.20 22.96 k = 2.0 # Failures % 44.99 87.08 88.63 67.47 70.61 55.71 41.11 62.61 14.99 6.73 57.88 20.27 7.44 10.74 10.06 2.45 48.78 21.41 37.78 correlation of failure events deserve a detailed investigation due to its practical importance [11,19,15], relatively little attention has been given to characterize the time correlation of failures in distributed systems. Our work is the first to investigate the time correlation between failure events across a broad spectrum of large-scale distributed systems. In addition, we also propose a model for peaks observed in the failure rate process derived from several distributed systems. Previous failure studies [18, 13, 12, 14, 20] used few data sets or even data from a single system; their data also span relatively short periods of time. In contrast, we perform a detailed investigation using data sets from diverse large-scale distributed systems including more than 100K hosts and 1.2M failure events spanning over 15 years of system operation. In our recent work [6], we proposed a model for space-correlated failures using fifteen FTA data sets, and we showed that space-correlated failures are dominant in most of the systems that we investigated. In this work, we extend our recent work with a detailed time-correlation analysis, and we propose a model for peak periods. Closest to our work, Schroeder and Gibson [16] present an analysis using a large set of failure data obtained from a high performance computing site. However, this study lacks a time correlation analysis and focuses on well-known failure characteristics like MTTF and MTTR. Sahoo et al. [14] analyze one year long failure data obtained from a single cluster. Similar to the results of our analysis, they report that there is strong correlation with significant periodic behavior. Bhagwan et al. [3] present a characterization of the availability of the Overnet P2P system. Their and other studies [5] show that the availability of P2P systems has diurnal Wp 17 http://www.st.ewi.tudelft.nl/∼nezih/ 39.94 PDS N. Yigitbasi. et al. Wp Time-Correlated Failures in Distributed SystemsWp Wp Wp6. Conclusion patterns, but do not characterize the time correlations of failure events. Traditional failure analysis studies [4, 8] report strong correlation between the intensity of the workload and failure rates. Our analysis brings further evidence supporting the existence of this correlation–we observe more failures during peak hours of the day and during work days in most of the (interactive) traces. 6 Conclusion In the era of cloud and peer-to-peer computing, large-scale distributed systems may consist of hundreds of thousands of nodes. At this scale, providing highly available service is a difficult challenge, and overcoming it depends on the development of efficient fault-tolerance mechanisms. To develop new and improve existing fault-tolerance mechanisms, we need realistic models of the failures occurring in large-scale distributed systems. Traditional models have investigated failures in distributed systems at much smaller scale, and often under the assumption of independence between failures. However, more recent studies have shown evidence that there exist time patterns and other time-varying behavior in the occurrence of failures. Thus, we have investigated in this work the time-varying behavior of failures in large-scale distributed systems, and proposed and validated a model for time-correlated failures in such systems. First, we have assessed the presence of time-correlated failures, using traces from nineteen (production) systems, including grids, P2P systems, DNS servers, web servers, and desktop grids. We found for most of the studied systems that, while the failure rates are highly variable, the failures still exhibit strong periodic behavior and time correlation. Second, to characterize the periodic behavior of failures and the peaks in failures, we have proposed a peak model with four attributes: the peak duration, the failure inter-arrival time during peaks, the time between peaks, and the failure duration during peaks. We found that the peak failure periods explained by our model are responsible for on average for over 50% and up to 95% of the system downtime. We also found that the Weibull and the lognormal distributions provide good fits for the model attributes. We have provided bestfitting parameters for these distributions, which will be useful to the community when designing and evaluating fault-tolerance mechanisms in large-scale distributed systems. Last but not least, we have made four new traces available in the Failure Trace Archive, which we hope will encourage others to use the archive and also to contribute to it with new failure traces. Wp 18 http://www.st.ewi.tudelft.nl/∼nezih/ PDS N. Yigitbasi. et al. Wp Time-Correlated Failures in Distributed SystemsWp Wp WpReferences References [1] The uc berkeley/stanford recovery-oriented computing (roc) project. http://roc.cs.berkeley.edu/. [2] J. Aldrich. R. A. Fisher and the making of maximum likelihood 1912-1922. Statistical Science, 12(3):162–176, 1997. [3] R. Bhagwan, S. Savage, and G. M. Voelker. Understanding availability. In M. F. Kaashoek and I. Stoica, editors, IPTPS, volume 2735 of Lecture Notes in Computer Science, pages 256–267. Springer, 2003. [4] X. Castillo and D. Siewiorek. Workload, performance, and reliability of digital computing systems. pages 84–89, jun 1981. [5] J. Chu, K. Labonte, and B. N. Levine. Availability and locality measurements of peer-to-peer file systems. In Proceedings of ITCom: Scalability and Traffic Control in IP Networks, 2002. [6] M. Gallet, N. Yigitbasi, B. Javadi, D. Kondo, A. Iosup, and D. Epema. A model for space-correlated failures in large-scale distributed systems. In Euro-Par, 2010. to appear. [7] T. Z. Islam, S. Bagchi, and R. Eigenmann. Falcon: a system for reliable checkpoint recovery in shared grid environments. In SC ’09, pages 1–12. ACM, 2009. [8] R. K. Iyer, D. J. Rossetti, and M. C. Hsueh. Measurement and modeling of computer reliability as affected by system activity. ACM Trans. Comput. Syst., 4(3):214–237, 1986. [9] D. Kondo, B. Javadi, A. Iosup, and D. Epema. The Failure Trace Archive: Enabling comparative analysis of failures in diverse distributed systems. In CCGRID, pages 1–10, 2010. [10] U. Lublin and D. G. Feitelson. The workload on parallel supercomputers: modeling the characteristics of rigid jobs. J. Parallel Distrib. Comput., 63(11):1105–1122, 2003. [11] J. W. Mickens and B. D. Noble. Exploiting availability prediction in distributed systems. In NSDI’06, pages 6–6. USENIX Association, 2006. [12] D. Nurmi, J. Brevik, and R. Wolski. Modeling machine availability in enterprise and wide-area distributed computing environments. In Euro-Par, pages 432–441, 2005. [13] D. Oppenheimer, A. Ganapathi, and D. A. Patterson. Why do internet services fail, and what can be done about it? In USITS’03, pages 1–1. USENIX Association, 2003. [14] R. K. Sahoo, A. Sivasubramaniam, M. S. Squillante, and Y. Zhang. Failure data analysis of a large-scale heterogeneous server environment. In DSN ’04, page 772, 2004. [15] F. Salfner, M. Lenk, and M. Malek. A survey of online failure prediction methods. ACM Comput. Surv., 42(3):1–42, 2010. [16] B. Schroeder and G. A. Gibson. A large-scale study of failures in high-performance computing systems. In DSN ’06, pages 249–258, 2006. [17] B. Schroeder and G. A. Gibson. Disk failures in the real world: what does an mttf of 1,000,000 hours mean to you? In FAST ’07, page 1. USENIX Association, 2007. [18] D. Tang, R. Iyer, and S. Subramani. Failure analysis and modeling of a vaxcluster system. pages 244 –251, jun 1990. [19] K. Vaidyanathan, R. E. Harper, S. W. Hunter, and K. S. Trivedi. Analysis and implementation of software rejuvenation in cluster systems. In SIGMETRICS ’01, pages 62–71. ACM, 2001. [20] J. Xu, Z. Kalbarczyk, and R. K. Iyer. Networked windows nt system field failure data analysis. In PRDC ’99, page 178, 1999. [21] Y. Zhang, M. S. Squillante, and R. K. Sahoo. Performance implications of failures in large-scale cluster scheduling. In JSSPP, pages 233–252, 2004. Wp 19 http://www.st.ewi.tudelft.nl/∼nezih/
© Copyright 2026 Paperzz