Problems in Backblaze`s Analysis Peng Liu and Leo Wright JMP

Exploring the Backblaze Hard Drive Data
Peng Liu and Leo Wright
JMP
Abstract
Backblaze, a cloud storage company, provided daily records of
approx. 50,000 hard drives, over
a period of nearly two years.
Data include demographics, a
failure indicator, and eighty
SMART indicators.
We studied the life distributions
of individual models, compared
different models, investigated
possible usage of SMART
sensors, and compared to the
failure rate on BackBlaze web
page.
Big, Missing, Problematic Data
Present Analysis
Big and Missing:
•2 years: 2013, 2014
•631 CSV files
•~12GB JMP table
•Missing ~50% by rows
•Missing ~65% SMART
•Truncated data
•Delete bad data
•Use reliable information
Problematic:
•Small errors
•Multiple failures
•Few/No failures
•Truncation
•Revisit failure rate
calculation in Backblaze’s
blog
•Assess Backblaze’s
conclusions
•Life Distribution
•LD comparison by models
•Parametric Survival Model
using SMART
Exploring the Backblaze Hard Drive Data – Data Processing
Peng Liu and Leo Wright
JMP
Flow Chart of Data Processing
Import CSV and
merge
Capacities by
Models
Data Error
Easy Fix
Sum of Failure by
Serial Numbers
Multiple
Failures
Delete or Assume Mistakes
Row Counts and
Max SMART 9 by
serial number
Missing
Rows
Cannot Fix, Truncated Data
Missing Data
Pattern
Missing
SMART
Cannot Fix, Missing Data
Row Counts, etc.
by S/N
Credibility
of Life
Span
Use SMART 9, Remove Bad
serial number
Exploring the Backblaze Hard Drive Data – Life Distributions
Peng Liu and Leo Wright
JMP
Zero or few failures, truncations, sample sizes. Numerous data quality issues to manage!
33 models have at least 2 failed drives. The graph
draws MTTFs and corresponding confidence
intervals. Sample sizes of those models are plotted.
There is no evidence that any manufacturer
produces hard drives more reliably than another.
The truncation issues alter our results. For example,
the MTTF (Weibull) lower bound for a Seagate drive
in the graph is 15790 days without considering
truncation; it is 23764 days otherwise. In some
cases, it led to convergence issues.
Exploring the Backblaze Hard Drive Data – SMART Predictability
Peng Liu and Leo Wright
JMP
SMART sensors may be useful. Using the final readings, fit a Parametric Survival model.
• Treat factor levels, such as temperature as constant.
• For trends, such as cycle counts and error counts, assume a cumulative damage model where damage is
proportionally cumulative to the logarithm of the life.
A Seagate disk
SMART 194 Temperature
SMART 190 Temperature
SMART 241 Total LBA Written
• The positive correlation between temperature and
failure time is most likely due to intensive
read/write operations.
• Total LBA Written is a cumulative measurement of
write operations, which is a reasonable indicator of
usage.
• Different hard drive models yield different sets of
significant indicators.
Exploring the Backblaze Hard Drive Data – Problems in Backblaze’s Analysis
Peng Liu and Leo Wright
JMP
Failure rate: the average number of failures you can expect running one drive for a year. A failure is when we have to
replace a drive in a pod. That is the reciprocal of a mean time to failure (MTTF)!
Backblaze’s calculation is unique. Given a period of time, for each day, count the total number of drives in service,
N[i]. Add up all counts, N=sum of N[i]’s. Count all failure incidences during the period, K. The failure rate is K/N per
day, and 365*K/N per year.
The Problem: Suppose we conduct two experiments on two different days. On the first day, we set up 1000 hard
drives of the same model, brand new, in service. Now consider these units have been in service many months, some
of those drives have failed, but 500 of them are still in service. So on a particular day, the number in service and
failed units are changing with the age. What do you expect the failure probabilities of individual drives on those two
days? Are they the same? Furthermore, due to truncation (missing data), estimated failure rates are inflated since
the denominator may be less than true value.
Final Conclusion
• We dispute Backblaze’s conclusion.
• The data, as it exists, cannot be used to conclude any hard drive or manufacturer is better than another.
• We found that some SMART indicators may be useful to predict failures.