An Efficient Binary Locally Repairable Code for Hadoop Distributed

An Efficient Binary Locally Repairable Code for
Hadoop Distributed File System
Abstract—In the Hadoop distributed file systems (HDFS), to
lower costly communications traffic for data recovery, the concept
of locally repairable codes (LRCs) has been recently proposed.
With regards to the immense size of modern, energy-hungry
HDFS, computational complexity reduction can be attractive. In
this work, to avoid finite field multiplications—which are the
major source of complexity—we put forward the idea of designing binary locally repairable codes (BLRCs). More specifically,
we design a length 15, rate 2/3, BLRC with minimum distance
4, which has the minimum possible locality among its type. We
show that our code has a lower complexity than the most recent
non-binary LRC in the literature while meeting other desirable
requirements in HDFS such as storage overhead and reliability.
Index Terms—Erasure coding, Hadoop distributed file system,
locally repairable codes.
I. I NTRODUCTION
Many companies such as Facebook, Google, and Microsoft
use Hadoop distributed file system (HDFS) as their storage
systems [1]. HDFS has a name node and several data nodes
(DNs). The name node, a master computer containing metadata (data about data), administers a file system. Typically,
some backups of the name node are provided to be used in
the case of failures. DNs are slave computers that store clients’
data.
DN failures are norm and can make HDFS unreliable. The
simplest way to bring reliability to HDFS is to generate
multiple replica of data and store them in distinct DNs.
Replication is widely used in present storage systems such
as those of Facebook [2]. However, due to its large overhead1
replication is becoming less attractive, especially because the
amount of data needed to be stored is significantly increasing2 .
Since a node failure can be viewed as an erasure, conventional (n, k) erasure codes are suggested to create redundancy
for HDFS [2], [4]. In this case, k blocks of data are coded
by an (n, k) erasure code and then, the n coded blocks
are stored across n different DNs. This storage method can
provide the optimal trade-off between redundancy (i.e., the
storage overhead) and reliability if maximum-separable distance (MDS) codes (e.g. Reed-Solomon (RS) codes [5]) are
used [6]. MDS codes, however, cannot properly address the
problem of repairing failed nodes, because a large amount of
data must be transformed between different nodes, which is a
costly task [2].
An (n, k) erasure code that is able to reconstruct every
coded block of data by at most r other coded blocks is
called an r-locally repairable code (rLRC), and is represented
by (n, k, r). In the case of repairing permanent failures or
retiring defective nodes, LRCs with small r are desired as
1 For
2 The
example, the 3-replication scheme has 200% storage overhead.
total amount of the world data almost doubles every two years [3].
they require less costly network bandwidth and disk I/Os.
In the case of transient unavailability of nodes, reducing r
can significantly improve data availability, and expedite the
process of distributing data between various data centers [2].
Hence, reducing the locality r of LRC is important.
LRCs that are currently used in practice are designed based
on RS codes which need additions (XORs) and multiplications
in binary extension field F2m , e.g. see [2], [4], and [7].
Multiplication is more expensive than addition because its
implementation requires algorithms that work based on more
hardware and computation [8]. Considering the immense size
of data centers, such computations are non-negligible. For
example, the LRC introduced in [2] is constructed based on a
(14, 10) RS code. Supposing that 30 PB of data is needed to
be stored, 1.2×1017 multiplications have to be performed to
store the total data in the warehouse cluster3 .
In this paper, we introduce a simple yet efficient systematic
code for HDFS with the following properties: (i) the code
is binary, thus multiplication complexity is removed; (ii) its
mean time to data loss (MTTDL) is no worse than the 3replication method which is widely used currently; (iii) it has
the minimum possible locality for its size; and (iv) its storage
overhead is no worse than the recently proposed code in [2].
Notations: We show matrices and vectors by capital boldface letters and boldface letters respectively. w(a) and F2m
stand for Hamming weight of the vector a and a finite field
with cardinality 2m respectively. Also, |S| and P(S) represent
the cardinality and power set of set S respectively. [l] and (·)T
respectively denote the set of all integers between 1 to l and
the matrix transpose operation.
II. P ROPOSED LRC
In this section, we show how our proposed binary LRC
(BLRC) is constructed.
A. Code Parameters
We use k = 10 as in [2] and [9], and set dmin = 4,
where dmin is the minimum distance of the code. As will
be discussed in Section III, this choice of dmin is needed to
achieve a larger MTTDL than that of the 3-replication scheme.
Minimizing n for a given k minimizes the storage overhead.
As Table I suggests [10], for k = 10 and dmin = 4, the
minimum possible value of n is 15. Now, a question arises:
what is the minimum locality that could be gained among all
binary (15, 10) erasure codes with dmin = 4?
Proposition 1. Every binary (15, 10) code with dmin = 4 has
locality at least six.
3 Four
multiplications per byte are required for encoding of the code in [2].
TABLE I
B EST ACHIEVABLE dmin
FOR D IFFERENT
THE B INARY F IELD .
n
k
k
k
=
=9
= 10
= 11
13
3
2
2
14
4
3
2
15
4
4
3
VALUES OF (n, k) OVER
16
4
4
4
17
5
4
4
Proof. Please see the Appendix.
In the following section, we introduce a code that achieves
this minimum locality.
B. Code Design
In what follows, we introduce C0 , a new (15, 10, 6) BLRC
with dmin = 4. The generator matrix of C0 has the following
form: G0 = [P0 , I10 ] ∈ F10×15
, where P0 ∈ F10×5
. Note
2
2
T
that P0 has a simple structure as its columns span all possible
arrangements of placing three ones and two zeros.

PT0


=


0
0
1
1
1
0
1
0
1
1
0
1
1
0
1
0
1
1
1
0
1
0
0
1
1
1
0
1
0
1
1
0
1
1
0
1
1
1
0
0
1
1
0
1
0
1
1
0
0
1



 ∈ F5×10 .
2


(1)
In the following, let us verify how this code meets dmin = 4
and r = 6 by presenting Propositions 2 and 3.
Proposition 2. The minimum distance dmin of C0 is 4.
Proof. We show that the weight of every linear combination
of rows ∑
of G0 is at least four. In other words, we show that
w(s = i∈A gi ) ≥ 4 for every A ∈ P([10]) \ ∅, where gi
is the i-th row of G0 and [10] is the set of integers between
1 and 10. If |A| = 1 or |A| ≥ 4, clearly w(s) ≥ 4. In the
case of |A| = 2, we need to show w(pi ⊕ pj ) ≥ 2 for any
arbitrary distinct rows pi and pj of P0 . Each row pi has
three 1’s and two 0’s. For three 1’s of pi , there is at least
one zero in the corresponding element of pj ; also for two 0’s
of pi there is at least one 1 in the corresponding element of
pj . Hence, w(pi ⊕ pj ) ≥ 2. Similarly, for |A| = 3, we need
to show that w(pi ⊕ pj ⊕ pl ) ≥ 1 or equivalently not zero
for any arbitrary distinct rows pi , pj , and pl of P0 . This is
understood noticing that the aggregate of all 1’s of any three
rows of P0 is 9, which is an odd number. Hence, w(s) ≥ 4
also for |A| = 3.
Proposition 3. The locality r of C0 is 6.
Proof. This is verified by inspection of all 15 possibilities.
C. Encoding and Decoding of C0
In this section, the details of the encoding and decoding
of C0 are outlined. These details are later used for complexity analysis. Consider a code with the generator matrix
G ∈ Fk×n
2m . Suppose a stripe of LB bits to be coded and
stored in HDFS. This stripe is broken into k data blocks xj
B
symbols in
each of size lB = LkB bits or equivalently, lm
F2m , where j ∈ [k]. Let xi,j be the i-th symbol of the jB
th data block, where i ∈ [ lm
]. Then, we obtain the coded
ei = (yi,1 , yi,2 , ..., yi,n ) = x
ei G ∈ F1×n
vector y
2m , where
1×k
ei = (xi,1 , xi,2 , ..., xi,k ) ∈ F2m . Now, we constitute matrix
x
B
ei ’s. The columns of Y, say
Y by stacking lm
coded vectors y
yl for l ∈ [n], are n coded blocks which are stored in n distinct
DNs.
For C0 with G0 , m = 1, and (n, k) = (15, 10), the first five
and the last 10 columns of Y are the coded and systematic
blocks, respectively. The coded blocks are constructed by
XORs of the systematic blocks, e.g. y1 = x5 ⊕ x6 ⊕ x7 ⊕
x8 ⊕ x9 ⊕ x10 .
Now, let us consider the decoding procedure of C0 . Suppose
that there is a request to read some data blocks. The name node
firstly directs the client to those DNs containing the systematic
blocks. If they are unavailable, then the client is referred to
the nodes that hold the parity blocks. Based on the structure
of C0 , in this step, decoding is done by XORing the available
systematic blocks with the parity blocks to recover unavailable
blocks. For example, if the systematic data block x1 is not
available, then the client can reach it by either of these three
cases: x1 = y3 ⊕ x3 ⊕ x4 ⊕ x6 ⊕ x7 ⊕ x8 , x1 = y4 ⊕ x2 ⊕
x4 ⊕ x5 ⊕ x7 ⊕ x9 , or x1 = y5 ⊕ x2 ⊕ x3 ⊕ x5 ⊕ x6 ⊕ x10 .
When a DN failure occurs, all its blocks have to be reconstructed. The missed blocks of a failed node are reconstructed
by downloading r = 6 relevant blocks of surviving nodes.
III. P ROPOSED BLRC A SSESSMENT
The main objective of this work was to suggest a binary
LRC with minimum locality as an alternative coding solution
for HDFS, where the motivation was to reduce the coding
complexity. In order to evaluate our code, we compare it with
the non-binary (16, 10, 5) LRC in [2]. Our evaluation metrics
are: 1) computational complexity; 2) MTTDL; 3) the required
storage; and 4) locality. This comparison will allow potential
users to make a proper choice according to their needs and
system requirements. Note that this comparison is not a strict
one, because some of the parameters of these two codes (such
as storage overhead) are not the same.
A. Computational Complexity
In this section, similar to [9], we suppose that computations
are performed in byte level. There are two separate operations
for which we consider computational complexity.
Encoding complexity: The required numbers of multiplications and additions to produce parity blocks associated
with an (n, k) RS code are (n − k) and (1 − k1 )(n − k)
per byte, respectively. Therefore, since the (16, 10, 5) LRC
proposed in [2] is based on a (14, 10) RS code, it requires 4
multiplications and 3.6 additions per byte to produce the (14,
10) RS parity blocks. Moreover, it requires 0.8 extra additions
per byte to construct its own two parities. For C0 , the number
of multiplications to produce parity blocks is zero, and the
number of additions is 2.5 per byte. This immediately shows a
significant reduction, more so when we consider the fact that
field multiplication can be significantly more complex than
field addition [8].
1-λ0
1-λ1-ρ1
λ0
0
ρ3
1-λ5-ρ5
ρ4
1
λ5
γ2 λ4
5
4
3
2
ρ2
1-λ4-ρ4
γ1λ3
λ2
λ1
1
ρ1
1-λ3-ρ3
1-λ2-ρ2
O3
ρ5
(1-γ2)λ4
(1-γ1)λ3
O1
O2
1
1
Fig. 1. Markov chain model of the proposed BLRC.
Repair/decoding complexity: Now, let us consider the case
of unavailability or failure of one block, which is the dominant
case of block failures [9]. In this case, for decoding or repairing processes of the LRC in [2], on average, one multiplication
and 4.5 additions per byte are needed. For C0 , these numbers
are zero multiplications and six additions. Considering the
complexity of field multiplication, the repair complexity of
our code is arguably better than that of [2], particularly for
the large filed sizes.
B. MTTDL
Standard Markov chain model is widely used in the context
of distributed storage to determine the reliability of the codes
proposed for HDFS, e.g. see [2], [7] and references therein.
In this model, it is assumed that all blocks of a stripe of data
are stored in independent DNs. Therefore, a failure occurring
to one block of a stripe does not affect other blocks of that
stripe.
The Markov chain diagram of C0 is presented in Fig. 1.
State numbers 0 to 5 denote the number of failed blocks.
State O1 is the state where four block failures associated with
one stripe cannot be recovered, i.e. the state where omitting
four columns of G0 results in 10×11 sub-matrices of rank 9.
Similarly, state O2 is the state where five block failures cannot
be recovered, i.e. the state where omitting five columns of
G0 results in 10×10 sub-matrices of rank 9. Also O3 is the
state where six block failures associated with one stripe occur.
Assuming that disk failures are distributed exponentially with
failure rate λ and noticing that at state i, there are 15 − i
surviving blocks, block failure rate λi is (15 − i)λ. Similarly,
ρi is the block repair rate at state i. As stated in [7], ρi depends
on disk size Cdisk , the repair traffic of each node Brep , the
number of data nodes N , code locality r, and the average
time Td needed to recognize failures and prompt the repair
process. Finally, γ1 and γ2 denote the percentage of four and
five repairable block failures, respectively. It is numerically
verified from G0 that γ1 = 0.92 and γ2 = 0.56.
Suppose that ti,j shows the transition rate from node i to
node j. Based on the Markov chain presented in Fig. 1, it
is verified that the canonical form of the transition matrix
associated with C0 has the following form:
(
)
Q6×6 R6×3
T = [ti,j ]9×9 =
,
(2)
03×6
I3
where the first six and the last three states of T are related
to transient states 0 to 5 and absorbing states O1 to O3 ,
respectively. Also, 03×6 and I3 represent a 3×6 null matrix
and an identity matrix of size three, respectively.
The locality of our code is r = 6. Following [2], we set disk
failure with average rate of one disk per four years (λ = 0.25),
Cdisk = 15 TB, Brep = 0.1 Gbps, N = 3000 and Td = 30
minutes, which reflect current settings of one of Facebook data
centers. Then, the average repair rate associated with a single
block is ρ1 ≈ Brep N/(Cdisk r) = 1/2400 disk per second.
Also, since for more than one block failure, Td is much larger
than the repair time, we have ρi = T1d for i = {2, ..., 5} [7].
As shown in [11], the expected time to absorption, i.e.
MTTDL, can be obtained by adding the elements of the first
row of (I6 − Q6×6 )−1 , where (·)−1 stands for the matrix
inverse operation. By applying this procedure, the MTTDL
of C0 is 1.6 × 1011 years. Similarly, MTTDLs equal to
1.1 × 1010 and 8.4 × 1017 years are obtained for the 3replication scheme and LRC in [2] respectively. A comparison
between the MTTDL of C0 with that of the 3-replication
method shows that the reliability of C0 is significantly more
than that of the 3-replication scheme which is considered as
the reference point for the reliability evaluation [2], [7]. It is
notable that the 3-replication method has been widely used
as a standard scheme. Hence, the reliability of the proposed
BLRC is more than sufficient from a practical point of view.
C. Required Storage
For every 10 data blocks, C0 stores extra five coded blocks,
while the LRC of [2] stores six extra blocks. Thus, our
proposed BLRC uses 6.25% less storage than the code in
[2]. Note that this amount of storage is non-trivial when one
considers the size of data centers.
D. Locality
A 6-month observation of Facebook data centers shows that
98.08% of repairs correspond to regenerating one block of a
stripe. Hence, this is the most important case to deal with [9].
This implies that locality of a code, as defined in Section I, is
a meaningful measure of the traffic imposed by the repairing
process. As shown by Proposition 3, the locality of our code
is six, while the locality of the proposed LRC in [2] is five.
However, our code provides other benefits such as significant
reduction in computational complexity and reduced storage
overhead. Also, as discussed earlier, among all binary (15, 10)
codes with dmin = 4, C0 has the lowest possible locality.
Remark: The above comparison shows that our code has
a few benefits over the code of [2] at the cost of increased
locality. For a more meaningful comparison, it is desirable to
have a BLRC with the same storage overhead and locality as
the code in [2]. It is indeed possible to design such a binary
code. In fact, a (16, 10, 5) BLRC, say C1 , can be formed by
simply adding a new parity block constructed by XORing all
systematic blocks to C0 . Hence, the generator matrix of C1 ,
say G1 , is obtained as G1 = [115 , G0 ], where 115 represents
a 10×1 vector of all ones. Now, the locality of C1 is five
because by XORing 115 and four carefully selected columns
of I10 , any column P0 can be constructed. Moreover, C1 has
one more parity than C0 , meaning that its MTTDL even better
than C0 . The encoding complexity of the new code, C1 , is 2.8
additions per byte (still significantly lower than that of [2]),
and the repair complexity is now 5 additions, which is lower
than that of [2].
IV. C ONCLUSION AND F UTURE W ORK
A new binary (15, 10) locally repairable code (BLRC)
was proposed for the Hadoop distributed file system. The
proposed BLRC has minimum distance of four, thus providing
better reliability than the widely used 3-replication method.
Since our code is binary, encoding, decoding and repairing do
not require finite field multiplication, resulting in significant
computational saving. We proved that, among all (15, 10)
binary codes with minimum distance of four, our code has
the minimum locality of six, thus imposes minimum repair
traffic. Discussions on generalization of our code to support
more applications can be of interest.
R EFERENCES
[1] //wiki.apache.org/hadoop/PoweredBy.
[2] M. Sathiamoorthy, M. Asteris, D. S. Papailiopoulos, A. G. Dimakis,
R. Vadali, S. Chen, and D. Borthakur, “Xoring elephants: Novel erasure
codes for big data.” PVLDB, vol. 6, no. 5, pp. 325–336, 2013.
[3] //www.emc.com/leadership/programs/digital-universe.htm.
[4] D. Ford, F. Labelle, F. I. Popovici, M. Stokely, V.-A. Truong, L. Barroso,
C. Grimes, and S. Quinlan, “Availability in globally distributed storage
systems,” Proceedings of the 9th USENIX Conference on Operating
Systems Design and Implementation, 2010.
[5] I. S. Reed and G. Solomon, “Polynomial codes over certain finite field,”
Journal of The Society for Industrial and Applied Mathematics, vol. 8,
1960.
[6] A. Dimakis, K. Ramchandran, Y. Wu, and C. Suh, “A survey on network
codes for distributed storage,” Proceedings of the IEEE, vol. 99, no. 3,
pp. 476–489, 2011.
[7] C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, P. Gopalan, J. Li, and
S. Yekhanin, “Erasure coding in windows azure storage,” Proceedings
of the 2012 USENIX Conference on Annual Technical Conference, 2012.
[8] K. M. Greenan, E. L. Miller, and T. J. E. Schwarz, “Optimizing galois
field arithmetic for diverse processor architectures and applications.” In
Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, pp. 257–266, 2008.
[9] K. Rashmi, N. Shah, D. Gu, H. Kuang, D. Borthakur, and K. Ramchandran, “A solution to the network challenges of data recovery in erasurecoded distributed storage systems: a study on the facebook warehouse
cluster,” in 5th USENIX Workshop on Hot Topics in Storage and File
Systems, 2013.
[10] R. Schürer and W. C. Schmid, “MinT-architecture and applications of
the (t, m, s)-net and OOA database,” Math. Comput. Simul., vol. 80,
no. 6, pp. 1124–1132, Feb. 2010.
[11] L. J. S. Charles M. Grinstead, Introduction to Probability, 2nd ed.
American Mathematical Society, Ch.11, 2006.
A PPENDIX
Here, we prove Proposition 1. Let hi , 1 ≤ i ≤ 5, denote the
i-th row of the parity check matrix H = [I5 , PT ] ∈ F5×15
of
2
a binary code with dmin = 4. Let r denote the locality of the
code. Since locality is r, there is a column of G that can be
written as the sum of at most r other columns of G. Since the
field is binary, this implies that there are r′ , 1 ≤ r′ ≤ r + 1,
columns of the generator matrix G = [P, I10 ], whose sum is
zero. Let k be the number of those r′ columns that are from
P. Suppose r < 6, thus r′ ≤ 6. In this case, we have k ≥ 1,
because sum of every set of r′ ≤ 6 columns of I10 is non-zero.
Note that the weight of sum of those k columns of P must be
at most 6 − k, as, otherwise, the sum of the r′ columns cannot
be zero. Therefore, there are k rows of H (corresponding to
the k columns of P), whose sum has weight of at most six.
Consequently, to prove Theorem 1, it suffices to show that the
sum of every
∑ set of rows of H has weight of at least seven,
that is w( i∈A hi ) ≥ 7 for every A ∈ P([5]) \ ∅, where [5]
is the set of integers between 1 and 5.
Since dmin = 4, the weight of every linear combination of
rows of G is at least four. Therefore, the weight of a linear
combination of i ≥ 1 rows of P is at least 4 − i. Hence,
the weight of every three columns of H is at least one. In
other words, every three columns of H are independent. Let
uj denote the j-th column of H, where 1 ≤ j ≤ 15. From
every set of five columns of H, we can select four linearly
independent columns. It is because (i) every three columns of
H are independent; and (ii) if a column is a linear combination
of at most three other columns of H, it has to be the sum of
those three columns, as, the field is binary, and by (i), it cannot
be the sum of two. Without the loss of generality, let us assume
that columns U = {u1 , . . . , u4 } are independent. Next, we
show that, among the set of columns S = {u5 , . . . , u15 }, there
are at least seven columns that are not a linear combination
of columns in U.
Since every three columns of H are independent, there is
no column in S which is a linear combination of two columns
in U. We consider two cases. In the first case,∑
we assume that
4
there is a column in S, say u5 , such that u5 = i=1 ui . In this
case, we show that there is no column in S which is a linear
combination of exactly three columns in U. By contradiction,
assume that such a column, say u6 , exists. Without the loss of
generality, assume u6 = u1 +u2 +u3 . Note that u6 +u4 +u5 =
0, contradicting dmin = 4. Consequently, in this case, there
is at most one column in S, which is a linear combination
of columns in U. In other words, there are 10 columns in S
that are independent from U. In the second case, columns in
S can only be a linear combination of exactly three columns
in S. Since there are at most four linear combination of three
columns in U, there are at least 11 − 4 = 7 columns in S that
are independent from U. Therefore, in both cases, there are at
least seven columns in S that are∑
independent from U.
By contradiction suppose
w(
i∈A0 hi ) < 7 for some
∑
A0 ∈ P([5]). Therefore,
h
i∈A0 i has at least nine zeros.
Consider
five
columns
of
H
corresponding to five zeros of
∑
h
,
and
select
four
independent
columns from them
i∈A0 i
(recall that from every set of five columns of H, we can select
four linearly independent columns). Let i1 , i2 , i3 , and i4 be
indices of those four columns. As shown earlier, among the
remaining 11 columns, there are at least seven columns that
are independent from the selected four ones. Let i5 , . . . , i11 be
indices of those seven columns. Note that columns i1 , . . . , i4
together with each column j ∈ {i5 , . . . , i11 } form a full rank
5 × 5 matrix. Since the 5 × 5 matrix has full rank, every linear
combinations of rows of the matrix
∑has at least one non-zero
element. Therefore, the element of i∈A0 hi at column j must
be 1, as, otherwise, there is a linear combination of rows of the
5 × 5 matrix ∑
(defined by A0 ) that has all zero elements. This
implies that
at columns i5 , . . . , i11 ,
i∈A0 hi must have 1∑
which contradicts the assumption w( i∈A0 hi ) < 7.