-
Archiving Cold Data in Warehouseswith Clustered Network
Coding
Fabien AndrTechnicolor
Anne-Marie KermarrecInria
Erwan Le Merrer Nicolas Le ScouarnecGilles Straub Alexandre van
Kempen
Technicolor
AbstractModern storage systems now typically combine plain
repli-cation and erasure codes to reliably store large amount
ofdata in datacenters. Plain replication allows a fast accessto
popular data, while erasure codes, e.g., Reed-Solomoncodes, provide
a storage-efficient alternative for archivingless popular data.
Although erasure codes are now increas-ingly employed in real
systems, they experience high over-head during maintenance, i.e.,
upon failures, typically re-quiring files to be decoded before
being encoded again torepair the encoded blocks stored at the
faulty node.
In this paper, we propose a novel erasure code system,tailored
for networked archival systems. The efficiency ofour approach
relies on the joint use of random codes and aclustered placement
strategy. Our repair protocol leveragesnetwork coding techniques to
reduce by 50% the amountof data transferred during maintenance, by
repairing severalcluster files simultaneously. We demonstrate both
throughan analysis and extensive experimental study conducted on
apublic testbed that our approach significantly decreases boththe
bandwidth overhead during the maintenance process andthe time to
repair lost data. We also show that using a non-systematic code
does not impact the throughput, and comesonly at the price of a
higher CPU usage. Based on these re-sults, we evaluate the impact
of this higher CPU consump-tion on different configurations of data
coldness by deter-mining whether the clusters network bandwidth
dedicatedto repair or CPU dedicated to decoding saturates
first.
Keywords Distributed Storage, Erasure Codes, Mainte-nance, Cold
Data.
Permission to make digital or hard copies of all or part of this
work for personal orclassroom use is granted without fee provided
that copies are not made or distributedfor profit or commercial
advantage and that copies bear this notice and the full citationon
the first page. Copyrights for components of this work owned by
others than ACMmust be honored. Abstracting with credit is
permitted. To copy otherwise, or republish,to post on servers or to
redistribute to lists, requires prior specific permission and/or
afee. Request permissions from [email protected] 2014,
April 1316 2014, Amsterdam, Netherlands.Copyright c 2014 ACM
978-1-4503-2704-6/14/04. . .
$15.00.http://dx.doi.org/10.1145/2592798.2592816
1. IntroductionRedundancy is key to provide a reliable service
in practicalsystems composed of unreliable components. Typically,
dis-tributed storage systems heavily rely on redundancy to
maskineluctable disk/node unavailabilities and failures.
Whilethree-way replication is the simplest means to achieve
relia-bility with redundancy, it is now acknowledged that
erasurecodes can dramatically improve storage efficiency [46].
Major cloud systems such as those of Google [15], Mi-crosoft [5]
and Facebook [39] have recently adopted erasurecodes with the most
popular choice being Reed-Solomoncodes for their simplicity. Since
replication has higher stor-age costs but remains more efficient
than codes regardingreads and writes, storage systems now tend to
differentiatebetween cold data (i.e., no longer frequently
accessed) fromhot ones, typically the most popular data largely
accessed,and process them differently [39]. Plain replication
ensureshot data reliability while erasure codes are used for cold
dataarchival. Indeed, reading an erasure coded data is more timeand
resource consuming than reading a replicated data. Thissets the
scene for new offers like Amazon Glacier [19], pro-viding a low
cost archival system, at the price of a file acces-sibility in the
order of hours.
Reed-Solomon codes are the de facto standard of code-based
redundancy in practice. However, having been de-signed for
communication systems, they lack an efficientrepair procedure,
which is important for networked storagesystems. Indeed, in storage
systems, the level of redundancydecreases over time with failures.
An additional maintenancemechanism is thus key to sustain this
redundancy and pre-serve the reliability of stored information. As
Reed-Solomoncodes are not associated with a tailored mechanism,
theysuffer from significant overhead in terms of bandwidth us-age
and decoding operations when maintenance has to betriggered. In
order to address these two drawbacks, archi-tectural solutions have
been proposed [40], as well as newcode designs [12, 24, 29], paving
the way for better trade-offs between storage, reliability and
maintenance efficiency.The optimal tradeoff has been provided by
Dimakis et al. [9]with the use of network coding. This initial work
has been
-
Block 1 File X
Block 1File Y
Node 1
Block 2 File X
Node 2
Block 2 File Y
Node 4
Block 4 File Y
Classical Reed-Solomon repair
Block 1 File X
Block 1File Y
Node 1
Block 2 File X
Node 2
Block 3 File X
Node 3
Block 2 File Y Block 3 File Y
CNC repair
Block 4 File X
Node 4
Block 4 File Y
Block 4 File X Block 3 File X
Node 3
Block 3 File Y
2 blocks + 2 blocks
= 4 blocks on network to be decoded to restore Node 1s data
= 3 blocks only on network to restore, without decoding
Figure 1. With Reed-Solomon codes, upon the failure ofNode 1,
files X and Y are repaired independently thus requir-ing to
transfer a total of 4 blocks (2 blocks for each of thefiles) to
decode both files and generate a new block for eachfile. Instead,
in CNC repair, blocks of X and Y are com-bined at nodes used for
repair, so that only 3 blocks need tobe transferred over the
network; new blocks for each file arethen generated without
decoding the original files X and Y .
followed by numerous theoretical studies on coding
schemesachieving the tradeoff [11]. However, these code designs
ei-ther exist only for high redundancy (i.e., rate lower than
1/2)or have high computing costs [13] thus limiting their
appli-cability to practical systems where both low storage
over-head and reasonable computing costs are desirable. More-over,
all these studies consider the repair of a single file,
thusignoring potential benefits of repairing several distinct
filestogether.
Instead of relying on specifically structured codes, ran-dom
codes are an appealing alternative to provide fault tol-erance in a
distributed setting [1, 9, 10, 18, 31, 33]. Theyprovide a simple
and efficient way to construct optimal codesw.h.p., such as
Reed-Solomon ones, while offering attractiveproperties in terms of
maintenance. However, the practicalaspects of the maintenance of
such codes in practice havereceived little attention so far.
In this paper, we propose a novel approach to
redundancymanagement, combining both random codes and
networkcoding, to provide a practical maintenance protocol. Themain
intuition behind our system is to apply random codesand network
coding at the granularity of groups of nodes(clusters), factorizing
the repair cost across several files atthe same time. This is
illustrated on Figure 1.
More specifically, our contributions are the following:
1. We propose a novel maintenance system, combining aclustered
placement strategy, random codes and networkcoding techniques at
the node level (i.e., between differ-ent files hosted by a single
machine). We call this ap-proach CNC, for Clustered Network Coding.
We showthat CNC halves the data transferred during the mainte-nance
process when compared to standard erasure codes.
Replication Reed-Solomon CNC
Fault tolerance w.r.t
storage overhead (optimal)
Efficient file access (applied across files)
Low repair bandwidth
(whole file)
(only half)
Reintegration Figure 2. Comparison of CNC with most
implementedredundancy mechanisms: Replication, and
Reed-Solomoncodes (systematic form). Note that Reed Solomon codes
ap-plied accros files allow direct read on encoded data.
Moreover, CNC enables reintegration (i.e., the capabil-ity to
reintegrate nodes which have been wrongfully de-clared as faulty).
Typically if Node 1 in Fig. 1 turns outnot to have failed (e.g.
system timeout was set at a toolow value), the new blocks created
by CNC are usefuland increase the level of availability. On the
contrary,the blocks generated by the classical repair are
identicalcopies of the blocks of Node 1, making reintegration
use-less. Finally a simple random selection of nodes duringthe
maintenance process ensures that the network load isevenly balanced
between nodes. This enables the storagesystem to scale with the
number of files to repair, as theavailable bandwidth is consumed as
efficiently as it canbe. We provide an analysis of CNC
demonstrating its per-formance.
2. We deployed CNC on the public experimental testbedGRID5000
[21], to evaluate its benefits and compare itagainst Reed-Solomon
codes. Experimental results showthat encoding and decoding times
are similar to the onesof Reed-Solomon codes, while the time to
repair a fail-ure is drastically reduced. Using CNC instead of
Reed-Solomon codes as a negligible effect on the contentionof the
archival cluster. Finally, we show that the fact thatCNC does not
rely on a systematic code such as Reed-Solomon does not hamper the
performance of the systemeven in failure-free executions.
Figure 2 summarizes the properties of CNC and exist-ing
redundancy mechanisms (i.e., replication, and System-atic
Reed-Solomon codes1) conveying the benefits of CNC,in the context
of cold data storage where low storage over-head and repair
bandwidth are more important than efficientfile access. Note that
in the context of this work, we assumethat data is erasure coded
for an enhanced reliability. We em-phasize that the objective of
this work is not to replace repli-cation with erasure codes, but to
provide an efficient mainte-
1 With systematic Reed-Solomon codes, the encoded data includes
in clearthe original data, allowing access to sub-part of the
original data withoutdecoding (see e.g., [39]).
-
nance mechanism for erasure coded data in an archival
setup(i.e., cold data storage).
The rest of the paper is organized as follows. We first re-view
the background on maintenance techniques using era-sure codes in
Section 2. Our novel system is presented inSection 3 and analyzed
in Section 4. We evaluate and com-pare CNC against state of the art
approaches in Section 5.Finally, we present related work in Section
6 and concludein Section 7.
2. Motivation and Background2.1 Maintenance in Storage
SystemsDistributed storage systems are designed to provide
reliablestorage service over unreliable components [8, 16, 17,
30,41]. In order to deal with component failures [15, 45],
faulttolerance usually relies on data redundancy; three-way
repli-cation is the storage policy adopted by Hadoop [43] or bythe
Google file system [17] for example. Data redundancymust be
complemented with a maintenance mechanism ableto recover from the
loss of data when failures occur. Thispreserves the reliability
guarantees of the system over time.Maintenance has already lain at
the very heart of numerousstorage systems design [4, 14, 20, 44].
Similarly, reintegra-tion, which is the capability to reintegrate
replicas stored ona node wrongfully declared as faulty, was shown
in [7] to beone of the key techniques to reduce the maintenance
cost.These studies focused on the maintenance of replicas.
Whileplain replication is easy to implement and maintain, it
suf-fers from a high storage overhead, typically x instances ofthe
same file are needed to tolerate x 1 simultaneous fail-ures. This
high overhead is a growing concern especially asthe scale of
storage systems keeps increasing. This motivatessystem designers to
consider erasure codes as an alternativeto replication, in
particular for cold data [39]. Yet, using era-sure codes
significantly increases the complexity of the sys-tem and
challenges designers for efficient maintenance algo-rithms.
2.2 Erasure Codes in Storage SystemsErasure codes have been
widely acknowledged as muchmore efficient than replication [46]
with respect to storageoverhead. More specifically, Maximum
Distance Separable(MDS) codes are known to be optimal: for a given
storageoverhead (i.e., the ratio of original quantity of data to
storeover the total quantity of data including redundancy),
MDScodes provide the optimal efficiency in terms of data
avail-ability. With an MDS code (n, k), the file to store is
splitinto k chunks, encoded into n blocks with the property thatany
subset of k out of n blocks suffices to reconstruct thefile. Thus,
to reconstruct a file of M bytes, one needs todownload exactlyM
bytes, which corresponds to the sameamount of data as if plain
replication was used.
When using codes, the encoding can be applied per fileas shown
in Figure 3a, or across files as shown in Figure 3b.
Node 1
X1
Y1
Node 2
X2
Y2
X1 + X2
Y1 + Y2
Node 4
X1 + 2X2
Y1 + 2Y2
Node 3
Y
(Dec
ode
)
Read k blocks and
Decode Y if needed
(a) Encoding per file
Node 1
X
Node 2
Node 4
Node 3
Y
X+Y
X+2Y
Y
Directly read Y
(b) Encoding across files
Figure 3. Access to data, coding per file (i.e., indepen-dently)
or coding across files.
When encoding per file, each file is split into k blocks
andencoded independently. As a consequence, the redundancyblocks
contain information either about file X or about fileY but not
both. In this case, accessing a file requires down-loading k blocks
and decoding them if needed. When en-coding across files, each file
is considered as a block, andk files are encoded together to
generate redundancy blocks,e.g. on Nodes 3 and 4 in Figure 3b. In
this case, provided thatthe code used is systematic (i.e., the
original data is avail-able in clear in the encoded data), it is
possible to downloadone block and get the corresponding file
efficiently withoutdecoding (e.g., X and Y are stored as is as
shown in Fig-ure 3b). This second design has the advantage of
enablingdata to be read directly by fetching one block from a
singlenode (e.g., reading Y from Node 2 on Figure 3). This
fullyleverages the systematic property of codes, limiting both
ac-cess to disks and decoding operations, contrary to the
firstdesign requiring to fetch blocks from k nodes and decod-ing
them if needed (e.g., reading Y1 and Y2 from node 1and 2). However,
when codes are non-systematic, only thefirst design can be used;
hence, accesses to files incur k diskoperations and a decoding thus
consuming some additionalI/Os and CPU.
As the codes we build our solution upon are non-systematic,we
consider that encoding is applied per file in the rest ofthe paper.
However, for the sake of completeness, in orderto measure the
impact of contacting several nodes (insteadof 1), we also compare
our scheme to systematic Reed-Solomon codes applied across files in
Section 5.
Reed-Solomon codes are a classical example of MDScodes, and are
already deployed in cloud-based storage sys-tems [5, 15, 39].
However, as pointed out in [40], one ofthe major concern of erasure
codes lies in their maintenanceprocess incurring significant
bandwidth overhead as well ashigh decoding costs as explained
below.
-
Maintenance of Erasure Codes. When a node is declaredfaulty, all
blocks of the files it was hosting need to be re-created on a new
node. The repair process works as follows:for one block of a file
to repair, the new node first needs todownload k blocks of this
file (i.e., corresponding to the sizeof the file) to be able to
decode it. Once decoded, the newnode can re-encode the file and
regenerate the lost block.This must be iterated for all the lost
blocks. Three issuesarise:
1. Repairing one block (possibly a small part of a file)requires
the new node to download enough blocks (i.e.,k) to reconstruct the
entire file. This is required for allthe blocks previously stored
on the faulty node.
2. The new node must then decode the file, though it doesnot
want to access it 2. Decoding operations are known tobe time
consuming in particular for large files.
3. Reintegrating a node which has been wrongfully declaredas
faulty is almost useless. This is due to the fact that thenew
blocks created during the repair operation have tobe strictly
identical to the lost ones for this is necessaryto sustain the
coding strategy3. Therefore, reintegratinga node results in having
two identical copies of the in-volved blocks (the reintegrated ones
and the new ones).Such blocks can only be useful if either the
reintegratednode or the new node fails but not in the event of
anyother node failure.
In order to mitigate these drawbacks, various solutionshave been
suggested. Lazy repairs, for instance as describedin [4], consist
of deliberately delaying the repairs, waitingfor several failures
before repairing all of them together. Thisenables to repair
multiple failures with the bandwidth (i.e.,data transferred) and
decoding overhead needed for repair-ing one failure. However,
delaying repairs leaves the sys-tem more vulnerable in case of a
burst of failures. Architec-tural solutions have also been
proposed, as for example theHybrid strategy [40]. This consists of
maintaining one fullreplica stored on a single node in addition to
multiple en-coded blocks. This extra replica is used upon repair,
avoid-ing the decoding operation. However, maintaining an
extrareplica on a single node significantly complicates the
de-sign, while incurring scalability issues. Finally, new classesof
codes have been designed [12, 24, 25] which trade storageoptimality
for a better maintenance efficiency.
Random Codes CNC relies on random linear codes (ran-dom codes
for short) that represent an appealing alterna-tive to classical
erasure codes in terms of storage efficiencyand reliability, while
considerably simplifying the mainte-nance process. Random codes
have been initially evaluated
2 Even in systematic codes, for 2/3 of possible failures, a
decoding isrequired as a block from the systematic part is
missing.3 This can be achieved either by a tracker maintaining the
global informationabout all blocks or by the new node inferring the
exact structure of the lostblocks from all existing ones.
2X1+7X2 8X1+3X2 4X1+3X2 9X1+2X2 6X1+5X2
X1 X2
n=5 random linear
combinations
File X split into
k=2 chunks
2 9
8 3 4 3 2 7
5 6
Figure 4. Creation process of encoded blocks using a ran-dom
code. All the coefficients are chosen randomly. Anyk = 2 blocks is
enough to reconstruct the file X.
in the context of distributed storage systems in [1].
Authorsshowed that random codes can provide an efficient fault
tol-erant mechanism with the property that no
synchronizationbetween nodes is required. Instead, the way blocks
are gen-erated on each node is achieved independently in such a
waythat it fits the coding strategy with a high probability.
Avoid-ing such synchronization is crucial in distributed settings,
asalso demonstrated in [18, 31].
Encoding a file using random codes is simple: each fileis
divided into k chunks and the blocks stored for reliabilityare
created as random linear combinations of these k blocks(see Figure
4). All blocks, along with their associated coef-ficients, are then
stored on n different nodes. Note that theadditional storage space
required for the coefficients is typi-cally negligible compared to
the size of each block.
In order to reconstruct a file initially encoded with a givenk,
one needs to download k different blocks of this file.Random matrix
theory over finite field ensures that if onetakes k random vectors
of the same subspace, these k vectorsare linearly independent with
a probability which can bemade arbitrary close to one, depending on
the field size [1].In other words, an encoded file can be
reconstructed as soonas any set of k encoded blocks is collected.
This representsthe optimal solution (MDS codes).
3. Clustered Network CodingOur CNC system is designed to sustain
a predefined level ofreliability, i.e., of data redundancy, set by
the archival systemoperator. This reliability level then directly
translates into aredundancy factor applied to stored files, with
parameters k(number of blocks sufficient to retrieve a file) and n
(totalnumber of redundant blocks for a file). A typical scenariofor
using CNC is a storage cluster like in the Google FileSystem [17],
where large files are split into smaller filesof the same size, for
example 1 GB as in Windows AzureStorage [5]. These files are then
erasure coded in order tosave storage space. We assume that the
failure detection isperformed by a monitoring system, whose
description is outof the scope of this paper. We also assume that
this systemtriggers the repair process, assigning new nodes to
replacethe faulty ones.
-
3.1 A Cluster-based ApproachTo provide an efficient maintenance,
CNC relies on (i) host-ing all blocks related to a set of files on
a single cluster ofnodes, and (ii) repairing multiple files
simultaneously. Tothis end, the system is partitioned into disjoint
(logical) clus-ters of n nodes, so that each node of the storage
system be-longs to only one cluster. Each file to be stored is
encodedusing random codes and is randomly associated to a
singlecluster, so as to balance the storage load on each
clusterevenly. All blocks of a given file are then stored on the
nnodes of the same cluster. In other words, CNC placementstrategy
consists in storing blocks of two different files be-longing to the
same cluster on the same set of nodes4. Notethat these clusters are
constructed at a logical level. In prac-tice, nodes of a given
cluster may span geo-dispersed sites toprovide an enhanced
reliability. Obviously, there is a trade-off between minimizing
inter-site traffic and high reliability;this is outside the scope
of this paper. In such a setup, thearchival system manager (e.g.,
the master node in the GoogleFile System [17]) only needs to
maintain two data structures:an index which maps each file to one
cluster and an indexwhich contains the set of the identifier of
nodes in each clus-ter. This simple data placement scheme leads to
significantdata transfer gains and better load balancing, by
clusteringoperations on encoded blocks, as explained in the
remainingpart of this section.
3.2 Maintenance of CNCWhen a node failure is detected, the
maintenance operationshould ensure that all blocks hosted on the
faulty node are re-paired. This preserves the redundancy factor and
hence thepredefined reliability level of the archival system. While
inmost systems, repair is usually performed at the granularityof a
file, a node failure typically leads to the loss of sev-eral
blocks, involving several files. CNC precisely leveragesthis
characteristic; when a node fails, multiple repairs aretriggered,
one for each particular block of one file that thefaulty node was
storing. Traditional approaches using era-sure codes actually
consider a failed node as the failure ofall of its blocks. By
contrast, the novelty of CNC is to lever-age network coding at the
node level (i.e., between multipleblocks of different files on a
particular cluster). This is pos-sible since CNC placement strategy
clusters files so that allnodes of a cluster store the same files.
Network coding hasalready been studied to reduce the bandwidth
during mainte-nance [9, 22, 23] but only at the file level (i.e.,
between mul-tiple blocks of a single file). CNC differs from these
worksas it repairs different files simultaneously by mixing
them,thus enabling the reduction of the amount of data to be
trans-ferred during the maintenance process in practical
archivalsystems.
4 An analytical evaluation of the mean time to data loss for
such a clusteringplacement can be found in [6].
3.3 An Illustrating ExampleBefore generalizing in the next
section, we first describe asimple example (see Figure 5). This
provides the intuitionbehind CNC compared to a classical
maintenance process.We consider two files X and Y of size M = 1024
MB,encoded with random codes (k = 2, n = 4), stored onthe 4 nodes
of the same cluster (i.e., nodes 1 to 4). FileX is chunked into k =
2 chunks X1, X2 and file Y intochunks Y1 and Y2. Each node stores
one encoded blockrelated to X and one encoded block related to Y ,
which arerespectively random linear combinations of {X1, X2}
and{Y1, Y2}. Each block is of size Mk = 512 MB so that eachnode
stores a total 1024 MB.
Let us consider the failure of Node 4. In a classical re-pair
process, the new node asks k = 2 nodes their blockscorresponding to
files X and Y and downloads 4 blocks, fora total of 2048 MB. This
enables the new node to decodethe two files independently, and then
re-encode each file toregenerate the lost blocks of X and Y and
store them.
Instead, CNC leverages the fact that the encoded blocksrelated
to X and Y are stored on the same node and re-stored on the same
new node to encode the files togetherrather than independently
during the repair process. Moreprecisely, if the nodes are able to
compute a linear combi-nation of their encoded blocks, we can prove
that if k = 2,only 3 blocks are sufficient to perform the repair of
the twofiles X and Y. Thus, the transfer of only 3 blocks incurs
thedownload of 1536 MB, instead of the 2048 MB needed withthe
classical repair process. In addition, this repair can beprocessed
without decoding any of the two files. In practice,the new node has
to contact the three remaining nodes toperform the repair. Each of
the three nodes sends the newnode a random linear combination of
its two blocks with theassociated coefficients. Note that the two
files are now in-termingled (i.e., encoded together). However, we
want to beable to access each file independently after the repair.
Thechallenge is thus to create two new random blocks, with
therestrictions that one is only a random linear combination ofthe
X blocks, and the other of the Y blocks. In this exam-ple, finding
the appropriate coefficients in order to cancel theXi or Yi, comes
down to solve for each file X and Y a sys-tem of two equations with
three unknowns5. The new nodethen makes two different linear
combinations of the threereceived blocks according to the
previously computed coef-ficients, (A=-6, B=-22, C=25) and (D=20,
E=9, G=-17) inthe example. Thereby it creates two new independent
ran-dom blocks, related to file X and Y respectively. The repairis
then performed, saving the bandwidth consumed by thetransfer of one
block (i.e., 512 MB in this example). Note
5 The system is the following:{A 4 +B 8 C 8 = 0A 12 +B 3 + C 6 =
0 for (A,B,C), and{D 15 + E 12 + F 24 = 0D 9 + E 14 + F 18 = 0 for
(D,E,F), in Figure 5
-
Node 1
2X1 + 7X2
5Y1 + 3Y2
Node 2
8X1 + 3X2
6Y1 + 7Y2
4X1 + 3X2
8Y1 + 6Y2
No
de
4
9X1 + 2X2
3Y1 + 8Y2
Node 3
x2
x3
4X1 + 14X2 + 15Y1 + 9Y2
8X1 + 3X2 + 12Y1 + 14Y2
8X1 + 6X2 + 24Y1 + 18Y2
16X1 + 205X2 + 0Y1 + 0Y2
0X1 + 0X2 + 246Y1 + 88Y2
New Node 4
A
B
C
D
E
F
Cluster
x1
x2
x2
x3 Repair blocks
(Only 3 transmitted)
Local computation on new Node 4
(a) CNC Repair
Node 1
2X1 + 7X2
5Y1 + 3Y2
Node 2
8X1 + 3X2
6Y1 + 7Y2
4X1 + 3X2
8Y1 + 6Y2
9X1 + 2X2
3Y1 + 8Y2
Node 3
New Node 4 Cluster
Repair blocks
(Transmitted 4)
8X1 + 3X2
4X1 + 3X2
6Y1 + 7Y2
8Y1 + 6Y2
X1
X2
Y1
Y2
9X1 + 2X2
3Y1 + 8Y2
Decode
File X
Decode
File Y
Local computation on new Node 4
(b) Classical Repair
Figure 5. Comparison between CNC and classical maintenance
process, for the repair of a failed node which was storing
twoblocks of two different files (X & Y) in a cluster of 4
(with k = 2, n = 4). All stored blocks as well as transferred
blocks andrepair blocks in the example have exactly the same
size.
that the example is given over the integers for
simplicity,though arithmetic operations would be computed over a
fi-nite field in an implementation.
3.4 CNC: The General CaseWe now generalize the previous example
for any k. We firstdefine a repair block object: a repair block is
a randomlinear combination of two encoded blocks of two
differentfiles stored on a given node. Repair blocks are
transientobjects which only exist during the maintenance
process(i.e., repair blocks only transit on the network and are
neverstored permanently). We are now able to formulate the
coretechnical result of this paper; the following theorem appliesin
a context where different files are encoded using randomcodes with
the same k, and the encoded blocks are placedaccording to the
cluster placement described in the previoussection.
Theorem 1. In order to repair two different files, download-ing
k + 1 repair blocks from k + 1 different nodes is a suffi-cient
condition.
Repairing two files jointly actually comes down to cre-ating one
new random block for each of the two files. Theformal proof,
provided in the technical report [3], relies onshowing that vectors
resulting from CNC operations remainrandom, which ensures that
blocks do not degenerate in thelong run due to successive
operations performed over them.This theorem thus implies that
instead of having to down-load 2k blocks as with Reed-Solomon codes
when repair-ing, CNC decreases that need to only k + 1. Other
impli-cations and analysis are detailed in the next section.
Notethat the encoded blocks of the two files do not need to havethe
same size. In case of different sizes, the smallest is sim-ply
zero-padded during the network coding operations as is
usually done in this context; padding is then removed at theend
of the repair process. In a real system, nodes usuallystore far
more than two blocks, implying multiple iterationsof the process
previously described. More formally, to re-store a failed node
which was storing x blocks, the repairprocess must be iterated x2
times. In fact, as two new blocksare repaired during each
iteration, the number of iterationsis halved compared to the
classical repair process. Note thatin case of an odd number of
blocks stored, the repair pro-cess is iterated until only one block
remains. The last blockis repaired downloading k blocks of the
corresponding filewhich are then randomly combined to conclude the
repair.The overhead related to the repair of the last block in
caseof an odd block number becomes negligible with a growingnumber
of blocks stored.
The fact that the repair process must be iterated severaltimes
can also be leveraged to balance the bandwidth loadover all the
nodes in the cluster. Only k+1 nodes, out of then nodes of the
cluster, are selected at each iteration of therepair process. As
all nodes of the cluster have a symmetricalrole, a different set of
k + 1 nodes can be selected at eachiteration. In order to leverage
the whole available bandwidthof the cluster, CNC makes use of a
random selection ofthese k + 1 nodes at each iteration. In other
words, for eachround of the repair process, the new node selects
k+1 nodesuniformly at random over the n cluster nodes. Doing so,
weshow that every node is evenly loaded, i.e., each node sendsthe
same number of repair blocks in expectation.
More formally, let N be the number of repair blocks sentby a
given node. In a cluster where n nodes participate in
themaintenance operation, for T iterations of the repair
process,the average number of repair blocks sent by each node
is:
E(N) = Tk + 1
n(1)
-
An example illustrating this load balancing is provided inthe
next section.
4. CNC AnalysisThe novel maintenance protocol proposed in the
previoussection enables (i) to significantly reduce the amount of
datatransferred during the repair process; (ii) to balance the
loadbetween the nodes of a cluster; (iii) to avoid
computationallyintensive decoding operations, and finally, (iv) to
provideuseful reintegration. Benefits are now detailed.
4.1 Transfer SavingsA direct implication of Theorem 1 is that
for large enoughvalues of k, the required data transfer to perform
a repairis halved; this directly results in a better usage of
availablebandwidth. To repair two files in a classical repair
process,the new node needs to download at least 2k blocks to be
ableto decode each of the two files. Then, the ratio k+12k (CNCover
Reed-Solomon) tends to 1/2 as larger values of k areused.
The exact necessary amount of data (x, k, s) to repair xblocks
of size s encoded with the same k is given as follows:
(x, k, s) =
{x2 s(k + 1) if x is evenx2 s(k + 1 +
k1x ) if x is odd
An example of the transfer savings is given in Figure 6,for k =
16 and a file size of 1 GB.
From Theorem 1, CNC requires to repair lost files ingroups of
two. One can wonder whether there is a benefitin grouping more than
two files during the repair. In fact,a simple extension of Theorem
1 reveals that to group Gfiles together, a sufficient condition is
that the new nodedownloads (G 1)k + 1 repair blocks from (G 1)k +
1distinct nodes over the n nodes of the cluster. Firstly,
thisimplies that the new node must be able to contact manymore
nodes than k + 1. Secondly, we can easily see thatthe gains made
possible by CNC are maximal when two filesare considered
simultaneously: savings in data transfer whenrepairing are
expressed by the ratio (G1)k+1Gk . The minimalvalue of this ratio
(12 , which is equivalent to the maximalgain) is obtained when G =
2 and large value of k.
A second natural question is whether or not downloadingfewer
than (G 1)k + 1 repair blocks to group G filestogether is possible.
We can positively answer this question,as the value (G1)k+1 is only
a sufficient condition. In fact,if nodes do not send random
combinations, but carefullychoose the coefficients of the
combination, it is theoreticallypossible to download fewer repair
blocks. However, as Ggrows, finding such coefficients becomes
computationallyintractable, especially for large values of k. This
then callsfor the use of the simpler operation i.e., G = 2 as we
havepresented in this paper.
0 100 200 300 400 500 600 700 800 900
1000
0 200 400 600 800 1000
Rep
air b
andw
idth
(GB)
Files to repair
ReplicationRS
CNC (k=8)CNC (k=16)CNC (k=32)
Figure 6. Necessary amount of data to transfer to repair afailed
node, according to the selected redundancy scheme (1GB files).
4.2 Load BalancingAs previously mentioned, when a node fails,
the repair pro-cess is iterated as many times as needed to repair
all lostblocks. CNC ensures that the load over remaining nodes
isbalanced during maintenance, because of the random selec-tion of
the k + 1 nodes at each round.
Consider a scenario involving a 5 node cluster, storing
10different files encoded with random codes (k = 2). Node5 has
failed, involving the loss of 10 blocks of the 10 filesstored on
that cluster. Nodes 1 to 4 are available for therepair process. T =
5 iterations of the repair process arenecessary to recreate the 10
new blocks, as each iterationenables to repair 2 blocks at the same
time. The total numberof repair blocks sent during the whole
maintenance is T (k + 1) = 15, whereas the classical repair process
needs todownload 20 encoded blocks. The random selection ensuresin
addition that the load is evenly balanced between theavailable
nodes of the cluster. Here, nodes 1, 2 and 4 areselected during the
first repair round, then nodes 2, 3 and 4during the second round
and so forth. The total number ofrepair blocks is balanced between
all available nodes, eachsending T(k+1)n =
154 = 3.75 repair blocks on average.
As a consequence of using the whole available bandwidth
inparallel, and as opposed to sequentially fetching blocks foronly
a subset of nodes, the Time To Repair (TTR) a failednode is also
greatly reduced.
4.3 No Decoding OperationsDecoding operations are known to be
time consuming andtherefore should be done only in case of file
accesses. Whilethe use of classical erasure codes requires such
decoding totake place upon repair, CNC avoids those operations. In
fact,no file needs to be decoded at any time in CNC: repairingtwo
blocks only requires computing two linear combinationsinstead of
decoding the two files. This greatly simplifies therepair process
over classical approaches. As a consequence,the time to perform a
repair is reduced compared to the
-
classical reparation process, especially when dealing withlarge
files as confirmed by our experiments in Section 5.
4.4 ReintegrationThe decision to declare a node as faulty is
usually performedusing timeouts; this is typically an error prone
decision [7].In fact, nodes can be wrongfully timed-out and can
recon-nect once the repair is done [27]. While the longer the
time-outs, the fewer errors are made, adopting large timeouts
mayjeopardize the reliability guarantees, typically in the eventof
burst of failures. The interest of reintegration is to beable to
leverage the fact that nodes which have been wrong-fully timed-out
are reintegrated in the system. Authors in [7]showed that
reintegration is a key concept to save mainte-nance bandwidth.
However, reintegration has not been ad-dressed when using erasure
codes.
As previously mentioned, when using classical erasurecodes, the
repaired blocks have to be strictly identical tothe lost ones.
Therefore, reintegrating a node which wassuspected as faulty in the
system is almost useless since thisresults in two identical copies
of the lost and the repairedblocks. Such blocks can only be useful
in the event of thefailure of two specific nodes, the incorrectly
timed-out nodeand the new one. Instead, reintegration is always
useful whendeploying CNC. More precisely, every single new block
canbe leveraged to compensate for the loss of any other blockand
therefore is useful in the event of the failure of any node.Indeed,
new created blocks are simply new random blocks,thus different from
the lost ones while being functionallyequivalent. Therefore, each
new block contributes to theredundancy factor of the cluster.
5. EvaluationIn order to confirm the theoretical savings
provided by theCNC repair protocol in terms of bandwidth
utilization anddecoding operations, we deployed CNC over an
experimen-tal platform. We now describe the implementation of the
sys-tem and CNC experimental results.
5.1 System OverviewWe implemented a simple storage cluster with
architecturesimilar to Hadoop [43] or the Google File System [17].
Thisarchitecture is composed of one tracker node that managesthe
metadata of files, and several storage nodes that storethe data.
This set of storage nodes forms a cluster as definedin Section 3.
The overview of the system architecture is de-picted in Figure 7.
Client nodes can PUT/GET the data di-rectly to/from the storage
nodes, after having obtained theirIP addresses from the tracker. In
case of a storage node fail-ure, the tracker initiates the repair
process and schedules therepair jobs. All files to be stored in the
system are encodedusing random codes with the same k. Let n be the
number ofstorage nodes in the cluster, then n encoded blocks are
cre-ated for each file, one for each storage node. Note that
the
Tracker Node
Client Node
Fil
es
Me
ta D
ata
Cluster of Storage Nodes
New Node
ASK_REPAIRBLOCK
REPAIRBLOCK
Figure 7. Experimental System Overview.
system can thus tolerate n k storage node failures beforefiles
are lost for good.
Operations In the case of a PUT operation, the client
firstencodes blocks. The coefficients of the linear
combinationassociated with each encoded block are appended at
thebeginning of the block. Those n encoded blocks are sent tothe n
storage nodes of the cluster using a PUT_BLOCK_MSG.A PUT_BLOCK_MSG
contains the encoded information, aswell as the hash of the
corresponding file. Upon the receiptof a PUT_BLOCK_MSG, the storage
node stores the encodedblock using the hash as filename. To
retrieve the file, theclient sends a GET_BLOCK_MSG to at least k
nodes, outof the n nodes of the cluster. A GET_BLOCK_MSG
onlycontains the hash of the file to be retrieved. Upon the
receiptof a GET_BLOCK_MSG, the storage node sends the
blockcorresponding to the given hash. As soon as the client
hasreceived k blocks, the file can be recovered.
In case of a storage node failure, a new node is selectedby the
tracker to replace the failed one. This new node sendsan
ASK_REPAIRBLOCK_MSG to k + 1 storage nodes. AnASK_REPAIRBLOCK_MSG
contains the two hashes of thetwo blocks which have to be combined
following the repairprotocol described in Section 3. Upon the
receipt of anASK_REPAIRBLOCK_MSG, the storage node combines thetwo
encoded blocks corresponding to the two hashes, andsends the
resulting block back to the new node. As soon ask + 1 blocks are
received, the new node can regenerate twolost blocks. This process
is iterated until all lost blocks arerepaired.
5.2 Deployment and ResultsWe deployed this system on the
Grid5000 experimentaltestbed [21]. The experiment ran on 24 storage
nodes, 1tracker node, and 4 client nodes all connected through a
1Gbps network. Each node has 2 Intel Xeon E5520 CPUs at2.26 GHz, 32
GB RAM and two 300 GB SAS hard drivesused in RAID-0. The 24 storage
nodes form a logical cluster,as defined in Section 3. All files
were encoded with k = 16,and have a size of 1 GB, which is the size
used for sealedextents in Windows Azure Storage [5].
-
0 100 200 300 400
16 32 128 256 5121024
MB
/s
File size (MB)
Encoding Xeon 5520 [k=16]
CNCRS
0 100 200 300 400
16 32 128 256 5121024File size (MB)
Decoding Xeon 5520 [k=16]
0 100 200 300 400
16 32 128 256 5121024File size (MB)
Encoding Xeon E5-2630 [k=16]
0 100 200 300 400
16 32 128 256 5121024File size (MB)
Decoding Xeon E5-2630 [k=16]
0 200 400 600 800
4 6 8 12 16 20 24
MB
/s
k [n=int(k*1.5)]
[1024MB]
0 200 400 600 800
4 6 8 12 16 20 24k [n=int(k*1.5)]
[1024MB]
0 200 400 600 800
4 6 8 12 16 20 24k [n=int(k*1.5)]
[1024MB]
0 200 400 600 800
4 6 8 12 16 20 24k [n=int(k*1.5)]
[1024MB]
Figure 8. Single-core in-memory encoding and decoding throughput
for various file sizes with k=16 and for various valuesof k with
file size=1024 MB, on Xeon E5520 (2.26 Ghz) and Xeon E5-2630 (2.30
GHz) running Linux 3.2 64bit with 32 GBRAM.
Implementation We implemented the coding logic ofCNC as a
library in C relying on GF Complete [36] for fi-nite fields
operations. The networking and storage logic hasbeen implemented in
C++ using Boost.Asio. The client per-forms all encoding/decoding
operations using one dedicatedthread, possibly sending/receiving
other blocks while com-puting. However, the storage nodes being
repaired do notperform computation while receiving data; latency is
at thisstage less critical, since no user is directly impacted.
BesidesCNC, our system also supports systematic Reed-Solomoncodes
both applied per file or across files as described inFigure 3. We
compare CNC to these two systematic Reed-Solomon coding schemes,
noted Reed-Solomon or RS.
Encoding/Decoding performance We first look at ourCNC library
in-memory encoding and decoding rates, ontwo different machines
(one considered slow, and the otherfast). Those rates are measured
when using random codesfor various code lengths (k), depending on
the size of thefile to be encoded (16 MB to 1024 MB), and depending
onthe hardware of the two different machines. Results are de-picted
on Figure 8. For a given (k, n), encoding and decodingrates are
close to linear with the file size. For example with(k = 16, n =
24) the encoding of a 1 GB file occurs at 125MB/s on the Xeon
E5520, while the faster Xeon E5-2630 en-codes at around 200 MB/s.
Decoding speeds are 200 MB/sand 300 MB/s respectively. This
confirms that machine ar-chitectures are crucial for performance
while dealing withcoding libraries [36, 37].
Both the encoding and decoding rates are representedon the
figures (when some data from the systematic part is
missing) for Reed-Solomon codes. These are provided bythe
Jerasure library [38]. We observe that rates for CNCand
Reed-Solomon codes are fairly similar, CNC being abit faster for
decoding, and a bit slower for encoding. Theminor difference
between the two schemes is due to the factthat, for the block sizes
(i.e., 1 to 64 MB) and the k (i.e.,4 to 24) that we consider,
applying the operations to thedata completely dominates over other
costs (e.g., invertingmatrices of coefficients which is costlier
for random matricesthan for Reed-Solomon generator matrices).
Repair Time In this experiment, we measure the total re-pair
time upon a node failure, depending on the amount ofstorage of the
faulty node. The results, depicted in Figure 9,include time to
receive repair blocks at the new node, timeto compute (decoding for
Reed-Solomon codes, and linearcombinations for new blocks creation
for CNC), as wellas wait time (which corresponds to the delay until
whichthe last repair block has been received, allowing
operation).Hence, it represents the effective time between failure
detec-tion and complete repair.
Figure 9 shows that the repair time is dramatically re-duced
when using CNC compared to Reed-Solomon codes,especially with an
increasing amount of data to be repaired.For instance, to repair a
node hosting 128 GB of data, CNCand Reed-Solomon codes require
respectively 824 and 2076seconds (i.e., a 60% reduction when using
CNC). These timesavings are mainly due to the fact that decoding
operationsare avoided in CNC, and that less information is
transferred.
-
0
500
1000
1500
2000
2500
CNCRS CNC
RS CNCRS CNC
RS
Seco
nds
Wait timeReceive time
Compute time
128 GB64 GB32 GB16 GB
Figure 9. Repair time for CNC and Reed-Solomon codesfor various
amounts of data. The total time is split betweenwaiting time (for
response), reception time (over the net-work) and time dedicated to
computing.
PUT and GET performance without failures Figure 10shows the
performance of PUT operations from a singleclient accessing the
cluster. The system is able to performPUT operations at a rate of
40 MB/s for CNC and 45 MB/sfor Reed-Solomon codes with encoding per
file, and at a rateof 55 MB/s for Reed-Solomon codes with encoding
acrossfiles. CNC and Reed-Solomon codes exhibit similar
perfor-mance when applied per file. This is consistent with the
en-coding speed we observed in Figure 8. Encoding across filesis
slightly faster due to the fact that files do not need to besplit
into chunks before being encoded.
Figure 11 shows the performance of GET operations froma single
client. They are performed at a rate of 110 MB/sfor both CNC and
Reed-Solomon codes (encoding per file).For these, the network (1
Gbps) is clearly the limiting fac-tor, which is again consistent
with the high decoding speed(greater than 190 MB/s) that we
observed in Figure 8. Infact, for Reed-Solomon coding applied
across files, there isa slight performance drop (around 90 MB/s):
In this case,the GET contacts only one storage node thus opening a
sin-gle TCP connection and reading from a single storage
node.Hence, the client does not saturate the 1 Gbps link, as it
isthe case for the encoding per file where k TCP connectionsare
opened (parallel reads from k storage nodes).
Figure 12 shows the performance of multiple clients ac-cessing
the cluster concurrently. The clients perform GEToperations
continuously for 30 minutes and we compute theaggregate throughput
of all clients. We observe that there isno strong degradation of
performance due to concurrency.Note that Reed-Solomon codes with
encoding across filesfully leverage the systematic nature of such
codes: reading ablock incurs only one disc seek without requiring a
decodingoperation. Yet, our experiments show that this property
doesnot hamper CNC because access to disk is a negligible fac-tor.
The gap between encoding per file and encoding acrossfiles
diminishes as the number of clients increases, as can be
0
15
30
45
60
75
0 20 40 60 80 100 120 140
Put t
hrou
ghpu
t (M
B/s)
Total Amount Put (GB)
CNCReed-Solomon (per file)
Reed-Solomon (across files)
Figure 10. Single-client throughput of the PUT operationfor
various amounts of data.
0
20
40
60
80
100
120
0 20 40 60 80 100 120 140
Get
thro
ughp
ut (M
B/s)
Total Amount Get (GB)
CNCReed-Solomon (per file)
Reed-Solomon (across files)
Figure 11. Single-client throughput of the GET operationfor
various amounts of data.
expected. Up to 4 clients running, the performance
increaseslinearly as clients do not compete for resources. For 4
clientsin parallel performing continuous GET operations, the
aver-age throughput per client is 360-365 MB/s for CNC and RSper
file, and 380 MB/s for RS across file. Coding across fileshas a
slight advantage (less than 5 %) when 4 clients contin-uously query
the storage cluster at around 90 MB/s each: in-deed coding across
files implies bigger blocks and less TCPconnections per client.
This has a limited impact and was notvisible for a single
client.
The number of concurrent clients we consider (i.e., 4 inthis
experiment) is already much higher than the averagenumber of
clients that would access data on a cluster stor-ing cold data as
described in [39]. In [39], the logical clusteris composed of 36TB
storage nodes, and data is marked ascold if not accessed for at
least 3 months. Let us assume thatthe system offers each user with
100 GB archival capacity,with 1 GB archive files, with users
accessing to one of theirbackup archive once a month. In this case,
we would observea read rate as low as 3 103 read/s, which is much
lowerthan the read rate of our experiment, and also much lowerthan
the read rate reported for hot data on production sys-tems (e.g.,
30 reads/s [15]). As a consequence, the penalty
-
0
100
200
300
400
500
1 2 3 4Agg
rega
te G
et th
roug
hput
(MB/
s)
Number of clients
CNCReed-Solomon (per file)
Reed-Solomon (across files)
Figure 12. Multiple-client throughput of the GET operationfor
various amounts of data.
in I/O due to encoding per file rather than across files
(seeFigure 3) has a negligible impact for cold data. This is
con-firmed experimentally, as according to these figures, there
isno penalty in using CNC for 1 client and little impact for upto 4
clients (our cluster only allowed 4 client nodes). Also,the
decoding is not a limiting factor. Hence, CNC comparesfavorably to
Reed-Solomon codes in spite of the decodingneeded. The only
additional cost that must be paid for usingCNC is a higher CPU
utilization. In our experiments, a sin-gle dedicated thread (and
thus one CPU core used at 100%)was sufficient for handling the
encoding and decoding op-erations. The negligible impact in term of
I/Os is consistentwith the block sizes that we consider. For a file
of 1 GB splitinto k=16 blocks, we need approximately 8 second to
receiveall blocks of 64 MB over the network which is 2000
timeslonger than a typical disk seek (4 ms).6
Performance under failures Figure 13 plots the perfor-mance of
the system when it has suffered failures. We ran aset of
experiments measuring the GET performance of sev-eral concurrent
clients running on 4 client nodes. A fixednumber of storage nodes
were continuously failed so as toevaluate the performance of the
various coding schemes forthe possible failure configurations
(e.g., 1 failure of a sys-tematic node, 1 failure of a
non-systematic node, 2 fail-ures of systematic nodes). For a given
failure probability, wethen evaluate the average performance which
depends on theprobability that the system is in each of the
possible failureconfigurations. This allows us to measure the
average per-formance of the system in spite of failures being rare
andoccurring rarely in the time-span of a real experiment.
From Figure 13, we observe that the average throughputfor CNC
and Reed-Solomon (per file) does not degradeas the failure rate
increases. This is due to the fact thatthe clients do not saturate
the capacity of the remainingstorage nodes. Indeed, in our setting
after two failures, 22
6 Even for smaller files of 16 MB split into k=16 blocks of 1
MB, we stillneed approximately 0.125 seconds to receive all blocks
of 1 MB whichremains 30 times longer than a typical disk seek.
0
20
40
60
80
100
0.001 0.01 0.1
Ave
rage
Get
thro
ughp
ut (M
B/s)
Failure Rate
CNCReed-Solomon (per file)
Reed-Solomon (across files)
Figure 13. Bandwidth depending on the failure rate.
of the 24 storage nodes are still available. When encoding
isapplied across files, a slight degradation is observed as whena
failure occurs, it has a probability of 16/24 to affect oneof the
systematic nodes so that accessing the file stored onthat node
requires performing a degraded read (i.e., transferencoded data
corresponding to 16 files and decode this datato recover the single
file we try to read). This degradationremains limited as even if
failures are frequent, failures donot necessarily affect the data
read. Limiting the impact ofsuch degraded reads is a current area
of research in codes forstorage as discussed in the next
section.
Impact of data coldness. From the previous experiments,we
observed that CNC does not degrade the throughputwhen compared to
Reed-Solomon applied across files butcomes at the price of a higher
CPU utilization when readingdata (i.e., GET operations). However,
CNC also halves therepair bandwidth. In this subsection, we study
which of thenetwork or the CPU is the most limiting factor for
variousfailure rates (e.g., once a day to once a year).
Data is considered cold when not accessed for a givenamount of
time. We consider various thresholds for datacoldness (from 1 day
to 1 year). A threshold of 1 monthmeans that data is accessed at
most once a month. We usethis as an upper bound on the read rate to
infer the corre-sponding upper bound on CPU usage.
We consider a cluster of storage nodes connected througha 1 Gbps
network, having Xeon E5520 processors and 72TB of storage. The
system also comprises 4 client nodes,which dedicate 1 CPU core to
perform decoding operations.Storage nodes dedicate up to 1/4th of
their bandwidth to re-pairs reserving the other 3/4th for other
operations such asreads. In Figure 14a, we provide a plot showing
the percent-age of CPU usage (i.e., the average percentage of usage
ofthe cores that client nodes dedicate to decoding operations)for
CNC. CPUs are not saturated if the data accessed lessthan once
every 14 days is marked as cold. In Figure 14b, weplot the
percentage of dedicated bandwidth usage for repairoperations with
Reed-Solomon. The network is saturated ifrepairs are needed at
least once per month. Figure 14c sum-
-
0
0.2
0.4
0.6
0.8
1
1 10 100
CPU
Usa
ge (%
)
Data cold after (days)
CNC (Decoding)
(a)
0
0.2
0.4
0.6
0.8
1
1 10 100
Net
wor
k U
sage
(%)
Time between failures (days)
RS (Repair)
(b)
1
10
100
1 10 100
Tim
e be
twee
n fa
ilure
s (da
ys)
Data cold after (days)
None applicable (both saturated)
RS to be preferred (CNC saturated)
Either CNC or RS(none saturated)
CNC to be preferred(RS saturated)
(c)
Figure 14. CPU saturation for various coldness thresholds (a)
and network saturation for various mean times between failures(b).
Thresholds where CNC or RS are preferred (c).
marizes the thresholds (i.e., the minimal settings where
theresource they consume the most is not saturated) for usingCNC,
Reed-Solomon or any of them. Obviously, these con-clusions hold in
our experimental setting (1 Gbps Ethernet,and 1 Xeon E5520) and
need to be adapted when consid-ering faster networks or CPUs. Yet,
in our experience, theyhold in current network and architecture
configurations.
Impact of decoding As stated earlier, the main impact ofusing
CNC is a higher CPU cost for decoding. In our setting(i.e., a
cluster from Grid5000 interconnected using 1 GbpsEthernet), the CPU
was not the most limiting factor as asingle core of the CPUs could
decode up to 200 MBps (1600Mbps) according to Figure 8.
Figure 8 also shows that on modern CPUs, 1 core candecode at
more than 250 MBps (2 Gbps). These results ex-trapolate to multiple
cores as the decoding can be performedin parallel by multiple cores
(i.e. decoding is performed bystripes and stripes can be processed
independently). Table 1gives the throughput of various processors
and the corre-sponding network they allow to saturate. Notice that
appro-priate CPUs allow saturating 1 Gbps, 10 Gbps, or even 40Gbps
networks. A 100 Gbps Ethernet network would how-ever not be
saturated, leaving a slight advantage to system-atic Reed-Solomon
(i.e., 100 Gbps for RS vs 80 Gbps forCNC with 3 Xeon E7-4870) in
that specific configuration.
6. Related WorkThe problem of efficiently maintaining
erasure-coded con-tent has triggered a novel research area both in
theoreticaland practical communities. Design of novel codes
tailoredfor networked storage system has emerged, with
differentpurposes. For instance, in a context where partial
recoveringmay be tolerated, priority random linear codes have
beenproposed in [32] to offer the property that critical data hasa
higher opportunity to survive node failures than data ofless
importance. Another point in the code design space isprovided by
self-repairing codes [34] which have been espe-cially designed to
minimize the number of nodes contacted
during a repair thus enabling faster and parallel replenish-ment
of lost redundancy.
In a context where bandwidth is a scarce resource, net-work
coding has been shown to be a promising techniquewhich can serve
the maintenance process. Network codingwas initially proposed to
improve the throughput utilizationof a given network topology [2].
Introduced in distributedstorage systems in [9], it has been shown
that the use of net-work coding techniques can dramatically reduce
the main-tenance bandwidth. The authors of [9] derived a class
ofcodes, namely regenerating codes which achieve the
optimaltradeoffs between storage efficiency and repair bandwidth.In
spite of their attractive properties, regenerating codes aremainly
studied in an information theory context and lack ofpractical
insights. Indeed, this seminal paper provides theo-retical bounds
on the quantity of data to be transferred duringa repair. The
computational cost of a random linear imple-mentation of these
codes, which is rather high, can be foundin [13]. Recent advances
in this research area are surveyedin [11, 26, 28].
Recently, authors in [29], [35] and [42] have designednew codes
tailored for cloud systems. In [29], the authorsproposed a new
class of Reed-Solomon codes, namely ro-tated Reed-Solomon codes
with the purpose of minimizingI/O for recovery and degraded read.
While important for hotdata (i.e., data frequently accessed),
minimizing I/O is lesscrucial when storing cold data as we observed
in our exper-iments. Simple Regenerating Codes, introduced in [35]
re-duce the maintenance bandwidth while providing exact re-
Table 1. Decoding throughput of various processors forCNC (k=16,
1 GB file size) and corresponding capacity interm of network.
Cores CPU(s) Throughput Net. saturated4 Xeon E3-1220 8 Gbps 1
Gbps6 Xeon E5-2630 12 Gbps 10 Gbps
24 2 Xeon E5-2695 48 Gbps 40 Gbps40 4 Xeon E7-4870 80 Gbps
-
pairs, and simple XOR implementation. A novel family ofcodes
called Locally Repairable Codes (LRCs) has been pro-posed in [42],
also reducing the maintenance bandwidth byadding additional local
parities while still providing exact re-pair. Yet this reduction
comes at the price of losing optimalstorage efficiency. Moreover,
an exact repair does not pro-vide the benefits of reintegration.
Eventually, a new familyof codes, called Piggybacked-RS codes has
been proposedby [39]. They are constructed by taking an existing
Reed-Solomon code and adding carefully designed functions ofone
byte-level stripe onto the parities of other byte-levelstripes.
They allow reducing the maintenance bandwidth bya factor of 30%
while still preserving the Reed-SolomonMDS storage efficiency.
Some other recent works [22, 23] aim to bring networkcoding into
practical systems. Code design presented in pa-per [23] is not MDS,
thus consuming more storage space.Codes in [22] handle a single
failure; they target a mainte-nance framework that operates over a
cloud of clouds. Ad-ditionally, CNC codes do not require splitting
blocks furthercontrary to the F-MSR codes in [22] (and the
correspondingincrease in coding/decoding costs). Finally, and
despite theprobabilistic nature of both types of codes, the repair
processfrom [22] has a significant probability of seeing data
losseswhile operating. This is due to the impossibility of
combin-ing data blocks directly within clouds machines.
Specificmechanisms have to be implemented in order to ensure
dataintegrity on the long term, which adds design complexityto the
overall proposal. One significant convergence point ofthe two
approaches is that both codes are non-systematic, ar-guing the
possibility of bandwidth gains in archival systems.
7. ConclusionWhile erasure codes, typically Reed-Solomon, have
beenwidely acknowledged as a sound alternative to plain
repli-cation in the context of reliable distributed archival
systems,they suffer from high costs, bandwidth and
computationally-wise, upon node repair. In this paper, we address
this issueand provide a novel code-based system providing high
reli-ability and efficient maintenance for practical archival
sys-tems. The originality of our approach, CNC, stems from
acluster-based placement strategy, assigning a set of files toa
specific cluster of nodes combined with the use of ran-dom codes
and network coding at the granularity of severalfiles. CNC
leverages network coding and the co-location ofblocks of several
files to encode files together during the re-pair. This provides a
significant decrease of the bandwidthrequired during repair, avoids
file decoding and providesuseful node reintegration. We provide a
theoretical analy-sis of CNC. We also implemented CNC and deployed
it on atestbed. Our evaluation shows a 50% improvement of CNCwith
respect to bandwidth consumption and repair time
overReed-Solomon-based approaches; the price to pay being
amoderately higher CPU utilization as a single core of a mod-
ern processor is sufficient for handling transfers on a 1
Gbpsnetwork. Also, the impact of chunking files due to the use ofa
non-systematic code remains limited for not so frequentlyaccessed
data. We have shown that in our setting (1Gbps net-work, Xeon
E5520) for cold data, CNC impact is a limitedCPU usage without
throughput loss. As CNC reduces themaintenance-related costs, it is
particularly adapted to colddata storage such as archival
systems.
8. AcknowledgmentsWe thank anonymous reviewers and our shepherd
LidongZhou for their useful comments. We are particularly
gratefulto Lidong Zhou for his great help in improving the
experi-mental contribution of this paper. We thank Ahmed Oulabasfor
his contribution to the CNC coding library.
This study was partially funded by the ODISEA collabo-rative
project from the System@tic and Images & Rseauxclusters.
Experiments presented in this paper were carriedout using the
Grid5000 experimental testbed, being devel-oped under the INRIA
ALADDIN development action withsupport from CNRS, RENATER and
several Universities aswell as other funding bodies.
References[1] S. Acedanski, S. Deb, M. Mdard, and R. Koetter.
How good
is random linear coding based distributed networked storage.In
NetCod, 2005.
[2] R. Ahlswede, N. Cai, S.-Y. Li, and R. Yeung. Network
In-formation Flow. IEEE Transactions On Information
Theory,46:12041216, 2000.
[3] F. Andr, A.-M. Kermarrec, Erwan Le Merrer, N. LeScouarnec,
G. Straub, and A. van Kempen. ArchivingCold Data in Warehouses with
Clustered Network Coding.arxiv:1206.4175.
[4] R. Bhagwan, K. Tati, Y.-C. Cheng, S. Savage, and G.
M.Voelker. Total recall: system support for automated availabil-ity
management. In NSDI, 2004.
[5] B. Calder, J. Wang, A. Ogus, N. Nilakantan, A. Skjolsvold,S.
McKelvie, Y. Xu, S. Srivastav, J. Wu, H. Simitci, J. Haridas,C.
Uddaraju, H. Khatri, A. Edwards, V. Bedekar, S. Mainali,R. Abbasi,
A. Agarwal, M. F. ul Haq, M. I. ul Haq, D. Bhard-waj, S. Dayanand,
A. Adusumilli, M. McNett, S. Sankaran,K. Manivannan, and L. Rigas.
Windows Azure Storage: ahighly available cloud storage service with
strong consistency.In SOSP, 2011.
[6] S. Caron, F. Giroire, D. Mazauric, J. Monteiro, andS.
Prennes. Data life time for different placement policiesin P2P
storage systems. In Globe, 2010.
[7] B.-G. Chun, F. Dabek, A. Haeberlen, E. Sit, H.
Weatherspoon,F. Kaashoek, J. Kubiatowicz, and R. Morris. Efficient
ReplicaMaintenance for Distributed Storage Systems. In NSDI,
2006.
[8] F. Dabek, M. F. Kaashoek, D. Karger, R. Morris, and I.
Stoica.Wide-area cooperative storage with CFS. In SOSP, 2001.
-
[9] A. G. Dimakis, P. B. Godfrey, Y. Wu, M. O. Wainwright, andK.
Ramchandran. Network Coding for Distributed StorageSystems. In
INFOCOM, 2007.
[10] A. G. Dimakis, V. Prabhakaran, and K. Ramchandran.
Decen-tralized Erasure Codes for Distributed Networked Storage.
InJoint special issue, IEEE/ACM Transactions on Networkingand IEEE
Transactions on Information Theory, 2006.
[11] A. G. Dimakis, K. Ramchandran, Y. Wu, and C. Suh. A Sur-vey
on Network Codes for Distributed Storage. The Proceed-ings of the
IEEE, 99:476489, 2010.
[12] A. Duminuco and E. Biersack. Hierarchical Codes: Howto Make
Erasure Codes Attractive for Peer-to-Peer StorageSystems. In P2P,
2008.
[13] A. Duminuco and E. Biersack. A Pratical Study of
Regen-erating Codes for Peer-to-Peer Backup Systems. In
ICDCS,2009.
[14] A. Duminuco, E. Biersack, and T. En-Najjary. Proactive
repli-cation in distributed storage systems using machine
availabil-ity estimation. In CoNEXT, 2007.
[15] D. Ford, F. Labelle, F. I. Popovici, M. Stokely, V.-A.
Truong,L. Barroso, C. Grimes, and S. Quinlan. Availability in
Glob-ally Distributed Storage Systems. In OSDI, 2010.
[16] A. Gharaibeh and M. Ripeanu. Exploring data
reliabilitytradeoffs in replicated storage systems. In HPDC,
2009.
[17] S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google
FileSystem. In SOSP, 2003.
[18] C. Gkantsidis and P. Rodriguez. Network Coding for
LargeScale Content Distribution. In INFOCOM, 2005.
[19] Glacier. http://aws.amazon.com/fr/glacier/.
[20] P. B. Godfrey, S. Shenker, and I. Stoica. Minimizing Churn
inDistributed Systems. In SIGCOMM, 2006.
[21] Grid5000. https://www.grid5000.fr/.
[22] Y. Hu, H. C. H. Chen, P. P. C. Lee, and Y. Tang.
NCCloud:Applying Network Coding for the Storage Repair in a
Cloud-of-Clouds. In FAST, 2012.
[23] Y. Hu, C.-M. Yu, Y. K. Li, P. Lee, and J. Lui. NCFS: Onthe
Practicality and Extensibility of a Network-Coding-BasedDistributed
File System. In NetCod, 2011.
[24] C. Huang, M. Chen, and J. Li. Pyramid Codes:
FlexibleSchemes to Trade Space for Access Efficiency in
ReliableData Storage Systems. In NCA, 2007.
[25] C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, P.
Gopalan,J. Li, and S. Yekhanin. Erasure coding in Windows
AzureStorage. In USENIX ATC, 2012.
[26] S. Jiekak, A.-M. Kermarrec, N. Le Scouarnec, G. Straub,
andA. Van Kempen. Regenerating Codes: A System Perspective.ACM
SIGOPS Operating Systems Review, 47:2332, 2013.
[27] A. Kermarrec, E. Le Merrer, G. Straub, and A. Van
Kempen.Availability-Based Methods for Distributed Storage
Systems.In SRDS, 2012.
[28] A. Kermarrec, N. Le Scouarnec, and G. Straub.
RepairingMultiple Failures with Coordinated and Adaptive
Regenerat-ing Codes. arxiv:1102.0204, (updated September 2013).
[29] O. Khan, R. Burns, J. Plank, W. Pierce, and C. Huang.
Re-thinking Erasure Codes for Cloud File Systems: MinimizingI/O for
Recovery and Degraded Reads. In FAST, 2012.
[30] J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P.
Eaton,D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon,W. Weimer, C.
Wells, and B. Zhao. OceanStore: an architec-ture for global-scale
persistent storage. ACM SIGPLAN Not.,35(11):190201, 2000.
[31] H.-Y. Lin and W.-G. Tzeng. A Secure Erasure Code-BasedCloud
Storage System with Secure Data Forwarding. IEEETransactions on
Parallel and Distributed Systems, 2012.
[32] Y. Lin, B. Liang, and B. Li. Priority Random Linear Codes
inDistributed Storage Systems. IEEE Transactions on Paralleland
Distributed Systems, 20(11):16531667, 2009.
[33] M. Martalo and, M. Picone, M. Amoretti, G. Ferrari, andR.
Raheli. Randomized network coding in distributed storagesystems
with layered overlay. In ITA, 2011.
[34] F. E. Oggier and A. Datta. Self-repairing homomorphic
codesfor distributed storage systems. In INFOCOM, 2011.
[35] D. S. Papailiopoulos, J. Luo, A. G. Dimakis, C. Huang,
andJ. Li. Simple Regenerating Codes: Network Coding for
CloudStorage. In INFOCOM, 2012.
[36] J. S. Plank, K. Greenan, and E. L. Miller. Screaming
FastGalois Field Arithmetic Using Intel SIMD Extensions. InFAST,
2013.
[37] J. S. Plank, J. Luo, C. D. Schuman, L. Xu, and Z.
Wilcox-OHearn. A performance evaluation and examination of
open-source erasure coding libraries for storage. In FAST,
2009.
[38] J. S. Plank, S. Simmerman, and C. D. Schuman. Jerasure:A
Library in C/C++ Facilitating Erasure Coding for
StorageApplications - Version 1.2A. University of Tennessee,
CS-08-627, 2008.
[39] K. V. Rashmi, N. B. Shah, D. Gu, H. Kuang, D. Borthakur,
andK. Ramchandran. A Solution to the Network Challenges ofData
Recovery in Erasure-coded Distributed Storage Systems:A Study on
the Facebook Warehouse Cluster. In HotStorage,2013.
[40] R. Rodrigues and B. Liskov. High Availability in
DHTs:Erasure Coding vs. Replication. In IPTPS, 2005.
[41] A. I. T. Rowstron and P. Druschel. Storage Managementand
Caching in PAST, A Large-scale, Persistent Peer-to-peerStorage
Utility. In SOSP, 2001.
[42] M. Sathiamoorthy, M. Asteris, D. S. Papailiopoulos, A.
G.Dimakis, R. Vadali, S. Chen, and D. Borthakur. XORingElephants:
Novel Erasure Codes for Big Data. In VLDB, 2013.
[43] K. Shvachko, H. Kuang, S. Radia, and R. Chansler. TheHadoop
Distributed File System. In MSST, 2010.
[44] K. Tati and G. M. Voelker. On Object Maintenance in
Peer-to-Peer Systems. In IPTPS, 2006.
[45] K. V. Vishwanath and N. Nagappan. Characterizing
cloudcomputing hardware reliability. In SoCC, 2010.
[46] H. Weatherspoon and J. Kubiatowicz. Erasure Coding
Vs.Replication: A Quantitative Comparison. In IPTPS, 2002.