Blizzard: Fast, Cloud-scale Block Storage for Cloud

Blizzard: Fast, Cloud-scale Block Storage for Cloud-oblivious
Applications
James Mickens, Edmund B. Nightingale, Jeremy Elson, Krishna Nareddy, Darren Gehring
Microsoft Research
Bin Fan∗
Carnegie Mellon University
Asim Kadav†, Vijay Chidambaram‡
University of Wisconsin-Madison
Osama Khan§
Johns Hopkins University
Abstract
Blizzard is a high-performance block store that exposes cloud storage to cloud-oblivious POSIX and
Win32 applications. Blizzard connects clients and
servers using a network with full-bisection bandwidth,
allowing clients to access any remote disk as fast as if
it were local. Using a novel striping scheme, Blizzard
exposes high disk parallelism to both sequential and random workloads; also, by decoupling the durability and
ordering requirements expressed by flush requests, Blizzard can commit writes out-of-order, providing high performance and crash consistency to applications that issue
many small, random IOs. Blizzard’s virtual disk drive,
which clients mount like a normal physical one, provides
maximum throughputs of 1200 MB/s, and can improve
the performance of unmodified, cloud-oblivious applications by 2x–10x. Compared to EBS, a commercially
available, state-of-the-art virtual drive for cloud applications, Blizzard can improve SQL server IOp rates by
seven-fold while still providing crash consistency.
1 Introduction
As enterprises leverage cloud storage to process big-data
workloads, there is increasing pressure to migrate traditional desktop and server applications to the cloud as
well. However, migrating POSIX/Win32 applications to
the cloud has historically required users to select from
a variety of unattractive options. Databases like MySQL
and email servers like Exchange can be trivially migrated
to the cloud by running the server binaries inside of VMs
that reside on cloud servers. Unfortunately, the storage
∗ Work
completed as a Microsoft intern; now at Google.
completed as a Microsoft intern; now at NEC Labs.
‡ Work completed as a Microsoft intern.
§ Work completed as a Microsoft intern; now at Twitter.
† Work
USENIX Association
abstractions exposed to those VMs lack the high performance and transparent scaling that are enjoyed by nonPOSIX applications written for scale-out cloud stores
like HDFS [6] and FDS [30]. For example, Azure and
EC2 provide virtual disks that are backed by remote storage and that unmodified POSIX/Win32 applications can
mount. However, each virtual drive only provides 50–
250 MB/s of throughput [1, 28]; in contrast, a raw cloud
store can provide more than a thousand MB/s to clients.
Datacenter operators provide “cloud-optimized” versions of a few popular applications like SQL and ActiveDirectory [3, 4, 15, 26, 27], implicitly acknowledging the difficulty of extracting cloud-scale performance
from unmodified POSIX1 applications. These cloudoptimized programs directly interface with the network
storage using raw cloud APIs. This strategy provides
higher performance than a naı̈ve VM port, but such
cloud-optimized applications offer fewer customization
options than traditional POSIX/Win32 versions, making
it difficult for users to tweak performance for individual workloads. More importantly, for the long tail of
the application distribution, there are no pre-built, cloudoptimized versions. Large or technologically savvy companies may have the resources to write cloud-optimized
versions of their applications, but for safety reasons, datacenter operators do not provide external developers
with full access to the raw cloud APIs that are needed
to maximize performance. Even if customers had such
access, many customers would prefer to simply deploy
their standard binaries to the cloud and automatically receive fast, scalable IO.
Unfortunately, desktop and server applications have
significantly different IO patterns than traditional cloud1 For
conciseness, we use “POSIX” to mean “POSIX/Win32” in the
rest of this paper.
11th USENIX Symposium on Networked Systems Design and Implementation 257
scale applications. A typical MapReduce-style workload issues large, sequential IOs, but POSIX applications
issue small, random IOs that are typically 32–128 KB
in size [25, 41]. POSIX applications also require finergrained consistency than many big-data workloads. For
an application like an Internet-scale web crawl, appendat-least-once semantics [14] are often reasonable, and
losing MBs of append-only data may be an acceptable
risk for higher throughput; in contrast, for a POSIX
database, compiler, or email server that is randomly accessing small blocks, the loss or duplication of just a
few adversarially-chosen blocks can result in metadata
inconsistencies and catastrophic data loss. To enforce
fine-grained consistency, POSIX applications use disk
flushes to order writes [8, 33, 39], but these flushes introduce write barriers. In the context of cloud storage, these
barriers make it difficult for POSIX applications to issue
large numbers of parallel writes to remote cloud disks.
Lower disk parallelism leads to lower performance.
In this paper, we introduce Blizzard, a highperformance block store that exposes unmodified, cloudoblivious applications to fast, scalable cloud storage.
From the perspective of a client application, Blizzard’s
virtual disk looks like a standard SATA drive. However, Blizzard translates block reads and writes to parallel IOs on remote cloud disks, transparently handling
the removal, addition, or failure of those remote disks.
Blizzard is not the first system to propose a virtual disk
backed by remote storage [2,11,24]. However, Blizzard’s
virtual drive has several unique characteristics:
Locality-oblivious, full-bisection bandwidth block
access: Blizzard is built atop a CLOS network with
no oversubscription [17], i.e., arbitrary client/server
pairs can exchange data at full NIC speeds without
inducing network congestion. Blizzard also pairs
each server disk with enough network bandwidth
to read and write that disk at full sequential speed.
These properties mean that clients can stripe their data
across arbitrary remote disks in a locality-oblivious
manner. This simplifies Blizzard’s striping algorithm
and permits very aggressive sharding (which results
in better performance, since spreading data over more
disks increases IO parallelism).
Nested striping: POSIX applications typically issue
small, random IOs, but even when their IOs are large,
client file systems often break such large operations
into smaller pieces. A naı̈ve stripe of a virtual drive
across remote disks will cause IOp convoy dilation—
as the small requests belonging to a single large
operation travel from the client to a remote disk, the
inter-request spacing will increase due to network
jitter and scheduling vagaries on the client and the
server. The longer the dilation, the less likely that
the remote disk can use a single seek to handle all
of the adjacent disk requests in the convoy. Blizzard
uses a novel striping scheme called nested striping
which avoids these problems. Nested striping ensures
that convoy blocks are spread across multiple disks
in parallel. This amortizes the seek costs for the
individual blocks, and globally acts to prevent disk
hotspots.
Fast flushes with prefix write commits: POSIX applications use disk flushes to order writes and provide
crash consistency. However, such flushes restrict IO
parallelism, and massive IO parallelism is the primary
technique that clients must leverage to unlock cloudscale IO performance. When Blizzard’s virtual disk
receives a flush request, it immediately acknowledges
the flush to the client application, even though Blizzard has not made writes from that flush epoch durable.
Asynchronously, the virtual drive issues writes in a
way that respects prefix epoch semantics. If the client
or the virtual disk crashes, the disk will always recover to a consistent state in which writes from different flush epochs will never be intermingled—all writes
up to some epoch N − 1 will be durable; some writes
from epoch N may be durable; and all writes from
subsequent epochs are lost. Blizzard’s asynchronous
writes lengthen the window for potential data loss,
but they permits much higher levels of write performance. This approach also reduces the penalty for Nway data replication, since acknowledging a write no
longer proceeds at the pace of the slowest replica for
that write. Prior work has shown how prefix semantics can be added to the ext4 file system [7], but Blizzard shows how such semantics can be added at the
disk level, in a file-system agnostic manner, and in a
way that also provides high performance to applications that bypass the file system entirely and issue raw
disk IOs.
Blizzard has several additional features, like support for
disconnected operation2 , and tunable levels of disk parallelism (§2.4).
We have built a Blizzard prototype consisting of
1,200 disks and 150 servers. Using this prototype,
we demonstrate that Blizzard can improve the performance of unmodified IO-intensive applications by 2x–
10x. Importantly, our Blizzard prototype coexists alongside our FDS [30] deployment, using the same servers,
disks, and networking equipment. FDS is optimized for
large, business-scale computations. Thus, using Blizzard, cloud providers can leverage a single set of cluster hardware for both big-data computations and POSIX
2 Not
discussed further due to space constraints.
258 11th USENIX Symposium on Networked Systems Design and Implementation
USENIX Association
applications, reducing hardware outlays while consolidating administrative costs and providing fast, scalable
IO to both types of workloads. From the perspective of
developers, Blizzard allows POSIX applications to receive cloud-scale performance and availability without
requiring datacenter operators to expose their raw, unsafe cloud APIs—instead, developers simply write applications under the assumption that (virtual) disk drives
are extremely fast.
2 Design
Blizzard has two high-level goals. First, it has to run
unmodified, cloud-oblivious POSIX applications on the
same cloud infrastructure used by traditional big-data applications. Second, it must provide those POSIX applications with the storage performance, scalability, and
availability that big-data programs receive. By “cloudlevel performance,” we mean that a single client should
be able to issue hundreds of MBs of IO requests every
second. By “cloud-level scalability and availability,” we
mean that client storage should transparently improve
as the cloud operator adds more remote disks or better
networking capacity. Also, administrative efforts that
help big-data applications should also improve unmodified POSIX applications.
To satisfy these design goals, Blizzard must efficiently
handle two aspects of POSIX workloads:
• POSIX applications typically generate small,
random IOs between 32 KB and 128 KB in
size [25, 41]. To offer high performance to such
seek-bound workloads, Blizzard needs to expose applications to massive disk parallelism.
This allows applications to issue multiple operations simultaneously and overlap the seek
costs.
• POSIX applications use the fsync() system
call to control the ordering and durability of
writes, ensuring consistency after crashes [7,
8, 33, 39]. An fsync() call generates a disk
flush, and the disk flush acts as a write barrier, preventing writes issued after the flush
from completing before all previous writes are
durable. These write barriers limit disk parallelism, which results in poor performance.
Thus, Blizzard needs to handle fsync() calls
in a way that preserves notions of write ordering, but does not require clients to wait for synchronous disk events before issuing new writes.
In Section 2.3, we describe how Blizzard leverages fullbisection bandwidth networks to aggressively stripe each
client’s data across a large number of disks. In the absence of flush requests, this scheme would suffice to
provide clients with massive disk parallelism. However, POSIX applications commonly issue flush requests.
USENIX Association
Thus, as described in Section 2.5, Blizzard uses eventual durability semantics [7] to remove synchronous disk
operations from the flush path. When a client issues a
flush, Blizzard records ordering information about pending writes; then, Blizzard immediately acknowledges the
flush request. Asynchronously, Blizzard writes data to
remote disks in a way that respects the client’s ordering constraints, and allows the client to recover a consistent view after a crash. In the extreme, Blizzard can issue writes from multiple flush periods completely out-oforder (§2.5.3), removing all synchronization constraints
involving writes, but still preserving consistency.
2.1
Blizzard’s Storage Abstraction
Frameworks like pNFS [38] and BlueSky [42] expose
cloud storage to cloud-oblivious applications by translating (say) NFS operations into operations on cloud
disks. However, these systems lock applications into a
particular set of file semantics which will not be appropriate for all applications. For example, some applications desire NFS’s close-to-open consistency semantics,
but other applications require POSIX-style consistency
in which a newly written block is immediately visible
to readers of the enclosing file [20]. Since all file systems eventually issue reads and writes to a block device
(and since some applications issue raw disk commands),
we decided to implement Blizzard as a virtual block device which stripes data across remote disks. This allowed
us to support heterogeneous POSIX and Win32 file systems like ext3 and NTFS; it also allowed us to expose
fast storage to applications like databases that issue raw
block-level IOs.
2.2
The Low-level Storage Substrate
Blizzard stripes each virtual drive across several remote
physical disks. As the striping factor increases, the
virtual drive benefits from greater spindle parallelism
(and thus higher IO performance). However, a traditional oversubscribed network constrains how aggressively Blizzard can stripe. In an oversubscribed network, the available cross-rack bandwidth is lower than
the available intra-rack bandwidth; thus, for a system
with medium-to-high utilization, clients access racklocal disks faster than they access disks in external racks.
If Blizzard restricted each virtual drive to use rack-local
disks, this would limit spindle parallelism, constrain the
total capacity of the virtual drive, and makes job allocation more difficult, since a single job could not harness idle disks spread across multiple racks. However, if
Blizzard allowed a virtual disk to span racks, contention
in the oversubscribed cross-rack links would prevent the
client from fully utilizing rack-external disks.
To avoid this dilemma, Blizzard uses Flat Datacenter
Storage (FDS) as its low-level storage substrate [30].
11th USENIX Symposium on Networked Systems Design and Implementation 259
Figure 1: An example of nested striping with a segment
size of 2. Virtual disk blocks are striped across tracks,
and tracks are scattered across disks.
FDS is a datacenter-scale blob store that connects all
clients and disks using a network with full-bisection
bandwidth, i.e., no oversubscription [17]. FDS also provisions each storage server with enough network bandwidth to match its aggregate disk bandwidth. For example, a single physical disk has roughly 128 MB/s of maximum sequential access speed. 128 MB/s is 1 Gbps, so if
a storage server has ten disks, FDS provisions that server
with a 10 Gbps NIC; if the server has 20 disks, it receives
two 10 Gbps NICs. The resulting storage substrate provides a locality-oblivious storage layer—any client can
access any remote disk at maximum disk speeds3 .
A Blizzard virtual disk is backed by a single FDS blob
(although Blizzard does not use a linear mapping between a virtual disk address and the corresponding byte
in the FDS blob (§2.3 and §2.5)). FDS breaks each blob
into 8 MB segments called tracts, and uses these tracts
as the striping unit. Blizzard typically instructs FDS to
stripe a blob across 64 or 128 physical disks; optionally,
FDS performs N-way replication of each tract.
2.3
Data Placement
FDS provides Blizzard with some nice properties out-ofthe-box, and allows Blizzard to satisfy the design goal
of running on the same infrastructure that supports traditional big-data applications. However, FDS is optimized
for large, sequential IOs. In particular, FDS divides each
FDS blob into a series of large tracts. A tract resides in a
contiguous location on a particular disk, but FDS scatters
a blob’s tracts across a large number of different disks.
Tracts are 8 MB in size, and FDS’s primary read/write interface works at the granularity of tracks. POSIX IOs are
typically much smaller than 8 MB. Thus, we had to devise a way for Blizzard’s virtual disk to map these small,
random IOs onto FDS tracts (which FDS would then map
to remote disks).
3 This assumes that RTTs between clients and storage servers are
much smaller than a seek time. We explore the impact of network
latency in Section 4.6.
FDS natively supports a raw, low-level interface that
lets applications read and write data in chunks that are
smaller than a tract. This interface allows applications
to process small pieces of tract metadata without having
to read or write entire 8 MB tracts. Our first prototype
of Blizzard’s virtual disk used this interface, and a very
simple block mapping, to translate virtual disk addresses
to tract-level offsets. In this initial prototype, the virtual
disk split its linear address space into contiguous, tractsized chunks, and assigned each chunk to a separate FDS
tract. A virtual disk IO with offset byteOffset would go
to tract byteOffset/TRACT SIZE BYTES. This mapping
was straightforward, but it led to disappointing performance. A convoy of sequential IOs often hit the same
tract (and thus the same remote disk), eliminating opportunities for disk parallelism. Even worse, sequential convoys often experienced dilation. Even if the client used
a tight loop to issue writes to adjacent offsets, the temporal spacing between those operations often grew as the
operations traveled from client to remote disk. Clientside scheduling jitter increased the spacing, as did random network delays and scheduling jitter on the remote
server. Thus, a sequential convoy that initially had little inter-request spacing at the client often arrived at the
remote hard disk with larger inter-request gaps. In many
cases, this prevented the remote disk from efficiently servicing the entire convoy with a single seek. Instead, the
convoy operations were handled with multiple seeks and
rotational delays, increasing the total IO latency for all
operations on that disk.
Given these observations, we designed a new mechanism called nested striping that maps linear byte ranges
to FDS tracts. We define a segment as a logical group
of one or more tracts; a segment of N bytes contains
striped data for a linear byte range of N bytes. Figure 1 demonstrates nested striping when each segment
contains two tracts. Intuitively, increasing the segment
size allows Blizzard to provide greater disk parallelism
for sequential workloads. For example, a segment size of
one restricts sequential IOs to one disk. As shown in Figure 1, a segment size of two spreads sequential IOs across
two disks. Figure 1 also demonstrates why we call this
striping scheme “nested”: Blizzard stripes blocks across
FDS tracts, and FDS distributes the tracts in a blob across
many remote disks. By default, our Blizzard implementation uses a segment size of 128 tracts.
Experiments show that nested striping dramatically
decreases the impact of convoy dilation. Using a segment size of 128 provides over 1000 MB/s of sequential
read throughput (§4.1). In contrast, using a segment size
of one tract results in sequential read throughput of 222
MB/s.
260 11th USENIX Symposium on Networked Systems Design and Implementation
USENIX Association
2.4
Role-based Striping
Up to this point, we have assumed that Blizzard is deployed in a shared, public cloud that is utilized by multiple customers. However, Blizzard can also be used in
private, enterprise-local clouds. Indeed, one deployment
model for Blizzard is a building-scale deployment—all
of the rooms in the building are connected via a fullbisection bandwidth network, and all desktop and server
machines use Blizzard virtual drives to store and manipulate data. Datacenters already deploy full-bisection networks at the scale of thousands of machines [17], so this
deployment model is technically feasible, and it would
allow an enterprise to consolidate storage and IT effort
for a particular building.
In such a building-scale deployment, different users
will have different performance needs. For example, a programmer that runs large compilations requires better storage performance than a receptionist who
mainly sends emails and performs word processing. For
building-wide Blizzard deployments, segment sizes offer
a convenient knob that administrators can use to control
the amount of disk parallelism that Blizzard exposes to
different users with different roles.
2.5
Write Semantics
Blizzard’s virtual disk provides three types of data consistency: write-through commits, flush epoch commits
with fast acknowledgments, and out-of-order commits
with fast acknowledgments. We describe each approach
below. Note that, when we say that Blizzard “acknowledges” an operation to a “client,” the client is either a
file system or an application that issues raw disk IOs. A
write request or a flush request is “acknowledged” when
that operation returns to the client.
2.5.1 Write-through
In this mode, Blizzard does not acknowledge a virtual
disk write until the associated FDS operation has become
durable on the relevant remote disk. This approach provides the smallest window of potential data loss if a crash
occurs. However, write-through consistency provides the
lowest performance, since a thread which issues a blocking write must wait for that write to become durable before issuing additional IOs.
2.5.2 Flush epoch commits with fast acks
Let a flush epoch be a period of time between two flush
requests from the client. Each flush epoch contains one
or more writes. A flush epoch is issued if all of its writes
have been sent to remote disks. The epoch is retired if
all of those writes have been reported as durable by the
remote disks.
USENIX Association
Setup:
Blizzard maintains a counter called
epochToIssue; this counter starts at 0 and represents which writes Blizzard can send to FDS. Blizzard
maintains another counter called currEpoch that also
starts at 0. This counter represents the total number
of flush requests that the client has issued. As explained below, currEpoch will often be greater than
epochToIssue.
Acknowledging writes: When Blizzard receives a write
request or a flush request, it immediately acknowledges
that operation, allowing the client to quickly issue more
IO requests. If the incoming operation was a flush,
Blizzard increments currEpoch by 1. If the operation
was a write, Blizzard tags the write with the currEpoch
value and places the write request in a queue that is
ordered by epoch tags.
Draining the write queue: Once a new write is enqueued and acknowledged to the client, Blizzard tries
to issue enqueued writes to remote disks. Blizzard iterates from the front of the write queue to the end, i.e.,
from the oldest unretired epoch to the newest. If the currently examined write is from epochToIssue, Blizzard
dequeues the write and issues it immediately; otherwise,
Blizzard terminates the queue traversal. Later, when a
write from epoch N completes, Blizzard checks whether
epoch N has now retired. If so, Blizzard increments
epochToIssue and tries to release new writes from the
head of the write queue.
When Blizzard issues a write, it removes it from the
write queue. However, Blizzard keeps the write in a
separate cache until the write is durable. Meanwhile, if a
read arrives for the write’s byte range, Blizzard services
the read using the cached, fresh data, instead of issuing a
read to the underlying remote disk and possibly getting
old data.
Consistency semantics: In this consistency scheme,
Blizzard treats a flush as an ordering constraint, but not a
durability constraint; using the terminology of optimistic
crash consistency, Blizzard provides “eventual durability” [7]. This means that Blizzard issues writes in a way
that respects flush-order durability, but a flush epoch may
retire at an arbitrarily long time after the flush was acknowledged to the client. Indeed, the epoch may never
retire if the client crashes before it can issue the associated writes. However, the rebooted client is guaranteed
to see a consistent prefix of all writes that were acknowledged as flushed; this suffices for many applications [9].
2.5.3
Out-of-order commits with fast acks
To maximize the rate at which writes are issued, Blizzard
defines a scheme that allows writes to be acknowledged
11th USENIX Symposium on Networked Systems Design and Implementation 261
immediately and issued immediately, regardless of their
flush epoch. This means that writes may become durable
out-of-order. However, Blizzard enforces prefix consistency using two mechanisms. First, Blizzard abandons
nested striping and uses a log structure to avoid updating
blocks in place; thus, if a particular write fails to become
durable, Blizzard can recover a consistent version of
the target virtual disk block. Second, even though
Blizzard issues each new write immediately, Blizzard
uses a deterministic permutation to determine which log
entry (i.e., which <tract,offset>) should receive the
write. To recover to a consistent state after a crash, the
client can start from the last checkpointed epoch and
permutation position, and roll the permutation forward,
examining log entries and determining the last epoch
which successfully retired.
Setup: Let there be V blocks in the virtual disk, where
each block is of equal size, a size that reflects the average IO size for the client (say, 64 KB or 128 KB). The V
virtual blocks are backed by P > V physical blocks in the
underlying FDS blob. Blizzard treats the physical blocks
as a log structure. Blizzard maintains a blockMap that
tracks the backing physical block for each virtual block.
Blizzard also maintains an allocationBitMap that indicates which physical blocks are currently in use. When
the client issues a read to a virtual block, Blizzard consults the blockMap to determine which physical block
contains the desired data. Handling writes is more complicated, as explained below.
Blizzard maintains a counter called currEpoch;
this counter is incremented for each flush request, and
all writes are tagged with currEpoch. Blizzard also
maintains a counter called lastDurableEpoch which
represents the last epoch for which all writes are retired.
The virtual-to-physical translation: When Blizzard
initializes the virtual disk, it creates a deterministic permutation of the physical blocks. This permutation represents the order in which Blizzard will update the log. For
example, if the permutation begins 18, 3,. . . , then the
first write, regardless of the virtual block target, would
go to physical block 18, and the second write, regardless
of the virtual block target, would go to physical block
3. Importantly, Blizzard can represent a permutation of
length P in O(1) space, not O(P) space. Using a linear
congruential generator [44], Blizzard only needs to store
three integer parameters (a, c, and m), and another integer representing the current position in the permutation.
As we will describe later, the serialized permutation will
go into the checkpoints that Blizzard creates.
Handling reads is simple: when the client wants
data from a particular virtual block, Blizzard uses the
blockMap to find which physical block contains that
data; Blizzard then fetches the data. Handling writes re-
quires more bookkeeping. When a write arrives, Blizzard iteratively calls the deterministic permutation function, and immediately sends the write to the first physical
block that is not marked in the allocationBitMap as
used. However, once the write is issued, Blizzard does
not update the allocationBitMap or the blockMap—
those structures are reflected into checkpoints, so they
can only be updated in a way that respects prefix semantics. So, after Blizzard issues the write, it places the write
in a queue. Blizzard uses the write queue to satisfy reads
to byte ranges with in-flight (but possibly non-durable)
writes. When a write becomes durable, Blizzard checks
whether, according to the permutation order, the write
was the oldest unretired write in lastDurableEpoch+1.
If so, Blizzard removes the relevant write queue entry,
and updates blockMap and allocationBitMap. Otherwise, Blizzard waits for older writes to commit first.
Once all writes in the associated epoch are durable, Blizzard increments lastDurableEpoch.
When Blizzard issues a write to FDS, it actually writes
an expanded block. This expanded block contains the
raw data from the virtual block, as well as the virtual
block id, the write’s epoch number, and a CRC over the
entire expanded block. As we explain below, Blizzard
will use this information during crash recovery.
If the client issues a write that is smaller than the size
of a virtual block, Blizzard must read the remaining parts
of the virtual block before calculating the CRC and then
writing the new expanded block. This read-before-write
penalty is similar to the one suffered by RAID arrays that
use parity bits. This penalty is suffered for small writes,
or for the bookends of a large write that straddles multiple blocks. For optimal performance, Blizzard’s virtual
block size should match the expected IO size of the
client. For example, POSIX applications like databases
and email servers often have a configurable “page size”;
these applications try to issue reads and writes that are
integral multiples of the page size, so as to minimize disk
seeks. For these applications, Blizzard’s virtual block
size should be set to the application-level page size.
For other applications that 1) frequently generate writes
that are not an even multiple of Blizzard’s block size,
or 2) generate writes that are not aligned on Blizzard’s
block boundaries, Blizzard should be configured to use
write-through mode, or fast acknowledgment mode with
nested striping (§4.7).
Checkpointing: Periodically, the client checkpoints the
blockMap, the allocationBitMap, the four permutation parameters, lastDurableEpoch, and a CRC over
the preceding quantities. For a 500 GB virtual disk,
the checkpoint size is roughly 16 MB. Blizzard does
not update the checkpoint in place; instead, it reserves
enough space on the FDS blob for two checkpoints, and
alternates checkpoint writing between the two locations.
262 11th USENIX Symposium on Networked Systems Design and Implementation
USENIX Association
1
wy(0)
2
3
wx(1)
wy(1)
4
5
wy(2)
1
3
Blizzard Virtual Drive
VID
..
x
y
wx(1)
..
7
7
8
VID:y
epoch:0
CRC
VID:x
epoch:1
CRC
VID:y
epoch:1
CRC
VID:y
epoch:2
CRC
wx(1)
wy(1)
wy(2)
5
6
wy(0)
Blizzard Cache
wy(2)
Not
Not
allocated
allocated
Figure 2: An example of Blizzard’s log-based, out-oforder commit scheme. Disks with thick borders contain
durable writes. See the main paper text for an explanation of what happens at each time step.
Recovery: To recover from a crash, Blizzard loads the
two serialized checkpoints and initializes itself using
the checkpoint with the highest lastDurableEpoch
and a valid CRC. Blizzard then rolls forward from
the permutation position in the checkpoint, scanning
physical blocks in permutation order. Since Blizzard
issues log writes in epoch order, the recovery scan will
process writes in epoch order. If the allocationBitMap
says that the current physical block is in use, Blizzard
inspects the next physical block in the permutation. If
the allocationBitMap says that the current physical
block is not in use, Blizzard inspects the physical
block’s epoch number. If the number is less than
lastDurableEpoch, Blizzard terminates roll-forward.
If the CRCs from all of the physical block’s replicas
are inconsistent or do not match each other, Blizzard
terminates roll-forward. Otherwise, Blizzard updates
the allocationBitMap to mark the physical block as
used (and the old physical block for the virtual block as
unused); Blizzard also updates the relevant blockMap
entry to point to the physical block. Finally, Blizzard sets
lastDurableEpoch to the epoch value in the physical
block. The permutation position at which roll-forward
terminates will be the position to which Blizzard sends
the next write.
Example: Figure 2 provides an example of Blizzard’s
log-based, out-of-order commits. For simplicity, this example uses the permutation generator (writeNumber %
logSize), such that physical log entries are scanned in
linear order, from left to right. At time (1), a write arrives for virtual block Y. Blizzard issues the write to remote storage immediately, and places the write in an inmemory cache, so that reads for Y’s data can access the
fresh data without having to wait for the remote write to
finish. A flush request arrives at (2), which Blizzard ac-
USENIX Association
knowledges immediately. The current flush epoch is now
1. At (3), writes to virtual blocks X and Y arrive. Blizzard
acknowledges those writes immediately, issuing them in
parallel to the next positions in the log, and updating the
write cache entries for blocks X and Y (in the latter case,
overwriting the old cache value for Y). At (4), another
flush arrives, and Blizzard increments the flush epoch to
2. At (5), another write arrives for Y, causing Blizzard
to issue a new write to the next position in the log, and
updating Y’s write cache value. At (6), the first write to
Y becomes durable on a remote disk, causing Blizzard to
update the blockMap entry for Y to point to that log entry.
At time (7), the write to X becomes durable, and Blizzard
updates the blockMap appropriately. However, at time
(7), the second write to Y has not committed. At time (7),
the client takes a checkpoint (note that the last permutation index that the client knows is durable is the second
log entry). At time (8), the client learns that the third
write to Y is durable. However, since the second write
to Y is not durable yet, the client does not change the
blockMap–thus, virtual block Y still points to the write
from epoch 0.
Suppose that the client crashes immediately after it
makes a checkpoint at (7). Further suppose that this
crash prevents the write to the third physical block from
becoming durable, e.g., because the client needed to
retry the write, but crashed before it could do so. After
the client reboots, it looks at the checkpoint and extracts
the last permutation index known to be durable (log
entry two). The client then rolls forward through the
log in permutation order. The client examines the third
physical log entry and sees that it is marked as unused by
the allocationBitMap. The client examines the entry’s
epoch number and CRC. Since the associated write
failed, one or both of those quantities will have invalid
values. At this point, the client stops the roll-forward.
Even though the write to the fourth log block completed,
that write is lost to the client. However, the client has
recovered to a prefix-consistent view of the virtual block
Y (and the rest of the disk).
IOp dilation: Even though log-based consistency does
not use nested striping, the linear congruential generator
produces striping patterns that “jump around” enough
to prevent convoy dilation. Adversarial write patterns
can still result in dilation, but such patterns are rare in
practice.
2.6
Application-perceived Consistency
In write-through mode, Blizzard minimizes applicationperceived data loss. Since all writes are synchronous and
go directly to remote disks, writes can only be lost if the
application transfers write data to the virtual drive, but
11th USENIX Symposium on Networked Systems Design and Implementation 263
the drive or the client machine crashes before Blizzard
can write the data to the remote disks. Once Blizzard has
acknowledged a write operation to the client, the client
knows that the write data is durable.
For the fast acknowledgment schemes, Blizzard trades
higher performance for an increased possibility of data
loss. In these schemes, if a crashed client issued writes
belonging to flush epochs F0 , F1 , . . . , FN , the client will
recover to a virtual disk which contains all writes from
epochs F0 . . . FR ; some, all, or no writes from epoch FR+1 ;
and no writes from epochs FR+2 . . . FN . Denote these
three write segments as the preserved epochs, the questionable epoch, and the disavowed epochs, respectively.
With traditional flush semantics for a local physical disk,
there are no disavowed epochs—clients cannot issue new
writes across a flush barrier if prior writes are outstanding, so, at worst, a client can lose some or all writes
from its last, questionable epoch. With Blizzard’s fast acknowledgment schemes, N − R may be greater than zero,
i.e., there may be a questionable epoch and disavowed
epochs.
When operating in fast acknowledgment mode, Blizzard can minimize data loss (i.e., N −R) by issuing writes
as quickly as possible; the log-based write scheme does
precisely this. However, for all three of Blizzard’s write
schemes, unmodified POSIX file systems and applications will always recover to a prefix-consistent version of
the Blizzard drive. Using fast acknowledgments, applications are more likely to share acknowledged (but currently non-durable) writes to external parties, meaning
that, if the application crashes, it may not be able to recover those externalized writes from the Blizzard drive.
In practice, we do not believe that this is a problem,
since many users are willing to receive high performance
and crash consistency in exchange for potentially externalizing non-durable data [9]. Users that wish to never
externalize non-durable data can run Blizzard in writethrough mode.
2.7
Server-side Failure Recovery
Blizzard relies on FDS to recover failed tracts belonging
to a virtual drive. However, FDS clients are responsible
for retrying aborted writes caused by remote disk failures. Thus, if Blizzard detects that an FDS write was
unsuccessful, it must contact the FDS metadata server,
download the new mapping between tracts and remote
disks, and retry the write. If replication is enabled, Blizzard also ensures that each write to a virtual block results
in R successful writes to the R replica disks.
2.8
Coexisting Workloads
Blizzard is built atop FDS, and both systems use a
shared physical infrastructure of servers, disks, and network equipment. The network provides full-bisection
NTFS driver
Read,Write,Flush
Virtual SATA
interface
Blizzard disk driver
Kernel space
ALPC shared
sections
User space
Blizzard Client
FDS Library
FDS Cluster
Figure 3: A Blizzard virtual disk on Windows.
bandwidth, and FDS uses a request-to-send/clear-to-send
mechanism which ensures that senders cannot overrun
receivers [30]. Thus, neither Blizzard workloads nor
native FDS applications can induce network congestion
at the core or the edge. This lack of congestion does
not guarantee any notion of application-level “network
fairness,” so operators that desire such properties must
use client-side mechanisms like admission control or rate
limiting.
Both Blizzard and FDS use aggressive, randomized
striping across disks. As the aggregate client IO pressure increases, service times at each disk degrade gracefully, since the workload is spread evenly across each
disk (§4.5). In principle, a single disk can store data
for both Blizzard applications and native FDS applications. In practice, we typically allocate a single disk to
either Blizzard or native FDS applications, but not both.
POSIX applications often require low-latency IO in addition to high throughput, so our allocation scheme prevents small POSIX IOs from getting queued behind the
much larger IOs from big-data applications.
3
Implementation
Figure 3 shows the architecture for our Blizzard implementation on Windows. The virtual disk contains two
pieces: a kernel-mode SATA driver, and a user-mode
component which links to the FDS client library. File
systems (and applications which issue raw disk IO) send
IO request packets (IRPs) to the SATA driver. The driver
forwards these requests to the user-mode client, which
translates the requests into the appropriate FDS operations. For IRPs that correspond to reads and writes,
Blizzard issues reads or writes to the appropriate remote disks. Once the client receives a response from a
264 11th USENIX Symposium on Networked Systems Design and Implementation
USENIX Association
remote disk, it hands the response to the kernel mode
driver, which then informs the requesting application that
the IRP has completed. To minimize the overhead of
exchanging data across the user-kernel boundary, Blizzard uses Window’s Advanced Local Procedure Calls
(ALPC); ALPC provides zero-copy IPC using shared
memory pages.
To maximize performance, Blizzard uses asynchronous interfaces to exchange data between the kernel
driver, the FDS client, and the FDS cluster. The usermode component of the virtual disk is multi-threaded,
and it maximizes throughput by handling multiple kernel requests and FDS operations in parallel. Such high
levels of concurrency and asynchrony make it tricky to
preserve the prefix semantics discussed in Section 2.5.
In terms of wall-clock time, reads and writes arrive at
the user-mode component in the order that the kernel issued them. However, the user-mode component handles
each read and write in a separate thread; lacking guidance from the kernel, writes might issue to FDS in a
nondeterministic fashion, based on whichever user-mode
write threads happened to grab more CPU time. To allow the user-mode component to implement prefix semantics, the kernel driver tags each write request with
its flush epoch, its sequence number within that epoch,
and the maximum sequence number for writes from the
previous epoch. This way, different user-mode threads
can use lightweight interlocked-increment operations on
integers to track the number of writes that have issued
for each epoch. Blizzard does eventually require heavyweight locking to add writes to the write queue (§2.5),
but this locking takes place in the latter part of the write
path, leaving the first part contention-free.
4 Evaluation
In this section, we use a variety of experiments to
demonstrate that Blizzard provides low-latency, highthroughput IO to unmodified, cloud-oblivious applications. Unless stated otherwise, when we refer to “Blizzard in fast acknowledgment mode,” we refer to the second consistency scheme in Section 2.5, not the log-based
approach.
4.1
Microbenchmarks
Figure 4 depicts the raw performance of a Blizzard virtual disk backed by 128 remote disks and using single
replication. To generate these results, we ran a custom client program that issued asynchronous, block-level
reads and writes to the virtual disk as quickly as possible.
Blizzard was configured in write-through mode, to identify the steady-state performance that Blizzard could provide to a completely IO-bound client. The results show
that, depending on the block size and the segment size,
Blizzard can provide throughputs of 700 MB/s for se-
USENIX Association
(a) Segment size: 1
(b) Segment size: 64
(c) Segment size: 128
Figure 4: Throughput microbenchmarks.
quential writes, and over 1000 MB/s for sequential reads,
random reads, and random writes.
From the perspective of a client, increasing the segment size is roughly analogous to adding disks to a private RAID-0 array—it increases the number of disks that
a client can access in parallel, improving both performance and load balancing. As shown in Figure 4(a),
small segment sizes lead to poor sequential IO performance due to convoy dilation effects (§2.3). However,
even for a segment size of one, Blizzard services random
IOs at 400 MB/s or faster. This is because, at any given
time, a random workload accesses more disks than a sequential workload, improving disk parallelism (and thus
aggregate throughput). In the rest of this section, unless
otherwise specified, all experiments use a segment size
of 128, and 128 backing disks.
Increasing the block size beyond 32 KB improves
performance, since disks can fetch more data per seek.
However, increasing the block size beyond 128 KB leads
to diminishing throughput returns.
Figure 5 compares Blizzard’s write latency under several consistency settings and replication levels. For this
experiment, we intentionally included some old, slow
11th USENIX Symposium on Networked Systems Design and Implementation 265
50
30
Bl-FastACK
BlizzardWriteThrough-1rep
BlizzardWriteThrough-3reps
BlizzardFastACK-3reps
SQL IOs per second
40
20
10
Bl-FastACK-wpf-20
Bl-WThrough
EBS
0
4 KB
64 KB
128 KB
1 MB
Number of Threads
Figure 5: Write latency: Blizzard in write-through mode
(1 and 3 replicas) and Blizzard in fast acknowledgment
mode (3 replicas).
120
100
80
60
Rd-EBS
Rd-BlizzardWriteThrough
Rd-BlizzardFastACK
Wr-EBS
Wr-BlizzardWriteThrough
40
Wr-BlizzardFastACK
20
0
4 KB
64 KB
128 KB
1 MB
Figure 6: IOp latency: EBS and Blizzard.
disks that we typically exclude from production workloads. Figure 5 shows that fast acknowledgments dramatically reduce the cost of data redundancy—despite
the presence of known-slow disks, the write latency of
triple replication with fast acknowledgments is 2x–5x
lower than the latency of single replication with writethrough semantics. Blizzard clients obviously cannot issue an infinite number of low-latency IOs, since, at some
point, some resource (e.g., the network or a remote disk)
will become saturated. However, our results show that,
until that point is reached, Blizzard’s fast acknowledgments provide very low-latency IOs.
4.2
Figure 7: SQL read and write IOs per second: EBS, Blizzard in write-through mode, Blizzard in fast acknowledgment mode, and Blizzard in fast acknowledgment mode
with forced epochs every 20 writes. Each pair of circled
lines represents the read and write speeds for a particular
configuration.
Blizzard vs. EBS
In this section, we compare Blizzard’s virtual drive to
Amazon’s Elastic Block Store (EBS) drive. Our EBS
deployment used one Amazon EC2 instance (an EBSoptimized m1.xlarge) which had 4 CPUs and 15 GB of
memory, and which ran 64-bit Windows 2008 R2 SP1
Datacenter Edition. The virtual EBS drive was backed
by 12 disks, each of which was provisioned for 100 IOps
and 10 GB of storage; we constructed a RAID-0 striped
volume as the backing storage for the EBS drive. The
disks and the EC2 instance were connected by a 1 Gbps
network connection. To provide a fair comparison to
Blizzard, we configured Blizzard’s virtual drive to use 12
backing disks, and we used FDS’s built-in rate-limiter to
restrict the Blizzard client to 1 Gbps of network band-
width. The Blizzard client had 4 CPUs and 12 GB of
RAM, similar to the EBS client.
Using a multithreaded synthetic load generator, we
tested IO latency in EBS, Blizzard in write-through
mode, and Blizzard with fast acknowledgments enabled.
Since the load generator did not generate flush requests,
the latter Blizzard configuration provided good performance, but not prefix consistency; this configuration is
still a useful one to investigate, because many people
disable disk flushes for the sake of performance [7, 31].
Figure 6 shows that the read latencies for both Blizzard
schemes were 2x–4x lower than EBS’s latency. Both
Blizzard schemes had similar read latencies because delayed durability tricks do not directly affect the completion times of reads (although a client may generate more
reads per second if writes require less time to complete).
For write latencies, Blizzard with fast acknowledgments was 7x–14x faster than EBS. Blizzard in writethrough mode was essentially equivalent to EBS for
small to medium operations, but much faster for 1 MB
writes. It is unclear to us why EBS was so much slower
for large writes. Throughput tests (which we elide due to
space constraints) showed that, even for large IO sizes,
EBS only utilized about 80% of the available network
bandwidth; thus, the EBS client may be exchanging control traffic with other nodes that we cannot see.
Figure 7 shows the performance of sqliosim, a popular SQL benchmark tool, on EBS and several different
configurations of Blizzard. We used sqliosim in a configuration that had several threads doing random IO to 4
databases in parallel. Each database was 4 GB in size,
with its own log file. In the background, read-ahead
and bulk updates occurred. Note that sqliosim uses
write-through IO, not flushes, to provide consistency, so
Blizzard with fast acknowledgment simply writes data
to remote disks as quickly as possible and provides no
crash consistency. We also ran a variant of fast ac-
266 11th USENIX Symposium on Networked Systems Design and Implementation
USENIX Association
Figure 8: Macrobenchmarks: Blizzard’s virtual drive
(write-through, single replication) versus a local physical drive.
knowledgment mode which inserted a fake flush every
20 writes. That variant, which immediately acknowledges writes but bounds data loss to 20 writes, performs
slower than Blizzard with write-through or vanilla fast
acknowledgments; this is because the fake flushes reduced the amount of write parallelism. Nonetheless, all
three versions of Blizzard issued significantly more IOps
than EBS.
4.3
Macrobenchmarks
Figure 8 shows Blizzard’s performance for several IOintensive Win32 workloads; Blizzard was configured in
write-through mode and used single replication. The
WIM-create workload represents the time needed to
generate a bootable WinPE [29] .iso file using the
Windows Automated Installation Kit (a WinPE image is a minimal Windows installation that is useful
for creating recovery CDs and other diagnostic tools).
JetStress [23] is a seek-intensive application with 16
threads that emulates the IO load of an Exchange email
server. Gzip is a file compression program that uses four
IO threads. Virus scan represents the throughput of a
full system analysis performed by System Center 2012
Endpoint. DirCopy is derived from the built-in Windows robocopy tool, and it recursively copies directories using eight threads. Grep is an eight-way threaded
program that evaluates regular expressions over file data.
SQL refers to the sqliosim tool that database administrators use to characterize disk performance. By default, the
tool launches eight threads that perform random database
queries, eight threads that perform sequential queries,
and eight threads that perform bulk database updates.
In Figure 8, throughput numbers refer to applicationlevel performance, not disk-level performance. For example, Gzip throughput refers to how many MBs of file
data the program can compress per second. Similarly,
SQL throughput refers to how quickly the SQL engine can
read or write the database. Figure 8 shows that Blizzard
can improve unmodified application’s performance by a
USENIX Association
Figure 9: Exchange throughput. For the log-based commit results, Blizzard’s block size was set to 64 KB, to
match the size of Exchange’s transactional IOs. Other
Blizzard configurations used a block size of 128 KB.
factor of 2x–10x. Blizzard provides the greatest boost to
programs like DirCopy and Grep which spend very little
time on CPU operations.
4.4
Log-based Commit
To test the performance of Blizzard’s out-of-order, logbased commit scheme (§2.5), we ran several tests involving the JetStress tool, which emulates the workload for
an Exchange email server. Unlike sqliosim, which issues write-through operations to implement consistency,
JetStress uses disk flushes. As shown in Figure 9,
Blizzard in write-through mode with single replication
provides a 4x throughput improvement over a local physical disk, and Blizzard in single replicated, fast acknowledgment mode provides a 9x improvement. Using single
replication, Blizzard’s log-based, out-of-order commit
scheme was only 3% slower than single-replicated fast
acknowledgment, despite occasionally needing to perform read-before-writes (§2.5.3). The triple-replicated
log scheme was only 5% slower, since Blizzard could
hide much of the latency associated with slow replicas
(§4.1).
Log-based commits do not provide faster throughput
or lower IO latency than simple fast acknowledgments
because both schemes acknowledge writes and flushes
immediately. However, the log-based scheme issues all
writes immediately, whereas the simple scheme issues
writes in epoch order, waiting for writes from epoch N to
commit before issuing writes from epoch N+1. Thus, the
simple scheme is more prone to data loss in the case of a
client crash, since it buffers more writes in-memory than
the log-based scheme (§4.7). The increased buffering requirement will also cause client-submitted IO requests to
block more often, until Blizzard can deallocate memory
belonging to newly retired epochs.
4.5
Multiple Active Clients
Blizzard clients stripe their data across a shared set of
disks. As the number of active clients grows, the aggre-
11th USENIX Symposium on Networked Systems Design and Implementation 267
100-Med
100-High
130-Med
130-High
175-AllHeavy
0
5
10
15
Throughput (MB/s)
1
0.8
0.6
0.4
0.2
0
20
Figure 10: CDF of IOp latency (VDI workloads). The
legend format is XXX-YYY, where XXX is the number
of clients and YYY describes the client IOp rates. There
were 130 remote disks.
gate request pressure on the disks increases. We designed
nested striping (§2.3) and log-based permutation maps
(§2.5) so that clients would spread their requests across
multiple disks, preventing hotspots from emerging and
leading to graceful degradation of disk IO queues. To
test our design, we issued IO requests using a synthetic
load generator that simulated a variable number of clients
with varying levels of IOs per second (IOps). Each emulated client received 10 Gbps of network connectivity,
and each client’s virtual user was marked as “Light,”
“Normal,” “Power,” or “Heavy” based on its IOps rate (5,
10, 20, or 40 respectively). The size of each IOp ranged
from 512 bytes to 1 MB, with the statistical distribution
of IOp sizes and read/write ratios governed by empirical studies of VDI workloads [12, 13]. Note that a single
high-level IOp resulted in multiple FDS operations if the
IOp was larger than a Blizzard virtual block.
Figure 10 provides a CDF of IOp latencies for several different client deployments; to measure true IOp
latencies, Blizzard was operated in write-through mode.
The “Medium” deployment was 10% Light, 50% Normal, 25% Power, and 15% Heavy. The “High” client
split was 10%/30%/40%/20%, and the pessimistic “AllHeavy” split was 0%/0%/0%/100%. In all cases, there
were 130 remote disks. Figure 10 shows that, with
100 clients (i.e., more than one disk per client) and 130
clients (exactly one disk per client), Blizzard provides
IOp latencies that are competitive with those of a local
physical disk: at least 62% of IOps had 5 ms of latency
or lower, and 85% of IOps had 10 ms of latency or lower.
Interestingly, latencies in the 130 client test were
slightly lower than those in the 100 client test, e.g., in
the “High” IOps test, 65.7% of IOps in the 130 node
deployment had 5 ms or less of latency, but this was
true for only 61.9% of IOps in the 100 node test. The
reason is that nested striping distributes the aggregate
client load evenly across all disks. Thus, each disk sees
a random stream of seek offsets. When the number
Added Latency
Figure 11: Comparing Blizzard’s performance in our deployed network (far left) and our deployed network with
synthetic latencies added.
of clients increased from 100 to 130, disk queues got
deeper, but this was better for disks that were serving
random workloads—as each disk arm swept across its
platters, there were more opportunities to service queued
IO requests. Of course, longer disk queues will eventually increase IOp latency, as demonstrated by the pessimistic “AllHeavy” deployment which had 175 clients
(i.e., 1.34 clients for each disk) and 40 IOps per client.
Even in this case, 56% of IOs had latencies no worse
than 5 ms, and 77% had latencies no worse than 10 ms.
4.6
Latency Sensitivity
Blizzard is designed for full-bisection networks in which
clients have fast, low-latency connections to remote
disks. For example, in our current deployment, clients
and storage nodes communicate via links with 500 microseconds of latency. This is an order of magnitude
smaller than the seek time for a disk, allowing Blizzard
to make network-attached disks as fast to access as local
ones. Figure 11 depicts Blizzard’s performance with synthetic network latencies added, and with write-through
semantics enabled (i.e., the virtual drive does not acknowledge writes to clients until those writes are durable
on remote disks). With five milliseconds of additional
latency, Blizzard’s throughput drops by a factor of 5x–
10x, and with twenty milliseconds of additional latency,
performance is essentially equivalent to that of a single
local disk.
These experiments highlight the importance of Blizzard’s congestion-free, full-bisection bandwidth network. In such a system, network delays are negligible,
and the client-perceived latency for a write-through IO
is governed by the time needed to perform a single seek
on a remote storage server (see Figure 10). Figure 11
shows that if the storage network lacks a fast interconnect, then millisecond-level network latencies effectively
double or triple the seek penalty for accessing a remote
disk. In these scenarios, Blizzard’s fast acknowledgment
schemes are crucial for eliminating remote access penalties from the critical path of writes.
268 11th USENIX Symposium on Networked Systems Design and Implementation
USENIX Association
Only new physical
blocks targeted
50% new targets, 50%
overwrites
JetStress
Crash-consistent recoveries
WriteFastACK FastACK
through
+ log
50/50
50/50
50/50
50/50
50/50
50/50
50/50
50/50
50/50
Figure 12: For all 450 injected crashes, Blizzard recovered to a prefix-consistent version of the virtual disk.
FastACK (wpf=1, blockSize=64KB)
FastACK (wpf=4, blockSize=64KB)
FastACK+log (blockSize=4KB)
250
FastACK+log (blockSize=64KB)
200
150
100
50
0
Reliability
0
In this section, we demonstrate two things. First,
Blizzard always recovers a crashed virtual drive to a
prefix-consistent state. Second, if a Blizzard client
primarily issues block-aligned writes for entire blocks of
data, the client should use log-based commit to reduce
data loss and decrease memory pressure. If clients
frequently issue writes that are misaligned, or not an
even multiple of Blizzard’s block size, clients should use
the simple fast acknowledgment scheme (if they wish to
maximize performance), or write-through mode (if they
wish to minimize data loss).
Recovering to consistent virtual drives: To test Blizzard’s reliability in the presence of crashes, we modified the user-mode portion of the virtual driver so that
it randomly injected emulated failures. At each emulated failure point, the driver logged the set of writes that
were buffered in memory, as well as the subset of those
writes that had been issued to remote disks, but not yet
acknowledged as being durable. Writes that are buffered
but unissued at crash time are lost. Writes that are issued
but unacknowledged at crash time may or may not be
durable—their durability depends on whether the client
OS had actually sent the writes over the network before
the crash, and whether remote disks crashed while handling the writes, and so on.
We used three synthetic workloads to explore Blizzard’s reliability. Our first workload issued 40 writes
per second, such that each write targeted a previously
unwritten portion of the virtual disk. Our second workload also issued 40 writes per second, but 50% of the
writes targeted new locations, and 50% targeted previously written blocks. Note that a write-only workload
of 40 IOps is intense for a POSIX application [12, 13].
We configured Blizzard to use a block size of 128 KB,
with half of the writes being 64 KB in size, and the
other half being 128 KB, ensuring that, when Blizzard
was run in log-based commit mode, the read-after-write
code paths would be stressed. Our final workload was
JetStress [23], a seek-intensive workload that simulates
an Exchange email server. The JetStress tests also used
USENIX Association
FastACK (wpf=4, blockSize=4KB)
300
50
100
150
IOps
200
250
300
Figure 13: Blizzard loss rates when the size of each
write is aligned with Blizzard’s block size. “wpf” means
“writes per flush.”
Average data loss per
crash (KB)
4.7
FastACK (wpf=1, blockSize=4KB)
350
Average data loss per crash (KB)
Write workload
FastACK (wpf=1)
FastACK (wpf=4)
FastACK+log (blockSize=4KB)
FastACK+log (blockSize=16KB)
100000
10000
1000
100
10
1
0
25
50
IOps
75
100
Figure 14: Blizzard loss rates for a simulated write-only
VM workload (writes vary between 4KB and 1 MB in
size). Note that the y-axis is log-scale.
a Blizzard block size of 128 KB, and in all experiments,
the virtual drive used 128 backing disks.
Figure 12 shows the results of our experiments.
For all 450 injected crashes, Blizzard recovered a
prefix-consistent version of the virtual drive. To validate
whether the recovered drive was prefix-consistent,
we ran Blizzard’s recovery code, and then used the
write log to verify that the recovered disk contained a
prefix-consistent representation of the write stream.
Bounding data loss: In fast acknowledgment mode
and log-based commit mode, Blizzard exchanges performance for the risk of data loss. A recovered virtual
drive is always consistent, but the drive may not contain a trailing set of writes that Blizzard acknowledged
to the client but failed to make durable before the crash
occurred. To measure this data loss, we ran two additional experiments.
11th USENIX Symposium on Networked Systems Design and Implementation 269
In the first experiment, we generated a synthetic
stream of write operations. Each write was the exact
size of Blizzard’s block size, and it was aligned with a
block boundary, ensuring that, when Blizzard used logbased commits, there was no read-before-write penalty
(§2.5.3). Using our instrumented Blizzard driver, we
picked random moments to simulate crashes, and logged
the amount of buffered, unacknowledged data that would
be lost in the simulated crash. Our load generator also injected a configurable number of flush requests. Figure 13
shows the results. Each data point represents 100 simulated crashes.
As expected, the loss rate increases as the IO rate increases, because Blizzard must buffer more data. With
one write per flush, i.e., with a total ordering over all
writes, the simple fast acknowledgment scheme can only
have one outstanding write to a remote disk. This
severely limits the rate at which writes can retire, and it
increases the memory pressure needed to buffer writes.
With a write size of 4 KB, the fast acknowledgment
scheme can only handle 100 IOps before too many writes
queue up, and the virtual drive throttles the client to
bound potential data loss; with a write size of 64 KB, the
scheme can only handle 50 IOps before it throttles the
client. The log-based commit scheme can issue writes
immediately, regardless of the flush rate, so this scheme
can gracefully scale up to 300 IOps.
For a flush rate of once-every-four writes, the fast acknowledgment scheme does much better, scaling all the
way to 300 IOps. With a wider flush epoch, the fast acknowledgment scheme can issue more writes in parallel,
and when a write from a new epoch arrives, old writes
from the prior epoch are likely to already be committed, or at least issued; thus, the new write is unlikely to
be delayed by a full seek time on a remote disk. With a
write size of 4 KB, the difference in average loss rates between the two schemes is large in relative percentage, but
small in absolute value—at 300 IOps, the fast acknowledgment scheme loses 4.1 writes representing 16.4 KB
of data, and the log-based scheme loses 2.4 writes representing 11.5 KB of data. For a larger write size of 64 KB,
the data loss per dropped write increases, and the two
schemes show bigger differences in absolute amounts of
data loss. For IOp rates above 100, the log-based scheme
decreases loss rates by 12–30%. For example, at 300
IOps with 64 KB writes, the log-based scheme loses 191
KB per crash, whereas the simple fast acknowledgment
scheme loses 272 KB.
Blizzard’s log-based commit scheme pays a readbefore-write penalty for writes that are smaller than Blizzard’s block size. If writes are frequently misaligned
(or not even multiples of the block size), the log-based
scheme will force many writes to wait for synchronous
reads. This will increase buffering requirements and the
issue latencies for writes, causing loss rates after a crash
to increase. Figure 14 shows this effect. In this example,
the writes are aligned with Blizzard’s block boundaries,
but they range in size from 4 KB to 1 MB, as determined
by empirical distributions of VM write sizes [12, 13]. In
these experiments, the log-based scheme with a block
size of 16 KB could only handle up to 50 IOps—beyond
that, the read-before-write penalty forced Blizzard to
throttle the client’s write rate. Decreasing the block size
to 4 KB resulted in fewer read-before-writes, allowing
the log-based scheme to scale better. At 75 IOps, the
log-based scheme beat the fast acknowledgment scheme
by 37%, with a data loss of 135 KB instead of 214
KB. However, at 100 IOps, the log-based scheme can no
longer hide the read-before-write penalties, and it has an
average data loss of 54 MB, an order of magnitude worse
than fast acknowledgments with a wpf of 1, and two orders of magnitude worse than fast acknowledgments with
a wpf of 4.
5
Related Work
Block-level interfaces: A variety of protocols use
a block interface to expose a single disk to remote
clients. Examples of such protocols include ATA-overEthernet [22] and iSCSI [36]. Blizzard extends the simple block interface, mapping each virtual drive to multiple backing disks, and providing high-level software
abstractions like replication and failure recovery across
thousands of disks.
Like Blizzard, Petal [24] defines a distributed,
software-implemented virtual disk that is backed by remote storage. However, Blizzard can coexist with traditional big-data workloads, and Blizzard leverages a fullbisection bandwidth network to stripe data more aggressively than Petal; the latter exposes Blizzard clients to
higher levels of disk parallelism. Blizzard also leverages
delayed durability semantics to increase the rate at which
clients can issue writes while still achieving crash consistency.
Salus [43] is another example of a virtual block store.
Salus is built atop HDFS/HBase [5, 6], and it provides
ordered-commit semantics during normal operation, and
prefix semantics when failures occur. Salus achieves
these properties with pipelined commit, a protocol
that resembles two-phase commit. In contrast, Blizzard achieves consistency with only a single round of
communication between the client and the remote data
stores. This reduces both network traffic and software
complexity.
Mapping schemes: Using techniques like nested
striping (§2.3) and deterministic permutations (§2.5),
Blizzard translates virtual block accesses to FDS-level
block accesses. This is similar to how SSDs use a
270 11th USENIX Symposium on Networked Systems Design and Implementation
USENIX Association
Flash Translation Layer (FTL) to map virtual blocks to
physical ones. SSDs employ a variety of optimizations
to minimize the size of the mapping table that is kept
in the SSD’s small, on-board SRAM (e.g., [18, 45]).
While these optimizations could be applied to Blizzard’s
mapping structures, we have not found a need for such
approaches, since Blizzard’s tables are small (∼20 MB)
and they easily fit within main memory. FTLs must also
implement compaction and garbage collection, since
SSD writes and SSD erases have different data sizes.
Blizzard’s log-based commit scheme avoids compaction
and garbage collection by using equivalent sizes for
writes and erases in the log.
Cloud-scale storage systems: BlueSky [42] uses NFS
or CIFS proxies to expose commercial cloud storage like
Azure to enterprise clients. BlueSky allows an enterprise
to offload the administrative costs of the storage cluster
to a third party. However, compared to systems in which
clients and servers reside in the same cloud, BlueSky introduces several new sources of overhead. Pulling data
from commercial cloud servers injects wide-area latencies into the IO path. BlueSky proxies also use the
chatty, text-based HTTP protocol to communicate with
remote servers. The HTTP overhead is magnified if the
enterprise has asymmetric upload/download speeds to
the cloud—slower upload speeds mean that even small
HTTP headers can add significant transfer delays [16].
Unlike BlueSky’s focus on WAN access to cloud
storage, Parallel NFS (pNFS) exposes clients to local
(i.e., on-premise) cloud storage [38]. Storage servers
can export a block interface, an object interface, or a
file interface; pNFS clients transparently convert NFS
requests to the appropriate lower-level access format.
pNFS requires applications to adhere to NFS semantics,
and both pNFS and BlueSky lack key Blizzard features
like role-based striping and asynchronous epoch-based
commits for writes.
Desktop/server file systems: OptFS demonstrated how
to decouple durability from ordering in the context
of a journaling file system [7]. Blizzard shows that
these ideas can be applied at the disk level, providing
OptFS-style performance improvements in a file systemagnostic way that does not depend on knowledge of the
file system’s consistency scheme (e.g., journaling [40]
or shadow paging [21, 35]). OptFS requires disks to
be modified to provide asynchronous durability notifications; in contrast, Blizzard’s virtual disk implements
prefix consistency using standard, asynchronous writethrough operations on the backing remote disks.
When Blizzard uses log-based commit, it leverages
expanded blocks to enable crash recovery and disconnected operation. Transactional Flash [34] and
USENIX Association
backpointer-based consistency [8] also embed extra
metadata in out-of-band areas.
BPFS [10] is a file system for use with byteaddressable, persistent memory hardware (e.g., Phase
Change Memory). BPFS introduces an abstraction,
called an epoch barrier, that allows ordering guarantees
to be expressed without requiring an immediate flush of
dirty data in the CPU cache. Epoch barriers provide data
consistency while preserving the ability of the memory
controller to reorder writes within an epoch. Epoch barriers require custom hardware, and BPFS expects that the
persistent memory resides directly on the memory bus.
Like BPFS, Blizzard also separates ordering from durability; however, the separation is implemented in the context of a distributed system, rather than a single machine
with access to persistent memory.
The Zebra file system [19] combines ideas from
RAID [32] and log-based file systems [35], striping a
per-client file log across a RAID array. Zebra does not
provide mechanisms for asynchronous flush handling,
and this constrains the level of disk parallelism that Zebra can provide to applications. Zebra uses compaction
and garbage collection to manage dead block data; when
such log cleaning occurs, it can introduce unpredictable
performance fluctuations [37]. In contrast, when Blizzard operates in log-based asynchronous commit mode,
it uses reads-before-writes to only commit full blocks of
data. This smooths out the background IO traffic that
is required for log maintenance. However, Section 4.7
demonstrates that if clients frequently issue misaligned
writes, Blizzard’s read-after-write penalty can be large,
making Blizzard’s simple fast acknowledgment scheme
more attractive.
6
Conclusions
Blizzard exposes unmodified, cloud-oblivious POSIX
applications to a fast, cloud-scale block store. This block
store, which clients mount as a virtual disk, efficiently
supports small, random IOs, but it coexists alongside
big-data file systems, and deploys atop the same servers,
disks, and switches. Using a network with full-bisection
bandwidth, Blizzard provides clients with fast access to
any remote disk. Using a novel striping scheme, Blizzard
maximizes disk parallelism, avoids disk hotspots, and reduces IOp convoy dilation. By carefully ordering writes,
Blizzard can immediately acknowledge flush requests
while still providing crash consistency; with fewer write
barriers, clients can issue writes faster, and better leverage the spindle parallelism of remote storage. A Blizzard prototype improves the speed of unmodified POSIX
applications by up to an order of magnitude. In summary, Blizzard makes it much easier for cloud-agnostic
POSIX applications to receive cloud-scale performance
and availability.
11th USENIX Symposium on Networked Systems Design and Implementation 271
References
[1] Amazon.
Amazon EBS-Optimized Instances.
AWS Documentation. http:
//docs.aws.amazon.com/AWSEC2/latest/
UserGuide/EBSOptimized.html. October 15,
2013.
[2] Amazon. Amazon Elastic Block Store (EBS).
http://aws.amazon.com/ebs/. 2014.
[3] Amazon. Amazon Relational Database Service
(Amazon RDS). http://aws.amazon.com/rds/.
2014.
[4] Amazon. Amazon Simple Email Service (Amazon
SES). http://aws.amazon.com/ses/. 2013.
[5] Apache. Apache HBase. http://hbase.apache.
org. 2014.
[6] D. Borthakur. The Hadoop Distributed File System: Architecture and Design. http://hadoop.
apache.org/docs/r0.18.0/hdfs_design.pdf.
2007.
[7] V. Chidambaram, T. Pillai, A. Arpaci-Dusseau, and
R. Arpaci-Dusseau. Optimistic Crash Consistency.
In Proceedings of SOSP, pages 228–243, Farmington, PA, November 2013.
[8] V. Chidambaram, T. Sharma, A. Arpaci-Dusseau,
and R. Arpaci-Dusseau. Consistency Without Ordering. In Proceedings of FAST, pages 101–116,
San Jose, CA, February 2012.
[13] R. Fellows.
Storage Optimization for VDI.
Tutorial:
Storage Networking Industry Association.
http://www.snia.org/sites/
default/education/tutorials/2011/fall/
StorageStorageMgmt/RussFellowsSNW_Fall_
2011_VDI_best_practices_final.pdf. 2011.
[14] S. Ghemawat, H. Gobioff, and S.-T. Leung. The
Google File System. ACM SIGOPS Operating Systems Review, 37(5):29–43, December 2003.
[15] Google. Google Cloud SQL. https://cloud.
google.com/products/cloud-sql. 2014.
[16] Google. Performance Best Practices: Minimize request overhead. https://developers.google.
com/speed/docs/best-practices/request.
March 28, 2012.
[17] A. Greenberg, J. Hamilton, N. Jain, S. Kandula,
C. Kim, P. Lahiri, D. Maltz, P. Patel, and S. Sengupta. VL2: A Scalable and Flexible Data Center
Network. In Proceedings of SIGCOMM, pages 51–
62, Barcelona, Spain, 2009.
[18] A. Gupta, Y. Kim, and B. Urgaonkar. DFTL: a
Flash Translation Layer Employing Demand-Based
Selective Caching of Page-Level Address Mappings. In Proceedings of ASPLOS, pages 229–240,
Washington, DC, March 2009.
[19] J. Hartman and J. Ousterhout. The Zebra Striped
Network File System. ACM Transactions on Computer Systems, 13(3):274–310, 1995.
[9] J. Cipar, G. Ganger, K. Keeton, C. Morrey III,
C. Soules, and A. Veitch. LazyBase: Trading
Freshness for Performance in a Scalable Database.
In Proceedings of EuroSys, pages 169–182, Bern,
Switzerland, April 2012.
[20] D. Hildebrand, A. Nisar, and R. Haskin. pNFS,
POSIX, and MPI-IO: A Tale of Three Semantics.
In Proceedings of the Workshop on Petascale Data
Storage, pages 32–36, Portland, OR, November
2009.
[10] J. Condit, E. B. Nightingale, C. Frost, E. Ipek,
B. Lee, D. Burger, and D. Coetzee. Better I/O
Through Byte-addressable, Persistent Memory. In
Proceedings of SOSP, pages 133–146, Big Sky,
MT, 2009.
[21] D. Hitz, J. Lau, and M. Malcolm. File System Design for an NFS File Server Appliance. In Proceedings of the USENIX Winter Technical Conference,
San Francisco, CA, January 1994.
[11] A. Edwards and B. Calder. Exploring Windows
Azure Drives, Disks, and Images. Microsoft. http:
//blogs.msdn.com/b/windowsazurestorage/
archive/2012/06/28/exploring-windowsazure-drives-disks-and-images.aspx. June
27, 2012.
[12] D. Feller. Virtual Desktop Resource Allocation.
The Citrix Blog. http://blogs.citrix.com/
2010/11/12/virtual-desktop-resourceallocation. November 12, 2010.
[22] S. Hopkins and B. Coile. AoE (ATA over Ethernet). http://support.coraid.com/documents/
AoEr11.txt. February 2009.
[23] N. Johnson.
JetStress 2010: JetStress Field
Guide. Microsoft. http://gallery.technet.
microsoft.com/Jetstress-Field-Guide1602d64c. March 27, 2012.
[24] E. Lee and C. Thekkath. Petal: Distributed Virtual
Disks. ACM SIGOPS Operating Systems Review,
30(5):84–92, December 1996.
272 11th USENIX Symposium on Networked Systems Design and Implementation
USENIX Association
[25] A. Leung, S. Pasupathy, G. Goodson, and E. Miller.
Measurement and Analysis of Large-Scale Network File System Workloads. In Proceedings of
USENIX ATC, pages 213–226, Boston, MA, June
2008.
[26] Microsoft.
Introducing Windows Azure SQL
Database.
http://msdn.microsoft.com/enus/library/windowsazure/ee336230.aspx.
2014.
[27] Microsoft.
Windows Azure Active Directory. http://www.windowsazure.com/en-us/
services/active-directory/. 2014.
[28] Microsoft. Windows Azure Storage Scalability
and Performance Targets. Windows Azure Documentation. http://msdn.microsoft.com/enus/library/windowsazure/dn249410.aspx.
June 20, 2013.
[29] Microsoft.
Windows PE Technical Reference. http://technet.microsoft.com/en-us/
library/dd744322(WS.10).aspx. October 22,
2009.
[30] E. Nightingale, J. Elson, O. Hofmann, Y. Suzue,
J. Fan, and J. Howell. Flat Datacenter Storage. In
Proceedings of OSDI, pages 1–15, Hollywood, CA,
October 2012.
[31] E. Nightingale, K. Veeraraghavan, P. Chen, and
J. Flinn. Rethink the Sync. In Proceedings of OSDI,
pages 1–14, November 2006.
[32] D. Patterson, G. Gibson, and R. Katz. A Case for
Redundant Arrays of Inexpensive Disks (RAID).
ACM SIGMOD Record, 17(3):109–116, 1988.
[33] V. Prabhakaran, A. Arpaci-Dusseau, and R. ArpaciDusseau. Analysis and Evolution of Journaling File
Systems. In Proceedings of USENIX ATC, pages
105–120, Anaheim, CA, April 2005.
[37] M. Seltzer, K. Smith, H. Balakrishnan, J. Chang,
S. McMains, and V. Padmanabhan. File System
Logging Versus Clustering: A Performance Comparison. In Proceedings of the USENIX Winter
Technical Conference, pages 249–264, New Orleans, LA, January 1995.
[38] S. Shepler, M. Eisler, and D. Noveck. Network File
System (NFS) Version 4 Minor Version 1 Protocol.
Technical report, RFC 5661, January, 2010.
[39] M. Steigerwald. Imposing Order. Linux Magazine,
May 2007.
[40] S. Tweedie. Journaling the Linux ext2fs File System. In Proceedings of the Fourth Annual Linux
Expo, Durham, North Carolina, May 1998.
[41] W. Vogels. File system usage in Windows NT 4.0.
In Proceedings of SOSP, pages 93–109, Kiawah Island Resort, SC, December 1999.
[42] M. Vrable, S. Savage, and G. Voelker. BlueSky:
A Cloud-Backed File System for the Enterprise.
In Proceedings of FAST, pages 237–250, San Jose,
CA, 2012.
[43] Y. Wang, M. Kapritsos, Z. Ren, P. Mahajan,
J. Kirubanandam, L. Alvisi, and M. Dahlin. Robustness in the Salus Scalable Block Store. In Proceedings of NSDI, pages 357–370, Lombard, IL,
April 2013.
[44] E. Weisstein.
Linear Congruence Method.
MathWorld:
A
Wolfram
Web
Resource.
http://mathworld.wolfram.com/
LinearCongruenceMethod.html. 2014.
[45] Y. Zhang, L. Arulraj, A. Arpaci-Dusseau, and
R. Arpaci-Dusseau. De-indirection for Flash-based
SSDs with Nameless Writes. In Proceedings of
FAST, pages 1–16, San Jose, CA, February 2012.
[34] V. Prabhakaran, T. Rodeheffer, and L. Zhou. Transactional Flash. In Proceedings of OSDI, pages 147–
160, San Diego, CA, December 2008.
[35] M. Rosenblum and J. Ousterhout. The Design
and Implementation of a Log-Structured File System. ACM Transactions on Computer Systems,
10(1):26–52, February 1992.
[36] J. Satran, K. Meth, C. Sapuntzakis, M. Chadalapaka, and E. Zeidner. Internet small computer
systems interface (iSCSI). Technical report, RFC
3720, April, 2004.
USENIX Association
11th USENIX Symposium on Networked Systems Design and Implementation 273