The Tao of Systems: Doing Nothing Well

The Tao of Systems: Doing Nothing Well
Sara Alspaugh, Arka Bhattacharya, David Culler, Randy Katz
University of California, Berkeley
As a systems community, we focus on optimizing for
performance — making the common case fast [10].
But the common case in real systems is doing nothing;
that is, being idle, as shown in Figure 1. Being fast does
not address this — you can only do something faster;
but doing something faster only increases the amount of
nothing the system does.
This highlights what we as system designers have been
doing wrong. Figure 2 shows how we spend the majority of our focus on performance at high utilization levels,
when in reality, the majority of system time is spent doing nothing [2].
And even the common case is doing nothing, doing
nothing still consumes power — often more than half of
the peak. If we are to design for the common case from
a power perspective, we need to “do nothing well”. This
requires understanding idleness as well as we understand
performance and optimize for the structure that we find
there.
In this paper we examine what it means to exploit the
structure of idleness in the workload, just as we exploit
temporal and spatial locality. We take a particular focus
Figure 1: A 45 day trace of a Facebook data analytics cluster running Hadoop MapReduce. The system is under-utilized
when the load is below peak. Thus, if the darkspace represents
load, then the whitespace above represents idleness.
Figure 2: Power consumption and response latency of an Atom
processor-based node as utilization increases. Although the
majority of actual time is spent in region (a), the majorty of
system designers’ attention is spent in region (b).
Abstract
We propose a new design principle for the systems community. “Do nothing well” in addition to “make the common case fast”. This means designing for power savings as well as performance gains. The key to saving
power is harnessing idleness; understanding and managing idleness as a first-class characteristic of workloads.
In this paper we study the idleness in a diverse range of
workloads. We show how well-known techniques for improving performance can be recast as tools for managing
idleness, and thereby saving power.
1
Introduction
One of the key tenets of system design is:
In making a design trade-off, favor the frequent case
over the infrequent case.
1
on storage systems, though the principle applies broadly.
We explore the idleness in a diverse range of workloads
and show how much is wasted by doing nothing poorly.
We then consider storage system techniques to reduce
this waste, such as workload aggregation, write-logging,
and caching. We see that these energy reduction techniques are really about manipulating idleness in order to
harness it for energy savings.
2
The usefulness of what is not
In all systems, capacity planning is a function of the
peak load demanded, while performance is a function of
the average load. The discrepancy between these two is
idleness. This implies idleness is driven by variation in
workload. The whitespace in Figure 1 is the idleness in
a particular workload.
Table 1 describes a set of workloads. We studied the
idleness in these workloads. Table 2 gives the degree of
idleness, or whitespace. This idleness is exactly the peak
less the average over the peak. We translate this to waste
in order to capture the overhead. In a perfectly provisioned world, waste would be equivalent to peak over
average. In reality, systems are vastly overprovisioned,
so the waste is provisioned over average. Provisioned
over peak describes the overprovision factor. Overprovisioning is driven by a number of things. One is that
provisioning is often done on an organizational or logical level, and thus is decoupled from peak analysis. The
Workload
Percent
whitespace
Peak-toAverage
Waste
nfs-corp
nfs-uni
block-hm
block-prn
block-rsrch
block-stg
block-wdev
block-web
http-wiki
mapreduce-fb
98.3%
92.4%
98.6%
99.2%
99.5
95.8%
84.3%
98.2%
11.1%
99%
59.8
13.2
69.0
119.1
183.1
23.7
6.4
55.2
1.1
2129
ProvisiontoAverage
Waste
249.6
20.4
166.5
205.8
1989.5
158.6
11048.4
517.7
unknown
unknown
Figure 3: The inherent idleness in two of the workloads examined. MSR corresponds to the block-web trace, and Wikipedia
corresponds to the http-wiki trace. The top graph shows that
most idle periods are short, but by weighting them by their
length, as in the bottom graph, we see there is actually a great
deal of inherent idleness.
other, particular to storage systems, is that storage systems must be provisioned for capacity in terms of bytes
as well as bandwidth. The explosive growth in demanded
storage capacity, as well as write-once, read-never data,
has only exacerbated the situation [5].
Figure 3 highlights idleness in two of the workloads.
In the figure, the MSR trace shows more idleness because it was collected at the block device level, postmemory cache. We will examine the importance of
caching in Section 3. For contrast, we also show the inherent idleness in a very different type of workload, web
requests from Wikipedia.
Idleness causes waste because systems are not designed from a power perspective. Waste could be eliminated if power consumption could be made proportional
to load. Lacking non-power proportional disks, creating
a power proportional storage system requires transitioning some disks into a low-power sleep state. This technique was demonstrated for web servers in [6].
Disks can only be put to sleep when they are idle. Idle
time is created for some disks by aggregating the load
others. This must be done carefully so as to avoid hitting
the performance knee in Figure 2. Yet it is necessary
because idle periods must be of a certain size before they
can be exploited without harming performance, due to
the high-transition latency associated with such states.
The act of coalescing work in order to coalesce idle
Overprovision
Factor
4.2
1.6
2.4
1.7
10.9
6.7
1726.3
9.4
unknown
unknown
Table 2: Across a diverse range of system types the amount
of idleness and waste is large. Peak and provisioned capacity
exceed the average required by one to five orders of magnitude. Provisioning was considered in terms of bytes per second
(bandwidth) rather than bytes (capacity); high capacity provisioning results in even greater waste.
2
Label
nfs-corp
Type
CIFS
(NFS)
Origin
NetApp,
2007
nfs-uni
NFS
block-hm
block-prn
block-rsrch
blocklevel
“”
“”
Harvard,
2001
Microsoft,
2007
“”
“”
block-stg
“”
“”
block-wdev
block-web
“”
“”
“”
“”
http-wiki
HTTP
Wikipedia
mapreduce-fb
MapReduce
Facebook,
2009
Environment
data for marketing, sales, and
finance department in corporate DC
home directories for a subset
of campus users
enterprise DC server for hardware monitoring
enterprise DC print server
enterprise DC server hosting
research projects
enterprise DC web staging
server
enterprise DC test web server
enterprise DC web / SQL
server
Wikipedia requests (subsampled at 10%)
data analytics cluster
Hardware
3 TB NetApp filer, almost full
Duration
1 day
Reference
[7]
three NFS servers hosting a total of fourteen 53GB disk arrays
RAID-1 boot volume, RAID-5 data volumes as DAS, 13 disks average
“”
“”
1 day
[4]
1 week
[9]
“”
“”
“”
“”
“”
“”
“”
“”
“”
“”
“”
“”
“”
3 hours
[13]
45 days
—
unknown
Table 1: Summary of workloads examined and their characteristics.
freedom in that availability is maintained as long as one
replica is awake. The awake replicas received aggregated
workload. This technique was demonstrated for the compute side in [6]. On the storage side, it is a bit harder.
The challenge is maintaining consistency on writes.
One option is to wake all replicas on every write. This
breaks up the idle periods, sacrificing power savings. On
the other end of the spectrum, you can write only to active replicas, at the cost of having to playback writes to
the sleeping replicas before they can be read from. This
technique was explored in [9]. Other work has also explored power savings opportunities afforded by replication [1, 6, 8, 11, 15], though this work did not benefit from the perspective afforded by an idleness-focused
viewpoint. Figure 4 summarizes four general categories
of techniques for dealing with writes.
time requires balancing performance and power consumption. Fortunately, this yin and yang of power and
performance are not strictly at odds. They work in harmony, as techniques developed for enhancing performance (the darkspace in Figure 1), are also techniques
for shaping idleness (the whitespace in Figure 1). By
casting these techniques in the light of managing idleness, rather than performance, we can exploit the idleness for power savings. These techniques are reviewed
in the next section.
3
The way
Harnessing idleness in storage systems is challenging because such systems are inherently stateful. Thus you
must be concerned with consistency as well as data location. The first issue comes to the forefront when data
is replicated. The second becomes forefront when data is
partitioned across components and potentially only available in one place.
We examine these techniques in this section, using two
traces out of those listed in Table 1. We chose to use a
block level trace, block-web, and a web trace, http-wiki.
3.1
3.2
Loitering
To harness the idleness for energy savings, disks must be
put to sleep when idle. However, if a request arrives for
data that resides only on a sleeping disk, that request will
incur very high latency because the disk must be woken
up before that request can be served. This takes on the
order of seconds 1 . So care must be taken; we cannot
put disks to sleep as soon as they become idle without
Being consistent
Consistency becomes a problem when data is replicated,
regardless of whether you are putting disks to sleep.
Replication is done for fault-tolerance as well as increased read bandwidth. It provides an extra degree of
1 See
our disk survey at:
https://spreadsheets.
google.com/spreadsheet/pub?hl=en_US&hl=en_US&key=
0At9LlwTqZsD5dERFdExuQ29BQnRQR1FVeEQ4TlVQZUE&output=
html
3
Figure 6: We can think of the set of inactive disks as a 4th layer
in the well-studied memory hierarchy. A miss to the active-disk
set results in a latency penalty of a similar order of magnitude
to that between the memory and disk.
Figure 4: There are four general approaches to ensuring consistency on writes, with minor variations possible. Each approach differs in the degree to which it trades-off energy and
performance.
ture of this hierarchy has long been exploited for robustness and high-performance. This suggests we can successfully apply memory management and caching techniques which have been refined over the course of several decades. One caveat to the memory-management
analogue, however, is that the latency of an access to an
inactive disk is almost 5 seconds, which is orders of magnitude more than the threshold of 100ms desired for an
interactive application. Figure 7 shows an extra layer
of caching not only improves performance, but also increases the average idleness seen by the disk layer, which
could then be exploited for energy savings. Figure 5
shows the lower bound on the energy savings we can
expect from any caching scheme. Literature that has explored caching for energy savings includes [3, 14, 16].
Figure 5: Each point represents a different loiter time, with
corresponding energy savings and latency penalty. Going to
sleep after a reasonable idle period can result in vast savings
for a reasonable performance tradeoff. We can still serve 95%
of requests under 100ms while obtaining on the order of 2x savings for Wikipedia, and 6x savings for a Microsoft enterprise
data center web server. Note that energy savings are calculated
as percentage of baseline energy consumption.
3.4
Other Techniques
Other memory management techniques could also lead
to coalescing of idle times for energy savings. In a workload where there is appreciable spatial data-access correlation, which conventional wisdom holds to be true much
of the time, prefetching would enable further coalescing
of idle time by shifting the time at which data is retrieved
forward in time. Intelligent layout strategies which attempt to segregate hot data, close in nature to caching
and prefetching, would achieve a similar effect.
harming performance.
Loiter time is the duration of time a disk will remain
idle before going into a low-power sleep state. Shorter
loiter times means disks sleep more frequently, causing
more requests to suffer performance penalties. Longer
loiter times keep disks spinning longer, decreasing energy savings. Figure 5 illustrates a sweet spot, where
enormous amounts of energy can be saved while making
reasonable performance tradeoffs.
4
Conclusion
Extending memory management
We propose a new central tenet of system design should
be doing nothing well. From the Tao Te Ching [12]:
We can obtain a clearer picture of the nature of the challenges introduced in the previous section by considering
the set of inactive disks as an extension to the memory
management hierarchy, as shown in Figure 6. The na-
We turn clay to make a vessel;
But it is on the space where there is nothing
that the usefulness of the vessel depends.
Therefore just as we take advantage of what is,
we should recognize the usefulness of what is not.
3.3
4
computing. IEEE Computer, 40(12):33–37, 2007.
[3] D. Colarelli and D. Grunwald. Massive arrays of idle disks for
storage archives. In Proceedings of the 2002 ACM/IEEE conference on Supercomputing, Supercomputing ’02, pages 1–11, Los
Alamitos, CA, USA, 2002. IEEE Computer Society Press.
[4] D. Ellard, J. Ledlie, P. Malkani, and M. Seltzer. Passive nfs tracing of email and research workloads. In Proceedings of the 2nd
USENIX Conference on File and Storage Technologies, pages
203–216, Berkeley, CA, USA, 2003. USENIX Association.
[5] International Data Corportion. Worldwide file-based storage
20102014 forecast update. Technical report, IDC, December
2010.
[6] A. Krioukov, P. Mohan, S. Alspaugh, L. Keys, D. Culler, and
R. H. Katz. Napsac: design and implementation of a powerproportional web cluster. In Green Networking ’10: Proceedings of the first ACM SIGCOMM workshop on Green networking,
pages 15–22, New York, NY, USA, 2010. ACM.
[7] A. W. Leung, S. Pasupathy, G. Goodson, and E. L. Miller. Measurement and analysis of large-scale network file system workloads. In USENIX 2008 Annual Technical Conference on Annual Technical Conference, pages 213–226, Berkeley, CA, USA,
2008. USENIX Association.
[8] D. Li and J. Wang. Eeraid: energy efficient redundant and inexpensive disk array. In Proceedings of the 11th workshop on ACM
SIGOPS European workshop, EW 11, New York, NY, USA,
2004. ACM.
[9] D. Narayanan, A. Donnelly, and A. Rowstron. Write off-loading:
practical power management for enterprise storage. In Proceedings of the 6th USENIX Conference on File and Storage
Technologies, FAST’08, pages 17:1–17:15, Berkeley, CA, USA,
2008. USENIX Association.
[10] D. A. Patterson and J. L. Hennessy. Computer Architecture: A
Quantitative Approach. Morgan Kaufmann Publishers, Inc., second edition, 1996.
[11] E. Pinheiro and R. Bianchini. Energy conservation techniques
for disk array-based servers. In Proceedings of the 18th annual
international conference on Supercomputing, ICS ’04, pages 68–
78, New York, NY, USA, 2004. ACM.
[12] L. Tzu. The Way and its Power: Lao Tzu’s Tao Te Ching and Its
Place in Chinese Thought. 1958. Translation.
[13] G. Urdaneta, G. Pierre, and M. van Steen. Wikipedia workload
analysis for decentralized hosting. Elsevier Computer Networks,
53(11):1830–1845, July 2009.
[14] C. Weddle, M. Oldham, J. Qian, A.-I. A. Wang, P. Reiher, and
G. Kuenning. Paraid: A gear-shifting power-aware raid. Trans.
Storage, 3, October 2007.
[15] X. Yao and J. Wang. Rimac: a novel redundancy-based hierarchical cache architecture for energy efficient, high performance
storage systems. In Proceedings of the 1st ACM SIGOPS/EuroSys
European Conference on Computer Systems 2006, EuroSys ’06,
New York, NY, USA, 2006. ACM.
[16] Q. Zhu and Y. Zhou. Power-aware storage cache management.
IEEE Trans. Comput., 54, May 2005.
Figure 7: Shows an extension of Figure 3 corresponding to
different cache sizes. The top graph corresponds to the http
trace, and the bottom graph corresponds to the block-web trace.
The addition of a further layer of cache increases idleness.
Given this new perspective on the whitespace versus
the darkspace, many questions arise. Idleness deserves
greater focus in the systems literature. At what rate
has idleness increased in systems over the years? What
trends drive this? For instance, how does the move from
dedicated storage to storage distributed over processing
nodes change the observed patterns of idleness? Does
performance need to be held forth at the upmost at the
cost of the power-saving opportunities afforded by idleness aggregation? We urge the systems community to
focus on these new questions.
References
[1] H. Amur, J. Cipar, V. Gupta, G. R. Ganger, M. A. Kozuch, and
K. Schwan. Robust and flexible power-proportional storage. In
Proceedings of the 1st ACM symposium on Cloud computing,
SoCC ’10, New York, NY, USA, 2010. ACM.
[2] L. A. Barroso and U. Hlzle. The case for energy-proportional
5