The Tao of Systems: Doing Nothing Well Sara Alspaugh, Arka Bhattacharya, David Culler, Randy Katz University of California, Berkeley As a systems community, we focus on optimizing for performance — making the common case fast [10]. But the common case in real systems is doing nothing; that is, being idle, as shown in Figure 1. Being fast does not address this — you can only do something faster; but doing something faster only increases the amount of nothing the system does. This highlights what we as system designers have been doing wrong. Figure 2 shows how we spend the majority of our focus on performance at high utilization levels, when in reality, the majority of system time is spent doing nothing [2]. And even the common case is doing nothing, doing nothing still consumes power — often more than half of the peak. If we are to design for the common case from a power perspective, we need to “do nothing well”. This requires understanding idleness as well as we understand performance and optimize for the structure that we find there. In this paper we examine what it means to exploit the structure of idleness in the workload, just as we exploit temporal and spatial locality. We take a particular focus Figure 1: A 45 day trace of a Facebook data analytics cluster running Hadoop MapReduce. The system is under-utilized when the load is below peak. Thus, if the darkspace represents load, then the whitespace above represents idleness. Figure 2: Power consumption and response latency of an Atom processor-based node as utilization increases. Although the majority of actual time is spent in region (a), the majorty of system designers’ attention is spent in region (b). Abstract We propose a new design principle for the systems community. “Do nothing well” in addition to “make the common case fast”. This means designing for power savings as well as performance gains. The key to saving power is harnessing idleness; understanding and managing idleness as a first-class characteristic of workloads. In this paper we study the idleness in a diverse range of workloads. We show how well-known techniques for improving performance can be recast as tools for managing idleness, and thereby saving power. 1 Introduction One of the key tenets of system design is: In making a design trade-off, favor the frequent case over the infrequent case. 1 on storage systems, though the principle applies broadly. We explore the idleness in a diverse range of workloads and show how much is wasted by doing nothing poorly. We then consider storage system techniques to reduce this waste, such as workload aggregation, write-logging, and caching. We see that these energy reduction techniques are really about manipulating idleness in order to harness it for energy savings. 2 The usefulness of what is not In all systems, capacity planning is a function of the peak load demanded, while performance is a function of the average load. The discrepancy between these two is idleness. This implies idleness is driven by variation in workload. The whitespace in Figure 1 is the idleness in a particular workload. Table 1 describes a set of workloads. We studied the idleness in these workloads. Table 2 gives the degree of idleness, or whitespace. This idleness is exactly the peak less the average over the peak. We translate this to waste in order to capture the overhead. In a perfectly provisioned world, waste would be equivalent to peak over average. In reality, systems are vastly overprovisioned, so the waste is provisioned over average. Provisioned over peak describes the overprovision factor. Overprovisioning is driven by a number of things. One is that provisioning is often done on an organizational or logical level, and thus is decoupled from peak analysis. The Workload Percent whitespace Peak-toAverage Waste nfs-corp nfs-uni block-hm block-prn block-rsrch block-stg block-wdev block-web http-wiki mapreduce-fb 98.3% 92.4% 98.6% 99.2% 99.5 95.8% 84.3% 98.2% 11.1% 99% 59.8 13.2 69.0 119.1 183.1 23.7 6.4 55.2 1.1 2129 ProvisiontoAverage Waste 249.6 20.4 166.5 205.8 1989.5 158.6 11048.4 517.7 unknown unknown Figure 3: The inherent idleness in two of the workloads examined. MSR corresponds to the block-web trace, and Wikipedia corresponds to the http-wiki trace. The top graph shows that most idle periods are short, but by weighting them by their length, as in the bottom graph, we see there is actually a great deal of inherent idleness. other, particular to storage systems, is that storage systems must be provisioned for capacity in terms of bytes as well as bandwidth. The explosive growth in demanded storage capacity, as well as write-once, read-never data, has only exacerbated the situation [5]. Figure 3 highlights idleness in two of the workloads. In the figure, the MSR trace shows more idleness because it was collected at the block device level, postmemory cache. We will examine the importance of caching in Section 3. For contrast, we also show the inherent idleness in a very different type of workload, web requests from Wikipedia. Idleness causes waste because systems are not designed from a power perspective. Waste could be eliminated if power consumption could be made proportional to load. Lacking non-power proportional disks, creating a power proportional storage system requires transitioning some disks into a low-power sleep state. This technique was demonstrated for web servers in [6]. Disks can only be put to sleep when they are idle. Idle time is created for some disks by aggregating the load others. This must be done carefully so as to avoid hitting the performance knee in Figure 2. Yet it is necessary because idle periods must be of a certain size before they can be exploited without harming performance, due to the high-transition latency associated with such states. The act of coalescing work in order to coalesce idle Overprovision Factor 4.2 1.6 2.4 1.7 10.9 6.7 1726.3 9.4 unknown unknown Table 2: Across a diverse range of system types the amount of idleness and waste is large. Peak and provisioned capacity exceed the average required by one to five orders of magnitude. Provisioning was considered in terms of bytes per second (bandwidth) rather than bytes (capacity); high capacity provisioning results in even greater waste. 2 Label nfs-corp Type CIFS (NFS) Origin NetApp, 2007 nfs-uni NFS block-hm block-prn block-rsrch blocklevel “” “” Harvard, 2001 Microsoft, 2007 “” “” block-stg “” “” block-wdev block-web “” “” “” “” http-wiki HTTP Wikipedia mapreduce-fb MapReduce Facebook, 2009 Environment data for marketing, sales, and finance department in corporate DC home directories for a subset of campus users enterprise DC server for hardware monitoring enterprise DC print server enterprise DC server hosting research projects enterprise DC web staging server enterprise DC test web server enterprise DC web / SQL server Wikipedia requests (subsampled at 10%) data analytics cluster Hardware 3 TB NetApp filer, almost full Duration 1 day Reference [7] three NFS servers hosting a total of fourteen 53GB disk arrays RAID-1 boot volume, RAID-5 data volumes as DAS, 13 disks average “” “” 1 day [4] 1 week [9] “” “” “” “” “” “” “” “” “” “” “” “” “” 3 hours [13] 45 days — unknown Table 1: Summary of workloads examined and their characteristics. freedom in that availability is maintained as long as one replica is awake. The awake replicas received aggregated workload. This technique was demonstrated for the compute side in [6]. On the storage side, it is a bit harder. The challenge is maintaining consistency on writes. One option is to wake all replicas on every write. This breaks up the idle periods, sacrificing power savings. On the other end of the spectrum, you can write only to active replicas, at the cost of having to playback writes to the sleeping replicas before they can be read from. This technique was explored in [9]. Other work has also explored power savings opportunities afforded by replication [1, 6, 8, 11, 15], though this work did not benefit from the perspective afforded by an idleness-focused viewpoint. Figure 4 summarizes four general categories of techniques for dealing with writes. time requires balancing performance and power consumption. Fortunately, this yin and yang of power and performance are not strictly at odds. They work in harmony, as techniques developed for enhancing performance (the darkspace in Figure 1), are also techniques for shaping idleness (the whitespace in Figure 1). By casting these techniques in the light of managing idleness, rather than performance, we can exploit the idleness for power savings. These techniques are reviewed in the next section. 3 The way Harnessing idleness in storage systems is challenging because such systems are inherently stateful. Thus you must be concerned with consistency as well as data location. The first issue comes to the forefront when data is replicated. The second becomes forefront when data is partitioned across components and potentially only available in one place. We examine these techniques in this section, using two traces out of those listed in Table 1. We chose to use a block level trace, block-web, and a web trace, http-wiki. 3.1 3.2 Loitering To harness the idleness for energy savings, disks must be put to sleep when idle. However, if a request arrives for data that resides only on a sleeping disk, that request will incur very high latency because the disk must be woken up before that request can be served. This takes on the order of seconds 1 . So care must be taken; we cannot put disks to sleep as soon as they become idle without Being consistent Consistency becomes a problem when data is replicated, regardless of whether you are putting disks to sleep. Replication is done for fault-tolerance as well as increased read bandwidth. It provides an extra degree of 1 See our disk survey at: https://spreadsheets. google.com/spreadsheet/pub?hl=en_US&hl=en_US&key= 0At9LlwTqZsD5dERFdExuQ29BQnRQR1FVeEQ4TlVQZUE&output= html 3 Figure 6: We can think of the set of inactive disks as a 4th layer in the well-studied memory hierarchy. A miss to the active-disk set results in a latency penalty of a similar order of magnitude to that between the memory and disk. Figure 4: There are four general approaches to ensuring consistency on writes, with minor variations possible. Each approach differs in the degree to which it trades-off energy and performance. ture of this hierarchy has long been exploited for robustness and high-performance. This suggests we can successfully apply memory management and caching techniques which have been refined over the course of several decades. One caveat to the memory-management analogue, however, is that the latency of an access to an inactive disk is almost 5 seconds, which is orders of magnitude more than the threshold of 100ms desired for an interactive application. Figure 7 shows an extra layer of caching not only improves performance, but also increases the average idleness seen by the disk layer, which could then be exploited for energy savings. Figure 5 shows the lower bound on the energy savings we can expect from any caching scheme. Literature that has explored caching for energy savings includes [3, 14, 16]. Figure 5: Each point represents a different loiter time, with corresponding energy savings and latency penalty. Going to sleep after a reasonable idle period can result in vast savings for a reasonable performance tradeoff. We can still serve 95% of requests under 100ms while obtaining on the order of 2x savings for Wikipedia, and 6x savings for a Microsoft enterprise data center web server. Note that energy savings are calculated as percentage of baseline energy consumption. 3.4 Other Techniques Other memory management techniques could also lead to coalescing of idle times for energy savings. In a workload where there is appreciable spatial data-access correlation, which conventional wisdom holds to be true much of the time, prefetching would enable further coalescing of idle time by shifting the time at which data is retrieved forward in time. Intelligent layout strategies which attempt to segregate hot data, close in nature to caching and prefetching, would achieve a similar effect. harming performance. Loiter time is the duration of time a disk will remain idle before going into a low-power sleep state. Shorter loiter times means disks sleep more frequently, causing more requests to suffer performance penalties. Longer loiter times keep disks spinning longer, decreasing energy savings. Figure 5 illustrates a sweet spot, where enormous amounts of energy can be saved while making reasonable performance tradeoffs. 4 Conclusion Extending memory management We propose a new central tenet of system design should be doing nothing well. From the Tao Te Ching [12]: We can obtain a clearer picture of the nature of the challenges introduced in the previous section by considering the set of inactive disks as an extension to the memory management hierarchy, as shown in Figure 6. The na- We turn clay to make a vessel; But it is on the space where there is nothing that the usefulness of the vessel depends. Therefore just as we take advantage of what is, we should recognize the usefulness of what is not. 3.3 4 computing. IEEE Computer, 40(12):33–37, 2007. [3] D. Colarelli and D. Grunwald. Massive arrays of idle disks for storage archives. In Proceedings of the 2002 ACM/IEEE conference on Supercomputing, Supercomputing ’02, pages 1–11, Los Alamitos, CA, USA, 2002. IEEE Computer Society Press. [4] D. Ellard, J. Ledlie, P. Malkani, and M. Seltzer. Passive nfs tracing of email and research workloads. In Proceedings of the 2nd USENIX Conference on File and Storage Technologies, pages 203–216, Berkeley, CA, USA, 2003. USENIX Association. [5] International Data Corportion. Worldwide file-based storage 20102014 forecast update. Technical report, IDC, December 2010. [6] A. Krioukov, P. Mohan, S. Alspaugh, L. Keys, D. Culler, and R. H. Katz. Napsac: design and implementation of a powerproportional web cluster. In Green Networking ’10: Proceedings of the first ACM SIGCOMM workshop on Green networking, pages 15–22, New York, NY, USA, 2010. ACM. [7] A. W. Leung, S. Pasupathy, G. Goodson, and E. L. Miller. Measurement and analysis of large-scale network file system workloads. In USENIX 2008 Annual Technical Conference on Annual Technical Conference, pages 213–226, Berkeley, CA, USA, 2008. USENIX Association. [8] D. Li and J. Wang. Eeraid: energy efficient redundant and inexpensive disk array. In Proceedings of the 11th workshop on ACM SIGOPS European workshop, EW 11, New York, NY, USA, 2004. ACM. [9] D. Narayanan, A. Donnelly, and A. Rowstron. Write off-loading: practical power management for enterprise storage. In Proceedings of the 6th USENIX Conference on File and Storage Technologies, FAST’08, pages 17:1–17:15, Berkeley, CA, USA, 2008. USENIX Association. [10] D. A. Patterson and J. L. Hennessy. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, Inc., second edition, 1996. [11] E. Pinheiro and R. Bianchini. Energy conservation techniques for disk array-based servers. In Proceedings of the 18th annual international conference on Supercomputing, ICS ’04, pages 68– 78, New York, NY, USA, 2004. ACM. [12] L. Tzu. The Way and its Power: Lao Tzu’s Tao Te Ching and Its Place in Chinese Thought. 1958. Translation. [13] G. Urdaneta, G. Pierre, and M. van Steen. Wikipedia workload analysis for decentralized hosting. Elsevier Computer Networks, 53(11):1830–1845, July 2009. [14] C. Weddle, M. Oldham, J. Qian, A.-I. A. Wang, P. Reiher, and G. Kuenning. Paraid: A gear-shifting power-aware raid. Trans. Storage, 3, October 2007. [15] X. Yao and J. Wang. Rimac: a novel redundancy-based hierarchical cache architecture for energy efficient, high performance storage systems. In Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006, EuroSys ’06, New York, NY, USA, 2006. ACM. [16] Q. Zhu and Y. Zhou. Power-aware storage cache management. IEEE Trans. Comput., 54, May 2005. Figure 7: Shows an extension of Figure 3 corresponding to different cache sizes. The top graph corresponds to the http trace, and the bottom graph corresponds to the block-web trace. The addition of a further layer of cache increases idleness. Given this new perspective on the whitespace versus the darkspace, many questions arise. Idleness deserves greater focus in the systems literature. At what rate has idleness increased in systems over the years? What trends drive this? For instance, how does the move from dedicated storage to storage distributed over processing nodes change the observed patterns of idleness? Does performance need to be held forth at the upmost at the cost of the power-saving opportunities afforded by idleness aggregation? We urge the systems community to focus on these new questions. References [1] H. Amur, J. Cipar, V. Gupta, G. R. Ganger, M. A. Kozuch, and K. Schwan. Robust and flexible power-proportional storage. In Proceedings of the 1st ACM symposium on Cloud computing, SoCC ’10, New York, NY, USA, 2010. ACM. [2] L. A. Barroso and U. Hlzle. The case for energy-proportional 5
© Copyright 2025 Paperzz