AUTONOMIC CLOUD RESOURCE MANAGEMENT
By
Cihan Tunc
Copyright © Cihan Tunc 2015
A Dissertation Submitted to the Faculty of the
DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING
In Partial Fulfillment of the Requirements
For the Degree of
DOCTOR OF PHILOSOPHY
In the Graduate College
THE UNIVERSITY OF ARIZONA
2015
THE UNIVERSITY OF ARIZONA
GRADUATE COLLEGE
As members of the Dissertation Committee, we certify that we have read the
dissertation prepared by Cihan Tunc, titled “Autonomic Cloud Resource Management”
and recommend that it be accepted as fulfilling the dissertation requirement for the
Degree of Doctor of Philosophy.
_____________________________________________________Date: January 23, 2015
Dr. Salim Hariri
_____________________________________________________Date: January 23, 2015
Dr. Ali Akoglu
_____________________________________________________Date: January 23, 2015
Dr. Janet Wang
Final approval and acceptance of this dissertation is contingent upon the candidate‟s
submission of the final copies of the dissertation to the Graduate College.
I hereby certify that I have read this dissertation prepared under my direction and
recommend that it be accepted as fulfilling the dissertation requirement.
____________________________________________________ Date: January 23, 2015
Co-Dissertation Director: Salim Hariri
____________________________________________________ Date: January 23, 2015
Co-Dissertation Director: Ali Akoglu
STATEMENT BY AUTHOR
This dissertation has been submitted in partial fulfillment of requirements for an
advanced degree at the University of Arizona and is deposited in the University Library
to be made available to borrowers under the rules of the library.
Brief quotations from this thesis are allowable without special permission, provided
that accurate acknowledgement of the source is made. Requests for permission for
extended quotation from or reproduction of this manuscript in whole or in part may be
granted by the copyright holder.
SIGNED: Cihan Tunc
3
ACKNOWLEDGEMENTS
PhD is a long journey and could not be accomplished without help, patience, and
understanding of many people. Therefore, I would like to thank all the people who
contributed with their support.
First of all, I would like to express my gratitude to my advisors. I am very thankful to
Prof. Salim Hariri for his advice, guidance, comments, and support during this path.
Without him, this work would never have been accomplished. I have learnt a lot from
him during this journey and his teachings will be always appreciated throughout my
career.
I would like to express my sincere thanks to Prof. Ali Akoglu. With his help,
guidance, and support, this work has been initiated. I am very thankful for.
I also would like to thank my committee members during my PhD journey, Dr. Janet
Wang, Prof. Susan Lysecky, and Prof. Roman Lysecky. I am honored that you all were in
my committee and contributed my study and supported me during this study. In addition,
I would like to thank all other professors at The University of Arizona whose valuable
ideas have enlightened me.
I am deeply grateful to Prof. Youssif Al-Nashif for his brother-like help, discussions,
brainstorming, and support during my PhD path. Without his encouragements, it would
be extremely hard for me to contribute to this research.
I would like to thank my friends from ACL lab in UofA, especially Bilal, Jesus,
4
Pratik. I would like to especially thank Dr. Farah Fargo for her support and contributions
during my research. I am also very grateful to my friends from RCL lab in UofA,
especially Dr. Greg Striemer, Dr. Yoon Kaw Leow, and Burak Unal. With all your
friendship, UofA was a warm environment for me. Thank you very much.
I would really thank the ECE department staff for their cooperation and kindness
during my study.
I would like to express my appreciation to Prof. Yaman Yener who passed away in
2013. None of these works would be possible without his support. He has always been
there like an uncle for us. We have learnt a lot from him and his path has enlightened
many people.
I am very grateful to my undergrad advisor, Prof. Fatih Ugurdag for his friendship,
support, discussions, and understanding throughout both my professional career and
personal life.
I would like to thank all my friends who always believed in me, deeply.
Last but not least, I would like to express my gratitude to my beloved family (my
father Sayim Tunc, my mother Leyla Tunc, and my sisters Cansu Tunc and Caglanur
Tunc) who has been very patient, helpful, and understanding throughout this journey.
Without their supports, it would be almost impossible to start this journey. I would like to
also thank my relatives for their best wishes for me.
Thank you very much all.
5
I dedicate this dissertation to my family...
6
TABLE OF CONTENTS
LIST OF FIGURES...........................................................................................................................10
LIST OF TABLES .............................................................................................................................13
ABSTRACT ......................................................................................................................................14
CHAPTER I INTRODUCTION ...........................................................................................................17
A.
B.
C.
D.
E.
F.
Motivation .................................................................................................................. 17
Problem Statement ...................................................................................................... 18
Research Objectives ................................................................................................... 21
Research Challenges ................................................................................................... 23
Overview of the Proposed Approach.......................................................................... 24
Dissertation Outline .................................................................................................... 25
CHAPTER II RELATED WORK .......................................................................................................27
A.
B.
C.
D.
A.
B.
C.
D.
E.
QoS with Power Reduction ........................................................................................ 27
Hardware Advances .................................................................................................... 27
Software Advances ..................................................................................................... 28
Turning off unused capacity ....................................................................................... 28
Dynamic Voltage and Frequency Scaling (DVFS) .................................................... 29
Memory Management................................................................................................. 30
Workload Scheduling ................................................................................................. 31
Workload Consolidation ............................................................................................. 33
Discussion ................................................................................................................... 34
CHAPTER III THE AUTONOMIC MANAGEMENT PARADIGM .......................................................37
A.
B.
C.
Introduction ................................................................................................................ 37
Properties of Autonomic Computing.......................................................................... 38
Autonomic Computing System: The Conceptual Architecture .................................. 41
1)
Computing Environment ......................................................................................... 43
2)
Environment ............................................................................................................ 43
3)
Controller ................................................................................................................ 44
7
D.
1)
2)
3)
4)
E.
F.
Autonomic Computing System Framework ............................................................... 46
Application Management Editor ............................................................................. 47
Autonomic Runtime System (ARS)........................................................................ 48
High Performance Computing Environment (HPCE) ............................................ 51
Autonomic Component Architecture (ACA) .......................................................... 51
Autonomia .................................................................................................................. 52
Related Work on Autonomic Computing ................................................................... 55
CHAPTER IV AUTONOMIC POWER AND PERFORMANCE MANAGEMENT ...................................59
A.
Autonomic Management Framework ......................................................................... 60
B.
Case-Based Reasoning ............................................................................................... 63
C.
Hierarchical Autonomic Management Framework: Performance-Power
Management .............................................................................................................................. 64
1)
Autonomic Performance-Power Manager (APPM) ................................................ 65
2)
Cluster-Level APPM (C-APPM) ............................................................................ 67
3)
Server-Level APPM (S-APPM).............................................................................. 67
4)
Device-Level APPM (D-APPM) ............................................................................ 67
D.
Hierarchical Autonomic Management Framework in Cloud Environment ............... 68
1)
Server Autonomic Manager .................................................................................... 68
2)
AppFlow: A Data Structure for Autonomic Management ...................................... 69
3)
Autonomic Cloud Resource Management (ACRM) Framework ........................... 73
4)
Monitoring .............................................................................................................. 75
5)
AppFlow Analysis and Characterization ................................................................ 77
6)
Plan Generation, Selection, and Execution ............................................................. 84
CHAPTER V EXPERIMENTAL RESULTS AND EVALUATION .........................................................97
A.
B.
C.
D.
E.
Experimental Testbed ................................................................................................. 97
Evaluation Methods .................................................................................................... 98
Experimental Results and Evaluation of the ARCM Algorithm .............................. 100
Scalability Analysis .................................................................................................. 107
Total Operational Cost Analysis............................................................................... 111
CHAPTER VI CONCLUSION AND FUTURE WORK DIRECTIONS .................................................113
8
A.
B.
Conclusion ................................................................................................................ 113
Future Research Directions ...................................................................................... 115
1)
Adopt ARCM Methodology to Heterogeneous Cloud Systems ........................... 115
2)
Modeling and Predicting the Behavior of Different Types of Workloads............ 116
3)
Multi-Objective Optimization (Performance, Power, Availability,
Resilience)........................................................................................................................... 116
4)
Modeling of the VM Power Consumption............................................................ 117
5)
Online VM Migration and Consolidation ............................................................. 118
REFERENCES ................................................................................................................................119
9
LIST OF FIGURES
Figure 1. The monthly costs of the Amazon cloud systems [44]................................... 20
Figure 2. Power consumption of the system components in a server, by Intel [70] ...... 35
Figure 3. The conceptual view of autonomic computing system .................................. 43
Figure 4. Autonomic Computing System: The Implementation View .......................... 47
Figure 5. Self-Management Algorithm .......................................................................... 50
Figure 6. The control and management logic for the Autonomic Control Element (ACE)
........................................................................................................................................ 52
Figure 7. The autonomic logic of an ACE ..................................................................... 54
Figure 8. Autonomic Management System ................................................................... 61
Figure 9. Application Flow (AppFlow) ......................................................................... 62
Figure 10. Hierarchical Power Management ................................................................. 65
Figure 11. Server-level autonomic management architecture. ...................................... 69
Figure 12. Application Flow (AppFlow) attributes for a server. .................................. 71
Figure 13. Using AppFlow to model and predict the behavior of an application.......... 72
Figure 14. Autonomic cloud management framework. ................................................. 75
Figure 15. Monitoring tool created for the Autonomic Cloud Management ................. 76
Figure 16. The RUBiS browse-only operation memory utilization ............................... 79
Figure 17. The RUBiS default operation memory utilization ...................................... 79
Figure 18. The CPU utilization for different number of clients under RUBiS browsing
10
operation ........................................................................................................................ 80
Figure 19. The CPU utilization for different number of clients under RUBiS default
operation ........................................................................................................................ 80
Figure 20. An example of the rule set obtained from datamining algorithm. ................ 84
Figure 21. Algorithm 1 to find an acceptable sub-optimal configuration...................... 88
Figure 22. Algorithm 2 to search large configuration set .............................................. 89
Figure 23. Power consumption for different memory configurations ........................... 91
Figure 24. Intel i7 920 CPU p-state and full load power consumption ......................... 93
Figure 25. Plan execution approach used in this work .................................................. 96
Figure 26. Multi-tier RUBiS architecture ...................................................................... 98
Figure 27. Frequency comparison algorithm ............................................................... 100
Figure 28. The power consumption of browse-only RUBiS behavior with different
clients. .......................................................................................................................... 101
Figure 29. The power consumption of default RUBiS behavior with different clients.
...................................................................................................................................... 102
Figure 30. The power consumption of random RUBiS behavior with different clients
where no prior training has been applied. .................................................................... 103
Figure 31. Large scale testbed architecture for the scalability..................................... 109
Figure 32. The power RPR comparison with respect to static resource allocation. .... 110
Figure 33. The power RPR comparison with respect to frequency scaling for large
11
testbed .......................................................................................................................... 110
Figure 34. Memory and CPU usage vs cost ................................................................. 112
12
LIST OF TABLES
Table 1. Comparing the performance of our approach with other methods. ............... 104
Table 2. Comparing the execution time overhead of our approach with other methods.
...................................................................................................................................... 105
Table 3. The transition probability of the client numbers ............................................ 107
Table 4. The power comparison when the client numbers change dynamically for 1
hour. ............................................................................................................................. 107
13
ABSTRACT
The power consumption of data centers and cloud systems (utilized by companies,
e.g., Google, Amazon, eBay and E-Trade) has increased almost three times between 2007
and 2012. The past decade has seen data center IT managers adopting server
virtualization broadly; however, little progress has been achieved in server operation
efficiency. The traditional resource allocation methods are typically designed for high
performance as the primary objective so that peak resource requirements can be
supported. However, several research studies have shown that server utilization is
between 12% and 18% and has remained static from 2006 through 2012, while the power
consumption is close to those at peak loads. Hence, there is a pressing need for devising
sophisticated resource management approaches to the problems of energy, performance,
and availability of datacenters and cloud services. State of the art dynamic resource
management schemes typically rely on only a single resource such as number of cores,
core speed, memory, disk, network topology and hierarchy to provide reduction in power
consumption. There is a lack of fundamental research on designing methods that address
the challenges of dynamically managing multiple resources and properties (number of
cores, core frequency, memory allocation, storage, etc.) with the objective of allocating
just enough resources for each workload to meet its quality of service requirements while
optimizing for power consumption. The main focus of this dissertation is to consider
power and performance as the main properties to be simultaneously managed in large
14
cloud systems and datacenters. The objective of this research is to develop a framework
of performance and power management, and, in the context of this model, investigate a
general methodology for an integrated autonomic management of cloud systems and high
performance distributed datacenters.
In this dissertation, we developed an autonomic management framework based on a
novel data structure, referred to as AppFlow, which is used for modeling current and
near-term future behavior of a cloud application. We have developed the following
capabilities for the performance and power management of the cloud computing systems:
1) online modeling and characterizing the behavior of cloud applications and their
resource requirements; 2) predicting the application behavior so we can proactively
optimize its operations at runtime; 3) a holistic optimization methodology for
performance and power that takes into consideration number of cores, CPU frequency,
and memory amount; and 4) an autonomic cloud resource management to support the
dynamic change in VM configurations at runtime such that we can optimize
simultaneously multiple objectives including performance, power, availability, etc.
We validated the effectiveness of our approach using the RUBiS benchmark, an
auction model emulating eBay transactions that generates a wide range of applications
(such as browsing and bidding with different number of clients), on an IBM HS22 blade
server. Our experimental results showed that our approach can lead to a significant
reduction in power consumption upto 87% when compared to the static resource
15
allocation strategy, 72% when compared to adaptive frequency scaling strategy and 66%
when compared to a multi-resource management strategy.
16
CHAPTER I
INTRODUCTION
A. Motivation
The power consumption of data centers and cloud systems such as Google, Amazon,
eBay and E-Trade has increased almost three times between 2007 and 2012. According to
the International Energy Agency, data center electricity consumption is projected to
increase to roughly 140 billion kilowatt-hours annually by 2020 [1], [2], resulting in the
equivalent annual output of 50 power plants, costing American businesses $13 billion per
year in electricity bills and causing the emission of nearly 150 million metric tons of
carbon pollution annually. Much of the progress in data center efficiency over the past
five years has occurred in the area of facility and equipment efficiency. However, little
progress has been achieved in server operation efficiency, particularly in terms of server
utilization. The past decade has seen data center IT managers adopt server virtualization
[3] broadly (a technique to consolidate underutilized servers), however average server
utilization is still between 12 and 18 percent and has remained static from 2006 through
2012 [4, 5, 6]. Although hyper-scale cloud providers can realize higher utilization rates
(ranging from 40 to 70 percent), even they are not consistently achieving those rates. For
example, a new research from Google indicates that typical server clusters average
anywhere from 10 to 50 percent utilization [7].
17
Data center operators typically plan the number and processing capacity of servers to
handle peak annual traffic, such as Black Friday sales; the rest of the time, most servers
remain largely unused. This underutilized equipment not only has a significant energy
draw but also is a constraint on data center capacity. It is estimated that the electricity
consumption in U.S. data centers could be cut by as much as 40 percent with careful
resource management [1]. In 2014, this represents a savings of 39 billion kilowatt-hours
annually -- equivalent to the annual electricity consumption of nearly all the households
in the state of Michigan [1]. Such improvement would save U.S. businesses $3.8 billion a
year. Hence, there is a pressing need for devising sophisticated resource management
approaches to the problems of energy, performance, and availability of datacenters and
cloud services.
The main goal of this research project is to consider power/energy and performance
as the main properties to be simultaneously managed in large cloud systems and
datacenters.
B. Problem Statement
The traditional resource allocation methods for large data centers use the overprovisioned static resource allocation schemes for physical and virtual platforms. These
schemes are typically designed for high performance as the primary objective so that
peak resource requirements can be supported. However, resource allocation decision that
is based on a pessimistic scenario leads to the inevitable higher power consumption [8].
18
Instead of static and pessimistic resource allocation style, it is possible to reduce the
power consumption using dynamic resource management schemes [9-12]. More recently
several methods focused on energy aware resource management [13-18]. These
approaches typically rely on only a single resource such as frequency scaling [19-25],
memory [26-28], disk [29-30], network topology and hierarchy [31-32], etc. to provide
reduction in power consumption without causing any performance degradation. Some
methods rely on learning based resource allocation using neural networks [17, 33-35]. In
order to overcome the extensive training overhead of these methods, alternatively, model
based workload characterization has been used for resource management [36-43].
However, none of these methods address the challenge of dynamically managing multiple
resources and properties (number of cores, core frequency, memory allocation, storage,
etc.) with the objective of allocating just enough resources for each workload to meet its
quality service requirements while optimizing power consumption.
Based on Reference [44], Amazon.com estimates that, for its data centers, the cost
and operation of the servers is around 53% of the total budget (based on a 3-year
amortization schedule) as shown in Figure 1 whereas energy-related cost is around 42%
of the total.
19
Figure 1. The monthly costs of the Amazon cloud systems [44]
As a result, high power consumption yields to:
Higher Total Cost of Ownership (TCO) resulting in revenue loss for the cloud
providers.
High power requirements also need appropriate design of server rooms.
Due to high power consumption of the systems, cooling becomes a challenge.
Appropriate cooling mechanisms needed to be able to cool down individual systems. This
also results in higher cost for cooling.
CO2 emission is an important problem in today‟s world when the large compute
farms contribute to the global warming. The majority of the energy used in today‟s
20
society is generated from fossil fuels which produce harmful CO2 emissions. Based on
Gartner Report [43], high power consumption of the cloud services also results in
substantial carbon dioxide (CO2) emissions and it was estimated to be 2% of the global
emissions, in 2007.
Fixed power budget will result in performance loss of the systems not to exceed
the power limit by reducing the compute capabilities of the VMs, physical machines, and
also other datacenter based components.
High power consumption also causes instability, hot spots, increased failure rates,
reduced system reliability, and reduced lifetime of the devices due to overheating.
C. Research Objectives
It has been found that data centers are operate at around 10-50% of their maximum
capacity resulting in waste of energy as well as resources [46, 47] where idle servers
consume more than 60-70% of their maximum power capacity [48]. As an example, it
has been reported that the average utilization for the Google datacenters was around 30%
[50].
One of the ways to reduce power consumption by a data center is to apply
virtualization technology. Using virtualization, it is possible to improve the utilization
ratios of the system by allocating Virtual Machines (VMs) on a physical server to reduce
21
the amount of required allocated hardware and to improve the utilization of resources.
For example, a study from Schneider Electric shows that by using virtualization, the
utilization can increase from 5% to more than 50% whereas the power consumption
increases only 20% [51].
On the other hand, using traditional resource allocation, typically over-provisioning
resource allocation schemes are applied for physical and virtual platforms. This type of
schemes is designed for supporting high performance requirements of the systems for the
peak resource needs. Nevertheless, pessimistic scenarios developed for over-provisioning
schemes lead to inevitable higher power consumption and/or low resource utilization [8].
It is shown that the average server utilization is a low percentage between 12% and 18%
and has remained static from 2006 through 2012 [4-6]. For example, the average
utilization of Google server clusters range anywhere from 10% to 50% utilization [7].
Therefore, instead of static and pessimistic resource allocation techniques, dynamic
resource management schemes are needed to reduce the power consumption of the cloud
systems while still providing high while maintaining Quality of Service (QoS)
requirements. Several schemes can be used to reduce power consumptions such as
frequency scaling [19-25], management of disks [29-30], memory [26-28], network
devices [31-32], etc.
22
D. Research Challenges
Autonomic management, online modeling and analysis of multiple objectives and
constraints such as power consumption and performance of cloud systems and
datacenters is a challenging research problem due to the continuous change in workloads
and applications, continuous changes in topology, the variety of services and software
modules being used, and the extreme complexity and dynamism of their computational
workloads. In order to develop an effective autonomic control and management system of
cloud‟s application performance and power or energy consumption, it becomes highly
essential for the system to have the functionality of online monitoring, adaptive modeling
and analysis, and proactive management mechanisms tailored for real-time processing
and proactive management mechanisms. This dissertation explains how to develop
innovative cloud management techniques to address the following research challenges:
1. How do we efficiently and accurately model power and energy consumption of
cloud resources at runtime, a modeling that involves complex interactions of different
classes of devices such as processor, memory, network and I/O? Existing works focus on
a single component such as processor, memory or disk or relied on very simplistic
assumptions. A system-level view of these components would present more opportunities
for power savings since we can exploit the non-mutually exclusive behaviors of these
components to set them at power states such that the global system power consumption is
minimal. Hence one research challenge is to accurately model such complex interactions
23
among the various device classes. Modeling this accurately is a significant research and
development challenge. Our approach is based on using AppFlow [9, 10, 28, 52]
formalism to model all the system components and their interactions.
2. How can we predict in real-time the behavior of system resources and their
power/energy consumptions as workloads change dynamically by several order of
magnitude within a certain time window such as a day or a week?
3. How do we design efficient and self-adjusting optimization mechanisms that can
continuously learn/relearn, execute, monitor, and improve themselves to meet the
collective objectives of power management and performance improvement? Techniques
such as data mining can be exploited to address this research challenge.
4. Power management schemes for application servers, interconnects and interfaces
have not been much explored yet and they are an intrinsic part of datacenters. For
example, it has been reported that a 32 port Gigabit Ethernet switch consumes around
700W when idle. Our holistic approach to model and manage collectively the power and
performance of large scale datacenters will provide a theoretical framework that takes a
holistic datacenter-view.
E. Overview of the Proposed Approach
While most of the literature focuses on only a single resource type, in our approach we
dynamically manage multiple resources (number of cores, core frequency, and memory
amount) to reduce power consumption by allocating just enough VM resources that can
24
meet the QoS requirements of current applications by scaling up/down the hardware
resources during runtime. In this approach, we develop an autonomic power and
performance management system based on a novel data structure that model the behavior
of cloud applications that we refer to as Application Flow (AppFlow). In our approach,
we classify cloud workloads into a set of AppFlow types. Next, for each AppFlow type,
we analyze and characterize the AppFlow type to determine the ideal software and
hardware configuration required to run each AppFlow and thus build an AppFlow based
knowledge repository. Once the current AppFlow type is determined, virtual machine
resources are configured based on the resource amounts required that is obtained from the
knowledge repository such that power consumption is minimized with minimal
degradation in performance. Our experimental results showed that our approach can
reduce the power consumption upto 84% when compared to static resource allocation and
upto 30% when compared to other methods
F. Dissertation Outline
The remainder of the dissertation is organized as follows. In Chapter 2, we present
related works in the area of power and performance/energy management. We mainly
focus on software based management of the large scale systems instead of hardware
based solutions.
Chapter 3 introduces the concept of autonomic management and how it can be used to
perform cross-layer management of cloud resources and their applications.
25
In Chapter 4, we present our autonomic power and performance management
approach for cloud computing systems where we use AppFlow based reasoning to scale
up/down the virtual resources allocated for the VMs.
Next, in Chapter 5, we discuss our experimental results. First, we explain the testbed
system and then discuss in detail the RUBiS benchmark that is used in evaluating the
performance. Next we evaluate our approach and show how it compares to other methods
and discuss the overhead of our approach.
Chapter 6 summarizes the research and contributions, and then discusses future
research directions to optimize power and performance as well as other properties such as
availability and security for large scale cloud systems.
26
CHAPTER II
RELATED WORK
In this chapter we present the related works for power/energy and performance
management for cloud computing systems, datacenter, and large scale systems.
Power/energy conscious resource management studies for large scale computing systems
follow the following strategies classified as QoS with power reduction, exploiting
hardware advances, and applying software advances.
A. QoS with Power Reduction
With the efficient management of the resources, applications can be executed using
less amount of power while keeping the response time and/or execution time in the limits
as described in QoS. Hence, power consumption can be reduced through techniques such
as reducing processing speed, display intensity and/or resolution, etc. Such methods are
especially suitable for battery-powered systems where energy is a severely constrained
resource. This dissertation research proposed reducing power consumption without
degrading the QoS of applications and workloads.
B. Hardware Advances
Computer and component manufacturers are producing components better in energy
efficiency; that means they can process more computational units per unit of energy. By
exploiting such advances, it is possible to reduce power consumptions further.
27
C. Software Advances
Power consumption of the systems can be reduced through streamlining software
modules impacting computational and memory requirements to run them.
D. Turning off unused capacity
Computing and communication infrastructures are usually provisioned for peak
workload requirements. However, normally, average workloads could be several orders
of magnitudes less demanding. By turning off unused capacity or bringing them to states
that require less energy, considerable amount of energy can be saved.
Our research focuses on reducing power consumptions of idle resources and
exploiting emerging hardware and software technologies to optimize power and
performance of cloud resources as well as their applications.
The related works on the power/energy related research can be summarized as
follows:
1) Dynamic Voltage and Frequency Scaling (DVFS) to reduce power by scaling
frequency and voltage of the allocated cores [19-25, 53-55], (2) memory management
techniques aiming DVFS or cache partitioning [26-28], (3) network and IO centric power
management [31-32] to shut-down unutilized devices, (4) workload scheduling power
management [57-65] by applying heuristics constraining network, thermal hot-spots, fanspeeds, etc., and (5) workload consolidation [66-69] applying live migration to turn off
28
the idle systems.
A. Dynamic Voltage and Frequency Scaling (DVFS)
Dynamic voltage and frequency scaling (DVFS) methods scale the frequency of each
core of the physical machine and the associated VM‟s core to manage the power and
performance. Current technologies in the CPU market include examples such as
SpeedStep from Intel [53] and PowerTune [54] from AMD to dynamically modify both
frequency and CPU voltage using ACPI P-states.
Ge et al. propose an off-line DVFS based management algorithm to improve the
energy efficiency using a weighted user driven energy-delay product [55] to change the
priority to energy saving or to performance. Their experiments show that they can save
upto 30% energy saving with 5% performance degradation. However, off-line DVFS
based algorithm is only effective if detailed information on the application is known
before execution.
Wang and Chen [19] investigated multiple control theory based approaches to limit
the power consumption for a given power budget by reducing the frequency levels, on
runtime. Their methods include reducing the frequency of the system if the power
consumption is higher than the defined limit. Reference [20] identified the period during
which the workload uses extensive amount of load operations. Since during load
operations the processors wait until the IO operations are completed, and their approach
takes advantage of the idle period by scaling down the frequency of the corresponding
29
cores in order to reduce power consumption. In a similar manner, Wirtz and Ge [21] used
DVFS for MapReduce operations. The proposed method applies highest frequency to the
nodes operating map and reduce functions and minimum frequency otherwise. They also
bound the performance loss to a user specified value. Their results showed energy
reduction by 19% at an expense of 17% performance degradation and 23% energy
reduction with a 5% performance loss.
Applying control theory to the resource management is another technique where the
current system behavior is monitored and used as a feedback. Using control theory for
DVFS to maximize performance within a power budget has been discussed in [22] and
[23]. In [24], an adaptive MIMO control system based on the fuzzy modeling technology
is proposed for combined power and performance constraints. Reference [25] proposes
controlling CPU allocation for each VM workloads and number of VMs dynamically for
a specified quality of service and power budget.
Even though DVFS approach is useful for large scale systems to reduce power
consumption, it only focuses on one resource type while the overall resource interactions
are not well studied.
B. Memory Management
Reference [26] proposed a method for designing and developing new multiple power
states for DRAMs similar to CPU DVFS approach based on the fact that memory can
consume on the average 23% of the system power. The results show that 2.43% average
30
(5.15% max) energy-efficiency improvement. In [27], a power-aware cache partitioning
mechanism is presented, which reduces energy consumption by 20% while maintaining
the performance. Similarly, reference [28] showed that using memory management in
addition to multicore processor management, it is possible to reduce power consumption
upto 63.75% compared to other methods by activating/deactivating memory banks during
runtime. It should be noted that the proposed methods assume that the amount of the
memory is fixed at runtime and then they reduce power consumption by switching the
memory states. However, the allocated memory amount plays an important role in the
power consumption as well.
Tolentino et al. in [56] developed Memory Management Infra-Structure for Energy
Reduction (MemoryMISER) for dynamic power-scalable memory management utilizing
a modified Linux kernel and a daemon implementation of a PID controller to on-line and
off-line memory scaling in the system operating mode. Their experiments show that the
proposed technique can reach 70% reduction in memory energy consumption and 30%
reduction in total system energy consumption with a performance degradation of less
than 1%.
C. Workload Scheduling
Workload scheduling is another approach for reducing the power consumption by
assigning the workload to the appropriate machines based on cost, thermal issues, and
network traffic, etc. Chang discussed scheduling of the virtual cores to the physical cores
31
dynamically to optimize power and performance by demonstrating 17% reduction in
power consumption [57]. In [58] power is reduced by considering the thermal states of
the systems and fan speeds as well as network operations while distributing the
workloads. They have first analyzed separately every single component to determine the
imbalance of the CPU power consumption in large systems. They reduce the power
consumption of the system by 12% by assigning workloads to cooler nodes. Laszewski
[59] presented a power aware clustering algorithm where they applied DVFS based on
the utilizations. Another algorithm for workload scheduling is proposed in [60] for data
centers where mainly solar energy is used. They predict both the workload requirements
and the amount of available solar energy for a greener system so that low priority
workloads can run with the solar energy and for the workloads that should run with the
electrical grid, they try to assign them during the periods when the cost of electricity is
less than other periods. This approach can reduce the electrical grid consumption up-to
39%. In [61] a heuristic scheduling algorithm is proposed to reduce energy consumption
for parallel tasks so that the voltage of the processors where non-critical work runs can be
reduced. Gandhi [62] collected the patterns of resource usage by analyzing the workload
for an entire day. These patterns are then used to predict the resource demand and
allocate servers accordingly. At run time, they apply fine-tuning technique to respond to
the variations in the workload behavior and they achieved upto 35% reduction in power
consumption.
32
Rodrigo et al. proposes a scheduling methodology where the system updates the
number of required VM instances based on the required QoS and the current QoS [63].
Using analytical performance model and workload information of an application, they
apply their methodology using CloudSim simulation toolkit for both web-based and
scientific-based workloads. Srikantaiah et al. [64] have studied request scheduling for
multi-tiered web-applications to minimize energy consumption, while meeting
performance requirements. They show that the effect of performance degradation due to
high utilization of resources and showed tried to find a right utilization point to deploy
multidimensional bin packing problem for workload scheduling. Younge et al. [65]
proposed a greedy algorithm to reduce the power consumption of the cloud systems by
scheduling as many as VMs to each physical nodes using OpenNebulae scheduler with 4
servers and they decide the number of VMs a physical node can handle based on the
number of cores available in that physical node.
It should be noted that the workload scheduling algorithms mainly focus scheduling
the servers whereas the power reduction of individual servers based on the allocated
virtual resources is not addressed.
D. Workload Consolidation
Using virtualization with live migration is another common method in which the
objective is to increase the number of unutilized hosts by running the workload with the
least number of nodes. The performance and the energy cost are studied in [66] to reduce
33
the migration cost and reduce energy consumption. An efficient resource management
policy for the virtualized cloud data centers was presented in [67] by continuously
consolidating VMs for live migration and switching off idle nodes to minimize power
consumption. Another technique for power efficient VM allocation, Cardosa et al. [68],
CPU allocation of hypervisors is proposed to consolidate the VMs since they will be
sharing the same resource. Tsai et al. in [69] applies capacity planning technique and uses
server consolidation to reduce the energy consumption for web based applications by
estimating CPU processing capacity needed to serve the incoming workloads.
E. Discussion
As the literature review shows, most techniques identify the components for dynamic
power management and then turning them on/off or changing their settings. However,
they do not exploit different classes of resources such as processor, memory, etc. at the
same time. In order to effectively reduce the power consumption of the large scale
systems, multiple resource components should be taken into consideration and methods
that are dynamic, not-overprovisioning, and autonomic should be developed.
According to data provided by Intel Labs [70], even though the power consumed by a
server is mainly drawn by CPU, memory plays an important role as well, as shown in
Figure 2. Additionally, other factors such as disk, NIC, etc. should not been discarded
completely. This data shows the importance of the requirement of managing multiple
resources concurrently if the goal is to minimize the power consumption of a server.
34
Figure 2. Power consumption of the system components in a server, by Intel [70]
Our proposed effort relies on learning the behavior of cloud applications and
predicting their characteristics at runtime in order to accurately determine their resource
requirements. The goal is to design an innovative autonomic management architecture
that can be integrated with cloud resources to transform them into intelligent selfmanaging entities that optimize their performance and power consumption as well as
other properties like availability. Dynamic resource management problem requires
finding the optimal or the near-optimal resource configuration which leads to power
reduction without compromising the performance. During the life cycle of a cloud
workload execution, the configuration of cloud resources should be adopted based on
mapping current workload into a set of well-defined AppFlow types, where we know for
each AppFlow type the best VM configurations in terms of number of cores, CPU
35
frequency and amount of memory/storage allocated to each VM.
The closest techniques to our approach are those reported in [10], [71], and [72]. In
[71], the proposed method is scaling up/down the VM resources at runtime based on the
utilization ratios (i.e. if the utilization of a component is high, they increase the allocated
size of that component). By doing so, they increase the performance of the critical
services without performance degradation on the non-critical services. However they do
not include the power consumption in their work. In [72], the authors use power and
performance management for VMs targeting live migration to optimize performance,
costs, and power. Using a multi-tiered optimization method they were able to reduce the
power consumption with slight performance loss. However, in their work, they only
consider consolidation the numbers of CPUs for the VMs and they exclude memory and
the core frequencies. The work presented in [10] studied the issue of multiple resource
management for power and performance.
In what follows, we will discuss in further detail our approach to implement such an
autonomic management framework for cloud resources and services.
36
CHAPTER III
THE AUTONOMIC MANAGEMENT PARADIGM
In this chapter, we explain the autonomic management paradigm, its requirements,
and its components. We also briefly describe several autonomic computing projects to
manage performance, fault, and other system properties.
A. Introduction
Autonomic computing aims at designing computing systems and applications that can
self-manage their operations with little or no involvement of users or system
administrators. They are inspired by biological systems (e.g., autonomic nervous system)
that can unconsciously handle abnormal behaviors to bring the system into a stable and
survivable state [52, 73-75]. Hence, an autonomic management system is analogous to a
self-regulating human body where no human interference and consciousness are required
[73, 52]. This approach provides a holistic approach for the development of
systems/applications that can adapt themselves to meet requirements of performance,
fault tolerance, reliability, security, Quality of Service (QoS), etc., without manual
intervention [73]. Therefore autonomic management is upmost needed for the
management of cloud computing systems since it is impractical to have ad-hoc manual
resource management due to the increasing in scale, dynamism, and complexity of cloud
systems and their applications or services.
37
An autonomic computing paradigm must have a mechanism whereby changes in its
essential variables (e.g., performance, fault, security, etc.) can trigger changes to the
behavior of the computing system such that the system is brought back into equilibrium
with respect to the environment. This state of stable equilibrium is a necessary condition
for the survivability of the organism. In the case of an autonomic computing system, we
can think of survivability as the system‟s ability to protect itself, recover from faults,
reconfigure as required by changes in the environment and always maintain its operations
at a near optimal performance. Both the internal (e.g. excessive CPU utilization) and the
external environment (e.g. protection from an external attack) impact its equilibrium.
B. Properties of Autonomic Computing
IBM introduced the autonomic computing concept in 2001 to develop selfmanagement systems that can overcome the rapidly growing computing system
complexity, growth and dynamism [13]. Based on the IBM‟s autonomic management
definition complex computing systems should have the capability of performing
automated maintenance and optimization to reduce the system administrators‟ workload.
To offer such properties, an autonomic computing system can be a collection of
autonomic components, which can manage their internal behaviors and relationships with
others in accordance to high-level policies. The principles that govern all such systems
have been summarized as eight defining characteristics [77, 78]:
1. Self-configuring: This feature indicates that the system should be able to adapt to
38
changing environments (e.g. adding/removing components, changes in system
characteristics) dynamically using policies provided by the IT professional.
2. Self-healing: An autonomic system should discover, diagnose, and react to
disruptions triggered by hardware and/or software failures. Self-healing components can
detect malfunctions in the system and initiate policy-based corrective actions without
disrupting the IT environment. Corrective action could involve a product altering its own
state or effecting changes in other components in the environment..
3. Self-optimizing: Self-optimizing components can tune themselves to meet enduser or business needs by detecting the sub-optimal behaviors and perform selfoptimization functions intelligently. The tuning actions could mean reallocating resources
– such as in response to dynamically changing workloads – to improve overall utilization,
or ensuring that particular business transactions can be completed in a timely fashion.
Self-optimization helps provide a high standard of service for both the system‟s end users
and a business‟s customers.
4. Self-protecting: Self-protecting components can detect attacks or malicious
activities and then take proactive corrective actions to make the managed system or
application be less vulnerable. The hostile behaviors can include unauthorized access and
use, virus infection and proliferation, denial-of-service attacks, just to name a few.
5. Self-Awareness: An autonomic system knows itself and is aware of its state and
its behaviors.
39
6. Contextually Aware: An autonomic system must be aware of its execution
environment and be able to react to changes in the environment.
7. Open: An autonomic system must be portable across multiple hardware and
software architectures, and consequently its must be built on standard and open protocols
and interfaces.
8. Anticipatory: An autonomic system must be able to anticipate, to the extent that it
can, its needs and behaviors and those of its context, and be able to manage itself
proactively.
As a result, an autonomic cloud management system behavior can include adding or
removing VMs from the system to maintain the QoS requirements, restart or reinstantiate a non-responsive VM (self-healing), adjust the resources allocated for the
VMs for power and performance management (self-optimizing), and limit the VM
resources if it detects that these resources are compromised or affecting negatively the
normal operations of the whole system (self-protecting). It is important to note that an
autonomic computing system addresses these issues in an integrated manner rather than
being treated in isolation as is currently done. Consequently, in autonomic computing, the
system design paradigm takes a holistic approach to seamlessly manage all system
functional and non-functional properties in an integrate manner.
Typically, self-management for the autonomic computing is addressed using the
following four primary aspects, i.e., self-configuration, self-optimizing, self-protection,
40
and self-healing. Autonomic components need to collaborate to achieve coherent
autonomic behaviors at the applications level. This requires a common set of underlying
capabilities including representations and/or mechanisms for solution knowledge, system
administration, problem determination, monitoring and analysis, and policy definition
and enforcement and transaction measurements [79].
It should be noted that there is an important distinction between autonomic activity in
the human body and autonomic activities in IT systems [80]. Many of the decisions made
by autonomic capabilities in the body are involuntary. In contrast, self-managing
autonomic capabilities in computer systems perform tasks that IT professionals choose to
delegate to the technology according to its policies. Adaptable policy - rather than hardcoded procedure - determines the types of decisions and actions that the autonomic
management capabilities must perform [80]. Self-managing capabilities in a system
accomplish their functions by taking an appropriate action based on one or more
situations that they sense in the environment. The function of any autonomic capability
can be modeled as a control loop that collects details from the managed system and acts
accordingly.
C. Autonomic Computing System: The Conceptual Architecture
In this section we apply how the concepts of Ashby‟s ultrastable system can be
applied for the autonomic computing system based on the human nervous system. In
Ashby‟s Ultrastable system, the adaptive behavior is explained as the human body
41
adaptation based on the fact that internal body mechanisms continuously work together to
maintain essential variables within their limits where in order for an organism to survive,
its essential variables must be kept within viable limits. Otherwise disintegration and/or
loss of identity (dissolution or death) may occur for the organism [81]. In Ashby‟s
Ultrastable system it is stated that systems are called adaptive if it maintains the essential
variables within physiological limits [82] that define the viability zone. Therefore, the
goal of the adaptive behavior aims survivability of the system and if (internal or external)
environment pushes the system outside its balanced state, the system will always work
towards coming back to the original equilibrium state. As a result, Ashby observed that
many organisms undergo two forms of disturbance: (1) frequent small impulses to the
main variables and (2) occasional step changes to its parameters. Therefore, based on
these observations and Ultra-Stable system architecture consists of two close loops: (1) to
control small disturbances and (2) another control loop for longer disturbances.
Figure 3 presents the conceptual architecture of an autonomic computing system. This
architecture directly derives from Ashby‟s Ultrastable system.
42
PE
Control
KE
autonomic
behavior
G
PE
M & A Cardinals
cardinal
M& A
KE
programmed
behavior
L
E E
A
S
High Performance Computing Environment
Environment
Internal
External
Figure 3. The conceptual view of autonomic computing system
The shown autonomic computing system consists of the following modules:
1) Computing Environment
Computing environment represents the high performance computing applications and
their execution environment. The autonomic system is geared towards control and
management of the high-performance computing environment.
2) Environment
The environment represents all the factors that can impact the High Performance
43
Computing System. Any change in the environment can cause the whole system to go
from a stable state to an unstable state. This change needs to be taken care of by a set of
reactive changes in the Computing System causing the system to move back from the
unstable state to a different stable state. Notice that the environment consists of two parts:
internal and external. The internal changes are the runtime application state changes
whereas the external environment reflects the state of the execution environment.
3) Controller
At runtime, a system can be affected due to many changes such as a failure, an
internal or external attack, performance issues affecting an application or the whole
system, etc. Such issues need to be managed efficiently at runtime by the Controller. The
Control module contains the following engines to operate successfully and to
autonomously manage the system for a stable state:
a. Monitoring and Analysis Engine (M&A): Monitors the state of the environment
through its sensors and analyzes the information.
b. Planning Engine (PE): Plans alternate execution strategies (by selecting
appropriate actions) to optimize the behaviors (e.g. to self-heal, self-optimize, self-protect
etc.).
c. Knowledge Engine (KE): Provides the decision support to the Control to pick up
the appropriate rule from a set of rules to improve the performance.
44
The Control module consists of both local and global control loops that operate on the
individual and overall system, respectively, as explained below.
a)
Local Control Loop
The local control loop manages the behavior of individual and local system elements
where the applications and components execute. Such a control loop adds selfmanagement capabilities to the conventional components/elements by controlling local
algorithms, resource allocation strategies, distribution and load balancing strategies, etc.
Note that this loop will only handle known environment states and the mapping of
environment states to behaviors is encapsulated in its knowledge engine (KE). For
example, when the load on the system resources exceeds the acceptable threshold value,
the fine loop control will balance the load by either controlling the local resources or by
reducing the size of the computational loads. This will work only if the local resources
can handle the computational requirements. However, the fine loop control is blind to the
overall behavior and thus cannot achieve the desired overall objectives. Thus by itself it
can lead to sub-optimal behavior.
b)
Global Control Loop
While local control loop tries to manage the attributes individually, at some point, it is
possible to see that one of the essential variables of the system exceeds its limits that will
trigger the global loop control subsystem. Then, the global control loop manages the
45
behavior of the overall application and defines the knowledge that will drive the local
adaptations. This control loop can handle unknown environment states and uses four
attributes or features for monitoring and analysis of the environment i.e., performance,
fault-tolerance, configuration and security. This control loop acts towards changing the
existing behavior of the high performance computing environment such that it can adapt
itself to changes in the environment. For example, in the previous load-balancing
example, the existing behavior of a high performance computing environment (as
directed by the local loop) was to maintain its local load within prescribed limits. On the
other hand, when applied blindly local control may degrade the performance of the
overall system. As a result, such a change in the overall performance cardinal triggers the
global loop. The global loop then selects an alternate behavior pattern from the pool of
behavior patterns for this high performance computing environment. The Planning
Engine (PE) determines the appropriate plan of actions using its Knowledge Engine
(KE). Finally the Execution Engine (EE) executes the new plan on the high performance
computing environment in order to adapt its behavior to the new environment conditions.
In the next Section, we describe our approach to implement a distributed autonomic
computing system based on the conceptual architecture discussed above.
D. Autonomic Computing System Framework
Figure 4 illustrates the main modules to implement our autonomic computing system
framework that is inspired by the Ashby‟s ultrastable system architecture proposed to
46
model the autonomic nervous system.
Figure 4. Autonomic Computing System: The Implementation View
The proposed framework consists of the following modules:
1) Application Management Editor
In this framework, the users and/or application developers can specify the
characteristics and requirements of their applications using an Application Management
Editor (AME). Each application can be expressed in terms of one or more autonomic
components. These components are expressed using the Autonomic Component
47
Architecture (ACA) as will be discussed later. Once the application is specified, the next
step is to autonomously control and manage the execution of the application at runtime
using the services and routines offered by the Autonomic Runtime System (ARS).
2) Autonomic Runtime System (ARS)
The Autonomic Runtime System (ARS) exploits the temporal and heterogeneous
characteristics of the scientific computing applications, and the architectural
characteristics of the computing and storage resources available at runtime to achieve
high-performance, scalable, and robust scientific simulations. ARS will provide
appropriate control and management services to deploy and configure the required
software and hardware resources to autonomously (e.g., self-optimize, self-heal) run
large-scale scientific applications in computing environments. The local control loop is
responsible for control and management of one autonomic component, while the global
control loop is responsible for the control and management of an entire autonomic
application.
The ARS can be viewed as an application-based operating system that provides
applications with all the services and tools required to achieve the desired autonomic
behaviors (self-configuring, self-healing, self-optimizing, and self-protection). The
primary modules of ARS are the following:
48
a)
Application Information and Knowledge (AIK) Repository
The AIK repository stores the runtime application status information, the application
requirements, and knowledge about optimal management strategies for both applications
and system resources that have proven to be successful and effective. In addition, AIK
contains Component Repository (CR) that stores the components that are currently
available for the users to compose their applications with.
b)
Autonomic Middleware Services (AMS)
The AMS provides appropriate control and management services to deploy and
configure the required software and hardware resources to run autonomously (e.g., selfoptimize, self-heal etc.). These runtime services maintain the autonomic properties of
applications and system resources at runtime. To simplify the control and management
tasks, we dedicate one runtime service for each desired attribute or functionality such as
self-healing, self-optimization, self-protection, etc. The event series notifies the
appropriate runtime service whenever its events become true. The overall algorithm that
is followed to achieve self- management for any service (self-healing, self-optimizing,
self-protecting, etc.) is shown in Figure 5.
49
1. while (Component ACAi is running) do
2. State = CRMi Monitoring(ACAi)
3. State_Deviation = State_Compare(State, DESIRED_STATE)
4. if (State_Deviation == TRUE)
5.
CRM_i Send_Event(State)
6.
EventServer Notify ARM
7.
Event_Type = CRM_Analysis(State)
8.
if (CRMiABLE)
9.
Actions = CRM_Planning(State,Event_Type)
10.
Automatic_Service ASj ϵ{ASconfig, ASncal, ASoptimization, ASsecurity}
11.
Execute ASj(Actions)
12.
else
13.
Actions = ARM_Analysis(State, Event_Type)
14.
Execute ASj(Actions)
15.
end if
16. end if
17. end while
Figure 5. Self-Management Algorithm
c)
Application Runtime Manager (ARM) - The Global Control Loop
The ARM focuses on setting up the application execution environment and then
maintaining its requirements at runtime. The ARM performs online monitoring to collect
the status and state information using the component sensors. It analyzes component
behaviors and when detecting any anomalies or state changes reflected by out-of-bound
values of the cardinals (for example, degradation in performance, component failure).
The ARM generates the appropriate control and management plans as specified by the
knowledge in its knowledge repository. The plan generation process is about triggering
50
the appropriate rule from the list of rules (for the appropriate service) in the knowledge
repository. The ARM main modules include 1) Online Monitoring and Analysis, 2)
Online Planning, and 3) Autonomic Execution.
d)
Event Server
The Event server receives events from the Component Managers that monitor
components and systems and then notifies the corresponding engines subscribed to these
events.
e)
Coordinator
The coordinator is responsible for handling application execution by coordinating
between different ACA in different coupled models.
3) High Performance Computing Environment (HPCE)
The HPCE consists of the high performance computing application and the execution
environment on which it runs. In effect, the framework‟s ARS as described above
manages this part. The HPCE consists of two main modules:
4) Autonomic Component Architecture (ACA)
An autonomic component is the fundamental building block for autonomic
applications in our Autonomic Computing Framework.
51
E. Autonomia
The researchers at the NSF Center for Cloud and Autonomic Computing at the
University of Arizona have developed and successfully implemented a general autonomic
computing architecture (Autonomia) that is adopted in this project to achieve the desired
autonomic power and performance management capabilities ([75, 83, 85]). Autonomia
provided the software tools needed to implement the ACE's self-management
capabilities. By adopting the Autonomic architecture shown in Figure 4 we implement
the ACE services by adding two software modules: Observer and Controller modules, as
shown in Figure 6.
Figure 6. The control and management logic for the Autonomic Control Element (ACE)
52
The Observer monitors and analyzes the current resource (i.e. hardware) and software
activities of the ACE's sub-domain whenever is required. The Controller is delegated to
manage its execution and enforces the operational policies for both the resources and the
software. In fact, the Observer and Controller pair provides a unified management
interface to support the ACE's self-management services by continuously monitoring and
analyzing current ACE activities in order to dynamically select the appropriate plan to
correct or remove anomalous system conditions once they are detected and/or predicted.
The Observer monitors and gathers system data (both hardware and software) about the
ACE's sub-domain and analyzes them to form the knowledge required by the Controller
to enforce the ideal resource allocation management policies. The controller and observer
mechanisms associated with each ACE can also be extended to cover the whole
hierarchal architecture. If multiple ACEs are integrated in one team, the Observer and the
Controller of the local ACEs are connected through an upper-level ACE with an upperlevel Observer and Controller modules. The upper-level Observer integrates and fuses
data gathered from the local ACEs, while the upper-level Controller coordinates their
actions. This upper level controller manages system resources among local ACE‟s using
a pre-defined protocol. The resource allocation plans specifies all necessary information
to facilitate information sharing and collaboration among ACEs. For example, for the
purpose of ACE team formation of a VM based computing environment, defined
messages in this protocol would include invitation-to-join-team, accept-team-invitation
53
and reject-team-invitation. The invitation-to-join-team message is sent from an upperlevel ACE to a local ACE to join its team. The accept-team-invitation or reject-teaminvitation messages is sent as a response from the local ACE to indicate if it accepts or
rejects this invitation. Figure 7 illustrates the algorithm used by the Observer and
Controller modules to make ACEs operate in an autonomic manner.
1. while (true)
2.
if (any changes in management policies)
3.
read and update Controller_Policies section
4.
end if
5. Current State Observer_Sensors
6. event Other_Component_Observer
7. State_Deviation = Observer_Analyzer(event, CurrentState)
8. if (State_Deviation is true)
9.
Controller_Actions = Controller_Policies(Component_CurrentState, event)
10. end if
11. if (event != null)
12.
other Component_Observer <-- event
13. end if
14. end while
Figure 7. The autonomic logic of an ACE
The Controller continuously checks if there are no changes to its current resource
utilization and allocation policies and if there are changes, it will automatically read and
update the controller Policies (Steps 2-4). The Observer module monitors the current
traffic state of its ACE as well as other ACEs using the indicators/monitors that are
54
defined in the Observer sensor section (Step 5). The Observer module can also receive
events from other element observers that might require immediate attention (Step 6). The
monitored system information is then analyzed by the analyzer as defined in the Observer
analyzer section (Step 7). If the analyzer routine determines that the current state of the
element represents an abnormal system utilization condition, it will use its knowledge
base and anomaly behavior analysis of the system to identify the best plan to fix the
current problem. Once that is done, the Controller is then invoked to carry out the
appropriate actions (Steps 8-9) to bring the element to normal operations (Steps 9-10). If
the Observer needs to send information to other elements, it will then send their observers
the appropriate type of events using a pre-defined communication protocol (Step
13).While the logic is straightforward, it requires determining the list of traffic policies
and how to implement them in each ACE. In this project, we utilize Autonomia
techniques to develop the autonomic cloud management capabilities.
F. Related Work on Autonomic Computing
In this section we describe related works on autonomic computing and how it
contributes to a wide area of applications especially for the management of the systems.
Autonomic management can be used for many applications. In Reference [86] an
autonomic management framework for home applications is presented to provide a
seamless integration of the applications and integration of autonomic elements with each
other sharing the same environment.
55
A self-configuration approach is presented in [87] for the control and management of
large scale network security by automatically updating the security mechanisms. The
approach includes a Component Runtime Manager which continuously monitors and
analyzes network resource states in order to dynamically control security tools (e.g.,
Intrusion Detection System (IDS), Firewalls, etc.) in an integrated manner. Assuming the
system administrators have access via the internal network, Sibai et al [88] proposes an
autonomic framework monitoring all the traffic between server and the users. A decision
based on the traffic is made to whether drop the request and inform the system owner.
A framework for cloud services is proposed by [89] which has the capabilities of
learning and establishing its own dynamic resources perception (which changes based on
the resources available) and modifying them if needed to build a fault tolerant cloud
system by checking if a resource should be included in the computations.
Since large scale computation and data systems serving for large-scale parallel
applications in science, engineering and commerce use multiple heterogeneous sources
connected each other using an interconnect network, for high performance and
management Reference [90] presents an autonomic scheduling mechanism using
bandwidth and tolerance. The scheduling is applied considering the status of the network.
In Reference [91], a self-managing adaptive cloud management using autonomic
computing paradigms with the pulse monitoring is proposed to have flexible and faulttolerant systems with minimal human intervention. Using Pulse Monitoring (PBM)
56
periodically for heartbeat monitoring, the failures in a system is detected by the
autonomic Elements. In their implementation, the PBM is used with pinging the elements
in a peer-to-peer mode so each manager has responsibility to monitor its neighbors.
Reference [92] introduces an autonomic architecture for a self-organizing system
which manages a dynamic number of heterogeneous server clusters. The introduced selforganizing controller targets maintaining the same response time and/or CPU usage
across all the servers in the cluster. Reference [93] presents optimal VMs allocation for a
cloud provider to maximize revenue subject to capacity, availability, and VM migration
constraints. Using a near optimal heuristic the authors show an improved revenue and
also availability compared to the other algorithms.
Decision-making process of autonomic systems is a critical task. Hence, in [94], a
stochastic search-based genetic algorithm technique is applied that eases the developers
in a way that developers do not need prescribe reconfigurations. They experiment with
the dynamic reconfiguration on a network to maintain requirements and provide
resiliency. Similarly, [95] showed that traditional control based techniques for automating
datacenter management can be replaced with autonomic management schemes that are
based on statistical machine learning methods.
Typically for the autonomic control elements, many managers may exist which can
create synchronization and coordination of the autonomic loops to satisfy multiple
constraints. The authors in [96] focused both on power and performance with two
57
different control loops and integrated them to have a stable and scalable architecture.
Similarly, the authors in [97] showed that the multiple autonomic management control
loops can be synchronized with each other to provide the required scalability.
58
CHAPTER IV
AUTONOMIC POWER AND PERFORMANCE MANAGEMENT
Cloud systems and data centers supporting e-commerce and web-hosting services of
today typically hosts hundreds and sometimes thousands of server clusters densely
packed in server racks in order to meet the high performance demands within limited
floor-space. Power and energy consumption have become key concerns in these data
centers. High peak power requirements in these data centers translates into large and
expensive uninterruptible power supplies and backup power generators, which are
necessary in case of a power outage. The tightly packed server racks create high power
density zones within the data center making cooling both expensive and complicated.
Over the 10-year period of its lifespan, the cost of power and cooling equipment for each
rack of a typical data center has been estimated at $52,800.This leads to very high Total
Cost of Ownership (TCO) for the datacenter owners. TCO of these centers could reach
$4B dollars per year when running 24/7 [98] and 63% of this cost can be contributed to
power, cooling equipment, and electricity [99]. High energy consumption also translates
to excessive heat dissipation, which in turn, increases cooling costs and causes servers to
become more prone to failure [100].
The main problem to be studied in this research is how to minimize the power
consumption of large clouds systems and data centers while maintaining the quality of
service requirements of their services. By solving this important problem, the TCO of
59
data centers can be reduced significantly without compromising the performance and
quality of service of their applications and services. Our approach to address this problem
is to apply autonomic computing paradigm as discussed before to develop autonomic
management framework for any cloud system or data center resources as well as their
applications.
A. Autonomic Management Framework
Autonomic Management Environment refers to a system augmented with intelligence
to become self-governed and self-managed under changing internal/external circumstance
[101], as shown in Figure 8. R&D of such intelligence to optimize performance/watts of
systems is the focus of this dissertation research. The autonomic manager controlling the
proposed framework follows the typical MAPE (Monitor, Analyze, Plan, and Execute)
paradigm. The Autonomic Manager (AM) senses the state of the System using its Sensor
Port, analyzes the collected information, plans a managing strategy and then executes it
onto the system using its Effecter Port. In each of the four phases, AM may use
previously stored knowledge to aid the decision process. Such autonomic management
approach can be applied to manage diverse features such as security or fault tolerance. In
our approach, we use autonomic management approach to optimize performance/watts in
cloud systems. We leverage the tools and services developed in Autonomia at the
University of Arizona [83-85]. Autonomia has been successfully used to proactively
manage the performance of large-scale applications in science and engineering [103]. It
60
has also been used to achieve self-protection of networks from a wide range of attacks
[87, 102].
Autonomic Manager
Sensor
Port
Analysis
System
Monitoring
Knowledge
Planning
Execution
Effector
Port
Figure 8. Autonomic Management System
To effectively manage resources of a system, AM must determine whether the
system‟s operational state meets all runtime system constraints/requirements. We have
developed an innovative methodology, referred to as Application Flow (AppFlowTM)
used by AM to provide instantaneous characterization of the behavior of the environment
and aids as a predictor to self-manage itself, as shown in Figure 9.
61
space
time
managed
system
Figure 9. Application Flow (AppFlow)
AppFlow is a three-dimensional array of features where the x-dimension captures
spatial variability of features; the y-dimension captures temporal variability of features
for each Management System (MS); and the z-dimension captures the variability in the
managed system state. Such set of necessary and sufficient features are categorized into
two distinct classes – Capacity Spatial Features (CSF) and Operational Spatial Features
(OSF). As the names suggest, CSF indicates the maximum possible capacity of the MS
and OSF indicates the current dynamic capacity (configuration) of the MS that is changed
by the AM based on the requirements of the application at runtime. The main goal of the
AM is to configure the managed system at runtime such that its OSF makes the managed
system operate in normally; i.e., maintains its service level agreement while minimizing
power consumption. It is to be noted that CSF represents a set of Absolute Features,
while OSF represents a set of Dependent Features that can be derived from CSF. Along
62
the y-axis of AppFlow, the „time” vector captures the behavior of the MS over a period of
time. In essence it captures the temporal variability of OSF features during workload
execution to help identify trends and predict workload dynamic resource requirements.
B. Case-Based Reasoning
Case-based reasoning is the process of solving a problem by finding similar past
cases and reusing the solution that worked the best with these cases.[104]. At the highest
level, a general case based reasoning approach consists of the following four processes :
1. Retrieve: Retrieve the most similar case(s).
2. Reuse: Reuse the case(s) to solve the problem.
3. Revise: Revise the proposed solution if necessary.
4. Retain: Retain the new solution as a part of a new case.
Usually a case retrieval process selects the most similar case(s) to the current problem
using methods such as nearest neighbor techniques (a weighted sum of the case features
is compared with the ones stored in the historical case database). An example for
retrieving a matching case based on the similarity metric [105] is shown below:
(
)
∑
|
|
|
|
where x, y are the weighted sum for two cases, x and y, and Wi is the weight for each
63
feature.
For an autonomic power and performance management, we integrate case based
reasoning technique with AppFlow approach where we view the AppFlow as a case.
Hence, we can use case-based reasoning techniques to the find the best AppFlow type
(case) that optimizes both power and performance.
C. Hierarchical Autonomic Management Framework: Performance-Power
Management
The framework essentially consists of Managed System (MS) and Autonomic
Performance-Power Manager (APPM). The cloud system or datacenter is the MS that can
logically be organized into three distinguishable hierarchies, as shown in Figure 10: (i)
Cluster-wide Autonomic Manager (CAM) with visibility to the whole cloud system; (ii)
Server-wide Autonomic Manager (SAM) with visibility at the platform-level; and (iii)
Device-Level Autonomic Manager (DeAM) with local device visibility only. MS, at any
hierarchy, is modeled in terms of a set of States and Transitions. Each state is
characterized by a power consumption value and performance value derived from the
spec ratings of the device. A transition steers the MS from one state to another. APPM is
tasked to enable the MS to always transition to a state where power consumption is
minimal
without violating performance constraints.
Our proposed autonomic
management approach achieves this goal for the MS and at each level of the hierarchies.
64
Autonomic
Manager
Cluster
Autonomic
Manager
.. . … … …
Server1
Autonomic
Manager
Device1
.. . …
Autonomic
Manager
Server1
Autonomic
Manager
.. . …
Devicen
Figure 10. Hierarchical Power Management
1) Autonomic Performance-Power Manager (APPM)
As cloud systems and datacenters are usually provisioned for peak performance
requirements resulting in low resource utilization and unnecessary high power
consumption, it is expected that there is considerable room to optimize performance/watt
in cloud systems. Any proposed technique needs to be adaptive and proactive taking into
consideration workload changes, their resources requirements and QoS. We propose to
formulate this problem as a multi-variable constrained optimization problem. Our
approach is to develop a mathematically-rigorous theoretical framework to obtain realtime optimal or near optimal strategies to optimize performance/watt of workloads in
65
datacenters.
Policy Optimization for Performance-Power Management: At the heart of APPM is a
constrained optimization problem to minimize power while meeting all performance
goals for an MS. In addition to sensing and characterizing current MS state, APPM also
characterizes workloads and predicts their behavior for the next time window. It uses this
knowledge to determine the optimal or near optimal power state for the MS such that
power consumption is minimal and it can still service the expected workload for the next
time window without sacrificing any performance, as explained in details.
Hierarchical Performance-Power Management: We propose to solve the optimization
problem at each hierarchy of a datacenter. Solutions at the upper levels of the hierarchy
are more coarse-grained but take into consideration the global view of the cloud system.
As we go down the hierarchy, solutions become less coarse-grained and take into
consideration a narrower view of both the system and time window. For example, at the
cloud-level, the solution takes into consideration all the servers that are a part of the
datacenter, whereas at the device level, the solution takes into consideration only the
single device within a single server in the datacenter. Such hierarchical solution
necessitates introducing APPM at each level of the hierarchy within the framework. The
decisions from high-level APPMs trickle down to lower level APPMs in the hierarchy
that further refine the decisions until the decisions reach the lowest (device) level in the
hierarchy.
66
2) Cluster-Level APPM (C-APPM)
This APPM monitors the workload and the state of MS through its sensor ports. It
uses this information to analyze and plan an optimal state (optimal is used loosely to refer
to an appropriate state or near optimal state that optimizes performance/watt). Essentially,
datacenter-level APPM solves the optimization problem to determine the appropriate
state for an MS. It executes the decision by utilizing the effecter ports of the MS.
3) Server-Level APPM (S-APPM)
This APPM functions exactly as datacenter-level APPM, but with narrower, platformlevel visibility. Once the datacenter-level APPM arrives at a decision for MS, the
decision percolates to each server-level APPM responsible for performance-power
management of a single server in the datacenter. Each server-level APPM independently
map the global decision onto specific-server conditions. Essentially, it solves a serverlevel optimization problem to optimize its local performance-power within the bounds of
the global decision.
4) Device-Level APPM (D-APPM)
This APPM is responsible for performance-power management at the lowest level in
the hierarchy comprising of individual devices such as processor, memory, network
interface card, or disk within a server. It functions exactly as the server-level APPM by
67
mapping server-level decisions to the device-level.
D. Hierarchical Autonomic Management Framework in Cloud Environment
In what follows, we describe how to apply the hierarchy discussed in the previous
section to a cloud environment. We first describe the Server Autonomic Manager
architecture and then the Cloud-Autonomic Management architecture. In addition, we
show how to apply AppFlow concept to model the current state of the managed system
and predict its behavior.
1) Server Autonomic Manager
Figure 11 shows the server-level autonomic power and performance management (SAPPM) for a virtualized cloud server. Virtual Machines (VM1-VM3) deployed on the
same physical machine share the physical machine resources in an isolated manner
through mediation of a hypervisor. The Managed System (MS) in this case consists of the
physical machine and the VMs set and the workloads running on the VMs. To
characterize the current operational spatial features and predict their values in the near
future, we use statistical modeling, and machine learning techniques. The results of this
runtime analysis will be used by the S-APPM to scale up/down the server resources
allocated to the VMs running the current workloads on that server.
68
Virtualized resources
Autonomic manager
Fuzzy inference, classification,
prediction, resource models
VM1
VM2
VM3
Analyze
Plan
Monitor
Execute
Measure CPU,
memory, I/O
usage
Request/reserve
CPU, memory,
I/O
Hypervisor
Physical machine
Figure 11. Server-level autonomic management architecture.
Next, we describe the key elements of the AppFlow associated with the server that
will be used by the server autonomic manager to control the server resources in order to
optimize both performance and power.
2) AppFlow: A Data Structure for Autonomic Management
An autonomic management derives the awareness of its environment by accurate and
efficient monitoring of relevant features that clearly capture the impact of the workload
on the cloud resources (e.g., server, memory rank, or core in a multicore system). To
effectively manage the resources, the S-APPM must continuously determine whether the
operational state of the server resources meets all the performance and energy
69
requirements. The S-APPM will utilize the AppFlow data structure as the model to
optimize the operations of the cloud server.
The AppFlow spatial attributes include a wide range of metrics as shown in Figure 12
that can be used to quantify the behavior of hardware, software and network resources
used to run cloud applications or services. We categorize the AppFlow attributes into
three main types: Software, Hardware, and Network. Hardware attributes show the
hardware-related information of the system such as the allocated number of cores and
their frequency as well as the allocated memory amount. Additional information such as
processor and memory characteristic as well as power consumption can be added to this
type. Software attributes consider the characteristics of the workloads running on the
systems such as CPU utilization, memory utilization, disk and memory read/write
operations, etc. The network attributes include bandwidth, received and sent packets, the
number of connections, etc.
70
Figure 12. Application Flow (AppFlow) attributes for a server.
One novel feature of the AppFlow is the ability to model and predict the behavior of a
cloud application by using the AppFlow attribute as shown in Figure 13. The trajectory of
the application behavior is shown as a sequence of red points that represents the operating
point for the application at any instant of time.
71
Figure 13. Using AppFlow to model and predict the behavior of an application.
In this example, we show the AppFlow in three dimensions, time, Operational
Spatial Features (OSF), and Objective Function. Our objective function is
multidimensional optimization problem as performance and power optimization. Using
multiple attributes, such as core number, core frequency, and memory, we define the
operational spatial features. The blue sub-cube indicates the application in a normal
operating zone, in which the cloud resources allocated for application VMs can maintain
the performance requirements described in Service Level Agreement (SLA) with low
power/energy consumption. If the current behavior is not in the blue sub-cube, the cloud
resources assigned to the application are operating in an inefficient zone with respect to
power and performance attributes. In this case, the allocated resources (i.e. OSF) need to
72
be adapted to bring the application behavior to the blue sub-cube in order to meet the
application‟s SLA requirements and reduce energy consumption by scaling up/down the
system resources.
3) Autonomic Cloud Resource Management (ACRM) Framework
The Autonomic Cloud Resource Management (ARCM) architecture shown in Figure
14 continuously applies the following four functions: Workload Analysis, Workload
Characterization into AppFlow types, Plan Selection, and Plan Execution. The possible
applications running on a cloud system can include bidding applications, social
networking, scientific computations, streaming, video processing, etc. When they are
combined (in terms of the type of the applications and the number of users), multiple
workloads are created. In the workload analysis, the monitoring agents collect relevant
information about the current status of the VMs (the allocation resources and their
utilizations) running the current applications. Next this information is passed to the
AppFlow-based application characterization. This module approximates each workload
into one or more of the predetermined application‟s AppFlow types, whose behavior and
required optimal configuration (that optimizes both performance and power/energy
consumption) are studied a-priori. In this module, we try to map the workloads to
AppFlows since we need to know the current executing workload to configure the
systems using the right amount of resources. Hence, we first need to be able to determine
73
the current system behavior. However, the existing workloads can be undeterministic
since they are a mix of the possible applications. Therefore, AppFlow mapping is a
required step to identify the current system behavior. In the Plan Selection, data mining
techniques and comprehensive analysis of VM configurations are carried out offline to
determine the ideal VM configuration (e.g., number of cores, CPU frequency, VM
memory, etc.) that will meet the SLA for the application and simultaneously minimizing
energy consumption. ARCM will then reconfigure the cloud resources and their VMs
according to the selected plan.
It should be noted that the autonomic cloud resource management architecture used in
this work contains a closed loop which classifies the workloads and approximates them to
known AppFlow types that have been studied a prior. By comparing the current running
AppFlow with the previously studied AppFlow types, using AppFlow based reasoning,
the plans to optimize performance and power constraints are selected and executed to
scale up/down the virtual resources allocated for each VM in the cloud environment.
In what follows, we describe the main functions of each module in the ARCM
architecture.
74
Figure 14. Autonomic cloud management framework.
4) Monitoring
We have developed a lightweight monitoring tool using Perl in order to monitor the system
behavior from different perspectives as shown in Figure 15.
In our systems, we use Xen
hypervisor, which is one of the cutting-edge type 1 hypervisors, provide dynamic scaling of the
VM resources on runtime.
75
Figure 15. Monitoring tool created for the Autonomic Cloud Management
Using our monitoring tool, we profile the VM resource usage as well as the resource
usage of the VM processes. For this purpose we check the utilization values of the virtual
resources for each VM such as virtually assigned number of cores, amount of virtual
memory, their utilization values, number of IO and network operations (read/write
amount and number of packets),
We also monitor the physical host system behavior and the VM behavior from the
hypervisor perspective (for this purpose we parse xentop [106]). From these layers, the
monitoring tool is capable of collecting data of similar attributes as CPU (number of
cores, CPU frequency, CPU utilization) and memory (allocated memory, memory
utilization) of individual VMs and also the physical host system.
After observing the system resource information, the data is stored in a PostgreSQL
76
database [107] with their timestamps and types in order to retrieve the monitored data for
both training and Plan Execution purposes.
In addition to the system resource utilization ratios, power consumption of each blade
is monitored at runtime (using the Advanced Management Module, AMM, which is a
management module integrated to the IBM HS22 blade server [108] used for the testbed)
and the database entries are regularly updated every minute during the experiment.
5) AppFlow Analysis and Characterization
Applying right amount of resources to a VM during its runtime to meet application
requirements requires good understanding of the application behavior and requirements,
especially when performance and power are used as the major constraints. Therefore, one
first needs to analyze and classify different application behaviors and their requirements.
As an example to better explain our approach and its feasibility, we will use the RUBiS
benchmark as a running example [109]. RUBiS is an auction site model based on
eBay.com where it allows for evaluating application design patterns as well as
performance and scalability of servers [27]. It can be configured to emulate different
types of auction behaviors such as browsing, bidding, etc. and with different client
numbers.
In order to characterize the application behavior into the proper AppFlow types, we
need to identify all different classes of RUBiS applications and their resource utilizations.
77
Each AppFlow type can be used to model one of the following RUBiS behavior types.
RUBiS Behavior A: Browsing (where only browsing, viewing, or reading comments
operations occur), RUBiS Behavior B: Default behavior (where both browsing and
buying operations are combined), etc. In addition to the different behavior types, also the
number of clients involved in each RUBiS behavior type must also be taken into
consideration. In our preliminary analysis, we consider three cases for client
configurations: low number of clients (100 clients), medium number of clients (500
clients), and large number of clients (1500 clients).
Figures 16, 17, 18, and 19 show CPU and memory utilizations for different RUBiS
application behaviors and how they can be identified. In the figures, browsing and default
transition RUBiS behavior with different number of clients (100, 500, and 1500) for a
database VM (DB VM) with 1 core with 2660MHz frequency and 1GB memory are
demonstrated. While the number of clients increases, as the number of operations
increase accordingly, the utilization rate will go higher for both CPU and memory.
Utilizations for both browsing and default operations exhibit similar trends but default
transactions (where both browsing and buying like operations occur) require more
resources than browse only transactions.
78
Figure 16. The RUBiS browse-only operation memory utilization
Figure 17. The RUBiS default operation memory utilization
79
Figure 18. The CPU utilization for different number of clients under RUBiS browsing operation
Figure 19. The CPU utilization for different number of clients under RUBiS default operation
80
In order to accurately identify the RUBiS behavior at runtime, we use data mining
techniques and characterize the RUBiS behavior types based on the information retrieved
by monitoring tool. This helps us define each application type in terms of processor
metrics (number of cores, CPU frequency, CPU utilization, run time power
consumption), memory metrics (allocated memory, memory read/write request rates,
miss ratio, memory utilization), network (number of transmitted and received packets),
and storage (latency, coding/storage overhead, read/write request and amount).
For
example, any AppFlow type can be characterized based on the following features or
metrics:
(
)
where
shows allocated resources for CPU (i.e. core) and memory, usage shows the
resource utilization,
shows the amount of request. Also diff is used to represent the
differential data between the current data point and the previous data point introduced as
additional data and
is used for the averaged data of the last 5 data points as described
before.
Using application types and different number of clients, different application behavior
can be created to model an AppFlow type. Specifically, for the RUBiS application, we
81
have chosen default and browse-only bidding operations with low number of clients (100
clients), medium number of clients (500 clients), and large number of clients (1500
clients) where each client number can be distinguished based on CPU and memory
utilization and other AppFlow function parameters.
Prior to the characterization of the AppFlow types, a post-processing and analysis
step on the monitored data is required for the following reasons: (1) Large amount of
information can result in slow identification, hence the same data points retrieved during
the life cycle of the applications should be discarded; (2) The characterization of each
application metric can be enhanced by introducing additional measurements or metrics
such as the mean and increase/decrease in the metric value; (3) A normalization and
quantization method can be used to reduce noise and to improve the quality of the
characterization; and (4) Similar application behaviors can result in misidentification or
lower resource allocation resulting in performance degradation, and, hence a decision
method for the similar application behaviors is needed.
In order to solve these problems, we have applied the following post-processing
approach on the observed data values: (1) Normalize the raw monitored data using the
maximum and minimum value ranges and quantized data with 0.1 intervals between 0
and 1; (2) Increase the amount of analyzed data. In this step we have added more metrics
such as the difference value with respect to the previous value, the average of the last five
data points, the ratios of CPU utilization to memory utilization, disk write operation to
82
disk read operation, and transmitted packets to receive packets; and (3) Repeat the
analysis of application behavior after adding more data metrics.
We experimented with different data mining techniques to characterize the current
application behavior at runtime. Data mining is achieving semi-automatic pattern
discovery, association, and prediction for a given input and can be mainly categorized as
association (identifying relations between the datasets), clustering (creating clusters with
the objects having the similar characteristics), and classification (classifying each data
into a predefined class set). Among these methods, classifier algorithms are selected due
to their applicability for the large datasets since classification algorithms provide high
accuracy, speed, and scalability. These features are highly needed especially for large
datasets. We have experimented with multiple algorithms such as J48, JRip, etc. and we
have chosen JRip (a Weka implementation of RIPPER algorithm) [110] for the training
phase. JRip is a rule learner data mining algorithm, designed based on Repeated
Incremental Pruning to Produce Error Reduction (RIPPER) proposed by William W.
Cohen [111]. In this method, training data is first split into a growing set and a pruning
set. Then, an initial rule set over the growing set is created and then repeatedly simplified
by using pruning set. It simplifies until the time that further pruning would increase error
rate.
Our data mining approach consists of two layers. In the first layer we characterize the
application type: “What type of application behavior is running, browse-only or default
83
behavior type?” and then we characterize the number of active clients that we group them
into three clusters: 100 clients, 500 clients or 1500 clients. We have used the data which
is created during the post-processing to classify AppFlow types. For example, Figure 20
shows the AppFlow rules that can be used to identify RUBiS Browse behavior with 500
clients:
elsif(($net_tx <= 0.4) && ($net_tx_rx <= 0.2) && ($disk_write_req_avg <= 0.2)
&& ($net_tx >= 0.4) && ($disk_read >= 0.2) && ($disk_write_read >= 0.3) &&
($disk_read_req >= 0.2))
{ $Clients = "500"; }
elsif(($net_tx <= 0.4) && ($disk_read_avg >= 0.5) && ($net_tx_rx >= 0.3) &&
($disk_read >= 0.6) && ($net_tx >= 0.4) && ($net_tx_avg <= 0.4))
{ $Clients = "500"; }
elsif(($net_tx <= 0.4) && ($net_tx_rx <= 0.2) && ($disk_read_avg <= 0.2) &&
($net_rx_avg >= 0.5) && ($disk_read >= 0.2) && ($disk_write <= 0.1) &&
($disk_write_read >= 0.3) && ($disk_read_avg <= 0.1))
{ $Clients = "500"; }
elsif(($net_tx > 0.4) && ($net_tx_rx > 0.2) && ($disk_read_req_avg >= 0.2)
&& ($net_rx_avg >= 0.5) && ($cpu_usage >= 1) && ($cpu_avg <= 0.9))
{ $Clients = "1500"; }
Figure 20. An example of the rule set obtained from datamining algorithm.
6) Plan Generation, Selection, and Execution
During an application life cycle, the required system resources will change
84
dynamically due to change in its behavior (e.g., browsing behavior at the start and later is
changed to bidding mode of operations) or changes in the utilization of cloud resources.
This dynamic change in the behavior of the application or in resource utilization might
push the application operating region outside the normal operating region and that will
trigger the needs to change the current execution environment used by the application.
For example, when the number of clients increases, the system performance degrades
significantly. In such a case, the current plan needs to be adopted so more resources can
be allocated to accommodate performance while minimizing the power consumption. In
this section, we show how to create, select, and execute such plans that meet both
performance and power consumption requirements.
a)
Plan Generation for AppFlow types
In order to reduce the power consumption, one can assign minimum resources to the
VMs running an application. However, such an approach can result in performance
degradation and hence not meeting the application SLA requirements. The following
steps can be followed to determine the optimal or near-optimal system resource allocation
in terms of number of cores, core frequency, and amount of memory allocated to each
VM:
•
The behavior of each application during its life cycle can be mapped into a
sequence of tasks
as
*
+, and each of them should be characterized
independently. The following attributes are used to characterize the behavior of each
85
application:
(
)
•
Application Response for task
•
Resource configuration, ( ), that it needs to complete application
•
The deadline for application, ( )
First, the application
has to meet the deadline
(
that
is denoted by
)
execution.
( ) as the first constraint such
( ) when the resource configuration ( ) is
applied to the system. Therefore this problem can be solved by finding the ideal resource
configuration ( ( )). Identification of the ideal resource configuration is obtained by
solving the multi-objective power-performance optimization problem that maximizes
“Performance-per-Watt (ppw)” which satisfies the performance requirements. This
problem can be formulated as follows:
(
(
where
)
is
(
the
)
(
(
Performance-per-Watt
)
)
for
an
application
and
) denotes the power consumption by the application. In
this equation, we try to maximize the
value by minimizing the response time and
minimizing the application power consumption. It should be noted that the response time
of the application has to be less than or equal to the application deadline requirements
specified as QoS in the SLA requirements; otherwise the QoS will be violated. Therefore,
86
applying right amount of resources which are just-enough for meeting the application
response time, we can reduce the power consumption of the applications.
In order to understand the performance and power behavior of the applications on
VMs without the need of the modeling of the complex interactions of the resources, we
have applied multiple plan algorithms integrating multiple resource types and their
configurations. We have varied the virtual memory configuration from 256MB to 4GB
and virtual CPU configuration from 1, 2, and 4 cores. During the experiments, one server
is set to highest CPU number and highest memory (4 cores and 4096MB RAM) in order
to achieve the highest performance, while the resources allocated to other servers have
been changed.
Initially, we have developed an algorithm to find an acceptable sub-optimal
configuration for each AppFlow type that consists of two stages. In the first stage, we
focus on the number of cores and the memory amount, first; and then, in the second
stage, we further investigate the possibilities by searching the appropriate frequency for
the chosen number of cores and allocated memory. The algorithm is shown in Figure 21:
87
1: Input: VM V, Application A, CoreList, MemoryList, FrequencyList
2: Output: AcceptedConfiguration
3: # Part 1: Finding Core and Memory amount
4: HighestFreq Max(FrequencyList)
5: foreach corei in CoreList
6: foreach memj in MemoryList
7:
SetConfiguration(V, corei, memj, HighestFreq)
8:
CoreMemListi,j(duration,power)RunApplication(A)
9: end foreach
10:end foreach
11: MAX_POWER ∞
12:foreach configurationi in CoreMemList
13:
(duration, power) GetResults(configurationi)
14: if(duration ≤ SLA_MAX_TIME and power < MAX_POWER)
15: OptimalCoreMem configuration
16: MAX_POWER POWER
17: end foreach
18: # Part 2: Finding the right frequency
19: MAX_POWER ∞
20: foreach freqi in FrequencyList
21: (core, mem) OptimalCoreMem
22: SetConfiguration(V, core, mem, freqi)
23: duration, power RunApplication(A)
24: if(duration ≤ SLA_MAX_TIME and power < MAX_POWER)
25:
AcceptedConfiguration (core, mem, freqi)
26:
MAX_POWER POWER
27: end if
27: end foreach
28: return AcceptedConfiguration
Figure 21. Algorithm 1 to find an acceptable sub-optimal configuration
88
However, this algorithm can be further improved by using an extended resource
configuration set to find the optimal resource configuration. In this algorithm (as
proposed in Figure 22), we search for a larger configuration set space and use multiple
core number, memory amount, and possible frequencies in a nested way. We use
different core numbers (1, 2, and 4), different frequency settings for each VM (our
processor provides 9 different P-states starting from 1560MHz to 2660MHz), and
different memory configurations (from 256MB to 4GB)..
1: Input: VM V, Application A, CoreList, MemoryList, FrequencyList
2: Output: Configuration with min power
3: foreach corei in CoreList C
4: foreach memj in MemoryList M
5:
foreach freqk in FrequencyList F
6:
V.SetConfiguration(corei, memj, freqi)
7:
(duration, power) ←RunApplication(A on VMs)
8:
ResultList.add(<corei, memj, freqk,duration,power>)
9:
end foreach
10: end foreach
11: end foreach
12: MAXPOWER ← inf
13: conf ← Ø
14: foreach <corei, memj, freqk, duration, power> in ResultList
15: if(duration ≤ SLA_MAX_TIME and power < MAX_POWER)
16:
conf ← <corei, memj, freqk>
17:
MAX_POWER ← power
18: end if
19: end foreach
20: return conf
Figure 22. Algorithm 2 to search large configuration set
89
In summary, in this section, we explained how to determine the VM configurations
based on memory, number of cores, and the core frequency that will maximize the
performance or meet the SLA requirements for all AppFlow types while minimizing the
power consumption. In what follows, we show how to determine the VM configuration
with respect to memory and CPU frequency.
VM Memory Configuration Plan: Data is stored in the DRAM at runtime in such a
way that large files are stored on multiple DRAM chips, such that system accesses the
data concurrently using multiple threads and thus improve performance significantly.
However, while this manner increases the access speed, it also brings another challenge
when trying to reduce the power consumption by DRAM because now you are using
more DRAM chips and thus consume more power than necessary. In our previous work
[28], we presented an approach to reduce the power consumption by shutting down idle
memory chips or bringing them to a low power state. Nevertheless, because of the way
data is stored in the DRAM, we cannot shut down specific memory chips as they are
always active, and the frequency of the DRAM system today is usually kept static and
cannot be changed as easily as the CPU frequency. However, we can reduce the system‟s
power consumption by reducing the I/O transactions between the memory and the hard
disk driver. When there is a high transaction rate between the memory and the hard disk,
the system will consume a lot more power than normal, and the entire system‟s response
time will also be reduced because of the relatively slow hard disk response time. This
90
typically occurs when there is insufficient amount of memory in the system. To
determine the optimal amount of memory for a given workload, we allocate the
maximum 4GB DRAM to the VM running the workload and monitor the actual amount
that is used. As shown in Figure 23, when the VM is assigned insufficient memory
between 800-1800 MB, the I/O requests is high (30,000 – 40,000) and the power
consumption is high (220 – 224 Watts). However, when we allocate a sufficient amount
of memory around 3000 MB, the power consumption is reduced to 213 and also I/O
requests goes almost to zero (as shown in Figure 23).
Figure 23. Power consumption for different memory configurations
VM CPU Configuration Plan: The CPU configuration includes the number of virtual
CPUs the VM should have as well as the frequency it should operate on. Since the Intel
91
i7 920 CPU has 4 physical cores and could be recognized by the OS as 8 cores due to
Intel‟s Hyper-threading technology, configuring different frequency for each of the 8
cores would imply monitoring and reacting to the activity of every core allocated to the
VM, which would be very complex and will introduce significant overhead in the
resource scheduling algorithm. According to our initial experiments, it also did not give
an optimal power consumption result. Therefore, we use the same frequency for all of the
CPU cores allocated for the VM. After we determine the virtual CPU number an
AppFlow type, we run the application for each of the 10 p-states of the Intel i7 CPU, and
record the power consumption, response time and compare it to the SLA. The best
frequency configuration should be the one that meets the SLA requirement with minimal
power consumption. Figure 24 shows the 10 p-states of the Intel i7 CPU and its power
consumption while working at full load. The machine was plugged into a power meter
that collects the power consumption measurements.
92
Figure 24. Intel i7 920 CPU p-state and full load power consumption
b)
Plan Selection
We create an appropriate resource allocation plan that optimizes both performance
and power consumption for each AppFlow type as discussed in [84]. On runtime, while
the application behavior changes, the autonomic management triggers the control loop
which predicts the current execution environment. Hence, the question becomes “What
AppFlow type is running on the current system and what is the right configuration needs
to be applied for the maximum ppw?”. Hence the approach to obtain the best
configuration plan for the closest current AppFlow can be obtained as below:
(
(
))
In order to identify the current running AppFlow, we used the rules obtained using
93
JRip during the training phase in order to determine the closest known AppFlow type that
matches the current application behavior. For improving the accuracy of our prediction
for the current AppFlow type, we have chosen an interval
to predict. During this time
interval, for every data point, the closest AppFlow type is calculated and at the end, we
select the closest AppFlow type based on the majority data points that are evaluated
during this interval.
After identifying the current AppFlow type, the next step is finding the best
configuration. We have used the Plan Generation step to create the plans maximizing
ppw for each AppFlow and stored them as a library. When the AppFlow type is found,
the associated configuration is selected and, then, the plan execution is triggered.
c)
Runtime Plan Execution
After determining the appropriate plan for each AppFlow type, the Server-APPM is
used to execute the selected configuration plan at runtime.
Our plan execution approach is shown in Figure 25. First, the VMs to be managed are
initialized with minimum resources (i.e. 1 core, 1560MHz, and 256MB memory) so that
as long as there is no application is running, the power consumption will be set to
minimum. This approach can be integrated with VM consolidation to reduce power
consumption further (VMs without any assigned application can be consolidated to the
same host machines to reduce the number of active hosts) as long as the resource
utilizations are less than the pre-defined resource thresholds. For example, if the CPU
94
load of the VMs is higher than the threshold of a pre-defined CPU threshold for a given
time interval, it can be assumed that there is an application started on the VMs and the
first plan is executed where the VM resources are re-configured using the smallest plan.
Then, the VMs are updated every T (e.g, 3 minutes interval is chose for the RUBIS
application) based on their current loads. First, the monitored application behavior is read
from
the
database
and
(
the
closest
AppFlow
type
is
decided
by
) and the best configuration for the AppFlow is
(
gathered from the library (
)) which was
discovered during Plan Generation. Next, the allocated virtual resources of the VMs are
updated using the plan as
(
).
95
1: Input: VMs
2: Initialization(VMs)
3: while(true) // Wait if the VMs are in idle state
4: start_time current_time
5: wait 30 sec
6: CPU_avg readDB(select CPU_usage between start_time and current_time);
7: if(CPU_avg ≥ CPU_THRESHOLD && MEM_avg ≥ MEM_THRESHOLD)
8:
ExecutePlan(VMs, SmallestPlan)
9:
break;
10: end if
11: end while
12: while(true) // When there is a workload
13: start_time current_time
14: wait 3 min
15: workload readDB(select * between start_time and current_time);
16: AppFlow_subcube FindClosestAppFlow(workload)
17: Plan BestConfig(AppFlow_subcube)
18: ExecutePlan(VMs, Plan)
19: endwhile
Figure 25. Plan execution approach used in this work
96
CHAPTER V
EXPERIMENTAL RESULTS AND EVALUATION
A. Experimental Testbed
Our experimental testbed is built using an IBM Blade Server (HS22) with blades
having 2 Intel Xeon X5650 processors (2.66GHz) and 24GB DRAM memory. On each
allocated blades, we have installed Debian 6.0.6 ("squeeze") and used Xen 4.0 for the
virtualization platform since it allows dynamic resource allocation for VMs via Xen tools
(such as dynamic memory configuration, frequency, number of cores, core mapping,
etc.).
For benchmarking we have chosen RUBiS which is an auction site model based on
eBay.com [109]. RUBiS allows evaluating of bidding applications and the servers‟
performance and its scalability. RUBiS can be configured to generate different types of
behaviors such as browsing, or bidding with different number of clients. Figure 26 shows
the RUBiS architecture implemented in this study. The RUBiS architecture is designed to
run MySQL DB as the backend and Apache Webserver as the frontend (it runs on PHP)
where the clients operate through the Apache Webserver. To be able to provide a more
realistic behavior, we have separated RUBiS servers: On one of the physical machines,
RUBiS DB and Web Servers have been built using two different VMs (DB VM and Web
Server VM) which will be managed by the server-level autonomic manager. On the other
97
allocated two blades, multiple client VMs, that behave as load generators, have been
configured such that each host machine has 10 Client VMs in total. Each client VM has
one virtual CPU and 256MB virtual memory.
Figure 26. Multi-tier RUBiS architecture
In order to control the experiments, another VM is used as the Management VM
(MVM) on the second blade. The main duty of MVM is to start and stop the experiments,
control the monitoring tools, and manage the VMs.
B. Evaluation Methods
We evaluated the effectiveness of our approach by comparing it with three methods.
First, we have used a static method (with a pessimistic approach) where the resources are
not changed during runtime of an application and set the MS (managed VMs) resources
98
to 4 cores and 4GB memory with all the cores in the system set to highest frequency.
Second method is the adaptive frequency scaling method discussed in the Related Work
(Chapter II) (the algorithm is shown in Figure 27). Third, we compare with our previous
method [10].
For the frequency scaling method, we have set the adaptation interval to 1 minute
since the power information provided by our system is updated every minute. And, the
initial frequency level is set to minimum (1596MHz). The overall algorithm of the
frequency scaling method is shown in Figure 27. It should be noted that this method uses
a power threshold to update the frequency levels of the cores assigned to VMs (MSs).
Hence, in order to define the threshold values, we have used the average power
consumption to calculate power consumption reduction. And, for the initialization of our
previous approach, we used 4 cores, 4GB memory, and the highest frequency (2660MHz)
for the used cores while setting the remaining cores to the minimum frequency. During
the experiments, we saw that the unused cores for the VMs are not increasing the power
consumption since they are set to idle by the Linux (Xen) scheduler. So, even though 4
cores are assigned, when not used, all idle cores will be set to a state where they do not
contribute to the overall power consumption.
99
1: Input: VMs, Frequency List, power threshold (Powerthreshold)
2: Initialization(VMs)
3: for every minute
4: power readHostPowerUsage()
5: if(power ≥ Powerthreshold and VMs.frequencyLevel > FreqMin)
6:
VMs.frequencyLevel VMs.frequencyLevel - 1
7: elseif(power ≤ Powerthreshold and VMs.frequencyLevel < FreqMin)
8:
VMs.frequencyLevel VMs.frequencyLevel + 1
9: endif
10: end
Figure 27. Frequency comparison algorithm
C. Experimental Results and Evaluation of the ARCM Algorithm
In this section, we present the experimental results to compare our method with the
aforementioned methods in terms of their execution time overhead and the total MS
power consumption reduction. First, we have observed the base power consumption of
our system (i.e. the power consumption when there is no application running and when
all the cores are set to the minimum frequency) in order to calculate the total power
consumption by the MS (VMs operating as virtual servers). The total VM power
consumption in a physical machine has been calculated as
where
represents the total VM power consumption and
represents the total
power consumed by the system. The minimum base power (
) is measured as 122
Watt as the base power of the system without any application on the VMs.
100
We used the RUBiS application with browse-only which contains operations such as
searching, browsing, reading comments, and RUBiS with default transition tables that
focuses on bidding, commenting, buying, etc.
The experimental results for browse-only and default RUBiS behaviors are shown in
Figure 28 and Figure 29 with different clients numbers (as 100, 500, and 1500),
respectively.
Figure 28. The power consumption of browse-only RUBiS behavior with different clients.
101
Figure 29. The power consumption of default RUBiS behavior with different clients.
Our experiment results show that the power consumption of the static allocation
strategy is the highest among all the other methods. This is due to the pessimistic
approach where maximum resource configurations are used. Frequency scaling method
reduces the power consumption compared to the static method. However since the
frequency scaling method only considers only frequency, the reduction in the power
consumption is not at its peak. By applying our approach that considers multi-resource
management schemes, the power can be further reduced. When compared to the static
resource allocation method, Algorithm 1 reduced power by 84% and with the improved
method (Algorithm 2), we were able to reduce power by 87%.
In addition, we have also created a random RUBiS application behavior by randomly
102
selecting different types of RUBiS operations. The experiment results for this case show
similar behavior where the power is further reduced without performance degradation as
shown in Figure 30.
Figure 30. The power consumption of random RUBiS behavior with different clients where no
prior training has been applied.
It should be noted that for the experiments shown in Figures 28, 29, and 30, when the
client number is set to a small number (i.e. 100 clients), the power reductions for different
methods are close to each other. In this case, the resource requirements of the VMs are
low and, hence, the Linux (Xen) scheduler sets the unused cores to idle states so that
unnecessary power will not be consumed. However, when the number of clients
increases, the need for dynamic resource management increases and that leads to
significant reduction in power consumption.
103
Table 1 shows the relative power reduction for our approach when compared to other
methods. The relative power reduction (RPR) is computed as:
(
)
.
As shown in Table 1, the RPR with respect to static resource allocation is close to
87.15% and 66.5% with respect Algorithm 2.
Table 1. Comparing the performance of our approach with other methods.
RPR if A2 with respect to
Type
Browse
Default
Random
Clients
Compared to Static
Compared to Freq.
Compared to Alg.1
100
74.70
57.76
53.66
500
87.15
72.17
66.50
1500
81.67
54.83
52.96
100
73.12
62.87
61.89
500
86.03
71.44
62.37
1500
67.28
30.53
20.39
100
74.02
55.42
55.12
500
69.30
39.40
14.70
1500
67.26
34.02
14.23
104
Additionally, Table 2 shows the Relative Time Increase (RTI) in the execution time
when compared to the static resource allocation that gives the best performance. The RTI
is computed in a similar way to RPR as
(
)
The time overhead is calculated based on the static approach since the static
approach uses the highest resources and it results in the smallest execution time.
Table 2. Comparing the execution time overhead of our approach with other methods.
RTI with respect to
Type
Browse
Default
Random
Clients
Freq
Alg. 1
Alg. 2
100
0.09
0.00
0.60
500
0.08
1.01
3.27
1500
0.16
1.80
2.95
100
0.00
0.09
0.17
500
-0.33
0.17
1.34
1500
1.24
1.49
2.89
100
0.00
0.26
0.68
500
-0.17
0.59
0.67
1500
0.33
1.89
1.81
105
Considering the execution time overhead, our approach has an overhead upto 3.27%
in the worst case while reducing the power consumption by 87.15%. The negative values
regarding to the frequency scaling occur due to small time variation in the benchmark (2
and 1 seconds out of 20 minutes, respectively). Hence, it is clear that our approach
outperforms other techniques.
Also, our experimental results show that for the browsing operations the power
reduction is better than the default workload behavior since browsing does not create a
high CPU utilization as in the default RUBiS transition table case. Therefore, by reducing
the number of cores the power consumption can be further decreased.
We have also evaluated the performance of our algorithms when the number of
clients changes during runtime based on a predefined Markov random process where the
transition probabilities are given in Table 3. In this set of experiments, the RUBiS
application starts with 100 clients initially. Every 4 minutes, a new client number is
chosen according to the transition probabilities shown in Table 3.
Table 4 shows the experimental results when the number of clients changes over time.
In this set of experiments different application behaviors are used for one hour duration.
The results clearly show that our algorithms outperform all other approaches.
106
Table 3. The transition probability of the client numbers
Current\Next
100
500
1500
100
0.1
0.3
0.6
500
0.1
0.3
0.6
1500
0.1
0.1
0.8
Table 4. The power comparison when the client numbers change dynamically for 1 hour.
RPR of Algorithm 2 with respect to
Type
Compare to Static
Compare to Freq
Compared to Alg 1
Browse
85.23
77.12
57.01
Default
84.10
76.66
57.74
Random
83.17
63.80
51.17
D. Scalability Analysis
In this section we analyze the scalability of our algorithm when the number of
resources and clients increase significantly. Since our approach is based on creating plans
for every applications/systems, for large scale systems it is possible to create a nearoptimal plan for the large scale application scenarios. To evaluate the scalability of our
107
algorithms, we need to build a larger testbed based on IBM Blade Server (HS22) that
consists of 7 blades instead of just 3 blades. Our plan is to scale the testbed from 3 blades
to 7 blades and then use analytical techniques to extrapolate the performance of our
algorithm when the testbed is scaled upto a much larger number of systems. The
architecture of the testbed is shown in Figure 31. On each allocated blades, we have used
Debian 6.0.6 ("squeeze") for VMs on Xen 4, similar to the previous testbed. We have
allocated 10 Web Server VMs and 10 DB server VMs.
In this large testbed, the clients are directly connected to the Web servers (For every
Web server, 2 Client VMs are allocated creating the required number of client
emulation). For the DB servers, we have applied Master-Slave MySQL replication and
allocated one additional Master DB. RUBiS front end has been updated for the masterslave replication where all the write operations (e.g. insert, update, etc.) are sent to the
DB master whereas read operations (such as select) are sent to the DB slaves and tested
using browsers and netstat tool. In each physical machine we have allocated two RUBiS
systems (DB server and Web server) such that 5 physical blades are allocated for the VM
servers. One additional VM is allocated on another physical blade for the Master DB and
also for the 20 Client VMs and the Cloud Autonomic Manager, we have allocated
another physical blade. In total 7 blades from IBM HS22 have been allocated for the
large scale testbed.
108
Figure 31. Large scale testbed architecture for the scalability
We have applied static resource allocation and frequency scaling method to manage
the VMs in addition to our second algorithm. Figure 32 compares the performance of
Algorithm 2 with respect to RPR 2 with static resource allocation, while Figure 33
compares our algorithm with frequency scaling method. For example, with respect to
static resource allocation, Algorithm 2 reduces power by 54% for default behavior with
15000 clients.
109
Figure 32. The power RPR comparison with respect to static resource allocation.
Figure 33. The power RPR comparison with respect to frequency scaling for large testbed
110
E. Total Operational Cost Analysis
We present the effect of the total operational cost analysis of our approach, in this
section. The total operational cost of a cloud system depends on the resources used, total
power, cooling, etc. Since in our approach, the number of resources allocated to user
applications change dynamically, the cost can be determined based on the average
number of resources allocated to user applications. This can be computed as shown
below:
( )
where Nc: Number of allocated cores, Nm: Allocated memory amount, αc: The unit cost
for a core, and αm: The unit cost for 1B memory. Figure 34 shows how the cost changes
when the CPUs and memory allocation change during the application execution. It is
obvious that by reducing the amount of resources consumed by an application, the
operational cost for the cloud users will also be reduced.
111
Figure 34. Memory and CPU usage vs cost
112
CHAPTER VI
CONCLUSION AND FUTURE WORK DIRECTIONS
A. Conclusion
The power consumption of cloud systems and data centers has increased significantly
with the rapid deployment of cloud applications and services, and it is projected to
increase to roughly 140 billion kilowatt-hours annually by 2020. This is equivalent to the
annual output of 50 power plants, costing American businesses $13 billion per year in
electricity bills and causing the emission of nearly 150 million metric tons of carbon
pollution annually. Cloud systems and data centers are configured to handle peak annual
traffic, where most of the time, most of their servers are largely unused. In this
dissertation, we focused on development of innovative autonomic management
algorithms of cloud system resources and applications such that we can maintain their
performance requirement dynamically at any time based on demand, while minimizing
the power consumption.
We have developed an autonomic performance-per-watt based resource management
methodology based on AppFlow to analyze, characterize the current operation state of
cloud system applications, and predict their behaviors in the near-future so we can
efficiently optimize simultaneously both power and performance.
Our experimental
results and evaluations carried out on a private cloud system utilized IBM HS22
113
bladeserver with an eBay-like application (RUBiS). We compared our autonomic
management algorithms with the frequency scaling approach discussed and also static
allocation strategy. We have shown that the Reduction Power Rate (RPR) can reach 86%
with less than 1% execution time overhead and in its worst case it will perform as good
as the frequency based approach.
Our research contributions can be highlighted in the following points:
We have developed innovative analytical approach to model and characterize the
behavior and the resource requirements of cloud applications based on its operation
environment at any instant of time.
We have shown how the model and characterization approach can be used for
predicting the behavior of application behavior so we can proactively optimize its
operations at runtime.
We have developed a hierarchical autonomic management framework for cloud
systems and data centers that includes multiple levels of autonomic managers: Cloud
Level Autonomic Manager, Server Level Autonomic Manager, and Device Level
Autonomic Manager.
We have developed multi-objective optimization algorithms (power and
performance) based on data mining and statistical techniques to determine the near-
114
optimal configuration of any cloud server such that the performance and power
consumption are both optimized at runtime.
We have provided a holistic optimization methodology for performance and
power that takes into consideration number of cores, CPU frequency, and memory
amount.
An autonomic cloud resource management to support the dynamic change in VM
configurations at runtime such that we can optimize simultaneously multiple objectives
including performance, power, availability, etc.
We have deployed a private cloud testbed based on IBM blade server and Xen
virtualization environment to experiment with and evaluate the autonomic management
algorithms and techniques developed in this dissertation to manage the cloud resources.
B. Future Research Directions
1) Adopt ARCM Methodology to Heterogeneous Cloud Systems
In this dissertation, we mainly focused on multi-tier web applications, named RUBiS
on a homogeneous environment. We are planning to further improve the approach so it
can be applied in heterogeneous cloud environments.
The challenges that will be
addressed is how to model and measure the power consumption of heterogeneous
resources with little support from the environment. We will be looking into a
115
combination of measurement and modeling techniques to model and predict the power
consumption in heterogeneous environment.
2) Modeling and Predicting the Behavior of Different Types of Workloads
In this dissertation, we studied the performance of our algorithms with a complex
eBay-like application, RUBiS. We are planning to add more complex applications such
as financial, engineering, modeling and simulation that are simultaneous running on the
cloud system. The goal of our autonomic power and performance management is to
allocate cloud resources to these heterogeneous applications such that each one of them
meets its Service Level Objectives (SLO), while minimizing the overall power
consumption.
3) Multi-Objective Optimization (Performance, Power, Availability, Resilience)
In this dissertation, we focused on optimizing power and performance in cloud
systems. We are currently considering the scope of our autonomic management
framework to include in addition to power and performance, availability and resilient
properties. These objectives can contradict each other (e.g., increasing availability
requires the use of redundant resources which will to increase in power consumption) and
will develop tradeoff techniques that can satisfy all the required objectives with minimal
power consumption.
116
4) Modeling of the VM Power Consumption
Measuring the power consumption for the servers can be accomplished using power
meters such as „Watts Up?‟ [76] by connecting it between the power source and the
server. On the other hand, this type of monitoring only allows the users to obtain
information on the total amount of power consumed by the server which includes the
base power of the server, the services running on top of it, and the VMs‟ power
consumption. This approach does not give us the power consumption by the VMs which
is important to achieve effective autonomic power and performance management of
cloud systems. We are investigating techniques to model the power consumption of the
VMs from the physical devices (VM power-meter). To model the VM power
consumption, one should take into consideration multiple factors such as CPU usage,
memory usage, disk operations, how the virtual cores are mapped to the physical cores,
the scheduling algorithms, etc. One initial study that will be considered in our approach is
the one developed in [49] that models the energy consumption of a system by adding the
energy consumed by all its subsystems as shown below:
( )
where
,
,
,
are used to describe the energy consumption of the
CPU, memory, disk operations, and static power consumption of the idle system. Then,
117
the authors simplify the model using the normalized values as
values are the model specific constants and
,
, and
and
is constant value required for the linear
modeling. From this analysis, one can predict the energy consumption by the VM.
5) Online VM Migration and Consolidation
Online VM migration and VM consolidation aims at reducing the number of active
servers in order to reduce power consumption by migrating VMs from underutilized
servers so they can be turned off. However, key challenges are the service response time
and determining the optimal consolidation techniques without over-utilization of the
servers. Techniques such as bin-packing are highly suitable for the VM consolidation. In
our approach, the Cloud-Level Autonomic Manager can also address the consolidation
and migration of VMs such that performance and power consumption are optimized.
118
REFERENCES
[1]
Jonathan Koomey, Growth in Data Center Electricity Use 2005 to 2010, analytics
press,
Accessed
January,
2015
from
http://www.analyticspress.com/datacenters.html
[2]
The Natural Resources Defense Council (NRDC), “Data Center Efficiency
Assessment, Scaling Up Energy Efficiency Across the Data Center Industry:
Evaluating Key Drivers and Barriers”, Issue Paper August 2014
[3]
Bari, Md Faizul, Raouf Boutaba, Rafael Esteves, Lisandro Zambenedetti
Granville, Maxim Podlesny, Md Golam Rabbani, Qi Zhang, and Mohamed Faten
Zhani. "Data center network virtualization: A survey." Communications Surveys
& Tutorials, IEEE 15, no. 2 (2013): 909-928.
[4]
Bill Snyder, “Server Virtualization Has Stalled, Despite the Hype,” InfoWorld
online, December 31, 2010, www.infoworld.com/t/servervirtualization/what-youmissed-server-virtualization-has-stalled-despitethe-hype-901?source=footer.
[5]
Alex Benik, The Sorry State of Server Utilization and the Impending PostHypervisor Era, Gigaom, November 30, 2013, www.gigaom.com/2013/11/30/thesorry-state-of-server-utilization-and-the-impending-post-hypervisor-era/.
[6]
J. Whitney, and J. Kennedy. "The Carbon Emissions of Server Computing for
Small-to Medium-sized Organization." (2012).
[7]
L.A. Barroso and Urs Hölzle, The Data Center as a Computer: An Introduction to
the Design of Warehouse-Scale Machines, 2nd ed.,Morgan & Claypool, 2013.
[8]
Govindan,
Sriram,
Jeonghwan
Choi,
Bhuvan
Urgaonkar,
Anand
Sivasubramaniam, and Andrea Baldini. "Statistical profiling-based techniques for
effective power provisioning in data centers." In Proceedings of the 4th ACM
European conference on Computer systems, pp. 317-330. ACM, 2009.
119
[9]
Fargo, Farah, Cihan Tunc, Youssif Al-Nashif, Ali Akoglu, and Salim Hariri.
"Autonomic Workload and Resources Management of Cloud Computing
Services." In Cloud and Autonomic Computing (ICCAC), 2014 International
Conference on, pp. 101-110. IEEE, 2014.
[10]
Fargo, Farah, Cihan Tunc, Youssif Al-Nashif, and Salim Hariri. "Autonomic
performance-per-watt management (APM) of cloud resources and services." In
Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference, p.
2. ACM, 2013.
[11]
Yang, Chao-Tung, Kuan-Lung Huang, Jung-Chun Liu, Yi-Wei Su, and William
Cheng-Chung Chu. "Implementation of a Power Saving Method for Virtual
Machine Management in Cloud." In Cloud Computing and Big Data (CloudComAsia), 2013 International Conference on, pp. 283-290. IEEE, 2013.
[12]
Zhang, Qi, and Raouf Boutaba. "Dynamic workload management in
heterogeneous Cloud computing environments." In Network Operations and
Management Symposium (NOMS), 2014 IEEE, pp. 1-7. IEEE, 2014.
[13]
Satoh, Fumiko, Hiroki Yanagisawa, Hitomi Takahashi, and Takayuki Kushida.
"Total Energy Management System for Cloud Computing." In Cloud Engineering
(IC2E), 2013 IEEE International Conference on, pp. 233-240. IEEE, 2013.
[14]
Royaee, Z. and M. Mohammadi “Energy aware Virtual Machine Allocation
Algorithm in Cloud network” Smart Grid Conference (SGC), 2013 pp.
259-263
17-18 Dec. 2013.
[15]
Malviya, Pushtikant, Swapnamukta Agrawal, and Shailendra Singh. "An
Effective Approach for Allocating VMs to Reduce the Power Consumption of
Virtualized Cloud Environment." In Communication Systems and Network
120
Technologies (CSNT), 2014 Fourth International Conference on, pp. 573-577.
IEEE, 2014.
[16]
Lim, Norman, Shikharesh Majumdar, and Peter Ashwood-Smith. "Resource
management techniques for handling requests with service level agreements." In
Performance Evaluation of Computer and Telecommunication Systems (SPECTS
2014), International Symposium on, pp. 618-625. IEEE, 2014.
[17]
Farahnakian, Fahimeh, Pasi Liljeberg, and Juha Plosila. "Energy-efficient virtual
machines consolidation in cloud data centers using reinforcement learning." In
Parallel, Distributed and Network-Based Processing (PDP), 2014 22nd Euromicro
International Conference on, pp. 500-507. IEEE, 2014.
[18]
Sharma, Viney, and Gur Mauj Saran Srivastava. "Energy efficient architectural
framework for virtual machine management in IaaS clouds." In Contemporary
Computing (IC3), 2013 Sixth International Conference on, pp. 369-374. IEEE,
2013.
[19]
Wang, Xiaorui, and Ming Chen. "Cluster-level feedback power control for
performance optimization." In High Performance Computer Architecture, 2008.
HPCA 2008. IEEE 14th International Symposium on, pp. 101-110. IEEE, 2008.
[20]
Rountree, Barry, David K. Lowenthal, Martin Schulz, and Bronis R. de Supinski.
"Practical performance prediction under dynamic voltage frequency scaling." In
Green Computing Conference and Workshops (IGCC), 2011 International, pp. 18. IEEE, 2011.
[21]
Wirtz, Thomas, and Rong Ge. "Improving mapreduce energy efficiency for
computation intensive workloads." In Green Computing Conference and
Workshops (IGCC), 2011 International, pp. 1-8. IEEE, 2011.
121
[22]
Wang, Xiaorui, and Yefu Wang. "Coordinating power control and performance
management for virtualized server clusters." Parallel and Distributed Systems,
IEEE Transactions on 22, no. 2 (2011): 245-259.
[23]
Cochran, Ryan, Can Hankendi, Ayse K. Coskun, and Sherief Reda. "Pack & Cap:
adaptive DVFS and thread packing under power caps." In Proceedings of the 44th
annual IEEE/ACM international symposium on microarchitecture, pp. 175-185.
ACM, 2011.
[24]
Lama, Palden, and Xiaobo Zhou. "PERFUME: Power and performance guarantee
with fuzzy mimo control in virtualized servers." In Quality of Service (IWQoS),
2011 IEEE 19th International Workshop on, pp. 1-9. IEEE, 2011.
[25]
Kusic, Dara, Jeffrey O. Kephart, James E. Hanson, Nagarajan Kandasamy, and
Guofei Jiang. "Power and performance management of virtualized computing
environments via lookahead control." Cluster computing 12, no. 1 (2009): 1-15.
[26]
David, Howard, Chris Fallin, Eugene Gorbatov, Ulf R. Hanebutte, and Onur
Mutlu. "Memory power management via dynamic voltage/frequency scaling." In
Proceedings of the 8th ACM international conference on Autonomic computing,
pp. 31-40. ACM, 2011.
[27]
Kotera, Isao, Kenta Abe, Ryusuke Egawa, Hiroyuki Takizawa, and Hiroaki
Kobayashi. "Power-aware dynamic cache partitioning for CMPs." In Transactions
on high-performance embedded architectures and compilers III, pp. 135-153.
Springer Berlin Heidelberg, 2011.
[28]
Khargharia, Bithika, Haoting Luo, Youssif Al-Nashif, and Salim Hariri.
"Appflow: Autonomic performance-per-watt management of large-scale data
centers." In Proceedings of the 2010 IEEE/ACM Int'l Conference on Green
122
Computing and Communications & Int'l Conference on Cyber, Physical and
Social Computing, pp. 103-111. IEEE Computer Society, 2010.
[29]
Weddle, Charles, Mathew Oldham, Jin Qian, An-I. Andy Wang, Peter Reiher, and
Geoff Kuenning. "PARAID: A gear-shifting power-aware RAID." ACM
Transactions on Storage (TOS) 3, no. 3 (2007): 13.
[30]
Zhu, Qingbo, Zhifeng Chen, Lin Tan, Yuanyuan Zhou, Kimberly Keeton, and
John Wilkes. "Hibernator: helping disk arrays sleep through the winter." In ACM
SIGOPS Operating Systems Review, vol. 39, no. 5, pp. 177-190. ACM, 2005.
[31]
Heller, Brandon, Srinivasan Seetharaman, Priya Mahadevan, Yiannis Yiakoumis,
Puneet Sharma, Sujata Banerjee, and Nick McKeown. "ElasticTree: Saving
Energy in Data Center Networks." In NSDI, vol. 10, pp. 249-264. 2010.
[32]
Li, Zhichao, Kevin M. Greenan, Andrew W. Leung, and Erez Zadok. "Power
consumption in enterprise-scale backup storage systems." Power 2 (2012): 2.
[33]
Lin, Kai, Weiqin Tong, Xiaodong Liu, and Liping Zhang. "A self-adaptive
mechanism for resource monitoring in cloud computing." (2013): 178-182.
[34]
Ramezani, Fahimeh, Jie Lu, and Farookh Hussain. "An online fuzzy Decision
Support System for Resource Management in cloud environments." In IFSA
World Congress and NAFIPS Annual Meeting (IFSA/NAFIPS), 2013 Joint, pp.
754-759. IEEE, 2013.
[35]
Dabbagh, Mehiar, Bechir Hamdaoui, Mohsen Guizani, and Ammar Rayes.
"Energy-efficient cloud resource management." In Computer Communications
Workshops (INFOCOM WKSHPS), 2014 IEEE Conference on, pp. 386-391.
IEEE, 2014.
[36]
Cao, Junwei, Keqin Li, and Ivan Stojmenovic. "Optimal power allocation and
load distribution for multiple heterogeneous multicore server processors across
123
clouds and data centers." Computers, IEEE Transactions on 63, no. 1 (2014): 4558.
[37]
Berral, Josep Lluis, Ricard Gavalda, and Jordi Torres. "Power-aware multi-data
center management using machine learning." In Parallel Processing (ICPP), 2013
42nd International Conference on, pp. 858-867. IEEE, 2013.
[38]
Williams, Dan, Hani Jamjoom, and Hakim Weatherspoon. "Plug into the
Supercloud." Internet Computing, IEEE 17, no. 2 (2013): 28-34.
[39]
Khazaei, Hamzeh, Jelena Misic, Vojislav B. Misic, and Saeed Rashwand.
"Analysis of a pool management scheme for cloud computing centers." Parallel
and Distributed Systems, IEEE Transactions on 24, no. 5 (2013): 849-861.
[40]
Guzek, Mateusz, Dzmitry Kliazovich, and Pascal Bouvry. "A holistic model for
resource representation in virtualized cloud computing data centers." In Cloud
Computing Technology and Science (CloudCom), 2013 IEEE 5th International
Conference on, vol. 1, pp. 590-598. IEEE, 2013.
[41]
Solis Moreno, I., Peter Garraghan, Paul Townend, and Jie Xu. "Analysis,
modeling and simulation of workload patterns in a large-scale utility cloud."
(2014): 1-1.
[42]
Shuo Liu; Shaolei Ren; Gang Quan; Ming Zhao; Shangping Ren “Profit Aware
Load Balancing for Distributed Cloud Data Centers” Parallel & Distributed
Processing (IPDPS), 2013 IEEE 27th International Symposium on pp.
611-622
20-24 May 2013.
[43]
Ghamkhari, Mahdi, and Hamed Mohsenian-Rad. "Energy and performance
management of green data centers: a profit maximization approach." Smart Grid,
IEEE Transactions on 4, no. 2 (2013): 1017-1025.
124
[44]
Hamilton, James. "Cooperative expendable micro-slice servers (CEMS): low cost,
low power servers for internet-scale services." In Conference on Innovative Data
Systems Research (CIDR‟09)(January 2009). 2009.
[45]
Gartner, Inc., Gartner estimates ICT industry accounts for 2 percent of global
CO2 emissions. Gartner Press Release (April 2007).
[46]
University of Michigan. "Computer Center PowerNap Plan Could Save 75
Percent Of Data Center Energy." ScienceDaily. ScienceDaily, 5 March 2009.
<www.sciencedaily.com/releases/2009/03/090305164353.htm>.
[47]
E. Rubin, A. Rao, C. Chen, Comparative assessments of fossil fuel power plants
with CO2 capture and storage, in: Proceedings of 7th International Conference on
Greenhouse Gas Control Technologies, Vol. 1, 2005, pp. 285-294.
[48]
Fan, Xiaobo, Wolf-Dietrich Weber, and Luiz Andre Barroso. "Power provisioning
for a warehouse-sized computer." In ACM SIGARCH Computer Architecture
News, vol. 35, no. 2, pp. 13-23. ACM, 2007.
[49]
Kansal, Aman, Feng Zhao, Jie Liu, Nupur Kothari, and Arka A. Bhattacharya.
"Virtual machine power metering and provisioning." In Proceedings of the 1st
ACM symposium on Cloud computing, pp. 39-50. ACM, 2010.
[50]
Meisner, David, Brian T. Gold, and Thomas F. Wenisch. "PowerNap: eliminating
server idle power." ACM SIGARCH Computer Architecture News 37, no. 1
(2009): 205-216.
[51]
Niles, Suzanne, and Patrick Donovan. "Virtualization and cloud computing:
Optimized power, cooling, and management maximizes benefits." White paper
118 (2011).
125
[52]
Luo, Haoting, Bithika Khargharia, Salim Hariri, and Youssif Al‐Nashif.
"Autonomic Green Computing in Large‐Scale Data Centers." Energy-Efficient
Distributed Computing Systems (2012): 271-299.
[53]
Intel
SpeedStep,
Accessed
January,
2015
from
http://www.intel.com/cd/channel/reseller/asmo-na/eng/203838.htm
[54]
AMD PowerTune, Accessed January, 2015 from http://www.amd.com/enus/innovations/software-technologies/enduro
[55]
Ge, Rong, Xizhou Feng, and Kirk W. Cameron. "Improvement of powerperformance efficiency for high-end computing." In Parallel and Distributed
Processing Symposium, 2005. Proceedings. 19th IEEE International, pp. 8-pp.
IEEE, 2005.
[56]
Tolentino, Matthew E., Joseph Turner, and Kirk W. Cameron. "Memory-miser: a
performance-constrained runtime system for power-scalable clusters." In
Proceedings of the 4th international conference on Computing frontiers, pp. 237246. ACM, 2007.
[57]
Bae, Chang S., Lei Xia, Peter Dinda, and John Lange. "Dynamic adaptive virtual
core mapping to improve power, energy, and performance in multi-socket
multicores." In Proceedings of the 21st international symposium on HighPerformance Parallel and Distributed Computing, pp. 247-258. ACM, 2012.
[58]
Kodama, Yuetsu, Satoshi Itoh, Toshiyuki Shimizu, Satoshi Sekiguchi, Hiroshi
Nakamura, and Naohiko Mori. "Imbalance of CPU temperatures in a blade system
and its impact for power consumption of fans." Cluster computing 16, no. 1
(2013): 27-37.
[59]
Von Laszewski, Gregor, Lizhe Wang, Andrew J. Younge, and Xi He. "Poweraware scheduling of virtual machines in dvfs-enabled clusters." In Cluster
126
Computing and Workshops, 2009. CLUSTER'09. IEEE International Conference
on, pp. 1-10. IEEE, 2009.
[60]
Goiri, Íñigo, Kien Le, Md E. Haque, Ryan Beauchea, Thu D. Nguyen, Jordi
Guitart, Jordi Torres, and Ricardo Bianchini. "Greenslot: scheduling energy
consumption in green datacenters." In Proceedings of 2011 International
Conference for High Performance Computing, Networking, Storage and Analysis,
p. 20. ACM, 2011.
[61]
Wang, Lizhe, Gregor Von Laszewski, Jai Dayal, and Fugang Wang. "Towards
energy aware scheduling for precedence constrained parallel tasks in a cluster
with DVFS." In Cluster, Cloud and Grid Computing (CCGrid), 2010 10th
IEEE/ACM International Conference on, pp. 368-377. IEEE, 2010.
[62]
Gandhi, Anshul, Yuan Chen, Daniel Gmach, Martin Arlitt, and Manish Marwah.
"Minimizing data center sla violations and power consumption via hybrid
resource provisioning." In Green Computing Conference and Workshops (IGCC),
2011 International, pp. 1-8. IEEE, 2011.
[63]
Calheiros, Rodrigo N., Rajiv Ranjan, and Rajkumar Buyya. "Virtual machine
provisioning based on analytical performance and QoS in cloud computing
environments." In Parallel Processing (ICPP), 2011 International Conference on,
pp. 295-304. IEEE, 2011.
[64]
Srikantaiah, Shekhar, Aman Kansal, and Feng Zhao. "Energy aware consolidation
for cloud computing." In Proceedings of the 2008 conference on Power aware
computing and systems, vol. 10. 2008.
[65]
Younge, Andrew J., Gregor Von Laszewski, Lizhe Wang, Sonia Lopez-Alarcon,
and Warren Carithers. "Efficient resource management for Cloud computing
environments." In Green Computing Conference, pp. 357-364. 2010.
127
[66]
Liu, Haikun, Hai Jin, Cheng-Zhong Xu, and Xiaofei Liao. "Performance and
energy modeling for live migration of virtual machines." Cluster computing 16,
no. 2 (2013): 249-264.
[67]
Beloglazov, Anton, and Rajkumar Buyya. "Energy efficient allocation of virtual
machines in cloud data centers." In Cluster, Cloud and Grid Computing (CCGrid),
2010 10th IEEE/ACM International Conference on, pp. 577-578. IEEE, 2010.
[68]
Cardosa, Michael, Madhukar R. Korupolu, and Aameek Singh. "Shares and
utilities based power consolidation in virtualized server environments." In
Integrated Network Management, 2009. IM'09. IFIP/IEEE International
Symposium on, pp. 327-334. IEEE, 2009.
[69]
C. Tsai, K. Shin, J. Reumann, and S. Singhal, “Online web cluster capacity
estimation and its application to energy conservation,” IEEE Trans. on Parallel
and Dist. Sys., vol. 18, no. 7, pp. 932–945, Jul. 2007.
[70]
L. Minas and B. Ellison, Energy Efficiency for Information Technology: How to
Reduce Power Consumption in Servers and Data Centers. Intel Press, Aug. 2009.
[71]
Song, Ying, Hui Wang, Yaqiong Li, Binquan Feng, and Yuzhong Sun. "Multitiered on-demand resource scheduling for VM-based data center." In Proceedings
of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and
the Grid, pp. 148-155. IEEE Computer Society, 2009.
[72]
Jung, Gueyoung, Matti A. Hiltunen, Kaustubh R. Joshi, Richard D. Schlichting,
and Calton Pu. "Mistral: Dynamically managing power, performance, and
adaptation cost in cloud infrastructures." In Distributed Computing Systems
(ICDCS), 2010 IEEE 30th International Conference on, pp. 62-73. IEEE, 2010.
[73]
Rahman, Mustafizur, Rajiv Ranjan, Rajkumar Buyya, and Boualem Benatallah.
"A taxonomy and survey on autonomic management of applications in grid
128
computing environments." Concurrency and computation: practice and experience
23, no. 16 (2011): 1990-2019.
[74]
Parashar, Manish, and Salim Hariri. "Autonomic computing: An overview." In
Unconventional Programming Paradigms, pp. 257-269. Springer Berlin
Heidelberg, 2005.
[75]
S. Hariri, B. Khargharia, H. Chen, J. Yang, Y. Zhang, H. Liu and M. Parashar,
"The Autonomic Computing Paradigm", Journal of Cluster Computing: The
Journal of Networks, Software Tools and Applications, Special Issue on
Autonomic Computing, Vol. 9, No. 2, 2006, Springer-Verlag.
[76]
Watts
Up?
Power
meters,
Accessed
January,
2015
from
https://www.wattsupmeters.com/secure/products.php?pn=0
[77]
P.Horn. Autonomic Computing: IBM‟s Perspective on the State of Information
Technology. http://www.research.ibm.com/autonomic/, October 2001
[78]
IBM
Corporation.
Autonomic
computing
concepts.
http://www-
3.ibm.com/autonomic/library.shtml , 2001.
[79]
IBM “An architectural blueprint for autonomic computing”, April 2003.
[80]
Chen, Huoping. Self-configuration framework for networked systems and
applications. ProQuest, 2008.
[81]
AdaptiveSystems,
Accessed
January,
2015
from
http://www.cogs.susx.ac.uk/users/ezequiel/AS/lectures
[82]
Ashby, William Ross. Design for a Brain. Springer Science & Business Media,
1960.
[83]
Dong, Xiangdong, Salim Hariri, Lizhi Xue, Huoping Chen, Ming Zhang, Sathija
Pavuluri, and Soujana Rao. "Autonomia: an autonomic computing environment."
129
In Performance, Computing, and Communications Conference, 2003. Conference
Proceedings of the 2003 IEEE International, pp. 61-68. IEEE, 2003.
[84]
Likhachev, Maxim, and Ronald C. Arkin. "Spatio-temporal case-based reasoning
for behavioral selection." In Robotics and Automation, 2001. Proceedings 2001
ICRA. IEEE International Conference on, vol. 2, pp. 1627-1634. IEEE, 2001.
[85]
S. Hariri, L. Xue, H. Chen, M. Zhang, S. Pavuluri and S. Rao, "AUTONOMIA:
an autonomic computing environment", Conference Proceedings of the 2003
IEEE International Performance Computing and Communications Conference
(IPCCC 2003)
[86]
Bourcier, Johann, Ada Diaconescu, Philippe Lalanda, and Julie A. McCann.
"Autohome: An autonomic management framework for pervasive home
applications." ACM Transactions on Autonomous and Adaptive Systems (TAAS)
6, no. 1 (2011): 8.
[87]
Chen, Huoping, Youssif B. Al-Nashif, Guangzhi Qu, and Salim Hariri. "Selfconfiguration of network security." In Enterprise Distributed Object Computing
Conference, 2007. EDOC 2007. 11th IEEE International, pp. 97-97. IEEE, 2007.
[88]
Sibai, Faisal M., and D. Menasce. "Defeating the insider threat via autonomic
network capabilities." In Communication Systems and Networks (COMSNETS),
2011 Third International Conference on, pp. 1-10. IEEE, 2011.
[89]
Caton, Simon, and Omer Rana. "Towards autonomic management for cloud
services based upon volunteered resources." Concurrency and Computation:
Practice and Experience 24, no. 9 (2012): 992-1014.
[90]
Tomás, Luis, Agustín C. Caminero, Omer Rana, Carmen Carrión, and Blanca
Caminero. "A GridWay-based autonomic network-aware metascheduler." Future
Generation Computer Systems 28, no. 7 (2012): 1058-1069.
130
[91]
Lorimer, Trevor, and Roy Sterritt. "Autonomic management of cloud
neighborhoods through pulse monitoring." In Proceedings of the 2012
IEEE/ACM Fifth International Conference on Utility and Cloud Computing, pp.
295-302. IEEE Computer Society, 2012.
[92]
Solomon, Bogdan, Dan Ionescu, Marin Litoiu, and Gabriel Iszlai. "Selforganizing autonomic computing systems." In Logistics and Industrial Informatics
(LINDI), 2011 3rd IEEE International Symposium on, pp. 99-104. IEEE, 2011.
[93]
Casalicchio, Emiliano, Daniel A. Menascé, and Arwa Aldhalaan. "Autonomic
resource provisioning in cloud systems with availability goals." In Proceedings of
the 2013 ACM Cloud and Autonomic Computing Conference, p. 1. ACM, 2013.
[94]
Ramirez, Andres J., David B. Knoester, Betty HC Cheng, and Philip K.
McKinley. "Plato: a genetic algorithm approach to run-time reconfiguration in
autonomic computing systems." Cluster Computing 14, no. 3 (2011): 229-244.
[95]
Bodık, Peter, Rean Griffith, Charles Sutton, Armando Fox, Michael Jordan, and
David Patterson. "Statistical machine learning makes automatic control practical
for internet datacenters." In Proceedings of the 2009 conference on Hot topics in
cloud computing, pp. 12-12. 2009.
[96]
de Oliveira Jr, Frederico Alvares, Remi Sharrock, and Thomas Ledoux.
"Synchronization of multiple autonomic control loops: Application to cloud
computing." In Coordination Models and Languages, pp. 29-43. Springer Berlin
Heidelberg, 2012.
[97]
Anthony, Richard, Mariusz Pelc, and Haffiz Shuaib. "The interoperability
challenge for autonomic computing." In EMERGING 2011, The Third
International Conference on Emerging Network Intelligence, pp. 13-19. 2011.
131
[98]
American Power Consortium Document Number CMRP5T9PQG, Accessed
January,
2015
from
ftp://www.apcmedia.com/salestools/CMRP5T9PQG_R2_EN.pdf
[99]
Operating Systems and Architectural Techniques for
Power and Energy
Conservation,
2015
Accessed
January,
from
http://www.cs.rutgers.edu/~ricardob/power.html
[100] Huang, Hai, Kang G. Shin, Charles Lefurgy, Karthick Rajamani, Tom Keller, Eric
V. Hensbergen, and Freeman Rawson. "Cooperative software-hardware power
management for main memory." In Workshop on Power-Aware Computer
Systems. 2004.
[101] Hariri, S., and M. Parashar. "The foundations of autonomic computing." A.
Zomaya, CHAPMN (2005).
[102] Qu, Guangzhi, Salim Hariri, Santosh Jangiti, Suhail Hussain, Seungchan Oh,
Samer Fayssal, and Mazin Yousif. "Abnormality metrics to detect and protect
against network attacks." In Pervasive Services, 2004. ICPS 2004. IEEE/ACS
International Conference on, pp. 105-111. IEEE, 2004.
[103] Chen, Huoping, Byoung uk Kim, Jingmei Yang, Salim Hariri, and Manish
Parashar. "Autonomic Runtime System for Large Scale Parallel and Distributed
Applications." Unconventional Programming Paradigms (2004): 164.
[104] Aamodt, Agnar, and Enric Plaza. "Case-based reasoning: Foundational issues,
methodological variations, and system approaches." AI communications 7, no. 1
(1994): 39-59.
[105] Xu, Jing. "Autonomic application and resource management in virtualized
distributed computing systems." PhD diss., University of Florida, 2011.
[106] Xen Hypervisor. Accessed January, 2015 from http://www.xen.org/.
132
[107] PostgreSQL Data base. Accessed January, 2015 from http://www.postgresql.org/
[108] IBM
HS22,
BladeCenter.
Accessed
January,
2015
from
http://www-
947.ibm.com/support/entry/portal/product/bladecenter/bladecenter_hs22?product
Context=1810554879
[109] OW2, RUBiS Benchmark. Accessed January, 2015 from http://rubis.ow2.org/
[110] Hall, Mark, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter
Reutemann, and Ian H. Witten. "The WEKA data mining software: an update."
ACM SIGKDD explorations newsletter 11, no. 1 (2009): 10-18.
[111] Cohen, William W. "Fast effective rule induction." In Proceedings of the twelfth
international conference on machine learning, pp. 115-123. 1995.
133
© Copyright 2026 Paperzz