Informatica — Universiteit van Amsterdam
Bachelor Informatica
Storage to Energy
Dexter Drupsteen
11th June 2013
Supervisor(s): Paola Grosso (UvA), Arie Taal (UvA)
Signed:
2
Abstract
This document proposes a method for deciding to move data storage tasks. It provides models
to compare data centers in the context of greenhouse gas emission and data storage. Following
the methods proposed in this document, a decision can be made to keep storage at a local data
center or move towards a remote data center in order to reduce greenhouse gas emission. Two
different storage methods are reviewed in this document: cold storage and hot storage. For both
can be concluded that the transport network, that connects a local and a remote data center,
plays a significant role as well as the ratio between two key variables: the amount of data of the
storage task and the retention time of the data, the time the data resides on one of the servers.
Contents
1 Introduction
1.1 Storage to energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
4
2 Related works
5
3 Methods
3.1 Variables and terminology
3.2 Models and setups . . . .
3.2.1 General setup . . .
3.2.2 Hot storage . . . .
3.2.3 Cold storage . . .
.
.
.
.
.
7
7
9
9
11
14
4 Tools
4.1 Sweep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Storage to energy - Web application . . . . . . . . . . . . . . . . . . . . . . . . .
19
19
20
5 Results
5.1 Hot storage results . . . . . . . . . . . . . .
5.1.1 Influence of PUE and local X-values
5.1.2 Transport network influence . . . . .
5.1.3 Retention time influence . . . . . . .
5.1.4 Data amount influence . . . . . . . .
5.1.5 Download rate influence . . . . . . .
5.2 Cold storage results . . . . . . . . . . . . .
5.2.1 Influence of PUE and local X-values
5.2.2 Transport network influence . . . . .
5.2.3 Retention time influence . . . . . . .
5.2.4 Data amount influence . . . . . . . .
5.2.5 Disk size influence . . . . . . . . . .
5.2.6 Data accumulation influence . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
22
22
22
25
27
28
30
32
33
33
35
35
37
38
6 Discussion
6.1 Hot storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Cold storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
40
42
7 Conclusions and recommendations for future research
7.1 Recommendations for future research . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
45
46
References
48
Appendix A - Constants
49
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CHAPTER 1
Introduction
In the past few years the popularity of cloud computing has been rising [1]. Businesses, public
institutions and educational institutions are moving more of their calculation and storage tasks
towards large data centers. With the increasing usage of data centers, the energy consumed
by these data centers become more significant in the total energy consumption of the world.
To create a sustainable and green sector, the need to reduce energy consumption of these data
centers is growing.
A lot of research is done on the field of making data centers more energy efficient. These research
projects can be divided into two large scopes of research. On one side of the spectrum there is
research on creating energy efficient hardware [2]. These projects lay their main focus on hardware used for calculation and storage but also the infrastructure of the data centers themselves.
Other research project focus on the software side of energy efficiency. For example the research
project of Deng and Pung[3] proposes a software guided method for reducing energy consumed
by data centers.
These projects show that the ICT branch is putting effort to become a more sustainable industry
and tries to reduce greenhouse gas emission by innovating. This is a good step. But for institutions and businesses that rely on cloud computing and thus large data centers, another option
is available. An institution that has its calculations and storage tasks done in a data center that
is on for example Dutch grounds, has the option to move these tasks towards a data center,
anywhere on the planet, to reduce greenhouse gas emission. In order for a institution to move its
ICT tasks towards a possibly more green data center (i.e. a data center with less greenhouse gas
emission and/or less power consumption 1 ) it needs to be shown that moving the tasks towards
the other data center is profitable. That is, less emission. To give institutions and businesses a
better way to decide on this matter the following questions must be answered:
• What information is needed to decide what is the greenest solution for our tasks?
• How can we compare the greenhouse gas emission of two data centers for the tasks that
needs to be executed on them?
• Is greenhouse gas emission still reduced even when taking into account the energy used for
transporting the data of these tasks?
• What are important factors to take into account when moving the data of these tasks
towards another data center?
A research project that investigates this is the Bits to energy or energy to bits project[4].
This project forms the basis for the storage to energy project discussed here. The Bits to energy
or energy to bits project is discussed in the related works section of this document.
1 A data center could be using more more energy and still have less greenhouse gas emission due to a clean
energy source.
3
1.1 Storage to energy
The storage to energy project tries to give answer to sustainability questions concerning storage
tasks. Storage tasks consist of storing an amount of data for a period of time. Two scenarios
are investigated:
• the scenario of hot storage, where data availability is a major concern
• the scenario of cold storage, where large amounts of data are stored over a long period of
time
Both scenarios are explained in more detail in sections 3.1.
This project relies on methods and models provided by the Bits to energy or energy to bits
project[4]. In this project models for storage tasks on data centers are discussed. By proposing
new or edited models a better understanding of storage tasks is provided. This project answers
the questions mentioned before but in the context of storage tasks:
• What information is needed to calculate greenhouse gas emission for storage tasks?
• When is moving storage tasks towards a remote data center profitable in the terms of
greenhouse gas emission?
• What are decisive factors in the question of moving storage tasks towards a remote data
center?
Answering these questions and providing tools to answer them can give institutions and
businesses a better overview of the sustainability of their storage tasks. With a better overview
these institutions and businesses can make a decision for a more sustainable storage policy
and ultimately can realize a smaller CO2 footprint, without being dependant on innovations in
hardware or software.
4
CHAPTER 2
Related works
As mentioned before, this project is highly influenced by the Bits to energy or energy to bits
project. The next section covers the Bits to energy or energy to bits project.
Bits to energy or energy to bits
The Bits to energy or energy to bits project is a collaboration between the dutch research institution TNO, research infrastructure provider SurfSARA and the systems and network engineering
research group of the university of Amsterdam, also known as the SNE group.
The focus of the project is answering two fundamental questions:
• ”What are the sustainability effects of data transport over the data network? How much
energy is required and what is the CO2 footprint?”
• ”What are the sustainability effects of energy transport? When is it suitable to acquire
green energy from elsewhere?”
These questions make clear that the project is split into two main focuses. On one hand, the
project investigates to move green and sustainable energy towards the data center in order to
reduce greenhouse gas emission. They calculate the required energy for a certain task on a data
center in order to calculate the costs (in greenhouse gas emission) for energy production, energy
transport and possible losses.
One the other hand the project focuses on the sustainability of data transport. They provide
models to calculate energy consumption by a set of different networks for a certain task. They
also propose models for calculating energy consumption and greenhouse gas emission for certain
tasks done by data centers. They recognize three different scenarios: CPU intensive tasks, these
are tasks that require a lot of processing time but the data to be transported is mostly limited;
interactive tasks, which are also CPU intensive tasks but require more data transmission; and
storage tasks, which consist of large amount of data that are kept for a period of time on the
data center but on which no calculation is needed. Combining the two areas gives an overview of
what would be more sustainable, moving tasks towards a greener data center or moving greener
energy toward the data center.
Other works
Though the Bits to energy or energy to bits is a influential work in this project, the models
provided by that project are influenced by an article by J. Baliga et al. [5]. Their work will also
be mentioned and used in this project. Their models, which are simple and clear, in combination
with models presentend in [4] will form a basis for models presented here. Another work that
provides a view on models is [6]. In this article Robert Basmadjian et al. propose a detailed
5
model for calculating energy consumption in large data centers. Although the models may be
too detailed for the purpose of this project, ideas and inspiration for this project are drawn from
it.
Further there are a lot of ways to handle certain storage tasks. Especially for cold storage, where
the main objective is to minimize energy consumption, a lot of methods are available. In an
article from Dennis Colarelli et al.[7] a method for cold storage is proposed which is considered
to be the basis for cold storage throughout this project. In [3] by Yuhui Deng and Brandon
Pung et al. this method is also described briefly, but it gives a different angle on and describes
a different part of the cold storage method described by [7]. In another work[8] of Yuhui Deng,
with Frank Wang and Na Helian, they describe the in depth working of disks that are used for
the cold storage method described in [7] and briefly discussed in [3].
For an overview of data center architecture and infrastructure the article of Krishna Kant et al.
[9] is referenced.
Furthermore datasheets of hardware producers are used to acquire information about energy
consumption of infrastructure and hardware components used in data centers for the use of data
storage.
6
CHAPTER 3
Methods
Before the models can be explained, a set of variables and the general terminology have to be
defined. The next section covers the general variables and terminology used throughout the next
sections.
3.1 Variables and terminology
Power usage effectiveness
A common variable to compare data centers’ efficiency with is the PUE, an abbreviation for power
usage effectiveness. It is a ratio between the total power consumed by the data center and the
power used by IT equipment like data storage devices, servers and infastructure equipment[4, 5].
The following equation shows the PUE:
PUE =
EIn
= CLF + P LF + 1, 1 < P U E < ∞
EIT
(3.1)
The ratio between the total energy consumption of the data center (EIn ) and the energy
consumption of the IT hardware (EIT ) equals the sum of the cool load factor (CLF ), which is
the energy used by cooling devices normalized to the IT load, and the power load factor (P LF ),
which is the power lost due to hardware like uninterruptable power suply, power distribution units
and other switchgear[4, 5]. With the PUE a more accurate power consumption of hardware can
be calculated. If a storage device turns out to use x Joule of power over a period of time, a more
accurate power consumption would be P U E.x. This way, cooling devices etc. can be ignored in
the equations of the models as they are accounted for in the PUE.
A survey of Uptime[10, 4] has collected the information about PUEs and their distribution and
is shown in figure 3.1.
Utilization
In a perfect situation hardware is used to its full potential. For example when data gets routed at
a router, the router would use its full capacity (in the order of bits per second) to route the data
the correct way. But most of the hardware does not work at its full processing capacity when
processing a task but keeps operating at 100% power consumption. This results in a bigger kW
per bit energy consumption of these devices. Therefore another factor is added to the calculation
of data consumption by a data center or other component. Before was stated that to calculate the
a more accurate power consumption of a data center the PUE was used: P U E.Etotal where the
Etotal is the total energy used in kW h. In order to account for the utilization of the equipment
E
another term is added to this equation: P U
U .Etotal . In this equation the U factor accounts for
the lower utilization. For example if the utilization of a data center is 50% the utilization factor
will be 0.5 (a factor of 2).
7
Figure 3.1: The distribution of PUEs of data centers all around the world.
Greenhouse gas emission
With a corrected energy consumption available for a data center, the next step is to determine
greenhouse gas emission. In the case of energy consumption (and thus production) the primary,
and most important, greenhouse gas is carbon-di-oxide or CO2 . To calculate the grams of CO2
that are produced during the production of the energy the energy source of the data center should
be available. This differs for data centers. Where in the Netherlands data centers are mostly
provided by energy produced by powerplants that use natural gas, but in for example Norway,
the energy used by data centers is far more likely to be hydroelectricity. If there is a total energy
consumption of the data center calculated (Etotal ) in kW h then for each energy source there is
2
a X-factor in gr.kWCO
h .
Kdata
center
= Xdata
center .P U Edata center .Econsumed
(3.2)
In equation (3.2) the Kdata center is the grams of CO2 that are emitted due to the energy
consumed by the data center. In the rest of the document, the K variable will be the variable
on which will be decided to move a certain storage task towards another data center or not.
Hot storage
As mentioned in the introduction this project will consider two data storage scenarios. Hot
storage is the storage of content that requires a high availability. This content may be downloaded
a few times per hour or day. Examples of hot storage are popular photographs on social media,
videos that are viewed a lot or any other form of content that is accessed frequently. In hot
storage energy efficiency is traded for this high and fast availability. As startup times for storage
hardware would limit the speed at which the data is retrieved, systems are kept awake and in an
active state.
8
Cold storage
The other data storage scenario considered in this project is cold storage. Cold storage is the
storage of content that require no or low availability. The content is stored and only accessed
after a long period of time. Backups are a general example of cold storage. But cold storage is
not limited to backups. Also when for example collecting scientific data over a large period of
time, before doing calculations on it would be a good cold storage candidate.
There are a few methods to store cold data. One can store cold data on optical disks or magnetic
tapes. Both of these methods use no or little energy during the time the data is stored. In data
centers a more modern approach to cold data storage is to store the data on MAIDs[7]. MAID
is an abbreviation for Massive Array of Idle Disks. With this method the data is stored on a
disk array that is only active when it is written to or read from. In the remaining time the data
is on the disk, the disk is operating in a low power mode. To retrieve the data from the array
again, the array has to come from the low power mode into a normal power mode again. This
takes time and energy, but is fine for the cold storage scenario as the data stored is not expected
to be used during the retention time.
3.2 Models and setups
3.2.1 General setup
In order to explain the models a general setup (of hardware and network components) must be
defined. The power consumption of components and hardware used in these sections can be
found in the appendices.
In this project there are always two data centers. The first data center will be referred to as
the local data center. This local data center is close to the source of data. Therefore costs for
transporting data from the source of the data towards the local data center is neglectible. The
other data center will be referred to as the remote data center. This data center can be anywhere.
The two data centers are connected by a transport network. There are a few types of transport
networks which will be explained in 3.2.1.
Local area network
The local area network (LAN) of the data center is the part of the data center that connects it
to the transport network. All data that needs to be stored to or read from the storage devices
of the data center will pass the LAN. For both the local and the remote data center the same
setup of the LAN is chosen.
A typical LAN of a data center consists of the following components [4]:
• a host (network interface)
• three switches
• two firewalls
• a router
The energy consumption for the LAN of the datacenter can be calculated with the following
equation:
LAN =
3Pswitch
2Pf irewall
Prouter
P U E Phost
.(
+
+
+
)
U
Chost
Cswitch
Cf irewall
Crouter
(3.3)
where Phost , Pswitch , Pf irewall , Prouter are the power consumptions in kW of the host,
switches, firewalls and the router respectively and the Chost , Cswitch , Cf irewall , Crouter are
Gbit
kW s
the capacities of the components in second
. This makes the whole equation an equation in GBit
as the PUE and U terms are terms in without a unit of measurement.
When a data amount N is known, equation (3.3) can be used to calculate the energy used by
the LAN to transport this data (equation (3.4)).
9
ELAN (N ) =
8N
8N P U E Phost
3Pswitch
2Pf irewall
Prouter
.LAN =
.
.(
+
+
+
)
3600
3600 U
Chost
Cswitch
Cf irewall
Crouter
(3.4)
The multiplication with 8N , with N in GByte, makes ELAN an equation in kWs. The division
by 3600 converts kWs to kWh (from seconds to hours).
Transport network
The transport network is the network that connects the remote and the local data centers. This
network can be either a dedicated network, a lightpath, or the public internet. For creating
equations for the transport network types building blocks are used. For the internet transport
network type this building block consists of the components showed in figure 3.2. The lightpath
building block is depicted in figure 3.3.
Figure 3.2: The building block of the internet transport network type
Figure 3.3: The building block of the lightpath transport network type
With these building blocks equations for two distances can be constructed. The two distances
are short and long where the short distance is considered to be national transport and the long
distance is considered to be international transport. The building blocks represent one hop in
the network. The average short distance contains one hop and the average long distance contains
three hops [4, 5] as can be seen in figures 3.4, 3.5, 3.6 and 3.7.
For each of the two transport types equations for both the long and short distance can be
constructed like equation (3.3) and equation (3.4):
• Etransport
networkinternet
short distance
• Etransport
networkinternet
long distance
• Etransport
networklightpath
short distance
• Etransport
networklightpath
long distance
(N )
(N )
(N )
(N )
To generalize this Etransport network (N ) will be used, this is one of the 4 equations mentioned
above and is of the form of equation (3.4) (but specific for the type of network that is used).
Like the LAN of the data center is multiplied with the PUE of the data center, the energy
used by the transport network should also be multiplied with a PUE to account for cooling and
power supply overhead. Therefore a PUE is assigned to the transport network.
10
Figure 3.4: Short distance internet transport network
Figure 3.5: Long distance internet transport network
Figure 3.6: Short distance lightpath transport network
Figure 3.7: Long distance lightpath transport network
3.2.2 Hot storage
Model
For the storage facilities of the data centers the infrastructure in figure 3.8 is used.
11
Figure 3.8: The setup of the storage devices in the data centers
The infrastructure shows a content server (1), the content server is connected to a storage
array (2) by switches (3). As the bits of the data pass either one or the other switch, only one
switch needs to be used in the equation. The storage array is an array of disks. In the case
the storage array is a redundant array of independent disk or RAID system. To account for the
energy consumption when the data is written to disks equation (3.5) is used.
Ewrite (N ) =
Pswitch
2Pdiskactive
P U E 8N Pcontent server
.
.(
+
+
)
U 3600 Ccontent server
Cswitch
Cdisk
(3.5)
where Pcontent server , Pswitch and Pdiskactive are the power consumptions in kW of the content
server, switch and the disk in an active state, respectively; the Ccontent server , Cswitch and
Gbit
. N is the amount of data in
Cdiskactive are the capacities of the same components in second
GByte. This explains for the multiplication with 8 (to go from GByte to Gbit). The division
Gbit
Gbit
by 3600 is to convert the second
to hour
. The factor of 2 on the Pdiskactive is to account for
redundancy in storage [5, 4]. For every bit of data that is written to the array, two bits are
E
actually stored. As the term P U
has no unit of measurement the function Ewrite (N ) has the
U
unit kWh.
The equation used for writing the data to the storage array can also be applied to calculate
energy used the energy needed to read the data from the storage array.
Eread (N ) = Ewrite (N ) =
P U E 8N Pcontent server
Pswitch
2Pdiskactive
.
.(
+
+
)
U 3600 Ccontent server
Cswitch
Cdisk
(3.6)
However, this does not fully explain the whole model. In order to complete it there is a need
for an equation that calculates the energy used for storing the data over a period of time. To
calculate this energy, the period of time needs to be known. This introduces a new variable
called the retention time (RT). The equation used to calculate the energy of solely storage is a
function of the retention time and the amount of data:
Ehot
storage (N, RT )
=
P U E 2N
.d
e.Pdiskactive ∗ RT
U
Sdisk
(3.7)
Where Sdisk is the storing capacity of one disk in the array in GByte. Pdiskactive is again the
power consumption of a disk in an active state in kW. N is again the amount of data (or data
amount) in GByte. RT is the retention time in hours. Equation (3.7) is therefore also a function
with the unit kWh. The PUE and U factors are terms without a unit of measurement as well as
as N is in GByte as well as Sdisk . That leaves the Pdiskactive which is in kW and the
the S2N
disk
retention time RT in hours. This gives the equation a unit of kWh.
A property of hot storage data is that the data is consumed during the time that it resides
on the server. That means that the data can be downloaded by a user. To calculate the energy
12
used by the storage devices of the local or remote data center when downloading (i.e. reading
from the array), equation (3.6) is used. In order to use this equation another variable must be
introduced. Namely the download rate. This variable holds the amount of data downloaded in
an hour and thus has the unit GByte
hour . If the download rate is multiplied by the retention time (in
hours) an amout of GBytes, downloaded during the whole time the data resided on the server
(in GByte), is calculated. That amount of data needs to be read from the array, pass the switch
and go through the content server again, alike when writing the data.
Decision equation
With the equations used for calculating the energy used by the local area networks of the data
centers, the transport network and the reading, writing and storage of the data on the data
centers a decision equation can be made for the hot storage scenario. This decision equation can
be used to decide to move the storage task towards a remote data center or to keep it at the local
data center. As the focus of this project is to reduce the amount of CO2 emitted by the data
centers the terms in the decision equation need to be in grams of CO2 . For keeping the storage
task at the local data center the following equation is used1 :
Kstay
at local data center (N, RT, DR)
= Xlocal
data center .(Ehot storagelocal (N, RT )
+ Edownload
f rom data centerlocal (RT.DR))
(3.8)
where
Edownload
f rom data center (N
∗
) = Eread (N ∗ ) + ELAN (N ∗ )
(3.9)
,Ehot storage is the function defined by equation (3.7), Eread is function defined by eqation
(3.6) and ELAN is the function defined by equation (3.4). The N ∗ in equation (3.9) is the data
amount that is downloaded during the time the data resides on the server and is the multiplication of RT , the retention time, and DR, the download rate, in equation (3.8).
For the remote data center the equation used in the final decision equation is different1 :
Kmove
to remote data center (N, RT, DR)
=Xlocal
data center .(ELANlocal
+ Ktransport
+ Kremote
data center
(N ))
network (N, RT, DR)
(3.10)
data center (N, RT, DR)
where
Ktransport
network (N, RT, DR)
=Xtransport
(Etransport
network .
network (N )
+ Etransport
network (RT.DR))
(3.11)
and
Kremote
data center (N, RT, DR)
= Xremote
data center .(ELANremote (N )
+ Ewriteremote (N ) + Ehot
+ Edownload
storageremote (N, RT )
f rom data centerremote (RT.DR))
(3.12)
.
The ELANremote and ELANlocal are like equation (3.4) for the local and remote data center respectively. In Ktransport network the energy used for first transporting the total data set
1 The following equations rely on previously defined equations. To make clear that the PUE of the local or
remote data centers are used a subscript to the equations is added. For example ELANlocal referes to the ELAN
function with the PUE of the local data center.
13
towards the remote data center is calculated with Etransportnetwork (N ) and the energy used
for the transport of data that is downloaded from the remote data center is calculated with
Etransportnetwork (RT.DR). The data that is downloaded from the remote data center has to
pass the transport network again as it is assumed that the destination for the downloaded data
is close to the local data center.
It can be noted that the energy used to write the data to the local data center is ignored in
equation (3.8) but the write to the remote data center is not ignored in equation (3.12). It is
assumed that the data that needs to be stored resides in the local data center. As the energy
used to write the data to the local data center, that might be added in the part of the decision
equation of the local data center, equals the energy used to read the data for transport towards
the remote data center, that might be added in the part of the decision equation of the remote
data center, both can be omitted. Just as the energy used to get the data into the local data
center the first time (which is the energy used by the local area network of the local data center)
would show up in both equations, it can also be omitted when deciding to move to a remote data
center or to stay at the local data center.
Now that the equations of the CO2 emission are known, the decision is as follows. When
Kmove
to remote data center (N, RT, DR)
< Kstay
at local data center (N, RT, DR)
(3.13)
is true, it is, in the terms of less CO2 emission, more profitable to move the storage task
towards the remote data center.
3.2.3 Cold storage
Model
For the cold storage scenario, the same setup is used as for the hot storage scenario. This setup
is shown in figure 3.8. There is one difference between the setups. Whereas in the hot storage
scenario the storage array was said to be a RAID (Redundant Array of Independent Disks), in
the cold storage scenario the array is configured to be a MAID (Massive Array of Idle Disks). In
the MAID setup the disks can enter three states. These states (with the transistions from one
state to another) can be seen in figure 3.9[3].
Figure 3.9: The states that a disk in a MAID array can enter with their transistions. No energy
is used to perform transition 1 and 2. For transition 3 and 4, a penalty must be paid: the spin
up and the spin down energy.
It is assumed that the disks in the MAID array are in a standby mode by default even when
storing data. Therefore the equation for storing for the hot storage scenario (3.7) can be taken
and edited for the cold storage scenario:
P U E 2N
.d
e.Pdiskstandby ∗ RT
(3.14)
U
Sdisk
When writing to the storage array, the disks written to (or read from) have to be active. This
means that the equation for writing (or reading) for the hot storage scenario equation 3.5 could
be used for the cold storage scenario as well:
Ecold
storage (N, RT )
Ewrite (N ) =
=
P U E 8N Pcontent server
Pswitch
2Pdiskactive
.
.(
+
+
)
U 3600 Ccontent server
Cswitch
Cdisk
14
(3.15)
But this ignores the fact that the disks can be in three states, as only two states are used
in these equations. These equations neglect to actively represent the cold storage method that
is used and just say that when disks are not used they will enter a low power consumption
immediately. This is not entirely true.
There are different methods to store cold data. It can be stored on tapes or optical disks like DVD
or CD (methods not discussed here). When creating cold storage arrays a method is proposed
for example by [7] in 2002. This method is based on spinning down hard disks when they are
inactive for a period of time. Keeping the disks at a spinning state (i.e. ready to read from or
write to at any moment) is only consuming a lot of energy where the disks could easily be not
spinning as the data is not as needed in cold storage as it is in hot storage. But there are some
things that have to be taken into account when using this method of cold storage.
In figure 3.9 three states are named: active, idle and standby. When the disk is not used (just
storing the data or nothing at all) it resides in the standby mode, which is in a spun down state.
The disk does not rotate at all or very slowly. When data arrives at the disk for writing, the disk
has to be spun up (transition 4 in figure 3.9). When the data is written to the disk, or read from
the disk, the disk enters an idle state (transition 1). When the disk is used in the idle state (to
read or to write) it gets in the active state again (transition 2). When the disk is not used again
for a certain period of time, the disk gets spun down to the standby state (transition 3). This
seems straightforward. But there is a catch. Transition 1 and 2 are performed without a penalty.
Although the disk has a bigger power consumption in the active state, there is no energy needed
to bring the disk in the active state from the idle state. But for transition 4 and transition 3,
going from and to the standby state respectively, there is a penalty to be paid. These penalties
are the energy used to spin the disk down to its standby state, and spin it up again to an active
state. But that is not the only catch. It was said that the disk would remain in the idle state
for a certain period of time after the disk was active and before it gets spun down. This time
is dependant on the spin down policy of the disks. There are a few types of spin down policies.
The time that the disk is idle can be decided by using a function on the amount of time it was
in the active state (for example longer active time could mean a longer idle period). The idle
time can also be dependant of statistical data about the disk usage etc. These kind of policies
are dynamic policies. In this model another type of policy is used. Instead of calculating an idle
time on statistical data, the policy is static and will always spin down the disks after a certain
threshold time.
With this information the write equation for cold data can be revised:
Ewrite
cold (N )
=
Pswitch
2Pdiskactive
P U E 8N Pcontent server
.
.(
+
+
)
U 3600 Ccontent server
Cswitch
Cdisk
2N
+d
e.(Espin up + Espin down + Edisk idle (Tidle threshold ))
Sdisk
(3.16)
t
60
Where the Tidle threshold is the time spent in idle time before the disk gets spun down to
standby state in minutes. As it is a constant it is not a parameter for the function. Espin up ,
Espin down and Edisk idle (Tidle threshold ) are all in kWh. Furthermore the Sdisk is the size of a
single disk in GBytes. The Espin up is the energy used for spinning up the disk. This is needed
as it was stated that the disk reside in a standby state by default. The Espin down is the energy
used for spinning the disk down. This is needed as it was stated that after the idle period the
disk was spun down again to its standby state. The other terms in the equation are the same as
the terms in equation (3.5) on page 12.
Edisk
idle (t)
= Pdiskidle .
This poses another question. When comparing the cold storage write equation (3.16) to the
hot storage write equation (3.5) it can be noted that the times a certain amout of data is written
to the disks matters for the cold storage equation, but not for the hot storage equation. Every
time a write action occures on the cold storage disks, they have to be spun up at least once,
kept idle for a certain time and spun down again also at least once. That would suggest that
if the data comes in a fragmented way, the energy used for writing in the cold storage scenario
15
gets higher. In the hot storage scenario this would not matter as the disks are in an active state
during the whole retention time. Writing the whole data amount in one time or several times
would not matter for hot storage. This difference calls for an edited cold storage model in which
the total data set can split into several batches, which are then written to disk with a certain
(time) interval between them.
Consider the variable Tbetween batches as the time between the end of the write of a batch and
the start of the write of the next batch. This time is important because when the time between
batches is smaller than the idle threshold time (the time when a disk gets spun down from the
idle state to the standby state) the disks that are written to, do not require to be spun down and
up again. Therefore there are two different equations for writing data in cold storage. Namely
when Tbetween batches > Tidle threshold :
Ewrite
cold (Nbatches , Sbatch )
=
NX 8.Sbatch Pcontent server
Pswitch
2Pdiskactive
P U E batches
.
.(
+
+
)
U
3600
Ccontent server
Cswitch
Cdisk
i=1
2Sbatch
e.(Espin up + Espin down + Edisk idle (Tidle threshold ))
+d
Sdisk
(3.17)
This corresponds with the loop made in figure 3.10.
Figure 3.10: The states that the disks will enter when the time between batches is larger than
the idle threshold time. For every batch the transistions are: 1 - spin up the disk to an active
state; 2 - after the write is done go into an idle state; 3 - spin down the disk to standby mode to
conserve energy.
When Tbetween batches > Tidle threshold is false, thus the time between batches is shorter than
the idle time, the equation will be different:
Ewrite
cold (Nbatches , Sbatch )
= Espin
+ Espin down + Edisk idle (Tidle threshold )+
8.Sbatch Pcontent server
PUE
Pswitch
2Pdiskactive
.
.(
+
+
)
U
3600
C
C
Cdisk
content server
switch
i=1
2Sbatch
+d
e.Edisk idle (Tbetween batches )
Sdisk
(3.18)
up
Nbatches
X
This corresponds with the loop made in figure 3.11.
16
Figure 3.11: The states that the disks will enter when the time between batches is shorter than
the idle threshold time. For every batch the transistions are: 1 - spin up the disk to an active
state; 2 - after the write is done go into an idle state; 3 - the next batch arrives turn active to
write it; 4 - the idle threshold time passes, no more batches have arrived, spin down to standby
mode.
Decision equation
With the cold storage models from the previous section a decision equation can be made like the
decision equation of the hot storage scenario. For the local data center the following equation
gives the total grams of CO2 emitted for a specific storage task:
Kstay
at local data
center (Nbatches , Sbatch , RT ) = Xlocal . ELANlocal (Nbatches .Sbatch )
+ Ewrite
coldlocal (Nbatches , Sbatch )
+ Ecoldstoragelocal (Nbatches .Sbatch , RT )
(3.19)
Note that the retention time is the time the whole data set is stored on the data center. The
time the incomplete data set is on the data center (while other batches are still arriving) is not
counted in the equation. As the retention time of cold storage is usually long, the time between
the first batch to arrive and the last batch to arrive is neglectible.
For the remote data center a similar equation can be constructed:
Kmove
to remote data
center (Nbatches , Sbatch , RT ) = X remote . ELANremote (Nbatches .Sbatch )
+Ewrite
coldremote (Nbatches , Sbatch )
+Ecoldstorageremote (Nbatches .Sbatch , RT )
+Xtransport
.Etransport
network
network (Nbatches .Sbatch )
(3.20)
Note that in this equation the local area network of the local data center is left out (unlike in
the model for hot storage). As it matters whether or not the data comes in fragmented, a source
outside (but close to) the local data center is assumed to be used. Directing large amounts of
data, as is usually the case with cold storage, through the local data center towards the remote
data center therefore seems inappropriate. Rather a direct connection from the source to the
remote data center is used.
Another thing that may be noted is that in both equations there is no term that accounts for
downloading the data. This is correct. For the cold storage scenario it is assumed that during
the retention time the data will not be downloaded (otherwise it would be hot data). Also the
retrieval of the information after the retention time is done is not in the equations (althoug it can
easily be constructed). For cold data it is not sure whether or not it will be used again after the
retention time (that is why the retention time can become quite long). Backups for instance will
17
only be kept for a few months after that they are replaced with newer versions. In this model it
is assumed that the data is not used anymore after the retention time.
Whether or not to move the cold data storage task towards the remote data center is now
dependant on the following equation:
Kmove
to remote data center (Nbatches , Sbatch , RT )
< Kstay
at local data center (Nbatches , Sbatch , RT )
(3.21)
For scenarios where this equation is true, the remote data center has a smaller CO2 emission
than the local data center. If the equation is false it is the ohter way around and keeping the
storage task at the local data center seems like a better option to reduce CO2 emission.
18
CHAPTER 4
Tools
During the storage to energy project two tools were developed. Both are creatd to gain insight in the workings of the moddels and to answer the questions proposed in the introduction. Both tools try to answer when it is is more profitable to move storage tasks towards a
greener remote data center or when to stay at the local data center. In the GitHub repository https://github.com/Daxez/StorageToEnergy the source code of the two applications can be
found.
4.1 Sweep
The Sweep program is a console application that needs a comma separated file as input. In the
file all the variables needed to do the modelling should be defined. Up to two different variables
can be defined as being ranges. When no ranges are defined the Sweep program will give the total
CO2 emission and the power consumption in kW h of both the remote and the local datacenter.
This data is written to a file. If one range is specified, a two dimensional plot is made for the
range that was defined (if the -plot option is enabled). When there are two ranges defined,
instead of a two dimensional plot, an interactive three dimensional plot is generated. The plots
used in the results section of this document are generated with help of the Sweep. For the two
dimensional and three dimensional the output data is also written to a file. These files can be
used to generate images or to perform analysis on with other software.
The Sweep is written in Python. For plotting the Matplotlib libraries are used. For putting
the data into the correct form (for either the Matplotlib or the output files) the program relies
on the Numpy libraries. The program is divided into three python modules. The constants
module contains the power consumption constants of for example switches and disk drives. In
the BitsToEnergy module, the hot storage and the cold storage models are implemented, as well
as some other models of the Bits to energy or energy to bits project[4].
To run the Sweep application the following options can be used (from the –help of the program):
usage: sweep.py [-h] [-s S] [-o O] [-os OS] [-nowrite] [-plot] [-plotenergy]
file
positional arguments:
file
optional arguments:
-h, --help
-s S
-o O, -outputpath O
-os OS
-nowrite
input file
show this help message and exit
Specify the seperator string in the file
specifies the outputpath
Specifies seperation string used in the output file
Disables the write of output of the calculations to a
file: OutputPath/{InputFileLineNumber}.txt
19
-plot
-plotenergy
Plots the acquired data to a pyplot screen. Does this
for each line and continues when the plot screen is
closed.
Makes the plotter plot energy instead of gr.CO2. Only
works if -plot is enabled.
The first line of the output file is the line that was evaluated from the input file, so the
scenario is always available. The second line of the output file is the column names that are used
for the ranges and the output variables (so if there are no ranges, only the output variables will
be on this line). After that the results will be printed. For example:
#PUE local:1.5;PUE remote:1.5;PUE transport network:2.2;X local:380.0;X remote:300.0;
#X transport network:860.0;Transport network type:2.0;
#Energy source type of local datacenter:1.0;
#Cable type of local datacenter energy source:1.0;
#Distance to energy source from local datacenter:0.0;
#Energy source type of remote datacenter:1.0;
#Cable type of remote datacenter energy source:1.0;
#Distance to energy source from remote datacenter:0.0;Calculation data amount:0.0;
#CPU time:0.0;Interactive data:0.0;Retention time:[24.0:244.0|10.0];
#Download rate:0.0;Data amount:[1000.0:5000.0|1000.0];Batch size:0.0;
#Time between batches:0.0;Storage type:1.0;
Retention time;Data amount;K_keep_local;K_remote;E_keep_local;E_remote;
24.0;1000.0;746.896757216;932.738001311;1.96551778215;2.36445111548;
34.0;1000.0;782.478392099;960.828765692;2.05915366342;2.45808699675;
44.0;1000.0;818.060026981;988.919530073;2.15278954469;2.55172287802;
54.0;1000.0;853.641661863;1017.01029445;2.24642542596;2.64535875929;
64.0;1000.0;889.223296746;1045.10105883;2.34006130723;2.73899464056;
74.0;1000.0;924.804931628;1073.19182322;2.4336971885;2.83263052183;
84.0;1000.0;960.386566511;1101.2825876;2.52733306977;2.9262664031;
94.0;1000.0;995.968201393;1129.37335198;2.62096895104;3.01990228437;
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The top line (it is broken onto several sentences here for readability, but in fact all the lines
starting with a ‘#’ are on one line) is a better readable version of the input file. As in the input
file variables like ‘PUE local’ are shortened to ‘PUEloc’. There are also variables in there needed
for bits to energy or energy to bits project, but are not explained here. For example ‘CPU time’
and ‘Interactive data’.
4.2 Storage to energy - Web application
Another application created during the storage to energy project, is the Storage to energy web
application. In short this application is the web version of the Sweep. But unlike the Sweep
it does not accept a input file, but rather supports the user in filling out the variables with an
interface. Just like in the Sweep, up to two different ranges can be defined for variables. But
where it is possible for all values in the Sweep, in the web application the variables that can
handle ranges are more limited. For instance, the transport network cannot be given as a range
in the web application, whereas it can be a range in the Sweep.
The storage to energy web application is made on top of the Bootstrap css. For plotting 3D
graphs Pre3D for javascript is used. It is a 3D library that makes use of the HTML5 Canvas
element. This javascript 3D library even works on modern mobile phones. It is fully open source
and published under the BSD license. To provide feedback to the user and retrieve all the
information from the user the combination of jQuery, css (bootstrap) and html is used.
20
The web application can be used to model scenarios. It can be used with minimum knowledge
on the subject so it can be used by a wide public.
Figure 4.1: A screenshot of the imput form of the Storage to energy web application
As the web application is still in development at the time of writing, the visualization of the
results cannot be viewed yet. But once ready, the git repository can be “pulled” by anyone and
as the web application only relies on javascript and html, it can be run on any computer with
modern browsers (internet explorer 7 and earlier vers1ions are not supported). Figure 4.1 shows
the provisional interface of the web application.
21
CHAPTER 5
Results
The following sections present the results obtained by using the models explained in sections
3.2.2 and 3.2.3 and the usage of the Sweep application introduced in section 4.1.
5.1 Hot storage results
In this section the results of the models of hot storage will be discussed. The graphs that will
be shown always contain two different functions. The blue one always represents the local data
center; the red one always represents the remote data center. The graphs will always show the
outcome of two sides of the decision equation (5.1) (unless said otherwise). That is if the red
one has a lower gr. CO2 emission (this value will always be on the z-axis) it is profitable in the
terms of emission to move towards the remote data center (and if the blue function is lower, it
is of course more profitable to keep the task at the local data center).
X local
data center .(ELANlocal
+Xtransport
data center
(N ))
network .(Etransport network (N )
+ Etransport network (RT.DR))
+Xremote data center . ELANremote (N ) + Ewriteremote (N )
+ Ehot storageremote (N, RT ) + Edownload f rom data centerremote (RT.DR)
(5.1)
<
Xlocal
data center .(Ehot storagelocal (N, RT )
+ Edownload
f rom data centerlocal (RT.DR))
The results are presented in a specific structure to show the influence of several variables and
influence they have on eachother.
5.1.1 Influence of PUE and local X-values
The first comparison looked at is in figure 5.1. This 3D graphs shows the local PUE plotted
2
against the remote PUE. Both the X-values of the data centers are set equal (to 400 gr.kWCO
h ).
Furthermore the data amount is 500 GByte and the retention time is 2.5 days. The X-value of
the network is set to be 0. Therefore the costs of transporting the data through the transport
network towards the remote data center will not be added to the total emission of the remote
data center. This is done to show the influence the PUE and X-values have on the data centers
and their emission. The costs for the local area network of the local data center, as well as the
costs of the local area network of the remote data center are part of the function for the remote
22
data center. Thus in figure 5.1 the emission of all the equipment, that reside in the data centers,
needed for the storage task is shown. What can be concluded from this figure is that the remote
data center is not only dependant on its own PUE (and X-value) but also on the PUE (and
X-value) of the local data center. It is worth mentioning that the remote data center, in this
scenario, has a larger emission, even at the point where the remote data center has the smallest
PUE and the local data center has the largest PUE (the functions do not cross).
Figure 5.1: Possible values for the local and remote PUEs plotted against eachother. In this
figure both the remote data center and the local data center have a X value of 400. the costs
for transporting data through the transport network is not counted (the X-value for transport
network is 0). If the decision equation is viewed, this means that the costs of transporting the
data throught the LANs of the local data center and the remote data center are still counted, so
it cannot be said that there are no costs counted for transport at all. The data amount is 500
GByte and a retention time of 2.5 days. For generating this graph a disk size of 500 GByte is
used. The functions in the graph do not cross, the remote data center is always higher in this
example.
Figure 5.1 showed that the remote data center had a higher emission than the local data
center. As the two data centers were exactly equal this is not strange. When deciding to
transport a certain (hot storage) task it does not make sense to transport towards an equal data
CO2
center elsewhere. In figure 5.2 the X-value of the remote data center is lowered to 150 grkW
h .
This results in a scenario where moving towards the remote data center is profitable for some
local and remote PUE values (either a high PUE for the local data center or a small PUE for
the remote remote data center). This means that a combination of a low or average PUE and
a lower X-value for the remote data center are prerequisites for moving the storage task to the
remote data center to reduce emission. Still only the emission caused by equipment in the local
and remote data centers is viewed. The transport network is still left out of equation (5.1).
23
Figure 5.2: Possible values for the local and remote PUEs plotted against eachother. In this
graph the transport network is again not counted. The remote data center has a lower X-value
CO2
(150 grkW
h ). The data stored is 500 GByte and the retention time is again 2.5 days. For
generating this graph a disk size of 500 GByte is used.
The difference between 5.1 and 5.2 shows that the trivial assumption that the X-value can
make a difference, is true. In figures 5.3 and 5.4 the influence of the local X-value and the remote
X-value is seen. It also reaffirms the conclusion that the remote data center is dependant on the
PUE and the X-value of the local data center. In both figures the PUEs remain the same (1.8).
Figure 5.3: A plot of the possible X-values for the local data center. The PUEs of the data
2
centers are equal (1.8), the remote data center has a X-value of 150 gr.kWCO
h . Transport network
is still ignored (but the LAN of the local data center is added to the total emission of the remote
data center when transporting the data).
24
Figure 5.4: The possible X-values of the remote data center plotted. The PUEs of the data
2
centers are equal (1.8), the local data center has a X-value of 400 gr.kWCO
h . Transport network is
still ignored (but the LAN of the local data center is added to the total emission of the remote
data center when transporting the data).
5.1.2 Transport network influence
The PUEs and X-values of the data centers have been examined. But for the remote data center
the transport network is needed as well. Transporting data from the local towards the remote
data center requires energy, this is also shown in the decision equation. In figures 5.1 to 5.4 the
transport network was ignored to compare the influence of the X-values and PUEs of the local
and remote data center. In figure 5.5 the transport network is added to the equation to see
its influence. The local data center is not influenced by the increasing X-value of the transport
network and so, only the emission of the remote data center will rise as was expected.
25
Figure 5.5: Plot of the transport network X-value where the PUEs of the data centers are equal
2
(1.8). The x factor of the remote data center (120 gr.kWCO
h ) is lower than the X-value of the local
gr. CO2
data center (400 kW h ) to ensure data is moved towards a greener data center. The PUE of
the transport network is set to be 2.2 and is of network type “Internet short”
The influence of the transport network’s X-value leads to the comparison of the network
types. The four network types, (internet short distance, internet long distance, lightpath short
distance and lightpath long distance) are depicted in figure 5.6. Though the scales of each of the
subfigures is different, the emission of the local data center (which is constant due to the fact
that the transport network does not influence the local data center) gives a relative perspective
on the change in emission of the remote data center due to the transport network’s X-value and
in this figure, especially, the network type. It can be seen that the order of the network types,
from most emitting to least emitting is:
• Internet long distance
• Lightpath long distance
• Internet short distance
• Lightpath short distance
26
(a) Internet short network type
(b) Lightpath short network type
(c) Internet long network type
(d) Lightpath long newtork type
Figure 5.6: A range across possible X-values of the tranport network types. The PUEs of the
2
and the remote
data centers are equal (1.8). The local data center has a X-value of 600 gr.kWCO
h
gr. CO2
data center has a X-value of 60 kW h . The retention time is 2.5 days and the total data is 500
GByte. These figures were created with a disk size of 1 TByte (instead of 500 GByte as in the
previous figures)
5.1.3 Retention time influence
The retention time, the time the data resides on the server has influence on the grams of CO2
emitted by the local and remote data centers. Its influence can be seen in figure 5.7. But the
transport network X-values nor the type of the transport network have any influence on the CO2
emission caused by the retention time. This is obvious of course but it can also be seen in figure
5.7 as the X-value of the transport network does affect the slope in the direction of (the increase
of) the retention time. In these figures the X-values of the local and remote data center are
2
respectively 380 and 120 gr.kWCO
h . The transport network has a PUE of 2.2 and the amount of
data is 500 GByte.
27
(a) Internet short network type
(b) Lightpath short network type
(c) Internet long network type
(d) Lightpath long newtork type
Figure 5.7: The retention time (in hours) plotted to the X-value of the transport network to
show that the transport network has no influence on the emission caused by a longer retention
time. The PUEs of the data centers are equal (to 1.8). The remote data center with an X-value
2
2
of 120 gr.kWCO
is cleaner than the local data center with a X-value of 380 gr.kWCO
h
h . The transport
network has a PUE of 2.2. The amount of data that needs to be stored is 500 GByte.
5.1.4 Data amount influence
In contrast to the retention time, the CO2 emitted due to the amount of data that is put into
hot storage is influenced by the transport network (when looking at the remote data center). In
figure 5.8 the X-value of the transport network is plotted against the data amount that needs
to be stored. In this figure the PUEs of the data centers are again 1.8, their X-values are 120
gr. CO2
2
for the remote data center and 380 gr.kWCO
for the local data center. A rentention time
kW h
h
of 4 days is used. For the data amounts shown in the figure the local data center stays constant
(just above 800 grams of CO2 ). In the subfigure 5.8c the line in the function for the remote
2
data center from the point where the X-value is 860 gr.kWCO
h and the data amount is 100 GByte
2
(lower right corner) to the point where the X-value is 60 gr.kWCO
and the data amount is 500
h
GByte (the far left corner) seems to be curved (instead of the straight lines seen at the figure
of the retention time 5.7) due to the increasing slope of the lines in the data amount direction
(compared for each transport network X-value). In the other subfigures this not as clear as in
subfigure 5.8c, as the emission due to transport is a lot higher with the long distance internet
network type than with the other network types.
Another variable closely related to the data amount is the disk size used in the arrays of the
data centers. In figure 5.9a the same subfigure as in 5.8c is used. Next to it is the same graph
28
(a) Internet short network type
(b) Lightpath short network type
(c) Internet long network type
(d) Lightpath long newtork type
Figure 5.8: The data amount (in GByte) plotted against the possible X-values of the transport
network. The PUEs of the data centers are equal (to 1.8). The remote data center with an
2
2
X-value of 120 gr.kWCO
is cleaner than the local data center with a X-value of 380 gr.kWCO
h
h . The
transport network has a PUE of 2.2. The retention time of the storage task is 4 days.
but for a smaller disk size. In the subfigure 5.9b the remote data center is likely to surpass the
local data center (at the higer X-values for the transport network) in terms of CO2 emission if
the graph would have stopped at 250 GBytes. But on exactly 250 GBytes both emissions make
a leap jump to a higher emission. This behavior shows itself because the disk size is 500 GByte
and it was stated that for storing data the redundancy was set to 2 in the 3.2.2 section. So at
250 GByte, 500 GByte need to be stored. That means that if more than 250 GBytes need to be
stored, another disk must be used. Due to the large X-value of the local data center, the leap in
the local data center’s function is larger than the in the function of the remote data center and
although the transport network makes the emission for moving towards the remote data center
rise quickly, the leap of the local data center when another disk is used, is large enough for the
remote data center to stay greener. Even at high X-values for the transport network.
In figure 5.10 the data amount is plotted against the retention time with the same PUEs
and X-values for the local and remote data center and a transport network with a X-value of
2
640 gr.kWCO
h . This graph shows an interesting image as it suggests that moving towards the remote
data center is likely to be more green when the retention time grows. On the other hand the
figure shows that staying at the local data center would be greener when the data amount rises,
but the retention time stays low.
29
(a) Internet long network type - 1000 GByte disks
(b) Internet long network type - 500 GByte disks
Figure 5.9: The data amount (in Gbyte) plotted against the possible X-values of the transport
network for two disk sizes. The PUEs of the data centers are equal (to 1.8). The remote data
2
center with an X-value of 120 gr.kWCO
is cleaner than the local data center with a X-value of
h
gr. CO2
380 kW h . The transport network has a PUE of 2.2. The retention time of the storage task is
4 days.
Figure 5.10: The data mount (in GByte) plotted against the retention time (in hours). The PUEs
2
of the data centers are equal (to 1.8). The remote data center with an X-value of 120 gr.kWCO
is
h
gr. CO2
cleaner than the local data center with a X-value of 380 kW h . The transport network has a
2
PUE of 2.2 and a X-value of 640 gr.kWCO
h .
5.1.5 Download rate influence
The last variable that is showed in this section is the download rate. As said, the data that is
downloaded is transported through the transport network again (if it is downloaded from the
remote data center) because the download destination is assumed to be close to the local data
center. Therefore for the calculation of the emission of the local data center, only the emission
caused by its local area network (and the reading of the data, but that seems obvious) is added
to the total amount. That implies that the transport network is (again) playing a role for the
30
remote data center, in the calculation of the emission caused by downloading. In figure 5.11 the
download rate is plotted for each transport network type. To show the influence of the network
on the emission caused by dowloading data from the remote data center, figure 5.12 shows the
X-value of the transport network (of type internet long distance) plotted against the download
rate. Both figures show that the download rate contributes to the total emission of both data
centers. But, depending on the transport network, as for the remote data center the download
rate involves the transport network, the download rate has more influence on the remote data
center than on the local data center.
(a) Internet short distance network type
(b) Lightpath short distance network type
(c) Internet long distance network type
(d) Lightpath long distance network type
Figure 5.11: Download rate (in GByte
hour plotted for each transport network type. The PUEs of the
2
data centers are equal (to 1.8). The remote data center with an X-value of 120 gr.kWCO
h is cleaner
gr. CO2
than the local data center with a X-value of 380 kW h . The amount of data is 250 GBytes
with a rentention time of 6 days. The transport network has a PUE of 2.2 and a X-value of
2
860 gr.kWCO
h . Note that the function for the local data center has a constant slope through the
subgraphs as it is not influenched by the transport network type
31
Figure 5.12: Download rate (in GByte
hour against the transport’s network X-value for the internet
long transport network type. The PUEs of the data centers are equal (to 1.8). The remote
2
data center with an X-value of 120 gr.kWCO
is cleaner than the local data center with a X-value
h
gr. CO2
of 380 kW h . The amount of data is 250 GBytes with a rentention time of 6 days. The
transport network has a PUE of 2.2. The remote data center has a bigger emission for the
maximum download rate showed in this graph, if the X-value of the transport network is larger
gr. CO2
2
than 350 gr.kWCO
(the largest value in the
h . When the transport network’s X-value is 900 kW h
graph), the remote data center emits more than the local data center when the download rate is
bigger than 10 GByte
hour .
5.2 Cold storage results
In this section the results produced by the cold storage models are presented. Like in the hot
storage results, the blue function in the graphs is always the local data center and the red function in the graphs is always the remote data center. Both functions represent one side of the
decision matrix for the cold storage scenario (repeated in equation (5.2)). On the z-axis the
emission in gr. CO2 is displayed.
X remote . ELANremote (Nbatches .Sbatch ) + Ewrite coldremote (Nbatches , Sbatch )
+ Ecoldstorageremote (Nbatches .Sbatch , RT ) + Xtransport network .Etransport
<
Xlocal . ELANlocal (Nbatches .Sbatch ) + Ewrite
network (Nbatches .Sbatch )
coldlocal (Nbatches , Sbatch ) + Ecoldstoragelocal (Nbatches .Sbatch , RT )
(5.2)
The results are presented in a specific structure to show the influence of several variables and
influence they have on eachother.
32
5.2.1 Influence of PUE and local X-values
As with the hot storage model, firstly the PUEs of the data centers are plotted against each
other. Figure 5.13 shows this comparison. It can be noted that in the case of cold storage the
remote data center total emission is not dependant of the local PUE. This is because of the form
of the decision equation explained in 3.2.3, where it is said that for cold storage the source of
the data lies close to the local data center, but for the data to be transported to the remote data
center, it does not necessarily pass the local data center. Thus the graphs shows twice the same
function in different direction, as one would expect.
2
Figure 5.13: Local PUE plotted against the remote PUE. Both have a X-value of 400 gr.kWCO
h .
The transport network is not added to the total emission of the remote data center (its X-value
is set to 0). Furthermore this data comes in a whole batch of 10000 GByte and has a retention
time of 25 days.
Figure 5.13 also implies that the X-values of local and remote data centers behave as expected.
That is, when the local X-value rises, the emission of the local data center rises and the emission
of the remote data center stays constant. If the remote X-value rises, the emission of the remote
data center rises and the emission of the local data center stays constant. The figures to show
this behavior are left out for brevity.
5.2.2 Transport network influence
In the cold storage decision equation (eq. (5.2) the transport network emission when moving
the data are also added to the total emission of the remote data center. Figure 5.14 shows the
influence of the transport network on the remote data center (as the transport network does
not influence the local data center). In this figure 5.14 the four transport network types are depicted as well. The order of transport network types from most emitting to least emitting type
stated in the previous section also holds for cold storage (this is quite obivious as the calculation
of the emission of the transport network is not different for cold storage than it is for hot storage).
Figure 5.15 shows the same scenario as figure 5.14c, but here the remote data center has a
lower X-value. This shows that the assumption that a lower X-value would result in a lower
emission also holds for cold storage scenarios.
33
(a) Internet short network type
(b) Lightpath short network type
(c) Internet long network type
(d) Lightpath long network type
Figure 5.14: Transport network X-value for cold data storage. The PUEs of the local and remote
data centers are equal (to 1.5). The other values, X-values, data amount and retention time are
the same as in figure 5.13
Figure 5.15: Transport network X-value for the cold storage scenario. The PUEs of the local
and remote data centers are equal (to 1.5). The X-values of the local and remote data center are
2
respectively 400 and 300 gr.kWCO
h . The transport network type is the long distance internet type.
34
5.2.3 Retention time influence
In the section about the results of the hot storage models, the retention time was plotted against
the X-value of the transport network. In the cold storage scenario, the retention time is usually
longer than it is for hot storage. Figure 5.16 the retention time is plotted against the transport
network’s X-value with the PUEs equal to 1.5 (transport network 2.2) local and remote X-values
2
and a data amount of 10000 GByte. Again, it can be seen
of respectively 300 and 380 gr.kWCO
h
that the transport network has no influence on the emission caused by the local data center.
But the emission remote data center is affected. Though for a high transport network X-value,
placing the task at a remote data center is more polluting when dealing with short retention
times. But the slope of the local data center in the direction of increasing retention time, is more
steep than the slope of the remote data center in the direction of increasing retention time, even
at a high transport network.
Figure 5.16: Transport network X-value to the retention time (in hours). The PUEs of the local
CO2
and remote data centers are equal (to 1.5). Their X-values are respectively 380 and 300 grkW
h
and the transport network type is set to internet short distance. The data amount is 10000
GByte.
5.2.4 Data amount influence
As was expected, the transport network does not seem to have influence on the emission of CO2
caused by a longer retention time. But for the data amount this does not hold. The transport
network is assumed to have influence on the CO2 emitted by increasing the data amount. Figure
5.17 shows this for the internet short distance transport network type (with the PUEs, X-values
the same as in the previous figures) with a retention time of 25 days. Just as with figure 5.8c
that was discussed in the previous section, it can be seen in the graph that for the high X-values
of the transport network, the slope of the data amount is steeper than it is for low X-values of
the transport network when looking at the remote data center.
35
Figure 5.17: Transport network X-value to the data amount in GByte. The PUEs of the local
CO2
and remote data centers are equal (to 1.5). Their X-values are respectively 380 and 300 grkW
h
and the transport network type is set to internet short distance with a PUE of 2.2. The retention
time is set to 25 days.
In figure 5.18 the retention time is plotted against the data amount. The X-value of the
2
with a PUE of 2.2.
transport network, that is of type lightpath short distance, is 860 gr.kWCO
h
The PUEs of the local and remote data center and their X-values remain the same as in previous figures. Both the retention time and data amount have a large domain in this graph. But
it should not be strange that a lot of data (in this graph a maximum of 45 TByte) is stored
for a long time (in this graph a maximum of more than a hundred days) in a cold storage scenario.
36
Figure 5.18: The retention time plotted against the data amount. Both PUEs are equal (to 1.5).
CO2
The X-values of the local and the remote data center are respectively 380 and 300 grkW
h . The
transport network type is set to the lightpath short distance with a PUE of 2.2 and a X-value of
860.
5.2.5 Disk size influence
Not only the emission of the transport network and the local area networks are influenced by the
data amount but also the emission caused by the storage of data itself. This is more appropriate
to show in the cold storage section as cold storage usually involves a higher data amount, but
the hot storage scenario is also affected by the behavior depicted in figure 5.19. In this figure the
CO2 emission rises gradually due to the transport network and the local area networks of the
data centers. But with every step of 1000 GBytes both functions make a leap. This behavior
can be explained by the size of the disks. When the disk arrays of the data centers contain disks
with a size of 2000 Gbyte every 1000 GBytes of data extra to be stored will result in using a new
disk (as 1000 GByte of data takes 2000 GByte of storage due to redundancy). For the graph the
CO2
gr CO2
remote X-value has been set to 230 grkW
h , the local X-value remains 380 kW h and both PUEs
CO2
are 1.5. The transport network of type internet short distance has a X-value of 860 grkW
h and
a PUE of 2.2. The retention time is 25 days
37
Figure 5.19: The data amount zoomed in to. The PUEs of the local and remote data centers are
CO2
equal (to 1.5). Their X-values are respectively 380 and 230 grkW
h and the transport network
gr CO2
type is set to internet short distance with a X-value of 860 kW h with a PUE of 2.2 . The
retention time is 25 days.
5.2.6 Data accumulation influence
Where the cold storage scenario differs (greatly) from the hot storage scenario is the influence of
the accumulation of the data. The accumulation of cold data in batches of a predetermined size
with a waiting time between them should have effect on the cold storage scenario as a penalty
is paid for every startup and shutdown operation on a disk as well as the idle time caused by
the shutdown policy of the cold storage method. In figure 5.20 the influence of this batch size is
shown for each transport network type with a data amount of 5000 GByte. As can be seen, the
transport network type only affects the position of the remote data center function, just as the
local and remote X-values, PUEs and the retention time would do. The curve of the function
remains the same. This behavior, as said, is unique for the cold storage scenario and can lead
to a general rule about data accumulation for cold storage (either remote or local). This will be
discussed in the next section.
38
(a) Internet short network type
(b) Lightpath short network type
(c) Internet long network type
(d) Lightpath long network type
Figure 5.20: The influence of the batch size. This figure is to show the hyperbolic curve as
a result of a small batch size. The data amount is 5000 GB. The PUEs and other variables
determine the height at which the curve lies.
39
CHAPTER 6
Discussion
How can the results presented in the previous chapter, be interpreted? Can decisive factors be
derived on basis of these results? And are there general rules that can be defined in the question
of moving storage tasks towards a remote data center to reduce CO2 emission? This chapter
tries to give answer to these questions on basis of the results in the previous chapter.
6.1 Hot storage
The first thing that can be stated is that both the PUEs and the X-values for the data center are
important factors. This is of course showed by the models themselves, as each term in the model
eventually gets multiplied by the PUEs and the X-values. As the X-value is always a larger
term than the PUE, it can definitly be stated that the X-value, in the case of CO2 emission,
the X-value is more decisive. Graph 5.1 shows the PUEs plotted against eachother for both the
local and the remote data center and graph 5.3 shows the influence of the X-value of the local
data center. If these graphs are viewed next to each other it can be stated that with the rise of
the X-value the grams CO2 produced grows harder than with the rise of the PUE. Figure 6.1
shows this even more so. But again, this behavior is trivial and can be easily be deducted from
looking at the models. One can question whether or not it is better to have a significant lower
PUE, and thus lower energy consumption, or to have a lower X-value, which results in a lower
CO2 emission. In the case of sustainability and the reduction of CO2 emission a lower X-value
would be sufficient.
Figure 6.1: The local PUE plotted against the local Xvalue for 4 days of storage of 500 GByte
with no download rate.
40
In the hot storage model, the remote data center is also influenced by the PUE of the local
data center. This behavior can be explained by the fact that in the model the data that needs
to be transferred to the remote data center passes the LAN of the local data center. This is also
shown in figure 5.1. Figure 5.2 then shows that the remote data center has to have a significant
lower X-value to be profitable in the terms of emission. In this case the X-value is more than
two and a half times as low. This, as figure 5.1 shows, can only be done with a lower X-value
for the remote data center. Note that the emission of the transport network is left out of these
graphs and have to be compenstated by the remote data center as well.
Where the local data center is influenced by its PUE and X-value, the emission calculated
for the remote data center also includes the transportation of the data. Figures 5.5 and 5.6 show
the influence of both the X-value of the transport network and the transport network type. Even
with the low X-value of the remote data center these graphs show a quick rise in emission with
the rise of the transport network’s X-value. On the other hand figure 5.6 also shows that the
network type has an influence as well. Compare figures 5.6b and 5.6c and the conclusion is that
the latter, the internet long distance type network almost doubles the emission in comparison to
the short distance lightpath type. But even for the cleanest network type, the X-value has to be
in the lower half of the spectrum for the remote data center to be profitable.
Up to this point it can be said that the transport network has a significant influence on the
emission of moving tasks towards the remote data center. But the emission of the transport
network is not only dependant of the X-value of the transport network itself. Figure 5.8 shows
that the data amount of the storage task plays a role in this. It shows that for smaller data
amounts the influence of the network on the total amount of data is reduced. This is a trivial
conclusion. Sending less data through a network will always result in a smaller amount of energy
used. But does this mean that only small amounts of hot data are suitable for storage on a
remote data center? Figure 5.9b showed us that by using smaller disks, the network made a
smaller difference. Due to the fact that disks are always in an active state in the hot storage
scenario, doubling the amount of disks gave the emission caused by the storage itself a more
important role. Furthermore in figure 5.6b the transport network X-value had to be in the lower
CO2
segment (smaller than 450 grkW
h ) for the remote data center to be profitable. In figure 5.8b
the remote data center seems to be more profitable even with high X-values for the transport
network. Not only for low data amounts but also for the 500 GByte used in figure 5.6b. The
difference between the two graphs lies in the retention time. In figure 5.6b a retention time of
two and a half days is used whereas in figure 5.8b the retention time is four days. This means
that if the retention time is increased the influence of the transportation costs of the storage task
is decreased. Even more so as on basis of figure 5.7 it was concluded that the X-value of the
transport network does not affect the increase of emission caused by the retention time. That
is, the slope of the retention time is not affected by a larger X-value of the transport network as
is the case with the data amount. To show the link between the data amount and the retetion
time even more, figure 5.10 shows that it is more profitable to keep the storage task at the local
data center in case of a large data amount. But it also shows that if the retention time is long
enough a cleaner remote data center will become more profitable.
This does not entirely cover the whole hot storage scenario. Another variable which can
influence the desicion to move or not to move is the download rate. For the local data center
the download rate increases the usage of the storage components and the local area network.
But for the remote data center, the transport network forms a part of the emission caused by
downloading as well as can be seen in figure 5.11. In this figure the data amount is not as high
as in previous examples (250 GByte) and the retention time is even longer than the 4 days in
figure 5.8. The download rate can undo the profit gained by a longer retention time. Only with
the short distance lightpath the remote data center seems to be able to realize a smaller emission
with the rise of the download rate. Though it should be noted that the transport network is
CO2
significant less clean than the previous examples with a X-value of 860 grkW
h . With the lower
gr. CO2
X-value of the previous examples (640 kW h ) the short distance internet type and the long
distance lightpath type would remain profitable even if the whole dataset is downloaded each
41
Figure 6.2: A plot of the download rate for the long distance lightpath type. The variables are
CO2
the same as in figure 5.12 but the transport network has a X-value of 640 grkW
h
hour (figure 6.2).
6.2 Cold storage
As with the hot storage scenario, the PUE and the X-values of the data centers are of course
of importance for the decision to move a storage task towards a remote data center. But unlike
the hot storage scenario, the remote data center of the cold storage scenario is not dependant
on the values of the local data center. This is a direct result of the decision equation of the cold
storage scenario which is discussed in section 3.2.3. The effects of this decision can be seen by
comparing figures 5.1 and 5.13. In the latter the two data centers act as if they were the same
data centers. Of course, the transport network is left out in this figure. As the PUEs of the
data centers behave in such a predictable way, so will the X-values of the data centers. This is
not shown in any figure, but a quick glance at the models for cold storage shows that this is a
safe assumption. If the same reasoning as in the hot storage scenario is used, this should mean
that the X-value (or the PUE) of the remote data center can be higher than in the hot storage
scenario while still creating a profit. But do the same rules apply for the cold storage scenario?
When looking at figure 5.14 it is shown that the transport network, as well its X-value as its
type, is of influence on the emission calculated for the remote data center. The X-values of the
data centers are the same, and thus the remote data center is always more poluting than the
local data center. Figure 5.15 shows that a lower X-value for the remote data center results in
a far more positive situation for moving the storage task toward the remote data center. Especially since the long destination internet network type is used. This leads to the question how
the emission by the transport network is affected by the increase of the data amount. Figure
5.17 shows the data amount plotted against the transport network X-value for the short distance
internet network type. With a retention of 25 days it shows that, as in the hot storage model,
the gradient of the remote data center in the direction of the data amount, gets steeper when the
X-value of the transport network rises. This is obvious, but is a retention of 25 days a normal
retention for cold storage or should it be longer? One can imagine that there are cases where
the retention time could be several months.
42
Figure 6.3: The retention time plotted against the data amount. Both PUEs are equal (to 1.5).
CO2
The X-values of the local and the remote data center are respectively 380 and 300 grkW
h . The
transport network type is set to the internet long distance with a PUE of 2.2 and a X-value of
860.
The retention time in figure 5.16 behaves as expected. The longer the retention time, which
again is obviously independent of the transport network, gets the total emission will be less for
the most clean data center. But the retention time has to be relatively longer for cold storage
than for hot storage to give the desired effect. This is the result of two factors in the cold storage
scenario. Firstly the data transported from the data source to the data centers is usually set to
be a lot larger than for hot storage. This is not a strange assumption as there is less information needed than there is available. For example, 60% of the data in Yahoo clusters is cold[11].
Because the cold datasets are a lot larger the transport network has a bigger influence when
moving the data towards a remote data center. A longer retention time on a cleaner remote
data center could make up for this large transport network influence. But because the disks on
which this cold data is stored are in a standby mode, a low power mode, the total emission of the
disks decrease. Thus for the cleaner remote data center to be cleaner in practice the retention
time should increase significantly (more than in a hot storage scenario) to make up for both
“losses”. The function of the remote data center in figure 5.18 may suggest otherwise, as for
large amounts of data, the retention time of 105 days is sufficient to make up for the transport
network costs even for large dataset. But in this graph a relative clean transport network is used.
When looking at figure 6.3 the local data center turns out to be a less emitting option. The cold
storage scenario seems to be more dependant on the transport network. In the first paragraph
of this section the question was posed whether or not a remote data center with a X-value that
is somewhat higher (than in the remote storage scenario) but still be cleaner than the local data
center. This could be the case, as the transport network seems to be a far more important factor.
A factor that could influence the emission caused solely by the storage of the data (not the
transport or accumulation) is the size of the disks. In figure 5.19 it is shown that with the
increase of the data amount, more disks are used (as is expected) for storing the data. The size
of the leap made per disk is dependant of the X-value and the PUE of the data center. With a
large X-value or PUE the leap per disk will be higher. Thus the usage of smaller disks can result
in more emission caused by the data centers (and not the transport network). In figure 6.4 the
effect of smaller disks is shown. Due to the smaller disks the remote data center will be more
profitable if the data amount and the retention time are large. The usage of small disks does
however result in a higher emission. So it can be said that size does indeed matter.
In figure 5.20 the accumulation of data is looked upon. Due to the cold storage method
(Massive Arrays of Idle Disks), the accumulation of data can result in higher emission for both
the local and the remote data center. The function for both data centers is a hyperbolic one.
43
Figure 6.4: The retention time plotted against the data amount. Both PUEs are equal (to 1.5).
CO2
The X-values of the local and the remote data center are respectively 380 and 300 grkW
h . The
transport network type is set to the internet long distance with a PUE of 2.2 and a X-value of
860. The disk size used in 6.3 is four times as large as in this graph (2000 GByte to 500 GByte
per disk)
It has two limits. The first limit is when the size of the batches approaches zero. In this case
the number of batches reaches infinity, and the disks have to be spun up and down, and have an
idle time in which they wait for the next instruction (if this instruction would quickly follow the
previous one). The other limit is the total emission if the whole data set would arrive in a single
batch. In general, sending the data in small batches towards the data centers should be avoided.
44
CHAPTER 7
Conclusions and recommendations for
future research
7.1 Recommendations for future research
Virtualization In both models, the storage arrays contain a number of disks. When data is
written to these disks it is assumed to be the number of disks sufficient for the amount of data
that needs to be written. The other disks can stay either active (hot storage) or standby (cold
storage) in this case. But with the rise of virtualization [3] it could well be the case that the
data is written on a larger number of disks and resides fragmented on these disks.
Hot/cold storage scenario Two data storage scenarios have been discussed in this document.
Data that needs to be stored was either hot or cold in the models presented here. But in [11] it
is shown that data is not always either hot or cold. Data can also migrate from hot status to
cold status and the other way around. A scenario where data constists of hot and cold data can
be interesting to research (especially given a migration from hot to cold or the other way around
whithin the server).
Cold storage method The cold storage method proposed here has been called “cool storage” by
ICT infrastructure provider SURFSara. Their opinion was that the method proposed in this
document could be used as a buffer layer before (even) colder storage like tape libraries. What
influence would a more layered approach of cold storage have. Especially with the previous
recommendation (the mixed storage scenario). Data can move from a hot and active storage
facility to a more cool facility (like the cold storage method described in this document) and
then to a tape library with use of a tape robot. Can this layered approach on itself have a
positive effect on energy efficiency of data storage?
Timing In this project it is not researched whether or not there are better times of the day to
move data around. One can imagine that moving data through a transport network, or even a
local area network of a data center can make a difference in the amount of CO2 emitted due to
a higher utilization of the network components.
Download destination For hot storage, the destination of the data that is downloaded could be
improved. How can a changing download destination be modeled or is there can there be an
average download destination? And where should that be?
45
7.2 Conclusions
By using the model described in sections 3.2.2 and 3.2.3 an approximation can be made for the
greenhouse gas emission (or more specific the CO2 emission). The several infrastructures (like
the local are network, the storage infrastructure and equipment) that are used for storage in data
centers are described by their individual hardware components. Because the infrastructures are
described as the sum of their loose components, new models with other infrastructure and other
setups can (easily) be made. By calculating the energy used by these infrastructures and the
energy source used by the data center, an amount of CO2 emission can be calculated. But what
information is needed to calculate this emission? To calculate the emission for a specific storage
task on a data center, the power usage effectiveness (PUE) of the data center and its energy
source are needed. Also specific information about the storage array the data is going to be
written to, like the disk size and the storage redundancy can be of importance. Some specific
variables of the storage tasks are the amount of data that is used and the time this data resides
on the servers of the data center (the retention time). For the hot storage scenario the way the
data is consumed (how many times the data is going to be downloaded during the retention
time) can be added and for the cold storage scenario, the way the data is accumulated on the
server is an interesting variable to consider. But in order to make a decision whether or not to
move a certain storage task towards a remote data center information about the network that
connects the two may be just as important. The type of the transport network and its energy
source are both important in the models.
But what are decisive factors when deciding to move a storage task for a local data center
towards a remote data center, in order to reduce greenhouse gas emission? Unfortunately, this is
not always clear. Certainly in the hot storage scenario the variables influence each other greatly.
First of all, the remote data center actually needs to be cleaner. A cleaner energy source proves
to be more effective in reducing emission than the power usage effectiveness of the data center.
It is interesting to note that more energy consumption does not necessarily increase the CO2
emission. When the remote data center is cleaner, other factors start to play a role for the hot
storage scenario. For example, if the data amount is rather large compared to the retention time,
the transportation of the data from the local to the remote data center causes a relative large
emission when compared to the emission caused by solely storing the data on the remote data
center. Especially when the transport network is not clean, the emission of the transport network
becomes a big part of the total emission when moving towards the remote data center. On the
other hand, even when the data amount is large, the emission of the data center itself reduces
the influence of the transport network, if the retention time is large as well. This may make the
remote data center a cleaner option, even with a dirtier transport network. But that does not
hold for every scenario. Even when the ratio between the data amount and the retention time
is in favor of the remote data center, the consumption of the data might throw a spanner in the
works for the remote data center. When the data is downloaded often, the transport network
plays a bigger role for moving towards the remote data center again. This can undo the relative
lower emission of the remote data center caused by a long retention time.
For the cold storage scenario, the transport network plays an even bigger role in the decision
to move or not. This is a consequence of the low energy consumption by the storage devices
in the cold storage scenario. In the ratio between the data amount and the retention time,
the latter has to be larger to have effect (on the influence of the transport network) than it
has to be in the hot storage scenario. On the other hand, that also implies that small data
amounts with a medium retention time are quicker to be green on a less emitting remote data
center. Again, it is all about the ratio between the two, even more so for the cold storage method.
In the results section, data accumulation is also discussed. It can be concluded that, when
using Massive Arrays of Idle Disks as discussed in section 3.2.3, a fragmented way of data accumulation is not good for the local nor the remote data center. A general rule for cold storage, to
save on CO2 emission, is to accumulate the data in larger batches rather than sending every small
piece of cold data towards the server. Depending on spin down policies, sending small batches
46
(in this model smaller than 2 GByte) can result in an exponential growth of CO2 emission.
It is clear that the decision to move a storage task towards a greener remote data center is
not always straightforward and requires knowledge of the data centers as well as the data in
question. When to move a storage task towards a greener remote data center is not a trival
question and can be hard to answer. The use of tools like the Sweep (although it is more inclined
to be used in research) and the Storage to energy web application both discussed in section 4,
are necessary to give insight into specific scenarios. With the help of tools like these, public and
private institutions can examine their own scenarios and decide: “to move or not to move”.
47
Bibliography
[1] N. Leavitt. Is cloud computing really ready for prime time? Computer, jan. 2009.
[2] Andreas Berl, Erol Gelenbe, Marco Di Girolamo, Giovanni Giuliani, Hermann De Meer,
Minh Quan Dang, and Kostas Pentikousis. Energy-efficient cloud computing. The Computer
Journal, 2010.
[3] Yuhui Deng and Brandon Pung. Conserving disk energy in virtual machine based environments by amplifying bursts. Computing, 91(1):3–21, 2011.
[4] P. Grosso A. Taal and F. Bomhof. Transporting bits or transporting energy: does it matter?
2013.
[5] J. Baliga, R.W.A. Ayre, K. Hinton, and R.S. Tucker. Green cloud computing: Balancing
energy in processing, storage, and transport. Proceedings of the IEEE, 2011.
[6] Robert Basmadjian, Nasir Ali, Florian Niedermeier, Hermann de Meer, and Giovanni Giuliani. A methodology to predict the power consumption of servers in data centres. In
Proceedings of the 2nd International Conference on Energy-Efficient Computing and Networking, pages 1–10. ACM, 2011.
[7] Dennis Colarelli and Dirk Grunwald. Massive arrays of idle disks for storage archives. In
Proceedings of the 2002 ACM/IEEE conference on Supercomputing, Supercomputing ’02,
2002.
[8] Yuhui Deng, Frank Wang, and Na Helian. Eed: energy efficient disk drive architecture.
Information Sciences, 178(22):4403–4417, 2008.
[9] Krishna Kant. Data center evolution: A tutorial on state of the art, issues, and challenges.
Computer Networks, 53(17):2939–2965, 2009.
[10] M. Stansberry and J. Kudritzki. Uptime institute 2012 data center industry survey. 2012.
[11] Rini T Kaushik and Milind Bhandarkar. Greenhdfs: Towards an energy-conserving storageefficient, hybrid hadoop compute cluster. In Proceedings of the USENIX Annual Technical
Conference, 2010.
[12] Hewlett-Packard Development Company. Quickspecs - hp 3par storeserv 10000 storage.
2013.
48
Appendix A - Constants
In the implementation of the models different constants are used. The following list gives an
overview of their values and their origin.
• Content server - 360
Gbit
second
• Router - 12
Consumption
( P owerCapacity
) from [4]
W
• Firewall - 16
Gbit
second
W
Gbit
second
• DWDM - 0.8
Consumption
( P owerCapacity
) from [4]
Consumption
) from [4]
( P owerCapacity
W
• Switch - 230
W
Gbit
second
Consumption
( P owerCapacity
) from [4]
W
Gbit
second
Consumption
( P owerCapacity
) from [4]
1
• Hard disk (active) -11.75 W see
• Hard disk (idle) - 6.5 W see
1
• Hard disk (standby)- 1.5 W see
• Hard disk capacity - 10.0
1
Gb
second
see
1
• Spin up energy - 2.4e−6 from [3]
• Spin down energy - 1.1e−6 from [3]
• Idle time threshold (cold storage) - 5 minutes
Utilization of the switches, firewalls, routers and DWDMs (in the transport network) are
25%. Utilization of network components in the local area network of the data centers are 50%
and utilization of the storage hardware (content server and disks) are a 100%. These figures are
based on the Bits to energy or energy to bits project[4].
1 Based on information from the HP Store server specification sheet[12] and the work of Colarelli et al.[7] as
well as the Bits to energy or energy to bits project[4].
49
© Copyright 2026 Paperzz