Managing Distributed Memory to Meet Multiclass

Managing Distributed Memory to Meet Multiclass Workload Response Time Goals
Markus Sinnwell
SAP AG
Business Information Warehouse
P.O. Box 1461
69185 Walldorf, Germany
E-Mail: [email protected]
Abstract
In this paper we present an online method for managing a goaloriented buffer partitioning in the distributed memory of a network
of workstations. Our algorithm implements a feedback mechanism
which dynamically changes the sizes of dedicated buffer areas and
thereby the buffer hit rate for the different classes in such a way
that user-specified response time goals are satisfied. The aggregated size of the buffer memory across all network nodes remains
constant and only the partitioning is changed. The algorithm is
based on efficiently approximating the trajectory of the per-class
response time curves as a function of the available buffer. Changes
in the workload that would lead to violation of response time goals
are counteracted by accordingly adjusting the buffer allocation.
For local replacement decisions, we integrate a cost-based buffer
replacement algorithm to fit into our goal-oriented approach. We
have implemented our algorithm in a detailed simulation prototype and we present first results obtained from this prototype.
1. Introduction
For database requests, varying resource consumption combined
with different user priorities leads to large variations in the response time. For example, there are an increasing number of systems in which – besides the normal OLTP workload – complex
decision-support queries are executed. Without an effective load
control, the high resource consumption of such decision-support
queries will slow down short running OLTP transactions excessively. Besides their complexity, the priority of queries should be
considered as well, when resources are allocated. A query which
has a very firm deadline should receive all the needed resources,
while a query which runs in background mode should only be allowed to use the remaining resources. Because of this, it is reasonable to divide transactions into different classes where user requirements can be expressed by a Service Level Agreement [20].
A possible way to specify such agreements is to define response
time constraints for each class.
When multiclass workloads are considered, it is not sufficient
to minimize the mean response time over all classes. As the response time is largely dependent on the number of disk accesses,
changing the buffer hit rate is an appropriate method of achieving
response time goals. A method of controlling buffer hit rates is
This research was conducted while the author was at the University of the Saarland.
Arnd Christian König
Department of Computer Science
University of the Saarland
P.O. Box 151150
66041 Saarbrücken, Germany
E-Mail: [email protected]
the partitioning of the aggregate buffer area into separate regions
which are dedicated to caching pages belonging to a specific class
only. While such dedicated buffer areas speed up the operations
of the corresponding classes, operations of other classes will be
slowed down, since the overall buffer size is not changed. While
in commercial database systems (like e.g. DB2 [19]) methods exist to partition the local cache buffer into separate pools, the determination of the sizes for the various buffer pools is a very hard
optimization problem, even if the workload remains constant and
the workload parameters are known. Suboptimal partitioning can
result in poor system performance, as some partitions remain underutilized while the buffer pools of other classes are too small to
satisfy the given goals. This problem becomes even worse if the
workload evolves over time.
Further complications for this manual approach arise when we
expand a single server environment to a network of workstations
also known as NOW [1]. Up to now, NOWs are mainly used because of their aggregated CPU power, but this will change with
data-intensive applications starting to exploit the performance capacity of distributed disks and the aggregated memory for caching
benefits. In addition to the two level storage hierarchy of the single server, the aggregate memory of the NOW introduces a new
level of the storage hierarchy, namely remote cache, which has to
be considered when a buffer partitioning is determined.
All these problems make an automatic, dynamic adaptation of
the buffer partitioning based on the actual workload parameters the
only viable approach to meeting multiclass response time goals. In
this paper we present a novel method that has the following salient
properties:
It is completely automated in that it determines the appropriate sizes of the dedicated buffers for each class on every
node, by essentially approximating a solution for the underlying combinatorial optimization problem. With our approach
it is therefore sufficient to specify the performance goal itself
and not the system parameters to achieve this goal. This is
a major improvement in the administration and a further step
towards a self-tuning database system.
It is dynamic in that it copes with evolving workload characteristics and also allows dynamic adjustments of the classspecific response time goals. This is achieved by running the
approximative optimization for the buffer partition sizes continuously and incrementally.
It is light-weight in that it does not incur much overhead.
The collection of input data for the optimization is distributed
across all nodes, and messages have to be exchanged only infrequently. Furthermore, the memory and CPU consumption
of the approximation algorithm are small.
It is general in that it supports arbitrary workloads. In particular, the method does not require the data in the buffer partitions of different classes to be disjoint, but rather allows
sharing across classes.
Although the buffer partitioning algorithm that we will present
in this paper can be used in combination with almost every replacement strategy, optimal usage of the aggregate memory will
generally only be possible when actual workload and system characteristics are taken into account. Therefore we will incorporate
the cost-based remote cache algorithm developed in [27, 26] into
our dynamic multiclass buffer partitioning method.
The remainder of this paper is organized as follows: In Section 2 we give a review of existing techniques for goal-oriented
workload management as well as techniques for remote caching.
After defining system and workload characteristics in Section 3
we will present the computation of the goal-oriented buffer partitioning in Section 4. In Section 5 we will use the goal-oriented
buffer partitioning method introduced in Section 4 to derive a distributed implementation. To take full advantage of the aggregate
memory we will consider the combination of the goal-oriented
buffering with a cost-based remote cache replacement algorithm
in Section 6. Section 7 presents the setup and first results of an experimental study which we have carried out in a detailed simulation prototype. Finally we summarize our conclusions and discuss
future work in Section 8.
2. Related Work
General goal-oriented methods for automatic database tuning
are described in [23, 12, 22]. All these approaches dynamically
check the satisfaction of given goals; if violations are observed,
appropriate countermeasures are invoked. The methods differ in
the countermeasures employed. While [23] only introduces a general framework for goal-oriented methods, the other methods use
either dynamic routing decisions in a shared nothing OLTP environment [12] or the variation of the multi programming level
and the transaction priorities on a single server [22] to achieve the
given goals. In contrast to our work, no goal-oriented buffering is
considered.
Goal-oriented buffering has been considered in a number of papers [5, 7, 6, 8]. In [5] a method called fragment fencing was introduced, which was replaced by class-fencing in [6]. The goal
of both methods is to minimize the mean response time of a socalled No-Goal class, while satisfying the response time goals for
the different Goal classes. To achieve this, the buffer can be dynamically partitioned and dedicated to a class. For example, if a
class k can’t meet its goal, a certain amount of the global buffer
is dedicated solely for pages of the class k transactions. Assuming that the transactions are disk-bound, this reduces the average
response time of class k, as the buffer hit rate is increased. On
the other hand, the dedicated buffer area is decreased if the mean
response time of a class is below the goal, since the no-goal class
will profit from the freed buffer space. Fragment and class fencing
differ in the way they estimate the necessary amount of dedicated
buffer to meet a given response time goal. While fragment fencing assumes a direct proportionality between the buffer space and
the response time, class fencing only assumes a proportionality
between the miss rate and the response time. The necessary dependency between the miss rate and the buffer space is derived by
a linear extrapolation of previously measured values of the buffer
hit rate as a function of the buffer space. This method ensures a
fast convergence as long as the curve illustrating the dependency
between the miss rate and the buffer space is concave. In [7] this
assumption has been proven for common buffer replacement algorithms by an empirical study.
In [8] the dynamic tuning algorithm is described. This method
also uses dedicated buffer pools for classes to speed up their response time. Their algorithm tries to find a state in which the
maximum over all performance indices 1 is minimal. To achieve
this the algorithm computes the effects of small changes in the
buffer partitioning on the performance index and only carries out
those changes which lead to an improved system state.
The main disadvantage of all goal-oriented buffering methods
described so far is that they are designed for a single server. In this
paper we will combine goal-oriented buffering with distributed
caching. Distributed or remote caching tries to exploit the aggregated buffer in a network of workstations. The main assumption
motivating distributed caching is that a memory access over a local
network is faster than a local disk access. Many different heuristics which all try to maximize the global cache hit rate, thereby
reducing the number of disk accesses have been proposed in the
past [13, 9, 11, 24, 18]. Although this high global cache hit rate
often leads to improved system performance, the disregard of the
current resource utilization can decrease the performance under
some circumstances. Therefore [27, 28, 26] have proposed online
methods which try to balance between an egoistic (maximizing the
local buffer hit rate) and an altruistic (maximizing the global buffer
hit rate) behavior dependent on the current load and the system parameters.
3. System and Workload Characterization
The system that we consider consists of N nodes which are interconnected by a fast network and every node i has a reserved
area of SIZEi bytes of main memory which can be used as page
buffer. Furthermore, each node is connected to a local disk. We
assume that each data page has a permanent, disk-resident copy
at a specific node called its home. The homes themselves are distributed across the nodes using a hash function or some catalogdriven partitioning function.
The locally reserved buffer area is managed, depending on the
number of dedicated buffers, either by one or by several local
buffer managers. Independent of the actual implementation of the
buffer managers, our goal-oriented partitioning algorithm is only
based on the assumption that increasing the size of any local buffer
of a class will increase the buffer hit rate and thereby results in a
decrease of the mean response time for that class. Although in [2]
a counter example for this condition was given, using a FIFO replacement algorithm on a single node, this assumption should be
satisfied by virtually all practically used replacement policies.
k
1
The performance index for a class is defined to be the ratio of the observed and
the goal response time of class operations.
k
We assume that the external workload consists of several operations, which can arrive at any node of the local network. Each
single operation can further be subdivided into several page accesses where each will be executed by data-shipping, i.e. all the
requested pages are copied to the node on which the operation has
been initiated. All operations are assumed to be disk-bound, which
means that the response time is mainly determined by the number
of page accesses that can not be satisfied by cache hits.
Although we only consider read requests in this paper, it is also
possible to incorporate write requests in our model. In the presence of updates we have to ensure the transactional properties like
atomicity, isolation and durability of the operations. To guarantee
the atomicity, we can use the (distributed) 2-phase-locking protocol [10] and isolation in a distributed system can be archived by
the 2-phase commit protocol [15]. Finally, we can guarantee durability by the WAL (Write-Ahead-Logging) principle [4].
Depending on the user-defined response time goals, we can
group the operations into separate classes. Although it is possible to group arbitrary operations with the same response time goal
together, we believe that data affinity and resource consumption
should be considered, too. Ideally only operations which access
the same pages should be grouped together and furthermore there
should be no two operations which access the same pages but belong to different classes. The effects which may result from the
violation of this ideal grouping are shown by the next two examples.
Example 1: Let the class k consist of the two operations op1
and op2 . We assume that the complexity (i.e. the number of page
accesses per operation) for both classes is equal but the sets of
objects accessed are disjoint. If the inter-arrival time of op1 is
much smaller than that of op2 a dedicated buffer for class k will
contain almost solely pages which are accessed by op1 .2 Therefore
changes of the buffer size will only affect the response time for the
operation op1 . Although we can change the buffer size so that
the goal for the mean response time over both operations is met,
this does probably not reflect the intention of the users, as op1
operations will be much faster and op2 operations much slower
than the class mean.
Example 2: In this second example we assume, that the operation op1 belonging to class k1 and op2 belonging to k2 accesses
the same objects. Furthermore, let the goal response time of class
k1 be much tighter than that of class k2 (class k2 might even be
a class without a given goal). To achieve the response time goal
of class k1 we have to provide a sufficiently large dedicated buffer
area. As accesses of the operation op2 can also profit from the
buffered pages, the mean response time of this class is probably
accelerated beyond the user-defined goal.
Summarizing these examples, we have seen, that the situation
of example 1 leads to an undesirable behavior, while that of example 2 simply speeds up a class more than necessary. As this
speedup does not incur any additional costs (the dedicated buffer
of class k1 in the example has to be chosen as large as that anyway) we propose the following simple assignment schema. A class
consists of all those operations which access the same objects and
which have the same response time goals. Such a partitioning of
the operations is always possible and it guarantees that the problem of example 1 never occurs.
In the following we assume that all operations with given response time goals are grouped into K classes, numbered from 1
to K . These classes are called Goal classes. In addition, we introduce a special No-Goal class, numbered class 0, which subsumes
all operations without a given response time goal.
4. Approximating an Optimal Buffer Partitioning
In this section we will derive a method of computing the buffer
pool sizes of a single class k on the different nodes, so that the
mean response time of all operations of this class will satisfy the
given goal. In contrast to the centralized problem, which was
addressed in [6], we have an additional degree of freedom. Besides the possibility to influence the response time by changing
the size of the overall dedicated buffer, we further have to decide
on which node the buffer pool size for the considered class has to
be changed.
Similar to class fencing [6], our approach also aims to minimize the response time of the no-goal class under the constraint
that every goal class satisfies its response time goal. To reduce the
complexity of this optimization problem, we assume that – with
the exception of the considered class k – all other classes actually
meet their response time goals3 . For this class k our algorithm
will compute a new allocation, which ideally should lead to a situation where class k satisfies its goal, too, or at least reduces the
difference between its mean response time and its goal. We will
do this by generalizing class fencing to N dimensions (where N
is the number of nodes), so that we can predict the new response
time by the extrapolation of measured response times of former
allocations. We will use this kind of approximation to derive the
objective function for the no-goal class as well as the constraint,
which ensures the satisfaction of the response time goals for the
goal class k.
As we use remote caching, the local response time RTk;i of a
class k operation on node i depends on the local as well as on the
remote cache hit rate, which in turn depends on the local ( LMk;i )
and the remote buffer size (RMk;i ) of class k. Similar to the oneserver approach of class fencing, the relation describing the response time as a function of the buffer size is a-priori unknown
and only some tuples of this relation are given based on previous
measurements. These tuples can now be used to compute the coefficients k;i , k;i and k;i of a linear approximation of the local
response time function:
RTk;i (LMk;i ; RMk;i ) = k;i LMk;i + k;i RMk;i + k;i
(1)
Since the size of the remote cache for class k at node i is determined by the sizes of the local caches of all other nodes, we can
use the equation:
RMk;i =
Assuming a buffer replacement strategy that considers the access frequency of
objects, like e.g. LRU.
j =1;j 6=i
LMk;j ,
(2)
to transform equation 1 to:
3
2
N
X
This assumption is only made for the theoretical derivation. In Section 5, when
describing an implementation we will allow the concurrent adaptation of several
classes.
RTk;i (LMk;1 ; : : : ; LMk;N ) =
k;i LMk;i + k;i N
X
j =1;j 6=i
LMk;j + k;i
(3)
As we do not specify local goals for every node, but only one
goal for the mean response time of a class over all nodes, we consider the weighted sum of all local response times for that class.
The weighting factors that we use are the inter-arrival rates k;i of
class k operations that arrive on node i. Hence, the mean response
time RTk can be expressed as:
N
X
i=1 0
N
X
1
N
X
@k;ik;i +
k;j k;j A LMk;i
i=1
j =1;j =i
{z
}
|
N
X
k;i k;i
i|=1 {z }
+
6
=:k
=:k;i
(4)
?
Equation 4 can be seen as a N dimensional hyperplane which
approximates the dependency between the mean response time and
the partitioning of the local buffers. To derive the constraint which
ensures that the new partition will satisfy the given goal, we only
have to set equation 4 equal to the response time goal RTkgoal for
the class k:
i=1
k;i LMk;i + k
(5)
Before we derive the objective function of our optimization
problem, we want to introduce some bounds on the sizes of the
dedicated buffers, which are imposed by the limited main memory
on every node. Clearly, the local buffer pool size of class k on
any node cannot be less than zero and it is also impossible that the
sum of all local buffers is larger than the locally reserved cache
memory size (SIZEi ). In our terminology we can express these
bounds in the following way:
0 LMk;i SIZEi ?
K
X
l=1;l6=k
LMl;i
(6)
To meet the response time goal we can choose any allocation
which satisfies the equations 5 and 6, but as we aim to minimize
the mean response time of the no-goal class we will now derive the
objective function for this minimization problem. Analogously to
the approximation of the response time as a function of the buffer
size for the class k operations, we can also approximate the response time RT0;i for the no-goal class on a node i by:
RT0;i (LMk;1 ; : : : ; LMk;N ) = 0;i SIZEi ?
+ 0;i N
X
j =1;j 6=i
(?0;i ) LM0;i + (?0;i ) N
X
j =1;j 6=i
LMk;j + ?0;i
(8)
RT0 (LMk;1 ; : : : ; LMk;N ) =
(k;i RTk;i (LMk;1 ; : : : ; LMk;N )) =
RTkgoal =
RT0;i (LMk;1 ; : : : ; LMk;N ) =
Computing the weighted mean of the local response times and
putting together the constants (analogously to equation 4) we get:
RTk (LMk;1 ; : : : ; LMk;N ) =
N
X
buffers on this node. In addition we assume that we change the
allocation of only one class (namely class k) at one moment and
therefore the response time for the no-goal class on node i depends
in this case only on all the local buffer sizes of class k. Using this
fact, we can rewrite formula 7 by collecting all constant factors in
a new constant ?0;i as:
SIZEj ?
K
X
l=1
K
X
!l=1
LMl;i
LMl;j + 0;i
!
(7)
In this formula we assume that the buffer on node i that can be
used by no-goal class operations equals the size of the complete
reserved memory on this node minus the sizes of all dedicated
N
X
i=1
N
X
i=1
(0;i RT0;i (LMk;1 ; : : : ; LMk;N )) =
0;i LMk;i + 0
(9)
It should be noted that, in contrast to equation 4, all the gradients
0;i are now greater than zero, i.e. the response time of the nogoal class increases when the local buffer size for the class k is
increased at any node.
With these results, we can compute our final buffer partitioning by solving the following linear programming problem with the
variables LMk;i ; (1 i N ):
N
X
Minimize:
i=1
under the constraint:
0;i LM ; + 0
RTkgoal =
considering the bounds:
0 LM ;
k i
SIZEi ?
k i
N
X
i=1
k;i LM ; + k
K
X
l=1;l6=k
k i
LMl;i for all nodes i
Although – when considering arbitrary response time curves
and approximation planes – it can not be proven that solving the
linear program does always result in an improved partitioning,
these special cases are irrelevant for our purposes, since they correspond to states where goals of the goal classes are not violated
[16].
5. Implementation
After having described the partitioning algorithm in Section 4
we will now demonstrate how this computation can be embedded
into a distributed system for online decisions. We assume that for
every goal class there exists one agent process on every node and
additionally a single coordinator process which can be located on
any node. Furthermore, for the no-goal class, one agent on every
node is needed. Because of load balancing issues we allow the
coordinator to be placed separately for every class and even a migration of a coordinator from one node to another node is possible,
as long as all corresponding agents are informed. In figure 1 we
have sketched an environment with 4 nodes and 4 classes (3 goal
classes and the no-goal class).
Our algorithm itself consists of five phases, that form a feedback controlled loop in which the satisfaction of the goal is
checked. If the goal is violated a recomputation of the buffer partitioning is initiated. In the following we will describe these phases
for a class k in more detail.
1
1
2
3
2
1
2
3
1
2
3
3
Node2
Node1
Network
Node3
1
1
3
1
Node4
1
2
2
3
3
2
1
2
3
3
k
Local agent for a Goal class k
Local agent for the No-Goal class
k
Locally reserved buffer area
for Goal class k operations
Buffer for the No-Goal class
k
Coordinator process for a Goal class k
Figure 1. Environment with 4 nodes and 3 goal
classes.
(a) Collect Phase of the local Agent Every time a class k operation is locally initiated the agent computes the inter-arrival time
and upon completion of the operation the local response time for
this class is updated. To prevent heavy fluctuations, caused by
stochastic noise, we record the response times over a sufficiently
long observation interval.
If a significant change in observed response time is recorded,
the appropriate coordinator is informed about the new value.
While the information collected by the no-goal agents is only used
by the optimization process of the goal classes, changes registered
by the no-goal agents have to be propagated to all goal class coordinators. In addition to the inter-arrival and the mean response
time, an increase or a decrease of any local buffer size will influence the buffer size of the no-goal class and therefore a change of
this value is propagated to the goal class coordinators, too.
It should be emphasized that the agents do not have to run synchronously on the different nodes, because the coordinator of a
class k remembers the most recently received information from
every class k and every no-goal agent.
(b) Collect Phase on the Coordinator In this phase the coordinator is awaiting the data that is sent by the agents in phase (a).
After receiving a new message, the received information is either
used for the creation of a new (if the partitioning has changed since
the last measure point) or for the update of the last measure point
(if only the response time has changed). In the first case we have
to ensure that a unique approximation of the N -dimensional hyperplane is still possible, as this is needed by the optimization pro-
cess. Let 1 be the most recent measure point, we can guarantee
a unique approximation by keeping the N + 1 most recent measure points such that the vectors 1 2 ; : : : ; 1 N +1 are linear
independent.
Although this method ensures the unique linear approximation
once there are enough points, we still have to address the problem
of what to do during “warm-up” when there are less than N +1
measure points. In this case we can use simple heuristics, like e.g.
allocating a certain percentage of the undedicated main memory
on any node. To quickly overcome this warm-up period we have
to take care that every new partitioning leads to a new linear independent measure point, so that the next iteration of the feedbackcontrolled loop can rely on one additional measure point.
?
?
(c) Check Phase In this phase the coordinator computes the
weighted mean response time RTk for the class k according to
equation 4 and afterwards this time is checked against the given
response time goal. Due to statistical variance in the response
time, we consider a goal to be violated only if it differs more than
a certain tolerance from the given goal. To allow a workload
dependent adaptation of we use the method of [5]. If a goal is
violated we proceed to phase (d) and otherwise the current iteration of the feedback controlled loop is finished and we return to
the collection phases.
(d) Optimization Phase During this phase the class k coordinator determines the new partitioning for the local buffers of class
k according to the method described in Section 4. This involves
the approximation of the hyperplane based on the points of measurements registered in phase (b) followed by the minimization
process. Having determined the new buffer partitioning, the new
buffer pool sizes are sent to all agents that are subject to changes.
(e) Allocation Phase In this phase the local agents receive the
output of the optimization phase from the coordinator and change
their local allocation schemas accordingly. Although the computation in Section 4 assumes that there are no concurrent adaptations
for different classes, we drop this restriction in our implementation to reduce the overhead for synchronization and to improve
adaptivity. Therefore it is now possible, that a class k agent can
not allocate the desired amount of memory, because a local agent
of another class k0 has already reserved this area. In this case the
local agent of class k allocates as much memory as possible and
informs the coordinator about the difference, so that the coordinator can update its information. Further actions are not triggered,
as the algorithm implements a feedback mechanism and therefore
if the goal is not reached with this partitioning the algorithm will
consider the new information in its next iteration.
Computational Complexity Having defined the phases, we will
now study the computational complexity of the different tasks.
Here we will restrict ourselves to the phases (b) and (d), since
the other phases, which are executed by the agents, involve only
trivial computations.
In phase (b) we have to determine the N + 1 newest, “linear
independent” points where N is defined to be the number of nodes
in the system. Since this involves the test whether the corresponding system of linear equations is singular, we use an incremental Gauss algorithm [14]. This algorithm takes advantage of the
only marginal changes between two computations (a new measure
point replaces an old one), and thereby reduces the complexity of
the standard Gauss algorithm to O(N 2 ).
In phase (d) we first have to solve a system of linear equations to
determine the parameters for the approximation of the hyperplane.
Similar to phase (b), we can use again the incremental Gauss algorithm so that we achieve a worst case complexity of O(N 2 ).
Finally, the coordinator has to compute the solution of the linear
program introduced in Section 4. For this task we have chosen
an implementation [3] of the simplex algorithm, which, although
having an exponential worst case complexity, has been proven to
be linear in the number of variables and constraints in the mean
[25].
Besides these more theoretical considerations, we have measured the average time of a single execution of the different tasks
on a SUN Sparc 4 Workstation. The results for different numbers
of nodes are shown in table 1. This table 1 shows, that the overNumber of Nodes
Lin. Independence
Approximation
Optimization
Overall
5
0.1
0.24
0.9
1.24
10
0.2
0.6
1.6
2.4
20
0.7
2.7
2.3
5.7
30
2.4
5.5
2.7
10.6
40
2.8
11.1
3.3
17.2
50
4.2
14.8
5.4
24.4
Table 1. CPU Execution time in milliseconds
head incurred by the coordinator process is very low. In addition
we have to remember that these tasks are only executed on demand when a class violates its response time goal; otherwise no
actions are needed. Furthermore, we can benefit from the distribution of the coordinators among different nodes, as this allows the
application of load-balancing methods (distribution and migration
of coordinator processes) in the case of heavy CPU contention on
a specific node.
6. Using a Cost-Based Buffer Manager
Up to now we have not specified the replacement policy for
buffer managers. Although there are many different policies,
which satisfy the precondition that we state in section 3 (increasing the buffer size leads to a decrease in response time),
[17, 27, 28, 26] have shown that in general an optimal usage of
the remote cache can be achieved neither by the maximization of
the global hit rate (altruistic behavior) nor by the maximization of
the local hit rate (egoistic behavior). Therefore, in this section we
will describe the integration of the cost-based replacement policy
of [27, 26] into our goal-oriented partitioning algorithm.
The central idea of [27, 26] is the notion of the benefit of a
cached page. The benefit of a page is defined as the difference
in the access cost between keeping the page in the local cache
versus dropping it. Instead of using a simple stack, every buffer
manager uses a priority queue to keep the pages sorted by their
benefit and in the case of a buffer replacement action, the page
with the locally lowest benefit is replaced. To compute the benefit of a page every buffer manager keeps track whether its local
copy is the last cached copy of that page in the system as well
as the local and the global heat of every page, the heat being defined as the number of accesses (locally resp. globally) per time
unit. In the implementation the LRU-k algorithm [21] is used to
approximate the heat. To reduce the overhead of the information
dissemination, threshold-based protocols are used which allow the
propagation and the update of the nonlocal information. Besides
this page specific information, the access cost to different levels
in the storage hierarchy are needed, too. Tagging each page request with the storage level the page has been accessed from, this
information can be gathered with low overhead by observing the
response times of already finished requests.
As our goal-oriented buffering schema allows several buffer
managers per node (one for the no-goal class and at most one for
every goal class), we have to adapt the original algorithm slightly.
To ensure a correct ranking in the no-goal buffer as well as in the
various goal buffers, we have to collect the different class heats as
well as the accumulated heat over all accesses. For the goal classes
we only have to consider those local heat-values for which there
exists a dedicated local buffer. Furthermore we have to collect the
global heat for those classes only for which at least one dedicated
buffer area exists in the system. Finally, we can reduce the overhead of the bookkeeping by collecting the heat information for a
class k on an object p only if at least one operation from class k
accesses the object p. But as this information is unknown a-priori
we use a method which dynamically creates and deletes the heat
information on demand.
A single access to a page p by an operation op belonging to a
class k is now executed in the following way. First of all the accumulated heat for this page is updated. If a dedicated buffer for
class k exists on the considered node and the page is not cached
locally in an other dedicated buffer already, the page is acquired
(either from the local no-goal buffer, from which it is removed, or
via remote cache or disk), the class-specific heat is updated and the
page is inserted into the dedicated buffer of class k. If this causes
other pages to be dropped from the dedicated buffer, these are removed from the cache of the local node completely. In case the
requested page already resides within the dedicated buffer, only
the class-specific heat is updated. In case there is no dedicated
buffer for class k, the page (if not cached in there already) is acquired and inserted into the no-goal buffer. Replacement victims
are dropped again from the buffer of the local node.
7. Simulation Experiments
7.1. Simulation Setup
In order to access the validity of our theoretical model, we conducted an experimental study4 . For this we have integrated our
approach into the detailed simulation prototype described in [26].
In all the experiments described in this section we use an environment consisting of 3 nodes (CPU speed 100 MIPS), which are
connected via a fast local network (transfer-rate of 100 Mbit/s).
Each node employs a common SCSI disk and 2 MB of cache space
which can be used for caching pages. We have chosen such a small
buffer size to limit the execution time of the simulations.
The database is modeled as a set of M = 2000 data pages
(4 KByte), which are distributed in a round-robin fashion over
all nodes’ disks. For each node and each operation of the different classes a stream of accesses to the corresponding pages
is generated. A single operation can consist of one or more accesses. The identities of the accessed pages are distributed via a
Zipfian-distribution with a skew parameter , i.e. the local access
4
The entire study, including all system and database parameters, can be found
in [16].
P
7.2. Base Experiment
In our first experiment we consider a two class scenario (one
goal class and the no-goal class). We assume that each operation
of both classes accesses 4 pages and that there is no data-sharing,
i.e. there is no page which is accessed from both classes. Finally
we have chosen the access skew of both classes equal to 0.
Figure 5 shows the resulting changes in the observed overall response time, the response time goal and the systemwide dedicated
memory (the total size of cache buffers dedicated to the goal class).
As expected, the observed response time is closely related to the
size of the dedicated buffer. Furthermore, the approximation does
result in a partitioning satisfying the response time goal after only
a short number of observation intervals. This, in fact, has been true
for all experiments conducted, including experiments with vastly
more complex operations, dynamically changing workloads or a
larger number of nodes. Because we aim to illustrate the adaptivity of our approach, there is never a larger number of consecutive
intervals with a constant goal in this experiment. Because of this,
it is not possible to effectively calculate the tolerance [5], which
would prevent the coordinator from becoming active on small variations from the given goal. This explains the oscillation seen in
some parts of the figure. Even so, the system does not exhibit significant changes in partitioning after reaching a partitioning that
satisfies the response time goal. Furthermore, this problem disappears in more realistic systems, in which response time goals
normally do not change in quick succession.
7.3. Variation of the Access Skew
In our study we have observed, that it is not the complexity of
the different operations, but the skew of the accesses, that is of
major importance for the speed of convergence. In theory, this can
be explained by the fact, that our algorithm generates new partitionings by approximating the shape of the response time curve
through linear programming. Therefore with a skew = 0, which
corresponds to an equal distribution, the response time as a function of the buffer sizes theoretically equals a hyperplane and so
can be matched almost ideally with our approximation hyperplane.
5
response time goal
observed response time
Response time (msec)
4
3
2
1
5e+06
Total dedicated cache (byte)
1
frequency of an page with index p is C 1p with C = M
q=1 q .
Operations are generated at each node independently, with their
inter-arrival time 1=k;i assumed to be exponentially distributed.
The length of the observation interval is set to 5000 msec,
which still allows fast adaption to changing parameters and at the
same time smoothes variations caused by stochastic noise.
In our experiments we want to prove that our method is able to
find a buffer partitioning which satisfies the user-given goals and
furthermore we are interested in the speed of convergence, i.e. the
number of iterations of the feedback controlled loop necessary to
find such a partitioning. In order to capture a variety of different
partitions, we count the number of intervals in which the system
reaches a state satisfying the response time goal, changing the response time goal after four “satisfied” intervals. The new goal is
randomly chosen so that it should be satisfiable under the current
workload and also differs significantly from the current goal.
All experiments were repeated sufficiently to obtain an accuracy of less than 1 Iteration for the speed of convergence with a
statistical confidence of 99 percent.
4e+06
3e+06
2e+06
1e+06
0
10
20
30
40
50
60
70
80
Elapsed observation intervalls
Figure 2. Variance in response time, response time
goal and dedicated memory
With a higher skew the difference between the function of the response time and the approximation plane increases. This theoretical consideration is proven by our simulation results which are
shown in table 2.
In order to be able to compare the results of different experiments, we choose the goals randomly from [goalmin ; goalmax ],
where goalmin corresponds to the response time of the goal class,
3 SIZEi of the cache
when (under the chosen workload) 23
i=1
memory is dedicated to it; in turn, goalmax corresponds to the
3 SIZEi of the cache being
response time achieved by 13
i=1
dedicated. The results show that an increase in skew leads to a de-
P
Skew Iterations
0
1.84
0.25
2.41
P
0.5
3.55
0.75
3.88
1
3.95
Table 2. Convergence-speed under varying crease in convergence time. Nevertheless it should be noted, that
even in the case = 1, which corresponds to a very highly skewed
distribution, in the mean less than 4 iterations of the feedback controlled loop are sufficient to adapt to a change in the given goal.
7.4. Multiple Goal-Classes
We have repeated these experiments with two goal classes k1
and k2 (RTkgoal
< RTkgoal
) and twice the amount of cache buffer
1
2
memory at each node. With multiple classes, the time of convergence is also dependent on whether the sets of page accesses by
each class are disjoint or not. In case of disjoint sets, the amount
of memory dedicated to one class does not influence the performance of the other and therefore we would expect to get the same
results as in the base experiment. This is confirmed by our experiments, as the measured speed of convergence for each value of was identical to those already shown in table 2.
However this independence is no longer valid if we consider
data sharing between the different goal classes. Raising the percentage of sharing we have observed that the size of the dedicated
buffers of the class k2 decreases gradually. This is due to the fact
that this class can profit from the dedicated buffer of class k1 . Further increases in the sharing leads to a complete removal of of the
dedicated buffers of class k2 and eventually – even without any
dedicated buffers – class k2 exceeds its goal solely by accessing
pages from the buffers of class k1 . These observations matches
exactly with our considerations made in section 3 example 2.
These findings were confirmed by experiments using more than
two classes [16].
7.5. Overhead
Because of the length of the observation interval and their small
size, messages used by our method only make up a fraction of the
total network-traffic (less than 0.1%, in our experiments). Coupled
with the CPU overhead described in Section 5 and the fact that
very little additional memory is needed, the overall overhead of
our method is not significant in the setting of a distributed database
system.
8. Conclusion and further work
In this paper we presented an online method for distributed,
goal-oriented caching in a network of workstations. Our approach
is built out of two components: an algorithm which computes a
buffer partitioning according to the user specified response time
goals and a cost-based buffer replacement algorithm which makes
optimal use of these partitions. To allow online computation we
have described a distributed, low overhead implementation of the
partitioning algorithm in a detailed simulation prototype and presented some results generated by this prototype.
In the future we plan to expand our simulational study. One
aspect we want to focus on is the usage of other objective functions. In our current approach we only try to minimize the mean
response time of the no-goal class, but some applications insist on
more stringent conditions, like e.g. a given mean response time
goal together with a maximal coefficient of variation among the
different nodes. In this scenario minimizing the mean response
time of the no-goal class will in general not lead to the user specified goal and therefore a new objective function, like e.g. minimizing the variation, will be needed.
References
[1] T. Anderson, C. Culler, and D. Patterson. A case for NOW (Network
Of Workstations). volume 15(1) of IEEE Micro, February 1995.
[2] L. Belady, R. Nelson, and G. Shedler. An Anomaly in Space-Time
Charakteristics of Certain Programs running on a Paging Machine.
Communications of the ACM, 1969.
[3] M. Berkelaar, J. Dirks, and H. Schwab. lp-solve Library version 2.1
and Documentation. Winter Simulation Conference, 1997.
[4] P. Bernstein, N. Goodman, and V. Hadzilacos. Recovery Algorithms for Database Systems. In IFIP 9th World Computer Congress,
September 1983.
[5] K. Brown, M. Carey, D. DeWitt, and M. Metha. Managing Memory
to Multiclass Workload Response Time Goals. In 19th International
Conference on Very Large Databases, 1993.
[6] K. Brown, M. Carey, and M. Livney. Goal Oriented Buffer Management Revisited. In ACM SIGMOD Conference, 1996.
[7] K. P. Brown. Goal Oriented Memory Allocation in Database Management Systems. PhD thesis, University of Wisconsin-Madison,
1995.
[8] J.-Y. Chung, D. Fergusson, G. Wang, C. Nikolaou, and J. Teng. Goal
Oriented Dynamic Buffer Pool Management for Database Systems.
In International Conference of Engineering of Complex Computer
Systems, 1995.
[9] M. D. Dahlin, R. Y. Wang, T. E. Anderson, and D. Patterson. Cooperative Caching: Using Remote Client Memory to Improve File
System Performance. In 1st Symposium on Operating Systems Design and Implementation, 1994.
[10] K. Eswaran, J. Gray, R. Lorie, and I. Taiger. The Notions of Consistency and Predicate Locks in a Database System. In Communications of the ACM, volume 19(11), November 1976.
[11] M. Feeley, W. Morgan, F. Pighin, A. Karlin, H. Levy, and C. Tekkath.
Implementing Gobal Memory Management in a Workstation Cluster. In 15th ACM Symposium on Operating Sytems Principles, 1995.
[12] D. Ferguson, C. Nikolaou, and L. Geargiadis. Goal Oriented, Adaptive Transaction Routing for High Performance Transaction Processing Systems. In Proceedings of the 2nd International Conference on
Parallel and Distributed Systems, San Diego, CA, Jan. 1993.
[13] M. Franklin. Client Data Caching. Kluwer, 1996.
[14] P. Gill, W. Murray, and M. Wright. Numerical Data and Optimization, volume 1. Addison-Wesley, 1991.
[15] J. Gray. Operating Systems: An Advanced Course, chapter Notes on
Database Operating Systems. Springer, 1979.
[16] A. König. Memory Management in Distributed Database Systems
using class-oriented Performance Goals (in german). Diplom thesis,
University of the Saarland, 1998.
[17] A. Leff, J. Wolf, and P. Yu. Policies for efficient Memory Utilization
in a Remote Caching Architecture. In 1st International Conference
on Parallel and Distributed Information Systems, 1991.
[18] A. Leff, J. Wolf, and P. Yu. Efficient LRU-Based Buffering and a
LAN Remote Caching Architecture. In IEEE Transactions on Parallel and Distributed Systems, volume 4(2), 1996.
[19] C. Mullins. DB2 Developers Guide, DB2 Performance Techniques
for Application Programmers. Sams Publishing, 1993.
[20] J. Noonan. Automated Service Level Management and its supporting
Technologies. Mainframe Journal, 1989.
[21] E. O’Neil, P. O’Neil, and G. Weikum. The LRU-K Page Replacement Algorithm for Database Disk Buffering. In ACM SIGMOD
Conference, 1993.
[22] E. Rahm. Goal-Oriented Performance Control for Transaction Processing. In Proceedings of the 9th. ITG/GI MMB97 Conference.
VDE-Verlag, 1997.
[23] E. Rahm, D. Ferguson, L. Georgiadis, C. Nikolau, G.-W. Su,
M. Swanson, and G. Wang. Goal-oriented Workload Management in
Locally Distributed Transaction Systems. In IBM Research Report
RC 14712, T.J. Watson Research Center, 1989.
[24] P. Sakar and J. Hartman. Efficient Cooperative Caching using Hints.
In 2nd Symposium on Operating System Design and Implementation,
1996.
[25] A. Schrijver. Theory of Linear and Integer Programming. Wiley,
1986.
[26] M. Sinnwell. Adaptive Caching in distributed Information Systems
(in german). PhD thesis, University of the Saarland, 1998.
[27] M. Sinnwell and G. Weikum. A Cost-Model-Based Online Method
for Distributed Caching. In 13th International Conference on Data
Engineering, 1997.
[28] S. Venkataraman, M. Livney, and J. Naughton. Memory Management for Scalable Web Servers. In 13th International Conference
on Data Engineering, 1997.