Managing Distributed Memory to Meet Multiclass Workload Response Time Goals Markus Sinnwell SAP AG Business Information Warehouse P.O. Box 1461 69185 Walldorf, Germany E-Mail: [email protected] Abstract In this paper we present an online method for managing a goaloriented buffer partitioning in the distributed memory of a network of workstations. Our algorithm implements a feedback mechanism which dynamically changes the sizes of dedicated buffer areas and thereby the buffer hit rate for the different classes in such a way that user-specified response time goals are satisfied. The aggregated size of the buffer memory across all network nodes remains constant and only the partitioning is changed. The algorithm is based on efficiently approximating the trajectory of the per-class response time curves as a function of the available buffer. Changes in the workload that would lead to violation of response time goals are counteracted by accordingly adjusting the buffer allocation. For local replacement decisions, we integrate a cost-based buffer replacement algorithm to fit into our goal-oriented approach. We have implemented our algorithm in a detailed simulation prototype and we present first results obtained from this prototype. 1. Introduction For database requests, varying resource consumption combined with different user priorities leads to large variations in the response time. For example, there are an increasing number of systems in which – besides the normal OLTP workload – complex decision-support queries are executed. Without an effective load control, the high resource consumption of such decision-support queries will slow down short running OLTP transactions excessively. Besides their complexity, the priority of queries should be considered as well, when resources are allocated. A query which has a very firm deadline should receive all the needed resources, while a query which runs in background mode should only be allowed to use the remaining resources. Because of this, it is reasonable to divide transactions into different classes where user requirements can be expressed by a Service Level Agreement [20]. A possible way to specify such agreements is to define response time constraints for each class. When multiclass workloads are considered, it is not sufficient to minimize the mean response time over all classes. As the response time is largely dependent on the number of disk accesses, changing the buffer hit rate is an appropriate method of achieving response time goals. A method of controlling buffer hit rates is This research was conducted while the author was at the University of the Saarland. Arnd Christian König Department of Computer Science University of the Saarland P.O. Box 151150 66041 Saarbrücken, Germany E-Mail: [email protected] the partitioning of the aggregate buffer area into separate regions which are dedicated to caching pages belonging to a specific class only. While such dedicated buffer areas speed up the operations of the corresponding classes, operations of other classes will be slowed down, since the overall buffer size is not changed. While in commercial database systems (like e.g. DB2 [19]) methods exist to partition the local cache buffer into separate pools, the determination of the sizes for the various buffer pools is a very hard optimization problem, even if the workload remains constant and the workload parameters are known. Suboptimal partitioning can result in poor system performance, as some partitions remain underutilized while the buffer pools of other classes are too small to satisfy the given goals. This problem becomes even worse if the workload evolves over time. Further complications for this manual approach arise when we expand a single server environment to a network of workstations also known as NOW [1]. Up to now, NOWs are mainly used because of their aggregated CPU power, but this will change with data-intensive applications starting to exploit the performance capacity of distributed disks and the aggregated memory for caching benefits. In addition to the two level storage hierarchy of the single server, the aggregate memory of the NOW introduces a new level of the storage hierarchy, namely remote cache, which has to be considered when a buffer partitioning is determined. All these problems make an automatic, dynamic adaptation of the buffer partitioning based on the actual workload parameters the only viable approach to meeting multiclass response time goals. In this paper we present a novel method that has the following salient properties: It is completely automated in that it determines the appropriate sizes of the dedicated buffers for each class on every node, by essentially approximating a solution for the underlying combinatorial optimization problem. With our approach it is therefore sufficient to specify the performance goal itself and not the system parameters to achieve this goal. This is a major improvement in the administration and a further step towards a self-tuning database system. It is dynamic in that it copes with evolving workload characteristics and also allows dynamic adjustments of the classspecific response time goals. This is achieved by running the approximative optimization for the buffer partition sizes continuously and incrementally. It is light-weight in that it does not incur much overhead. The collection of input data for the optimization is distributed across all nodes, and messages have to be exchanged only infrequently. Furthermore, the memory and CPU consumption of the approximation algorithm are small. It is general in that it supports arbitrary workloads. In particular, the method does not require the data in the buffer partitions of different classes to be disjoint, but rather allows sharing across classes. Although the buffer partitioning algorithm that we will present in this paper can be used in combination with almost every replacement strategy, optimal usage of the aggregate memory will generally only be possible when actual workload and system characteristics are taken into account. Therefore we will incorporate the cost-based remote cache algorithm developed in [27, 26] into our dynamic multiclass buffer partitioning method. The remainder of this paper is organized as follows: In Section 2 we give a review of existing techniques for goal-oriented workload management as well as techniques for remote caching. After defining system and workload characteristics in Section 3 we will present the computation of the goal-oriented buffer partitioning in Section 4. In Section 5 we will use the goal-oriented buffer partitioning method introduced in Section 4 to derive a distributed implementation. To take full advantage of the aggregate memory we will consider the combination of the goal-oriented buffering with a cost-based remote cache replacement algorithm in Section 6. Section 7 presents the setup and first results of an experimental study which we have carried out in a detailed simulation prototype. Finally we summarize our conclusions and discuss future work in Section 8. 2. Related Work General goal-oriented methods for automatic database tuning are described in [23, 12, 22]. All these approaches dynamically check the satisfaction of given goals; if violations are observed, appropriate countermeasures are invoked. The methods differ in the countermeasures employed. While [23] only introduces a general framework for goal-oriented methods, the other methods use either dynamic routing decisions in a shared nothing OLTP environment [12] or the variation of the multi programming level and the transaction priorities on a single server [22] to achieve the given goals. In contrast to our work, no goal-oriented buffering is considered. Goal-oriented buffering has been considered in a number of papers [5, 7, 6, 8]. In [5] a method called fragment fencing was introduced, which was replaced by class-fencing in [6]. The goal of both methods is to minimize the mean response time of a socalled No-Goal class, while satisfying the response time goals for the different Goal classes. To achieve this, the buffer can be dynamically partitioned and dedicated to a class. For example, if a class k can’t meet its goal, a certain amount of the global buffer is dedicated solely for pages of the class k transactions. Assuming that the transactions are disk-bound, this reduces the average response time of class k, as the buffer hit rate is increased. On the other hand, the dedicated buffer area is decreased if the mean response time of a class is below the goal, since the no-goal class will profit from the freed buffer space. Fragment and class fencing differ in the way they estimate the necessary amount of dedicated buffer to meet a given response time goal. While fragment fencing assumes a direct proportionality between the buffer space and the response time, class fencing only assumes a proportionality between the miss rate and the response time. The necessary dependency between the miss rate and the buffer space is derived by a linear extrapolation of previously measured values of the buffer hit rate as a function of the buffer space. This method ensures a fast convergence as long as the curve illustrating the dependency between the miss rate and the buffer space is concave. In [7] this assumption has been proven for common buffer replacement algorithms by an empirical study. In [8] the dynamic tuning algorithm is described. This method also uses dedicated buffer pools for classes to speed up their response time. Their algorithm tries to find a state in which the maximum over all performance indices 1 is minimal. To achieve this the algorithm computes the effects of small changes in the buffer partitioning on the performance index and only carries out those changes which lead to an improved system state. The main disadvantage of all goal-oriented buffering methods described so far is that they are designed for a single server. In this paper we will combine goal-oriented buffering with distributed caching. Distributed or remote caching tries to exploit the aggregated buffer in a network of workstations. The main assumption motivating distributed caching is that a memory access over a local network is faster than a local disk access. Many different heuristics which all try to maximize the global cache hit rate, thereby reducing the number of disk accesses have been proposed in the past [13, 9, 11, 24, 18]. Although this high global cache hit rate often leads to improved system performance, the disregard of the current resource utilization can decrease the performance under some circumstances. Therefore [27, 28, 26] have proposed online methods which try to balance between an egoistic (maximizing the local buffer hit rate) and an altruistic (maximizing the global buffer hit rate) behavior dependent on the current load and the system parameters. 3. System and Workload Characterization The system that we consider consists of N nodes which are interconnected by a fast network and every node i has a reserved area of SIZEi bytes of main memory which can be used as page buffer. Furthermore, each node is connected to a local disk. We assume that each data page has a permanent, disk-resident copy at a specific node called its home. The homes themselves are distributed across the nodes using a hash function or some catalogdriven partitioning function. The locally reserved buffer area is managed, depending on the number of dedicated buffers, either by one or by several local buffer managers. Independent of the actual implementation of the buffer managers, our goal-oriented partitioning algorithm is only based on the assumption that increasing the size of any local buffer of a class will increase the buffer hit rate and thereby results in a decrease of the mean response time for that class. Although in [2] a counter example for this condition was given, using a FIFO replacement algorithm on a single node, this assumption should be satisfied by virtually all practically used replacement policies. k 1 The performance index for a class is defined to be the ratio of the observed and the goal response time of class operations. k We assume that the external workload consists of several operations, which can arrive at any node of the local network. Each single operation can further be subdivided into several page accesses where each will be executed by data-shipping, i.e. all the requested pages are copied to the node on which the operation has been initiated. All operations are assumed to be disk-bound, which means that the response time is mainly determined by the number of page accesses that can not be satisfied by cache hits. Although we only consider read requests in this paper, it is also possible to incorporate write requests in our model. In the presence of updates we have to ensure the transactional properties like atomicity, isolation and durability of the operations. To guarantee the atomicity, we can use the (distributed) 2-phase-locking protocol [10] and isolation in a distributed system can be archived by the 2-phase commit protocol [15]. Finally, we can guarantee durability by the WAL (Write-Ahead-Logging) principle [4]. Depending on the user-defined response time goals, we can group the operations into separate classes. Although it is possible to group arbitrary operations with the same response time goal together, we believe that data affinity and resource consumption should be considered, too. Ideally only operations which access the same pages should be grouped together and furthermore there should be no two operations which access the same pages but belong to different classes. The effects which may result from the violation of this ideal grouping are shown by the next two examples. Example 1: Let the class k consist of the two operations op1 and op2 . We assume that the complexity (i.e. the number of page accesses per operation) for both classes is equal but the sets of objects accessed are disjoint. If the inter-arrival time of op1 is much smaller than that of op2 a dedicated buffer for class k will contain almost solely pages which are accessed by op1 .2 Therefore changes of the buffer size will only affect the response time for the operation op1 . Although we can change the buffer size so that the goal for the mean response time over both operations is met, this does probably not reflect the intention of the users, as op1 operations will be much faster and op2 operations much slower than the class mean. Example 2: In this second example we assume, that the operation op1 belonging to class k1 and op2 belonging to k2 accesses the same objects. Furthermore, let the goal response time of class k1 be much tighter than that of class k2 (class k2 might even be a class without a given goal). To achieve the response time goal of class k1 we have to provide a sufficiently large dedicated buffer area. As accesses of the operation op2 can also profit from the buffered pages, the mean response time of this class is probably accelerated beyond the user-defined goal. Summarizing these examples, we have seen, that the situation of example 1 leads to an undesirable behavior, while that of example 2 simply speeds up a class more than necessary. As this speedup does not incur any additional costs (the dedicated buffer of class k1 in the example has to be chosen as large as that anyway) we propose the following simple assignment schema. A class consists of all those operations which access the same objects and which have the same response time goals. Such a partitioning of the operations is always possible and it guarantees that the problem of example 1 never occurs. In the following we assume that all operations with given response time goals are grouped into K classes, numbered from 1 to K . These classes are called Goal classes. In addition, we introduce a special No-Goal class, numbered class 0, which subsumes all operations without a given response time goal. 4. Approximating an Optimal Buffer Partitioning In this section we will derive a method of computing the buffer pool sizes of a single class k on the different nodes, so that the mean response time of all operations of this class will satisfy the given goal. In contrast to the centralized problem, which was addressed in [6], we have an additional degree of freedom. Besides the possibility to influence the response time by changing the size of the overall dedicated buffer, we further have to decide on which node the buffer pool size for the considered class has to be changed. Similar to class fencing [6], our approach also aims to minimize the response time of the no-goal class under the constraint that every goal class satisfies its response time goal. To reduce the complexity of this optimization problem, we assume that – with the exception of the considered class k – all other classes actually meet their response time goals3 . For this class k our algorithm will compute a new allocation, which ideally should lead to a situation where class k satisfies its goal, too, or at least reduces the difference between its mean response time and its goal. We will do this by generalizing class fencing to N dimensions (where N is the number of nodes), so that we can predict the new response time by the extrapolation of measured response times of former allocations. We will use this kind of approximation to derive the objective function for the no-goal class as well as the constraint, which ensures the satisfaction of the response time goals for the goal class k. As we use remote caching, the local response time RTk;i of a class k operation on node i depends on the local as well as on the remote cache hit rate, which in turn depends on the local ( LMk;i ) and the remote buffer size (RMk;i ) of class k. Similar to the oneserver approach of class fencing, the relation describing the response time as a function of the buffer size is a-priori unknown and only some tuples of this relation are given based on previous measurements. These tuples can now be used to compute the coefficients k;i , k;i and k;i of a linear approximation of the local response time function: RTk;i (LMk;i ; RMk;i ) = k;i LMk;i + k;i RMk;i + k;i (1) Since the size of the remote cache for class k at node i is determined by the sizes of the local caches of all other nodes, we can use the equation: RMk;i = Assuming a buffer replacement strategy that considers the access frequency of objects, like e.g. LRU. j =1;j 6=i LMk;j , (2) to transform equation 1 to: 3 2 N X This assumption is only made for the theoretical derivation. In Section 5, when describing an implementation we will allow the concurrent adaptation of several classes. RTk;i (LMk;1 ; : : : ; LMk;N ) = k;i LMk;i + k;i N X j =1;j 6=i LMk;j + k;i (3) As we do not specify local goals for every node, but only one goal for the mean response time of a class over all nodes, we consider the weighted sum of all local response times for that class. The weighting factors that we use are the inter-arrival rates k;i of class k operations that arrive on node i. Hence, the mean response time RTk can be expressed as: N X i=1 0 N X 1 N X @k;ik;i + k;j k;j A LMk;i i=1 j =1;j =i {z } | N X k;i k;i i|=1 {z } + 6 =:k =:k;i (4) ? Equation 4 can be seen as a N dimensional hyperplane which approximates the dependency between the mean response time and the partitioning of the local buffers. To derive the constraint which ensures that the new partition will satisfy the given goal, we only have to set equation 4 equal to the response time goal RTkgoal for the class k: i=1 k;i LMk;i + k (5) Before we derive the objective function of our optimization problem, we want to introduce some bounds on the sizes of the dedicated buffers, which are imposed by the limited main memory on every node. Clearly, the local buffer pool size of class k on any node cannot be less than zero and it is also impossible that the sum of all local buffers is larger than the locally reserved cache memory size (SIZEi ). In our terminology we can express these bounds in the following way: 0 LMk;i SIZEi ? K X l=1;l6=k LMl;i (6) To meet the response time goal we can choose any allocation which satisfies the equations 5 and 6, but as we aim to minimize the mean response time of the no-goal class we will now derive the objective function for this minimization problem. Analogously to the approximation of the response time as a function of the buffer size for the class k operations, we can also approximate the response time RT0;i for the no-goal class on a node i by: RT0;i (LMk;1 ; : : : ; LMk;N ) = 0;i SIZEi ? + 0;i N X j =1;j 6=i (?0;i ) LM0;i + (?0;i ) N X j =1;j 6=i LMk;j + ?0;i (8) RT0 (LMk;1 ; : : : ; LMk;N ) = (k;i RTk;i (LMk;1 ; : : : ; LMk;N )) = RTkgoal = RT0;i (LMk;1 ; : : : ; LMk;N ) = Computing the weighted mean of the local response times and putting together the constants (analogously to equation 4) we get: RTk (LMk;1 ; : : : ; LMk;N ) = N X buffers on this node. In addition we assume that we change the allocation of only one class (namely class k) at one moment and therefore the response time for the no-goal class on node i depends in this case only on all the local buffer sizes of class k. Using this fact, we can rewrite formula 7 by collecting all constant factors in a new constant ?0;i as: SIZEj ? K X l=1 K X !l=1 LMl;i LMl;j + 0;i ! (7) In this formula we assume that the buffer on node i that can be used by no-goal class operations equals the size of the complete reserved memory on this node minus the sizes of all dedicated N X i=1 N X i=1 (0;i RT0;i (LMk;1 ; : : : ; LMk;N )) = 0;i LMk;i + 0 (9) It should be noted that, in contrast to equation 4, all the gradients 0;i are now greater than zero, i.e. the response time of the nogoal class increases when the local buffer size for the class k is increased at any node. With these results, we can compute our final buffer partitioning by solving the following linear programming problem with the variables LMk;i ; (1 i N ): N X Minimize: i=1 under the constraint: 0;i LM ; + 0 RTkgoal = considering the bounds: 0 LM ; k i SIZEi ? k i N X i=1 k;i LM ; + k K X l=1;l6=k k i LMl;i for all nodes i Although – when considering arbitrary response time curves and approximation planes – it can not be proven that solving the linear program does always result in an improved partitioning, these special cases are irrelevant for our purposes, since they correspond to states where goals of the goal classes are not violated [16]. 5. Implementation After having described the partitioning algorithm in Section 4 we will now demonstrate how this computation can be embedded into a distributed system for online decisions. We assume that for every goal class there exists one agent process on every node and additionally a single coordinator process which can be located on any node. Furthermore, for the no-goal class, one agent on every node is needed. Because of load balancing issues we allow the coordinator to be placed separately for every class and even a migration of a coordinator from one node to another node is possible, as long as all corresponding agents are informed. In figure 1 we have sketched an environment with 4 nodes and 4 classes (3 goal classes and the no-goal class). Our algorithm itself consists of five phases, that form a feedback controlled loop in which the satisfaction of the goal is checked. If the goal is violated a recomputation of the buffer partitioning is initiated. In the following we will describe these phases for a class k in more detail. 1 1 2 3 2 1 2 3 1 2 3 3 Node2 Node1 Network Node3 1 1 3 1 Node4 1 2 2 3 3 2 1 2 3 3 k Local agent for a Goal class k Local agent for the No-Goal class k Locally reserved buffer area for Goal class k operations Buffer for the No-Goal class k Coordinator process for a Goal class k Figure 1. Environment with 4 nodes and 3 goal classes. (a) Collect Phase of the local Agent Every time a class k operation is locally initiated the agent computes the inter-arrival time and upon completion of the operation the local response time for this class is updated. To prevent heavy fluctuations, caused by stochastic noise, we record the response times over a sufficiently long observation interval. If a significant change in observed response time is recorded, the appropriate coordinator is informed about the new value. While the information collected by the no-goal agents is only used by the optimization process of the goal classes, changes registered by the no-goal agents have to be propagated to all goal class coordinators. In addition to the inter-arrival and the mean response time, an increase or a decrease of any local buffer size will influence the buffer size of the no-goal class and therefore a change of this value is propagated to the goal class coordinators, too. It should be emphasized that the agents do not have to run synchronously on the different nodes, because the coordinator of a class k remembers the most recently received information from every class k and every no-goal agent. (b) Collect Phase on the Coordinator In this phase the coordinator is awaiting the data that is sent by the agents in phase (a). After receiving a new message, the received information is either used for the creation of a new (if the partitioning has changed since the last measure point) or for the update of the last measure point (if only the response time has changed). In the first case we have to ensure that a unique approximation of the N -dimensional hyperplane is still possible, as this is needed by the optimization pro- cess. Let 1 be the most recent measure point, we can guarantee a unique approximation by keeping the N + 1 most recent measure points such that the vectors 1 2 ; : : : ; 1 N +1 are linear independent. Although this method ensures the unique linear approximation once there are enough points, we still have to address the problem of what to do during “warm-up” when there are less than N +1 measure points. In this case we can use simple heuristics, like e.g. allocating a certain percentage of the undedicated main memory on any node. To quickly overcome this warm-up period we have to take care that every new partitioning leads to a new linear independent measure point, so that the next iteration of the feedbackcontrolled loop can rely on one additional measure point. ? ? (c) Check Phase In this phase the coordinator computes the weighted mean response time RTk for the class k according to equation 4 and afterwards this time is checked against the given response time goal. Due to statistical variance in the response time, we consider a goal to be violated only if it differs more than a certain tolerance from the given goal. To allow a workload dependent adaptation of we use the method of [5]. If a goal is violated we proceed to phase (d) and otherwise the current iteration of the feedback controlled loop is finished and we return to the collection phases. (d) Optimization Phase During this phase the class k coordinator determines the new partitioning for the local buffers of class k according to the method described in Section 4. This involves the approximation of the hyperplane based on the points of measurements registered in phase (b) followed by the minimization process. Having determined the new buffer partitioning, the new buffer pool sizes are sent to all agents that are subject to changes. (e) Allocation Phase In this phase the local agents receive the output of the optimization phase from the coordinator and change their local allocation schemas accordingly. Although the computation in Section 4 assumes that there are no concurrent adaptations for different classes, we drop this restriction in our implementation to reduce the overhead for synchronization and to improve adaptivity. Therefore it is now possible, that a class k agent can not allocate the desired amount of memory, because a local agent of another class k0 has already reserved this area. In this case the local agent of class k allocates as much memory as possible and informs the coordinator about the difference, so that the coordinator can update its information. Further actions are not triggered, as the algorithm implements a feedback mechanism and therefore if the goal is not reached with this partitioning the algorithm will consider the new information in its next iteration. Computational Complexity Having defined the phases, we will now study the computational complexity of the different tasks. Here we will restrict ourselves to the phases (b) and (d), since the other phases, which are executed by the agents, involve only trivial computations. In phase (b) we have to determine the N + 1 newest, “linear independent” points where N is defined to be the number of nodes in the system. Since this involves the test whether the corresponding system of linear equations is singular, we use an incremental Gauss algorithm [14]. This algorithm takes advantage of the only marginal changes between two computations (a new measure point replaces an old one), and thereby reduces the complexity of the standard Gauss algorithm to O(N 2 ). In phase (d) we first have to solve a system of linear equations to determine the parameters for the approximation of the hyperplane. Similar to phase (b), we can use again the incremental Gauss algorithm so that we achieve a worst case complexity of O(N 2 ). Finally, the coordinator has to compute the solution of the linear program introduced in Section 4. For this task we have chosen an implementation [3] of the simplex algorithm, which, although having an exponential worst case complexity, has been proven to be linear in the number of variables and constraints in the mean [25]. Besides these more theoretical considerations, we have measured the average time of a single execution of the different tasks on a SUN Sparc 4 Workstation. The results for different numbers of nodes are shown in table 1. This table 1 shows, that the overNumber of Nodes Lin. Independence Approximation Optimization Overall 5 0.1 0.24 0.9 1.24 10 0.2 0.6 1.6 2.4 20 0.7 2.7 2.3 5.7 30 2.4 5.5 2.7 10.6 40 2.8 11.1 3.3 17.2 50 4.2 14.8 5.4 24.4 Table 1. CPU Execution time in milliseconds head incurred by the coordinator process is very low. In addition we have to remember that these tasks are only executed on demand when a class violates its response time goal; otherwise no actions are needed. Furthermore, we can benefit from the distribution of the coordinators among different nodes, as this allows the application of load-balancing methods (distribution and migration of coordinator processes) in the case of heavy CPU contention on a specific node. 6. Using a Cost-Based Buffer Manager Up to now we have not specified the replacement policy for buffer managers. Although there are many different policies, which satisfy the precondition that we state in section 3 (increasing the buffer size leads to a decrease in response time), [17, 27, 28, 26] have shown that in general an optimal usage of the remote cache can be achieved neither by the maximization of the global hit rate (altruistic behavior) nor by the maximization of the local hit rate (egoistic behavior). Therefore, in this section we will describe the integration of the cost-based replacement policy of [27, 26] into our goal-oriented partitioning algorithm. The central idea of [27, 26] is the notion of the benefit of a cached page. The benefit of a page is defined as the difference in the access cost between keeping the page in the local cache versus dropping it. Instead of using a simple stack, every buffer manager uses a priority queue to keep the pages sorted by their benefit and in the case of a buffer replacement action, the page with the locally lowest benefit is replaced. To compute the benefit of a page every buffer manager keeps track whether its local copy is the last cached copy of that page in the system as well as the local and the global heat of every page, the heat being defined as the number of accesses (locally resp. globally) per time unit. In the implementation the LRU-k algorithm [21] is used to approximate the heat. To reduce the overhead of the information dissemination, threshold-based protocols are used which allow the propagation and the update of the nonlocal information. Besides this page specific information, the access cost to different levels in the storage hierarchy are needed, too. Tagging each page request with the storage level the page has been accessed from, this information can be gathered with low overhead by observing the response times of already finished requests. As our goal-oriented buffering schema allows several buffer managers per node (one for the no-goal class and at most one for every goal class), we have to adapt the original algorithm slightly. To ensure a correct ranking in the no-goal buffer as well as in the various goal buffers, we have to collect the different class heats as well as the accumulated heat over all accesses. For the goal classes we only have to consider those local heat-values for which there exists a dedicated local buffer. Furthermore we have to collect the global heat for those classes only for which at least one dedicated buffer area exists in the system. Finally, we can reduce the overhead of the bookkeeping by collecting the heat information for a class k on an object p only if at least one operation from class k accesses the object p. But as this information is unknown a-priori we use a method which dynamically creates and deletes the heat information on demand. A single access to a page p by an operation op belonging to a class k is now executed in the following way. First of all the accumulated heat for this page is updated. If a dedicated buffer for class k exists on the considered node and the page is not cached locally in an other dedicated buffer already, the page is acquired (either from the local no-goal buffer, from which it is removed, or via remote cache or disk), the class-specific heat is updated and the page is inserted into the dedicated buffer of class k. If this causes other pages to be dropped from the dedicated buffer, these are removed from the cache of the local node completely. In case the requested page already resides within the dedicated buffer, only the class-specific heat is updated. In case there is no dedicated buffer for class k, the page (if not cached in there already) is acquired and inserted into the no-goal buffer. Replacement victims are dropped again from the buffer of the local node. 7. Simulation Experiments 7.1. Simulation Setup In order to access the validity of our theoretical model, we conducted an experimental study4 . For this we have integrated our approach into the detailed simulation prototype described in [26]. In all the experiments described in this section we use an environment consisting of 3 nodes (CPU speed 100 MIPS), which are connected via a fast local network (transfer-rate of 100 Mbit/s). Each node employs a common SCSI disk and 2 MB of cache space which can be used for caching pages. We have chosen such a small buffer size to limit the execution time of the simulations. The database is modeled as a set of M = 2000 data pages (4 KByte), which are distributed in a round-robin fashion over all nodes’ disks. For each node and each operation of the different classes a stream of accesses to the corresponding pages is generated. A single operation can consist of one or more accesses. The identities of the accessed pages are distributed via a Zipfian-distribution with a skew parameter , i.e. the local access 4 The entire study, including all system and database parameters, can be found in [16]. P 7.2. Base Experiment In our first experiment we consider a two class scenario (one goal class and the no-goal class). We assume that each operation of both classes accesses 4 pages and that there is no data-sharing, i.e. there is no page which is accessed from both classes. Finally we have chosen the access skew of both classes equal to 0. Figure 5 shows the resulting changes in the observed overall response time, the response time goal and the systemwide dedicated memory (the total size of cache buffers dedicated to the goal class). As expected, the observed response time is closely related to the size of the dedicated buffer. Furthermore, the approximation does result in a partitioning satisfying the response time goal after only a short number of observation intervals. This, in fact, has been true for all experiments conducted, including experiments with vastly more complex operations, dynamically changing workloads or a larger number of nodes. Because we aim to illustrate the adaptivity of our approach, there is never a larger number of consecutive intervals with a constant goal in this experiment. Because of this, it is not possible to effectively calculate the tolerance [5], which would prevent the coordinator from becoming active on small variations from the given goal. This explains the oscillation seen in some parts of the figure. Even so, the system does not exhibit significant changes in partitioning after reaching a partitioning that satisfies the response time goal. Furthermore, this problem disappears in more realistic systems, in which response time goals normally do not change in quick succession. 7.3. Variation of the Access Skew In our study we have observed, that it is not the complexity of the different operations, but the skew of the accesses, that is of major importance for the speed of convergence. In theory, this can be explained by the fact, that our algorithm generates new partitionings by approximating the shape of the response time curve through linear programming. Therefore with a skew = 0, which corresponds to an equal distribution, the response time as a function of the buffer sizes theoretically equals a hyperplane and so can be matched almost ideally with our approximation hyperplane. 5 response time goal observed response time Response time (msec) 4 3 2 1 5e+06 Total dedicated cache (byte) 1 frequency of an page with index p is C 1p with C = M q=1 q . Operations are generated at each node independently, with their inter-arrival time 1=k;i assumed to be exponentially distributed. The length of the observation interval is set to 5000 msec, which still allows fast adaption to changing parameters and at the same time smoothes variations caused by stochastic noise. In our experiments we want to prove that our method is able to find a buffer partitioning which satisfies the user-given goals and furthermore we are interested in the speed of convergence, i.e. the number of iterations of the feedback controlled loop necessary to find such a partitioning. In order to capture a variety of different partitions, we count the number of intervals in which the system reaches a state satisfying the response time goal, changing the response time goal after four “satisfied” intervals. The new goal is randomly chosen so that it should be satisfiable under the current workload and also differs significantly from the current goal. All experiments were repeated sufficiently to obtain an accuracy of less than 1 Iteration for the speed of convergence with a statistical confidence of 99 percent. 4e+06 3e+06 2e+06 1e+06 0 10 20 30 40 50 60 70 80 Elapsed observation intervalls Figure 2. Variance in response time, response time goal and dedicated memory With a higher skew the difference between the function of the response time and the approximation plane increases. This theoretical consideration is proven by our simulation results which are shown in table 2. In order to be able to compare the results of different experiments, we choose the goals randomly from [goalmin ; goalmax ], where goalmin corresponds to the response time of the goal class, 3 SIZEi of the cache when (under the chosen workload) 23 i=1 memory is dedicated to it; in turn, goalmax corresponds to the 3 SIZEi of the cache being response time achieved by 13 i=1 dedicated. The results show that an increase in skew leads to a de- P Skew Iterations 0 1.84 0.25 2.41 P 0.5 3.55 0.75 3.88 1 3.95 Table 2. Convergence-speed under varying crease in convergence time. Nevertheless it should be noted, that even in the case = 1, which corresponds to a very highly skewed distribution, in the mean less than 4 iterations of the feedback controlled loop are sufficient to adapt to a change in the given goal. 7.4. Multiple Goal-Classes We have repeated these experiments with two goal classes k1 and k2 (RTkgoal < RTkgoal ) and twice the amount of cache buffer 1 2 memory at each node. With multiple classes, the time of convergence is also dependent on whether the sets of page accesses by each class are disjoint or not. In case of disjoint sets, the amount of memory dedicated to one class does not influence the performance of the other and therefore we would expect to get the same results as in the base experiment. This is confirmed by our experiments, as the measured speed of convergence for each value of was identical to those already shown in table 2. However this independence is no longer valid if we consider data sharing between the different goal classes. Raising the percentage of sharing we have observed that the size of the dedicated buffers of the class k2 decreases gradually. This is due to the fact that this class can profit from the dedicated buffer of class k1 . Further increases in the sharing leads to a complete removal of of the dedicated buffers of class k2 and eventually – even without any dedicated buffers – class k2 exceeds its goal solely by accessing pages from the buffers of class k1 . These observations matches exactly with our considerations made in section 3 example 2. These findings were confirmed by experiments using more than two classes [16]. 7.5. Overhead Because of the length of the observation interval and their small size, messages used by our method only make up a fraction of the total network-traffic (less than 0.1%, in our experiments). Coupled with the CPU overhead described in Section 5 and the fact that very little additional memory is needed, the overall overhead of our method is not significant in the setting of a distributed database system. 8. Conclusion and further work In this paper we presented an online method for distributed, goal-oriented caching in a network of workstations. Our approach is built out of two components: an algorithm which computes a buffer partitioning according to the user specified response time goals and a cost-based buffer replacement algorithm which makes optimal use of these partitions. To allow online computation we have described a distributed, low overhead implementation of the partitioning algorithm in a detailed simulation prototype and presented some results generated by this prototype. In the future we plan to expand our simulational study. One aspect we want to focus on is the usage of other objective functions. In our current approach we only try to minimize the mean response time of the no-goal class, but some applications insist on more stringent conditions, like e.g. a given mean response time goal together with a maximal coefficient of variation among the different nodes. In this scenario minimizing the mean response time of the no-goal class will in general not lead to the user specified goal and therefore a new objective function, like e.g. minimizing the variation, will be needed. References [1] T. Anderson, C. Culler, and D. Patterson. A case for NOW (Network Of Workstations). volume 15(1) of IEEE Micro, February 1995. [2] L. Belady, R. Nelson, and G. Shedler. An Anomaly in Space-Time Charakteristics of Certain Programs running on a Paging Machine. Communications of the ACM, 1969. [3] M. Berkelaar, J. Dirks, and H. Schwab. lp-solve Library version 2.1 and Documentation. Winter Simulation Conference, 1997. [4] P. Bernstein, N. Goodman, and V. Hadzilacos. Recovery Algorithms for Database Systems. In IFIP 9th World Computer Congress, September 1983. [5] K. Brown, M. Carey, D. DeWitt, and M. Metha. Managing Memory to Multiclass Workload Response Time Goals. In 19th International Conference on Very Large Databases, 1993. [6] K. Brown, M. Carey, and M. Livney. Goal Oriented Buffer Management Revisited. In ACM SIGMOD Conference, 1996. [7] K. P. Brown. Goal Oriented Memory Allocation in Database Management Systems. PhD thesis, University of Wisconsin-Madison, 1995. [8] J.-Y. Chung, D. Fergusson, G. Wang, C. Nikolaou, and J. Teng. Goal Oriented Dynamic Buffer Pool Management for Database Systems. In International Conference of Engineering of Complex Computer Systems, 1995. [9] M. D. Dahlin, R. Y. Wang, T. E. Anderson, and D. Patterson. Cooperative Caching: Using Remote Client Memory to Improve File System Performance. In 1st Symposium on Operating Systems Design and Implementation, 1994. [10] K. Eswaran, J. Gray, R. Lorie, and I. Taiger. The Notions of Consistency and Predicate Locks in a Database System. In Communications of the ACM, volume 19(11), November 1976. [11] M. Feeley, W. Morgan, F. Pighin, A. Karlin, H. Levy, and C. Tekkath. Implementing Gobal Memory Management in a Workstation Cluster. In 15th ACM Symposium on Operating Sytems Principles, 1995. [12] D. Ferguson, C. Nikolaou, and L. Geargiadis. Goal Oriented, Adaptive Transaction Routing for High Performance Transaction Processing Systems. In Proceedings of the 2nd International Conference on Parallel and Distributed Systems, San Diego, CA, Jan. 1993. [13] M. Franklin. Client Data Caching. Kluwer, 1996. [14] P. Gill, W. Murray, and M. Wright. Numerical Data and Optimization, volume 1. Addison-Wesley, 1991. [15] J. Gray. Operating Systems: An Advanced Course, chapter Notes on Database Operating Systems. Springer, 1979. [16] A. König. Memory Management in Distributed Database Systems using class-oriented Performance Goals (in german). Diplom thesis, University of the Saarland, 1998. [17] A. Leff, J. Wolf, and P. Yu. Policies for efficient Memory Utilization in a Remote Caching Architecture. In 1st International Conference on Parallel and Distributed Information Systems, 1991. [18] A. Leff, J. Wolf, and P. Yu. Efficient LRU-Based Buffering and a LAN Remote Caching Architecture. In IEEE Transactions on Parallel and Distributed Systems, volume 4(2), 1996. [19] C. Mullins. DB2 Developers Guide, DB2 Performance Techniques for Application Programmers. Sams Publishing, 1993. [20] J. Noonan. Automated Service Level Management and its supporting Technologies. Mainframe Journal, 1989. [21] E. O’Neil, P. O’Neil, and G. Weikum. The LRU-K Page Replacement Algorithm for Database Disk Buffering. In ACM SIGMOD Conference, 1993. [22] E. Rahm. Goal-Oriented Performance Control for Transaction Processing. In Proceedings of the 9th. ITG/GI MMB97 Conference. VDE-Verlag, 1997. [23] E. Rahm, D. Ferguson, L. Georgiadis, C. Nikolau, G.-W. Su, M. Swanson, and G. Wang. Goal-oriented Workload Management in Locally Distributed Transaction Systems. In IBM Research Report RC 14712, T.J. Watson Research Center, 1989. [24] P. Sakar and J. Hartman. Efficient Cooperative Caching using Hints. In 2nd Symposium on Operating System Design and Implementation, 1996. [25] A. Schrijver. Theory of Linear and Integer Programming. Wiley, 1986. [26] M. Sinnwell. Adaptive Caching in distributed Information Systems (in german). PhD thesis, University of the Saarland, 1998. [27] M. Sinnwell and G. Weikum. A Cost-Model-Based Online Method for Distributed Caching. In 13th International Conference on Data Engineering, 1997. [28] S. Venkataraman, M. Livney, and J. Naughton. Memory Management for Scalable Web Servers. In 13th International Conference on Data Engineering, 1997.
© Copyright 2026 Paperzz