Catalyst: GPU-assisted rapid memory deduplication in virtualization environments Anshuj Garg, Debadatta Mishra and Purushottam Kulkarni Department of Computer Science and Engineering Indian Institute of Technology Bombay {anshujgarg, deba, puru} @cse.iitb.ac.in Abstract Content based page sharing techniques improve memory efficiency in virtualized systems by identifying and merging identical pages. Kernel Same-page Merging (KSM), a Linux kernel utility for page sharing, sequentially scans memory pages of virtual machines to deduplicate pages. Sequential scanning of pages has several undesirable side effects— wasted CPU cycles when no sharing opportunities exist, and rate of discovery of sharing being dependent on the scanning rate and corresponding CPU availability. In this work, we exploit presence of GPUs on modern systems to enable rapid memory sharing through targeted scanning of pages. Our solution, Catalyst, works in two phases, the first where pages of virtual machines are processed by the GPU to identify likely pages for sharing and a second phase that performs page-level similarity checks on a targeted set of shareable pages. Opportunistic usage of the GPU to produce sharing hints enables rapid and low-overhead duplicate detection, and sharing of memory pages in virtualization environments. We evaluate Catalyst against various benchmarks and workloads to demonstrate that Catalyst can achieve higher memory sharing in lesser time compared to different scan rate configurations of KSM, at lower or comparable compute costs. Keywords Memory Deduplication, Graphics Processing Units, Virtualization, KSM; 1. Introduction Resource over-commitment is a sought-after strategy in virtualization-based infrastructure-as-a-service platforms [37, 38]. The basic premise of over-commitment being exploita- Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected] or Publications Dept., ACM, Inc., fax +1 (212) 869-0481. VEE ’17, April 08-09, 2017, Xi’an, China c 2017 ACM 978-1-4503-4948-2/17/04. . . $15.00 Copyright DOI: http://dx.doi.org/10.1145/3050748.3050760 tion of non-overlapping peak resource demands and to exploit policy-driven performance vs. resource allocation trade-offs. Along the memory management dimension, the main techniques to enable over-commitment are—demand paging [37], ballooning [13, 37], deduplication [3, 4, 37] and hypervisor-based disk page caching [15, 18]. Each of these techniques have different cost-benefit tradeoffs. Hypervisor-based demand paging [37] is non-intrusive, requires no cooperation from guest operating systems but is hampered by the lack of knowledge regarding utility of pages for the guest OS. Memory ballooning [13, 37] while addressing this semantic gap relies on guest cooperation. Hypervisor-based disk caching [15, 18] aims to explicitly manage disk caches and manage them centrally to increase memory efficiency by employing deduplication and compression across disk pages. This technique is also intrusive and is restricted to disk caches. System-wide memory deduplication—scan and merge same pages across VMs and the host [3, 17, 37] is non-intrusive and can be effectively combined with the ballooning and hypervisor caching techniques. Content-based deduplication techniques can be broadly classified into two categories: in-band sharing and out-ofband sharing [25]. In-band techniques [17] intercept the I/O path and deduplicate identical pages in I/O buffers and page caches. In contrast, out-of-band techniques [3, 37] are instantiated as a separate execution entity that periodically scans memory allocated to virtual machines for deduplication opportunities. Out-of-band techniques are more effective than in-band techniques as they can consider all the pages of virtual machines for merging and not the pages only in the I/O path. As part of this work, we focus on improving the efficiency of the out-of-band deduplication process. The sharing efficiency of out-of-band techniques largely depends upon the time interval between successive scans and number of pages scanned in each interval. Share-ometer [25] demonstrated the effect of changing the scan interval and pages-to-scan (in each scan) parameters of the Kernel Samepage Merging (KSM) feature of the Linux kernel. KSM scans pages in sequential order over the virtual address space associated with a virtual machine (virtual machines are processes in the Linux+KVM virtualization solution). KSM’s procedure suffers from two potential problems. First, if there is no or very low sharing potential in the system, it takes a full scan of all virtual addresses across VMs for this discovery. Clearly, the CPU cycles spent are not proportionate to the sharing benefits. Second, since KSM scans virtual addresses sequentially, the latency to capture the sharing potential can be high if identical pages are scattered or lie closer to the end of the virtual address space. For memory-oriented workloads this can adversely effect sharing potential. The longer it takes to deduplicate a shareable page, sharing is short-lived or potential pages are missed. Towards designing an efficient out-of-band sharing process, the wish-list of requirements is as follows, (i) Not to waste CPU cycles when no or low sharing opportunity exists in the system. (ii) Detect sharing potential as quickly as possible and maintain a high ratio of shared pages to shareable pages. (iii) Reduce overhead of detecting opportunities for sharing and merging of identical pages. Current out-of-band sharing techniques [3, 16, 37] do not simultaneously meet all these requirements. Our central idea is to shift compute requirements of the sharing process off the CPU. A possible approach to minimize CPU demand for the sharing related processes is to opportunistically use the CPU for aggressive scanning during low utilization periods. We push this idea to the extreme by offloading almost all the sharing related compute requirements independent of the CPU utilization levels. This is especially relevant for systems executing CPU-intensive workloads, where interference due to computation requirements for deduplication can be non-trivial [3]. Most modern systems have specialized co-processors— Graphics Processing Units (GPUs), Floating Point Units (FPUs) etc. These co-processors are designed to exclusively solve specific types of computing tasks in an efficient manner. Using GPUs for general purpose tasks which would typically be executed by the CPU have also been explored [33, 36, 39]. We aim to exploit the complementary computational capabilities of GPUs to offload compute requirements of the sharing process and simultaneously meet the three goals described above. With a GPU-assisted setup, the CPU can perform only the essential tasks of sharing (setting up the shared memory areas etc.) and the groundwork for deduplication (identifying redundant pages) can be handled by the GPU. Further, GPU-based computations will also benefit from the large set of cores for rapid identification of sharing potential. As part of this work, we design and implement Catalyst, a GPU-assisted system for efficient memory deduplication. Catalyst enables rapid detection of sharing potential, and allows the CPU to perform targeted scan-and-merge of 90 Wasted CPU Cycles (%) 80 72.16 70 57.49 60 50 44.39 40 32.79 30 20 10 10 20 30 40 50 60 70 80 90 Sharing Potential (%) Figure 1. CPU cycles wasted in scanning non-sharable pages. pages that are highly likely to be deduplicated. The main challenges of building such a system are to engineer a lowoverhead mechanism for copying data from the kernel space to the GPU and integration with current deduplication processes. We demonstrate the efficacy of Catalyst by comparing it with different configurations of the vanilla KSM process. 2. Background and Motivation 2.1 CPU efficiency and KSM Kernel Same-page Merging (KSM) [3] is a kernel feature that scans through virtual address ranges of memory allocated to processes and deduplicates identical pages. In the context of the Linux+KVM virtualization solution, where each virtual machine’s instantiation is essentially a process, KSM can be used without modifications to deduplicate pages of virtual machines. KSM uses a scan-and-merge approach—scan pages in a sequential manner and merge when identical. A hash for each page is computed and compared against other pages for similarity. If a match is found, KSM performs a byte-by-byte comparison of the pages to ascertain similarity. Similar pages are merged, both virtual addresses are mapped to the same physical page and the mappings to the physical page are marked copy-on-write (CoW). The CoW semantics is required to handle writes to shared pages. The sequential scanning approach of KSM has to scan all pages, even those that are not shareable, to exploit the sharing potential across all virtual machines. In cases with low sharing potential, the computation cost is disproportionate relative to the sharing benefits. Figure 1 shows the percentage of all CPU cycles used for scanning that resulted in no sharing benefits. For the 120 Percent Sharing Achieved 100 80 60 40 Distance 0 Distance 1 Distance 2 Distance 3 20 0 0 10 20 30 40 50 Time in Seconds 60 70 Figure 2. Sharing distance vs. time taken to achieve maximum sharing. experiment, the setup consisted of a single VM which read a single file of size 1 GB. The contents of the file were configured to exhibit deterministic sharing potential—20%, 40%, 60% and 80%. On reading the file, we measured the number of CPU cycles KSM spent on scanning pages that were shared and not shared. Referring to Figure 1, the x-axis represents the setups with different sharing levels and y-axis represents the number of cycles wasted on scanning non-sharable pages. As can be seen, with less sharing potential most of the CPU cycles are wasted in scanning non-sharable pages—72% CPU cycles yielded no sharing benefits. As the sharing potential goes up, percentage of CPU cycles wasted decreases. In an ideal situation the CPU cycles should only be used for scanning shareable pages and never be wasted, or at least not wasted to a large degree. As the sharing potential is not known a priori, achieving this goal requires additional inputs or pre-processing before the KSM process takes over. Vanilla KSM not only wastes cycles in scanning nonsharable pages but also can take considerable amount of time to realize the maximum sharing potential. The time taken by KSM to realize the maximum sharing potential depends on the number of pages scanned per second and the distribution of identical pages in the set of all pages. Figure 2 shows the result of an experiment that reports the cumulative distribution of the time taken to realize the sharing potential with different spatial distribution of identical pages. Distance 0 indicates all identical pages are contiguous, Distance 1 indicates a gap of one (non-identical) page between any two identical pages and so on. The experiment setup was identical to the previous experiment which estimated the wastage of CPU cycles, except the manner in which degree of sharing was instantiated in the file to be read. Figure 2 reports results from a file with 20% sharing potential and a KSM scan rate of 4000 pages per second. As can be seen from the Figure, when sharing is spatially closeby, the time to achieve 100% sharing potential is much quicker than when the spread of identical content is more. With Distance 3, KSM requires about 70 seconds to achieve the sharing potential, where as it requires only 30 seconds with Distance 0. Similar to the first experiment, since there is no a priori knowledge regarding the spatial distribution of the shareable pages, time required to reach the sharing potential can considerably vary for different distributions of shareable content. Further, the time to achieve that sharing potential is also inversely proportional to the scan rate — low rates results in large time intervals. In our work we try to address the above problems with KSM and also explore the possibility of using the coprocessors like GPUs to solve them. In absense of prior knowledge regarding the memory state, we aim to achieve a win-win situation of not using compute cycles more than that is required for sharing identical pages and achieve sharing potential rapidly. 2.2 Freeriding the GPU The central idea of Catalyst is to exploit the presence of a GPU to enable rapid and efficient deduplication of memory pages across VMs. Here we present a very brief overview of GPU characteristics and its relevance for the problem at hand. A GPU is a highly parallel microprocessor specifically designed for fast and efficient processing of graphics and general purpose applications [33, 36, 39]. GPUs are connected via the Peripheral Component Interconnect (PCI) interface of a computer. Code and data has be transferred to the GPU via the PCI bus for execution. The asynchronous data transfer to GPU device memory is DMA-based. Application programmers have a restrictive view of the GPU. GPU APIs provide abstraction of thread group hierarchy and shared memory hierarchy to the programmer. Programmers write functions called kernels for execution on a GPU and additionally specify the number of threads for simultaneous execution. Programs of SIMD nature (single instruction multiple data) are well suited for exploiting the parallel execution model of GPUs. Page hashing operations of KSM are also SIMD operations—same hashing operation performed on different pages. Hence the hashing and hash comparison operations can be executed in parallel and can be offloaded to a GPU. To summarize, there are two main advantages of offloading KSM’s main compute operations of page content hashing and hash comparison to a GPU, 1. Hashing and hash comparison takes CPU cycles and is performed periodically. CPU is a shared resource in virtual environments and several VMs contend for the CPU. Offloading these tasks to a GPU will save CPU cycles. 2. Page hashing and comparison is SIMD in nature, can be executed in parallel and be potentially achieved in orders of magnitude faster if performed on a GPU. 3. Catalyst Design and Architecture 3.1 Strategy Our approach is to offload the page hashing and hash comparisons of the KSM process to the GPU. The main tasks here are to transfer the content from pages of virtual machines to the GPU, and compute hashes of the page contents and compare them on GPU itself. The pages with same hash values are considered potential candidates for sharing. The output of GPU processing is a set of hints: the set of pages whose hash values are identical to one or more pages in the system. Hints are then used as an input by KSM for targeted scanning. Vanilla KSM wastes a lot of cycles in scanning non-sharable pages. Hints enable KSM to perform targeted scanning of only those pages which have a high probability of being shared. 3.2 Figure 3. Software architecture of the Catalyst solution. vices for efficient and rapid memory sharing. The Catalyst kernel module is responsible for transfer of kernel data and control information to the Catalyst daemon. The user-space daemon sets up execution of page sharing related tasks on the GPU. Details of the two components are as follows, • Catalyst kernel module The Catalyst kernel module acts as a communication channel between KSM and the Catalyst daemon. The kernel module is responsible for transfer of kernel data to the user-space daemon and vice-versa. It provides the Catalyst daemon with the data contained in virtual machines’ pages and transfers the output of Catalyst daemon (hints) to KSM. It also avoids the double copy problem discussed in Section 3.2. Challenges Towards designing such a system there exist two main challenges, (i) Kernel does not have direct access to the GPU GPU libraries like CUDA [22], OpenCL [14] etc., provide user-space APIs for programming the GPUs. These user-level library calls are directly interfaced with proprietary drivers to provide the desired functionality. Operating systems do not have access to manipulate the GPU access interface directly. In order to work around this constraint, an efficient mechanism is required to expose the GPU interface to the kernel. (ii) Efficient transfer of data to the GPU The physical memory (RAM) is not directly accessible to a GPU. In order to execute a program on a GPU, the data required by a program needs to be first transferred to the GPU’s memory. GPU programs use API libraries to transfer data to the GPU and operating systems do not have direct access to the GPU API libraries. A first-cut solution is to transfer data from kernel space to a userspace process, which then invokes the GPU APIs to transfer data to the GPU. This involves two data transfers— one from the kernel space to the user space and other from the user space to the GPU. It is very likely that the data transfer overheads might surpass the benefits of offloading kernel tasks to GPUs and hence the data transfer component needs careful design. 3.3 Architecture Figure 3 shows the architecture of Catalyst. The two main components of our system are the Catalyst kernel module and a user-space Catalyst daemon. Together these components solve the challenges associated with availing GPU ser- • Catalyst daemon The Catalyst daemon is a user space process that executes periodically. It is responsible for transferring kernel data to the GPU and setting up execution of KSM’s offloaded tasks on the GPU. The Catalyst daemon maps pages of virtual machines with coordination from the Catalyst kernel module. It then provides these pages to the GPU, and invokes appropriate GPU functions to compute and compare hash values of virtual machine pages, thus generating hints. The Catalyst daemon receives output of computations on the GPU and forwards the hints to Catalyst kernel module. The Catalyst daemon executes periodically and can be configured using the num sleep seconds parameter. num sleep seconds specifies the interval after which the daemon requests the kernel module for the next round of pages to estimate the set of hints. 4. Implementation Details Catalyst is implemented for the KVM hypervisor on Linux hosts enabled with the KSM page deduplication service. Figure 4 shows the high level control flow among the different components of Catalyst. The Catalyst daemon requests the Catalyst kernel module to acquire mappings of pages associated with VMs . The acquire mapping operation maps physical pages of a VM on to the Catalyst daemon virtual address space (CVA). Subsequently, the Catalyst daemon Figure 4. Catalyst flow diagram. transfers content of the VM pages (mapped on to the Catalyst daemon address space) to the GPU. After transferring page content data to the GPU, the Catalyst daemon invokes the necessary GPU functions to compute hints—prospective sharable pages. After computing hints, the Catalyst daemon requests the Catalyst kernel module to release mappings, which unmaps pages of a virtual machine from the Catalyst daemon’s virtual address space. The Catalyst daemon collects hints generated by the GPU and delivers them to KSM via the Catalyst kernel module. KSM then performs scan-and-merge operations on the provided hints. As soon as KSM is done with scanning the hints, it requests the Catalyst kernel module to notify the Catalyst daemon to be in readiness to repeat the same process. This entire procedure in termed as one round of Catalyst operations. There are three main operations in our implementation which are explained in the rest of this section. 4.1 GPU Catalyst memory exchange To facilitate targeted scan and merge of memory pages by KSM, GPU’s parallel hash computation capabilities can be leveraged. This necessitates transfer of pages of virtual machines to the GPU and collection of processed deduplication hints to assist KSM. Designing efficient high volume data exchange from main memory to GPU memory is a nontrivial exercise. To meet the above requirements, following are the major operations involved in the exchange of data between GPU and Catalyst. 4.1.1 Getting a handle on VM memory As already mentioned earlier (Section 3.2), it takes multiple intermediate memory copies to transfer data from the operating system to GPU. Physical memory pages used by virtual Figure 5. Acquiring VM mapping into the Catalyst daemon process. machines are accessible only from within the privileged kernel mode. However, the NVIDIA GPU programming model allows only user space communication with the GPU; which becomes a bottleneck to transmit data in pages of virtual machines. Catalyst’s acquire mapping operation avoids the overheads due to the copy of data from kernel to user space. The Catalyst daemon allocates a large virtual address region with no backing of physical pages, this virtual address space allocation is termed CVA (the Catalyst daemon virtual address space). When deduplication hints for a VM is to be generated, the Catalyst kernel module is requested to update mappings of the CVA to physical pages of a virtual machine. To update mappings of the CVA, the Catalyst kernel module translates page frames of VMs (which are part of the host OS virtual address space (HVA)) to machine frame numbers (MFNs). An unmapped Catalyst daemon’s virtual page (in the Catalyst virtual address space) is made to point to a physical page (MFN) of the virtual machine by modifying the Catalyst daemon process page tables (refer to Figure 5). When all the physical pages are mapped to CVAs, access to the CVAs from the user space implies accessing physical pages belonging to a virtual machine. In the first round of Catalyst, we consider all pages of a virtual machine for hashing and hash comparison. In subsequent rounds, Catalyst only considers unshared pages and changes the CVA mapping to point to the unshared pages, and transfers them to the GPU for hashing and hint generation. This is achieved by checking the map count of each VM page, the map count is a count of mappings for each physical page. A map count greater than one, indicates a shared page and hence the corresponding page is not considered. However, we store the hash Figure 7. An example of how hints are generated in case of single VM. Notations: A-active hints, D-virtual machine page data, V-virtual page number, SH- shared hints, H-hash values of pages, and B-bitmap. Figure 6. Procedure for transfer of data and computations at the GPU with multiple virtual machines. values of shared pages in GPU memory (referred to as active hints) to enable comparisons of the newly hashed pages with the hashes of already shared pages. When the Catalyst daemon is done with processing of virtual machine pages, it removes the mapping of CVAs to the virtual machine physical pages. 4.1.2 Data transfer to GPU When the Catalyst kernel module completes mapping of all the VM physical pages onto the Catalyst daemon user space, it signals Catalyst daemon to start computation on the virtual machine data. The Catalyst daemon then uses GPU API calls to transfer the data stored in the mapped CVAs to the GPU and calls the appropriate GPU functions to compute hints for the transfered pages. Figure 6 shows the data transfer and GPU computation sequence across multiple virtual machines. Catalyst employs a sequence of rounds for the offloading procedure. Each Catalyst round consists of two phases. In the first phase, data present in VM pages is transferred to GPU and hashed one VM at a time. In the second phase, the hashes of all virtual machines are compared against each other and with active hints, and new hints (for the current round) are generated. When a round begins, the Catalyst daemon requests the Catalyst kernel module to change virtual address mappings of the Catalyst daemon considering the first virtual machine. When CVA-to-HVA mappings of the Catalyst daemon are updated, page data of the virtual machine is transfered to the GPU. Along with page data, the virtual addresses of pages which need to be hashed and virtual addresses of pages (of VM under consideration) which are already shared (shared hints) are also transferred to the GPU. When data transfer for first virtual machine is complete, the GPU computes the hash values of each page, sorts the calculated hashes and updates the set of active hints (details in Section 4.2). On completion of these GPU operations, the Catalyst daemon requests the Catalyst kernel module to reset its virtual address mappings. The Catalyst daemon then transfers data of the next virtual machine, computes hashes, performs comparisons and builds per VM active hints. 4.2 Hints generation in GPU Catalyst daemon transfers the virtual address of unshared VM pages, their corresponding page data and virtual address of already shared pages to GPU. When virtual machine’s data transfer to GPU is complete, Catalyst daemon executes the GPU functions to compute hash and hash comparison on GPU. To compute the hash of VM page data parallely, we create the number of GPU threads equal to number of pages of virtual machines and each thread is given one page. These threads then parallely apply the same hash function on their page data. GPU maintains a mapping of VM virtual page address (HVA) to its calculated hash value. After the hashing is done, for computing self sharing, we sort the page hashes in increasing order of hash values in GPU. Sorting clubs together VM pages with equal hash values. In case of a single VM, we can report self sharing by comparing hashes of the VM among each other and with previously shared pages (active hints). We extract these virtual pages with equal hash values and report them as hints. Since in each round, we transfer the data of only those virtual machine pages which are not shared, we have to store the hashes of hints generated in previous rounds. We refer to these hints as active hints and store them in the GPU device memory. Figure 7 shows an example of how hints are generated for a single virtual machine. At the beginning of each round we have a set of active hints. Each active hint is composed of a hash value and the corresponding virtual page address. The active hints lists the set of shared pages of a VM and corresponding hash values from the previous round. The Catalyst daemon transfers contents of pages, virtual address of pages to be hashed and virtual address of already shared pages (shared hints). Shared hints represent the set of pages of a VM that are currently shared and whose sharing has not been broken. The set of pages in the shared hints is a subset of pages in the active hints. A page may be sharable at the end of roundk (and hence part of the active hints), is subsequently shared, then overwritten and hence not in the shared hints at the beginning of roundk+1 . The GPU first computes hash values over virtual machine pages and then sorts them. When the hashes are sorted, we also update the order of the corresponding virtual page addresses (referred as virtual page number in Figure 7). Next, the active hints are updated using the shared hints data. There is a need to update active hints as it is quite possible that the pages that were shared at the end of the previous round are unshared before beginning of the current round. The shared hints are used to remove entries corresponding to the unshared pages from active hints list. Referring to Figure 7, virtual page address of active hint (15,3) is not present in the set of shared hints and hence is removed from the set of active hints. We maintain a per VM bitmap corresponding to virtual page addresses of each page of the VM. Size of the bitmap is equal to the number of pages of a VM. The newly computed hash values are compared with hash values of pages in the set of active hints. For a page whose hash value matches a value in the set of active hints, its corresponding bitmap location is set to 1. The set of pages whose bits are set to 1 are the set of new hints for sharing. Referring to Figure 7, the virtual page 3 has a hash value of 11, which is also present in the active hint list and as a result the bit corresponding to virtual page 3 in the bitmap is set to 1. After comparing the hashes with active hints, we look for sharing within the newly hashed pages. Since the hash values and corresponding virtual pages are sorted, virtual pages with equal hash values are neighbors. To identify self sharing we compare the hash value of each virtual page with its neighbors. In Figure 7, virtual pages 2 and 94 have the same hash values of 13 and so do virtual pages 1 and 8. Bitmap locations of the corresponding virtual page addresses (bit locations 1, 2, 8 and 94) are set to 1. The bitmap eventually Figure 8. Generating hints list for a virtual machine. holds the set of pages which have at least one other page with equal hash value. In case of multiple VMs, because of GPU memory capacity limits, it may not be possible to hash pages of all VMs in one go. In such cases, we compute and store hashes and update active hints on a per-VM basis. After each VM state is updated, the per-page hashes and active hints of all the VMs are compared against each other. To compare two VMs with each other, say VM1 and VM2, we create GPU threads equal to the number of pages of VM1. Each thread picks one hash value corresponding to a VM1 page and performs a binary search for this value in the set of hash values of VM2, active hints of VM1 and active hints of VM2. If a match is found, the bitmap location of virtual page address corresponding to the hash value is set (a hint of the current round). For example, as shown in Figure 8, virtual pages V2 and V3 are part of the hints list based on the bitmap state. In this manner we compare the hashes of all the VMs against each other. At the end of the comparison we have a hint bitmap for each VM. We extract the virtual page addresses for each VM whose bitmap locations are set and report them as hints. At the end of the GPU computation procedure, the set of hints represents the virtual page addresses per VM which are likely to be sharable. 4.3 Targeted sharing by KSM The Catalyst daemon gets the per VM set of virtual page addresses (sharing hints) whose hash values are equal to values of one or more pages. These set of virtual page addresses are then transferred to KSM through the Catalyst kernel module. KSM is modified to incorporate non-sequential scan behavior which is essential to realize speedy sharing in an efficient manner. Instead of scanning all pages of virtual machines, KSM invokes scan and merge operations on the provided hints to share the pages quickly. KSM scans pages corresponding to the hints of one VM at a time. When hints of one VM are exhausted, KSM moves to the hints of the next VM and so on. When KSM consumes the hints belonging to all VMs, it sends an end-of-current-round notification to the Catalyst daemon via the Catalyst kernel module. Experimental Evaluation 5.1 180000 160000 140000 Number of Pages Shared In this section, we present the evaluation of Catalyst against several benchmarks, workloads and demonstrate its efficacy in different types of scenarios. The base case for comparison is with vanilla KSM. The main metrics are number of pages shared, sharing rate and CPU cycles required for sharing. We also present results showing the effect of different parameters like num sleep seconds and Copy-on-Write (CoW) breaks. A comparative study of an adaptive rate limiting policy designed for Catalyst against KSM, and a use-case that maintains a user-specified sharing level demonstrates Catalyst’s real-world applicability. 120000 100000 • Synthetic benchmark: This benchmark was used to cre- ate files of specified extent of duplicate content. When read, these files would provide the equivalent sharing potential. The benchmark was also capable of inducing copy-on-write (CoW) page breaks at a specified rate by modifying file contents (via the in-memory page cache) according to the specified CoW break rate. 0 50 100 150 200 250 300 350 400 70000 60000 50000 40000 30000 Catalyst: 20 Sec KSM:1000pps KSM:2000pps KSM:3000pps KSM:4000pps KSM:5000pps 20000 10000 0 0 50 100 150 200 250 300 350 400 Time in Seconds Figure 10. Comparison of page sharing of Catalyst and KSM with the Filebench Varmail benchmark. We configured our benchmark with 60% get tweets operations, 30% insert tweet operation, 10 clients, a rate of 600 operations per second and executed the workload for 600 seconds. • Redis benchmark [27]: Redis is an in-memory key value store that can store various data structures like sets, list, strings, hashes etc. Redis comes with its own benchmark utility which can generate test load in accordance with the parameters specified like key size, type of operations, number of requests etc . We ran the Redis benchmark for 500,000 get/set operations and with 1 MB data size for get/set requests. • OLTP Twitter benchmark [5]: This benchmark mim- icked a web-based micro-blogging website. The main operations included get tweets (read operations) and insert tweets (write operations). The workload allowed configuration of the mix of read-write operations, rate of operation, number of users and total duration of each run. 0 Figure 9. Comparison of page sharing of Catalyst and KSM for Synthetic benchmark. • Filebench Fileserver [35]: The Filebench Fileserver workload generated a load experienced by a real fileserver. The operations that the Fileserver performed on files included reading, writing, appending and deleting. Each workload executed a combination of these operations using the configuration parameters provided. We ran this workload with 25000 files, files of size 64 KB and 8 threads. Catalyst: 20 Sec KSM:1000pps KSM:2000pps KSM:3000pps KSM:4000pps KSM:5000pps Time in Seconds • Filebench Varmail [35]: The Varmail workload (mail server workload) represents the behavior experienced by the /var/mail directory of a UNIX system. The set of operations it performed includes reading, writing, rereading, and deleting emails. This workload was configured to vary the number of threads, filesize, number of files, running time etc. We ran this workload for 1000 files, files of size 16 KB file, 16 threads and for 600 seconds. 60000 20000 System setup and benchmarks All experiments were conducted on a machine with a 3.4 GHz Intel i7-3770 processor with 8 CPU cores and 32 GB physical memory. We used NVIDIA’s GTX 970 GPU with 1664 CUDA cores, each executing at 1.18GHz, and with 4 GB device memory. The operating system on the host machine and the virtual machines was a 64-bit Ubuntu 12.04.5 LTS operating system. The host OS was a modified Linux 3.14.65 kernel and KVM was used as a virtualization platform. Following benchmarks were used for the experiments, 80000 40000 Number of Pages Shared 5. 5.2 Comparison with KSM In this section we compare the performance of Catalyst and KSM. The metric used for performance includes number of Sharing technique Catalyst (20 seconds) KSM (1000 pps) KSM (2000 pps) KSM (3000 pps) KSM (4000 pps) KSM (5000 pps) KSM (10000 pps) Synthetic benchmark 2.36 10.39 10.28 9.88 10.37 9.79 - Varmail 35.80 11.87 22.33 32.27 49.80 53.03 - Fileserver 81.70 8.32 27.13 41.05 47.72 52.24 81.48 Twitter 84.41 8.45 18.10 26.46 35.99 48.66 - Redis 50.79 12.37 24.11 37.63 46.56 61.06 - Table 1. CPU cycles required by different workloads with Catalyst and different scan rates of KSM. All values are reported as Gigacycles. 90000 Catalyst: 20 Sec KSM:1000pps KSM:2000pps KSM:3000pps KSM:4000pps KSM:5000pps 80000 Number of Pages Shared 70000 60000 50000 40000 30000 20000 10000 0 0 50 100 150 200 250 300 350 400 Time in Seconds Figure 11. Comparison of page sharing of Catalyst and KSM for OLTP Twitter benchmark. pages shared with time and total CPU cycles taken for the sharing achieved. Figures 9, 10 and 11 show comparison of page sharing achieved by Catalyst and KSM during the duration of the experiment. Table 1 shows the CPU cycles overhead for the different workloads with Catalyst and KSM with several scan rate settings. Using the Synthetic benchmark, we created a file of 1 GB size with 60% sharing potential inside a virtual machine. We read this file from within the virtual machine. Reading the file caused the physical memory of virtual machine to be occupied by the file content. We executed Catalyst with num sleep seconds set to 20 seconds. We repeated the same experiment with different scan rates of KSM, 1000 pages per second (pps), 2000 pps and so on. Figure 9 shows the variation of pages shared with time. Catalyst was able to achieve the maximum sharing (approx 160000 pages) in a very short duration where as KSM with a scan rate of 5000 pps required approximately 100 seconds to reach the maximum sharing potential. Catalyst only scanned pages which had a high probability of being identical, where as KSM had to go through all the pages of the VM, which lead to slow realization of maximum sharing potential. With slower scan rates, the time required by KSM increases proportionately, up to 350 seconds with a scan rate of 2000 pps. A rate of 1000 pps is too slow and cannot achieve any sizable sharing within 400 seconds of the experiment. For the Synthetic workload, Table 1 reports the number of CPU cycles required by Catalyst and KSM to achieve maximum sharing. Since the sharing potential is constant, the total number of pages scanned to achieve this potential remains the same (even with different KSM scan rates). As a result, the CPU cycles required by KSM with different scan rate are similar ( 10 Gigacycles). Catalyst on the other hand outperformed KSM and consumed only about 25% of the CPU cycles ( 2.4 Gigacyles) required by KSM. Figure 10 shows the variation of pages shared with time for the Varmail workload. In this case as well, Catalyst was able to achieve higher sharing levels much earlier (within the first 10 seconds) compared to the KSM scan rates. KSM with 1000 pps required 300 seconds to reach maximum potential, where as KSM with 5000 pps required about 50 seconds for the same. Referring to Table 1, KSM with scan rates of 4000 pps and 5000 pps require 30% more CPU cycles as compared to Catalyst and also does not perform as well as Catalyst. CPU cycles required with KSM’s scan rate of 3000 pps is comparable with Catalyst, but the time required to achieve maximum sharing in this case is close to 100 seconds compared to less than 10 seconds required by Catalyst. With the Twitter workload we observed a different behavior — short duration spikes in number of pages shared as shown in Figure 11. Also, the CPU cycles required by Catalyst with a sleep interval of 20 seconds was 2x compared to KSM with a scan rate 5000 pps (refer Table 1). We observed similar behavior when used with Catalyst and a sleep interval of 30 seconds. Although the CPU cycles of Catalyst were close to those required with KSM’s scan rate of 5000 pps, the sharp short duration oscillations in number of pages shared remained. The reason for these short lived sharing spikes in case of the Twitter workload was due to large number of updates to the shared pages. During Catalyst’s sleep interval the sharing of several pages is broken and these pages have to be 5.3 Impact of CoW breaks In case of Twitter workload, we observed that there are short lived spikes in the graph of pages shared with time (Figure 11 and Table 1). As reported in Figure 11, workloads which are characterized by frequent memory updates result in the achieved sharing potential by Catalyst to oscillate continuously. Due to a high memory update rate, as soon as pages are shared by Catalyst, they also start getting over-written resulting in CoW breaks and decrease in sharing. On the next hint generation cycle, a set of new pages are reshared quickly and the number of shared pages increases sharply. The extent of this oscillation, number of spikes, the extent of CoW breaks and quick sharing of pages (just after hints generation) is a function of the memory update behavior and the duration between two consecutive hint generation events. To study the impact of CoW breaks, we varied the CoW break rate of the Synthetic benchmark and observed the value of number of pages shared at 1 second intervals. Note that we are not measuring the page sharing rate, but report the value of page sharing achieved at 1 second intervals. The Synthetic benchmark created a 1 GB file with 60% sharing potential (around 160000 pages). Next, it read and wrote to the file in 100 80 Complementary CDF reshared in every round. Catalyst can share them quickly, but still needs to perform the GPU orchestration in each round to share the newly broken pages. With the other two workloads, Fileserver and Redis, we observed behavior similar to the Varmail and Twitter workloads, respectively. Table 2 shows the average resident set size (RSS), average pages shared and CPU cycles required to share each page. It can be observed that, as compared to KSM, Catalyst achieves higher levels of average sharing for all the workloads. Average pages shared is calculated as average of the value of number of pages shared sampled every second. The CPU cycles spent per shared page in case of Catalyst are also lower in almost all the cases as compared to KSM. The CPU cycles per shared page is obtained by dividing the total CPU cycles by average sharing. With the Twitter workload, the high CoW break rate lead to a higher CPU cycles per shared page value. However, the average sharing of Catalyst with the Twitter workload was higher as compared to KSM. Other than the Twitter workload, Catalyst required 15% to 35% fewer CPU cycles per shared page compared to those required by KSM. To summarize, with stable and non-changing sharing potential the time required by Catalyst to achieve potential is an order of magnitude lesser than that required by KSM. For applications that update memory pages at a moderate rate, Catalyst is still able to perform better in terms of number of pages shared, time required to achieve sharing potential and the CPU cycles required. With a workload that performs aggressive memory updates, Catalyst performs better than KSM with comparable computation costs. 60 40 20 0 0 100 CoW breaks/s 1000 CoW breaks/s 2000 CoW breaks/s 5000 CoW breaks/s 20000 40000 60000 80000 100000 120000 140000 160000 180000 Pages Shared Figure 12. Complementary CDF (CCDF) of pages shared with different CoW rates. such a way that caused CoW breaks at the specified rate and then restored the sharing potential. We observed the number of pages shared with time for different CoW break rates: 100 breaks/sec, 1000 breaks/sec and so on. Figure 12 shows the complementary cumulative distribution of the number of pages shared in 1 second bins over the duration of the experiment. With a CoW break rate of 100 CoW breaks/s, 90% of reported page sharing is very close to the maximum sharing potential (around 160000 pages). For a CoW break rate of 2000 CoW breaks/s, for 90% of the time, the lower extent of number of pages shared was 140000 pages. As we increase the CoW break rate, this extent further decreases to 80000 shared pages with a CoW break rate of 5000 page breaks/second. Figure 13 shows the impact of CoW breaks on total CPU cycles consumed by Catalyst. The Synthetic benchmark not only performs CoW breaks but also makes the content of the overwritten pages to be identical. This restores the sharing potential. As a result, if the CoW break rate is high, the pages that will be scanned and shared after CoW break will also be high. This leads to increase in the time taken by Catalyst to scan hints as number of hints generated will be equal to the number CoW breaks. As shown in Figure 13, with increase in CoW break rate, the number of CPU cycles required for scanning hints also increases. This periodic load in Catalyst, which is designed to always greedily target all sharing potential results in increased CPU cycles consumption. This effect is also seen in comparisons with KSM (in Section 5.2). Note that in all cases, even if the CPU cycles consumed by Catalyst are more, it always achieves a higher sharing potential. 5.4 Effect of hint generation periodicity Figure 14 shows the variation in total number of CPU cycles consumed by Catalyst with different sleep intervals between consecutive hint generation events for the Varmail workload. Workload Average RSS (number of pages) Varmail 168985 Fileserver 511056 Twitter 269183 Redis 129246 Sharing technique Average pages shared Catalyst 20 KSM 5000 pps Catalyst 20 KSM 10000 pps Catalyst 20 KSM 5000 pps Catalyst 20 KSM 5000 pps 64762 62262 382511 325624 31415 27049 29035 26592 Total CPU cycles (x109 cycles) 35.8 53.03 81.7 81.48 84.41 48.66 50.76 61.06 CPU cycles per shared page (x103 cycles) 552.79 851.72 213.59 250.23 2686.93 1798.23 1748.23 2296.18 Table 2. RSS, average pages shared and CPU cycles per shared page for KSM and Catalyst with different workloads. 100 140 122.70 86.99 Total CPU Cycles (Giga Cycles) Total CPU Cycles (Giga Cycles) Total Cycles Cycles spent Scanning Hints 80 63.12 60 59.56 49.21 44.06 40 35.41 27.11 20 80 67.55 60 47.26 35.80 40 30.50 24.09 16.11 0 5.66 100 0 5 10 15 20 25 30 35 Sleep Interval (in Seconds) 500 1000 2000 5000 CoW Break Rate (Per Second) Figure 13. Effect of CoW breaks on computation costs for sharing. As expected, the total CPU cycles required by Catalyst decreases with sleep interval. An aggressive scanning process leads to increase in number of rounds when the Catalyst procedure is executed. Each Catalyst procedure has to change and restore mappings, transfer data to the GPU, setup execution on the GPU and collate results of the GPU process. This is an important parameter to balance the CPU consumption and the extent of sharing to be maintained; hence needs to be carefully selected as per resource availability and sharing related policy. 5.5 100 20 10.63 0 120 Figure 14. Effect on total CPU cycles of Catalyst with varying parameter num sleep seconds with the Varmail workload. interval and KSM with scan rates of 5000 pps,10000 pps and 15000 pps. Figure 15 shows the number of pages shared during the experiments. As can be seen, Catalyst achieved a much higher sharing levels than all the other cases. Catalyst achieved upto 1.5 times and 1.25 times more sharing compared to KSM with scan rates 5000 pps and 15000 pps, respectively. The CPU cycles required for the KSM with a scanning rate of 5000 pps, 10000 pps and 15000 pps, and Catalyst are 76 Gigacycles, 125 Gigacycles, 146 Gigacycles, and 120 Gigacycles, respectively. Catalyst outperformed the two aggressive cases both in terms of performance and CPU cost. Sharing across multiple VMs We tested our implementation in a multi-VM setup to demonstrate the correctness and applicability of Catalyst. We started three virtual machines, one executed the Fileserver benchmark, second the Varmail benchmark and third executed the Synthetic benchmark with CoW break rate of 5000 CoW breaks/s. We ran Catalyst with 20 seconds sleep 5.6 Adaptive rate limiting We implemented an adaptive rate limiting policy for both Catalyst and KSM. The policy takes as input a sharing threshold parameter from the user and adjusts the scan rate of KSM or sleep interval of Catalyst. Whenever the total memory shared decreases below the threshold, the scan rate is in- 700000 600000 600000 Number of Pages Shared Number of Pages Shared 700000 500000 400000 300000 200000 Catalyst: 20 Sec KSM:5000pps KSM:10000pps KSM:15000pps 100000 0 0 100 200 300 400 500 500000 400000 300000 200000 100000 0 Catalyst KSM 0 100 200 300 400 500 600 Time in Seconds 600 Time in Seconds creased, and conversely scan rate is decreased when amount of memory shared is above the threshold. The setup for this experiment is same as the experiment in Section 5.5 Policy for Catalyst: Catalyst starts execution with a default sleep interval value, a value of n seconds. Every n seconds, Catalyst checks whether the memory shared is below sharing threshold. If memory shared is less than the threshold, then Catalyst invokes the GPU hint generation functionality and proceeds with targeted scanning. Once sharing of that round is achieved, it sleeps for n seconds and the procedure repeats. On the other hand, if shared memory at the start of a round is more than the sharing threshold then Catalyst does nothing; it ends the round and sleeps for n seconds. Policy for KSM: KSM starts execution with a low scan rate of 100 pps. After every n seconds KSM checks whether memory shared is less than the sharing threshold. If memory shared is less than the threshold, scan rate of KSM is doubled. If memory shared is more than the threshold, KSM’s scan rate is reset to 100 pps. We set an upper limit of 4000 pps of scan rate of KSM. Figure 16 shows the number of pages shared during the duration of the experiment with adaptive rate limiting. The sharing threshold is set to 500 MB and Catalyst’s sleep interval is set to 10 seconds. From the figure it can be observed that Catalyst is able to achieve higher sharing levels compared to KSM. Also, as the sharing falls below the sharing threshold, Catalyst is able to recover quickly and achieve higher level of sharing, whereas KSM takes relatively more time to recover. Figure 17 reports the total CPU cycles required by Catalyst and KSM. Catalyst is able to achieve better performance with lower CPU cycles. 5.7 Figure 16. Effect of adaptive rate limiting on pages shared with a sharing threshold of 500 MB. 140 Total CPU Cycles (Giga Cycles) Figure 15. Comparison of page sharing achieved with a multi-VM setup. KSM Catalyst 121.08 120 104.75 100 80 60 40 20 0 Figure 17. Effect of adaptive rate limiting on CPU cycles. tions. Despite avoiding the data copy from kernel space to user space, transferring data to GPU from user-space is not cheap—requires about 2500 CPU cycles per page. The other major contributor is the process that changes the memory mappings of the memory region used by the Catalyst daemon. The average cost of the mapping update operations is also approximately 2500 CPU cycles per page. In a multiVM setup and with aggressive memory update behavior this requirement increases the CPU cycles overhead of Catalyst. Operation Change Mapping Copying Data to GPU Reading Hints from GPU Restore Mapping Sending Hints to KSM CPU Cycles ∼ 2500 Per Page ∼ 2500 Per Page ∼ 17 Per Hint ∼ 600 Per Page ∼ 6 Per Hint Micro-benchmarking Catalyst operations We performed micro-benchmarking experiments to determine the execution requirements of different operations of Catalyst. Table 3 shows the time profile of different opera- Table 3. Per page CPU-Cycles taken by different Catalyst operations. 6. Related Work 6.3 6.1 Memory deduplication approaches GPUs computation capabilities have not just made them integral part of desktop computers but also for high performance computing systems and cloud computing setups. Several applications are being developed that rely on GPUs for accelerated computing. A GPU can be accessed by virtual machines using the PCI passthrough technique. PCI passthrough gives direct access of a GPU to a VM but it does not allow the GPU to be used by another VM. GPU virtualization resolves the multiplexing issue with PCI passthrough and facilitates sharing of a GPU among multiple virtual machines. rCUDA [6], vCUDA [29, 30], GViM [10], gVirtus [7, 19] and Cu2rcu [26] virtualize the GPU by intercepting the GPU API calls. They intercept the GPU API calls in the guest machines and redirect those calls to the GPU in the host machines. Vendor supported virtualization solutions focus on manufacturing GPUs that allows multiple VMs to run their applications simultaneously on GPUs. NVIDIA is the only GPU manufacturer that currently provides such a solution. NVIDIA with the launch of GPU grid [11] technology has paved the way for new kind of GPU virtualization solution. NVIDIA GRID K1 and K2 [21] are specifically designed to support GPU sharing in the virtualized environment. These new variants allow for multiplexing of the GPU in hardware and hence the GPU can be simultaneously used by different execution entities (with the multiplexing handled at the hardware-level). Disco [4] was one of the first systems to propose and implement transparent page sharing. Their approach identifies identical read-only pages across multiple virtual machines and merges them to save memory. Instead of periodic scanning of pages, pages originating from same disk block were used to identify shareable pages. Disco required several guest OS modifications. VMware ESX server [37] introduced the method of content based page sharing using periodic scanning of memory pages. It hashes the content of memory pages and compares them to identify the duplicate pages. Similar to VMware ESX server, KSM [3] also periodically scans the memory pages of virtual machines and uses hashing for page deduplication. Difference Engine [9] extends the content based page sharing to sub-page level granularity by employing techniques like deduplication, compression and patching. XLH [16] monitors the I/O accesses of guest virtual machines and generates hints for KSM to scan I/O pages. These hints are in the form of pages which have high probability to be shared. Hints are then scanned by KSM preferentially enabling quick identification of sharable pages. Catalyst takes a step further assisting the KSM to scan pages in a targeted manner. Singleton [28], an extension of KSM, identifies sharing opportunities in host and guest page caches. It merges identical pages in the host and guest page caches and maintain a single copy of duplicate pages. Catalyst is an orthogonal optimization to Singleton. 6.2 GPU optimizations GPUs has been used to accelerate the application level tasks. A number of research works [33, 36, 39] have shown the utility of GPUs in accelerating applications like image processing, scientific computation etc. Apart from application level acceleration, GPUs have also been used to accelerate system software and networking related tasks. SNAP [31] offloads the packet forwarding task to a GPU and reports a speedup of 4x. SSLSHADER [12] uses GPUs to accelerate the SSL cryptographic operations and obtains speed up of 22x to 31x over the fastest CPU implementation. GNORT [34] is a high performance Network Intrusion Detection System (NIDS) based on GPUs. GPUStore [32] accelerates the file system encryption/decryption tasks using GPUs and reports significant improvement in disk throughput. To the best of our knowledge, there is not much work related to use of GPUs for accelerating virtualized systems or virtualization management tasks. The only reported work is by Naoi and Yamada [20] which performs compression and decompression of VM pages before and after live migration of VMs. They report a speed up of 3 to 4 times in the live migration process when GPUs are used for compression and decompression. Catalyst is an attempt in the same direction, to enable usage of GPUs to assist or offload operating system level operations and optimization tasks on to the GPU. 6.4 GPU virtualization Data movement with advanced GPU architectures An important bottleneck with usage of GPU devices is the latency of data transfer to/from the GPU. GPU devices are connected to PCI slots and interface with the CPU via the PCI bus. In order to execute tasks on GPUs, the associated data needs be transferred from CPU memory to GPU memory through the PCI bus and vice-versa for copying data back to the CPU. Heterogeneous system architectures [1] (HSA) aim to bridge this gap by fusing the CPU and the GPU on the same processor die. With HSA, GPUs have direct access to CPU physical memory, hence there is no need to explicitly transfer data via a device bus. Another alternative on the same lines is a specialized high speed interconnect between the CPU and the GPU and between GPUs (NVIDIA’s NVLink [23] technology). Such a specialized link can transfer data at a speed 5 to 12 times [23] faster compared to the traditional PCI bus. An interesting extension of Catalyst would be study cost-benefit tradeoffs with such advanced architectures. 7. Discussion 7.1 Catalyst extensions Techniques exploiting different granularities for deduplication can be candidates for GPU assistance. For example, Difference Engine [9] relies on finding similarities at the subpage level. Further, Guo et.al [8] propose breaking of large pages to improve deduplication potential. Catalyst can be extended to both these cases, (i) to perform sub-page level deduplication, and (ii) to rank the large pages in order of potential benefits. Currently, Catalyst uses an adaptive policy to tune its scanning rate. The rate is determined by using a threshold on the minimum required sharing. Additionally, other system parameters can be used to tune the behavior of Catalyst. For example, with the Twitter benchmark we observed a high rate of CoW breaks which lead to a performance degradation. An adaptive policy can be proposed in which the scan rate is inversely proportional to the rate of CoW breaks. Similarly, Catalyst can be sensitive to the CPU availability of the system, i.e., be more aggressive during low CPU utilization periods. ory areas. This leads to two data copies: one from pageable memory to pinned memory and then from the pinned memory to GPU memory. Current implementation of Catalyst relies on this mechanism as well. An advancement to this is a zero-copy mechanism [2] which allows programmers to allocate variables in pinned memory through CUDA API calls. Such variables can be directly copied to the GPU via DMAbased transfer. A possible design is to allocate a large pinned memory area and map to physical pages of the virtual machines. We tried to implement this design but were not able to successfully change the mappings of addresses in pinned regions. A successful implementation of such a design can further reduce the data transfer costs of Catalyst. 7.2 The challenge of memory deduplication techniques are in identifying and leveraging the sharing opportunities quickly and efficiently. An ideal memory consolidation goal should be to quickly achieve maximum sharing with minimum CPU utilization. Towards this objective, we presented a mechanism to enhance deduplication efficiency through the usage of GPUs. We proposed Catalyst, a novel and efficient technique, to offload the kernel level memory deduplication task of page hashing and hash comparison to GPU devices. Catalyst used GPU to rapidly identify the pages that have a high likelihood of getting shared. We evaluated our system against vanilla KSM with different realistic benchmarks. The empirical results demonstrated that Catalyst can achieve higher memory sharing in lesser time when compared to different scan rate configuration of KSM. Further, the CPU utilization costs of Catalyst was lesser or comparable with KSM with different scan rate settings. We also evaluated our system for multiple-VM scenario with different VMs running different workloads. In this case too Catalyst performed better than KSM in terms of sharing achieved and CPU cycles consumed. In the end, we proposed an illustrative adaptive rate limiting policy for both Catalyst and KSM, and compared the results. The results show that Catalyst could meet higher level sharing objectives in a cost effective manner. Architecture advancements and implications Heterogeneous system architectures have changed the GPU memory access paradigm and enabled unified memory addressing schemes for the CPU and the GPU. These architectures can potentially mitigate the data transfer overheads and improve the efficiency of Catalyst. It would be interesting to explore these alternative models of interfacing with the GPU for OS level services like deduplication, encryption, compression etc. The latest NVIDIA CUDA API provides a software-based unified memory access solution [24]. The API frees the programmer from the burden of explicitly transferring the data to/fro from the CPU and the GPU. This feature works for user-level memory allocations and is not directly applicable to transfer kernel-level data objects. Extensions to the CUDA APIs would be required for efficient transfer of kernel state to the GPUs to enable Catalyst-style solutions. 7.3 Implementation alternatives In order to avoid the copy of data from kernel to user space we have changed the mapping of user-space process variables to point to virtual machine physical pages. Another technique to access physical pages of a virtual machine is through use of the /dev/mem device interface provided by Linux. The device file can be used along with the mmap system call to map all physical pages into user-space. While this approach can be useful, the whole physical address range has to be mapped, whereas with Catalyst only subset of the physical pages correspond to the virtual machines. A naive technique is to identify physical pages of virtual machines and selectively mmap them to the Catalyst user-space daemon. A possible optimization that uses /dev/mem with mmap and relies on low-overhead and dynamic identification of physical pages for each VM can be an alternative design option for Catalyst. Data transfer from CPU memory to GPU memory requires the data to be first transferred from pageable memory to pinned memory. Data is transferred to GPU through DMA which requires data to be present in pinned mem- 8. Conclusions References [1] Heterogeneous system architecture (hsa) foundation. URL http://www.hsafoundation.com/. [2] Cuda toolkit documentation. URL http://docs.nvidia .com/cuda/cuda-c-best-practices-guide/#zero-copy. [3] A. Arcangeli, I. Eidus, and C. Wright. Increasing memory density by using ksm. In Proceedings of the 11th Ottawa Linux Symposium (OLS), 2009. [4] E. Bugnion, S. Devine, K. Govil, and M. Rosenblum. Disco: Running commodity operating systems on scalable multiprocessors. ACM Transactions on Computer Systems (TOCS), 15 (4):412–447, 1997. [5] D. E. Difallah, A. Pavlo, C. Curino, and P. Cudre-Mauroux. Oltp-bench: An extensible testbed for benchmarking relational databases. VLDB Endowment, 7(4):277–288, 2013. [6] J. Duato, A. J. Pena, F. Silla, J. C. Fernandez, R. Mayo, and E. S. Quintana-Orti. Enabling cuda acceleration within virtual machines using rcuda. In Proceedings of the 18th Annual International Conference on High Performance Computing (HiPC), 2011. [7] G. Giunta, R. Montella, G. Agrillo, and G. Coviello. A gpgpu transparent virtualization component for high performance computing clouds. In Proceedings of the 16th International European Conference on Parallel Processing (EuroPar). 2010. [8] F. Guo, S. Kim, Y. Baskakov, and I. Banerjee. Proactively breaking large pages to improve memory overcommitment performance in vmware esxi. In Proceedings of the 11th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE), 2015. [9] D. Gupta, S. Lee, M. Vrable, S. Savage, A. C. Snoeren, G. Varghese, G. M. Voelker, and A. Vahdat. Difference engine: Harnessing memory redundancy in virtual machines. Communications of the ACM, 53(10):85–93, 2010. ysis Simulation of Computer and Telecommunication Systems (MASCOTS), 2014. [19] R. Montella, G. Coviello, G. Giunta, G. Laccetti, F. Isaila, and J. G. Blas. A general-purpose virtualization service for hpc on cloud computing: an application to gpus. In Proceedings of the 9th International Conference on Parallel Processing and Applied Mathematics (PPAM), 2011. [20] Y. Naoi and H. Yamada. A gpu-accelerated vm live migration for big memory workloads. In Proceedings of the 5th ACM Symposium on Cloud Computing (SoCC), 2014. [21] NVIDIA. Nvidia grid k1 and k2 graphics-accelerated virtual desktops and applications, June 2013. URL http://www.nvidia.in/content/cloud-computing/ pdf/nvidia-grid-datasheet-k1-k2.pdf. [22] NVIDIA. Cuda parallel computing platform, 2015. URL http://www.nvidia.com/object/cuda home new.html. [23] NVIDIA. Nvidia nvlink high-speed interconnect, 2017. URL http://www.nvidia.com/object/nvlink.html. [24] NVIDIA. Unified memory in cuda 6, 2017. URL https://devblogs.nvidia.com/parallelforall/ unified-memory-in-cuda-6/. [10] V. Gupta, A. Gavrilovska, K. Schwan, H. Kharche, N. Tolia, V. Talwar, and P. Ranganathan. Gvim: Gpu-accelerated virtual machines. In Proceedings of the 3rd Workshop on System-level Virtualization for High Performance Computing (HPCVirt), 2009. [25] S. Rachamalla, D. Mishra, and P. Kulkarni. Share-o-meter: An empirical analysis of ksm based memory sharing in virtualized systems. In Proceeding of the 20th Annual IEEE International Conference on High Performance Computing (HiPC), 2013. [11] A. Herrera. Nvidia grid: Graphics accelerated vdi with the visual performance of a workstation, 2014. URL http://www.nvidia.com/content /grid/vdi-whitepaper.pdf. [26] C. Reano, A. Pea, F. Silla, J. Duato, R. Mayo, and E. Quintana-Orti. Cu2rcu: Towards the complete rcuda remote gpu virtualization and sharing solution. In Proceedings of the 19th Annual International Conference on High Performance Computing (HiPC), 2012. [12] K. Jang, S. Han, S. Han, S. B. Moon, and K. Park. Sslshader: Cheap ssl acceleration with commodity processors. In Proceedings of the 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2011. [13] S. T. Jones, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Geiger: Monitoring the buffer cache in a virtual machine environment. SIGARCH Computer Architecture News, 34(5): 14–24, 2006. [14] Khronos. The open standard for parallel programming of heterogeneous systems, 2015. URL https://www.khronos.org/opencl/. [15] D. Magenheimer, C. Mason, D. McCracken, and K. Hackel. Transcendent memory and linux. In Proceedings of the 11th Ottawa Linux Symposium (OLS), 2009. [16] K. Miller, F. Franz, M. Rittinghaus, M. Hillenbrand, and F. Bellosa. Xlh: More effective memory deduplication scanners through cross-layer hints. In Proceedings of the 24th USENIX Annual Technical Conference (ATC), 2013. [27] redislabs. redis. URL https://redis.io/. [28] P. Sharma and P. Kulkarni. Singleton: system-wide page deduplication in virtual environments. In Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2012. [29] L. Shi, H. Chen, and J. Sun. vcuda: Gpu accelerated high performance computing in virtual machines. In Proceedings of the 23rd IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2009. [30] L. Shi, H. Chen, J. Sun, and K. Li. vcuda: Gpu-accelerated high-performance computing in virtual machines. IEEE Transactions on Computers, 61(6):804–816, 2012. [31] W. Sun and R. Ricci. Fast and flexible: Parallel packet processing with gpus and click. In Proceedings of the 9th ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS), 2013. [17] G. Miłós, D. G. Murray, S. Hand, and M. A. Fetterman. Satori: Enlightened page sharing. In Proceedings of the 20th USENIX Annual Technical Conference (ATC), 2009. [32] W. Sun, R. Ricci, and M. L. Curry. Gpustore: harnessing gpu computing for storage systems in the os kernel. In Proceedings of the 5th Annual International Systems and Storage Conference (SYSTOR), 2012. [18] D. Mishra and P. Kulkarni. Comparative analysis of page cache provisioning in virtualized environments. In Proceedings of the 22nd International Symposium on Modelling, Anal- [33] J. Tölke and M. Krafczyk. Teraflop computing on a desktop pc with gpus for 3d cfd. International Journal of Computational Fluid Dynamics, 22(7):443–456, 2008. [34] G. Vasiliadis, S. Antonatos, M. Polychronakis, E. P. Markatos, and S. Ioannidis. Gnort: High performance network intrusion detection using graphics processors. In Proceedings of the 11th International Symposium on Recent Advances in Intrusion Detection (RAID), 2008. [35] E. Z. Vasily Tarasov and S. Shepler. Filebench: A flexible framework for file system benchmarking. ;login:THE USENIX MAGAZINE, 41(1):6–12, 2016. [36] F. Vazquez, E. Garzon, J. Martinez, and J. Fernandez. The sparse matrix vector product on gpus. In Proceedings of the 9th International Conference on Computational and Mathematical Methods in Science and Engineering (CMMSE), 2009. [37] C. A. Waldspurger. Memory resource management in vmware esx server. ACM SIGOPS Operating Systems Review, 36(SI): 181–194, 2002. [38] T. Wood, P. Shenoy, A. Venkataramani, and M. Yousif. Sandpiper: Black-box and gray-box resource management for virtual machines. Computer Networks, 53(17):2923–2938, 2009. [39] Z. Yang, Y. Zhu, and Y. Pu. Parallel image processing based on cuda. In Proceedings of the 2nd International Conference on Computer Science and Software Engineering (CSSE), 2009.
© Copyright 2024 Paperzz