In Proceedings of the 1st International Symposium on High-Performance Computer Architecture, pages 243-252, 1995. U-cache: A Cost-eective Solution to the Synonym Problem Jesung Kim Sang Lyul Min Deog-Kyoon Jeongz Sanghoon Jeony Byoungchul Ahny Chong Sang Kim Dept. of Computer Engineering Dept. of Computer Engineeringy Dept. of Electronic Engineeringz Yeungnam University, Korea Seoul National University, Korea Email: [email protected] Abstract This paper proposes a cost-eective solution to the synonym problem. In this proposed solution, a minimal hardware addition guarantees the correctness whereas the software counterpart helps improve the performance. The key to this proposed solution is an addition of a small physically-indexed cache called Ucache. The U-cache maintains the reverse translation information of the cache blocks that belong to unaligned virtual pages only, where aligned means that the lower bits of the virtual page number match those of the corresponding physical page number. A Ucache, even with only one entry, ensures correct handling of synonyms. A simple software optimization, in the form of page alignment, helps improve the performance. Performance evaluation based on ATUM traces shows that a U-cache, with only a few entries, performs almost as well as (in some cases outperforms) a fully-congured hardware-based solution when more than 95 % of the pages are aligned. 1 Introduction Recently, virtual caches are becoming increasingly important due to the emergence of high-speed processors[1, 2, 3]. In virtual caches, cache access and address translation are performed in parallel, thus reducing cache access time. The physical caches, in contrast, require that address translation be performed before accessing the cache. This, in many cases, slows down the cache access. This research was supportedby Korea Research Foundation grant 01-E-0201. Although virtual caches have speed advantage over physical caches, the virtual caches suer from a internal consistency problem. In general, virtual memory systems allow several virtual addresses to be mapped to the same physical address. This may lead to the situation in which the virtual cache has more than one copy of the same physical memory block giving rise to the synonym problem[4]. So far, proposed solutions to the synonym problem were either hardware-based[5, 6, 7] or software-based[8, 9, 10, 11]. The hardware-based solutions have the advantage that they are transparent to the software. However, they require excessive hardware in order to maintain reverse translation information so as to detect synonyms in the cache. Software-based solutions, on the other hand, do not require any additional hardware. They, however, are complicated to implement and have been known to degrade the overall performance. This paper proposes a solution to the synonym problem that combines the advantages of the hardware-based and software-based approaches. The key to the proposed solution is the addition of a small physically-indexed cache called U-cache. The U-cache maintains the reverse translation information of the cache blocks that belong to unaligned virtual pages only (i.e., virtual pages whose lower bits of the virtual page number do not match those of the corresponding physical page number). This is in contrast to other hardware-based approaches that require the reverse translation information of all the blocks in the virtual cache to be maintained. A U-cache, even with only one entry, ensures correct handling of synonyms. Software optimization in the form of page alignment helps improve the performance of the virtual cache. The overall structure of this paper is as follows: Section 2 reviews previous approaches to the synonym problem. Section 3 gives a complete description of our approach. Section 4 gives a quantitative evaluation of our approach. Section 5 provides some concluding remarks. 2 Previous approaches to the synonym problem Caches are high-speed buers that store parts of main memory[4]. Until recently, caches have been generally accessed by physical addresses. Therefore, in computer systems that support virtual memory, processor-generated virtual addresses have to be translated to physical addresses before accessing the cache. This address translation is normally performed by using a TLB. In many cases, this TLB access increases the cache access time. In virtual caches, cache blocks are selected by virtual addresses rather than physical addresses. Therefore, they do not suer from delays caused by TLB access and hence have a speed advantage over physical caches. However, virtual caches undergo a consistency problem called the synonym problem. This problem arises when the virtual cache has more than one copy of the same memory block within itself that can be accessed through two or more virtual addresses. For example, when virtual addresses A and B are mapped to the same physical address P , two copies of P may reside in the virtual cache at the same time (cf. Figure 1). A subsequent write access to one of these copies will make the other copy stale. If this stale copy is allowed to be accessed, the correctness of the execution will be violated. There have been both hardware-based and software-based approaches to the synonym problem. In software-based techniques no additional hardware is required to solve the synonym problem. The simplest of the software-based techniques is to disallow synonyms altogether[12]. This solution is simple but it unnecessarily complicates operating system implementation. Another software-based technique with less restriction is to require that all the synonyms be aligned[9]. (When two synonyms are mapped to the same cache block, they are said to be aligned.) In this technique, when one of the aligned synonyms is loaded onto the cache, its synonym would be displaced from the cache if such a synonym did previously exist in the cache, thus the synonym problem is prevented. The main problem with this approach is that the requirement for all synonyms to be aligned is still too restrictive. In real systems, synonyms that are not physical memory virtual cache virtual address 019000 900 A P 4A000 virtual address 02F000 B F00 Figure 1: Synonym problem Assume the virtual address space is 16 Mbytes and the physical address space is 1 Mbyte. Also assume that both of the virtual addresses 019000 and 02F000 are mapped to the same physical address 4A000. If we assume that the virtual cache size is 64 Kbytes, the block size is 16 bytes, and the set associativity is 1, then virtual addresses 019000 and 02F000 are indexed to cache blocks 900 and F00 respectively. The processor rst reads the data at virtual address 019000, which loads the cache block 900 with the contents at physical address 4A000. The processor reads the same data again, this time using virtual address 02F000. Although the requested item is in the cache, the read reference will result in a miss since the cache is indexed by virtual addresses. If no special handling is done, two copies of the contents of physical address 4A000 will then exist within the virtual cache. In such a situation, if a write is performed to cache block 900 through virtual address 019000, then the cache block F00 will continue carrying the old (i.e., incorrect) value. After this, reads from the cache block F00 through virtual address 02F000 will not get the correct value and thus correct execution will be violated. aligned, called unaligned synonyms, do occur[11] and disallowing them totally would put an undue restriction to operating system designers and, in some cases, to application programmers. Some software-based solutions allow unaligned synonyms as well as aligned ones[8, 10]. They usually make use of a mechanism to trap on an event that may cause inconsistencies in the cache. In the solution given in [8], more than one virtual-to-physical mapping (i.e., synonyms) are allowed if they are readonly. When the processor writes to one of the synonyms in the cache, the write is trapped by the virtual memory protection mechanism. As a result, the previous read-only mappings are broken and a unique write mapping is established. At the same time, the cache blocks associated with the broken mappings are purged from the cache. On a later read by a virtual address dierent from the one used by the last write, the unique write mapping is broken and multiple read-only mappings are allowed again. When the write mapping is broken, the cache blocks associated with it are written back to main memory so that later reads through other mappings see the up-to-date copy in main memory. The purges and write-backs required in this approach degrade the overall cache performance. To reduce this performance degradation, Wheeler and Bershad proposed and implemented various optimizations based on an elegant model for virtual cache management[11]. However, in general software-based techniques require substantial operating system modications, which make the porting of a new operating system time-consuming[8]. The hardware-based solutions do not require any software intervention in order to solve the synonym problem. They generally make use of reverse translation information to detect synonyms in the cache and when a synonym is detected, to remove it from the cache.1 For example, a physically-indexed cache, called real tag (R-tag) cache, is used to perform the necessary reverse translation in the hardware-based solution proposed by Goodman[7]. Likewise, in the V(irtual)-R(eal) cache approach[6] that combines a 1 One of the exceptions is the IBM 3090 system[13]. This system has a four-way set-associative cache. Since the cache size(64 KBytes) exceeds the page size(4 KBytes) multiplied by the set associativity(4), virtual indexing is used to access the cache in parallel with the TLB access. To solve the synonym problem caused by the virtual indexing, all the blocks that may contain the synonym (in this case 16 blocks from 4 sets) are checked before loading a block. This can be performed in one cycle by accessing and comparing the 16 tags with the physical address of the requested block. This technique cannot be applied to virtually tagged caches since the virtual tags of the synonyms dier with each other. rst-level virtual cache with a second-level physical cache, the reverse translation information is recorded in the tag eld of the second-level physical cache. Since the two solutions are conceptually similar to each other in handling the synonym problem, we will describe only the R-tag cache approach in the following. As it was previously mentioned, the R-tag cache is accessed by physical addresses and maintains the reverse translation information of the blocks in the virtual cache. In this scheme, all the valid blocks in the virtual cache have corresponding entries in the Rtag cache. The entries in the R-tag cache are pointers that correspond the blocks in the virtual cache to their physical addresses. In this approach, the synonym problem is handled as follows (cf. Figure 2): If an access is a hit, the requested operation (read or write) is performed immediately. Otherwise (i.e., if the access is a miss), the physical address obtained through TLB access is then used to access the R-tag cache. { If the access to the R-tag cache is a hit then it means that a synonym exists in the cache. Thus, the cache block that the R-tag cache entry points to, is moved to the cache block determined by the virtual address provided by the processor. The R-tag cache pointer is then updated and it now points to this new block. { If the access to the R-tag cache is a miss, then it means that no synonym exists in the cache. (Remember that all the valid blocks in the virtual cache have corresponding entries in the R-tag cache.) Thus the memory block retrieved from main memory is placed in the cache block determined by the virtual address of the current request. In addition, an entry is allocated from the R-tag cache and is made to point to the new block. The requested operation now is performed. In this scheme, the reverse translation information maintained in the R-tag cache oers an indirect method for accessing the blocks in the virtual cache using physical addresses. That is, before a new block is placed in the cache, the virtual cache is indirectly accessed using the physical address, thus making sure that the block that is about to be loaded is unique in the virtual cache. Since each block in the virtual cache requires an R-tag cache entry in this approach, a fully-associative virtual address virtual address 019000 02F000 virtual cache tag TLB 900 01 tag data 900 TLB synonym 4 F00 virtual cache R-tag cache 03 replaced A00 X F00 XX 4A000 4A000 physical address physical address Figure 2: R-tag cache approach As in Figure 1, the synonyms with virtual addresses 019000 and 02F000 are both mapped to the same physical address 4A000. Assume that the cache conguration and the scenario are the same as those given in Figure 1. The above gure shows the steps taken when the processor tries to read the data at virtual address 02F000 and misses. To service this miss, the R-tag cache is accessed by physical address 4A000 obtained through TLB access. With the synonym information provided by the R-tag cache, cache block 900 is invalidated and its contents copied to cache block F00. Thus, the existence of two same copies within the virtual cache is prevented. R-tag cache with the same number of entries as in the virtual cache is required to keep the reverse translation information of all the blocks in the cache. However, in many cases, a set-associative or a direct-mapped cache is advocated for the R-tag cache for implementation reasons. If a set-associative or a direct-mapped R-tag cache is used, a situation will then arise where an Rtag cache entry cannot be allocated due to conicts in the R-tag cache. In such case, an R-tag cache entry that points to some other virtual cache block has to be selected and displaced (cf. Figure 3). To prevent the synonym problem the cache block that this displaced pointer previously pointed to must also be displaced. Thus, in this case to load a single block, two blocks have to be displaced { one is to make room in the virtual cache and the other is the one pointed to by the displaced R-tag cache entry. This type of phenomenon is called paired eviction[14]. The paired eviction degrades the cache storage utilization and results in higher cache miss ratio. In [15], Cekleov et al. assessed the performance impact of the associativity of the R-tag cache quantitatively and proposed a scheme to avoid the paired eviction. In the proposed scheme, the R-tag cache is virtually indexed although it contains physical tags. Since both the virtual cache and the R-tag cache are R-tag cache data 0B A00 displaced Figure 3: Example of paired eviction To load the block for virtual address 019000, not only must the block tagged as 03 be replaced but the block tagged as 0B must also be displaced in order for the pointer to be stored in the R-tag cache. If the block tagged as 0B is not displaced then when this block's synonym is loaded onto the cache, this block's existence cannot be examined, thus resulting in two synonyms existing in the cache. indexed by the virtual addresses, the paired eviction is avoided. However, this scheme requires that every synonym be aligned. Furthermore, the implementation is more or less complicated because the virtual address as well as the physical address is needed to access the R-tag cache. We propose a simpler and more cost-eective solution to the paired eviction and synonym problem in the following section. 3 U-cache approach to the synonym problem As mentioned in the previous section, paired eviction occurs due to limited R-tag cache set associativity. However, if the R-tag cache entry for the block being loaded is always the same as that for the block being replaced, then the paired eviction will not occur regardless of the R-tag cache set associativity. This situation arises when the lower log2 pcar bits of the physical page number of the block being loaded match those of the block being replaced (c: virtual cache size, p: page size, ar : R-tag cache set associativity). For example, assume that the cache size is 64 Kbytes, page size is 4 Kbytes, and the R-tag cache is direct-mapped (i.e., set associativity of 1). In this case, if the lower 4 bits of the physical page number of the block being loaded match those of the block being replaced, paired eviction will not occur. Formally, we can dene a relation RV in the set V that consists of all the virtual addresses. RV = c cache set associativity) p: page size, a: V/R V P/R P virtual Since the relation RV is an equivalence relation (i.e., is reexive, symmetric, and transitive), it generates a unique partition of the virtual address space into equivalence classes. We denote this set of equivalence classes induced by RV as V=RV . Assuming that the number of entries and the set associativity of the R-tag cache are the same as those of the virtual cache, we can dene a similar relation RP on the physical address space P . = P/R P f< x; y > jBoth x and y have the same lower c log 2 p a bits of their virtual page numbersg ( : virtual cache size, RP V/R V f< x; y > jBoth x and y have the same lower log 2 c p a bits of their physical page numbersg Like RV , RP partitions the physical address space into equivalence classes. The set of these equivalence classes is denoted by P=RP . With these denitions, the aforementioned condition for the avoidance of paired eviction can be restated as follows: If the virtual addresses of the block being loaded and the block being replaced are from the same equivalence class of V and their physical addresses are also from the same equivalence class of P , then paired eviction cannot occur.2 Using this observation, paired eviction can be prevented by relating each equivalence class of V with a unique equivalence class of P . Such an arrangement can be made by dening a one-to-one and onto function from V=RV to P=RP . There are pc a ! such functions. One of those functions is given in Figure 4a. Among the possible functions, the simplest one is the identity function shown in Figure 4-b. This arrangement can easily be made by matching the lower log2 pc a bits of the physical page number with those 2 In general, paired eviction will not occur if the lower min(log2 pcvav ; log2 pcrar ) bits of the virtual page number correspond to those of the physical page number(cv: virtual cache size, p: page size, av : virtual cache set associativity, cr : the number of entries in the R-tag cache virtual cache block size, ar : R-tag cache set associativity). (a) (b) Figure 4: Examples of functions from V=RV to P=RP of the virtual page number when allocating a physical page for a virtual page, which is much simpler than aligning all the synonyms. This page alignment technique is by no means new. The technique has been used in other contexts and is sometimes referred to as page coloring[16]. For example, Kessler and Hill used this technique to reduce the misses that occur when frequently used virtual pages are mapped to the same cache region[17]. As another example, Taylor et al. used the page alignment idea to reduce the miss ratio of a simplied version of a TLB called TLB slice[16]. In their case, the technique was used to improve the accuracy of the hint provided by the TLB slice. Lastly, the page alignment technique has been applied to physical caches in order to allow cache access and address translation to occur in parallel[18, 19]. When the lower bits of the virtual page number are made to correspond to those of the physical page number, the cache set index can then be directly derived from the virtual address. Thus, the cache access and the TLB access can be performed in parallel. However, unlike the technique we will explain later in this section, this technique requires that all the pages be aligned for correct operation. To avoid possible confusion with aligned synonyms, we will use the term v-p aligned to denote the situation where the lower log2 pc a bits of the virtual page number and those of the physical page number correspond. If all the pages are v-p aligned then the pointers in the R-tag cache would, as shown in Figure 5, point to the blocks in the virtual cache with the same cache set index. Thus, regardless of the set associativity of the R-tag cache, paired eviction will not occur. However, strict enforcement of v-p alignment reduces the set associativity of the physical memory. For virtual address virtual address 019000 virtual cache tag R-tag cache virtual cache data TLB tag TLB U-cache data v-p aligned replaced v-p unaligned v-p unaligned v-p aligned 49000 physical address physical address Figure 5: Prevention of paired eviction by page alignment example, in a 64 Kbyte direct-mapped virtual cache with a page size of 4 Kbytes, the v-p alignment demands that the lower 4 bits (log2 464K K1 ) of the virtual and physical page numbers match and, therefore, the set associativity of the physical memory is reduced by 161 and the 1 Mbyte physical memory becomes 16way ( 4K1M16 ) set-associative. However, if the physical memory is suciently large, then this will have little eect on the performance[19]. Furthermore, since the v-p alignment is not a requirement for correctness but a hint for better performance in our approach as we will see later, the operating system is free to choose a v-p unaligned physical page if a v-p aligned physical page is not available. A careful inspection of Figure 5 would reveal that the R-tag cache pointers for the cache blocks from vp aligned pages are just redundant information when the cache is direct-mapped. That is, the R-tag cache pointers for such cache blocks can be directly derived from their physical addresses. Hence, if all the virtual pages are v-p aligned, then the pointer eld of the Rtag cache and, therefore, the R-tag cache itself can be eliminated in uniprocessors in which a duplicate tag such as a R-tag cache is not required. Such duplicate tag memory is needed in bus-based multiprocessors for bus snooping purposes. To handle the case where a virtual page is not v-p aligned, our approach provides a small R-tag cache, called U-cache, that maintains the pointers for the cache blocks from v-p unaligned pages only (cf. Figure 6). This is in contrast to other hardware-based approaches that maintain the reverse translation information of all the blocks in the virtual cache. With the U-cache, the overall caching algorithm is as follows: Figure 6: U-cache organization When a cache access is a hit, the requested operation (read or write) is performed immediately and no further processing is required, as in the R-tag cache approach. When the cache access is a miss, the U-cache is accessed while the missed block is being fetched from main memory. Depending on the hit or miss of this U-cache access, the following steps are carried out. Since the steps are performed only on a cache miss and they are carried out while the missed block is being fetched from main memory, their impact on cache performance is minimal. { A U-cache hit would imply that a (v-p un- aligned) synonym exists in the virtual cache. The block indicated by the pointer is then moved to the cache block selected through the virtual address. If the current cache access is v-p aligned,3 there is no need for a U-cache pointer. Therefore, the U-cache entry that previously pointed to the synonym block is deallocated. Otherwise (i.e., if the current request is v-p unaligned), the U-cache pointer is updated to point to the new block. { A U-cache miss would imply that there is no v-p unaligned synonym in the cache. In other words, either there is no synonym at all or the synonym in the cache is v-p aligned. If the current access is v-p aligned, the synonym problem cannot arise because the missed block will displace its synonym 3 The v-p alignmentof a cache access can be easily checked by comparing the lower bits of the virtual page number with those of the physical page number obtained through TLB access. from the cache provided that the cache is direct-mapped. Otherwise (i.e., if the current request is vp unaligned), a v-p aligned synonym may reside in the cache and this may cause the synonym problem. Thus in our scheme the cache block that may contain the v-p aligned synonym is checked and invalidated. The location of such a cache block is derived from the physical address of the current access. (Remember that in this case only a v-p aligned synonym is possible in the cache and, hence, the synonym has the same cache index bits as those that can be obtained from the physical address of the current request.) Since the current access is v-p unaligned, a new entry is allocated from the U-cache and this entry is made to point to the new block. In this case, a paired eviction may occur. However, we can minimize the paired eviction by v-p aligning as many pages as possible. The requested operation (read or write) now is performed. The U-cache approach has the advantage over the R-tag cache approach of requiring only minimal hardware additions. A U-cache, even with only one entry, ensures correct handling of synonyms. Software optimization in the form of page alignment help improve the cache performance by minimizing paired eviction due to a limited number of U-cache entries. U-cache for set-associative virtual caches In a set-associative virtual cache, both the cache hits and the cache misses with a U-cache miss can be processed in the same manner as in the direct-mapped case. However, the processing of a cache miss with a U-cache miss is more complicated and an ecient solution is possible only when the virtual cache is physically tagged. In the following, we describe the steps taken to handle the cache miss with a U-cache miss assuming a physically tagged virtual cache. As before, a U-cache miss implies that only a vp aligned synonym can be in the cache. If the current access is v-p aligned, no synonym is possible in the cache, since otherwise the access must have been a hit. On the other hand, if the current request is v-p unaligned, a v-p aligned synonym may reside in the cache. In a physically tagged virtual cache, the existence of a v-p aligned synonym can be detected by accessing the cache using the physical address of the current request. If this access is a hit, it implies that there is a v-p aligned synonym, and thus its contents should be used to service the current cache miss. Otherwise, there is no synonym in the cache and the missed block is fetched from main memory. U-cache in multiprocessor systems In many current high performance computer systems, a two-level cache hierarchy is commonly used in which the rst-level cache is optimized for speed while the second-level cache is optimized for high hit ratio. In multiprocessor systems, the two-level cache hierarchy has an added advantage of ltering unnecessary cache coherence transactions provided that the multi-level inclusion property (MLI)[20] is maintained. Our U-cache scheme can be extended to the multiprocessor system by introducing a second-level cache preserving the MLI property as in the V-R cache approach[6]. In this two-level cache hierarchy with Ucache, the second-level cache resolves the inter-cache coherence problem (i.e., the multiprocessor cache coherence problem) only, while the U-cache resolves the intra-cache coherence problem (i.e., the synonym problem). This will substantially simplify the interface between the rst-level and second-level caches. 4 Performance evaluation This section analyzes the performance of the proposed U-cache approach using trace-driven simulation. In the trace-driven simulation, we used ATUM[21] virtual address traces. The traces contain kernel-mode references as well as user-mode references. Thirteen ATUM traces were concatenated to form a single trace of 4,806,634 references, which was then fed into the simulator. During simulation, the cache was ushed at every 100,000 references to simulate multiprogramming eects. Unless otherwise stated, the cache block size and page size are 32 bytes and 4 Kbytes respectively and all the caches are direct-mapped. Figure 7 shows the miss ratios of a 16 Kbyte virtual data cache for various page alignment ratios. Both the direct-mapped and two-way set-associative R-tag caches were simulated. In the gure, the leftmost point is the result of the simulation where the physical pages are randomly allocated without any page alignment. In this case, the page alignment ratio is about 25 % since 1/4 of the pages will be v-p aligned by chance when the cache size is 4 times the page size. The results show that the associativity of the R-tag 16K direct-mapped cache 0.060 0.060 with direct-mapped R-tag cache with two-way set-associative R-tag cache no pages aligned 95% pages aligned 100% pages aligned 0.055 miss ratio miss ratio 0.040 0.050 0.045 20.0 0.020 40.0 60.0 page alignment ratio(%) 80.0 100.0 0.000 16Kbytes 32Kbytes cache size 64Kbytes Figure 7: Eects of page alignment on miss ratio (Rtag cache) Figure 8: Page alignment eects on miss ratio over a range of cache sizes (R-tag cache) cache has an immense eect on the miss ratio. However its performance impact diminishes as the page alignment ratio increases. The rightmost point in the gure corresponds to the case where all the pages have been v-p aligned. As mentioned in section 3, paired eviction will not occur in this case and, therefore, the miss ratio is the same regardless of the set associativity of the R-tag cache. From this result, it can be noticed that the miss ratio can be improved by up to 22 % through page alignment when the R-tag cache is direct-mapped. It can also be noted that when more than 95 % of the pages have been v-p aligned, the direct-mapped R-tag cache yields a lower miss ratio than that of the two-way set-associative R-tag cache without page alignment. Figure 8 shows the miss ratios of virtual data caches whose sizes range from 16 Kbytes to 64 Kbytes. The gure shows that the miss ratios of caches with page alignment are about the same as those of caches twice the size without page alignment. For example, when more than 95 % of the pages are v-p aligned, a 16 Kbyte cache yields performance comparable to that of a 32 Kbyte cache without page alignment. This means that only about half of the total cache blocks are utilized due to paired eviction when the pages are not v-p aligned. This result is consistent with the analysis by Goodman[7].4 Figure 9 depicts the miss ratio of a 16 Kbyte virtual cache when a U-cache is used. In the gure, the results for direct-mapped U-caches with 8, 32 and 128 entries are shown. Here, the performance of the U-cache is compared against that of the R-tag cache with a full entry (in this case 512 entries). Since the U-cache maintains the pointers to cache blocks from v-p unaligned pages only, its performance is greatly dependent on the page alignment ratio. In the case where the page alignment technique is not used, the miss ratio of the U-cache approach is much worse than that of the R-tag cache approach. This is due to the fact that the U-cache limits the number of cache blocks from v-p unaligned pages. For example, if the number of U-cache entries is 8, then only a maximum of 8 cache blocks can contain memory blocks from v-p unaligned pages. Therefore, if more than 8 frequently used blocks are from v-p unaligned pages, then the performance degradation will be signicant. However, as the page alignment ratio increases, the miss ratio of the U-cache approach drops drastically. When the page alignment ratio is above 90 %, the performance of the U-cache approach is comparable to that of the R-tag cache approach. One interesting point to note is that when the page alignment ratio reaches to about 95 % (70 % in the case where the number of U-cache entries is 128), the U-cache is shown to slightly outperform the R-tag cache. This is due to the fact that the R-tag cache has to maintain pointers to all the blocks in the cache whereas the U-cache maintains only those to the blocks from v-p unaligned pages. Figure 10 depicts the miss ratios of virtual caches whose sizes range from 16 Kbytes to 64 Kbytes when 8, 32 and 128-entry U-caches are used. In the gure, the performance of the U-cache approach, which requires Goodman's analytical study states that only 1 ? 2a1r of the total blocks in the virtual cache are utilized in the R-tag cache approach where ar is the set associativity of the R-tag cache. 4 95% pages aligned 16K direct-mapped cache 0.050 0.25 with 8-entry U-cache with 32-entry U-cache with 128-entry U-cache with full(512)-entry R-tag cache 0.20 8-entry U-cache 32-entry U-cache 128-entry U-cache full-entry R-tag cache 0.040 0.15 miss ratio miss ratio 0.030 0.10 0.020 0.05 0.010 0.00 20.0 40.0 60.0 page alignment ratio(%) 80.0 100.0 0.000 16Kbytes 32Kbytes cache size 64Kbytes Figure 9: Eects of page alignment on miss ratio (Ucache) Figure 10: Miss ratios over a range of virtual cache sizes and U-cache sizes (when 95 % pages aligned) only a xed number of entries regardless of the virtual cache size, is shown to be comparable to that of the R-tag cache approach whose size has to increase in proportion to the virtual cache size. On the whole, the results indicate that our proposed U-cache approach not only costs signicantly less than but performs as well as (in some cases outperforms) the R-tag cache approach when the page alignment ratio is suciently high. guarantees the correctness whereas a simple software optimization in the form of page alignment improves the performance. In this paper, we also report on the results of the performance evaluation of the proposed scheme. The evaluation is based on trace-driven simulations using ATUM traces. The results show that a U-cache with only a few entries performs almost as well as (in some cases outperforms) a fully-congured R-tag cache when the alignment ratio is above 95 %. 5 Conclusion Acknowledgements With the recent emergence of high-speed processors, virtual caches have become more important due to their speed advantage over physical caches. However, in virtual caches, the synonym problem can occur when several virtual addresses are allowed to be mapped to the same physical address. Both hardware-based and software-based solutions to this problem have been proposed and/or implemented. The hardware-based solutions have the advantage of ensuring correct handling of synonyms without any software help. However, they require additional hardware and this can prove to be expensive. For softwarebased solutions, implementation is complicated and they have been known to degrade the cache performance although they do not require any additional hardware. This paper proposes a solution to the synonym problem that combines the advantages of hardwarebased and software-based solutions. In this proposed solution, a minimal hardware addition called U-cache, The authors would like to thank Minsuk Lee, Seong Baeg Kim, Taejin Kim for their very helpful comments on an earlier version of this paper. References [1] T. Asprey and et al. Performance features of the PA7100 microprocessor. Micro, 13(3):22{35, June 1993. [2] MIPS Computer Systems. MIPS R4000 microprocessor user's manual. Integrated Device Technology, 1991. [3] C. E. Wu, Y. Hsu, and Y.-H. Liu. A quantitative evaluation of cache types for high-performance computer systems. IEEE Transactions on Computers, 42(10), Oct. 1993. [4] A. J. Smith. Cache memories. ACM Computing Surveys, 14(3):473{530, Sept. 1982. [5] V. Knapp and J.-L. Baer. Virtually addressed caches for multiprogramming and multiprocessing environments. In Proceedings of the 18th An- [16] G. Taylor, P. Davies, and M. Farmwald. The TLB slice { A low-cost high-speed address translation mechanism. In Proceedings of the 17th Annual [6] W.-H. Wang, J.-L. Baer, and H. M. Levy. Organization and performance of a two-level virtual-real cache hierarchy. In Proceedings of the 16th An- [17] R. E. Kessler and M. D. Hill. Page placement algorithms for large real-indexed caches. ACM Transactions on Computer Systems, 10(4), Nov. 1992. [18] B. K. Bray, W. L. Lynch, and M. J. Flynn. Page allocation to reduce access time of physical caches. Technical Report CSL-TR-90-454, Computer System Lab., 1990. [19] T.-C. Chiueh and R. H. Katz. Eliminating the address translation bottleneck for physical address cache. In Proceedings of the Fifth Inter- nual Hawaii International Conference on System Sciences, pages 477{486, 1985. nual International Symposium on Computer Architecture, pages 140{148, 1989. [7] J. R. Goodman. Coherency for multiprocessor virtual address caches. In Proceedings of the Second International Conference on Architectural Support for Programming Languages and Operating Systems, pages 72{81, 1987. [8] C. Chao, M. Mackey, and B. Sears. Mach on a virtually addressed cache architecture. In Proceedings of the First Mach USENIX Workshop, pages 31{51, 1990. [9] R. Cheng. Virtual address cache in Unix. In Proceedings of the 1987 Summer USENIX Conference, pages 217{224, 1987. [10] D. R. Cheriton, G. A. Slavenburg, and P. D. Boyle. Software-controlled caches in the VMP multiprocessor. In Proceedings of the 13th Annual International Symposium on Computer Architecture, pages 366{383, June 1986. [11] B. Wheeler and B. N. Bershad. Consistency management for virtually indexed caches. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 124{136, 1992. [12] M. D. Hill and et al. Design decisions in SPUR. Computer, 19(11):8{22, Nov. 1986. [13] S. G. Tucker. The IBM 3090 system: An overview. IBM System Journal, 25(1):4{19, Jan. 1986. [14] S. L. Min, J. Kim, C. S. Kim, H. Shin, and D.K. Jeong. V-P cache: A storage ecient virtual cache organization. Microprocessors and Microsystems, 17(9), Nov. 1993. [15] M. Cekleov, M. Dubois, J.-C. Wang, and F. A. Briggs. Virtual-address caches. Technical Report CENG 90-18, University of Southern California, 1990. International Symposium on Computer Architecture, pages 335{363, 1990. national Conference on Architectural Support for Programming Languages and Operating Systems, pages 137{148, 1992. [20] J.-L. Baer and W.-H. Wang. On the inclusion properties for multi-level cache hierarchies. In Proceedings of the 15th Annual International Symposium on Computer Architecture, pages 73{ 80, 1988. [21] A. Agarwal, R. L. Sites, and M. Horowitz. ATUM: A new technique for capturing address traces using microcode. In Proceedings of the 13th Annual International Symposium on Computer Architecture, pages 119{127, 1986.
© Copyright 2026 Paperzz