U-cache: A Cost-e ective Solution to the Synonym Problem

In Proceedings of the 1st International Symposium on High-Performance Computer Architecture, pages 243-252, 1995.
U-cache: A Cost-eective Solution to the Synonym Problem
Jesung Kim
Sang Lyul Min
Deog-Kyoon Jeongz
Sanghoon Jeony
Byoungchul Ahny
Chong Sang Kim
Dept. of Computer Engineering
Dept. of Computer Engineeringy
Dept. of Electronic Engineeringz
Yeungnam University, Korea
Seoul National University, Korea
Email: [email protected]
Abstract
This paper proposes a cost-eective solution to
the synonym problem. In this proposed solution, a
minimal hardware addition guarantees the correctness
whereas the software counterpart helps improve the
performance. The key to this proposed solution is an
addition of a small physically-indexed cache called Ucache. The U-cache maintains the reverse translation
information of the cache blocks that belong to unaligned virtual pages only, where aligned means that
the lower bits of the virtual page number match those
of the corresponding physical page number. A Ucache, even with only one entry, ensures correct handling of synonyms. A simple software optimization,
in the form of page alignment, helps improve the performance. Performance evaluation based on ATUM
traces shows that a U-cache, with only a few entries,
performs almost as well as (in some cases outperforms)
a fully-congured hardware-based solution when more
than 95 % of the pages are aligned.
1 Introduction
Recently, virtual caches are becoming increasingly important due to the emergence of high-speed
processors[1, 2, 3]. In virtual caches, cache access and
address translation are performed in parallel, thus reducing cache access time. The physical caches, in contrast, require that address translation be performed
before accessing the cache. This, in many cases, slows
down the cache access.
This research was supportedby Korea Research Foundation
grant 01-E-0201.
Although virtual caches have speed advantage over
physical caches, the virtual caches suer from a internal consistency problem. In general, virtual memory
systems allow several virtual addresses to be mapped
to the same physical address. This may lead to the
situation in which the virtual cache has more than
one copy of the same physical memory block giving rise to the synonym problem[4]. So far, proposed solutions to the synonym problem were either
hardware-based[5, 6, 7] or software-based[8, 9, 10, 11].
The hardware-based solutions have the advantage that
they are transparent to the software. However, they
require excessive hardware in order to maintain reverse translation information so as to detect synonyms
in the cache. Software-based solutions, on the other
hand, do not require any additional hardware. They,
however, are complicated to implement and have been
known to degrade the overall performance.
This paper proposes a solution to the synonym problem that combines the advantages of the
hardware-based and software-based approaches. The
key to the proposed solution is the addition of a small
physically-indexed cache called U-cache. The U-cache
maintains the reverse translation information of the
cache blocks that belong to unaligned virtual pages
only (i.e., virtual pages whose lower bits of the virtual
page number do not match those of the corresponding
physical page number). This is in contrast to other
hardware-based approaches that require the reverse
translation information of all the blocks in the virtual cache to be maintained. A U-cache, even with
only one entry, ensures correct handling of synonyms.
Software optimization in the form of page alignment
helps improve the performance of the virtual cache.
The overall structure of this paper is as follows:
Section 2 reviews previous approaches to the synonym
problem. Section 3 gives a complete description of our
approach. Section 4 gives a quantitative evaluation
of our approach. Section 5 provides some concluding
remarks.
2 Previous approaches to the synonym
problem
Caches are high-speed buers that store parts of
main memory[4]. Until recently, caches have been
generally accessed by physical addresses. Therefore, in computer systems that support virtual memory, processor-generated virtual addresses have to be
translated to physical addresses before accessing the
cache. This address translation is normally performed
by using a TLB. In many cases, this TLB access increases the cache access time. In virtual caches, cache
blocks are selected by virtual addresses rather than
physical addresses. Therefore, they do not suer from
delays caused by TLB access and hence have a speed
advantage over physical caches.
However, virtual caches undergo a consistency
problem called the synonym problem. This problem
arises when the virtual cache has more than one copy
of the same memory block within itself that can be
accessed through two or more virtual addresses. For
example, when virtual addresses A and B are mapped
to the same physical address P , two copies of P may
reside in the virtual cache at the same time (cf. Figure 1). A subsequent write access to one of these
copies will make the other copy stale. If this stale
copy is allowed to be accessed, the correctness of the
execution will be violated.
There have been both hardware-based and
software-based approaches to the synonym problem.
In software-based techniques no additional hardware
is required to solve the synonym problem. The simplest of the software-based techniques is to disallow
synonyms altogether[12]. This solution is simple but
it unnecessarily complicates operating system implementation. Another software-based technique with
less restriction is to require that all the synonyms be
aligned[9]. (When two synonyms are mapped to the
same cache block, they are said to be aligned.) In
this technique, when one of the aligned synonyms is
loaded onto the cache, its synonym would be displaced
from the cache if such a synonym did previously exist
in the cache, thus the synonym problem is prevented.
The main problem with this approach is that the requirement for all synonyms to be aligned is still too
restrictive. In real systems, synonyms that are not
physical memory
virtual cache
virtual address
019000
900
A
P
4A000
virtual address
02F000
B
F00
Figure 1: Synonym problem
Assume the virtual address space is 16 Mbytes and the physical
address space is 1 Mbyte. Also assume that both of the virtual
addresses 019000 and 02F000 are mapped to the same physical address 4A000. If we assume that the virtual cache size is
64 Kbytes, the block size is 16 bytes, and the set associativity
is 1, then virtual addresses 019000 and 02F000 are indexed to
cache blocks 900 and F00 respectively. The processor rst reads
the data at virtual address 019000, which loads the cache block
900 with the contents at physical address 4A000. The processor reads the same data again, this time using virtual address
02F000. Although the requested item is in the cache, the read
reference will result in a miss since the cache is indexed by virtual addresses. If no special handling is done, two copies of the
contents of physical address 4A000 will then exist within the
virtual cache. In such a situation, if a write is performed to
cache block 900 through virtual address 019000, then the cache
block F00 will continue carrying the old (i.e., incorrect) value.
After this, reads from the cache block F00 through virtual address 02F000 will not get the correct value and thus correct
execution will be violated.
aligned, called unaligned synonyms, do occur[11] and
disallowing them totally would put an undue restriction to operating system designers and, in some cases,
to application programmers.
Some software-based solutions allow unaligned synonyms as well as aligned ones[8, 10]. They usually
make use of a mechanism to trap on an event that
may cause inconsistencies in the cache. In the solution given in [8], more than one virtual-to-physical
mapping (i.e., synonyms) are allowed if they are readonly. When the processor writes to one of the synonyms in the cache, the write is trapped by the virtual memory protection mechanism. As a result, the
previous read-only mappings are broken and a unique
write mapping is established. At the same time, the
cache blocks associated with the broken mappings are
purged from the cache. On a later read by a virtual address dierent from the one used by the last
write, the unique write mapping is broken and multiple read-only mappings are allowed again. When
the write mapping is broken, the cache blocks associated with it are written back to main memory so that
later reads through other mappings see the up-to-date
copy in main memory. The purges and write-backs
required in this approach degrade the overall cache
performance. To reduce this performance degradation, Wheeler and Bershad proposed and implemented
various optimizations based on an elegant model for
virtual cache management[11]. However, in general
software-based techniques require substantial operating system modications, which make the porting of
a new operating system time-consuming[8].
The hardware-based solutions do not require any
software intervention in order to solve the synonym
problem. They generally make use of reverse translation information to detect synonyms in the cache
and when a synonym is detected, to remove it from
the cache.1 For example, a physically-indexed cache,
called real tag (R-tag) cache, is used to perform the
necessary reverse translation in the hardware-based
solution proposed by Goodman[7]. Likewise, in the
V(irtual)-R(eal) cache approach[6] that combines a
1 One of the exceptions is the IBM 3090 system[13]. This
system has a four-way set-associative cache. Since the cache
size(64 KBytes) exceeds the page size(4 KBytes) multiplied by
the set associativity(4), virtual indexing is used to access the
cache in parallel with the TLB access. To solve the synonym
problem caused by the virtual indexing, all the blocks that may
contain the synonym (in this case 16 blocks from 4 sets) are
checked before loading a block. This can be performed in one
cycle by accessing and comparing the 16 tags with the physical address of the requested block. This technique cannot be
applied to virtually tagged caches since the virtual tags of the
synonyms dier with each other.
rst-level virtual cache with a second-level physical
cache, the reverse translation information is recorded
in the tag eld of the second-level physical cache.
Since the two solutions are conceptually similar to
each other in handling the synonym problem, we will
describe only the R-tag cache approach in the following.
As it was previously mentioned, the R-tag cache
is accessed by physical addresses and maintains the
reverse translation information of the blocks in the
virtual cache. In this scheme, all the valid blocks in
the virtual cache have corresponding entries in the Rtag cache. The entries in the R-tag cache are pointers
that correspond the blocks in the virtual cache to their
physical addresses. In this approach, the synonym
problem is handled as follows (cf. Figure 2):
If an access is a hit, the requested operation (read
or write) is performed immediately.
Otherwise (i.e., if the access is a miss), the physical address obtained through TLB access is then
used to access the R-tag cache.
{ If the access to the R-tag cache is a hit then
it means that a synonym exists in the cache.
Thus, the cache block that the R-tag cache
entry points to, is moved to the cache block
determined by the virtual address provided
by the processor. The R-tag cache pointer is
then updated and it now points to this new
block.
{ If the access to the R-tag cache is a miss,
then it means that no synonym exists in the
cache. (Remember that all the valid blocks
in the virtual cache have corresponding entries in the R-tag cache.) Thus the memory
block retrieved from main memory is placed
in the cache block determined by the virtual
address of the current request. In addition,
an entry is allocated from the R-tag cache
and is made to point to the new block.
The requested operation now is performed.
In this scheme, the reverse translation information
maintained in the R-tag cache oers an indirect
method for accessing the blocks in the virtual cache
using physical addresses. That is, before a new block
is placed in the cache, the virtual cache is indirectly
accessed using the physical address, thus making sure
that the block that is about to be loaded is unique in
the virtual cache.
Since each block in the virtual cache requires an
R-tag cache entry in this approach, a fully-associative
virtual address
virtual address
019000
02F000
virtual cache
tag
TLB
900
01
tag
data
900
TLB
synonym
4
F00
virtual cache
R-tag cache
03
replaced
A00
X
F00
XX
4A000
4A000
physical address
physical address
Figure 2: R-tag cache approach
As in Figure 1, the synonyms with virtual addresses 019000 and
02F000 are both mapped to the same physical address 4A000.
Assume that the cache conguration and the scenario are the
same as those given in Figure 1. The above gure shows the
steps taken when the processor tries to read the data at virtual
address 02F000 and misses. To service this miss, the R-tag
cache is accessed by physical address 4A000 obtained through
TLB access. With the synonym information provided by the
R-tag cache, cache block 900 is invalidated and its contents
copied to cache block F00. Thus, the existence of two same
copies within the virtual cache is prevented.
R-tag cache with the same number of entries as in the
virtual cache is required to keep the reverse translation
information of all the blocks in the cache. However, in
many cases, a set-associative or a direct-mapped cache
is advocated for the R-tag cache for implementation
reasons. If a set-associative or a direct-mapped R-tag
cache is used, a situation will then arise where an Rtag cache entry cannot be allocated due to conicts
in the R-tag cache. In such case, an R-tag cache entry that points to some other virtual cache block has
to be selected and displaced (cf. Figure 3). To prevent the synonym problem the cache block that this
displaced pointer previously pointed to must also be
displaced. Thus, in this case to load a single block,
two blocks have to be displaced { one is to make room
in the virtual cache and the other is the one pointed
to by the displaced R-tag cache entry. This type of
phenomenon is called paired eviction[14]. The paired
eviction degrades the cache storage utilization and results in higher cache miss ratio.
In [15], Cekleov et al. assessed the performance impact of the associativity of the R-tag cache quantitatively and proposed a scheme to avoid the paired
eviction. In the proposed scheme, the R-tag cache is
virtually indexed although it contains physical tags.
Since both the virtual cache and the R-tag cache are
R-tag cache
data
0B
A00
displaced
Figure 3: Example of paired eviction
To load the block for virtual address 019000, not only must
the block tagged as 03 be replaced but the block tagged as 0B
must also be displaced in order for the pointer to be stored in
the R-tag cache. If the block tagged as 0B is not displaced then
when this block's synonym is loaded onto the cache, this block's
existence cannot be examined, thus resulting in two synonyms
existing in the cache.
indexed by the virtual addresses, the paired eviction
is avoided. However, this scheme requires that every
synonym be aligned. Furthermore, the implementation is more or less complicated because the virtual
address as well as the physical address is needed to
access the R-tag cache. We propose a simpler and
more cost-eective solution to the paired eviction and
synonym problem in the following section.
3 U-cache approach to the synonym
problem
As mentioned in the previous section, paired eviction occurs due to limited R-tag cache set associativity. However, if the R-tag cache entry for the block
being loaded is always the same as that for the block
being replaced, then the paired eviction will not occur
regardless of the R-tag cache set associativity. This
situation arises when the lower log2 pcar bits of the
physical page number of the block being loaded match
those of the block being replaced (c: virtual cache size,
p: page size, ar : R-tag cache set associativity). For
example, assume that the cache size is 64 Kbytes, page
size is 4 Kbytes, and the R-tag cache is direct-mapped
(i.e., set associativity of 1). In this case, if the lower
4 bits of the physical page number of the block being
loaded match those of the block being replaced, paired
eviction will not occur.
Formally, we can dene a relation RV in the set V
that consists of all the virtual addresses.
RV
=
c
cache set associativity)
p:
page size,
a:
V/R V
P/R P
virtual
Since the relation RV is an equivalence relation
(i.e., is reexive, symmetric, and transitive), it generates a unique partition of the virtual address space
into equivalence classes. We denote this set of equivalence classes induced by RV as V=RV .
Assuming that the number of entries and the set
associativity of the R-tag cache are the same as those
of the virtual cache, we can dene a similar relation
RP on the physical address space P .
=
P/R P
f< x; y > jBoth x and y have the same lower
c
log 2
p a bits of their virtual page numbersg
( : virtual cache size,
RP
V/R V
f< x; y > jBoth x and y have the same lower
log 2
c
p a bits of their physical page numbersg
Like RV , RP partitions the physical address space
into equivalence classes. The set of these equivalence
classes is denoted by P=RP .
With these denitions, the aforementioned condition for the avoidance of paired eviction can be restated as follows:
If the virtual addresses of the block being
loaded and the block being replaced are from
the same equivalence class of V and their
physical addresses are also from the same
equivalence class of P , then paired eviction
cannot occur.2
Using this observation, paired eviction can be prevented by relating each equivalence class of V with
a unique equivalence class of P . Such an arrangement can be made by dening a one-to-one and onto
function from V=RV to P=RP . There are pc a ! such
functions. One of those functions is given in Figure 4a. Among the possible functions, the simplest one is
the identity function shown in Figure 4-b. This arrangement can easily be made by matching the lower
log2 pc a bits of the physical page number with those
2 In general, paired eviction will not occur if the lower
min(log2 pcvav ; log2 pcrar ) bits of the virtual page number correspond to those of the physical page number(cv: virtual cache
size, p: page size, av : virtual cache set associativity, cr : the
number of entries in the R-tag cache virtual cache block size,
ar : R-tag cache set associativity).
(a)
(b)
Figure 4: Examples of functions from V=RV to P=RP
of the virtual page number when allocating a physical
page for a virtual page, which is much simpler than
aligning all the synonyms.
This page alignment technique is by no means new.
The technique has been used in other contexts and is
sometimes referred to as page coloring[16]. For example, Kessler and Hill used this technique to reduce the
misses that occur when frequently used virtual pages
are mapped to the same cache region[17]. As another
example, Taylor et al. used the page alignment idea
to reduce the miss ratio of a simplied version of a
TLB called TLB slice[16]. In their case, the technique
was used to improve the accuracy of the hint provided
by the TLB slice. Lastly, the page alignment technique has been applied to physical caches in order to
allow cache access and address translation to occur in
parallel[18, 19]. When the lower bits of the virtual
page number are made to correspond to those of the
physical page number, the cache set index can then be
directly derived from the virtual address. Thus, the
cache access and the TLB access can be performed
in parallel. However, unlike the technique we will explain later in this section, this technique requires that
all the pages be aligned for correct operation.
To avoid possible confusion with aligned synonyms,
we will use the term v-p aligned to denote the situation
where the lower log2 pc a bits of the virtual page number and those of the physical page number correspond.
If all the pages are v-p aligned then the pointers in the
R-tag cache would, as shown in Figure 5, point to the
blocks in the virtual cache with the same cache set
index. Thus, regardless of the set associativity of the
R-tag cache, paired eviction will not occur.
However, strict enforcement of v-p alignment reduces the set associativity of the physical memory. For
virtual address
virtual address
019000
virtual cache
tag
R-tag cache
virtual cache
data
TLB
tag
TLB
U-cache
data
v-p aligned
replaced
v-p unaligned
v-p unaligned
v-p aligned
49000
physical address
physical address
Figure 5: Prevention of paired eviction by page alignment
example, in a 64 Kbyte direct-mapped virtual cache
with a page size of 4 Kbytes, the v-p alignment demands that the lower 4 bits (log2 464K K1 ) of the virtual
and physical page numbers match and, therefore, the
set associativity of the physical memory is reduced
by 161 and the 1 Mbyte physical memory becomes 16way ( 4K1M16 ) set-associative. However, if the physical
memory is suciently large, then this will have little
eect on the performance[19]. Furthermore, since the
v-p alignment is not a requirement for correctness but
a hint for better performance in our approach as we
will see later, the operating system is free to choose a
v-p unaligned physical page if a v-p aligned physical
page is not available.
A careful inspection of Figure 5 would reveal that
the R-tag cache pointers for the cache blocks from vp aligned pages are just redundant information when
the cache is direct-mapped. That is, the R-tag cache
pointers for such cache blocks can be directly derived
from their physical addresses. Hence, if all the virtual
pages are v-p aligned, then the pointer eld of the Rtag cache and, therefore, the R-tag cache itself can be
eliminated in uniprocessors in which a duplicate tag
such as a R-tag cache is not required. Such duplicate
tag memory is needed in bus-based multiprocessors for
bus snooping purposes. To handle the case where a
virtual page is not v-p aligned, our approach provides
a small R-tag cache, called U-cache, that maintains
the pointers for the cache blocks from v-p unaligned
pages only (cf. Figure 6). This is in contrast to other
hardware-based approaches that maintain the reverse
translation information of all the blocks in the virtual
cache.
With the U-cache, the overall caching algorithm is
as follows:
Figure 6: U-cache organization
When a cache access is a hit, the requested operation (read or write) is performed immediately
and no further processing is required, as in the
R-tag cache approach.
When the cache access is a miss, the U-cache is
accessed while the missed block is being fetched
from main memory. Depending on the hit or miss
of this U-cache access, the following steps are carried out. Since the steps are performed only on
a cache miss and they are carried out while the
missed block is being fetched from main memory,
their impact on cache performance is minimal.
{ A U-cache hit would imply that a (v-p un-
aligned) synonym exists in the virtual cache.
The block indicated by the pointer is then
moved to the cache block selected through
the virtual address. If the current cache
access is v-p aligned,3 there is no need for
a U-cache pointer. Therefore, the U-cache
entry that previously pointed to the synonym block is deallocated. Otherwise (i.e.,
if the current request is v-p unaligned), the
U-cache pointer is updated to point to the
new block.
{ A U-cache miss would imply that there is
no v-p unaligned synonym in the cache. In
other words, either there is no synonym
at all or the synonym in the cache is v-p
aligned. If the current access is v-p aligned,
the synonym problem cannot arise because
the missed block will displace its synonym
3 The v-p alignmentof a cache access can be easily checked by
comparing the lower bits of the virtual page number with those
of the physical page number obtained through TLB access.
from the cache provided that the cache is
direct-mapped.
Otherwise (i.e., if the current request is vp unaligned), a v-p aligned synonym may
reside in the cache and this may cause the
synonym problem. Thus in our scheme the
cache block that may contain the v-p aligned
synonym is checked and invalidated. The
location of such a cache block is derived
from the physical address of the current access. (Remember that in this case only a v-p
aligned synonym is possible in the cache and,
hence, the synonym has the same cache index bits as those that can be obtained from
the physical address of the current request.)
Since the current access is v-p unaligned, a
new entry is allocated from the U-cache and
this entry is made to point to the new block.
In this case, a paired eviction may occur.
However, we can minimize the paired eviction by v-p aligning as many pages as possible.
The requested operation (read or write) now is
performed.
The U-cache approach has the advantage over the
R-tag cache approach of requiring only minimal hardware additions. A U-cache, even with only one entry,
ensures correct handling of synonyms. Software optimization in the form of page alignment help improve
the cache performance by minimizing paired eviction
due to a limited number of U-cache entries.
U-cache for set-associative virtual caches
In a set-associative virtual cache, both the cache hits
and the cache misses with a U-cache miss can be processed in the same manner as in the direct-mapped
case. However, the processing of a cache miss with a
U-cache miss is more complicated and an ecient solution is possible only when the virtual cache is physically tagged. In the following, we describe the steps
taken to handle the cache miss with a U-cache miss
assuming a physically tagged virtual cache.
As before, a U-cache miss implies that only a vp aligned synonym can be in the cache. If the current access is v-p aligned, no synonym is possible in
the cache, since otherwise the access must have been
a hit. On the other hand, if the current request is
v-p unaligned, a v-p aligned synonym may reside in
the cache. In a physically tagged virtual cache, the
existence of a v-p aligned synonym can be detected
by accessing the cache using the physical address of
the current request. If this access is a hit, it implies
that there is a v-p aligned synonym, and thus its contents should be used to service the current cache miss.
Otherwise, there is no synonym in the cache and the
missed block is fetched from main memory.
U-cache in multiprocessor systems
In many current high performance computer systems,
a two-level cache hierarchy is commonly used in which
the rst-level cache is optimized for speed while the
second-level cache is optimized for high hit ratio.
In multiprocessor systems, the two-level cache hierarchy has an added advantage of ltering unnecessary cache coherence transactions provided that the
multi-level inclusion property (MLI)[20] is maintained.
Our U-cache scheme can be extended to the multiprocessor system by introducing a second-level cache
preserving the MLI property as in the V-R cache
approach[6]. In this two-level cache hierarchy with Ucache, the second-level cache resolves the inter-cache
coherence problem (i.e., the multiprocessor cache coherence problem) only, while the U-cache resolves
the intra-cache coherence problem (i.e., the synonym
problem). This will substantially simplify the interface between the rst-level and second-level caches.
4 Performance evaluation
This section analyzes the performance of the proposed U-cache approach using trace-driven simulation.
In the trace-driven simulation, we used ATUM[21] virtual address traces. The traces contain kernel-mode
references as well as user-mode references. Thirteen
ATUM traces were concatenated to form a single trace
of 4,806,634 references, which was then fed into the
simulator. During simulation, the cache was ushed
at every 100,000 references to simulate multiprogramming eects. Unless otherwise stated, the cache block
size and page size are 32 bytes and 4 Kbytes respectively and all the caches are direct-mapped.
Figure 7 shows the miss ratios of a 16 Kbyte virtual
data cache for various page alignment ratios. Both
the direct-mapped and two-way set-associative R-tag
caches were simulated. In the gure, the leftmost
point is the result of the simulation where the physical pages are randomly allocated without any page
alignment. In this case, the page alignment ratio is
about 25 % since 1/4 of the pages will be v-p aligned
by chance when the cache size is 4 times the page size.
The results show that the associativity of the R-tag
16K direct-mapped cache
0.060
0.060
with direct-mapped R-tag cache
with two-way set-associative R-tag cache
no pages aligned
95% pages aligned
100% pages aligned
0.055
miss ratio
miss ratio
0.040
0.050
0.045
20.0
0.020
40.0
60.0
page alignment ratio(%)
80.0
100.0
0.000
16Kbytes
32Kbytes
cache size
64Kbytes
Figure 7: Eects of page alignment on miss ratio (Rtag cache)
Figure 8: Page alignment eects on miss ratio over a
range of cache sizes (R-tag cache)
cache has an immense eect on the miss ratio. However its performance impact diminishes as the page
alignment ratio increases. The rightmost point in the
gure corresponds to the case where all the pages have
been v-p aligned. As mentioned in section 3, paired
eviction will not occur in this case and, therefore, the
miss ratio is the same regardless of the set associativity of the R-tag cache. From this result, it can be
noticed that the miss ratio can be improved by up to
22 % through page alignment when the R-tag cache is
direct-mapped. It can also be noted that when more
than 95 % of the pages have been v-p aligned, the
direct-mapped R-tag cache yields a lower miss ratio
than that of the two-way set-associative R-tag cache
without page alignment.
Figure 8 shows the miss ratios of virtual data caches
whose sizes range from 16 Kbytes to 64 Kbytes. The
gure shows that the miss ratios of caches with page
alignment are about the same as those of caches twice
the size without page alignment. For example, when
more than 95 % of the pages are v-p aligned, a 16
Kbyte cache yields performance comparable to that
of a 32 Kbyte cache without page alignment. This
means that only about half of the total cache blocks
are utilized due to paired eviction when the pages are
not v-p aligned. This result is consistent with the
analysis by Goodman[7].4
Figure 9 depicts the miss ratio of a 16 Kbyte virtual
cache when a U-cache is used. In the gure, the results
for direct-mapped U-caches with 8, 32 and 128 entries
are shown. Here, the performance of the U-cache is
compared against that of the R-tag cache with a full
entry (in this case 512 entries). Since the U-cache
maintains the pointers to cache blocks from v-p unaligned pages only, its performance is greatly dependent on the page alignment ratio. In the case where
the page alignment technique is not used, the miss ratio of the U-cache approach is much worse than that
of the R-tag cache approach. This is due to the fact
that the U-cache limits the number of cache blocks
from v-p unaligned pages. For example, if the number of U-cache entries is 8, then only a maximum of
8 cache blocks can contain memory blocks from v-p
unaligned pages. Therefore, if more than 8 frequently
used blocks are from v-p unaligned pages, then the
performance degradation will be signicant. However,
as the page alignment ratio increases, the miss ratio
of the U-cache approach drops drastically. When the
page alignment ratio is above 90 %, the performance
of the U-cache approach is comparable to that of the
R-tag cache approach. One interesting point to note is
that when the page alignment ratio reaches to about
95 % (70 % in the case where the number of U-cache
entries is 128), the U-cache is shown to slightly outperform the R-tag cache. This is due to the fact that the
R-tag cache has to maintain pointers to all the blocks
in the cache whereas the U-cache maintains only those
to the blocks from v-p unaligned pages.
Figure 10 depicts the miss ratios of virtual caches
whose sizes range from 16 Kbytes to 64 Kbytes when 8,
32 and 128-entry U-caches are used. In the gure, the
performance of the U-cache approach, which requires
Goodman's analytical study states that only 1 ? 2a1r of the
total blocks in the virtual cache are utilized in the R-tag cache
approach where ar is the set associativity of the R-tag cache.
4
95% pages aligned
16K direct-mapped cache
0.050
0.25
with 8-entry U-cache
with 32-entry U-cache
with 128-entry U-cache
with full(512)-entry R-tag cache
0.20
8-entry U-cache
32-entry U-cache
128-entry U-cache
full-entry R-tag cache
0.040
0.15
miss ratio
miss ratio
0.030
0.10
0.020
0.05
0.010
0.00
20.0
40.0
60.0
page alignment ratio(%)
80.0
100.0
0.000
16Kbytes
32Kbytes
cache size
64Kbytes
Figure 9: Eects of page alignment on miss ratio (Ucache)
Figure 10: Miss ratios over a range of virtual cache
sizes and U-cache sizes (when 95 % pages aligned)
only a xed number of entries regardless of the virtual
cache size, is shown to be comparable to that of the
R-tag cache approach whose size has to increase in
proportion to the virtual cache size.
On the whole, the results indicate that our proposed U-cache approach not only costs signicantly
less than but performs as well as (in some cases outperforms) the R-tag cache approach when the page
alignment ratio is suciently high.
guarantees the correctness whereas a simple software
optimization in the form of page alignment improves
the performance.
In this paper, we also report on the results of the
performance evaluation of the proposed scheme. The
evaluation is based on trace-driven simulations using ATUM traces. The results show that a U-cache
with only a few entries performs almost as well as
(in some cases outperforms) a fully-congured R-tag
cache when the alignment ratio is above 95 %.
5 Conclusion
Acknowledgements
With the recent emergence of high-speed processors, virtual caches have become more important due
to their speed advantage over physical caches. However, in virtual caches, the synonym problem can
occur when several virtual addresses are allowed to
be mapped to the same physical address. Both
hardware-based and software-based solutions to this
problem have been proposed and/or implemented.
The hardware-based solutions have the advantage of
ensuring correct handling of synonyms without any
software help. However, they require additional hardware and this can prove to be expensive. For softwarebased solutions, implementation is complicated and
they have been known to degrade the cache performance although they do not require any additional
hardware.
This paper proposes a solution to the synonym
problem that combines the advantages of hardwarebased and software-based solutions. In this proposed
solution, a minimal hardware addition called U-cache,
The authors would like to thank Minsuk Lee, Seong
Baeg Kim, Taejin Kim for their very helpful comments
on an earlier version of this paper.
References
[1] T. Asprey and et al. Performance features of the
PA7100 microprocessor. Micro, 13(3):22{35, June
1993.
[2] MIPS Computer Systems. MIPS R4000 microprocessor user's manual. Integrated Device Technology, 1991.
[3] C. E. Wu, Y. Hsu, and Y.-H. Liu. A quantitative
evaluation of cache types for high-performance
computer systems. IEEE Transactions on Computers, 42(10), Oct. 1993.
[4] A. J. Smith. Cache memories. ACM Computing
Surveys, 14(3):473{530, Sept. 1982.
[5] V. Knapp and J.-L. Baer. Virtually addressed
caches for multiprogramming and multiprocessing environments. In Proceedings of the 18th An-
[16] G. Taylor, P. Davies, and M. Farmwald. The TLB
slice { A low-cost high-speed address translation
mechanism. In Proceedings of the 17th Annual
[6] W.-H. Wang, J.-L. Baer, and H. M. Levy. Organization and performance of a two-level virtual-real
cache hierarchy. In Proceedings of the 16th An-
[17] R. E. Kessler and M. D. Hill. Page placement
algorithms for large real-indexed caches. ACM
Transactions on Computer Systems, 10(4), Nov.
1992.
[18] B. K. Bray, W. L. Lynch, and M. J. Flynn.
Page allocation to reduce access time of physical
caches. Technical Report CSL-TR-90-454, Computer System Lab., 1990.
[19] T.-C. Chiueh and R. H. Katz. Eliminating the
address translation bottleneck for physical address cache. In Proceedings of the Fifth Inter-
nual Hawaii International Conference on System
Sciences, pages 477{486, 1985.
nual International Symposium on Computer Architecture, pages 140{148, 1989.
[7] J. R. Goodman. Coherency for multiprocessor
virtual address caches. In Proceedings of the
Second International Conference on Architectural
Support for Programming Languages and Operating Systems, pages 72{81, 1987.
[8] C. Chao, M. Mackey, and B. Sears. Mach on a
virtually addressed cache architecture. In Proceedings of the First Mach USENIX Workshop,
pages 31{51, 1990.
[9] R. Cheng. Virtual address cache in Unix. In
Proceedings of the 1987 Summer USENIX Conference, pages 217{224, 1987.
[10] D. R. Cheriton, G. A. Slavenburg, and P. D.
Boyle. Software-controlled caches in the VMP
multiprocessor. In Proceedings of the 13th Annual
International Symposium on Computer Architecture, pages 366{383, June 1986.
[11] B. Wheeler and B. N. Bershad. Consistency management for virtually indexed caches. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages
and Operating Systems, pages 124{136, 1992.
[12] M. D. Hill and et al. Design decisions in SPUR.
Computer, 19(11):8{22, Nov. 1986.
[13] S. G. Tucker. The IBM 3090 system: An
overview. IBM System Journal, 25(1):4{19, Jan.
1986.
[14] S. L. Min, J. Kim, C. S. Kim, H. Shin, and D.K. Jeong. V-P cache: A storage ecient virtual cache organization. Microprocessors and Microsystems, 17(9), Nov. 1993.
[15] M. Cekleov, M. Dubois, J.-C. Wang, and F. A.
Briggs. Virtual-address caches. Technical Report
CENG 90-18, University of Southern California,
1990.
International Symposium on Computer Architecture, pages 335{363, 1990.
national Conference on Architectural Support for
Programming Languages and Operating Systems,
pages 137{148, 1992.
[20] J.-L. Baer and W.-H. Wang. On the inclusion properties for multi-level cache hierarchies.
In Proceedings of the 15th Annual International
Symposium on Computer Architecture, pages 73{
80, 1988.
[21] A. Agarwal, R. L. Sites, and M. Horowitz.
ATUM: A new technique for capturing address
traces using microcode. In Proceedings of the
13th Annual International Symposium on Computer Architecture, pages 119{127, 1986.