Flexible Virtual Segmentation for Large Memory Systems

Flexible Virtual Segmentation for Large Memory
Systems
Chang Hyun Park, Taekyung Heo, Seonyoung Lee, and Jaehyuk Huh
Computer Science Department, KAIST
{changhyunpark, tkheo, sylee, and jhuh}@calab.kaist.ac.kr
Abstract—With ever growing physical memory capacity, traditional memory virtualization with page tables has been suffering
from excessive TLB misses for large memory applications. To mitigate such increasing pressures on TLB capacity, superpaging has
been adopted for commercial processors. However, superpaging
still relies on the efficiency of TLB which is on the critical path
of L1 cache accesses, and thus its scalability for larger memory
is limited. As an alternative to superpaging, direct segment has
been proposed to provide direct translation for a large chunk
of contiguous memory with variable length. Exploiting static
memory allocation natures of certain server applications, such
direct segment can almost eliminate translation costs. However,
direct segment limits the dynamic memory management by the
operating system too severely to be used for more general applications, since the primary memory region of an application must
be assigned to a single contiguous physical memory, suffering
from potential internal and external fragmentation.
This paper proposes to use a flexible virtual segmentation,
which provides the benefit of direct segment, while supporting
dynamic memory management with variable segment granularity. The primary memory region of each process is assigned
with a contiguous unused virtual physical region, and the virtual
physical region is mapped to multiple real system memory regions
through the second-level segmentation. Unlike prior approaches,
the second-level segmentation is accessed only for LLC misses,
and thus segment entries can be increased to 1000s entries with
negligible performance degradation. With the flexible virtual
segmentation, the operating system can manage physical memory
with many variable length chunks, which can reduce internal and
external fragmentation. Our experimental results show that the
flexible virtual segmentation can provide the performance close
to the pure segmentation, even with a large second-level segment
table supporting many variable-length segments.
I. I NTRODUCTION
With ever growing physical memory capacity and increasing
memory requirements for large memory applications, traditional memory virtualization has become a performance bottleneck for such large memory systems [10], [28]. Traditional
page-based memory management allows fine-grained mapping
between virtual and physical spaces to minimize internal and
external fragmentation, with fine-grained protection control.
However, such page-based memory virtualization poses a challenge in large memory systems, due to the limited efficiency of
translation lookaside buffers (TLBs). With increasing memory
requirements, the TLBs can no longer cover large working
sets, and generate excessive TLB misses. Furthermore, in the
traditional memory hierarchy, the address translation through
TLBs must complete before L1 cache tag matching. Since the
address translation is on the critical path of memory opera-
tions, it is difficult to increase the TLB capacity arbitrarily to
cover more memory pages.
To mitigate the performance bottleneck by address translation, a traditional technique is to increase the granularity
of translation with superpaging, and thus increase the TLB
coverage under TLB capacity limitation. However, the current
superpaging supports are limited to maintain TLB access
latencies. For example, in the x86 architecture, only 2MB
and 1GB large pages are supported due to the page table
structure, and for 1GB pages, only a few entries of TLBs
are supported [5]. Although other architectures support many
different page sizes, the TLB organization is limited to fullyassociative structures restricting its scalability [17].
An alternative technique, direct segment, maps a primary
virtual memory region to a contiguous physical region with a
direct variable length segment for each process [10]. The primary region is a contiguous region of virtual memory eligible
for segmentation. Such direct segment almost eliminates the
cost of address translation for a primary region of process virtual memory space. The direct segment targets a large memory
server, where a long-running single process consumes most of
the memory capacity, and the primary memory of the process
is mostly allocated at the initial phase of the process execution.
However, despite this superior address translation capability,
the direct segment limits dynamic memory management by
the operating system severely, if the scope is expanded to
more general applications. Some applications request memory
dynamically throughout their execution. Furthermore, with
possible frequent dynamic memory allocation and deallocation
by many applications, the operating system may not be able
to provide a region of contiguous memory due to external
fragmentation without compacting allocated memory regions.
However, direct segment opens a new opportunity for efficient address translation with variable length segmentation.
Such variable length mapping can potentially solve the scalability problem of address translation with every increasing
memory capacity, as its size is not tied to a fixed granularity
mapping such as pages. However, a single direct segment is too
limited to be used to accommodate the dynamic memory usage
of general systems. Inspired by direct segment, this paper
proposes a new flexible virtual segmentation which provides
the benefit of direct segment, while still supporting dynamic
memory management by the operating system.
There are two challenges in memory virtualization to provide a flexible, efficient, and scalable memory management.
Virtual Address Spaces
Virtual Address Spaces
Virtual Address Spaces
Actual System Address
System Address Space
System Address Space
(a) Traditional Paging
(b) Direct Segments
Intermediate Physical Space
System Address Space
(c) Flexible Virtual Segmentation
Fig. 1. Different address mapping schemes for paging, direct segment, and flexible virtual segmentation
First, variable length mapping can support scalability for memory virtualization as advocated by direct segment. However,
the system requires many such segments instead of a single
segment for a process. The address space of a process can be
decomposed into many chunks of variable length segments,
which can be allocated dynamically. Since the length of each
segment is variable, it can be arbitrarily increased or decreased
to scale to larger memory. Second, address translation must
be off the critical path. To support complicated many segment
searches, address translation may take much longer compared
to traditional fixed-sized page-based TLB lookups.
To address the two challenges, flexible virtual segmentation
exploits a large physical address space, which is much larger
than the actual system memory space. For example, x86
provides 48 bits for physical address, extendable to 52 bits,
but the system memory size is much smaller than the address
space. By exploiting the large unused physical address space,
the virtual segment uses a two-level segment-based translation.
The primary memory region of each process is assigned to a
contiguous chunk of intermediate physical address space by a
single direct segment. The per-process region of intermediate
physical space is mapped to many fragments of actual system
memory regions by flexible virtual segmentation. Since the
operating system does not need to provide a contiguous
system memory region, the internal and external fragmentation
problem of direct segment can be relieved.
However, adding such many segments to the current address
translation step between the core and L1 cache, increases
latency in the critical path severely. To avoid the complex
address translation on the critical path of common cases,
the proposed architecture moves the second-level segment
translation from the intermediate physical space to the system
memory after the cache hierarchy and coherence layer across
caches. To support such delayed translation, all caches are
indexed and tagged by the intermediate physical address space,
and translation occurs only when the external memory needs
to be accessed. Such delayed translation is not new. Traditional
virtually-tagged caches postpone the address translation, and
the recently proposed Enigma architecture also moves the
critical address translation after cache accesses [35]. Figure 1
presents the different address mapping schemes for paging,
direct segment, and flexible virtual segmentation, showing the
two-level segment-based translation.
The new contributions of this paper are as follows. First, this
paper advocates memory mapping based on variable length
segments. However, unlike direct segment, this paper argues
that systems need flexible many segments to avoid the negative
impact of internal and external fragmentation. Second, to
support many variable-length segment translations, the paper
uses two-level segment-based translation, adopted from a
prior approach. The first-level segmentation uses a fast direct
segment, while the second-level segmentation supports flexible
virtual segmentation mapping. Finally, the paper investigates
architectural mechanisms to facilitate the segment searching
for many segment architectures.
The proposed flexible virtual segmentation still supports
direct segment, and in addition, it allows the flexibility of many
segments. The hardware cost is small, only about 56KB of
extra space in a processor is required to support one thousand
variable length segments. Our experimental result shows that
the performance of the two-level segmentation is close to the
ideal pure direct segment.
The rest of this paper is organized as follows. Section 2
presents superpaging and direct segmentation to reduce translation costs, and other prior works for memory virtualization.
Section 3 presents the limitation of direct segment for general
applications. Section 4 describes the proposed flexible virtual
segmentation architecture, and Section 5 presents system issues for supporting the virtual segmentation. Section 6 presents
the performance improvements and Section 7 concludes the
paper.
II. BACKGROUND
Address translation has emerged as a performance bottleneck for large memory applications. Figure 2 shows the
portion of time spent executing, and the portion of time
walking the page table due to TLB misses. For some extreme
cases such as Graph500 and cg.D, the time spent walking
the page table is greater than the actual execution time of the
application.
There have been two different approaches to reduce the cost
for address translation of memory virtualization. Superpaging
increases the mapping granularity to improve the hit rates of
TLBs with the same TLB capacity, while recently proposed
direct segment maps a variable length contiguous memory
Workload execution cycles
Page table walking cycles
ca
nn
de eal
du
st r fer p
re ay re
am tr t
cl ace
us
t
x2 er
64
m
ca
a
ct s cf
us ta
o A r
xa mn DM
la etp
n
m cbmp
um k
m
T er
G S IGR
R S
AP C
H A
50
cg 0
.D
ft.
C
is
m .C
g.
D
Cycles (%)
100
90
80
70
60
50
40
Fig. 2. Portion of execution time spent for TLB misses
directly between virtual and physical spaces. This section
discusses the two approaches and related work.
A. Superpaging
Memory virtualization of most of the commercial systems
relies on the fixed page granularity mapping with page tables
between the virtual and physical address spaces. With such
a fixed granularity, the translation efficiency of TLBs can be
improved by increasing the mapping granularity to increase the
coverage of each TLB entry. Supporting multiple page sizes
has become common in commercial processors, and depending
on page table organizations, two ways of superpaging schemes
have been used.
Limited page types for multi-level page tables: For
systems with multi-level (tree-based) page tables, supported
page sizes are limited by the level of page tables. For example,
for the 64-bit x86 architecture with 4 levels of page tables,
three types are supported, 4KB, 2MB, and 1GB. Each page
size corresponds to the page table level, to allow efficient page
table lookups. However, when the number of supported page
sizes is limited, the operating system may not be able to choose
a right page size, since the difference between two page sizes
is very large, causing internal fragmentation.
Supporting several different page sizes complicates the TLB
organization. TLBs are indexed and tagged by virtual page
numbers, but the virtual page number determined by the page
size, is not known before translation. To support several page
sizes, the current commercial processors limit the TLB entries
for each type. The latest x86 processors partition the L1
TLB into three different granularities: 4KB, 2MB, and 1GB.
Each page size has 64, 32, and 4 entries, respectively. By
partitioning the L1 TLB into three structures, a virtual address
can search the three different granular pages in parallel. The
L2 TLB has 1024 entries, and supports both 2MB and 4KB
pages. TLB entries for 1GB pages are not stored in the L2
TLB and go straight to the L1 TLB of 4 entries.1
Many different page sizes with hashed page tables: An
alternative way to support superpaging is to provide many
page types. For example, Itanium supports powers of two
page sizes from 4KB to 4GB [6], and PowerPC supports page
sizes from 128KB to 256MB [19]. To support many different
1 Latest manuals from intel indicate that in the next microarchitecture 16
entries of 1GB will be added to the L2 TLB.
page sizes, the page table organization uses hashed page
tables, since many page sizes do not fit with fixed multi-level
paging directly. With such diverse page sizes, unlike limited
page sizes of the x86 architecture, the operating system can
choose the best fit for dynamic memory accesses to minimize
fragmentation.
A critical restriction of supporting many page sizes is the
organization of TLBs and their scalability. For processors with
many variable page sizes, TLBs are organized only in a fullyassociative structure with comparison logics for every entry.
Each entry should be able to use a different number of bits
of page numbers for matching the entry against the requested
virtual address. In such designs, it is difficult to increase the
TLB capacity.
As discussed in both schemes for supporting superpaging,
the organization of TLBs is affected to support many page
sizes. As more types of pages should be supported, the
scalability of TLBs decreases, because their latencies increase
significantly. In current physically addressed cache hierarchy,
TLB accesses are on the critical path of instruction fetching
and memory instruction execution, and the complexity and
capacity of TLBs are limited to control the latencies. Furthermore, as discussed by Basu et al. [10], the fixed page size
may not continuously scale with the memory capacity increase.
Even 1GB page may not be enough for several TB of memory
capacity.
B. Direct Segment
A recently proposed alternative to superpaging is direct
segment [10]. In the direct segment, each process has a set
of segment-supporting registers. With base, limit, and offset,
a segment maps a variable length virtual memory partition
(primary memory) to a contiguous physical memory region.
With such a low complexity, a segment support is added
to the existing page-based systems to allow static memory
allocation for single application servers, where a long-running
single application is commonly running on a system. Basu et
al. showed that big memory servers often run such a single
application with static memory allocation behaviors, where
most of the memory is allocated initially [10]. In such use
scenarios, dynamic memory management by the operating
system is not critical, and direct segment can almost eliminate
the overhead of address translation. The segmented primary
memory region can be identified by the programmer or the
operating system.
Figure 3 presents the address translation with direct segment
backed by traditional paging. Given a virtual address, the
address is checked using the BASE and LIMIT registers that
describe the direct segment. If the address lies within the direct
segment, the OFFSET, which is the offset between the virtual
address and the physical address, is summed with the current
virtual address to generate a physical address. If the virtual
address lies outside the segment region, traditional paging is
used. Translating a memory address using a direct segment
can be done quickly with a couple of operations.
Virtual Address <48>
Primary Region Address
OFFSET
+
LIMIT
Base Address
+
Virtual Address
Offset
BASE
?
Paging Address
Paging
Physical Address <40>
Fig. 3. Translation mechanism of direct segment [10]
However, as direct segment relies on the static memory
allocation for a process, there are several limitations to be
adopted for more general applications. First, each program
must determine the primary memory region at the initial phase
of its execution, and the operating system must assign a
contiguous memory for the segment. Second, since a large
contiguous chunk of memory should be allocated, even if some
portion of the memory is not used, the operating system cannot
reclaim the memory region, causing internal fragmentation.
Third, when the operating system cannot allocate a large
contiguous memory, it must perform compaction which will
cause performance degradation for co-running applications.
Although direct segment can improve the address translation
efficiency tremendously, it restricts the flexibility of memory
management by the operating system severely. This paper
proposes a segment-oriented translation supporting variable
length segments, while eliminating such restrictions.
C. Prior Work
Addition of Address Spaces : Jacob et al. [20], [21]
proposed postponing address translation until a LLC miss, in
which case the OS cache miss handler is invoked, translating
the address by software. This work shows that software
managed translation can be as good as a hardware translation mechanism, with further benefits of software flexibility.
Enigma [35] proposed the use of an intermediate address
space. This intermediate address space is 1:1 mapped to the
physical address space, preventing aliasing problems. The
cache is tagged using the intermediate space, and memory
translation is only required on a LLC miss. Swanson et al. [31]
proposed adding another address space which is hardware
managed by the memory controller. This work allows noncontiguous and non-aligned physical pages to be mapped
contiguous in the shadow address space, which in turn,
provides more opportunities for allocating superpages. Fang
et al. [14] studied the effect of page promotion using copying
and remapping (via shadow address).
Using a large segment : Basu et al. [10] proposed using a
large segment which directly maps a region of virtual memory
to physical memory, achieving performance of an ideal TLB
for memory accesses to within the segment.
Using large pages : Current architectures support various
sized pages, and operating systems make limited usage of
such architectural support. Alpha, ARM, Intel, Itanium, MIPS,
PowerPC, UltraSPARC, and many more, support various large
pages [19], [2]
Operating systems also support such large pages. As for
the x86 architecture, linux provides two interfaces for large
page usage. Libhugetlbfs provides applications with large
pages when the applications explicitly request large pages.
Transparent Huge Page, on the other hand, provides transparent support of large pages by allocating large pages whenever
eligible, and gracefully demoting or promoting from or to large
pages whenever the need arises.
Talluri and Hill [33] proposed using a reservation based
page allocation scheme, which maps the requested pages, as
well as surrounding pages. The actually accessed pages are
allocated, and the reserved surrounding pages are maintained
in a reserved pages list. Once the reserved pages are accessed, they are formally mapped. With the aid of reservation,
collection of aligned, contiguous pages can be promoted to
superpages with no additional cost of migrating pages.
Ganapathy and Schimmel [15] presented mechanisms which
support operating systems in providing multi granular page
sizes. To name a few, page table reorganization, page allocation, migration, promotion, and demotion issues were
addressed. Navarro et al. [25] made use of the reservation
scheme proposed by Talluri et al. [33], and implemented
operating system support for superpages, including dynamic
promotions and demotions.
Coalescing TLB entries : Talluri et al. [32] proposed clustered hashed table with subblocking. Blocking of contiguous
pages is based on spatial locality and is managed by hash
tables. Hash tables store the information about consecutive
pages of subblocks.
Pham et al. [28] proposed CoLT which is a TLB management mechanism to increase the virtual address reach of a TLB
entry. CoLT coalesced contiguous TLB entries when either
virtual address or physical address is contiguous. CoLT works
on set-associative TLB and full-associative TLB.
Pham et al. [27] large reach improved the CoLT with clustering. This work coalesced adjacent TLB entries with spatial
locality to improve TLB reach. CoLT requires contiguous
address space and does not allow any non-contiguous address
space between contiguous area, however, this work allows noncontiguous pages to form a cluster.
Reducing TLB miss latencies : Bhattacharjee [12] proposed large reach memory management unit cache to reduce
the translation latency overhead from TLB miss. This work
coalesced contiguous level 2 (level 1 being the leaf node)
entries of MMU cache. Sequential and parallel workloads
benefit from coalescing intermediate page table nodes, as
only the leaf node needs to be fetched through the memory
hierarchy.
Barr et al. [9] proposed storing partial address translation
information in the MMU cache. Instead of coalescing intermediate nodes [12], this work caches previous translation using
an index, and when a translation to a consecutive addresses
are requested, the cached translation allows the translation to
skip a few level of page table walk.
Max change (%)
10
00
00
10
00
0
10
00
10
0
Physical allocation change
Virtual allocation change
10
1
mcf
dc.B
SSCA2
cactusADM
omnetpp
streamcluster
x264
ferret
canneal
astar
freqmine
mummer
Tigr
xalancbmk
graph500
blackscholes
is.C
Utilization (%)
100
90
80
70
60
50
40
30
20
10
0
dedup
0
Fig. 4. Change of Memory allocation and average utilization
Speculative Execution : Barr et al. [8] used the reservation
scheme [33] to design a speculative address translation system.
For TLB misses which lie on a reserved region, execution is
continued with a speculated address. Address translation is
executed in parallel with the speculative execution to confirm
correctness.
Exploiting Multicore : Srikantaiah and Kandermir [29]
proposed passing TLB entries between TLB tables of each
cores. Multi-threaded applications benefit by sharing TLB
entries between cores, instead of individually walking the
page table, and multi-programmed applications benefit by
borrowing TLB capacity from other underutilizing cores. Bhattacharjee et al. proposed a second level shared TLB [11], [29].
This work also aims to exploit sharing and efficient usage of
a combined TLB structure.
Virtualization Support: Gandhi et al. [16] extended their
work on direct segmentation [10] to efficiently support virtualized systems. Current x86 virtualized systems need to walk
two levels of page table to translate an address, and the authors
proposed three modes for virtualization address translation.
III. C HALLENGES FOR S INGLE D IRECT S EGMENTS
Although direct segment is a very promising technique to
eliminate the translation cost entirely for the primary memory
region of each process, general systems still need to manage
their physical memory dynamically. In this section, we first
discuss the dynamic nature of memory used in general applications, and the internal and external fragmentation problems
with direct segment.
Methodology: We conducted preliminary experiments
presented in Figure 2 on a real machine equipped with an
Intel Haswell processor (i7-4770), 32GB of RAM, Linux
3.17. Transparent Hugepage support, hyperthreading, frequency scaling was turned off for consistent results. The TLB
configuration is as follows: 64 entry 4 way associative L1
TLB, 1024 entry 8 way associative L2 TLB. We executed
benchmarks of the PARSEC multi-threaded benchmark suite
2.1 [13], SPECCPU2006 benchmark suite [18], a bioinformatics benchmark suite [7], two graph analysis benchmark suites:
SSCA [4] (sized 20) and Graph500 [3] (sized 24), and NAS
Parallel Benchmark suite.
We used hardware performance counters to measure the execution cycles, and page table walk cycles of each benchmark.
We used the linux tool perf trace and pmap to extract a
memory trace of allocation/dellocation and initial accesses to
pages.
Memory Allocation Behaviors: Direct segment is targeting
an application which allocates most of the memory initially,
so that the program can request a right amount of primary
memory region to be allocated in the contiguous segment
memory. We analyzed the memory allocation behaviors of
more general applications in Figure 4. The upper part of
the figure shows how much virtually and physically allocated
memory changes after the initial phase of 5% of the total
execution times. The dotted line shows the maximum memory change in percentage in virtual address space for each
application, which are affected by the mmap requests by the
application and the heap increase by brk or sbrk system
call. The solid line shows the maximum memory change in
actual physical pages, which are allocated lazily after page
faults.
The result shows that even the virtual memory allocation
occur frequently during the execution of many applications,
although some applications on the right side allocate most of
the memory early as shown by Basu et al [10]. The virtual
memory allocation can change by up to 214% from the initial
allocation. With such dynamic nature of memory allocation
for some applications, it may be difficult to assign the primary
memory early at the initial phase of application execution.
The changes in physical memory is more drastic. For IS
in the NAS benchmark, most of physical memory pages are
touched and allocated at a later time during the execution of
the program.
Internal Fragmentation: The second limitation of direct
segment is the potential low utilization of allocated physical
memory. In most of the current operating systems using
paging, the actual memory page is allocated lazily only after a
page fault. However, with direct segment, the entire segment
should be allocated by the operating system regardless of
whether some portion of the memory will be actually accessed
by the application.
To quantify the internal fragmentation, we analyzed the
memory utilization of the dynamically allocated memory. The
lower part of figure 4 presents the ratios of actual resident
physical pages to the requested virtual pages by mmap and
sbrk system calls. The ratios were measured every second
and averaged. Such actual physical page allocation occurs only
when a page is actually accessed. The result shows various
utilizations of allocated virtual memory region from less than
20% utilization of ferret to almost 100% utilization in
xalancbmk and omnetpp.
brk
mmap
Virtual Address <48>
...
0x8905000
Segment Address
0x7FE18355C000
0x7FE18E1EC000
Segment Registers
(a) SpecCPU2006 calculix
brk
First Level
Translation
mmap
Base Address
Paging Address
?
Offset
0x7B8000
Paging
+
...
Intermediate Physical Address <52>
0x6EE000
0x2065000
0x7F8778DCC000
0x7F877A691000
(b) SpecCPU2006 gromacs
Cache Hierarchy
Fig. 5. Accessed regions in the primary region in virtual address space
IV. F LEXIBLE V IRTUAL S EGMENTATION A RCHITECTURE
A. Design
The design goal of flexible virtual segment architecture
is two-fold. It must provide many variable length segments
to minimize internal and external fragmentation, while supporting scalable address translation. We aim to expand the
Second Level
Translation
Index Cache
Segment ID
Segment Table
Base Address
Offset
In real systems, the mmaped regions are not always contiguous. Figure 5 is a case study of two applications, showing
the portion of the primary region (heap and mmaped regions)
in virtual address space that are actually used. As shown in
the figure, the primary memory region may contain separate
clusters of regions with a large unused hole between the
adjacent clusters.
External Fragmentation: The final limitation of direct
segment is that the operating system must always provide a
large contiguous memory to maintain the efficiency of direct
segment. However, even for many servers, some workloads
can be consolidated to use physical resources more efficiently.
For example, some batch workloads can share latency critical
workloads to use physical resources more efficiently. Not to
affect the performance latency critical workloads, the batch
workloads are scheduled selectively [22], [23], [24], [34]. For
more general systems, many applications are launched and
completed. Such application cycles can cause fragmentation
of main memory. Strictly supporting direct segment requires
the operating system to find contiguous chunk of memory.
Considering the limitation of direct segment with the
widened target application scope, a single direct segment may
not be enough. The operating system is required to manage
its physical memory in smaller chunks. Although variable
length supported by segmentation can potentially address the
scalability problem of address translation with large memory,
systems need many such variable length segments for dynamic
memory management.
Supporting variable length segments is not new to operating
system designs. For several existing systems with many types
of page size, such as Itanium and PowerPC, the operating
systems are already designed to manage the physical memory
at the power-of-two sizes efficiently while minimizing external
and internal fragmentation. Although segmentation provides
more fine-grained granularities than the power-of-two page
sizes, the current operating system designs show that supporting many different granularities for memory is possible.
+
System Address Space <40>
Fig. 6. Overall translation flow
Miss
Search Index <31>
Main Memory
Root node
K1
Node A
K2 K3
Intermediate Physical Address <52>
Update
Node B
K4
Index Cache
Root
A
Node A C
Node B F
Node C SID1
Offset <21>
Search Index < K5
K1 B
K2 D K3 E
K4 G
K5 SID2
Node C Node D Node E
K5
K6
K7
Segment Table
Base
Segment ID = SID1
Offset
Limit
Segment Base
Fig. 7. Index cache for second level translation
applicability of direct segment by adding supports for such
dynamic memory management. The second goal is to minimize the translation overheads for supporting complicated
address translation with flexible virtual segmentation. The
translation cost must not hinder the execution of common
cases of memory accesses.
To satisfy such conflicting goals, flexible virtual segment
exploits the unused intermediate physical address space. Intermediate physical address space is the extension of physical
address space. Even though the current architecture supports
a very large physical address space, the system will use only
the lower part of the space where the actual main memory
exists. As shown in Figure 1c, the intermediate physical space
includes the lower system memory space and unused upper
physical space. We use the intermediate physical space as
an intermediate address space to provide an illusion of a
contiguous memory for each process, while the operating
system can dynamically change the mapping of the actual
physical memory. In the x86 architecture with 48 bit physical
addresses, the intermediate physical space is 256TB, and
with 52 bits, the space becomes 4 PB. With such a large
intermediate physical space, many processes can be assigned
to a contiguous intermediate physical region of hundreds of
GB.
The flexible virtual segmentation uses two-level segmentation for the primary region of process memory. A virtual
address in the primary region is mapped to a intermediate
physical address through the first-level single segment translation, and the intermediate physical address is translated to a
system address by the second-level flexible virtual segmentation translation. Figure 1 presents the memory mapping
differences among traditional paging, direct segment, and
flexible virtual segment. With traditional paging, the virtual
address space is mapped to the system address space at fixed
page granularities. With direct segment, a single contiguous
memory region is used for the segment of each process.
With the flexible virtual segment, the first-level segment is
mapped to a contiguous region in intermediate physical space.
However, the intermediate physical space can be partitioned
with many variable length segments by the operating systems.
B. Overall Architecture
In the two-level segmentation, a single primary segment
in the virtual address space of each process is mapped to a
contiguous region of intermediate physical address space. The
translation between the virtual address space and intermediate
physical address space is similar to direct segment. With a
set of three registers, the translation has a negligible cost on
the critical path. Note that even with direct or our virtual
segmentation, some part of process memory for code and
stacks still use paging, which rarely incurs TLB misses due
to their relative small footprint and high locality.
A key design decision is to use the intermediate physical address space to index and tag all caches. As long as memory accesses occur within the cache hierarchies including coherence
transactions, the intermediate physical address does not need to
be translated to the actual system memory address. However,
the intermediate physical address must be mapped to a real
system address eventually. The second-level translation occurs
only when the external memory accesses are required, since
all caches are indexed and tagged by intermediate physical
addresses. The second-level translation supports many variable
length segments which the operating system can dynamically
assign.
One critical correctness restriction is in the mapping between the intermediate physical address space and system
memory space. If two different intermediate physical addresses
are allowed to be mapped to the same system address, an
aliasing problem will occur in the cache hierarchy, and the
coherence mechanism cannot maintain the coherence of the
aliased data. To avoid such aliasing problem, a intermediate
physical address must be mapped only to one system memory
address. Therefore, memory sharing among processes through
segmented memory is not allowed. Direct segment also does
not support memory sharing for the segment memory region
of processes. However, memory sharing can still be supported,
since unsegmented address region uses page-based translation.
Page tables directly translate from virtual addresses to system
addresses without the intermediate physical space.
Delaying address translation after L1 accesses is not new.
Classic virtual caches allow postponing the address translation
to after cache misses, although they must address the aliasing
problem with separate mechanisms. The recently proposed
Enigma architecture also uses two level translations [35]. In
the Enigma architecture, the first-level translation uses fixed
size large segments for fast translation without TLB misses,
and the second-level translation uses traditional paging. The
output of the first-level translation is an intermediate address,
which is similar to the intermediate physical address in our
architecture. However, Enigma uses the first-level fixed-length
segments backed by paging. The goal of flexible virtual
segmentation is to use variable length segments for most of the
memory allocation in systems for scalability with increasing
memory sizes.
To support many flexible segments, the flexible virtual segmentation provides an efficient flexible virtual segmentation
translation in the second-level translation mechanism. Since
the complexity of the flexible virtual segmentation translation
is moved to the second-level, it can minimize the impact on
the regular accesses of L1 caches. The latency incurred by
the second level translation can be overshadowed by longer
memory access latencies, and the translation process can be
overlapped with the last-level cache accesses as used by
Enigma.
The proposed flexible segmentation can also support direct
segment without the second-level translation, if a single application is used in servers. In the first-level segmentation, the
segment can be mapped to the system memory region directly.
For a LLC miss, if the address range is in the system memory
region, which is the lower portion of the virtual address space,
the second-level translation is bypassed for direct mapping.
C. Translation Architecture
Figure 6 presents the detailed architecture of flexible virtual
segmentation. The address translation between a core and its
L1 cache uses a set of segment translation registers and a
small TLB for non-segmented memory regions. The first level
translation architecture is similar to direct segment, and a
virtual address is translated by segmentation and also by TLBs.
If the virtual address is within the segment range, the target
address is calculated from the segment base address and offset.
However, unlike direct segment, the generated address from
the first-level segmentation is a intermediate physical address.
Since caches are indexed and tagged by the intermediate
physical address, the core can access all levels of caches with
the generated address. The TLBs translate to system memory
addresses directly, and used for non-primary regions of the
virtual address space.
The lower part of Figure 6 shows the second-level translation. If an LLC miss occurs and the external memory must be
accessed, the intermediate physical address must be translated
to a system address with a segment table containing the base,
Hit rate (%)
limit, and offset for flexible virtual segmentation. The segment
table is a cache of system-wide segment table maintained
by the operating system. If an intermediate physical address
misses the segment table, an interrupt occurs and the operating system fills the table entry. However, we expect that
such segment misses are rare except for cold misses, since
the segment table contains thousands of segment definitions.
Figure 7 shows the organization of a segment table. A segment
entry has the starting intermediate physical address, length,
and offset for the segment.
One of the key design issues is to find the corresponding
segment for a intermediate physical address from the table
efficiently. Unlike TLBs with fixed mapping granularity, a
naive way of searching a segment requires linear searches
for all table entries in the worst case. However, since the
translation latency is still important not to increase LLC miss
latencies significantly, we add a B-tree search mechanism.
Figure 7 presents the B-tree search mechanism for the
segment table. The operating system maintains a B-tree indexed by intermediate physical addresses for all segments in
the intermediate physical space. All segments used by many
processes are ordered by their intermediate physical addresses,
and a search tree is constructed as an index structure. One
node of the tree uses 64B data size to match the cache block
size, and it can contain 8 child nodes. For each child of a
node, the node must contain the starting intermediate physical
address for the memory region represented by the child and
the pointer to the child node in system address. For a given
LLC miss address, the search B-tree must be accessed to find
a segment ID in the segment table. We picked a very generous
amount of 1024 segment table entries, which should be more
than enough. We do not expect the system to use all 1024
segments, but in cases which it does, the operating system
needs swap the segment entries.
Both 1024 entries and 2048 entries merely need a tree
of depth 4, when spanning with by a factor of 8. We need
two types of data in a B-tree node: the pointer to the next
node, and a key to compare the incoming value (which is the
incoming intermediate physical address). We need 8 pointers
and 7 comparison keys. Each pointer points to an address in
system address of 40 bits. However, because the B-tree node
fits in a cache line, we can address a node with 34 bits. The
intermediate physical address is 52 bits, and we put a design
constraint that segments are aligned to 2MB boundaries. Thus
31bits is enough to compare a intermediate physical address.
We can fit the 8 pointers and 7 keys into a single cache line.
Accessing the search tree requires multiple memory accesses, and to avoid such memory accesses, we add a index
cache. Figure 7 shows the details of the search tree cache.
We conducted a sensitivity study on the index cache size.
We need to choose a index cache size large enough to insure
performance benefits. We distributed the system address space
of 40 bits to 1024 or 2048 segment entries equally. We inserted
all entries into the search tree, and simulated a 1 million
random accesses to the entire system address space. This is
the worst case performance, as workloads will show spatial
100
90
80
70
60
50
40
30
20
10
0
1024 entry
2048 entry
8KB
16KB
32KB
64KB 128KB
Fig. 8. Index cache size sensitivity study
locality. The cache hit rate depending on the index cache size
is shown in figure 8. The cache is a 8-way associative cache.
Each curve of the figure shows the number of entries that we
tested with. We can conclude that for 1024 entries, a 32KB
index cache is sufficient, and for 2048 entries, a 64KB index
cache is sufficient.
Using CACTI 6.5 [1], we measured the index cache access
latency. For a 3.4GHz machine, both 32KB and 64KB 8-way
index cache have the latency of 3 cycles. Also the access
latency to a table of 1024 entries is also 3 cycles. Thus we can
estimate that 4 accesses to the cache index (four level B-tree)
To support 1024 segments, the segment table size is about
24KB (base, offset, length), and a 32KB 4-way associative
index cache can contain the entire index tree covering 1024
segments. Compared to the LLC size, the extra 56KB structure
does not add a significant area overhead. Furthermore, a
multi-core processor needs to have only one index cache and
segment table shared by multiple cores.
Accessing the second-level translation can be either serialized or parallelized with accesses to the LLC. A serial access
to the translation exposes the translation latency to the total
LLC miss latency. However, parallel accesses, at the cost of
increased power consumption, can hide the translation latency
overlapped with LLC accesses. In the previous example of
1024 segments with a 32KB index cache, searching a segment
takes four consecutive accesses to the index cache as the level
of the index tree is four. Assuming 3 cycles access latency
for the 32KB index cache and extra one cycle for each step,
an index searching takes 16 cycles. From the segment ID,
accessing the 24KB segment table will take less than 3 cycles,
and the total segment access time will be less than 20 cycles.
V. S YSTEM - LEVEL I MPLICATIONS
A. Required OS Changes
To support the flexible virtual segmentation, the operating
system requires several changes and policy adjustments. To
provide the second-level many segment translation, the operating system maintains the segment table and the index tree
for segments in its memory. For a segment table cache or
index cache miss, the interrupt handler must fill the segment
table or index tree cache. Furthermore, if the segmentation
organization changes, the segment table and index tree entries
must be flushed, which is similar to TLB invalidation for page
table changes.
For each new process, the operating system first determines
the amount of intermediate physical space to allocate to the
process. Since the intermediate physical space is very large, 4
Peta-bytes with 52 bits for the intermediate physical address
space, the OS can assign a large contiguous intermediate
physical region to a process. Furthermore, the OS can leave
an empty space between processes, if necessary, to increase
the mapped intermediate physical space for each process dynamically. Note that even if a very large intermediate physical
region is assigned to a process, only a portion of the space
can be actually mapped to system memory by the second-level
segmentation.
Within a virtual memory space, however, some regions
must be assigned to traditional page tables, which should
be translated by backing TLB and page table. The memory
allocations that should be allocated to the paging region, are
those that need the benefit of sharing, fine-grained protection
control, and some other possible corner cases.
B. Allocating Physical Memory
The operating system must effectively manage physical
memory with variable length segments. However, managing
the free memory with different sizes is common even in the
current page-based memory management. Traditional buddyallocators maintain free memory chunks of the same sizes.
An important decision for the memory allocator is how to
determine the second-level segment size. If the size is too
large, internal fragmentation will waste physical memory. If
the size is too small, an extra segment or segments should
be allocated later. The actual physical memory allocation can
occur in two different ways. First, as discussed by Basu et
al. [10], on a high level virtual memory request such as mmap
system call, the requested virtual memory size is allocated
to the process. This aggressive early memory allocation will
reduce the number of dynamic segments, with the risk of
internal fragmentation, when only part of the requested virtual
memory is actually accessed.
An alternative way is to delay the memory allocation to
page faults, as used by the demand paging in current operating
systems. On a page fault, the operating system must determine
the segment size to allocate for the second-level segment. This
problem is similar to choosing superpage sizes when many
different page sizes are available. In general, two possible
approaches have been used for supporting different page
sizes [33], [25], [30], [15]. First, a large page is assigned
aggressively [30], [15]. In such cases, the system prefers to
use larger pages to reduce TLB misses. However this approach
is prone to internal fragmentation if allocation of large page
is not done carefully. Second, reservation-based approaches as
proposed by Talluri and Hill [33], can assign a small page at
first, but the following contiguous memory region is reserved
for future page promotion. Since the contiguous region is
reserved, the memory region can be promoted to a large page
size.
Considering existing operating system supports for such
superpaging with many different page sizes, we expect that
variable-length segmentation can also be supported efficiently
by the operating systems, although the variable length mapping
can open new opportunities and challenges for the memory
allocator of the operating systems.
C. Deallocating Physical Memory
To provide dynamic memory reclamation, a chunk of
physical memory mapped to a second-level segment can be
deallocated by the operating system, or by the application
through munmap requests. If the process no longer requires
the memory space, the operating system can simply remove
the second-level segment, and adjust the index tree. When
a segment is deallocated, some cachelines of the segment
may still exist in the caches, which are indexed and tagged
by the intermediate physical addresses. Although the process
can still access the obsolete cachelines of the deallocated
segment, those cachelines will not affect the physical memory,
since the second-level segment is removed and there is no
translation between the obsolete intermediate physical address
to the system address. As long as the same intermediate
physical addresses are not re-used for another process, the
orphan cachelines do not cause any correctness problem to
the system. Considering the very large intermediate physical
space, recycling of intermediate physical address occurs rarely,
and when it happens, the entire cache can be flushed.
The flexible segmentation can support traditional swapping,
although we expect such swapping operations will rarely occur
for future big memory systems. The operating systems can
unmap a segment for a certain range of the intermediate
physical region of a process, and write it to the swap space.
The overall process is similar to the page-level swap operation.
D. Supports for Other Memory Managment Issues
Supporting Segment-based Memory Protection: A restriction of direct segment for memory management is that it
can support a single permission for the entire primary memory
region. In the current page-based memory management, protection, access, and dirty bits are used to provide page-level
protection and memory access statistics for dynamic memory
management. In the flexible virtual segmentation, the secondlevel segment entry includes such bits to provide segment-level
protection and access status information. Exploiting many
available segments, the proposed architecture can set different
read or write permissions for each segment. Unlike direct
segment where the entire primary memory region must use the
same permission status, the primary region is decomposed to
multiple segments with different permission status. However,
to support such segment-level permission status, each cache
tag must include the permission bits, since the second-level
segment tables are accessed only for LLC misses, and the
permission status bit should be carried to each cache tag.
Supporting Sub-segment Memory Reclamation: Another
limitation of direct segment is that access status of subregions of the primary memory region cannot be identified.
The flexible virtual segmentation allows the partitioning of
the primary region by the second-level segments, and thus,
the access status can be tracked for each second-level segment,
although the access bit can be set only for LLC misses. The
access bit can tell whether the segment is at least accessed
once.
However, even a second-level segment can cover a very
large physical memory region, and once the segment is assigned to a process, the operating system cannot know the
memory access pattern within the second-level segment. For
example, even if a process uses only a small portion of an
allocated segment, but the segment access bit will always be
set. To provide a more fine-grained access status information,
the second-level segment table includes tens of extra access
bits to provide sub-segment access status. Each bit represents a
portion of the segment. The sub-segment access bits will be set
on misses of the LLC which will go through the second level
translation, before fetching the cache line from memory. With
such sub-segment access information, the operating system
can dynamically readjust the segment size to reclaim unused
physical memory regions. One of the advantage of the secondlevel segment table is, it is relatively easy to increase the area
without affecting the core architecture, since it exists at the
lowest-level of memory hierarchy.
VI. R ESULTS
A. Methodology
To show the performance of our proposed system, we
used a full system simulator, and created a custom pin-based
simulator to experiment our design.
We used MARSSx86 [26], a cycle accurate full system
simulator running a linux image, to conduct our experimentations on expected performance improvements of our proposed
system. Table I shows our system configuration. We use the
same TLB configuration as used in [28], which was used
to simulate a stressed TLB situation, similar to a real world
execution.
We modeled our pin-based simulator to follow the memory
mapping of each application, decide whether a memory access
is to a segment region or a paging region. If the memory access
was to the paging region, we inserted the memory access to
our TLB model. Our TLB model was same as that of the
Haswell processor.
We executed SPECCPU2006, Graph500 (sized 22), and
NPB benchmarks (C sizes, and B size for NPB_DC) as with the
real machine experiments. Additionally we conducted experiments of GUPS (with size 28), a random access benchmark.
We fed the benchmarks through simpoint, fastforwarded based
on the simpoints result. We simulated 1 billion instructions for
our MARSSx86 simulation, and 100 billion instructions for
our pin-based simulation.
B. Performance
Study of memory accesses : In this section, we present
the experimentation results of our proposed system. First,
we studied how much of the memory accesses land in the
segmented region and how much go through paging in the first
level of translation: from virtual address space to the extended
physical address space, or to the physical address space for
paging.
Parameter
Processor
Branch Predictor
L1 I/D Cache
L2 Cache
L3 Cache
TLB L1
TLB L2
Memory
Value
Out-of-order x86 ISA, 3.4GHz
128-entry ROB, 80-entry LSQ
5-issue width, 4-commit width
36-issue queue, 6-ALU, 6FPU
4K entry BTB, 1K entry RAS
Two-level branch predictor
2/4-cycles, 32KB, 4-way, 64B block
6-cycles, 256KB, 8-way, 64B block
27-cycles, 2MB, 16-way, 64B block
1-cycles 32 entry, 4-way
7-cycles 128 entry, 4-way
4GB DDR3-1600, 800MHz, 1 memory controller
TABLE I
S IMULATED SYSTEM CONFIGURATIONS
Figure 9 shows the breakdown of memory accesses. For
the majority of workloads, the accesses that are translated
through paging is very minor, denoted by the white portion
of the stacked graph. Sjeng and NPB_EP have the highest
memory accesses, 8.93% and 9.03%, respectively, that go to
the paged section, however, they access very small regions
of less than 50 pages. Such a small area should cause no
noticeable performance degradation. The stack region is also
paged, however stacks are generally very small, tens of pages
at most. Also, it is important to note that access to stacks are
concentrated on the topmost frame. Thus, the total TLB entries
required for paging will be negligible.
We conclude from this study that by aggressively allocating
memory to segments, and using paging only when absolutely
necessary, the pressure on the TLB structure can be nullified.
Expected performance gains : Figure 10 shows the normalized IPC of our flexible virtual segment system compared
to the baseline system, as presented in table I. We also plot
the ideal TLB to show that our flexible virtual segment system
shows near ideal TLB performance. We assumed that our
second level translation requires 20 cycles lookup time as
calculated at the end of section IV-C. We also show the performance of flexible virtual segment with speculative second level
translation, and finally we also plot the performance of direct
segmentation. We also plot three geomeans. Geomean_low
is the geomean of workloads with less than 5% performance
increase with 20 cycles lookup. Geomean_high is the geomean of workloads with more than 5% performance increase
with 20 cycles lookup, and Geomean is the geomean of all
the benchmarks. The figure shows that using our proposed
architecture can show performance increases.
Our flexible virtual segmentation structure is only accessed
upon a LLC miss, and so the difference between our flexible
virtual segmentation architecture and direct segment is dependent on the number of LLC misses. We have used a modest
2MB LLC structure which contributes to some performance
drops compared to the ideal TLB. Larger LLCs of systems
in use today will reduce the gap between the ideal TLB
and flexible virtual segmentation architecture. For benchmarks
with high LLC misses, the system can decide to turn on
speculative translation to improve performance.
Stack
Paging
bwasta
ca ave r
ct bz s
us ip
A 2
ca DM
lc
u
d li
ga ea x
m lII
G
em ess
sF gc
D c
g T
gr obmD
om k
h2 ac
6 s
hm4re
m f
er
lib les lbm
qu lie
an 3d
tu
m
m
c
m f
omnamilc
ne d
po tpp
vr
sj ay
so eng
sp ple
hi x
n
to x3
nt
xa
la w o
nc r
z bm f
gr eus k
ap m
h5 p
0
gu gu 0
p
N ps_ s
PB 3
N _ 0
P B
N B_ T
P C
N B_DG
P
N B_EC
P P
N B_F
P
N B_ T
PB IS
N _
P L
N B_M U
P
N B_ G
PB S
_UP
A
Memory access (%)
Segment
100
80
60
40
20
0
125
120
115
110
105
100
95
90
_M
qu m G
N an ilc
xa PB tum
la _U
nc A
le bm
sl k
ie
G
em 3d
s lb
N FD m
PB T
D
bw _C
G
N ave
PB s
ca _I
gr lc S
ap ul
ix
ze h50
us 0
m
N namp
PB
_ d
so EP
p
de lex
al
II
w
rf
h2 gc
N 64 c
P r
N B_Sef
PB P
_B
N mT
PB c
_ f
go DC
bm
sj k
en
to g
n
h
pe m to
rlb me
e r
po nch
vr
bz ay
o
m
ca
ip
ct ne 2
us tp
N AD p
PB M
sp _L
hi U
nx
gr as 3
om ta
ga ac r
m s
G
es
G eom gu s
eo e p
m an s
ea _
l
G n_h ow
eo ig
m h
ea
n
Flexible virtual segments w/ 20 cycles serial lookup
Flexible virtual segments w/ parallel lookup
Direct Segment
lib
N
PB
Normalized IPC (%)
Fig. 9. Breakdown of memory accesses
Fig. 10. Normalized performance of the flexible virtual segmentation system
Direct segment can achieve an average (geometric mean of
all workloads) of 6.44% performance improvement, and our
flexible virtual segment architecture shows a 4.32% performance improvement for serial lookup and 6.35% for parallel
lookup.
R 64 and IA-32 Architectures Optimization Reference Manual, Jul.
[5] Intel
2013.
R Itanium
R Architecture Software Developer’s Manual Vol.2 Rev
[6] Intel
2.3, Jul. 2013.
[7] K. Albayraktaroglu, A. Jaleel, X. Wu, M. Franklin, B. Jacob, C.-W.
Tseng, and D. Yeung, “Biobench: A benchmark suite of bioinformatics
applications,” in Proceedings of the IEEE International Symposium on
Performance Analysis of Systems and Software, 2005, ser. ISPASS ’05.
VII. C ONCLUSION
Washington, DC, USA: IEEE Computer Society, 2005, pp. 2–9.
[8] T. W. Barr, A. L. Cox, and S. Rixner, “SpecTLB: A mechanism for
Memory has grown very rapidly in size throughout the past
speculative address translation,” in Computer Architecture (ISCA), 2011
decade. On the other hand, the fundamental design principle
38th Annual International Symposium on. New York, New York, USA:
of virtual memory has not changed. There have been additions
ACM Request Permissions, 2011, pp. 307–317.
[9] T. W. Barr, A. L. Cox, and S. Rixner, “Translation caching: skip, don’t
of large page support both in architecture and the operating
walk (the page table),” in ISCA ’10: Proceedings of the 37th annual
system; however, it is our opinion that there is still room for
international symposium on Computer architecture. New York, New
more change.
York, USA: ACM Request Permissions, Jun. 2010, pp. 48–59.
We propose a flexible virtual segmentation system capable [10] A. Basu, J. Gandhi, J. Chang, M. D. Hill, and M. M. Swift, “Efficient
Virtual Memory for Big Memory Servers,” in Proceedings of the 40th
of handling large memory workloads of today. We exploit the
Annual International Symposium on Computer Architecture, Jun 2013,
abundance of memory in the virtual memory address space,
pp. 237–248.
employ an extended physical address space to be used by the [11] A. Bhattacharjee, D. Lustig, and M. Martonosi, “Shared last-level TLBs
for chip multiprocessors,” High Performance Computer Architecture
microarchitecture, and use aggressive segmentation to translate
(HPCA), 2011 IEEE 17th International Symposium on, pp. 62–63, 2011.
from the extended physical address space to system address
[12] A. Bhattacharjee, “Large-reach memory management unit caches,” in
space. As the rather complex flexible virtual segmentation
MICRO ’13. ACM, 2013, pp. 383–394.
translation only occurs on last level cache misses, our system [13] C. Bienia, “Benchmarking modern multiprocessors,” Ph.D. dissertation,
Princeton University, January 2011.
can improve overall performance.
[14] Z. Fang, L. Zhang, J. Carter, W. Hsieh, and S. McKee, “Reevaluating online superpage promotion with hardware support,” in High-Performance
R EFERENCES
Computer Architecture, 2001. HPCA. The Seventh International Symposium on, 2001, pp. 63–72.
[1] “Cacti an integrated cache and memory access time, cycle time, area,
[15] N. Ganapathy and C. Schimmel, “General Purpose Operating System
leakage, and dynamic power model,” website. [Online]. Available:
Support for Multiple Page Sizes.” in USENIX Annual Technical Conferhttp://www.hpl.hp.com/research/cacti/
ence, 1998.
[2] “Cortex-a9 technical reference manual,” website. [Online]. Available:
[16] J. Gandhi, A. Basu, M. D. Hill, and M. M. Swift, “Efficient Memory
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388e/Chddiifa.html
Virtualization,” in Proceedings of the 47th Annual International Sympo[3] “The
graph
500
list,”
website.
[Online].
Available:
sium on Microarchitecture, Jun 2013, pp. 237–248.
http://www.graph500.org/
[4] “Hpc
graph
analysis,”
website.
[Online].
Available: [17] C. Gray, M. Chapman, P. Chubb, D. Mosberger-Tang, and G. Heiser,
“Itanium: A system implementor’s tale,” in Proceedings of the Annual
http://www.graphanalysis.org/benchmark/
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
Conference on USENIX Annual Technical Conference, ser. ATEC ’05.
Berkeley, CA, USA: USENIX Association, 2005, pp. 31–31.
J. L. Henning, “Spec cpu2006 benchmark descriptions,” SIGARCH
Comput. Archit. News, vol. 34, no. 4, pp. 1–17, Sep. 2006.
B. Jacob and T. Mudge, “Virtual memory in contemporary microprocessors,” Micro, IEEE, vol. 18, no. 4, pp. 60–75, 1998.
B. Jacob and T. Mudge, “Uniprocessor virtual memory without TLBs,”
Computers, IEEE Transactions on, vol. 50, no. 5, pp. 482–499, 2001.
B. Jacob, S. Ng, and D. Wang, Memory systems: cache, DRAM, disk.
Morgan Kaufmann Publishers, 2008.
J. Leverich and C. Kozyrakis, “Reconciling high server utilization
and sub-millisecond quality-of-service,” in Proceedings of the Ninth
European Conference on Computer Systems, ser. EuroSys ’14. New
York, NY, USA: ACM, 2014, pp. 4:1–4:14.
D. Lo, L. Cheng, R. Govindaraju, L. A. Barroso, and C. Kozyrakis, “Towards energy proportionality for large-scale latency-critical workloads,”
in Proceeding of the 41st Annual International Symposium on Computer
Architecuture, ser. ISCA ’14. Piscataway, NJ, USA: IEEE Press, 2014,
pp. 301–312.
J. Mars, L. Tang, R. Hundt, K. Skadron, and M. L. Soffa, “Bubbleup: Increasing utilization in modern warehouse scale computers via
sensible co-locations,” in Proceedings of the 44th Annual IEEE/ACM
International Symposium on Microarchitecture, ser. MICRO-44. New
York, NY, USA: ACM, 2011, pp. 248–259.
J. Navarro, S. Iyer, P. Druschel, and A. L. Cox, “Practical, Transparent
Operating System Support for Superpages.” in OSDI 2002, 2002.
A. Patel, F. Afram, S. Chen, and K. Ghose, “MARSSx86: A Full System
Simulator for x86 CPUs,” in Proceedings of the 48th design automation
conference, 2011.
B. Pham, A. Bhattacharjee, Y. Eckert, and G. H. Loh, “Increasing TLB
Reach by Exploiting Clustering in Page Translations,” in HPCA ’14.
B. Pham, V. Vaidyanathan, A. Jaleel, and A. Bhattacharjee, “CoLT:
Coalesced Large-Reach TLBs,” in MICRO-45: Proceedings of the 2012
45th Annual IEEE/ACM International Symposium on Microarchitecture.
IEEE Computer Society, Dec. 2012.
S. Srikantaiah and M. Kandemir, “Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors,” in 2010 43rd
Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, pp. 313–324.
I. Subramanian, C. Mather, K. Peterson, and B. Raghunath, “Implementation of multiple pagesize support in hp-ux,” in Proceedings of
the Annual Conference on USENIX Annual Technical Conference, ser.
ATEC ’98. Berkeley, CA, USA: USENIX Association, 1998, pp. 9–9.
M. R. Swanson, L. Stoller, and J. B. Carter, “Increasing TLB Reach
Using Superpages Backed by Shadow Memory.” in ISCA ’98. IEEE
Comput. Soc, 1998, pp. 204–213.
M. Talluri, M. D. Hill, and Y. A. Khalidi, “A new page table for 64-bit
address spaces,” in SOSP ’95. New York, New York, USA: ACM
Request Permissions, Dec. 1995, pp. 184–200.
M. Talluri and M. D. Hill, “Surpassing the TLB Performance of
Superpages with Less Operating System Support.” in ASPLOS. New
York, New York, USA: ACM Press, 1994, pp. 171–182.
H. Yang, A. Breslow, J. Mars, and L. Tang, “Bubble-flux: Precise
online qos management for increased utilization in warehouse scale
computers,” in Proceedings of the 40th Annual International Symposium
on Computer Architecture, ser. ISCA ’13. New York, NY, USA: ACM,
2013, pp. 607–618.
L. Zhang, E. Speight, R. Rajamony, and J. Lin, “Enigma: Architectural
and operating system support for reducing the impact of address translation,” in Proceedings of the 24th ACM International Conference on
Supercomputing, ser. ICS ’10. New York, NY, USA: ACM, 2010, pp.
159–168.