The Processing Using Memory Paradigm

arXiv:1610.09603v1 [cs.AR] 30 Oct 2016
The Processing Using Memory Paradigm:
In-DRAM Bulk Copy, Initialization, Bitwise AND and OR
Vivek Seshadri — Microsoft Research
Onur Mutlu — ETH Zürich
Abstract
In existing systems, the off-chip memory interface allows the memory controller to perform
only read or write operations. Therefore, to perform any operation, the processor must first
read the source data and then write the result back to memory after performing the operation.
This approach consumes high latency, bandwidth, and energy for operations that work on a
large amount of data. Several works have proposed techniques to process data near memory
by adding a small amount of compute logic closer to the main memory chips. In this article,
we describe two techniques proposed by recent works that take this approach of processing
in memory further by exploiting the underlying operation of the main memory technology
to perform more complex tasks. First, we describe RowClone, a mechanism that exploits
DRAM technology to perform bulk copy and initialization operations completely inside main
memory. We then describe a complementary work that uses DRAM to perform bulk bitwise
AND and OR operations inside main memory. These two techniques significantly improve
the performance and energy efficiency of the respective operations.
1
Introduction
In modern systems, the channel that connects the processor and off-chip main memory is a
critical bottleneck for both performance and energy-efficiency. First, the channel has limited
data bandwidth. Increasing this available bandwidth requires increasing the number of
channels or the width of each channel or the frequency of the channel. All these approaches
significantly increase the cost of the system and are not scalable. Second, a significant
fraction of the energy consumed in performing an operation is spent on moving data over
the off-chip memory channel [30].
To address this problem, many prior and recent works [11, 12, 13, 15, 40, 42, 42, 44, 45,
47, 48, 51, 52, 54, 56, 57, 79, 85, 95, 103, 105, 114, 115, 120, 125, 131, 143, 149] have proposed techniques to process data near memory, an approach widely referred to as Processing
in Memory or PiM. The idea behind PiM is to add a small amount of compute logic close
to the memory chips and use that logic to perform simple yet bandwidth-intensive and/or
latency-sensitive operations. The premise is that being close to the memory chips, the PiM
1
module will have much higher bandwidth and lower latency to memory than the regular
processor. Consequently, PiM can 1) perform bandwidth-intensive and latency-sensitive
operations faster and 2) reduce the off-chip memory bandwidth requirements of such operations. As a result, PiM significantly improves both overall system performance and energy
efficiency.
In this article, we focus our attention on two works that push the notion of processing in
memory deeper by exploiting the underlying operation of the main memory technology to
perform more complex tasks. We will refer to this approach as Processing using Memory or
PuM. Unlike PiM, which adds new logic structures near memory to perform computation,
the key idea behind PuM is to exploit some of the peripheral structures already existing
inside memory devices (with minimal changes) to perform other tasks.
The first work that we will discuss in this article is RowClone [115], a mechanism that
exploits DRAM technology to perform bulk data copy and initialization completely inside
DRAM. Such bulk copy and initialization operations are triggered by many applications (e.g.,
bulk zeroing) and system-level functions (e.g., page copy operations). Despite the fact that
these operations require no computation, existing system must necessarily read and write
the required data over the main memory channel. In fact, even with a high-speed memory
bus (DDR4-2133) a simple 4 KB copy operation can take close to half a micro second for
just the data transfers on the memory channel. By performing such operations completely
inside main memory, RowClone eliminates the need for any data transfer on the memory
channel, thereby significantly improving performance and energy-efficiency.
The second work that we will discuss in this article is a mechanism to perform bulk
bitwise AND and OR operations completely inside DRAM [114]. Bitwise operations are an
important component of modern day programming. Many applications (e.g., bitmap indices)
rely on bitwise operations on large bitvectors to achieve high performance. Similar to bulk
copy or initialization, the throughput of bulk bitwise operations in existing systems is also
limited by the available memory bandwidth. The In-DRAM AND-OR mechanism (IDAO)
avoids the need to transfer large amounts of data on the memory channel to perform these
operations. Similar to RowClone, IDAO enables an order of magnitude improvement in the
performance of bulk bitwise operations. We will describe these two works in detail in this
article.
In this article, we will discuss the following things.
• We motivate the need for reducing data movement and how processing near memory
helps in achieving that goal (Section 2). We will briefly describe a set of recent works
that have pushed the idea of processing near memory deeper by using the underlying memory technologies (e.g., DRAM, STT-MRAM, PCM) to perform tasks more
complex than just storing data (Section 3).
• As the major focus of this article is on the PuM works that build on DRAM, we provide
a brief background on modern DRAM organization and operation that is sufficient to
understand the mechanisms (Section 4).
2
• We describe the two mechanisms, RowClone (in-DRAM bulk copy and initialization)
and In-DRAM-AND-OR (in-DRAM bulk bitwise AND and OR) in detail in Sections 5
and 6, respectively.
• We describe a number of applications for the two mechanisms and quantitative evaluations showing that they improve performance and energy-efficiency compared to
existing systems.
2
Processing in Memory
Data movement contributes a major fraction of the execution time and energy consumption
of many programs. The farther the data is from the processing engine (e.g., CPU), the more
the contribution of data movement towards execution time and energy consumption. While
most programs aim to keep their active working set as close to the processing engine as
possible (say the L1 cache), for applications with working sets larger than the on-chip cache
size, the data typically resides in main memory.
Unfortunately, main memory latency is not scaling commensurately with the remaining
resources in the system, namely, the compute power and memory capacity. As a result,
the performance of most large-working-set applications is limited by main memory latency
and/or bandwidth. For instance, just transferring a single page (4 KB) of data from DRAM
can consume between a quarter and half a microsecond even with high speed memory interfaces (DDR4-2133 [65]). During this time, the processor can potentially execute hundreds
to thousands of instructions. With respect to energy, while performing a 64-bit double precision floating point operation typically consumes few tens of pico joules, accessing 64-bits
of data from off-chip DRAM consumes few tens of nano joules (3 orders of magnitude more
energy) [30].
One of the solutions to address this problem is to add support to process data closer
to memory, especially for operations that access large amounts of data. This approach is
generally referred to as Processing in Memory (PiM) or Near Data Processing. The highlevel idea behind PiM is to add a small piece of compute logic closer to memory that has
much higher bandwidth to memory than the main processor. Prior research has proposed
two broad ways of implementing PiM: 1) Integrating processing logic into the memory chips,
and 2) using 3D-stacked memory architectures.
2.1
Integrating Processing Logic in Memory
Many works (e.g., Logic-in-Memory Computer [125], NON-VON Database Machine [120],
EXECUBE [79], Terasys [47], Intelligent RAM [105], Active Pages [103], FlexRAM [43, 68],
Computational RAM [40], DIVA [35] ) have proposed mechanisms and models to add processing logic close to memory. The idea is to integrate memory and CPU on the same
chip by designing the CPU using the memory process technology. The reduced data movement allows these approaches to enable low-latency, high-bandwidth, and low-energy data
communication. However, they suffer from two key shortcomings.
3
First, this approach of integrating processor on the same chip as memory greatly increases
the overall cost of the system. Second, DRAM vendors use a high-density process to minimize
cost-per-bit. Unfortunately, high-density DRAM process is not suitable for building highspeed logic [105]. As a result, this approach is not suitable for building a general purpose
processor near memory, at least with modern logic and high-density DRAM technologies.
2.2
3D-Stacked DRAM Architectures
Some recent DRAM architectures [3, 62, 85, 92] use 3D-stacking technology to stack multiple
DRAM chips on top of the processor chip or a separate logic layer. These architectures offer
much higher bandwidth to the logic layer compared to traditional off-chip interfaces. This
enables an opportunity to offload some computation to the logic layer, thereby improving
performance. In fact, many recent works have proposed mechanisms to improve and exploit
such architectures (e.g., [11, 12, 13, 15, 17, 42, 42, 44, 45, 48, 54, 56, 57, 85, 95, 104, 131, 143,
149]). 3D-stacking enables much higher bandwidth between the logic layer and the memory
chips, compared to traditional architectures. However, 3D-stacked architectures still require
data to be transferred outside the DRAM chip, and hence can be bandwidth-limited. In
addition, thermal factors constrain the number of chips that can be stacked, thereby limiting
the memory capacity. As a result, multiple 3D-stacked DRAMs are required to scale to large
workloads. Despite these limitations, this approach seems to be the most viable way of
implementing processing in memory in modern systems.
3
Processing Using Memory
In this article, we introduce a new class of work that pushes the idea of PiM further by
exploiting the underlying memory operation to perform more complex operations than just
data storage. We refer to this class of works as Processing using Memory (PuM).
Reducing cost-per-bit is a first order design constraint for most memory technologies. As
a result, the memory cells are small. Therefore, most memory devices use significant amount
of sensing and peripheral logic to extract data from the memory cells. The key idea behind
PuM is to use these logic structures and their operation to perform some additional tasks.
It is clear that unlike PiM, which can potentially be designed to perform any task, PuM
can only enable some limited functionality. However, for tasks that can be performed by
PuM, PuM has two advantages over PiM. First, as PuM exploits the underlying operation
of memory, it incurs much lower cost than PiM. Second, unlike PiM, PuM does not have to
read any data out of the memory chips. As a result, the PuM approach is possibly the most
energy efficient way of performing the respective operations.
Building on top of DRAM, which is the technology ubiquitously used to build main
memory in modern systems, two recent works take the PuM approach to accelerate certain
important primitives: 1) RowClone [115], which performs bulk copy and initialization operations completely inside DRAM, and 2) IDAO [114], which performs bulk bitwise AND/OR
operations completely inside DRAM. Both these works exploit the operation of the DRAM
4
sense amplifier and the internal organization of DRAM to perform the respective operations.
We will discuss these two works in significant detail in this article.
Similar to these works, there are others that build on various other memory technologies.
Pinatubo [89] exploits phase change memory (PCM) [81, 82, 83, 107, 109, 138, 148] architecture to perform bitwise operations efficiently inside PCM. Pinatubo enhances the PCM
sense amplifiers to sense fine grained differences in resistance and use this to perform bitwise
operations on multiple cells connected to the same sense amplifier. As we will describe in
this article, bitwise operations are critical for many important data structures like bitmap
indices. Kang et al. [67] propose a mechanism to exploit SRAM architecture to accelerate the primitive “sum of absolute differences”. ISAAC [119] is a mechanism to accelerate
vector dot product operations using a memristor array. ISAAC uses the crossbar structure
of a memristor array and its analog operation to efficiently perform dot products. These
operations are heavily used in many important applications including deep neural networks.
In the subsequent sections, we will focus our attention on RowClone and IDAO. We
will first provide the necessary background on DRAM design and then describe how these
mechanisms work.
4
Background on DRAM
In this section, we describe the necessary background to modern DRAM architecture and
its implementation. While we focus our attention primarily on commodity DRAM design
(i.e., the DDRx interface), most DRAM architectures use very similar design approaches
and vary only in higher-level design choices. As a result, the mechanisms we describe in the
subsequent sections can be extended to any DRAM architecture. There has been significant
recent research in DRAM architectures and the interested reader can find details about
various aspects of DRAM in multiple recent publications [21, 24, 53, 70, 71, 77, 78, 86, 87,
90, 91, 106, 116].
4.1
High-level Organization of the Memory System
Figure 1 shows the organization of the memory subsystem in a modern system. At a high
level, each processor chip consists of one of more off-chip memory channels. Each memory
channel consists of its own set of command, address, and data buses. Depending on the design
of the processor, there can be either an independent memory controller for each memory
channel or a single memory controller for all memory channels. All modules connected to a
channel share the buses of the channel. Each module consists of many DRAM devices (or
chips). Most of this section is dedicated to describing the design of a modern DRAM chip.
In Section 4.3, we present more details of the module organization of commodity DRAM.
5
Channel 1
Processor
Channel 0
DRAM Module
Figure 1: High-level organization of the memory subsystem
4.2
DRAM Chip
A modern DRAM chip consists of a hierarchy of structures: DRAM cells, tiles/MATs,
subarrays, and banks. In this section, we describe the design of a modern DRAM chip
in a bottom-up fashion, starting from a single DRAM cell and its operation.
4.2.1
DRAM Cell and Sense Amplifier
At the lowest level, DRAM technology uses capacitors to store information. Specifically, it
uses the two extreme states of a capacitor, namely, the empty and the fully charged states
to store a single bit of information. For instance, an empty capacitor can denote a logical
value of 0, and a fully charged capacitor can denote a logical value of 1. Figure 2 shows the
two extreme states of a capacitor.
empty
capacitor
(logical 0)
fully charged
capacitor
(logical 1)
Figure 2: Two states of a DRAM cell
Unfortunately, the capacitors used for DRAM chips are small, and will get smaller with
each new generation. As a result, the amount of charge that can be stored in the capacitor,
and hence the difference between the two states is also very small. In addition, the capacitor
can potentially lose its state after it is accessed. Therefore, to extract the state of the
capacitor, DRAM manufacturers use a component called sense amplifier.
Figure 3 shows a sense amplifier. A sense amplifier contains two inverters which are
connected together such that the output of one inverter is connected to the input of the
other and vice versa. The sense amplifier also has an enable signal that determines if the
inverters are active. When enabled, the sense amplifier has two stable states, as shown in
Figure 4. In both these stable states, each inverter takes a logical value and feeds the other
inverter with the negated input.
Figure 5 shows the operation of the sense amplifier from a disabled state. In the initial
disabled state, we assume that the voltage level of the top terminal (Va ) is higher than that
6
VDD
0
top
1
1
enable
0
VDD
bottom
(logical 0)
Figure 3: Sense amplifier
(logical 1)
Figure 4: Stable states of a sense amplifier
of the bottom terminal (Vb ). When the sense amplifier is enabled in this state, it senses the
difference between the two terminals and amplifies the difference until it reaches one of the
stable states (hence the name “sense amplifier”).
Va
VDD
Enable sense amplifier
0
Va > Vb
1
0
Vb
Figure 5: Operation of the sense amplifier
4.2.2
DRAM Cell Operation: The ACTIVATE-PRECHARGE cycle
DRAM technology uses a simple mechanism that converts the logical state of a capacitor
into a logical state of the sense amplifier. Data can then be accessed from the sense amplifier
(since it is in a stable state). Figure 6 shows the connection between a DRAM cell and the
sense amplifier and the sequence of states involved in converting the cell state to the sense
amplifier state.
As shown in the figure (state Ê), the capacitor is connected to an access transistor that
acts as a switch between the capacitor and the sense amplifier. The transistor is controller
by a wire called wordline. The wire that connects the transistor to the top end of the sense
amplifier is called bitline. In the initial state Ê, the wordline is lowered, the sense amplifier is
disabled and both ends of the sense amplifier are maintained at a voltage level of 21 VDD . We
assume that the capacitor is initially fully charged (the operation is similar if the capacitor
was empty). This state is referred to as the precharged state. An access to the cell is
triggered by a command called ACTIVATE. Upon receiving an ACTIVATE, the corresponding
wordline is first raised (state Ë). This connects the capacitor to the bitline. In the ensuing
phase called charge sharing (state Ì), charge flows from the capacitor to the bitline, raising
the voltage level on the bitline (top end of the sense amplifier) to 12 VDD + δ. After charge
sharing, the sense amplifier is enabled (state Í). The sense amplifier detects the difference
in voltage levels between its two ends and amplifies the deviation, till it reaches the stable
7
1
V
2 DD
1
wordline
1
V
2 DD
bitline
0
+δ
0
2
1
V
2 DD
0
1
1
V
2 DD
1
V
2 DD
1
0
1
V
2 DD
VDD
1
1
V
2 DD
1
1
3
+δ
1
0
5
1
V
2 DD
4
Figure 6: Operation of a DRAM cell and sense amplifier
state where the top end is at VDD (state Î). Since the capacitor is still connected to the
bitline, the charge on the capacitor is also fully restored. We shortly describe how the data
can be accessed from the sense amplifier. However, once the access to the cell is complete,
the cell is taken back to the original precharged state using the command called PRECHARGE.
Upon receiving a PRECHARGE, the wordline is first lowered, thereby disconnecting the cell
from the sense amplifier. Then, the two ends of the sense amplifier are driven to 12 VDD using
a precharge unit (not shown in the figure for brevity).
4.2.3
DRAM MAT/Tile: The Open Bitline Architecture
A major goal of DRAM manufacturers is to maximize the density of the DRAM chips while
adhering to certain latency constraints (described in Section 4.2.6). There are two costly
components in the setup described in the previous section. The first component is the sense
amplifier itself. Each sense amplifier is around two orders of magnitude larger than a single
DRAM cell [87, 108]. Second, the state of the wordline is a function of the address that is
currently being accessed. The logic that is necessary to implement this function (for each
cell) is expensive.
In order to reduce the overall cost of these two components, they are shared by many
DRAM cells. Specifically, each sense amplifier is shared by a column of DRAM cells. In
other words, all the cells in a single column are connected to the same bitline. Similarly,
each wordline is shared by a row of DRAM cells. Together, this organization consists of a
8
Wordline Driver
2-D array of DRAM cells connected to a row of sense amplifiers and a column of wordline
drivers. Figure 7 shows this organization with a 4 × 4 2-D array.
enable
Figure 7: A 2-D array of DRAM cells
To further reduce the overall cost of the sense amplifiers and the wordline driver, modern
DRAM chips use an architecture called the open bitline architecture. This architecture
exploits two observations. First, the sense amplifier is wider than the DRAM cells. This
difference in width results in a white space near each column of cells. Second, the sense
amplifier is symmetric. Therefore, cells can also be connected to the bottom part of the
sense amplifier. Putting together these two observations, we can pack twice as many cells in
the same area using the open bitline architecture, as shown in Figure 8;
Wordline Driver
enable
enable
Figure 8: A DRAM MAT/Tile: The open bitline architecture
9
As shown in the figure, a 2-D array of DRAM cells is connected to two rows of sense
amplifiers: one on the top and one on the bottom of the array. While all the cells in a given
row share a common wordline, half the cells in each row are connected to the top row of
sense amplifiers and the remaining half of the cells are connected to the bottom row of sense
amplifiers. This tightly packed structure is called a DRAM MAT/Tile [77, 132, 144]. In a
modern DRAM chip, each MAT typically is a 512×512 or 1024×1024 array. Multiple MATs
are grouped together to form a larger structure called a DRAM bank, which we describe next.
4.2.4
DRAM Bank
In most modern commodity DRAM interfaces [64, 65], a DRAM bank is the smallest structure visible to the memory controller. All commands related to data access are directed to a
specific bank. Logically, each DRAM bank is a large monolithic structure with a 2-D array
of DRAM cells connected to a single set of sense amplifiers (also referred to as a row buffer).
For example, in a 2Gb DRAM chip with 8 banks, each bank has 215 rows and each logical
row has 8192 DRAM cells. Figure 9 shows this logical view of a bank.
Wordline Driver
Row Decoder
row address
Sense Amplifiers
logical row
MAT
Sense Amplifiers
Column Selection Logic
column address
Cmd/Addr
Bank I/O
Command/Address
Data
Figure 9: DRAM Bank: Logical view
In addition to the MAT, the array of sense amplifiers, and the wordline driver, each bank
also consists of some peripheral structures to decode DRAM commands and addresses, and
manage the input/output to the DRAM bank. Specifically, each bank has a row decoder to
decode the row address of row-level commands (e.g., ACTIVATE). Each data access command
(READ and WRITE) accesses only a part of a DRAM row. Such individual parts are referred
to as columns. With each data access command, the address of the column to be accessed is
provided. This address is decoded by the column selection logic. Depending on which column
is selected, the corresponding piece of data is communicated between the sense amplifiers
10
and the bank I/O logic. The bank I/O logic in turn acts as an interface between the DRAM
bank and the chip-level I/O logic.
Although the bank can logically be viewed as a single MAT, building a single MAT of
a very large dimension is practically not feasible as it will require very long bitlines and
wordlines. Therefore, each bank is physically implemented as a 2-D array of DRAM MATs.
Figure 10 shows a physical implementation of the DRAM bank with 4 MATs arranged in
2 × 2 array. As shown in the figure, the output of the global row decoder is sent to each row
of MATs. The bank I/O logic, also known as the global sense amplifiers, are connected to all
the MATs through a set of global bitlines. As shown in the figure, each vertical collection of
MATs consists of its own columns selection logic and global bitlines. In a real DRAM chip,
the global bitlines run on top of the MATs in a separate metal layer. One implication of this
division is that the data accessed by any command is split equally across all the MATs in a
single row of MATs.
Column Selection Lines
MAT
MAT
WL Dr.
SA
SA
MAT
Global Bitlines
MAT
WL Dr.
WL Dr.
SA
SA
WL Dr.
Global Row Decoder
row addr
SA
SA
CSL
column addr
CSL
Cmd/Addr
Global Sense Amplifiers
Command/Address
Data
Figure 10: DRAM Bank: Physical implementation. In a real chip, the global bitlines run on
top of the MATs in a separate metal layer. (components in figure are not to scale)
Figure 11 shows the zoomed-in version of a DRAM MAT with the surrounding peripheral
logic. Specifically, the figure shows how each column selection line selects specific sense
amplifiers from a MAT and connects them to the global bitlines. It should be noted that
the width of the global bitlines for each MAT (typically 8/16) is much smaller than that of
the width of the MAT (typically 512/1024). This is because the global bitlines span a much
longer distance and hence have to be thicker to ensure integrity.
Each DRAM chip consist of multiple banks as shown in Figure 12. All the banks share
the chip’s internal command, address, and data buses. As mentioned before, each bank
operates mostly independently (except for operations that involve the shared buses). The
11
Wordline Driver
Figure 11: Detailed view of a MAT
Column Selection Lines
WL Dr.
WL Dr.
SA
SA
Bank
Bank
MAT
Global Bitlines
MAT
SA
CSL
column addr
Bank
MAT
Bank
WL Dr.
row addr
Bank
Chip I/O
cmd/addr
data
Bank
SA
SA
WL Dr.
Bank
Global Row Decoder
SA
MAT
CSL
Cmd/Addr
Global Sense Amplifiers
Command/Address
Data
Bank
Figure 12: DRAM Chip
chip I/O manages the transfer of data to and from the chip’s internal bus to the memory
channel. The width of the chip output (typically 8 bits) is much smaller than the output
width of each bank (typically 64 bits). Any piece of data accessed from a DRAM bank is
first buffered at the chip I/O and sent out on the memory bus 8 bits at a time. With the
DDR (double data rate) technology, 8 bits are sent out each half cycle. Therefore, it takes
4 cycles to transfer 64 bits of data from a DRAM chip I/O on to the memory channel.
4.2.5
DRAM Commands: Accessing Data from a DRAM Chip
To access a piece of data from a DRAM chip, the memory controller must first identify the
location of the data: the bank ID (B), the row address (R) within the bank, and the column
address (C) within the row. After identifying these pieces of information, accessing the data
involves three steps.
The first step is to issue a PRECHARGE to the bank B. This step prepares the bank for a
data access by ensuring that all the sense amplifiers are in the precharged state (Figure 6,
state Ê). No wordline within the bank is raised in this state.
12
The second step is to activate the row R that contains the data. This step is triggered
by issuing a ACTIVATE to bank B with row address R. Upon receiving this command, the
corresponding bank feeds its global row decoder with the input R. The global row decoder
logic then raises the wordline of the DRAM row corresponding to the address R and enables
the sense amplifiers connected to that row. This triggers the DRAM cell operation described
in Section 4.2.2. At the end of the activate operation the data from the entire row of DRAM
cells is copied to the corresponding array of sense amplifiers.
Finally, the third step is to access the data from the required column. This is done by
issuing a READ or WRITE command to the bank with the column address C. Upon receiving a
READ or WRITE command, the corresponding address is fed to the column selection logic. The
column selection logic then raises the column selection lines (Figure 11) corresponding to
address C, thereby connecting those sense amplifiers to the global sense amplifiers through
the global bitlines. For a read access, the global sense amplifiers sense the data from the
MAT’s local sense amplifiers and transfer that data to the chip’s internal bus. For a write
access, the global sense amplifiers read the data from the chip’s internal bus and force the
MAT’s local sense amplifiers to the appropriate state.
Not all data accesses require all three steps. Specifically, if the row to be accessed is
already activated in the corresponding bank, then the first two steps can be skipped and
the data can be directly accessed by issuing a READ or WRITE to the bank. For this reason,
the array of sense amplifiers are also referred to as a row buffer, and such an access that
skips the first two steps is called a row buffer hit. Similarly, if the bank is already in the
precharged state, then the first step can be skipped. Such an access is referred to as a row
buffer miss. Finally, if a different row is activated within the bank, then all three steps have
to be performed. Such an access is referred to as a row buffer conflict.
4.2.6
DRAM Timing Constraints
Different operations within DRAM consume different amounts of time. Therefore, after
issuing a command, the memory controller must wait for a sufficient amount of time before
it can issue the next command. Such wait times are managed by what are called the timing
constraints. Timing constraints essentially dictate the minimum amount of time between
two commands issued to the same bank/rank/channel. Table 1 describes some key timing
constraints along with their values for the DDR3-1600 interface.
4.3
DRAM Module
As mentioned before, each READ or WRITE command for a single DRAM chip typically involves
only 64 bits. In order to achieve high memory bandwidth, commodity DRAM modules group
several DRAM chips (typically 4 or 8) together to form a rank of DRAM chips. The idea is
to connect all chips of a single rank to the same command and address buses, while providing
each chip with an independent data bus. In effect, all the chips within a rank receive the
same commands with same addresses, making the rank a logically wide DRAM chip.
13
Name
Constraint
tRAS
ACTIVATE
→
PRECHARGE
tRCD
ACTIVATE
→
READ/WRITE
PRECHARGE
→
ACTIVATE
WRITE
→
PRECHARGE
tRP
tWR
Description
Value (ns)
Time taken to complete a row activation operation in a bank
Time between an activate command and column command to a
bank
Time taken to complete a
precharge operation in a bank
Time taken to ensure that data is
safely written to the DRAM cells
after a write operation (write recovery)
35
15
15
15
Table 1: Key DRAM timing constraints with their values for DDR3-1600
Figure 13 shows the logical organization of a DRAM rank. Most commodity DRAM
ranks consist of 8 chips. Therefore, each READ or WRITE command accesses 64 bytes of data,
the typical cache line size in most processors.
Chip 0 Chip 1 Chip 2 Chip 3 Chip 4 Chip 5 Chip 6 Chip 7
cmd
addr
data (64 bits)
Figure 13: Organization of a DRAM rank
5
RowClone
In this section, we present RowClone [115], a mechanism that can perform bulk copy and initialization operations completely inside DRAM. This approach obviates the need to transfer
large quantities of data on the memory channel, thereby significantly improving the efficiency
of a bulk copy operation. As bulk data initialization (specifically bulk zeroing) can be viewed
as a special case of a bulk copy operation, RowClone can be easily extended to perform such
bulk initialization operations with high efficiency.
RowClone consists of two independent mechanisms that exploit several observations
about DRAM organization and operation. The first mechanism, called the Fast Parallel
Mode (FPM), efficiently copies data between two rows of DRAM cells that share the same
14
set of sense amplifiers (i.e., two rows within the same subarray). The second mechanism,
called the Pipelined Serial Mode, efficiently copies cache lines between two banks within a
module in a pipelined manner. Although not as fast as FPM, PSM has fewer constraints
and hence is more generally applicable. We now describe these two mechanisms in detail.
5.1
Fast-Parallel Mode
The Fast Parallel Mode (FPM) is based on the following three observations about DRAM.
1. In a commodity DRAM module, each ACTIVATE command transfers data from a large
number of DRAM cells (multiple kilo-bytes) to the corresponding array of sense amplifiers (Section 4.3).
2. Several rows of DRAM cells share the same set of sense amplifiers (Section 4.2.3).
3. A DRAM cell is not strong enough to flip the state of the sense amplifier from one
stable state to another stable state. In other words, if a cell is connected to an already
activated sense amplifier (or bitline), then the data of the cell gets overwritten with
the data on the sense amplifier.
While the first two observations are direct implications from the design of commodity
DRAM, the third observation exploits the fact that DRAM cells can cause only a small
perturbation on the bitline voltage. Figure 14 pictorially shows how this observation can be
used to copy data between two cells that share a sense amplifier.
0
0
1
1
V
2 DD
dst
VDD
0
VDD
0
1
src
ACTIVATE src
ACTIVATE dst
0
1
1
1
V
2 DD
1
2
0
0
3
Figure 14: RowClone: Fast Parallel Mode
The figure shows two cells (src and dst) connected to a single sense amplifier. In the
initial state, we assume that src is fully charged and dst is fully empty, and the sense
amplifier is in the precharged state (Ê). In this state, FPM issues an ACTIVATE to src. At
the end of the activation operation, the sense amplifier moves to a stable state where the
bitline is at a voltage level of VDD and the charge in src is fully restored (Ë). FPM follows
this operation with an ACTIVATE to dst, without an intervening PRECHARGE. This operation
lowers the wordline corresponding to src and raises the wordline of dst, connecting dst to
the bitline. Since the bitline is already fully activated, even though dst is initially empty,
15
the perturbation caused by the cell is not sufficient to flip the state of the bitline. As a
result, the sense amplifier continues to drive the bitline to VDD , thereby pushing dst to a
fully charged state (Ì).
It can be shown that regardless of the initial state of src and dst, the above operation
copies the data from src to dst. Given that each ACTIVATE operates on an entire row
of DRAM cells, the above operation can copy multiple kilo bytes of data with just two
back-to-back ACTIVATE operations.
Unfortunately, modern DRAM chips do not allow another ACTIVATE to an already activated bank – the expected result of such an action is undefined. This is because a modern
DRAM chip allows at most one row (subarray) within each bank to be activated. If a bank
that already has a row (subarray) activated receives an ACTIVATE to a different subarray,
the currently activated subarray must first be precharged [77]. Some DRAM manufacturers
design their chips to drop back-to-back ACTIVATEs to the same bank.
To support FPM, RowClone changes the way a DRAM chip handles back-to-back ACTIVATEs
to the same bank. When an already activated bank receives an ACTIVATE to a row, the chip
allows the command to proceed if and only if the command is to a row that belongs to the
currently activated subarray. If the row does not belong to the currently activated subarray,
then the chip takes the action it normally does with back-to-back ACTIVATEs—e.g., drop it.
Since the logic to determine the subarray corresponding to a row address is already present
in today’s chips, implementing FPM only requires a comparison to check if the row address
of an ACTIVATE belongs to the currently activated subarray, the cost of which is almost
negligible.
Summary. To copy data from src to dst within the same subarray, FPM first issues
an ACTIVATE to src. This copies the data from src to the subarray row buffer. FPM then
issues an ACTIVATE to dst. This modifies the input to the subarray row-decoder from src to
dst and connects the cells of dst row to the row buffer. This, in effect, copies the data from
the sense amplifiers to the destination row. With these two steps, FPM can copy a 4KB
page of data 12.0x faster and with 74.4x less energy than an existing system (we describe
the methodology in Section 8.1).
Limitations. FPM has two constraints that limit its general applicability. First, it
requires the source and destination rows to be within the same subarray (i.e., share the same
set of sense amplifiers). Second, it cannot partially copy data from one row to another.
Despite these limitations, FPM can be immediately applied to today’s systems to accelerate
two commonly used primitives in modern systems – Copy-on-Write and Bulk Zeroing. In
the following section, we describe the second mode of RowClone – the Pipelined Serial Mode
(PSM). Although not as fast or energy-efficient as FPM, PSM addresses these two limitations
of FPM.
5.2
Pipelined Serial Mode
The Pipelined Serial Mode efficiently copies data from a source row in one bank to a destination row in a different bank. PSM exploits the fact that a single internal bus that is shared
across all the banks is used for both read and write operations. This enables the opportunity
16
to copy an arbitrary quantity of data one cache line at a time from one bank to another in
a pipelined manner.
To copy data from a source row in one bank to a destination row in a different bank, PSM
first activates the corresponding rows in both banks. It then puts the source bank into read
mode, the destination bank into write mode, and transfers data one cache line (corresponding
to a column of data—64 bytes) at a time. For this purpose, RowClone introduces a new
DRAM command called TRANSFER. The TRANSFER command takes four parameters: 1) source
bank index, 2) source column index, 3) destination bank index, and 4) destination column
index. It copies the cache line corresponding to the source column index in the activated
row of the source bank to the cache line corresponding to the destination column index in
the activated row of the destination bank.
Unlike READ/WRITE, which interact with the memory channel connecting the processor
and main memory, TRANSFER does not transfer data outside the chip. Figure 15 pictorially
compares the operation of the TRANSFER command with that of READ and WRITE. The dashed
lines indicate the data flow corresponding to the three commands. As shown in the figure,
in contrast to the READ or WRITE commands, TRANSFER does not transfer data from or to the
memory channel.
dst
dst
dst
src
Chip I/O
Chip I/O
Chip I/O
Chip I/O
READ a1
data out
src
Chip I/O
Chip I/O
src
TRANSFER a1,a2
no data
WRITE a2
data in
Figure 15: RowClone: Pipelined Serial Mode
5.3
Mechanism for Bulk Data Copy
When the data from a source row (src) needs to be copied to a destination row (dst), there
are three possible cases depending on the location of src and dst: 1) src and dst are within
the same subarray, 2) src and dst are in different banks, 3) src and dst are in different
subarrays within the same bank. For case 1 and case 2, RowClone uses FPM and PSM,
respectively, to complete the operation (as described in Sections 5.1 and 5.2).
For the third case, when src and dst are in different subarrays within the same bank, one
can imagine a mechanism that uses the global bitlines (shared across all subarrays within a
bank – described in [77]) to copy data across the two rows in different subarrays. However,
17
RowClone does not employ such a mechanism for two reasons. First, it is not possible in
today’s DRAM chips to activate multiple subarrays within the same bank simultaneously.
Second, even if we enable simultaneous activation of multiple subarrays, as in [77], transferring data from one row buffer to another using the global bitlines requires the bank I/O
circuitry to switch between read and write modes for each cache line transfer. This switching
incurs significant latency overhead. To keep the design simple, for such an intra-bank copy
operation, RowClone uses PSM to first copy the data from src to a temporary row (tmp) in
a different bank. It then uses PSM again to copy the data from tmp to dst. The capacity
lost due to reserving one row within each bank is negligible (0.0015% for a bank with 64k
rows).
Despite its location constraints, FPM can be used to accelerate Copy-on-Write (CoW),
an important primitive in modern systems. CoW is used by most modern operating systems
(OS) to postpone an expensive copy operation until it is actually needed. When data of
one virtual page needs to be copied to another, instead of creating a copy, the OS points
both virtual pages to the same physical page (source) and marks the page as read-only. In
the future, when one of the sharers attempts to write to the page, the OS allocates a new
physical page (destination) for the writer and copies the contents of the source page to the
newly allocated page. Fortunately, prior to allocating the destination page, the OS already
knows the location of the source physical page. Therefore, it can ensure that the destination
is allocated in the same subarray as the source, thereby enabling the processor to use FPM
to perform the copy.
5.4
Mechanism for Bulk Data Initialization
Bulk data initialization sets a large block of memory to a specific value. To perform this
operation efficiently, RowClone first initializes a single DRAM row with the corresponding
value. It then uses the appropriate copy mechanism (from Section 5.3) to copy the data to
the other rows to be initialized.
Bulk Zeroing (or BuZ), a special case of bulk initialization, is a frequently occurring
operation in today’s systems [66, 140]. To accelerate BuZ, one can reserve one row in each
subarray that is always initialized to zero. By doing so, RowClone can use FPM to efficiently
BuZ any row in DRAM by copying data from the reserved zero row of the corresponding
subarray into the destination row. The capacity loss of reserving one row out of 512 rows in
each subarray is very modest (0.2%).
While the reserved rows can potentially lead to gaps in the physical address space, we can
use an appropriate memory interleaving technique that maps consecutive rows to different
subarrays. Such a technique ensures that the reserved zero rows are contiguously located
in the physical address space. Note that interleaving techniques commonly used in today’s
systems (e.g., row or cache line interleaving) have this property.
18
6
In-DRAM Bulk AND and OR
In this section, we describe In-DRAM AND/OR (IDAO), which is a mechanism to perform
bulk bitwise AND and OR operations completely inside DRAM. In addition to simple masking and initialization tasks, these operations are useful in important data structures like
bitmap indices. For example, bitmap indices [19, 102] can be more efficient than commonlyused B-trees for performing range queries and joins in databases [2, 19, 139]. In fact, bitmap
indices are supported by many real-world database implementations (e.g., Redis [6], Fastbit [2]). Improving the throughput of bitwise AND and OR operations can boost the performance of such bitmap indices.
6.1
Mechanism
As described in Section 4.2.2, when a DRAM cell is connected to a bitline precharged to
1
V , the cell induces a deviation on the bitline, and the deviation is amplified by the sense
2 DD
amplifier. IDAO exploits the following fact about DRAM cell operation.
The final state of the bitline after amplification is determined solely by the deviation on the bitline after the charge sharing phase (after state Ì in Figure 6).
If the deviation is positive (i.e., towards VDD ), the bitline is amplified to VDD .
Otherwise, if the deviation is negative (i.e., towards 0), the bitline is amplified
to 0.
6.1.1
Triple-Row Activation
IDAO simultaneously connects three cells as opposed to a single cell to a sense amplifier.
When three cells are connected to the bitline, the deviation of the bitline after charge sharing
is determined by the majority value of the three cells. Specifically, if at least two cells are
initially in the charged state, the effective voltage level of the three cells is at least 23 VDD .
This results in a positive deviation on the bitline. On the other hand, if at most one cell is
initially in the charged state, the effective voltage level of the three cells is at most 31 VDD .
This results in a negative deviation on the bitline voltage. As a result, the final state of the
bitline is determined by the logical majority value of the three cells.
Figure 16 shows an example of activating three cells simultaneously. In the figure, we
assume that two of the three cells are initially in the charged state and the third cell is in
the empty state Ê. When the wordlines of all the three cells are raised simultaneously Ë,
charge sharing results in a positive deviation on the bitline. Hence, after sense amplification,
the sense amplifier drives the bitline to VDD and fully charges all three cells Ì.
More generally, if the cell’s capacitance is Cc , the the bitline’s is Cb , and if k of the
three cells are initially in the charged state, based on the charge sharing principles [69], the
deviation δ on the bitline voltage level is given by,
δ =
k.Cc .VDD + Cb . 21 VDD 1
− VDD
3Cc + Cb
2
19
0
1
2 VDD
1
1
2 VDD
+δ
1
0
1
1
0
1
1
0
0
1
1
2 VDD
1
2
1
2 VDD
VDD
0
3
Figure 16: Triple-row activation
=
(2k − 3)Cc
VDD
6Cc + 2Cb
(1)
From the above equation, it is clear that δ is positive for k = 2, 3, and δ is negative for
k = 0, 1. Therefore, after amplification, the final voltage level on the bitline is VDD for
k = 2, 3 and 0 for k = 0, 1.
If A, B, and C represent the logical values of the three cells, then the final state of the
bitline is AB + BC + CA (i.e., at least two of the values should be 1 for the final state
to be 1). Importantly, using simple boolean algebra, this expression can be rewritten as
C(A + B) + C(AB). In other words, if the initial state of C is 1, then the final state of
the bitline is a bitwise OR of A and B. Otherwise, if the initial state of C is 0, then the
final state of the bitline is a bitwise AND of A and B. Therefore, by controlling the value
of the cell C, we can execute a bitwise AND or bitwise OR operation of the remaining two
cells using the sense amplifier. Due to the regular bulk operation of cells in DRAM, this
approach naturally extends to an entire row of DRAM cells and sense amplifiers, enabling a
multi-kilobyte-wide bitwise AND/OR operation.
6.1.2
Challenges
There are two challenges in this approach. First, Equation 1 assumes that the cells involved
in the triple-row activation are either fully charged or fully empty. However, DRAM cells
leak charge over time. Therefore, the triple-row activation may not operate as expected.
This problem may be exacerbated by process variation in DRAM cells. Second, as shown
in Figure 16 (state Ì), at the end of the triple-row activation, the data in all the three
cells are overwritten with the final state of the bitline. In other words, this approach overwrites the source data with the final value. In the following sections, we describe a simple
implementation of IDAO that addresses these challenges.
20
6.1.3
Implementation of IDAO
To ensure that the source data does not get modified, IDAO first copies the data from the
two source rows to two reserved temporary rows (T 1 and T 2). Depending on the operation
to be performed (AND or OR), our mechanism initializes a third reserved temporary row
T 3 to (0 or 1). It then simultaneously activates the three rows T 1, T 2, and T 3. It finally
copies the result to the destination row. For example, to perform a bitwise AND of two rows
A and B and store the result in row R, IDAO performs the following steps.
1.
2.
3.
4.
5.
Copy data of row A to row T 1
Copy data of row B to row T 2
Initialize row T 3 to 0
Activate rows T 1, T 2, and T 3 simultaneously
Copy data of row T 1 to row R
While the above mechanism is simple, the copy operations, if performed naively, will
nullify the benefits of our mechanism. Fortunately, IDAO uses RowClone (described in
Section 5), to perform row-to-row copy operations quickly and efficiently within DRAM. To
recap, RowClone-FPM copies data within a subarray by issuing two back-to-back ACTIVATEs
to the source row and the destination row, without an intervening PRECHARGE. RowClonePSM efficiently copies data between two banks by using the shared internal bus to overlap
the read to the source bank with the write to the destination bank.
With RowClone, all three copy operations (Steps 1, 2, and 5) and the initialization
operation (Step 3) can be performed efficiently within DRAM. To use RowClone for the initialization operation, IDAO reserves two additional rows, C0 and C1. C0 is pre-initialized
to 0 and C1 is pre-initialized to 1. Depending on the operation to be performed, our mechanism uses RowClone to copy either C0 or C1 to T 3. Furthermore, to maximize the use of
RowClone-FPM, IDAO reserves five rows in each subarray to serve as the temporary rows
(T 1, T 2, and T 3) and the control rows (C0 and C1).
In the best case, when all the three rows involved in the operation (A, B, and R) are in
the same subarray, IDAO can use RowClone-FPM for all copy and initialization operations.
However, if the three rows are in different banks/subarrays, some of the three copy operations
have to use RowClone-PSM. In the worst case, when all three copy operations have to use
RowClone-PSM, IDAO will consume higher latency than the baseline. However, when only
one or two RowClone-PSM operations are required, IDAO will be faster and more energyefficient than existing systems. As our goal in this article is to demonstrate the power of our
approach, in the rest of the article, we will focus our attention on the case when all rows
involved in the bitwise operation are in the same subarray.
6.1.4
Reliability of Our Mechanism
While the above implementation trivially addresses the second challenge (modification of
the source data), it also addresses the first challenge (DRAM cell leakage). This is because,
in our approach, the source (and the control) data are copied to the rows T 1, T 2 and T 3
just before the triple-row activation. Each copy operation takes much less than 1 µs, which
21
is five orders of magnitude less than the typical refresh interval (64 ms). Consequently, the
cells involved in the triple-row activation are very close to the fully refreshed state before the
operation, thereby ensuring reliable operation of the triple-row activation. Having said that,
an important aspect of the implementation is that a chip that fails the tests for triple-row
activation (e.g., due to process variation) can still be used as a regular DRAM chip. As a
result, this approach is likely to have little impact on the overall yield of DRAM chips, which
is a major concern for manufacturers.
6.1.5
Latency Optimization
To complete an intra-subarray copy, RowClone-FPM uses two ACTIVATEs (back-to-back)
followed by a PRECHARGE operation. Assuming typical DRAM timing parameters (tRAS =
35ns and tRP = 15ns), each copy operation consumes 85ns. As IDAO is essentially four
RowClone-FPM operations (as described in the previous section), the overall latency of a
bitwise AND/OR operation is 4 × 85ns = 340ns.
In a RowClone-FPM operation, although the second ACTIVATE does not involve any sense
amplification (the sense amplifiers are already activated), the RowClone paper [115] assumes
the ACTIVATE consumes the full tRAS latency. However, by controlling the rows D1, D2, and
D3 using a separate row decoder, it is possible to overlap the ACTIVATE to the destination
fully with the ACTIVATE to the source row, by raising the wordline of the destination row
towards the end of the sense amplification of the source row. This mechanism is similar
to the inter-segment copy operation described in Tiered-Latency DRAM [87] (Section 4.4).
With this aggressive mechanism, the latency of a RowClone-FPM operation reduces to 50ns
(one ACTIVATE and one PRECHARGE). Therefore, the overall latency of a bitwise AND/OR
operation is 200ns. We will refer to this enhanced mechanism as aggressive, and the approach
that uses the simple back-to-back ACTIVATE operations as conservative.
7
End-to-end System Support
Both RowClone and IDAO are substrates that exploit DRAM technology to perform bulk
copy, initialization, and bitwise AND/OR operations efficiently inside DRAM. However, to
exploit these substrates, we need support from the rest of the layers in the system stack,
namely, the instruction set architecture, the microarchitecture, and the system software. In
this section, we describe this support in detail.
7.1
ISA Support
To enable the software to communicate occurrences of the bulk operations to the hardware,
the mechanisms introduce four new instructions to the ISA: memcopy, meminit, memand, and
memor. Table 2 describes the semantics of these four new instructions. The mechanisms
deliberately keep the semantics of the instructions simple in order to relieve the software
from worrying about microarchitectural aspects of the DRAM substrate such as row size,
alignment, etc. (discussed in Section 7.2.1). Note that such instructions are already present
22
in some of the instructions sets in modern processors – e.g., rep movsd, rep stosb, ermsb
in x86 [59] and mvcl in IBM S/390 [58].
Instruction
Operands
Semantics
memcopy
src, dst, size
Copy size bytes from src to dst
meminit
dst, size, val
Set size bytes to val at dst
src1, src2, dst, size
Perform bitwise AND of size bytes of src1
with size bytes of src2 and store the result
in the dst
src1, src2, dst, size
Perform bitwise OR of size bytes of src1
with size bytes of src2 and store the result
in the dst
memand
memor
Table 2: Semantics of the memcopy, meminit, memand, and memor instructions
There are three points to note regarding the execution semantics of these operations.
First, the processor does not guarantee atomicity for any of these instructions, but note
that existing systems also do not guarantee atomicity for such operations. Therefore, the
software must take care of atomicity requirements using explicit synchronization. However,
the microarchitectural implementation ensures that any data in the on-chip caches is kept
consistent during the execution of these operations (Section 7.2.2). Second, the processor
handles any page faults during the execution of these operations. Third, the processor can
take interrupts during the execution of these operations.
7.2
Processor Microarchitecture Support
The microarchitectural implementation of the new instructions has two parts. The first
part determines if a particular instance of the instructions can be fully/partially accelerated
by RowClone/IDAO. The second part involves the changes required to the cache coherence
protocol to ensure coherence of data in the on-chip caches. We discuss these parts in this
section.
7.2.1
Source/Destination Alignment and Size
For the processor to accelerate a copy/initialization operation using RowClone, the operation must satisfy certain alignment and size constraints. Specifically, for an operation to
be accelerated by FPM, 1) the source and destination regions should be within the same
subarray, 2) the source and destination regions should be row-aligned, and 3) the operation
should span an entire row. On the other hand, for an operation to be accelerated by PSM,
the source and destination regions should be cache line-aligned and the operation must span
a full cache line.
Upon encountering a memcopy/meminit instruction, the processor divides the region to
be copied/initialized into three portions: 1) row-aligned row-sized portions that can be
23
accelerated using FPM, 2) cache line-aligned cache line-sized portions that can be accelerated
using PSM, and 3) the remaining portions that can be performed by the processor. For
the first two regions, the processor sends appropriate requests to the memory controller,
which completes the operations and sends an acknowledgment back to the processor. Since
TRANSFER copies only a single cache line, a bulk copy using PSM can be interleaved with
other commands to memory. The processor completes the operation for the third region
similarly to how it is done in today’s systems. Note that the CPU can offload all these
operations to the memory controller. In such a design, the CPU need not be made aware of
the DRAM organization (e.g., row size and alignment, subarray mapping, etc.).
For each instance of memand/memor instruction, the processor follows a similar procedure.
However, only the row-aligned row-sized portions are accelerated using IDAO. The remaining
portions are still performed by the CPU. For the row-aligned row-sized regions, some of the
copy operations may require RowClone-PSM. For each row of the operation, the processor
determines if the number of RowClone-PSM operations required is three. If so, the processor
completes the execution in the CPU. Otherwise, the operation is completed using IDAO.
7.2.2
Managing On-Chip Cache Coherence
Both RowClone and IDAO allow the memory controller to directly read/modify data in
memory without going through the on-chip caches. Therefore, to ensure cache coherence,
the controller appropriately handles cache lines from the source and destination regions that
may be present in the caches before issuing the in-DRAM operations to memory.
First, the memory controller writes back any dirty cache line from the source region as
the main memory version of such a cache line is likely stale. Using the data in memory
before flushing such cache lines will lead to stale data being copied to the destination region.
Second, the controller invalidates any cache line (clean or dirty) from the destination region
that is cached in the on-chip caches. This is because after performing the operation, the
cached version of these blocks may contain stale data. The controller already has the ability
to perform such flushes and invalidations to support Direct Memory Access (DMA) [60].
After performing the necessary flushes and invalidations, the memory controller performs
the in-DRAM operation. To ensure that cache lines of the destination region are not cached
again by the processor in the meantime, the memory controller blocks all requests (including
prefetches) to the destination region until the copy or initialization operation is complete.
A recent work, LazyPIM [17], proposes an approach to perform the coherence operations
lazily by comparing the signatures of data that were accessed in memory and the data that
is cached on-chip. Our mechanisms can be combined with such works.
For RowClone, while performing the flushes and invalidates as mentioned above will
ensure coherence, we propose a modified solution to handle dirty cache lines of the source
region to reduce memory bandwidth consumption. When the memory controller identifies a
dirty cache line belonging to the source region while performing a copy, it creates an in-cache
copy of the source cache line with the tag corresponding to the destination cache line. This
has two benefits. First, it avoids the additional memory flush required for the dirty source
cache line. Second and more importantly, the controller does not have to wait for all the dirty
source cache lines to be flushed before it can perform the copy. In the evaluation section,
24
we will describe another optimization, called RowClone-Zero-Insert, which inserts clean zero
cache lines into the cache to further optimize Bulk Zeroing. This optimization does not
require further changes to the proposed modifications to the cache coherence protocol.
Although the two mechanisms require the controller to manage cache coherence, it does
not affect memory consistency — i.e., the ordering of accesses by concurrent readers and/or
writers to the source or destination regions. As mentioned before, such an operation is
not guaranteed to be atomic even in current systems, and software needs to perform the
operation within a critical section to ensure atomicity.
7.3
Software Support
The minimum support required from the system software is the use of the proposed instructions to indicate bulk data operations to the processor. Although one can have a working
system with just this support, the maximum latency and energy benefits can be obtained
if the hardware is able to accelerate most operations using FPM rather than PSM. Increasing the likelihood of the use of the FPM mode requires further support from the operating
system (OS) on two aspects: 1) page mapping, and 2) granularity of the operation.
7.3.1
Subarray-Aware Page Mapping
The use of FPM requires the source row and the destination row of a copy operation to be
within the same subarray. Therefore, to maximize the use of FPM, the OS page mapping
algorithm should be aware of subarrays so that it can allocate a destination page of a
copy operation in the same subarray as the source page. More specifically, the OS should
have knowledge of which pages map to the same subarray in DRAM. We propose that
DRAM expose this information to software using the small EEPROM that already exists
in today’s DRAM modules. This EEPROM, called the Serial Presence Detect (SPD) [63],
stores information about the DRAM chips that is read by the memory controller at system
bootup. Exposing the subarray mapping information will require only a few additional bytes
to communicate the bits of the physical address that map to the subarray index. To increase
DRAM yield, DRAM manufacturers design chips with spare rows that can be mapped to
faulty rows [55]. The mechanisms can work with this technique by either requiring that each
faulty row is remapped to a spare row within the same subarray, or exposing the location
of all faulty rows to the memory controller so that it can use PSM to copy data across such
rows.
Once the OS has the mapping information between physical pages and subarrays, it
maintains multiple pools of free pages, one pool for each subarray. When the OS allocates
the destination page for a copy operation (e.g., for a Copy-on-Write operation), it chooses
the page from the same pool (subarray) as the source page. Note that this approach does
not require contiguous pages to be placed within the same subarray. As mentioned before, commonly used memory interleaving techniques spread out contiguous pages across as
many banks/subarrays as possible to improve parallelism. Therefore, both the source and
destination of a bulk copy operation can be spread out across many subarrays.
25
7.3.2
Granularity of the Operations
The second aspect that affects the use of FPM and IDAO is the granularity at which data is
copied or initialized. These mechanisms have a minimum granularity at which they operate.
There are two factors that affect this minimum granularity: 1) the size of each DRAM row,
and 2) the memory interleaving employed by the controller.
First, in each chip, these mechanisms operate on an entire row of data. Second, to extract
maximum bandwidth, some memory interleaving techniques map consecutive cache lines to
different memory channels in the system. Therefore, to operate on a contiguous region of
data with such interleaving strategies, the mechanisms must perform the operation in each
channel. The minimum amount of data in such a scenario is the product of the row size and
the number of channels.
To maximize the likelihood of using FPM and IDAO, the system or application software
must ensure that the region of data involved in the operation is at least as large as this
minimum granularity. For this purpose, we propose to expose this minimum granularity
to the software through a special register, which we call the Minimum DRAM Granularity
Register (MDGR). On system bootup, the memory controller initializes the MCGR based
on the row size and the memory interleaving strategy, which can later be used by the OS
for effectively exploiting RowClone/IDAO. Note that some previously proposed techniques
such as sub-wordline activation [132] or mini-rank [136, 147] can be combined with our
mechanisms to reduce the minimum granularity.
8
Evaluation
To highlight the benefits of performing various operations completely inside DRAM, we first
compare the raw latency and energy required to perform these operations using different
mechanisms (Section 8.1). We then present the quantitative evaluation of applications for
RowClone and IDAO in Sections 8.2 and 8.3, respectively.
8.1
Latency and Energy Analysis
We estimate latency using DDR3-1600 timing parameters. We estimate energy using the
Rambus power model [108]. Our energy calculations only include the energy consumed by the
DRAM module and the DRAM channel Table 3 shows the latency and energy consumption
due to the different mechanisms for bulk copy, zero, and bitwise AND/OR operations. The
table also shows the potential reduction in latency and energy by performing these operations
completely inside DRAM.
First, for bulk copy operations, RowClone-FPM reduces latency by 12x and energy consumption by 74.4x compared to existing interfaces. While PSM does not provide as much
reduction as FPM, PSM still reduces latency by 2x and energy consumption by 3.2x compared to the baseline. Second, for bulk zeroing operations, RowClone can always use the
FPM mode as it reserves a single zero row in each subarray of DRAM. As a result, it can
reduce the latency of bulk zeroing by 6x and energy consumption by 41.5x compared to
26
Absolute
Mechanism Latency
(ns)
Copy
Zero
AND/OR
Reduction
Memory
Energy
Latency
Memory
Energy
(µJ)
Baseline
1020
3.6
1.00x
1.0x
FPM
85
0.04
12.0x
74.4x
Inter-Bank - PSM
510
1.1
2.0x
3.2x
Intra-Bank - PSM
1020
2.5
1.0x
1.5x
Baseline
510
2.0
1.00x
1.0x
FPM
85
0.05
6.0x
41.5x
Baseline
1530
5.0
1.00x
1.0x
IDAO-Conservative
320
0.16
4.78x
31.6x
IDAO-Aggressive
200
0.10
7.65x
50.5x
Table 3: DRAM latency and memory energy reductions
existing interfaces. Finally, for bitwise AND/OR operations, even with conservative timing
parameters, IDAO can reduce latency by 4.78x and energy consumption by 31.6x. With
more aggressive timing parameters, IDAO reduces latency by 7.65x and energy by 50.5x.
The improvement in sustained throughput due to RowClone and IDAO for the respective
operations is similar to the improvements in latency. The main takeaway from these results
is that, for systems that use DRAM to store majority of their data (which includes most
of today’s systems), these in-DRAM mechanisms are probably the best performing and the
most energy-efficient way of performing the respective operations. We will now provide
quantitative evaluation of these mechanisms on some real applications.
8.2
Applications for RowClone
Our evaluations use an in-house cycle-level multi-core simulator similar to memsim [9, 117,
118] along with a cycle-accurate command-level DDR3 DRAM simulator, similar to Ramulator [10, 78]. The multi-core simulator models out-of-order cores, each with a private lastlevel cache. We evaluate the benefits of RowClone using 1) a case study of the fork system
call, an important operation used by modern operating systems, 2) six copy and initialization intensive benchmarks: bootup, compile, forkbench, memcached [4], mysql [5], and shell
(Section 8.2.2 describes these benchmarks), and 3) a wide variety of multi-core workloads
comprising the copy/initialization intensive applications running alongside memory-intensive
applications from the SPEC CPU2006 benchmark suite [29]. Note that benchmarks such as
SPEC CPU2006, which predominantly stress the CPU, typically use a small number of page
copy and initialization operations and therefore would serve as poor individual evaluation
benchmarks for RowClone.
27
We collected instruction traces for our workloads using Bochs [1], a full-system x86-64
emulator, running a GNU/Linux system. We modify the kernel’s implementation of page
copy/initialization to use the memcopy and meminit instructions and mark these instructions
in our traces. For the fork benchmark, we used the Wind River Simics full system simulator [8] to collect the traces. We collect 1-billion instruction traces of the representative portions of these workloads. We use the instruction throughput (IPC) metric to measure singlecore performance. We evaluate multi-core runs using the weighted speedup metric [41, 122].
This metric is used by many prior works (e.g., [23, 39, 94, 113, 126, 127, 141, 142]) to measure
system throughput for multi-programmed workloads. In addition to weighted speedup, we
use five other performance/fairness/bandwidth/energy metrics, as shown in Table 7.
8.2.1
The fork System Call
fork is one of the most expensive yet frequently-used system calls in modern systems [112].
Since fork triggers a large number of CoW operations (as a result of updates to shared pages
from the parent or child process), RowClone can significantly improve the performance of
fork.
The performance of fork depends on two parameters: 1) the size of the address space
used by the parent—which determines how much data may potentially have to be copied, and
2) the number of pages updated after the fork operation by either the parent or the child—
which determines how much data is actually copied. To exercise these two parameters, we
create a microbenchmark, forkbench, which first creates an array of size S and initializes
the array with random values. It then forks itself. The child process updates N random
pages (by updating a cache line within each page) and exits; the parent process waits for
the child process to complete before exiting itself.
As such, we expect the number of copy operations to depend on N —the number of pages
copied. Therefore, one may expect RowClone’s performance benefits to be proportional to
N . However, an application’s performance typically depends on the overall memory access
rate [128, 129], and RowClone can only improve performance by reducing the memory access
rate due to copy operations. As a result, we expect the performance improvement due to
RowClone to primarily depend on the fraction of memory traffic (total bytes transferred over
the memory channel) generated by copy operations. We refer to this fraction as FMTC—
Fraction of Memory Traffic due to Copies.
Figure 17 plots FMTC of forkbench for different values of S (64MB and 128MB) and N
(2 to 16k) in the baseline system. As the figure shows, for both values of S, FMTC increases
with increasing N . This is expected as a higher N (more pages updated by the child) leads
to more CoW operations. However, because of the presence of other read/write operations
(e.g., during the initialization phase of the parent), for a given value of N , FMTC is larger
for S = 64MB compared to S = 128MB. Depending on the value of S and N , anywhere
between 14% to 66% of the memory traffic arises from copy operations. This shows that
accelerating copy operations using RowClone has the potential to significantly improve the
performance of the fork operation.
Figure 18 plots the performance (IPC) of FPM and PSM for forkbench, normalized
to that of the baseline system. We draw two conclusions from the figure. First, FPM im28
Fraction of Memory
Traffic due to Copy
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
S =64MB
2
4
8
S =128MB
16 32 64 128 256 512 1k 2k 4k 8k 16k
N (Number of Pages Updated) (log scale)
Instructions per Cycle
(normalized to baseline)
Figure 17: FMTC of forkbench for varying S and N
2.2
2.0
1.8
1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
M
w
Ro
FP
neo
l
C
RowClone-PSM
S = 64MB
2
4
8
16
S = 128MB
32
64
128 256 512
1k
2k
4k
8k 16k
N (Number of Pages Updated) (log scale)
Figure 18: Performance improvement due to RowClone for forkbench with different values of
S and N
proves the performance of forkbench for both values of S and most values of N . The peak
performance improvement is 2.2x for N = 16k (30% on average across all data points). As
expected, the improvement of FPM increases as the number of pages updated increases. The
trend in performance improvement of FPM is similar to that of FMTC (Figure 17), confirming our hypothesis that FPM’s performance improvement primarily depends on FMTC.
Second, PSM does not provide considerable performance improvement over the baseline.
This is because the large on-chip cache in the baseline system buffers the writebacks generated by the copy operations. These writebacks are flushed to memory at a later point
without further delaying the copy operation. As a result, PSM, which just overlaps the
read and write operations involved in the copy, does not improve latency significantly in the
presence of a large on-chip cache. On the other hand, FPM, by copying all cache lines from
the source row to destination in parallel, significantly reduces the latency compared to the
baseline (which still needs to read the source blocks from main memory), resulting in high
performance improvement.
Figure 19 shows the reduction in DRAM energy consumption (considering both the
DRAM and the memory channel) of FPM and PSM modes of RowClone compared to that
of the baseline for forkbench with S = 64MB. Similar to performance, the overall DRAM
energy consumption also depends on the total memory access rate. As a result, RowClone’s
potential to reduce DRAM energy depends on the fraction of memory traffic generated by
copy operations. In fact, our results also show that the DRAM energy reduction due to
29
Normalized DRAM
Energy Consumption
FPM and PSM correlate well with FMTC (Figure 17). By efficiently performing the copy
operations, FPM reduces DRAM energy consumption by up to 80% (average 50%, across all
data points). Similar to FPM, the energy reduction of PSM also increases with increasing
N with a maximum reduction of 9% for N =16k.
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
Baseline
RowClone-PSM
RowClone-FPM
2
4
8
16 32 64 128 256 512 1k 2k 4k 8k 16k
N (Number of Pages Updated) (log scale)
Figure 19: Comparison of DRAM energy consumption of different mechanisms for forkbench
(S = 64MB)
In a system that is agnostic to RowClone, we expect the performance improvement and
energy reduction of RowClone to be in between that of FPM and PSM. By making the system
software aware of RowClone (Section 7.3),i.e., designing the system software to be aware of
the topology (subarray and bank organization) of DRAM, as also advocated by various recent
works [74, 98, 101], we can approximate the maximum performance and energy benefits by
increasing the likelihood of the use of FPM.
8.2.2
Copy/Initialization Intensive Applications
In this section, we analyze the benefits of RowClone on six copy/initialization intensive
applications, including one instance of the forkbench described in the previous section.
Table 4 describes these applications.
Name
Description
bootup
A phase booting up the Debian operating system.
compile
The compilation phase from the GNU C compiler (while running cc1 ).
forkbench
An instance of the forkbench described in Section 8.2.1 with S = 64MB and N = 1k.
mcached
Memcached [4], a memory object caching system, a phase inserting many key-value
pairs into the memcache.
mysql
MySQL [5], an on-disk database system, a phase loading the sample employeedb
shell
A Unix shell script running ‘find’ on a directory tree with ‘ls’ on each sub-directory
(involves filesystem accesses and spawning new processes).
Table 4: Copy/Initialization-intensive benchmarks
30
Figure 20 plots the fraction of memory traffic due to copy, initialization, and regular
read/write operations for the six applications. For these applications, between 10% and 80%
of the memory traffic is generated by copy and initialization operations.
Fraction of Traffic
Read
Write
Copy
Initialization
1.0
0.8
0.6
0.4
0.2
bootup
compile forkbench mcached
mysql
shell
Figure 20: Fraction of memory traffic due to read, write, copy and initialization
Instructions per cycle
Figure 21 compares the IPC of the baseline with that of RowClone and a variant of
RowClone, RowClone-ZI (described shortly). The RowClone-based initialization mechanism
slightly degrades performance for the applications which have a negligible number of copy
operations (mcached, compile, and mysql ).
2.4
2.2
2.0
1.8
1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
9%
4%
40%
15%
66%
Baseline
bootup
4%
RowClone
compile forkbench mcached
RowClone-ZI
mysql
shell
Figure 21: Performance improvement of RowClone and RowClone-ZI. Value on top indicates
percentage improvement of RowClone-ZI over baseline.
Our further analysis indicated that, for these applications, although the operating system
zeroes out any newly allocated page, the application typically accesses almost all cache lines
of a page immediately after the page is zeroed out. There are two phases: 1) the phase when
the OS zeroes out the page, and 2) the phase when the application accesses the cache lines
of the page. While the baseline incurs cache misses during phase 1, RowClone, as a result
of performing the zeroing operation completely in memory, incurs cache misses in phase 2.
However, the baseline zeroing operation is heavily optimized for memory-level parallelism
(MLP) [46, 84, 97, 99, 100]. Memory-level parallelism indicates the number of concurrent
outstanding misses to main memory. Higher MLP results in higher overlap in the latency
of the requests. Consequently, higher MLP results in lower overall latency. In contrast, the
cache misses in phase 2 have low MLP. As a result, incurring the same misses in Phase 2 (as
with RowClone) causes higher overall stall time for the application (because the latencies
31
for the misses are serialized) than incurring them in Phase 1 (as in the baseline), resulting
in RowClone’s performance degradation compared to the baseline.
To address this problem, RowClone uses a variant called RowClone-Zero-Insert (RowCloneZI). RowClone-ZI not only zeroes out a page in DRAM but it also inserts a zero cache line
into the processor cache corresponding to each cache line in the page that is zeroed out. By
doing so, RowClone-ZI avoids the cache misses during both phase 1 (zeroing operation) and
phase 2 (when the application accesses the cache lines of the zeroed page). As a result, it
improves performance for all benchmarks, notably forkbench (by 66%) and shell (by 40%),
compared to the baseline.
Table 5 shows the percentage reduction in DRAM energy and memory bandwidth consumption with RowClone and RowClone-ZI compared to the baseline. While RowClone
significantly reduces both energy and memory bandwidth consumption for bootup, forkbench
and shell, it has negligible impact on both metrics for the remaining three benchmarks. The
lack of energy and bandwidth benefits in these three applications is due to serial execution
caused by the the cache misses incurred when the processor accesses the zeroed out pages
(i.e., phase 2, as described above), which also leads to performance degradation in these
workloads (as also described above). RowClone-ZI, which eliminates the cache misses in
phase 2, significantly reduces energy consumption (between 15% to 69%) and memory bandwidth consumption (between 16% and 81%) for all benchmarks compared to the baseline.
We conclude that RowClone-ZI can effectively improve performance, memory energy, and
memory bandwidth efficiency in page copy and initialization intensive single-core workloads.
Energy Reduction
Bandwidth Reduction
RowClone
+ZI
RowClone
+ZI
bootup
39%
40%
49%
52%
compile
-2%
32%
2%
47%
forkbench
69%
69%
60%
60%
mcached
0%
15%
0%
16%
mysql
-1%
17%
0%
21%
shell
68%
67%
81%
81%
Application
Table 5: DRAM energy and bandwidth reduction due to RowClone and RowClone-ZI (indicated
as +ZI)
8.2.3
Multi-core Evaluations
As RowClone performs bulk data operations completely within DRAM, it significantly reduces the memory bandwidth consumed by these operations. As a result, RowClone can
benefit other applications running concurrently on the same system. We evaluate this benefit of RowClone by running our copy/initialization-intensive applications alongside memoryintensive applications from the SPEC CPU2006 benchmark suite [29] (i.e., those applications
with last-level cache MPKI greater than 1). Table 6 lists the set of applications used for our
multi-programmed workloads.
32
Copy/Initialization-intensive benchmarks
bootup, compile, forkbench, mcached, mysql, shell
Memory-intensive benchmarks from SPEC CPU2006
bzip2, gcc, mcf, milc, zeusmp, gromacs, cactusADM, leslie3d, namd, gobmk, dealII,
soplex, hmmer, sjeng, GemsFDTD, libquantum, h264ref, lbm, omnetpp, astar, wrf,
sphinx3, xalancbmk
Table 6: List of benchmarks used for multi-core evaluation
Normalized Weighted
Speedup
We generate multi-programmed workloads for 2-core, 4-core and 8-core systems. In each
workload, half of the cores run copy/initialization-intensive benchmarks and the remaining
cores run memory-intensive SPEC benchmarks. Benchmarks from each category are chosen
at random.
Figure 22 plots the performance improvement due to RowClone and RowClone-ZI for
the 50 4-core workloads we evaluated (sorted based on the performance improvement due
to RowClone-ZI). Two conclusions are in order. First, although RowClone degrades performance of certain 4-core workloads (with compile, mcached or mysql benchmarks), it significantly improves performance for all other workloads (by 10% across all workloads). Second,
like in our single-core evaluations (Section 8.2.2), RowClone-ZI eliminates the performance
degradation due to RowClone and consistently outperforms both the baseline and RowClone
for all workloads (20% on average).
1.40
1.35
1.30
1.25
1.20
1.15
1.10
1.05
1.00
0.95
Baseline
RowClone
RowClone-ZI
50 Workloads
Figure 22: System performance improvement of RowClone for 4-core workloads
Table 7 shows the number of workloads and six metrics that evaluate the performance,
fairness, memory bandwidth and energy efficiency improvement due to RowClone compared
to the baseline for systems with 2, 4, and 8 cores. We evaluate fairness using the maximum
slowdown metric, which has been used by many prior works [14, 31, 32, 33, 34, 37, 38, 72, 73,
75, 76, 96, 117, 126, 127, 128, 129, 133, 134, 145] as an indicator of unfairness in the system.
Maximum slowdown is defined as the maximum of the slowdowns of all applications that are
in the multi-core workload. For all three systems, RowClone significantly outperforms the
baseline on all metrics.
To provide more insight into the benefits of RowClone on multi-core systems, we classify
our copy/initialization-intensive benchmarks into two categories: 1) Moderately copy/initializationintensive (compile, mcached, and mysql ) and highly copy/initialization-intensive (bootup,
forkbench, and shell ). Figure 23 shows the average improvement in weighted speedup for the
33
Number of Cores
2
4
8
Number of Workloads
138
50
40
Weighted Speedup [41, 122] Improvement
15%
20%
27%
Instruction Throughput Improvement
14%
15%
25%
Harmonic Speedup [93] Improvement
13%
16%
29%
Maximum Slowdown [33, 75, 76] Reduction
6%
12%
23%
Memory Bandwidth/Instruction [123] Reduction
29%
27%
28%
Memory Energy/Instruction Reduction
19%
17%
17%
Table 7: Multi-core performance, fairness, bandwidth, and energy
Weighted Speedup
Improvement over Baseline
different multi-core workloads, categorized based on the number of highly copy/initializationintensive benchmarks. As the trends indicate, the performance improvement increases with
increasing number of such benchmarks for all three multi-core systems, indicating the effectiveness of RowClone in accelerating bulk copy/initialization operations.
35
30
25
20
15
10
5
0
1
0
1
2
0
1
2
3
4
2-core
4-core
8-core
Number of Highly Copy/Initialization-intensive Benchmarks
Figure 23: Effect of increasing copy/initialization intensity
We conclude that RowClone is an effective mechanism to improve system performance,
energy efficiency and bandwidth efficiency of future, memory-bandwidth-constrained multicore systems.
8.2.4
Memory-Controller-based DMA
One alternative way to perform a bulk data operation is to use the memory controller
to complete the operation using the regular DRAM interface (similar to some prior approaches [66, 146]). We refer to this approach as the memory-controller-based DMA (MCDMA). MC-DMA can potentially avoid the cache pollution caused by inserting blocks (involved in the copy/initialization) unnecessarily into the caches. However, it still requires data
to be transferred over the memory bus. Hence, it suffers from the large latency, bandwidth,
and energy consumption associated with the data transfer. Because the applications used
in our evaluations do not suffer from cache pollution, we expect MC-DMA to perform comparably or worse than the baseline. In fact, our evaluations show that MC-DMA degrades
performance compared to our baseline by 2% on average for the six copy/initialization inten34
sive applications (16% compared to RowClone). In addition, MC-DMA does not conserve
any DRAM energy, unlike RowClone.
8.2.5
Other Applications
Secure Deallocation. Most operating systems (e.g., Linux [18], Windows [111], Mac OS
X [121]) zero out pages newly allocated to a process. This is done to prevent malicious
processes from gaining access to the data that previously belonged to other processes or the
kernel itself. Not doing so can potentially lead to security vulnerabilities, as shown by prior
works [25, 36, 49, 50].
Process Checkpointing. Checkpointing is an operation during which a consistent version
of a process state is backed-up, so that the process can be restored from that state in the
future. This checkpoint-restore primitive is useful in many cases including high-performance
computing servers [16], software debugging with reduced overhead [124], hardware-level fault
and bug tolerance mechanisms [26, 27, 28], mechanisms to provide consistent updates of
persistent memory state [110], and speculative OS optimizations to improve performance [20,
137]. However, to ensure that the checkpoint is consistent (i.e., the original process does not
update data while the checkpointing is in progress), the pages of the process are marked with
copy-on-write. As a result, checkpointing often results in a large number of CoW operations.
Virtual Machine Cloning/Deduplication. Virtual machine (VM) cloning [80] is a technique to significantly reduce the startup cost of VMs in a cloud computing server. Similarly,
deduplication is a technique employed by modern hypervisors [135] to reduce the overall
memory capacity requirements of VMs. With this technique, different VMs share physical
pages that contain the same data. Similar to forking, both these operations likely result in
a large number of CoW operations for pages shared across VMs.
Page Migration. Bank conflicts, i.e., concurrent requests to different rows within the
same bank, typically result in reduced row buffer hit rate and hence degrade both system
performance and energy efficiency. Prior work [130] proposed techniques to mitigate bank
conflicts using page migration. The PSM mode of RowClone can be used in conjunction with
such techniques to 1) significantly reduce the migration latency and 2) make the migrations
more energy-efficient.
CPU-GPU Communication. In many current and future processors, the GPU is or is
expected to be integrated on the same chip with the CPU. Even in such systems where
the CPU and GPU share the same off-chip memory, the off-chip memory is partitioned
between the two devices. As a consequence, whenever a CPU program wants to offload some
computation to the GPU, it has to copy all the necessary data from the CPU address space
to the GPU address space [61]. When the GPU computation is finished, all the data needs
to be copied back to the CPU address space. This copying involves a significant overhead.
In fact, a recent work, Decoupled DMA [88], motivates this problem and proposes a solution
to mitigate it. By spreading out the GPU address space over all subarrays and mapping the
application data appropriately, RowClone can significantly speed up these copy operations.
Note that communication between different processors and accelerators in a heterogeneous
System-on-a-chip (SoC) is done similarly to the CPU-GPU communication and can also be
accelerated by RowClone.
35
8.3
Applications for IDAO
We analyze our mechanism’s performance on a real-world bitmap index library, FastBit [2],
widely-used in physics simulations and network analysis. Fastbit can enable faster and more
efficient searching/retrieval compared to B-trees.
To construct an index, FastBit uses multiple bitmap bins, each corresponding to either
a distinct value or a range of values. FastBit relies on fast bitwise AND/OR operations on
these bitmaps to accelerate joins and range queries. For example, to execute a range query,
FastBit performs a bitwise OR of all bitmaps that correspond to the specified range.
We initialized FastBit on our baseline system using the sample STAR [7] data set. We
then ran a set of indexing-intensive range queries that touch various numbers of bitmap
bins. For each query, we measure the fraction of query execution time spent on bitwise OR
operations. Table 8 shows the corresponding results. For each query, the table shows the
number of bitmap bins involved in the query and the percentage of time spent in bitwise
OR operations. On average, 31% of the query execution is spent on bitwise OR operations
(with small variance across queries).
Table 8: Fraction of time spent in OR operations
Number of bins
Fraction of time spent in OR
3
9
20
45
29%
29%
31%
32%
98
118
128
34% 34%
34%
Normalized Performance
To estimate the performance of our mechanism, we measure the number of bitwise OR
operations required to complete the query. We then compute the amount of time taken by our
mechanism to complete these operations and then use that to estimate the performance of
the overall query execution. To perform a bitwise OR of more than two rows, our mechanism
is invoked two rows at a time. Figure 24 shows the potential performance improvement using
our two mechanisms (conservative and aggressive), each with either 1 bank or 4 banks.
1.2
1.0
0.8
0.6
0.4
Conservative (1 bank)
Aggressive (1 bank)
0.2
Conservative (4 banks)
Aggressive (4 banks)
3
9
20
45
98
Number of OR bins
118
128
Figure 24: Range query performance improvement over baseline
As our results indicate, our aggressive mechanism with 4 banks improves the performance
of range queries by 30% (on average) compared to the baseline, eliminating almost all the
overhead of bitwise operations. As expected, the aggressive mechanism performs better than
the conservative mechanism. Similarly, using more banks provides better performance. Even
36
if we assume a 2X higher latency for the triple-row activation, our conservative mechanism
with 1 bank improves performance by 18% (not shown in the figure).
8.4
Recent Works Building on RowClone and IDAO
Some recent works [22, 89] have built on top of RowClone and IDAO and have proposed
mechanisms to perform bulk copy and bitwise operations inside memory. As we described in
Section 5.3, to copy data across two subarrays in the same bank, RowClone uses two PSM
operations. While this approach reduces energy consumption compared to existing systems,
it still does not reduce latency. Low-cost Interlinked Sub-Arrays or LISA [22] addresses
this problem by connecting adjacent subarrays of a bank. LISA exploits the open bitline
architecture and connects the open end of each bitline to the adjacent sense amplifier using a
a transistor. LISA uses these connections to transfer data more efficiently and quickly across
subarrays in the same bank. Pinatubo [89] takes an approach similar to IDAO and uses
Phase Change Memory (PCM) technology to perform bitwise operations inside a memory
chip built using PCM. Pinatubo enables the PCM sense amplifier to detect fine-grained
differences in cell resistance. With the enhanced sense amplifier, Pinatubo can perform
bitwise AND/OR operations by simultaneously sensing multiple PCM cells connected to the
same sense amplifier.
9
Conclusion
In this article, we focused our attention on the problem of data movement, especially for
operations that access a large amount of data. We first discussed the general notion of
Processing in Memory (PiM) as a potential solution to reducing data movement so as to
achieve better performance and efficiency. PiM adds new logic structures, sometimes as
large as simple processors, near memory, to perform computation. We then introduced the
idea of Processing using Memory (PuM), which exploits some of the peripheral structures
already existing inside memory devices (with minimal changes), to perform other tasks on
top of storing data. PuM is a cost-effective approach as it does not add significant logic
structures near or inside memory.
We developed two new ideas that take the PuM approach and build on top of DRAM
technology. The first idea is RowClone, which exploits the underlying operation of DRAM to
perform bulk copy and initialization operations completely inside DRAM. RowClone exploits
the fact DRAM cells internally share several buses that can act as a fast path for copying
data across them. The second idea is In-DRAM AND/OR (IDAO), which exploits the
analog operation of DRAM to perform bulk bitwise AND/OR operations completely inside
DRAM. IDAO exploits the fact that many DRAM cells share the same sense amplifier and
uses simultaneous activation of three rows of DRAM cells to perform bitwise AND/OR
operations efficiently.
Our evaluations show that both mechanisms (RowClone and IDAO) improve the performance and energy-efficiency of the respective operations by more than an order of magnitude.
In fact, for systems that store data in DRAM, these mechanisms are probably as efficient as
37
any mechanism could be (since they minimize the amount of data movement). We described
many real-world applications that can exploit RowClone and IDAO, and have demonstrated
significant performance and energy efficiency improvements using these mechanisms. Due
to its low cost of implementation and large performance and energy benefits, we believe
Processing using Memory is a very promising and viable approach to minimize the memory
bottleneck in data-intensive applications. We hope and expect future research will build
upon this approach to demonstrate other techniques that can perform more operations in
memory.
Acknowledgments
We thank the members of the SAFARI and LBA research groups, and the various anonymous reviewers for their valuable feedback on the multiple works described in this document.
We acknowledge the support of AMD, Google, IBM, Intel, Microsoft, Nvidia, Oracle, Qualcomm, Samsung, Seagate, and VMWare. This research was partially supported by NSF
(CCF-0953246, CCF-1147397, CCF-1212962, CNS-1320531), Intel University Research Office Memory Hierarchy Program, Intel Science and Technology Center for Cloud Computing,
and Semiconductor Research Corporation.
References
[1] Bochs IA-32 emulator project. http://bochs.sourceforge.net/.
[2] FastBit: An Efficient Compressed Bitmap Index Technology. https://sdm.lbl.gov/
fastbit/.
[3] High Bandwidth Memory DRAM. http://www.jedec.org/standards-documents/
docs/jesd235.
[4] Memcached: A high performance, distributed memory object caching system. http:
//memcached.org.
[5] MySQL: An open source database. http://www.mysql.com.
[6] Redis - bitmaps. http://redis.io/topics/data-types-intro#bitmaps.
[7] The STAR experiment. http://www.star.bnl.gov/.
[8] Wind River Simics full system simulation. http://www.windriver.com/products/
simics/.
[9] Memsim. http://safari.ece.cmu.edu/tools.html, 2012.
[10] Ramulator Source Code. https://github.com/CMU-SAFARI/ramulator, 2015.
38
[11] Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. A Scalable Processing-in-memory Accelerator for Parallel Graph Processing. In Proceedings
of the 42nd Annual International Symposium on Computer Architecture, ISCA ’15,
pages 105–117, New York, NY, USA, 2015. ACM.
[12] Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. PIM-enabled Instructions: A Low-overhead, Locality-aware Processing-in-memory Architecture. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, ISCA
’15, pages 336–348, New York, NY, USA, 2015. ACM.
[13] Berkin Akin, Franz Franchetti, and James C. Hoe. Data Reorganization in Memory
Using 3D-stacked DRAM. In Proceedings of the 42nd Annual International Symposium
on Computer Architecture, ISCA ’15, pages 131–143, New York, NY, USA, 2015. ACM.
[14] Rachata Ausavarungnirun, Kevin Kai-Wei Chang, Lavanya Subramanian, Gabriel H.
Loh, and Onur Mutlu. Staged Memory Scheduling: Achieving High Performance and
Scalability in Heterogeneous Systems. In Proceedings of the 39th Annual International
Symposium on Computer Architecture, ISCA ’12, pages 416–427, Washington, DC,
USA, 2012. IEEE Computer Society.
[15] Oreoluwa Babarinsa and Stratos Idreos. JAFAR: Near-Data Processing for Databases.
In Proceedings of the 2015 ACM SIGMOD International Conference on Management
of Data, SIGMOD ’15, pages 2069–2070, New York, NY, USA, 2015. ACM.
[16] John Bent, Garth Gibson, Gary Grider, Ben McClelland, Paul Nowoczynski, James
Nunez, Milo Polte, and Meghan Wingate. PLFS: A Checkpoint Filesystem for Parallel
Applications. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC ’09, pages 21:1–21:12, New York, NY, USA, 2009.
ACM.
[17] A. Boroumand, S. Ghose, B. Lucia, K. Hsieh, K. Malladi, H. Zheng, and O. Mutlu.
LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory. IEEE
Computer Architecture Letters, PP(99):1–1, 2016.
[18] D. P. Bovet and M. Cesati. Understanding the Linux Kernel, page 388. O’Reilly Media,
2005.
[19] Chee-Yong Chan and Yannis E. Ioannidis. Bitmap index design and evaluation. In
Proceedings of the 1998 ACM SIGMOD International Conference on Management of
Data, SIGMOD ’98, pages 355–366, New York, NY, USA, 1998. ACM.
[20] Fay Chang and Garth A. Gibson. Automatic I/O Hint Generation Through Speculative Execution. In Proceedings of the Third Symposium on Operating Systems Design
and Implementation, OSDI ’99, pages 1–14, Berkeley, CA, USA, 1999. USENIX Association.
[21] Kevin K. Chang, Abhijith Kashyap, Hasan Hassan, Saugata Ghose, Kevin Hsieh,
Donghyuk Lee, Tianshi Li, Gennady Pekhimenko, Samira Khan, and Onur Mutlu.
39
Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization. In Sigmetrics, 2016.
[22] Kevin K Chang, Prashant J Nair, Donghyuk Lee, Saugata Ghose, Moinuddin K
Qureshi, and Onur Mutlu. Low-Cost Inter-Linked Subarrays (LISA): Enabling Fast
Inter-Subarray Data Movement in DRAM. In HPCA, 2016.
[23] Kevin Kai-Wei Chang, Rachata Ausavarungnirun, Chris Fallin, and Onur Mutlu. HAT:
Heterogeneous Adaptive Throttling for On-Chip Networks. In Proceedings of the 2012
IEEE 24th International Symposium on Computer Architecture and High Performance
Computing, SBAC-PAD ’12, pages 9–18, Washington, DC, USA, 2012. IEEE Computer
Society.
[24] Kevin Kai-Wei Chang, Donghyuk Lee, Zeshan Chishti, Alaa R Alameldeen, Chris
Wilkerson, Yoongu Kim, and Onur Mutlu. Improving DRAM performance by parallelizing refreshes with accesses. In 2014 IEEE 20th International Symposium on High
Performance Computer Architecture (HPCA), pages 356–367. IEEE, 2014.
[25] Jim Chow, Ben Pfaff, Tal Garfinkel, and Mendel Rosenblum. Shredding Your Garbage:
Reducing Data Lifetime Through Secure Deallocation. In Proceedings of the 14th
Conference on USENIX Security Symposium - Volume 14, SSYM’05, pages 22–22,
Berkeley, CA, USA, 2005. USENIX Association.
[26] Kypros Constantinides, Onur Mutlu, and Todd Austin. Online Design Bug Detection:
RTL Analysis, Flexible Mechanisms, and Evaluation. In Proceedings of the 41st Annual
IEEE/ACM International Symposium on Microarchitecture, MICRO 41, pages 282–
293, Washington, DC, USA, 2008. IEEE Computer Society.
[27] Kypros Constantinides, Onur Mutlu, Todd Austin, and Valeria Bertacco. SoftwareBased Online Detection of Hardware Defects Mechanisms, Architectural Support, and
Evaluation. In Proceedings of the 40th Annual IEEE/ACM International Symposium
on Microarchitecture, MICRO 40, pages 97–108, Washington, DC, USA, 2007. IEEE
Computer Society.
[28] Kypros Constantinides, Onur Mutlu, Todd Austin, and Valeria Bertacco. A flexible
software-based framework for online detection of hardware defects. IEEE Transactions
on Computers, 58(8):1063–1079, 2009.
[29] Standard Performance Evaluation Corporation. SPEC CPU2006 Benchmark Suite.
www.spec.org/cpu2006, 2006.
[30] William Dally. GPU Computing to Exascale and Beyond. http://www.nvidia.com/
content/PDF/sc_2010/theater/Dally_SC10.pdf.
[31] Reetuparna Das, Rachata Ausavarungnirun, Onur Mutlu, Akhilesh Kumar, and Mani
Azimi. Application-to-core Mapping Policies to Reduce Memory Interference in Multicore Systems. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT ’12, pages 455–456, New York, NY, USA,
2012. ACM.
40
[32] Reetuparna Das, Rachata Ausavarungnirun, Onur Mutlu, Akhilesh Kumar, and Mani
Azimi. Application-to-core Mapping Policies to Reduce Memory Interference in Multicore Systems. In HPCA, 2012.
[33] Reetuparna Das, Onur Mutlu, Thomas Moscibroda, and Chita R. Das. Applicationaware Prioritization Mechanisms for On-chip Networks. In Proceedings of the 42Nd
Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, pages
280–291, New York, NY, USA, 2009. ACM.
[34] Reetuparna Das, Onur Mutlu, Thomas Moscibroda, and Chita R. Das. Aérgia: Exploiting Packet Latency Slack in On-chip Networks. In Proceedings of the 37th Annual
International Symposium on Computer Architecture, ISCA ’10, pages 106–116, New
York, NY, USA, 2010. ACM.
[35] Jeff Draper, Jacqueline Chame, Mary Hall, Craig Steele, Tim Barrett, Jeff LaCoss,
John Granacki, Jaewook Shin, Chun Chen, Chang Woo Kang, Ihn Kim, and Gokhan
Daglikoca. The Architecture of the DIVA Processing-in-memory Chip. In Proceedings
of the 16th International Conference on Supercomputing, ICS ’02, pages 14–25, New
York, NY, USA, 2002. ACM.
[36] Alan M. Dunn, Michael Z. Lee, Suman Jana, Sangman Kim, Mark Silberstein,
Yuanzhong Xu, Vitaly Shmatikov, and Emmett Witchel. Eternal sunshine of the
spotless machine: Protecting privacy with ephemeral channels. In Proceedings of the
10th USENIX Conference on Operating Systems Design and Implementation, OSDI’12,
pages 61–75, Berkeley, CA, USA, 2012. USENIX Association.
[37] Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N. Patt. Fairness via Source
Throttling: A Configurable and High-performance Fairness Substrate for Multi-core
Memory Systems. In Proceedings of the Fifteenth Edition of ASPLOS on Architectural
Support for Programming Languages and Operating Systems, ASPLOS XV, pages 335–
346, New York, NY, USA, 2010. ACM.
[38] Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N. Patt. Prefetch-aware
Shared Resource Management for Multi-core Systems. In Proceedings of the 38th Annual International Symposium on Computer Architecture, ISCA ’11, pages 141–152,
New York, NY, USA, 2011. ACM.
[39] Eiman Ebrahimi, Onur Mutlu, Chang Joo Lee, and Yale N. Patt. Coordinated Control
of Multiple Prefetchers in Multi-core Systems. In Proceedings of the 42Nd Annual
IEEE/ACM International Symposium on Microarchitecture, MICRO 42, pages 316–
326, New York, NY, USA, 2009. ACM.
[40] Duncan Elliott, Michael Stumm, W. Martin Snelgrove, Christian Cojocaru, and Robert
McKenzie. Computational RAM: Implementing Processors in Memory. IEEE Des.
Test, 16(1):32–41, January 1999.
[41] Stijn Eyerman and Lieven Eeckhout. System-Level Performance Metrics for Multiprogram Workloads. IEEE Micro, 28(3):42–53, May 2008.
41
[42] A. Farmahini-Farahani, Jung Ho Ahn, K. Morrow, and Nam Sung Kim. NDA: NearDRAM acceleration architecture leveraging commodity DRAM devices and standard
memory modules. In IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), 2015, pages 283–295, Feb 2015.
[43] Basilio B. Fraguela, Jose Renau, Paul Feautrier, David Padua, and Josep Torrellas.
Programming the FlexRAM Parallel Intelligent Memory System. In Proceedings of the
Ninth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’03, pages 49–60, New York, NY, USA, 2003. ACM.
[44] Mingyu Gao, Grant Ayers, and Christos Kozyrakis. Practical Near-Data Processing
for In-Memory Analytics Frameworks. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT), PACT ’15, pages 113–124,
Washington, DC, USA, 2015. IEEE Computer Society.
[45] Mingyu Gao and Christos Kozyrakis. HRL: Efficient and Flexible Reconfigurable Logic
for Near-Data Processing. In HPCA, 2016.
[46] Andrew Glew. MLP yes! ILP no. ASPLOS Wild and Crazy Idea Session98, 1998.
[47] Maya Gokhale, Bill Holmes, and Ken Iobst. Processing in Memory: The Terasys
Massively Parallel PIM Array. Computer, 28(4):23–31, April 1995.
[48] Qi Guo, Nikolaos Alachiotis, Berkin Akin, Fazle Sadi, Guanglin Xu, Tze Meng Low,
Larry Pillegi, James C. Hoe, and Franz Frachetti. 3D-Stacked Memory-Side Acceleration: Accelerator and System Design. In WoNDP, 2013.
[49] J. Alex Halderman, Seth D. Schoen, Nadia Heninger, William Clarkson, William Paul,
Joseph A. Calandrino, Ariel J. Feldman, Jacob Appelbaum, and Edward W. Felten. Lest We Remember: Cold-boot Attacks on Encryption Keys. Commun. ACM,
52(5):91–98, May 2009.
[50] K. Harrison and Shouhuai Xu. Protecting Cryptographic Keys from Memory Disclosure
Attacks. In 37th Annual IEEE/IFIP International Conference on Dependable Systems
and Networks, pages 137–143, June 2007.
[51] Milad Hashemi, Khubaib, Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. Accelerating Dependent Cache Misses with an Enhanced Memory Controller. In ISCA,
2016.
[52] Milad Hashemi, Onur Mutlu, and Yale N. Patt. Continuous Runahead: Transparent
Hardware Acceleration for Memory Intensive Workloads. In MICRO, 2016.
[53] Hasan Hassan, Gennady Pekhimenko, Nandita Vijaykumar, Vivek Seshadri, Donghyuk
Lee, Oguz Ergin, and Onur Mutlu. ChargeCache: Reducing DRAM Latency by Exploiting Row Access Locality. In HPCA, 2016.
42
[54] Syed Minhaj Hassan, Sudhakar Yalamanchili, and Saibal Mukhopadhyay. Near Data
Processing: Impact and Optimization of 3D Memory System Architecture on the Uncore. In Proceedings of the 2015 International Symposium on Memory Systems, MEMSYS ’15, pages 11–21, New York, NY, USA, 2015. ACM.
[55] M. Horiguchi and K. Itoh. Nanoscale Memory Repair. Springer, 2011.
[56] Kevin Hsieh, Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O’Conner,
Nandita Vijaykumar, Onur Mutlu, and Stephen W. Keckler. Transparent Offloading
and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in
GPU Systems. In ISCA, 2016.
[57] Kevin Hsieh, Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali
Boroumand, Saugata Ghose, and Onur Mutlu. Accelerating Pointer Chasing in 3DStacked Memory: Challenges, Mechanisms, Evaluation. In ICCD, 2016.
[58] IBM Corporation. Enterprise Systems Architecture/390 Principles of Operation, 2001.
[59] Intel. Intel 64 and IA-32 Architectures Optimization Reference Manual. April 2012.
[60] Intel. Intel 64 and IA-32 Architectures Software Developer’s Manual, volume 3A,
chapter 11, page 12. April 2012.
[61] Thomas B. Jablin, Prakash Prabhu, James A. Jablin, Nick P. Johnson, Stephen R.
Beard, and David I. August. Automatic CPU-GPU Communication Management and
Optimization. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming
Language Design and Implementation, PLDI ’11, pages 142–151, New York, NY, USA,
2011. ACM.
[62] J. Jeddeloh and B. Keeth. Hybrid Memory Cube: New DRAM architecture increases
density and performance. In VLSIT, pages 87–88, June 2012.
[63] JEDEC. Standard No. 21-C. Annex K: Serial Presence Detect (SPD) for DDR3
SDRAM Modules, 2011.
[64] JEDEC. DDR3 SDRAM, JESD79-3F, 2012.
[65] JEDEC. DDR4 SDRAM Standard. http://www.jedec.org/standards-documents/
docs/jesd79-4a, 2013.
[66] Xiaowei Jiang, Yan Solihin, Li Zhao, and Ravishankar Iyer. Architecture Support for
Improving Bulk Memory Copying and Initialization Performance. In PACT, pages
169–180, Washington, DC, USA, 2009. IEEE Computer Society.
[67] Mingu Kang, Min-Sun Keel, Naresh R Shanbhag, Sean Eilert, and Ken Curewitz.
An energy-efficient VLSI architecture for pattern recognition via deep embedding of
computation in SRAM. In 2014 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pages 8326–8330. IEEE, 2014.
43
[68] Yi Kang, Wei Huang, Seung-Moon Yoo, D. Keen, Zhenzhou Ge, V. Lam, P. Pattnaik,
and J. Torrellas. FlexRAM: Toward an Advanced Intelligent Memory System. In
Proceedings of the 1999 IEEE International Conference on Computer Design, ICCD
’99, pages 192–, Washington, DC, USA, 1999. IEEE Computer Society.
[69] Brent Keeth, R. Jacob Baker, Brian Johnson, and Feng Lin. DRAM Circuit Design:
Fundamental and High-Speed Topics. Wiley-IEEE Press, 2nd edition, 2007.
[70] Samira Khan, Donghyuk Lee, Yoongu Kim, Alaa R. Alameldeen, Chris Wilkerson,
and Onur Mutlu. The Efficacy of Error Mitigation Techniques for DRAM Retention
Failures: A Comparative Experimental Study. In The 2014 ACM International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’14, pages
519–532, New York, NY, USA, 2014. ACM.
[71] Samira M. Khan, Donghyuk Lee, and Onur Mutlu. PARBOR: An Efficient SystemLevel Technique to Detect Data-Dependent Failures in DRAM. In DSN, 2016.
[72] Hyoseung Kim, Dionisio de Niz, Björn Andersson, Mark Klein, Onur Mutlu, and
Ragunathan Rajkumar. Bounding and reducing memory interference in COTS-based
multi-core systems. In RTAS, 2014.
[73] Hyoseung Kim, Dionisio de Niz, Björn Andersson, Mark Klein, Onur Mutlu, and
Ragunathan Rajkumar. Bounding and reducing memory interference in COTS-based
multi-core systems. Real-Time Systems, 52(3):356–395, 2016.
[74] Yoongu Kim, Ross Daly, Jeremie Kim, Chris Fallin, Ji Hye Lee, Donghyuk Lee, Chris
Wilkerson, Konrad Lai, and Onur Mutlu. Flipping Bits in Memory Without Accessing
Them: An Experimental Study of DRAM Disturbance Errors. In Proceeding of the
41st Annual International Symposium on Computer Architecuture, ISCA ’14, pages
361–372, Piscataway, NJ, USA, 2014. IEEE Press.
[75] Yoongu Kim, Dongsu Han, O. Mutlu, and M. Harchol-Balter. ATLAS: A scalable and
high-performance scheduling algorithm for multiple memory controllers. In IEEE 16th
International Symposium on High Performance Computer Architecture, pages 1–12,
Jan 2010.
[76] Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter. Thread
Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. In
Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’43, pages 65–76, Washington, DC, USA, 2010. IEEE Computer Society.
[77] Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, and Onur Mutlu. A Case for
Exploiting Subarray-level Parallelism (SALP) in DRAM. In Proceedings of the 39th
Annual International Symposium on Computer Architecture, ISCA ’12, pages 368–379,
Washington, DC, USA, 2012. IEEE Computer Society.
44
[78] Yoongu Kim, Weikun Yang, and Onur Mutlu. Ramulator: A Fast and Extensible
DRAM Simulator. IEEE Comput. Archit. Lett., 15(1):45–49, January 2016.
[79] Peter M. Kogge. EXECUBE: A New Architecture for Scaleable MPPs. In ICPP, pages
77–84, Washington, DC, USA, 1994. IEEE Computer Society.
[80] Horacio Andrés Lagar-Cavilla, Joseph Andrew Whitney, Adin Matthew Scannell,
Philip Patchin, Stephen M. Rumble, Eyal de Lara, Michael Brudno, and Mahadev
Satyanarayanan. SnowFlock: Rapid Virtual Machine Cloning for Cloud Computing.
In Proceedings of the 4th ACM European Conference on Computer Systems, EuroSys
’09, pages 1–12, New York, NY, USA, 2009. ACM.
[81] Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger. Architecting Phase
Change Memory As a Scalable DRAM Alternative. In Proceedings of the 36th Annual
International Symposium on Computer Architecture, ISCA ’09, pages 2–13, New York,
NY, USA, 2009. ACM.
[82] Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger. Phase Change Memory
Architecture and the Quest for Scalability. Commun. ACM, 53(7):99–106, July 2010.
[83] Benjamin C. Lee, Ping Zhou, Jun Yang, Youtao Zhang, Bo Zhao, Engin Ipek, Onur
Mutlu, and Doug Burger. Phase-Change Technology and the Future of Main Memory.
IEEE Micro, 30(1):143–143, January 2010.
[84] Chang Joo Lee, Veynu Narasiman, Onur Mutlu, and Yale N. Patt. Improving memory
bank-level parallelism in the presence of prefetching. In Proceedings of the 42nd Annual
IEEE/ACM International Symposium on Microarchitecture, MICRO 42, pages 327–
336, New York, NY, USA, 2009. ACM.
[85] Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, and Onur Mutlu.
Simultaneous Multi-Layer Access: Improving 3D-Stacked Memory Bandwidth at Low
Cost. ACM Trans. Archit. Code Optim., 12(4):63:1–63:29, January 2016.
[86] Donghyuk Lee, Yoongu Kim, Gennady Pekhimenko, Samira Manabi Khan, Vivek Seshadri, Kevin Kai-Wei Chang, and Onur Mutlu. Adaptive-latency DRAM: Optimizing
DRAM timing for the common-case. In HPCA, pages 489–501. IEEE, 2015.
[87] Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, and
Onur Mutlu. Tiered-latency DRAM: A Low Latency and Low Cost DRAM Architecture. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), HPCA ’13, pages 615–626, Washington, DC,
USA, 2013. IEEE Computer Society.
[88] Donghyuk Lee, Lavanya Subramanian, Rachata Ausavarungnirun, Jongmoo Choi, and
Onur Mutlu. Decoupled Direct Memory Access: Isolating CPU and IO Traffic by
Leveraging a Dual-Data-Port DRAM. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT), PACT ’15, pages 174–187,
Washington, DC, USA, 2015. IEEE Computer Society.
45
[89] Shuangchen Li, Cong Xu, Qiaosha Zou, Jishen Zhao, Yu Lu, and Yuan Xie. Pinatubo:
A Processing-in-Memory Architecture for Bulk Bitwise Operations in Emerging NonVolatile Memories. In Proceedings of the 53rd Annual Design Automation Conference,
page 173. ACM, 2016.
[90] Jamie Liu, Ben Jaiyen, Yoongu Kim, Chris Wilkerson, and Onur Mutlu. An Experimental Study of Data Retention Behavior in Modern DRAM Devices: Implications for
Retention Time Profiling Mechanisms. In Proceedings of the 40th Annual International
Symposium on Computer Architecture, ISCA ’13, pages 60–71, New York, NY, USA,
2013. ACM.
[91] Jamie Liu, Ben Jaiyen, Richard Veras, and Onur Mutlu. RAIDR: Retention-Aware
Intelligent DRAM Refresh. In Proceedings of the 39th Annual International Symposium
on Computer Architecture, ISCA ’12, pages 1–12, Washington, DC, USA, 2012. IEEE
Computer Society.
[92] Gabriel H. Loh. 3D-Stacked Memory Architectures for Multi-core Processors. In
Proceedings of the 35th Annual International Symposium on Computer Architecture,
ISCA ’08, pages 453–464, Washington, DC, USA, 2008. IEEE Computer Society.
[93] Kun Luo, J. Gummaraju, and M. Franklin. Balancing thoughput and fairness in SMT
processors. In Performance Analysis of Systems and Software, 2001. ISPASS. 2001
IEEE International Symposium on, pages 164–171, 2001.
[94] Justin Meza, Jing Li, and Onur Mutlu. A Case for Small Row Buffers in Non-volatile
Main Memories. In Proceedings of the 2012 IEEE 30th International Conference on
Computer Design (ICCD 2012), ICCD ’12, pages 484–485, Washington, DC, USA,
2012. IEEE Computer Society.
[95] Amir Morad, Leonid Yavits, and Ran Ginosar. GP-SIMD Processing-in-Memory. ACM
Trans. Archit. Code Optim., 11(4):53:1–53:26, January 2015.
[96] Sai Prashanth Muralidhara, Lavanya Subramanian, Onur Mutlu, Mahmut Kandemir,
and Thomas Moscibroda. Reducing Memory Interference in Multicore Systems via
Application-aware Memory Channel Partitioning. In Proceedings of the 44th Annual
IEEE/ACM International Symposium on Microarchitecture, MICRO-44, pages 374–
385, New York, NY, USA, 2011. ACM.
[97] Onur Mutlu. Efficient Runahead Execution Processors. PhD thesis, Austin, TX, USA,
2006. AAI3263366.
[98] Onur Mutlu. Memory Scaling: A Systems Architecture Perspective. In IMW, 2014.
[99] Onur Mutlu and Thomas Moscibroda. Parallelism-Aware Batch Scheduling: Enhancing Both Performance and Fairness of Shared DRAM Systems. In Proceedings of the
35th Annual International Symposium on Computer Architecture, ISCA ’08, pages 63–
74, Washington, DC, USA, 2008. IEEE Computer Society.
46
[100] Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N. Patt. Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors.
In Proceedings of the 9th International Symposium on High-Performance Computer
Architecture, HPCA ’03, pages 129–, Washington, DC, USA, 2003. IEEE Computer
Society.
[101] Onur Mutlu and Lavanya Subramanian. Research Problems and Opportunities in
Memory Systems. SuperFRI, 2014.
[102] Elizabeth O’Neil, Patrick O’Neil, and Kesheng Wu. Bitmap Index Design Choices and
Their Performance Implications. In Proceedings of the 11th International Database
Engineering and Applications Symposium, IDEAS ’07, pages 72–84, Washington, DC,
USA, 2007. IEEE Computer Society.
[103] Mark Oskin, Frederic T. Chong, and Timothy Sherwood. Active Pages: A Computation Model for Intelligent Memory. In Proceedings of the 25th Annual International
Symposium on Computer Architecture, ISCA ’98, pages 192–203, Washington, DC,
USA, 1998. IEEE Computer Society.
[104] Ashutosh Patnaik, Xulong Tang, Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, and Chita R. Das. Scheduling Techniques for GPU
Architectures with Processing-In-Memory Capabilities. In PACT, 2016.
[105] David Patterson, Thomas Anderson, Neal Cardwell, Richard Fromm, Kimberly Keeton, Christoforos Kozyrakis, Randi Thomas, and Katherine Yelick. A Case for Intelligent RAM. IEEE Micro, 17(2):34–44, March 1997.
[106] M. K. Qureshi, D. H. Kim, S. Khan, P. J. Nair, and O. Mutlu. AVATAR: A VariableRetention-Time (VRT) Aware Refresh for DRAM Systems. In 2015 45th Annual
IEEE/IFIP International Conference on Dependable Systems and Networks, pages
427–437, June 2015.
[107] Moinuddin K. Qureshi, Vijayalakshmi Srinivasan, and Jude A. Rivers. Scalable High
Performance Main Memory System Using Phase-change Memory Technology. In Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA
’09, pages 24–33, New York, NY, USA, 2009. ACM.
[108] Rambus. DRAM power model, 2010.
[109] S. Raoux, G. W. Burr, M. J. Breitwisch, C. T. Rettner, Y.-C. Chen, R. M. Shelby,
M. Salinga, D. Krebs, S.-H. Chen, H.-L. Lung, and C. H. Lam. Phase-change Random
Access Memory: A Scalable Technology. IBM J. Res. Dev., 52(4):465–479, July 2008.
[110] Jinglei Ren, Jishen Zhao, Samira Khan, Jongmoo Choi, Yongwei Wu, and Onur Mutlu.
ThyNVM: Enabling Software-transparent Crash Consistency in Persistent Memory
Systems. In Proceedings of the 48th International Symposium on Microarchitecture,
MICRO-48, pages 672–685, New York, NY, USA, 2015. ACM.
47
[111] M. E. Russinovich, D. A. Solomon, and A. Ionescu. Windows Internals, page 701.
Microsoft Press, 2009.
[112] R. F. Sauers, C. P. Ruemmler, and P. S. Weygant. HP-UX 11i Tuning and Performance, chapter 8. Memory Bottlenecks. Prentice Hall, 2004.
[113] Vivek Seshadri, Abhishek Bhowmick, Onur Mutlu, Phillip B. Gibbons, Michael A.
Kozuch, and Todd C. Mowry. The Dirty-Block Index. In Proceeding of the 41st
Annual International Symposium on Computer Architecuture, ISCA ’14, pages 157–
168, Piscataway, NJ, USA, 2014. IEEE Press.
[114] Vivek Seshadri, Kevin Hsieh, Amirali Boroumand, Donghyuk Lee, Michael A. Kozuch,
Onur Mutlu, Phillip B. Gibbons, and Todd C. Mowry. Fast Bulk Bitwise AND and
OR in DRAM. IEEE Comput. Archit. Lett., 14(2):127–131, July 2015.
[115] Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata Ausavarungnirun,
Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch,
and Todd C. Mowry. RowClone: Fast and Energy-efficient in-DRAM Bulk Data Copy
and Initialization. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pages 185–197, New York, NY, USA, 2013.
ACM.
[116] Vivek Seshadri, Thomas Mullins, Amirali Boroumand, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. Gather-Scatter DRAM: In-DRAM
Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses. In
Proceedings of the 48th International Symposium on Microarchitecture, MICRO-48,
pages 267–280, New York, NY, USA, 2015. ACM.
[117] Vivek Seshadri, Onur Mutlu, Michael A. Kozuch, and Todd C. Mowry. The Evictedaddress Filter: A Unified Mechanism to Address Both Cache Pollution and Thrashing.
In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT ’12, pages 355–366, New York, NY, USA, 2012. ACM.
[118] Vivek Seshadri, Samihan Yedkar, Hongyi Xin, Onur Mutlu, Phillip B. Gibbons,
Michael A. Kozuch, and Todd C. Mowry. Mitigating Prefetcher-Caused Pollution
Using Informed Caching Policies for Prefetched Blocks. ACM Trans. Archit. Code
Optim., 11(4):51:1–51:22, January 2015.
[119] Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian,
John Paul Strachan, Miao Hu, R Stanley Williams, and Vivek Srikumar. ISAAC: A
Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars. In Proc. ISCA, 2016.
[120] David Elliot Shaw, Salvatore Stolfo, Hussein Ibrahim, Bruce K. Hillyer, Jim Andrews,
and Gio Wiederhold. The NON-VON Database Machine: An Overview. http://hdl.
handle.net/10022/AC:P:11530., 1981.
48
[121] A. Singh. Mac OS X Internals: A Systems Approach. Addison-Wesley Professional,
2006.
[122] Allan Snavely and Dean M. Tullsen. Symbiotic Jobscheduling for a Simultaneous
Multithreaded Processor. In Proceedings of the Ninth International Conference on
Architectural Support for Programming Languages and Operating Systems, ASPLOS
IX, pages 234–244, New York, NY, USA, 2000. ACM.
[123] Santhosh Srinath, Onur Mutlu, Hyesoon Kim, and Yale N. Patt. Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware
Prefetchers. In Proceedings of the 2007 IEEE 13th International Symposium on High
Performance Computer Architecture, HPCA ’07, pages 63–74, Washington, DC, USA,
2007. IEEE Computer Society.
[124] Sudarshan M. Srinivasan, Srikanth Kandula, Christopher R. Andrews, and Yuanyuan
Zhou. Flashback: A Lightweight Extension for Rollback and Deterministic Replay
for Software Debugging. In Proceedings of the Annual Conference on USENIX Annual Technical Conference, ATEC ’04, pages 3–3, Berkeley, CA, USA, 2004. USENIX
Association.
[125] Harold S. Stone. A Logic-in-Memory Computer. IEEE Trans. Comput., 19(1):73–78,
January 1970.
[126] L. Subramanian, D. Lee, V. Seshadri, H. Rastogi, and O. Mutlu. The Blacklisting
Memory Scheduler: Achieving high performance and fairness at low cost. In ICCD,
2014.
[127] L. Subramanian, D. Lee, V. Seshadri, H. Rastogi, and O. Mutlu. BLISS: Balancing
Performance, Fairness and Complexity in Memory Access Schedyuling. IEEE Transactions on Parallel and Distributed Systems, 2016.
[128] L. Subramanian, V. Seshadri, Yoongu Kim, B. Jaiyen, and O. Mutlu. MISE: Providing
performance predictability and improving fairness in shared main memory systems. In
IEEE 19th International Symposium on High Performance Computer Architecture,
pages 639–650, Feb 2013.
[129] Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, and Onur Mutlu.
The Application Slowdown Model: Quantifying and Controlling the Impact of Interapplication Interference at Shared Caches and Main Memory. In Proceedings of the
48th International Symposium on Microarchitecture, MICRO-48, pages 62–75, New
York, NY, USA, 2015. ACM.
[130] Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian, and Al Davis. Micro-pages: Increasing DRAM Efficiency with Locality-aware
Data Placement. In Proceedings of the Fifteenth Edition of ASPLOS on Architectural
Support for Programming Languages and Operating Systems, ASPLOS XV, pages 219–
230, New York, NY, USA, 2010. ACM.
49
[131] Zehra Sura, Arpith Jacob, Tong Chen, Bryan Rosenburg, Olivier Sallenave, Carlo
Bertolli, Samuel Antao, Jose Brunheroto, Yoonho Park, Kevin O’Brien, and Ravi Nair.
Data access optimization in a processing-in-memory system. In Proceedings of the 12th
ACM International Conference on Computing Frontiers, CF ’15, pages 6:1–6:8, New
York, NY, USA, 2015. ACM.
[132] Aniruddha N. Udipi, Naveen Muralimanohar, Niladrish Chatterjee, Rajeev Balasubramonian, Al Davis, and Norman P. Jouppi. Rethinking DRAM Design and Organization
for Energy-constrained Multi-cores. In Proceedings of the 37th Annual International
Symposium on Computer Architecture, ISCA ’10, pages 175–186, New York, NY, USA,
2010. ACM.
[133] Hiroyuki Usui, Lavanya Subramanian, Kevin Kai-Wei Chang, and Onur Mutlu. DASH:
Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with
Hardware Accelerators. ACM Trans. Archit. Code Optim., 12(4):65:1–65:28, January
2016.
[134] Hans Vandierendonck and Andre Seznec. Fairness Metrics for Multi-Threaded Processors. IEEE Comput. Archit. Lett., 10(1):4–7, January 2011.
[135] Carl A. Waldspurger. Memory Resource Management in VMware ESX Server. SIGOPS
Oper. Syst. Rev., 36(SI):181–194, December 2002.
[136] F.A. Ware and C. Hampel. Improving Power and Data Efficiency with Threaded
Memory Modules. In ICCD, 2006.
[137] Benjamin Wester, Peter M. Chen, and Jason Flinn. Operating System Support for
Application-specific Speculation. In Proceedings of the Sixth Conference on Computer
Systems, EuroSys ’11, pages 229–242, New York, NY, USA, 2011. ACM.
[138] H. S. P. Wong, S. Raoux, S. Kim, J. Liang, J. P. Reifenberg, B. Rajendran, M. Asheghi,
and K. E. Goodson. Phase Change Memory. Proceedings of the IEEE, 98(12):2201–
2227, Dec 2010.
[139] Kesheng Wu, Ekow J. Otoo, and Arie Shoshani. Compressing Bitmap Indexes for
Faster Search Operations. In Proceedings of the 14th International Conference on Scientific and Statistical Database Management, SSDBM ’02, pages 99–108, Washington,
DC, USA, 2002. IEEE Computer Society.
[140] Xi Yang, Stephen M. Blackburn, Daniel Frampton, Jennifer B. Sartor, and Kathryn S.
McKinley. Why Nothing Matters: The Impact of Zeroing. In OOPSLA, pages 307–324,
New York, NY, USA, 2011. ACM.
[141] HanBin Yoon, Justin Meza, Rachata Ausavarungnirun, Rachel Harding, and Onur
Mutlu. Row Buffer Locality Aware Caching Policies for Hybrid Memories. In Proceedings of the 2012 IEEE 30th International Conference on Computer Design (ICCD
2012), ICCD ’12, pages 337–344, Washington, DC, USA, 2012. IEEE Computer Society.
50
[142] Hanbin Yoon, Justin Meza, Naveen Muralimanohar, Norman P. Jouppi, and Onur
Mutlu. Efficient Data Mapping and Buffering Techniques for Multilevel Cell PhaseChange Memories. ACM Trans. Archit. Code Optim., 11(4):40:1–40:25, December
2014.
[143] Dongping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph L. Greathouse, Lifan Xu, and Michael Ignatowski. TOP-PIM: Throughput-oriented Programmable Processing in Memory. In Proceedings of the 23rd International Symposium on Highperformance Parallel and Distributed Computing, HPDC ’14, pages 85–98, New York,
NY, USA, 2014. ACM.
[144] Tao Zhang, Ke Chen, Cong Xu, Guangyu Sun, Tao Wang, and Yuan Xie. Half-DRAM:
A High-bandwidth and Low-power DRAM Architecture from the Rethinking of Finegrained Activation. In Proceeding of the 41st Annual International Symposium on
Computer Architecuture, ISCA ’14, pages 349–360, Piscataway, NJ, USA, 2014. IEEE
Press.
[145] Jishen Zhao, Onur Mutlu, and Yuan Xie. FIRM: Fair and High-Performance Memory
Control for Persistent Memory Systems. In Proceedings of the 47th Annual IEEE/ACM
International Symposium on Microarchitecture, MICRO-47, pages 153–165, Washington, DC, USA, 2014. IEEE Computer Society.
[146] Li Zhao, Laxmi N. Bhuyan, Ravi Iyer, Srihari Makineni, and Donald Newell. Hardware
Support for Accelerating Data Movement in Server Platform. IEEE Trans. Comput.,
56(6):740–753, June 2007.
[147] Hongzhong Zheng, Jiang Lin, Zhao Zhang, Eugene Gorbatov, Howard David, and
Zhichun Zhu. Mini-rank: Adaptive DRAM Architecture for Improving Memory Power
Efficiency. In MICRO, 2008.
[148] Ping Zhou, Bo Zhao, Jun Yang, and Youtao Zhang. A Durable and Energy Efficient
Main Memory Using Phase Change Memory Technology. In Proceedings of the 36th
Annual International Symposium on Computer Architecture, ISCA ’09, pages 14–23,
New York, NY, USA, 2009. ACM.
[149] Q. Zhu, T. Graf, H. E. Sumbul, L. Pileggi, and F. Franchetti. Accelerating sparse
matrix-matrix multiplication with 3D-stacked logic-in-memory hardware. In High Performance Extreme Computing Conference (HPEC), 2013 IEEE, pages 1–6, Sept 2013.
51