Performance Beyond PCI Express

Performance Beyond PCI Express:
Moving Storage to The Memory Bus
A Technical Whitepaper
By Stephen Foskett
Performance Beyond PCI Express
April 2014
2
Performance Beyond PCI Express
Introduction
In the quest to eliminate bottlenecks and improve system performance, the state of
the art has continually moved storage ever closer to the server CPU. External
storage arrays, once the champions of performance, are now decidedly “second
tier” compared to solid state storage located inside the server. Until now, the fastest
storage attachment was PCI Express, with such flash storage devices commanding
attention. But what if there was something better? Moving storage to the memory
bus promises to make I/O faster, more scalable and more consistent.
3
Performance Beyond PCI Express
The Limits of PCI Express
PCI Express (PCIe) would seem to be the ultimate location for storage. After all,
previous storage attachments (Ethernet, Fibre Channel, and InfiniBand) relied on
PCIe cards for server connectivity. This is why it was so impressive when
companies placed flash memory directly on PCIe cards, with no additional
intermediate bus. But PCIe is a limited resource: Servers have just a few dozen PCI
lanes, so issues of contention arise.
These are the fundamental limits of PCIe-based storage:
1. Because PCIe is not a native storage interface, PCIe storage solutions need
an onboard controller to allocate resources and mediate between the flash
chips on the card and the demands of server I/O coming from the bus.
2. Because PCIe lanes are limited, systems can only hold a few PCIe storage
devices, or they must make do with fewer lanes and less available
bandwidth.
3. Because PCIe slots are limited, it can be difficult to scale a PCIe storage
solution without a major investment, and slots are in demand for other
peripheral cards.
4. Because PCIe is “bus mastering”, it can interrupt system memory access
when a DMA transfer is invoked. While this task is in process, no other
movements of data to or from the flash storage can take place.
These limits do not always make themselves felt, but performance-sensitive
applications are already reaching them. NAND flash chips are incredibly fast, but
placing them behind a PCIe controller keeps applications from exploiting their
performance potential. Another concern is the amount of time an application must
wait for a read or write operation to be completed, a concept known as I/O latency.
Layers of virtualization and translation, as well as many outstanding I/O
operations, can make this aspect of performance unpredictable. Finally, many
applications simply require more flexible capacity than a single PCIe card can
deliver: Some will use less than the capacity of the smallest card, while others will
4
Performance Beyond PCI Express
require slightly more than the capacity of the cards available.
Flooding the PCIe SSD Controller
PCI Express is a scalable systems bus using a number of serial interconnects
known as “lanes” to transfer data between expansion cards and a computer. These
PCIe lanes typically terminate at the system chipset, with most modern systems
having 16 or 32 lanes. Originally supporting 250 MB/s for each lane, PCIe Gen 2
can transfer 500 MB/s over each lane. This means that a 32-lane Gen 2 system has
a maximum aggregate throughput of 16 GB/s over PCIe – using all of the available
lanes.
Each PCIe card is limited to a fraction of the available lanes in the system. Lowend solid-state drives (SSDs) and host bus adapters (HBAs) might use 4 lanes,
while extreme-performance options could use as many as 16. This gives a
maximum aggregate bandwidth of between 2 GB/s and 8 GB/s for this type of
solution. And of course other peripheral cards must also use some of the PCIe
lanes.
Most high-performance applications are limited more by the total number of I/O
operations per second (IOPS) they require than by the maximum amount of bytes
they can send. These applications can queue hundreds of thousands of read and
write operations to the I/O subsystem in seconds, overwhelming conventional
storage solutions and forcing further processing to wait until the I/O operations
complete.
Consider a typical storage device: For each read or write request, it must receive
and validate the operation, look up the proper location for the data, perform the
read or write, and return the result to the requesting system. This process is the
same whether the device is a hard disk drive, an enterprise storage array, or a PCIe
card. The only difference is that the PCIe card is located on a fast, low-latency bus
and can see many more operations than a remote device.
5
Performance Beyond PCI Express
This relationship between latency, IOPS, and throughput has long been a focus of
storage system optimization. No solution can maintain maximum performance
when crushing levels of I/O requests are outstanding. The challenge in system
design is to provide sufficient performance at an acceptable level of latency. 1
This flood of data can cause puzzling performance issues for PCIe storage devices.
As shown below in a graph of real-world benchmark data produced by Diablo
Technologies™, a typical PCIe storage device can easily handle tens of thousands
of IOPS without any trouble. But once enough operations are pending, the wait
time (latency) for each operation increases. In this example, 72,000 IOPS sent the
latency shooting up past 1 millisecond2. The PCIe SSD controller was simply
overwhelmed with I/O requests.
Figure 1: IOPS vs. Latency
This “100% writes” example is extreme, to be sure, since writes are slower to
1
For example, see the article, “The Fundamental Characteristics of Storage” at FlashDBA,
http://flashdba.com/2013/04/08/the-fundamental-characteristics-of-storage/
2
For another example, see The SSD Review’s “Micron P320h HHHL 700GB PCIe Enterprise
SSD Review”, http://www.thessdreview.com/our-reviews/micron-p320h-hhhl-700gb-pcieenterprise-ssd-review-vertical-integration-high-iops-and-absurd-endurance/5/
6
Performance Beyond PCI Express
perform with NAND flash than reads. But it shows how difficult it is for even a
high-performance PCIe storage controller to keep up with tens of thousands of
writes every second. In this situation, the application would see each I/O operation
becoming slower, even as the storage system was able to keep accepting more
work.
Indeterminate Latency
This brings up the issue of predictable latency. If an application can have I/O
operations completed in well under 100 microseconds most of the time, yet latency
shoots up 10x or more once the workload gets heavy, it can interfere with
application performance. Even worse, this happens just when the application is the
busiest.
Imagine a system designed to handle high-volume transactions. A developer might
profile a new PCIe storage device by generating what he considers to be a
reasonable volume of requests and determine that it can process tens of thousands
of I/O operations per second with under 100 microseconds of latency. Happy with
this level of performance, he architects his new trading system to assume that all
reads and writes will be processed in less than this time. But once he puts it into
production, a surge of data could easily cause performance to drop by a factor of
10 or more.
This unpredictable latency is illustrated in the graph below3. Here, a PCIe
subsystem is able to perform just as described above. Yet the controller
periodically delays I/O operations, causing latency to spike. In this case, the
performance degradation happened even under moderate load, as the controller
performed tasks like wear leveling and garbage collection. If an application was
relying on predictable performance, it would be unsuccessful.
3
This chart shows performance measurements using the storage subsystem from AMPS, a highperformance messaging database platform from 60East Technologies®.
7
Performance Beyond PCI Express
Figure 2: AMPS 15% Read Mix
Contention for PCIe or flash controller resources can make performance
unpredictable in similar ways. If a system has multiple high-performance PCIe
cards, a workload spike on one can interfere with the others. This is true of I/O
cards like Ethernet and InfiniBand adapters as well as PCIe storage devices.
The sort of applications that demand extreme levels of performance also require
consistency. They must be written to tolerate worst-case latency because these I/O
subsystems simply cannot guarantee performance at all times. So storage must be
over-specified to ensure that worst-case latency does not exceed tolerable
thresholds.
8
Performance Beyond PCI Express
Introducing Memory Channel Storage™
(MCS™)
When examining the graph above, the orange line immediately draws attention.
What is this “MCS” with lower latency than PCIe and, more importantly,
predictability of performance under load?
Although most server I/O passes over the PCIe bus, systems have another, separate
interface for memory. This memory bus can push tens of gigabytes per second at
latencies measured in nanoseconds. This interface has only been used for dynamic
RAM in the past, rather than any type of non-volatile storage.
As the industry moved storage to the PCIe bus, technologists considered what
might come next. Since CPUs already have prodigious memory capacity, and
solid-state storage already relies on memory chips (albeit NAND flash rather than
dynamic RAM), attention turned to adding persistent storage on these memory
channels. Dynamic RAM is very expensive, and many systems have empty slots
left unused.
Memory Channel Storage is the creation of Diablo Technologies, and this is the
“MCS” shown on the graph above. It places NAND flash on cards which reside
alongside DRAM in system memory slots. Of course, flash is not the same as
DRAM, so systems need special drivers to make use of this type of storage. But
Diablo has developed the supporting technologies needed to make MCS feasible.
The benefits of moving storage to the memory bus are many:
1. Applications can access persistent storage more quickly. Moving flash to the
memory channel places storage even closer to the CPU than PCIe, reducing
latency and improving performance.
2. Applications can have both high IOPS and low latency
simultaneously. Memory subsystems are designed to be highly parallel, and
systems can spread the I/O load across multiple MCS cards more easily than
trying to use multiple PCIe cards.
9
Performance Beyond PCI Express
3. Applications can create, read, update, and delete data with predictable
response times. Memory channels must be highly deterministic to support
system RAM, so there is a much lower chance of contention hampering
performance.
4. Memory access is integrated with the CPU. Moving data from MCS-based
storage devices to system memory is no more than a “copy” from one
address to another within the memory controller, keeping the processor or
system bus access from being interrupted each time data is transferred.
Thus, many of the limitations that can become apparent when performancesensitive applications are moved to PCIe flash storage are reduced or eliminated
through the use of Memory Channel Storage. Predictable performance can be
critical for proper system sizing, and the storage architecture can match application
needs more exactly, both in terms of capacity and performance.
10
Performance Beyond PCI Express
Moving Storage to Memory Channels
Storage architectures are being optimized per application, with capacity and
performance diverging dramatically. Data-heavy applications are moving to scaleout storage while high-performance applications are using server-side flash. The
sort of applications that demand extreme performance are the same that will be
affected by unpredictable latency and limited scalability. Memory Channel Storage
promises an alternative architecture that delivers more of both.
This is a clever approach to utilize resources already present in today’s servers,
namely empty memory slots. In the short term, applications gain scalable persistent
storage with predictable performance. But the future could see servers shrink as
fewer PCIe slots and drive bays are needed. One can also envision a future where
applications are written to take advantage of non-volatile memory just as they
currently use DRAM.
11
Performance Beyond PCI Express
About the Author
Stephen Foskett is an active participant in
the world of enterprise information
technology, focused on enterprise storage,
server virtualization, networking, and
cloud computing. A long-time voice in the
storage industry, Stephen has authored
numerous articles for industry
publications, and is a popular presenter at
industry events. His contributions to the
enterprise IT community have earned him
recognition as both a Microsoft MVP and
VMware vExpert.
Foskett organizes the popular Tech Field Day event series for Gestalt IT. His
company, Foskett Services, is focused on content and community, creating
technical events, writing, and video productions and connecting the IT world
through social media. He can be found online at
FoskettServices.com, TechFieldDay.com, blog.Fosketts.net, and on Twitter as
@SFoskett.
12