Performance Beyond PCI Express: Moving Storage to The Memory Bus A Technical Whitepaper By Stephen Foskett Performance Beyond PCI Express April 2014 2 Performance Beyond PCI Express Introduction In the quest to eliminate bottlenecks and improve system performance, the state of the art has continually moved storage ever closer to the server CPU. External storage arrays, once the champions of performance, are now decidedly “second tier” compared to solid state storage located inside the server. Until now, the fastest storage attachment was PCI Express, with such flash storage devices commanding attention. But what if there was something better? Moving storage to the memory bus promises to make I/O faster, more scalable and more consistent. 3 Performance Beyond PCI Express The Limits of PCI Express PCI Express (PCIe) would seem to be the ultimate location for storage. After all, previous storage attachments (Ethernet, Fibre Channel, and InfiniBand) relied on PCIe cards for server connectivity. This is why it was so impressive when companies placed flash memory directly on PCIe cards, with no additional intermediate bus. But PCIe is a limited resource: Servers have just a few dozen PCI lanes, so issues of contention arise. These are the fundamental limits of PCIe-based storage: 1. Because PCIe is not a native storage interface, PCIe storage solutions need an onboard controller to allocate resources and mediate between the flash chips on the card and the demands of server I/O coming from the bus. 2. Because PCIe lanes are limited, systems can only hold a few PCIe storage devices, or they must make do with fewer lanes and less available bandwidth. 3. Because PCIe slots are limited, it can be difficult to scale a PCIe storage solution without a major investment, and slots are in demand for other peripheral cards. 4. Because PCIe is “bus mastering”, it can interrupt system memory access when a DMA transfer is invoked. While this task is in process, no other movements of data to or from the flash storage can take place. These limits do not always make themselves felt, but performance-sensitive applications are already reaching them. NAND flash chips are incredibly fast, but placing them behind a PCIe controller keeps applications from exploiting their performance potential. Another concern is the amount of time an application must wait for a read or write operation to be completed, a concept known as I/O latency. Layers of virtualization and translation, as well as many outstanding I/O operations, can make this aspect of performance unpredictable. Finally, many applications simply require more flexible capacity than a single PCIe card can deliver: Some will use less than the capacity of the smallest card, while others will 4 Performance Beyond PCI Express require slightly more than the capacity of the cards available. Flooding the PCIe SSD Controller PCI Express is a scalable systems bus using a number of serial interconnects known as “lanes” to transfer data between expansion cards and a computer. These PCIe lanes typically terminate at the system chipset, with most modern systems having 16 or 32 lanes. Originally supporting 250 MB/s for each lane, PCIe Gen 2 can transfer 500 MB/s over each lane. This means that a 32-lane Gen 2 system has a maximum aggregate throughput of 16 GB/s over PCIe – using all of the available lanes. Each PCIe card is limited to a fraction of the available lanes in the system. Lowend solid-state drives (SSDs) and host bus adapters (HBAs) might use 4 lanes, while extreme-performance options could use as many as 16. This gives a maximum aggregate bandwidth of between 2 GB/s and 8 GB/s for this type of solution. And of course other peripheral cards must also use some of the PCIe lanes. Most high-performance applications are limited more by the total number of I/O operations per second (IOPS) they require than by the maximum amount of bytes they can send. These applications can queue hundreds of thousands of read and write operations to the I/O subsystem in seconds, overwhelming conventional storage solutions and forcing further processing to wait until the I/O operations complete. Consider a typical storage device: For each read or write request, it must receive and validate the operation, look up the proper location for the data, perform the read or write, and return the result to the requesting system. This process is the same whether the device is a hard disk drive, an enterprise storage array, or a PCIe card. The only difference is that the PCIe card is located on a fast, low-latency bus and can see many more operations than a remote device. 5 Performance Beyond PCI Express This relationship between latency, IOPS, and throughput has long been a focus of storage system optimization. No solution can maintain maximum performance when crushing levels of I/O requests are outstanding. The challenge in system design is to provide sufficient performance at an acceptable level of latency. 1 This flood of data can cause puzzling performance issues for PCIe storage devices. As shown below in a graph of real-world benchmark data produced by Diablo Technologies™, a typical PCIe storage device can easily handle tens of thousands of IOPS without any trouble. But once enough operations are pending, the wait time (latency) for each operation increases. In this example, 72,000 IOPS sent the latency shooting up past 1 millisecond2. The PCIe SSD controller was simply overwhelmed with I/O requests. Figure 1: IOPS vs. Latency This “100% writes” example is extreme, to be sure, since writes are slower to 1 For example, see the article, “The Fundamental Characteristics of Storage” at FlashDBA, http://flashdba.com/2013/04/08/the-fundamental-characteristics-of-storage/ 2 For another example, see The SSD Review’s “Micron P320h HHHL 700GB PCIe Enterprise SSD Review”, http://www.thessdreview.com/our-reviews/micron-p320h-hhhl-700gb-pcieenterprise-ssd-review-vertical-integration-high-iops-and-absurd-endurance/5/ 6 Performance Beyond PCI Express perform with NAND flash than reads. But it shows how difficult it is for even a high-performance PCIe storage controller to keep up with tens of thousands of writes every second. In this situation, the application would see each I/O operation becoming slower, even as the storage system was able to keep accepting more work. Indeterminate Latency This brings up the issue of predictable latency. If an application can have I/O operations completed in well under 100 microseconds most of the time, yet latency shoots up 10x or more once the workload gets heavy, it can interfere with application performance. Even worse, this happens just when the application is the busiest. Imagine a system designed to handle high-volume transactions. A developer might profile a new PCIe storage device by generating what he considers to be a reasonable volume of requests and determine that it can process tens of thousands of I/O operations per second with under 100 microseconds of latency. Happy with this level of performance, he architects his new trading system to assume that all reads and writes will be processed in less than this time. But once he puts it into production, a surge of data could easily cause performance to drop by a factor of 10 or more. This unpredictable latency is illustrated in the graph below3. Here, a PCIe subsystem is able to perform just as described above. Yet the controller periodically delays I/O operations, causing latency to spike. In this case, the performance degradation happened even under moderate load, as the controller performed tasks like wear leveling and garbage collection. If an application was relying on predictable performance, it would be unsuccessful. 3 This chart shows performance measurements using the storage subsystem from AMPS, a highperformance messaging database platform from 60East Technologies®. 7 Performance Beyond PCI Express Figure 2: AMPS 15% Read Mix Contention for PCIe or flash controller resources can make performance unpredictable in similar ways. If a system has multiple high-performance PCIe cards, a workload spike on one can interfere with the others. This is true of I/O cards like Ethernet and InfiniBand adapters as well as PCIe storage devices. The sort of applications that demand extreme levels of performance also require consistency. They must be written to tolerate worst-case latency because these I/O subsystems simply cannot guarantee performance at all times. So storage must be over-specified to ensure that worst-case latency does not exceed tolerable thresholds. 8 Performance Beyond PCI Express Introducing Memory Channel Storage™ (MCS™) When examining the graph above, the orange line immediately draws attention. What is this “MCS” with lower latency than PCIe and, more importantly, predictability of performance under load? Although most server I/O passes over the PCIe bus, systems have another, separate interface for memory. This memory bus can push tens of gigabytes per second at latencies measured in nanoseconds. This interface has only been used for dynamic RAM in the past, rather than any type of non-volatile storage. As the industry moved storage to the PCIe bus, technologists considered what might come next. Since CPUs already have prodigious memory capacity, and solid-state storage already relies on memory chips (albeit NAND flash rather than dynamic RAM), attention turned to adding persistent storage on these memory channels. Dynamic RAM is very expensive, and many systems have empty slots left unused. Memory Channel Storage is the creation of Diablo Technologies, and this is the “MCS” shown on the graph above. It places NAND flash on cards which reside alongside DRAM in system memory slots. Of course, flash is not the same as DRAM, so systems need special drivers to make use of this type of storage. But Diablo has developed the supporting technologies needed to make MCS feasible. The benefits of moving storage to the memory bus are many: 1. Applications can access persistent storage more quickly. Moving flash to the memory channel places storage even closer to the CPU than PCIe, reducing latency and improving performance. 2. Applications can have both high IOPS and low latency simultaneously. Memory subsystems are designed to be highly parallel, and systems can spread the I/O load across multiple MCS cards more easily than trying to use multiple PCIe cards. 9 Performance Beyond PCI Express 3. Applications can create, read, update, and delete data with predictable response times. Memory channels must be highly deterministic to support system RAM, so there is a much lower chance of contention hampering performance. 4. Memory access is integrated with the CPU. Moving data from MCS-based storage devices to system memory is no more than a “copy” from one address to another within the memory controller, keeping the processor or system bus access from being interrupted each time data is transferred. Thus, many of the limitations that can become apparent when performancesensitive applications are moved to PCIe flash storage are reduced or eliminated through the use of Memory Channel Storage. Predictable performance can be critical for proper system sizing, and the storage architecture can match application needs more exactly, both in terms of capacity and performance. 10 Performance Beyond PCI Express Moving Storage to Memory Channels Storage architectures are being optimized per application, with capacity and performance diverging dramatically. Data-heavy applications are moving to scaleout storage while high-performance applications are using server-side flash. The sort of applications that demand extreme performance are the same that will be affected by unpredictable latency and limited scalability. Memory Channel Storage promises an alternative architecture that delivers more of both. This is a clever approach to utilize resources already present in today’s servers, namely empty memory slots. In the short term, applications gain scalable persistent storage with predictable performance. But the future could see servers shrink as fewer PCIe slots and drive bays are needed. One can also envision a future where applications are written to take advantage of non-volatile memory just as they currently use DRAM. 11 Performance Beyond PCI Express About the Author Stephen Foskett is an active participant in the world of enterprise information technology, focused on enterprise storage, server virtualization, networking, and cloud computing. A long-time voice in the storage industry, Stephen has authored numerous articles for industry publications, and is a popular presenter at industry events. His contributions to the enterprise IT community have earned him recognition as both a Microsoft MVP and VMware vExpert. Foskett organizes the popular Tech Field Day event series for Gestalt IT. His company, Foskett Services, is focused on content and community, creating technical events, writing, and video productions and connecting the IT world through social media. He can be found online at FoskettServices.com, TechFieldDay.com, blog.Fosketts.net, and on Twitter as @SFoskett. 12
© Copyright 2026 Paperzz