EDN -- 01.04.96 Souped-up memories boost system performanc EDN Staff - January 04, 1996 Cover Story:January 4, 1996 Souped-up memories boost system performance Markus Levy, Technical Editor The fast-page-mode DRAM has given way to higher performance memories such as extended-data-out and synchronous DRAMs. The price barriers of memories are eroding, but system designers must work with new interfaces and architectures to accommodate these memory devices. Welcome to the revolution in the multibillion dollar memory industry. Last year’s specialty memories are the basis for today’s mainstream devices. The features of these memory technologies are enabling more affordable, higher performance systems. However, these memories are generating controversies pertaining to the fundamental architecture of the PC. One such controversy is the use of the unified memory architecture (UMA). Despite the debate surrounding it, UMA is gaining popularity among PC OEMs that want to reduce system costs. Another issue is the use or removal of level 2 (L2) cache. Other memory-related topics are surfacing in embedded applications. Most of the µPs used in these applications tend to operate with bus speeds significantly lower than the Pentium-class devices in PCs. Embedded applications, unlike PCs, have long life cycles of between five and 10 years. When designing these embedded applications and processors, you must choose a memory device that provides adequate performance. Equally as important, you should select a cost-effective device that will be available for several years. Except for changes in process technology, the basic DRAM cell hasn’t changed since its inception. The architectural differences affecting DRAM performance are the faster input and output interfaces and the slowly increasing bus width. In the PC’s main memory arena, the long life of the fast-pag-mode DRAM (FPM) is ending, giving way to extended-data-out DRAMs (EDOs). EDOs, with nearly 50% higher peak bandwidth than FPMs, are also rapidly approaching their peak. Meanwhile, synchronous DRAMs (SDRAMs), Rambus DRAMs (RDRAMs), and, possibly, burst-EDO DRAMs (BEDO), which all have 50% or greater performance than the EDO, are waiting to conquer the market. Other technically innovative DRAM technologies, such as Ramtron’s Enhanced DRAM (EDRAM) and the IEEE’s RamLink and SyncLink, have little chance of becoming mainstream in the short term because of limited sources for these devices. Understanding the UMA Trade-offs More so now than ever before, it seems every PC OEM is trying to save money. Faster DRAMs, combined with the computer OEM’s desire to reduce costs, have brought about the feasability of the unified memory architecture (UMA). A PC with a UMA eliminates the separate frame-buffer memory of the graphics subsystem and allows a potential cost savings of between $30 and $60. On the downside, UMA is also known as the shared-frame buffer because the graphic’s subsystem must "steal" part of the PC’s main memory. UMA is drumming up significant activity in the PC industry. The potential cost-saving architecture is also raising many questions. First, there is uncertainty as to whether there is a UMA standard. Also, many in the industry wonder if the architecture saves money. Finally, some wonder who benefits from the architecture, what its performance impacts are, and whether it only targets low-end systems. Currently, there isn’t a standard method for designing a UMA system. As a result, two divergent paths have emerged: tightly and loosely coupled. Weitek has developed a chip set for the 486 that integrates the core logic on the same die as the graphics controller. The company has enthusiastically termed this integrated approach as tightly coupled to signify high performance. The other approach, inaccurately referred to as "loosely coupled," uses specific handshaking signals between a separate core logic chip and a graphics controller. A subcommittee of VESA, VESA UMA (VUMA), is supporting this approach. VUMA, lead by Opti, consists of a consortium of graphics-controller and core-logic vendors. The company has published a conceptual specification detailing the hardware and BIOS extensions for its UMA approach (Reference 3). UMA has advantages and disadvantages. A separate frame buffer typically wastes memory. Suppose you use frame-buffer EDO DRAMs in 256kx16-bit configurations. (Silicon Magic has a 256kx32-bit EDO DRAM.) It takes two of these devices to yield a 32-bit interface and a minimum memory granularity of 1 Mbyte. Whether running an 800x600-pixel display or a 1024x768-pixel with 8-bit color, you only use 470 and 768 kbytes, respectively, of a 1-Mbyte frame buffer. When using 2-Mbyte frame buffers for high-performance 64-bit operation, more than 1 Mbyte is still wasted, unless the user has 16- or 24-bit color modes or uses 1280x1024-pixel or 1600x1200-pixel-displays. With UMA, you can allocate the exact amount of graphics memory needed. However, drivers such as Virtual Desktop require a full 1 or 2 Mbytes regardless of the monitor type or display mode. Theoretically, UMA allows you to dynamically change the size of the graphic’s frame buffer, depending on the application you are running. Existing PC operating systems don’t allow you to change the frame buffer size on the fly. Instead you must set the new size and then perform a system reset to make changes. Although a system reset is simpler than having to open the PC and modify jumper settings, there are still complex issues to deal with. For example, Win95 permits any application to switch the system to its preferred display mode, then return the system to default display mode when the application loses foreground focus. VGA memory must be dedicated physical memory pages, completely off limits to the system’s virtual memory manager. To increase the allocated display memory, the system would have to move application code and data out of the physical pages needed for the increased VGA needs. If your system has a 1-Mbyte frame buffer, you may even see a performance improvement in graphics by using UMA. As discussed earlier, a 1-Mbyte frame buffer has a 32-bit interface. However, if the shared frame buffer is part of the system’s 64-bit main memory subsystem, you can potentially double the bandwidth. Tightly coupled UMA may also improve graphics performance by allowing the CPU to perform 3-D rendering directly to the frame buffer without having to go across a slower bus to get to memory (for example, through the PCI bus). Despite the potential graphic’s performance improvement seen with UMA, this architecture negatively affects overall system performance. For example, if you start with a system running Windows 3.1 or Win95 that has 8 Mbytes of main memory and uses 768 kbytes for graphics, you’ll see a 10 to 15% performance degradation. The negative effects are less noticeable when operating a system with 16 Mbytes because 16 Mbytes is more than Win95 needs. However, the 8 to 10% reduction of system memory, which allocates space for a 1- to 2-Mbyte frame buffer, increases the amount of swapping between main memory and the hard disk. Ironically, UMA targets low-end systems but only provides adequate performance in a system with 16 Mbytes of DRAM and an L2 cache. You can also argue that a UMA system using EDO DRAMs has 20% performance benefits when using an L2 cache. Having to use an L2 cache to support UMA depletes the cost savings of UMA. You can also determine the effects of UMA on system performance by analyzing the memory-bandwidth utilization. A 66-MHz system bus has total available bandwidth of 528 Mbytes/sec. Each execution cycle, the Pentium (or any superscalar processor) makes zero to three references to main memory. The number of references is related to the number of hits or misses to the L1 or L2 caches. However, a general rule is that approximately 20 to 30% of the Pentium-90’s bus bandwidth (that is, 100 to 150 Mbytes/sec) is consumed as a result of cache misses. The graphics controller is responsible for performing a mixture of screen refresh and redraw operations. The mixture depends on the application. Screen refresh typically occurs 60 to 76 times per second. If you are using a word processor, redraws occur only four times per second (depending on how fast you type). Multimedia applications are more redraw-intensive. During screen refresh, the graphics frame buffer contents must be read and sent to the RAMDAC and then to the screen. A 1024x768-pixel screen with 8-bit color accesses memory at an average rate of 478 Mbps (1024x768x76x8), or 60 Mbytes/sec, excluding redraw operations. Combining the CPU’s bus utilization with screen refresh shows that 160 Mbytes/sec is consumed. DMA operations into main memory consume bandwidth even further. Systems with PCI bus masters that stream into main memory also consume bandwidth. Table 2 shows that the latency of any of the DRAMs (page hit or miss) severely limits the UMA’s performance. For example, a burst access that takes a page hit to EDO only delivers 213 Mbytes/sec; a page miss only delivers 177 Mbytes/sec. To achieve reasonable performance with a UMA system, you should use BEDO or SDRAM. Furthermore, you must also employ techniques to increase the burst lengths or at least maximize the number of DRAM page hits. The CPU accesses main memory when it needs to read or write a cache line (32 bytes in Pentium’s case). After each burst of four, system-control logic deasserts RAS. This starts the precharge for the next row access but, unfortunately, results in a page miss on the next CPU access. However, the page-miss latency is tRAC and not tRC. The graphics controller typically supports longer bursts of 64 to 128 bytes, which increases the memory’s bandwidth. Graphics controllers with read prefetch buffers and write FIFOs can also take advantage of an idle bus to transfer data. VUMA has defined two priority levels for a graphics controller requesting the bus. A high-priority request has a predefined worst-case latency, which determines how quickly the core logic must give the bus to the graphics controller. For example, a high-priority request may arise when the graphic controller’s buffer is empty. On the other hand, if a graphics controller has a 128-byte buffer, it may issue a low-priority request after transferring 40 bytes. This notifies the core logic that the graphics controller needs the bus- - although not immediately. Without this two-level priority scheme, overall system performance degrades, because the graphics controller must get the bus every time it "needs" it. The core logic uses an arbitration scheme to determine when the CPU, graphics controller, or some other bus master can access main memory. Opti’s core logic chip set for UMA, called Viper-UMA, performs the arbitration during the DRAM-precharge period. Overlapping these operations helps to hide arbitration latencies. In systems with 16 Mbytes of main memory, Viper-UMA’s patented hardware mechanism logically divides the memory into two banks. This division allows two masters to access each bank concurrently, because the DRAM appears to be dual-ported. This access is similar to concurrency on the PCI bus where deep buffers simulate dual access. Weitek’s chip set performs lossless compression on the graphics data to minimize the bandwidth impact. The graphics controller stores images by describing the length of solid-color blocks in which typically 70% of the data can be compressed. When refreshing the screen, the controller sends only new data to its integrated RAMDAC at the start of each solid-color block. Otherwise, it lets the RAMDAC clock out the same data. Although you can’t predict the outcome of the UMA revolution, every core logic and graphics controller vendor is developing or analyzing potential UMAsupporting products. Some vendors are predicting that UMA may be acceptable in low-end desktop systems. However, if that system requires an L2 cache and 16 Mbytes in order to achieve adequate performance, it is no longer a low-end system. Additionally, the popularity of bandwidth-hungry 3-D graphics is increasing dramatically. Higher density memory devices may help drive UMA. For example, 64-Mbit devices configured as 4Mx16-bits yield 32 Mbytes granularity, so there should be memory to spare. In addition to multiple sources, a DRAM’s success depends on price and standardization. No matter how great the technology, the market is unwilling to pay a higher price for main-memory DRAMs. Furthermore, computer OEMs hesitate to use a device that requires major changes in a system’s design. Eliminating DRAM price premiums Architecturally, BEDOs, EDOs, and FPMs are similar enough that vendors can use the same die to develop all three. The differences among these devices result from a bonding option in the final manufacturing stages. Theoretically, because the devices arise from the same die, the price of each should be the same. However, the variance in price results from test cost, speed grades, product marketing, and supply and demand. In an effort to reduce the price of SDRAMs, some vendors are selling devices that they have only partially tested. Test costs account for a majority of the SDRAM’s (or any memory’s) price. Vendors, therefore, are only testing these PC-SDRAMs (or SDRAM-lites) for the functions required to operate the devices in a PC (see Table 1). For example, vendors test PC-SDRAMs only for burst lengths of one and four (to accommodate a Pentium cache line) and with a column-access (CAS) latency of three. Higher-speed-grade devices generally fetch a higher price. As Table 2 shows, in moving from FPM to Rambus, the improved device interface allows DRAM manufacturers to reduce the speed of the internal DRAM array to achieve the same or better bandwidth. For example, an EDO DRAM with a 70-nsec row-access time yields a 33-MHz maximum page frequency. However, an SDRAM can hit a page frequency of 66 MHz using the same speed-grade DRAM. Some vendors and designers claim that BEDO is a bridge between EDO and SDRAM. To be successful, this device must yield a 52-nsec row-access time (tRAC) to deliver page frequencies of 66 MHz. Most DRAM vendors, however, are unwilling to risk manufacturing yields of devices to 52 nsec. However, the Pentium’s bus speed is a mixture of 60 and 66 MHz, depending on the processor’s internal operating frequency. Theoretically, this bus speed implies that 50% of the yield could be to 52 nsec and 50% to 60 nsec. The problem with this situation is that most computer OEMs don’t want to handle two speed grades. Mixing and matching also makes it difficult for end users to figure out which device to add in when they increase their system’s memory. DRAM Technology Tutorial The standard DRAM architecture consists of a memory cell array, row and column decoders, sense amplifiers, and data input and output buffers. After making a row access (RAS), the selected row feeds into the sense amplifiers; the sense amplifiers serve as a row cache. A column access (CAS) reads data from the sense amplifiers. The timing parameters used to analyze the performance aspects of a DRAM include the row-access time (tRAC, or row address to data), column-access time (tCAC, or column address to data), page-mode cycle time (tPC, or column address to column address), and the random read/write cycle time (tRC, or start of one row access to the start of the next). Consultant Steven Przybylski has defined an additional means of analyzing a DRAM’s performance, which he calls a DRAM’s fill frequency (FF). The FF of a memory equals its peak Mbps/MByte, which translates to a measure of the ratio of bandwidth to granularity. By recognizing that system bandwidth and granularity requirements can also be expressed as an FF, Przybylski developed a metric for evaluating the applicability of individual DRAM architectures to specific application domains. If the system’s FF requirement is greater than a DRAM’s FF, that DRAM cannot meet the system’s performance requirements because peak bandwidth is not high enough. FF keeps the system’s bus width out of the equation and allows you to analyze the DRAM’s performance. A PC with a bus speed of 66 MHz requires a peak bandwidth between 78 and 266 Mbps to achieve acceptable performance. The FF of that system with 8 Mbytes of DRAM (minimum memory size) has an FF of 32. These numbers are important to keep in mind as we analyze each of the mainstream (and potentially mainstream) DRAMs. Fast-page-mode DRAMs (FPMs) have been around for many years. The functions of FPMs are generally understood. Using the timing values in Table 2 and assuming the FPM DRAM has a 4-bit interface, the peak bandwidth of that device is approximately 12.5 Mbytes/sec (using tPC=40 nsec). For a 16-Mbit DRAM, this bandwidth yields an FF of 6.25, far below the PC’s requirements. However, using a 16-Mbit DRAM with a by-16 interface, the FF increases to 25, which seems adequate for most of the PC market. However, the FF is an amalgamation of two unrelated constraints: bandwidth and granularity. Therefore, the FF can show that memory devices are unacceptable (as in the case of the by-4 device). But FF doesn’t conclusively show that the devices are acceptable. Each of these constraints needs to be examined individually before you can verify acceptability. DRAM manufacturers create extended-data-out DRAMs (EDOs) by adding a latch to the sense amps’ output of an FPM. This latch allows CAS to go high while waiting for data out to become valid; output enable (OE) controls data out. In standard page-mode DRAMs during a read, the data-output buffers turn off with the rising edge of CAS, even if OE stays low. Although EDO does nothing to improve the DRAM’s latency, EDO DRAMs can cycle at speeds as fast as the addressaccess time, typically 30 nsec. Therefore, burst transfers can cycle up to 30% faster than fast-page DRAMs. A 16-bi-wide, 16-Mbit EDO DRAM with a tPC of 25 nsec yields an FF equal to 40. Toshiba’s 512kx32-bit device doubles the FF to 80. The same device with a 4-bit-wide interface yields an FF of only 10. Of course, with a tPC of 25 nsec, neither device supports zero wait states on a bus with speeds greater than 40 MHz (and that’s pushing it). DRAM manufacturers create burst-EDO DRAMs (BEDO) by replacing the EDO DRAM’s output latch with a register (that is, an additional latch stage), and adding an address latch and a 2-bit counter. As a result of the output register, data does not reach the outputs on the first CAS cycle. However, the internal pipeline stage allows data to appear in a shorter time from the activating CAS edge in the second cycle (that is, a shorter tCAC). The first CAS cycle does not cause additional delay in receiving the first data element. The first data access is actually limited by tRAC, which, in effect, hides the first CAS cycle. The tPC for 52-nsec BEDO DRAMs is 15 nsec. For the by-4 and by-16 devices, this level of performance yields FFs of 16.5 and 66, respectively. Again, this level shows that the by-4 version is inadequate for use in a 16-Mbyte PC with a 66-MHz bus. BEDO DRAMs are burst-access DRAMs in which all read and write cycles occur in bursts of four. The CAS pin increments the on-chip burst counter. BEDO DRAMs perform interleave or linear bursting. Although bursts can be terminated by the write-enable signal, BEDO DRAM’s performance advantages are lost during single cycle accesses. However, in PCs, most cycles that go through main memory are for burst transfers such as cache fills and DMA. Synchronous DRAMs (SDRAMs) consist of two equal-sized banks, each with its own row decoder and 8 kbits of sense amps divided into two blocks of 512 bytes. This architecture allows data access from one bank while the other bank is in the precharge cycle, yielding gapless data output. Other key SDRAM ingredients include the input and output buffers, a burst counter, and a mode register to tailor the SDRAM interface. Although SDRAMs retain the multiplexed address bus and control signals of standard DRAMs, several new signals facilitate the high-speed interface. A clock synchronizes the flow of addresses, data and control, and pipelining of operations. A clock-enable input signals the validity of the clock and can disable the clock and put the SDRAM into a low-power mode. Chip select enables command execution and allows full-page-burst suspension. The burst-oriented design of the SDRAM supports programmable CAS latencies (1, 2, or 3), burst length (2, 4, 8, or full page), and transfer order (interleaved or linear). The 16-Mbyte Rambus DRAM (RDRAM) contains a standard DRAM array divided into two independent, noninterleaved logical banks. Each bank has an associated high-speed row cache that’s approximately two to four times larger than the row cache on standard DRAMs. RDRAMs use a 250-MHz clock to transfer 1 byte of data every 2 nsec. The RDRAM eliminates RAS and CAS at the system interface by handling all address translations internally. RDRAMs use an 8-bit external data bus to achieve a maximum performance of 500 Mbytes/sec. The device’s electrical interface, the RAMBUS channel, comprises 13 high-speed signals. The RAMBUS channel connects the RDRAM to a memory controller containing a RAMBUS ASIC cell (RAC). The RAC delivers data requests to the RDRAMs in a manner similar to a communications protocol: A master issues request packets that specify a starting address and a byte count to be read or written. A RAC is required for each RDRAM in a system. RDRAMs transfer data in blocks of 8 to 256 bytes. Before each transfer, the system must issue another request packet to the RDRAM, consuming 12 nsec. After a read request to an open row, the data does not start to flow for an additional 28 nsec. Writes to an open row can begin almost immediately after the request packet. When a row miss occurs, the RDRAM responds with a NACK signal and begins a 76-nsec precharge cycle. After the precharge cycle, the system must issue another 12-nsec repeat request. Data begins to flow 28 nsec later. The large peak FF of RDRAMs makes the 16-Mbit RDRAM more suitable for graphics and UMA than as a main-memory device. Micron, the biggest supporter of BEDO, is running its fabrication process at 0.35 µm, which allows the device to more easily yield the 52-nsec tRAC. The company claims 70% yields at 52 nsec. A Micron spokesperson reports that the company is about to release a 75-MHz specification. Although BEDO sounds good in theory, its success depends on how quickly SDRAM pricing comes down and if enough other vendors build the devices. If SDRAM’s price comes down quickly enough, it will be the device of choice for computer designers. From a business perspective, vendors can only charge what the market will bear for increased performance. Availability also affects device price, because a tremendous demand for a product produces constraints on the vendor’s capacity to deliver the product. These restrictions, in turn, drive up the price. The popularity of EDO DRAMs resulted in a price premium. Manufacturers are increasing EDO DRAM shipments at the expense of FPM DRAMs. To facilitate this shift, some manufacturers are considering charging a premium for FPM devices and eliminating the premium on EDO DRAMs. Fujitsu may be taking this position with its SDRAMs. A company representative said that by the middle of the year, the company will not charge a premium on SDRAMs. Table 1: Supported Features of SDRAM-Lite Devices Feature 1 CAS latency 2 3 Burst length 1 2 4 8 Full page Yes Yes Wrap type Yes Yes InterleaveSequential Single write Burst-stop command Auto precharge DQM byte control (316) Burst interruption All-bank precharge Random column (every cycle) Dual bank Supported Yes Yes Yes Yes Yes Yes Yes Core-logic chip-set support Because EDO is the next mainstream memory, almost every chip set supports it. (Surprisingly, Intel’s first chip set for the Pentium Pro only supports FPM.) With minimal additional effort, many chip sets have the option to support FPM, EDO, or BEDO, because these three memories use a variation on row access (RAS) and CAS. For example, PicoPower’s Vesuvius system controller takes an interesting approach to support all three memories. The PCI-based chip set can use FPM, EDO, and BEDO DRAM banks in a mixed mode. When the system starts up, Vesuvius automatically detects the memory type and sets up independent timing for each bank. The controller also provides you with a choice of using 32-bit- or 64-bit-wide banks. Mixing widths also makes it easier to add memory to a system in increments of 4 Mbytes. However, mixed widths also mean that application performance may differ depending on which bank you load the application into during runtime. Table 2: Timing Specifications of the Internal DRAM FPM EDO BEDO SDRAM RAMBUS Specification Symbol -5/-6/-7 -5/-6/-7 -5/-6/-7 -10/-12/-15 250 MHz Row-access time tRAC 50/60/70 50/60/70 50/60/70 50/60/70 128** Row-cycle time tRC 95/110/130 89/110/130 90/110/130 100/120/130 NA Column-address tAA 25/30/35 25/30/35 25/30/35 29/35/44 40** access tCAC 13/15/20 13/15/20 10/11.6/15 9/11/14* NA CAS access tPC 30/35/40 20/25/30 15/16.6/20 10/12/15 2 CAS cycle Maximum page MHz 33/28/25 50/40/33 66/60/50 100/80/66 500 frequency Notes *CAS latency=3 **Doesn’t include any data-transfer time SDRAMs require a relatively different interface because the devices are synchronous and use a clock input. However, system designers’ move away from 32-bit- toward 64-bit-wide memory modules (or dual in-line memory modules (DIMMs)) will speed the adoption of SDRAM. Vendors design DIMMs so that EDO and SDRAM can exist on the same module format. Starting this quarter, some core-logic DRAM controllers, such as those from VLSI, will begin supporting SDRAMs. Activities in the graphics department Graphics memory is experiencing significant advancements. The use of high-priced VRAM is nearing an end as memory vendors move toward devices such as RDRAMs, SDRAMs, and synchronous graphics DRAMs (SGRAMs). Designers of low-end graphics applications are even finding that standard EDO devices provide ample performance. For higher performance (at a higher price), Mosel Vitelic offers 256kx8- and 256kx16-bit EDOs with a guaranteed cycle time of 20 nsec. Mitsubishi’s 3-D RAM chip, supporting high-performance 3-D graphics, integrates 10 Mbytes of DRAM, 2 kbytes of SRAM, and an ALU onto a single chip. Samsung continues to sell its singlesourced window RAM (WRAM) into graphics applications that require VRAM-like performance at a reduced price. Benefits of integrating L2 cache with core logic Mathew Arcoleo, Cypress Semiconductor Corp When building a high-speed L2 cache subsystem, you can significantly increase timing margins by putting the logic functions on the same chip. If an external SRAM is used for the tag look-up, several delays must be accounted for: Tlookup=Taa+Toffs+Tft1+Toncs+Tcomp+Toffcs+Tft2, where Taa=RAM access time, Toffs=off-chip delay from SRAM (~1 nsec), Tft1=flight time to the chip set (~1 nsec), Toncs=on-chip delay into the chip set (~1 nsec), Tcomp=Tag compare, Toffcs=off-chip delay of BRDY from the chip set, and Tft2=flight time of BRDY to the processor. Contrast this equation to the tag look-up for the integrated tag implementation: Tlook-up=Taa+Tcomp+Toffcs+Tft2. Three critical delays have been eliminated, thereby increasing the system timing margin by 2 to 3 nsec. This also eases the speed requirements placed on the integrated tag RAM. Integrating the L2 cache also reduces the capacitive loading on the address bus. In addition, most of the discontinuities on the transmission lines are eliminated, resulting in a cleaner, more reliable system. Having the cache RAM inside the datapath unit also eliminates the 64 data I/O lines from the CPU bus required to connect the L2 cache. Also, only a single set of address inputs, decoders, and clock signals is required. Interestingly enough, the die size of a core-logic chip set is largely a function of the number of I/Os, and is not drastically increased by the addition of the cache SRAM. The data-path unit of a chip set is I/O intensive and usually comes in a 208-pin PQFP package. Assuming that all of the pins are being used, the die starts to become "pad-limited" at about 100 kmils2 (assuming standard 4-mil pads with 2-mil spacing). The chip set’s core logic only requires about 30 kmils2 (which includes approximately 12k logic gates at 1.3 mils2/logic gate and routing resources). This is a small fraction of the 100 kmils2 required to accommodate the I/O pads. Mathew Arcoleo is a staff engineer at Cypress Semiconductor, San Jose, CA. DRAM interfaces on embedded processors Many embedded µP vendors do not integrate specific DRAM controllers into their devices. One reason for not doing so is to minimize the amount of silicon. Another reason relates to the uncertainty of which type of DRAM suits the variety of applications customers may implement. Motorola’s MPC860 offers a flexible solution for accommodating a variety of memory types. The µP contains a user-programmable controller that is similar to a microcode machine (Figure 1). The controller provides you with two general-purpose lines that you can assert and deassert on a onequarter clock-cycle granularity to control most memory types. To provide this granularity, the MPC860 has two clocks running: One is the system clock, and the other is the system clock shifted by 90%. This clock shift essentially provides the same effect as doubling the clock. Motorola offers a software-analysis tool that lets you create waveform outputs and learn how to control your system’s memory. SRAMs for L2 caches To be a high-performance SRAM for PCs, the device must include a clock input and a pipeline. As a result, asynchronous SRAMs are giving up market share to higher performing, lower cost synchronous counterparts. Benchmarks running in systems using EDO DRAMs actually show a decrease in performance when using asynchronous SRAMs. Therefore, the 32kx32-bit synchronous burst SRAM has become the device of choice for L2 caches in PCs. This SRAM incorporates a 2-bit burst counter. In a system with a 66-MHz bus, the device achieves bursts of 3-1-1-1 in pipeline mode. Table 3: Clocks to Access System Memory (with L2 cache) Memory L2 cache Main memory miss* System-memory accesses 80% 20% Clocks 6-2-2-2 (12) FPM: 14-6-6-6 (32) EDO: 14-4-4-4 (26) SDRAM: 14-2-2-2 (20) BEDO: 12-2-2-2 (18) Clocks of accesses to this memory 9.6 (0.8x12) FPM: 6.4 EDO: 5.2 SDRAM: 4 BEDO: 3.6 Notes: *FPM is a -7 device. EDO is a -7 device. BEDO is a -5 device. SDRAM is a -15 device. Table 4: Clocks to Access System Memory (without L2 cache) Memory Main-Miss* 100% System-memory accesses FPM: 14-6-6-6 (32) Clocks EDO: 14-4-4-4 (26) SDRAM: 14-2-2-2 (20) BEDO: 12-2-2-2 (18) Clocks of accesses to this memory FPM: 32 EDO: 26 SDRAM: 20 BEDO: 18 Notes: *FPM is a -7 device. EDO is a -7 device. BEDO is a -5 device. SDRAM is a -15 device. The synchronous-burst SRAM should not have problems running to 75 MHz. Beyond this speed, you must be creative with your system design to reduce noise. (IDT, a leading SRAM supplier, predicts that between 1997 and 1999 there will be a market for 100- to 120-MHz synchronous SRAMs. Designs will, therefore, get even tougher.) For example, the Pentium Pro uses a separate cache bus to decrease loading and to allow the clock rate to equal the CPU’s internal clock rate. You can also improve cache performance by integrating tag, controller, and SRAM on the same chip. Motorola’s MPC2604GA is an integrated four-way, set-associative cache for PowerPC systems. This integrated cache performs 2-1-1-1 bursts at 66 MHz vs 3-1-1-1 bursts typical of discrete devices. Sony’s CXK78V5862GB is an integrated cache that contains a two-way, set-associative, 256-kbyte cache and a 64-bit processor interface. Sony and IDT also have processor-specific SRAMs that latch incoming address, data, and control signals into on-chip registers to decouple the processor’s addressing cycles from memory address cycles. Cypress Semiconductor has taken a different tack for increasing L2 cache performance by integrating the cache SRAM and control logic directly into its hyperCache PC core-logic chip set. (See box, "Benefits of integrating L2 cache with core logic.") Main Memory Effects on System Performance The effect of main memory on PC system performance depends on the presence of the processor’s L1 cache and the external L2 cache. The most significant use of PC main memory is in performing cache-line fills to the L1 and L2 caches. This operation attempts to transfer the data at the external bus frequency of the processor, which is 60 or 66 MHz. The processor incurs wait states when a memory device is unable to transfer data at these rates. You should, therefore, study the relative performance of each memory device. For the following analysis, assume that a µP’s internal frequency is twice its external frequency and that each external clock cycle is internally two clocks. Therefore, when an external memory system is running at zero wait states (that is, 2-1-1-1), the processor core actually sees an operation of 4-2-2-2. Also assume that the L1 and L2 cache achieve a hit rate of 80%. The main memory has two access modes: page open and closed. Similar to cache, an access to an open page is significantly faster than an access to a closed page. Because core logic deasserts RAS after each burst, all DRAM accesses are made to a closed page. Tables 3 and 4 show the relationship in processor clocks for access to each memory. For a system that contains an L2 cache, the average cycles are: L2+FPM DRAM=9.6+6.4=16 clocks, L2+EDO DRAM=9.6+5.2=14.8 clocks, L2+SDRAM=9.6+4=13.6 clocks, and L2+BEDO DRAM=9.6+3.6=13.2 clocks. For a system that doesn’t contain an L2 cache the average cycles are: FPM DRAM=28 clocks, EDO DRAM=22 clocks, SDRAM=16 clocks, and BEDO DRAM=14 clocks. You can draw two conclusions from the above analysis. First, up to 66 MHz, BEDO DRAMs even outperform SDRAMs. Second, the benefits of EDO, SDRAM, and BEDO become more pronounced without L2 cache. The analysis was made using FPM, EDO, and SDRAM devices with 70-nsec row-access time. However, the BEDO devices have a row-access time of 52 nsec. The choice of access times was made to represent the typical devices used. [Figure 2]In some PCs, computer OEMs question the necessity of using an L2 cache. Typically, a PC with FPM DRAMs includes an L2 cache, whereas EDO-based systems can get by without using one. Benchmarks from Intel, though, show that adding cache to an EDO-based system boosts performance by over 20%. Multitasking operating systems such as Windows 95 and NT require higher performing systems and, in turn, larger L2 caches to accommodate more L1 cache misses. Various benchmarks show that applications that are more graphical and mathematical require larger caches. Sharing main memory with the graphics frame buffer also demands the use of an L2 cache (see box, "Understanding the UMA trade-offs"). Looking Ahead Although EDO is the most popular DRAM architecture, the SDRAM is gaining appeal. Initially, the majority of SDRAMs will target bus frequencies less than 100 MHz, but the SDRAM’s architecture allows it to run even faster. Later this year, NEC will begin sampling 143-MHz SDRAMs with a serial stub-terminate-transceiver-logic (SSTL) interface. (Currently, SDRAMs use the low-voltage TTL interface.) In 1997, RDRAMs will hit 64-Mbit densities, making these devices more costeffective by amoritizing the chip’s control logic overhead. Although these higher density devices are 100% backwards compatible with previous-generation RDRAMs, internally they have increased functionality. Instead of two DRAM banks, the 64Mbit devices will have four. The four banks, in conjunction with the device’s ability to queue two operations, will allow accesses to be overlapped, improving the RDRAM’s severe latencies. Queuing will make the 64-Mbit RDRAMs more useful for Intel’s Pentium Pro, which also queues memory accesses. The 64-Mbit RDRAM’s queuing capability will also be beneficial for system’s with a unified memory architecture (UMA), because the queuing will allow overlap of CPU and graphics operations. IBM, Siemens, and Toshiba have developed functional 256-Mbit DRAMs; these highdensity devices will probably have a synchronous interface. If 256 Mbits is not enough, Motorola is now working with these three companies to develop a 1-Gbit device. To learn more about the current and future state of DRAMs, you can attend consultant Steven Przybylski’s seminar. The session takes place in Austin and Portland, OR, in January (dates to be determined) and Feb 6 in Paolo Alto, CA. The seminar costs $495. This seminar was given at the 1995 Microprocessor Forum. Contact Przybylski directly for more information at (408) 984-2719 or via e-mail at [email protected]. You can reach Technical Editor Markus Levy at (916) 939-1642; fax (916) 939-1650; email [email protected] References 1. Levy, Markus, "The dynamics of DRAM technology," EDN, Jan 5, 1995, pg 46. 2. Przybylski, Steven, New DRAM Technologies, MicroDesign Resources, Sebastopol, CA, 1995. 3. "VESA Unified Memory Architecture Hardware Specification Proposal," Video Electronics Standards Association, San Jose, CA. 4. Yao, Yong, "Unified Memory Architecture Cuts PC Cost," Microprocessor Report, June 19, 1995, MicroDesign Resources, Sebastopol, CA. Manufacturers of DRAMs, Synchronous Burst SRAMs, and Chip Sets When you contact any of the following manufacturers directly, please let them know you read about their products at the EDN Magazine WWW site. Alliance Semiconductor Cypress Semiconductor Fujitsu Microelectronics Corp Corp San Jose, CA San Jose, CA San Jose, CA (800) 642-7616 (408) 383-4900, ext 102 (408) 943-2600 Goldstar Technology Inc Hitachi America Ltd Hyundai America Ltd San Jose, CA Brisbane, CA San Jose, CA (408) 432-1331 (415) 589-8300 (408) 473-9200 Integrated Device IBM Microelectronics Inc Intel Literature Center Technology (IDT) Fishkill, NY Mt Prospect, IL Santa Clara, CA (800) 426-0181 (800) 468-8118 (800) 345-7015 LG Semicon Matsushita Micron Technology Inc San Jose, CA Milpitas, CA Boise, Idaho (408) 432-5000 (408) 946-4311 (208) 368-3900 Mitsubishi Electronics Mosel Vitelic Motorola Inc America San Jose, CA Austin, TX Sunnyvale, CA (408) 433-6000 (512) 933-7726 (408) 730-5900 NEC Electronics Inc Oki Semiconductor Inc Opti Inc Mountain View, CA Sunnyvale, CA Santa Clara, CA (800) 366-9782 (408) 720-1900 (408) 980-8178 Paradigm Technology Inc PicoPower Technology Rambus Inc San Jose, CA Fremont, CA Mountain View, CA (408) 954-0500 (510) 623-8300 (415) 903-3800 Ramtron International Samsung Semiconductor Sharp Microelectronics Corp Inc Corp Colorado Springs, CO San Jose, CA Mahwah, NJ (719) 481-7000 (408) 954-6972 (201) 529-8200 Siemens Silicon Magic Sony Semiconductor Co Cupertino, CA Cupertino, CA San Jose, CA (408) 777-4500 (408) 366-8888 (408) 955-6572 Texas Instruments Inc, Toshiba America Electronic VLSI Technology Inc Literature Response Center Co San Jose, CA Denver, CO Irvine, CA (408) 434-3000 (800) 477-8924, ext 4500 (714) 455-2000 Weitek Corp Sunnyvale, CA (408)738-8400 | EDN Access | feedback | subscribe to EDN! | | design features | out in front | design ideas | columnist | departments | products | Copyright c 1996 EDN Magazine. EDN is a registered trademark of Reed Properties Inc, used under license.
© Copyright 2026 Paperzz