EDN -- 01.04.96 Souped-up memories boost system performanc

EDN -- 01.04.96 Souped-up memories boost
system performanc
EDN Staff - January 04, 1996
Cover Story:January 4, 1996
Souped-up memories boost system
performance
Markus Levy,
Technical Editor
The fast-page-mode DRAM has given way to higher performance memories such as
extended-data-out and synchronous DRAMs. The price barriers of memories are eroding,
but system designers must work with new interfaces and architectures to accommodate
these memory devices.
Welcome to the revolution in the multibillion dollar memory industry. Last year’s specialty memories
are the basis for today’s mainstream devices. The features of these memory technologies are
enabling more affordable, higher performance systems. However, these memories are generating
controversies pertaining to the fundamental architecture of the PC. One such controversy is the use
of the unified memory architecture (UMA). Despite the debate surrounding it, UMA is gaining
popularity among PC OEMs that want to reduce system costs. Another issue is the use or removal of
level 2 (L2) cache.
Other memory-related topics are surfacing in embedded applications. Most of the µPs used in these
applications tend to operate with bus speeds significantly lower than the Pentium-class devices in
PCs. Embedded applications, unlike PCs, have long life cycles of between five and 10 years. When
designing these embedded applications and processors, you must choose a memory device that
provides adequate performance. Equally as important, you should select a cost-effective device that
will be available for several years.
Except for changes in process technology, the basic DRAM cell hasn’t changed since its inception.
The architectural differences affecting DRAM performance are the faster input and output interfaces
and the slowly increasing bus width. In the PC’s main memory arena, the long life of the fast-pag-mode DRAM (FPM) is ending, giving way to extended-data-out DRAMs (EDOs). EDOs, with nearly
50% higher peak bandwidth than FPMs, are also rapidly approaching their peak. Meanwhile,
synchronous DRAMs (SDRAMs), Rambus DRAMs (RDRAMs), and, possibly, burst-EDO DRAMs
(BEDO), which all have 50% or greater performance than the EDO, are waiting to conquer the
market. Other technically innovative DRAM technologies, such as Ramtron’s Enhanced DRAM
(EDRAM) and the IEEE’s RamLink and SyncLink, have little chance of becoming mainstream in the
short term because of limited sources for these devices.
Understanding the UMA Trade-offs
More so now than ever before, it seems every PC OEM is trying to save money. Faster DRAMs, combined with the computer OEM’s desire to reduce costs,
have brought about the feasability of the unified memory architecture (UMA). A PC with a UMA eliminates the separate frame-buffer memory of the graphics
subsystem and allows a potential cost savings of between $30 and $60. On the downside, UMA is also known as the shared-frame buffer because the graphic’s
subsystem must "steal" part of the PC’s main memory.
UMA is drumming up significant activity in the PC industry. The potential cost-saving architecture is also raising many questions. First, there is uncertainty as
to whether there is a UMA standard. Also, many in the industry wonder if the architecture saves money. Finally, some wonder who benefits from the
architecture, what its performance impacts are, and whether it only targets low-end systems.
Currently, there isn’t a standard method for designing a UMA system. As a result, two divergent paths have emerged: tightly and loosely coupled. Weitek has
developed a chip set for the 486 that integrates the core logic on the same die as the graphics controller. The company has enthusiastically termed this
integrated approach as tightly coupled to signify high performance. The other approach, inaccurately referred to as "loosely coupled," uses specific
handshaking signals between a separate core logic chip and a graphics controller. A subcommittee of VESA, VESA UMA (VUMA), is supporting this approach.
VUMA, lead by Opti, consists of a consortium of graphics-controller and core-logic vendors. The company has published a conceptual specification detailing
the hardware and BIOS extensions for its UMA approach (Reference 3).
UMA has advantages and disadvantages. A separate frame buffer typically wastes memory. Suppose you use frame-buffer EDO DRAMs in 256kx16-bit
configurations. (Silicon Magic has a 256kx32-bit EDO DRAM.) It takes two of these devices to yield a 32-bit interface and a minimum memory granularity of 1
Mbyte. Whether running an 800x600-pixel display or a 1024x768-pixel with 8-bit color, you only use 470 and 768 kbytes, respectively, of a 1-Mbyte frame
buffer. When using 2-Mbyte frame buffers for high-performance 64-bit operation, more than 1 Mbyte is still wasted, unless the user has 16- or 24-bit color
modes or uses 1280x1024-pixel or 1600x1200-pixel-displays.
With UMA, you can allocate the exact amount of graphics memory needed. However, drivers such as Virtual Desktop require a full 1 or 2 Mbytes regardless of
the monitor type or display mode. Theoretically, UMA allows you to dynamically change the size of the graphic’s frame buffer, depending on the application
you are running. Existing PC operating systems don’t allow you to change the frame buffer size on the fly. Instead you must set the new size and then perform
a system reset to make changes. Although a system reset is simpler than having to open the PC and modify jumper settings, there are still complex issues to
deal with. For example, Win95 permits any application to switch the system to its preferred display mode, then return the system to default display mode when
the application loses foreground focus. VGA memory must be dedicated physical memory pages, completely off limits to the system’s virtual memory manager.
To increase the allocated display memory, the system would have to move application code and data out of the physical pages needed for the increased VGA
needs.
If your system has a 1-Mbyte frame buffer, you may even see a performance improvement in graphics by using UMA. As discussed earlier, a 1-Mbyte frame
buffer has a 32-bit interface. However, if the shared frame buffer is part of the system’s 64-bit main memory subsystem, you can potentially double the
bandwidth. Tightly coupled UMA may also improve graphics performance by allowing the CPU to perform 3-D rendering directly to the frame buffer without
having to go across a slower bus to get to memory (for example, through the PCI bus).
Despite the potential graphic’s performance improvement seen with UMA, this architecture negatively affects overall system performance. For example, if you
start with a system running Windows 3.1 or Win95 that has 8 Mbytes of main memory and uses 768 kbytes for graphics, you’ll see a 10 to 15% performance
degradation. The negative effects are less noticeable when operating a system with 16 Mbytes because 16 Mbytes is more than Win95 needs. However, the 8
to 10% reduction of system memory, which allocates space for a 1- to 2-Mbyte frame buffer, increases the amount of swapping between main memory and the
hard disk. Ironically, UMA targets low-end systems but only provides adequate performance in a system with 16 Mbytes of DRAM and an L2 cache. You can
also argue that a UMA system using EDO DRAMs has 20% performance benefits when using an L2 cache. Having to use an L2 cache to support UMA depletes
the cost savings of UMA.
You can also determine the effects of UMA on system performance by analyzing the memory-bandwidth utilization. A 66-MHz system bus has total available
bandwidth of 528 Mbytes/sec. Each execution cycle, the Pentium (or any superscalar processor) makes zero to three references to main memory. The number
of references is related to the number of hits or misses to the L1 or L2 caches. However, a general rule is that approximately 20 to 30% of the Pentium-90’s
bus bandwidth (that is, 100 to 150 Mbytes/sec) is consumed as a result of cache misses.
The graphics controller is responsible for performing a mixture of screen refresh and redraw operations. The mixture depends on the application. Screen
refresh typically occurs 60 to 76 times per second. If you are using a word processor, redraws occur only four times per second (depending on how fast you
type). Multimedia applications are more redraw-intensive. During screen refresh, the graphics frame buffer contents must be read and sent to the RAMDAC
and then to the screen. A 1024x768-pixel screen with 8-bit color accesses memory at an average rate of 478 Mbps (1024x768x76x8), or 60 Mbytes/sec,
excluding redraw operations.
Combining the CPU’s bus utilization with screen refresh shows that 160 Mbytes/sec is consumed. DMA operations into main memory consume bandwidth even
further. Systems with PCI bus masters that stream into main memory also consume bandwidth. Table 2 shows that the latency of any of the DRAMs (page hit
or miss) severely limits the UMA’s performance. For example, a burst access that takes a page hit to EDO only delivers 213 Mbytes/sec; a page miss only
delivers 177 Mbytes/sec.
To achieve reasonable performance with a UMA system, you should use BEDO or SDRAM. Furthermore, you must also employ techniques to increase the burst
lengths or at least maximize the number of DRAM page hits. The CPU accesses main memory when it needs to read or write a cache line (32 bytes in
Pentium’s case). After each burst of four, system-control logic deasserts RAS. This starts the precharge for the next row access but, unfortunately, results in a
page miss on the next CPU access. However, the page-miss latency is tRAC and not tRC.
The graphics controller typically supports longer bursts of 64 to 128 bytes, which increases the memory’s bandwidth. Graphics controllers with read prefetch
buffers and write FIFOs can also take advantage of an idle bus to transfer data. VUMA has defined two priority levels for a graphics controller requesting the
bus. A high-priority request has a predefined worst-case latency, which determines how quickly the core logic must give the bus to the graphics controller. For
example, a high-priority request may arise when the graphic controller’s buffer is empty. On the other hand, if a graphics controller has a 128-byte buffer, it
may issue a low-priority request after transferring 40 bytes. This notifies the core logic that the graphics controller needs the bus- - although not immediately.
Without this two-level priority scheme, overall system performance degrades, because the graphics controller must get the bus every time it "needs" it.
The core logic uses an arbitration scheme to determine when the CPU, graphics controller, or some other bus master can access main memory. Opti’s core
logic chip set for UMA, called Viper-UMA, performs the arbitration during the DRAM-precharge period. Overlapping these operations helps to hide arbitration
latencies. In systems with 16 Mbytes of main memory, Viper-UMA’s patented hardware mechanism logically divides the memory into two banks. This division
allows two masters to access each bank concurrently, because the DRAM appears to be dual-ported. This access is similar to concurrency on the PCI bus
where deep buffers simulate dual access.
Weitek’s chip set performs lossless compression on the graphics data to minimize the bandwidth impact. The graphics controller stores images by describing
the length of solid-color blocks in which typically 70% of the data can be compressed. When refreshing the screen, the controller sends only new data to its
integrated RAMDAC at the start of each solid-color block. Otherwise, it lets the RAMDAC clock out the same data.
Although you can’t predict the outcome of the UMA revolution, every core logic and graphics controller vendor is developing or analyzing potential UMAsupporting products. Some vendors are predicting that UMA may be acceptable in low-end desktop systems. However, if that system requires an L2 cache and
16 Mbytes in order to achieve adequate performance, it is no longer a low-end system. Additionally, the popularity of bandwidth-hungry 3-D graphics is
increasing dramatically. Higher density memory devices may help drive UMA. For example, 64-Mbit devices configured as 4Mx16-bits yield 32 Mbytes
granularity, so there should be memory to spare.
In addition to multiple sources, a DRAM’s success depends on price and standardization. No matter
how great the technology, the market is unwilling to pay a higher price for main-memory DRAMs.
Furthermore, computer OEMs hesitate to use a device that requires major changes in a system’s
design.
Eliminating DRAM price premiums
Architecturally, BEDOs, EDOs, and FPMs are similar enough that vendors can use the same die to
develop all three. The differences among these devices result from a bonding option in the final
manufacturing stages. Theoretically, because the devices arise from the same die, the price of each
should be the same. However, the variance in price results from test cost, speed grades, product
marketing, and supply and demand.
In an effort to reduce the price of SDRAMs, some vendors are selling devices that they have only
partially tested. Test costs account for a majority of the SDRAM’s (or any memory’s) price. Vendors,
therefore, are only testing these PC-SDRAMs (or SDRAM-lites) for the functions required to operate
the devices in a PC (see Table 1). For example, vendors test PC-SDRAMs only for burst lengths of
one and four (to accommodate a Pentium cache line) and with a column-access (CAS) latency of
three.
Higher-speed-grade devices generally fetch a higher price. As Table 2 shows, in moving from FPM to
Rambus, the improved device interface allows DRAM manufacturers to reduce the speed of the
internal DRAM array to achieve the same or better bandwidth. For example, an EDO DRAM with a
70-nsec row-access time yields a 33-MHz maximum page frequency. However, an SDRAM can hit a
page frequency of 66 MHz using the same speed-grade DRAM.
Some vendors and designers claim that BEDO is a bridge between EDO and SDRAM. To be
successful, this device must yield a 52-nsec row-access time (tRAC) to deliver page frequencies of 66
MHz. Most DRAM vendors, however, are unwilling to risk manufacturing yields of devices to 52
nsec. However, the Pentium’s bus speed is a mixture of 60 and 66 MHz, depending on the
processor’s internal operating frequency. Theoretically, this bus speed implies that 50% of the yield
could be to 52 nsec and 50% to 60 nsec. The problem with this situation is that most computer
OEMs don’t want to handle two speed grades. Mixing and matching also makes it difficult for end
users to figure out which device to add in when they increase their system’s memory.
DRAM Technology Tutorial
The standard DRAM architecture consists of a memory cell array, row and column decoders, sense amplifiers, and
data input and output buffers. After making a row access (RAS), the selected row feeds into the sense amplifiers; the
sense amplifiers serve as a row cache. A column access (CAS) reads data from the sense amplifiers. The timing
parameters used to analyze the performance aspects of a DRAM include the row-access time (tRAC, or row address to
data), column-access time (tCAC, or column address to data), page-mode cycle time (tPC, or column address to
column address), and the random read/write cycle time (tRC, or start of one row access to the start of the next).
Consultant Steven Przybylski has defined an additional means of analyzing a DRAM’s performance, which he calls a
DRAM’s fill frequency (FF). The FF of a memory equals its peak Mbps/MByte, which translates to a measure of the
ratio of bandwidth to granularity. By recognizing that system bandwidth and granularity requirements can also be
expressed as an FF, Przybylski developed a metric for evaluating the applicability of individual DRAM architectures
to specific application domains. If the system’s FF requirement is greater than a DRAM’s FF, that DRAM cannot meet
the system’s performance requirements because peak bandwidth is not high enough. FF keeps the system’s bus width
out of the equation and allows you to analyze the DRAM’s performance.
A PC with a bus speed of 66 MHz requires a peak bandwidth between 78 and 266 Mbps to achieve acceptable
performance. The FF of that system with 8 Mbytes of DRAM (minimum memory size) has an FF of 32. These numbers
are important to keep in mind as we analyze each of the mainstream (and potentially mainstream) DRAMs.
Fast-page-mode DRAMs (FPMs) have been around for many years. The functions of FPMs are generally understood.
Using the timing values in Table 2 and assuming the FPM DRAM has a 4-bit interface, the peak bandwidth of that
device is approximately 12.5 Mbytes/sec (using tPC=40 nsec). For a 16-Mbit DRAM, this bandwidth yields an FF of
6.25, far below the PC’s requirements. However, using a 16-Mbit DRAM with a by-16 interface, the FF increases to
25, which seems adequate for most of the PC market. However, the FF is an amalgamation of two unrelated
constraints: bandwidth and granularity. Therefore, the FF can show that memory devices are unacceptable (as in the
case of the by-4 device). But FF doesn’t conclusively show that the devices are acceptable. Each of these constraints
needs to be examined individually before you can verify acceptability.
DRAM manufacturers create extended-data-out DRAMs (EDOs) by adding a latch to the sense amps’ output of an
FPM. This latch allows CAS to go high while waiting for data out to become valid; output enable (OE) controls data
out. In standard page-mode DRAMs during a read, the data-output buffers turn off with the rising edge of CAS, even
if OE stays low.
Although EDO does nothing to improve the DRAM’s latency, EDO DRAMs can cycle at speeds as fast as the addressaccess time, typically 30 nsec. Therefore, burst transfers can cycle up to 30% faster than fast-page DRAMs. A 16-bi-wide, 16-Mbit EDO DRAM with a tPC of 25 nsec yields an FF equal to 40. Toshiba’s 512kx32-bit device doubles the
FF to 80. The same device with a 4-bit-wide interface yields an FF of only 10. Of course, with a tPC of 25 nsec,
neither device supports zero wait states on a bus with speeds greater than 40 MHz (and that’s pushing it).
DRAM manufacturers create burst-EDO DRAMs (BEDO) by replacing the EDO DRAM’s output latch with a register
(that is, an additional latch stage), and adding an address latch and a 2-bit counter. As a result of the output register,
data does not reach the outputs on the first CAS cycle. However, the internal pipeline stage allows data to appear in a
shorter time from the activating CAS edge in the second cycle (that is, a shorter tCAC). The first CAS cycle does not
cause additional delay in receiving the first data element. The first data access is actually limited by tRAC, which, in
effect, hides the first CAS cycle.
The tPC for 52-nsec BEDO DRAMs is 15 nsec. For the by-4 and by-16 devices, this level of performance yields FFs of
16.5 and 66, respectively. Again, this level shows that the by-4 version is inadequate for use in a 16-Mbyte PC with a
66-MHz bus.
BEDO DRAMs are burst-access DRAMs in which all read and write cycles occur in bursts of four. The CAS pin
increments the on-chip burst counter. BEDO DRAMs perform interleave or linear bursting. Although bursts can be
terminated by the write-enable signal, BEDO DRAM’s performance advantages are lost during single cycle accesses.
However, in PCs, most cycles that go through main memory are for burst transfers such as cache fills and DMA.
Synchronous DRAMs (SDRAMs) consist of two equal-sized banks, each with its own row decoder and 8 kbits of sense
amps divided into two blocks of 512 bytes. This architecture allows data access from one bank while the other bank is
in the precharge cycle, yielding gapless data output. Other key SDRAM ingredients include the input and output
buffers, a burst counter, and a mode register to tailor the SDRAM interface.
Although SDRAMs retain the multiplexed address bus and control signals of standard DRAMs, several new signals
facilitate the high-speed interface. A clock synchronizes the flow of addresses, data and control, and pipelining of
operations. A clock-enable input signals the validity of the clock and can disable the clock and put the SDRAM into a
low-power mode. Chip select enables command execution and allows full-page-burst suspension. The burst-oriented
design of the SDRAM supports programmable CAS latencies (1, 2, or 3), burst length (2, 4, 8, or full page), and
transfer order (interleaved or linear).
The 16-Mbyte Rambus DRAM (RDRAM) contains a standard DRAM array divided into two independent,
noninterleaved logical banks. Each bank has an associated high-speed row cache that’s approximately two to four
times larger than the row cache on standard DRAMs. RDRAMs use a 250-MHz clock to transfer 1 byte of data every 2
nsec. The RDRAM eliminates RAS and CAS at the system interface by handling all address translations internally.
RDRAMs use an 8-bit external data bus to achieve a maximum performance of 500 Mbytes/sec. The device’s electrical
interface, the RAMBUS channel, comprises 13 high-speed signals. The RAMBUS channel connects the RDRAM to a
memory controller containing a RAMBUS ASIC cell (RAC). The RAC delivers data requests to the RDRAMs in a
manner similar to a communications protocol: A master issues request packets that specify a starting address and a
byte count to be read or written. A RAC is required for each RDRAM in a system.
RDRAMs transfer data in blocks of 8 to 256 bytes. Before each transfer, the system must issue another request
packet to the RDRAM, consuming 12 nsec. After a read request to an open row, the data does not start to flow for an
additional 28 nsec. Writes to an open row can begin almost immediately after the request packet. When a row miss
occurs, the RDRAM responds with a NACK signal and begins a 76-nsec precharge cycle. After the precharge cycle,
the system must issue another 12-nsec repeat request. Data begins to flow 28 nsec later. The large peak FF of
RDRAMs makes the 16-Mbit RDRAM more suitable for graphics and UMA than as a main-memory device.
Micron, the biggest supporter of BEDO, is running its fabrication process at 0.35 µm, which allows
the device to more easily yield the 52-nsec tRAC. The company claims 70% yields at 52 nsec. A
Micron spokesperson reports that the company is about to release a 75-MHz specification. Although
BEDO sounds good in theory, its success depends on how quickly SDRAM pricing comes down and if
enough other vendors build the devices. If SDRAM’s price comes down quickly enough, it will be the
device of choice for computer designers.
From a business perspective, vendors can only charge what the market will bear for increased
performance. Availability also affects device price, because a tremendous demand for a product
produces constraints on the vendor’s capacity to deliver the product. These restrictions, in turn,
drive up the price. The popularity of EDO DRAMs resulted in a price premium. Manufacturers are
increasing EDO DRAM shipments at the expense of FPM DRAMs. To facilitate this shift, some
manufacturers are considering charging a premium for FPM devices and eliminating the premium
on EDO DRAMs. Fujitsu may be taking this position with its SDRAMs. A company representative said
that by the middle of the year, the company will not charge a premium on SDRAMs.
Table 1: Supported Features of SDRAM-Lite Devices
Feature
1
CAS latency
2
3
Burst length
1
2
4
8
Full page
Yes
Yes
Wrap type
Yes
Yes
InterleaveSequential
Single write
Burst-stop command
Auto precharge
DQM byte control (316)
Burst interruption
All-bank precharge
Random column (every cycle)
Dual bank
Supported
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Core-logic chip-set support
Because EDO is the next mainstream memory, almost every chip set supports it. (Surprisingly,
Intel’s first chip set for the Pentium Pro only supports FPM.) With minimal additional effort, many
chip sets have the option to support FPM, EDO, or BEDO, because these three memories use a
variation on row access (RAS) and CAS. For example, PicoPower’s Vesuvius system controller takes
an interesting approach to support all three memories. The PCI-based chip set can use FPM, EDO,
and BEDO DRAM banks in a mixed mode. When the system starts up, Vesuvius automatically detects
the memory type and sets up independent timing for each bank. The controller also provides you
with a choice of using 32-bit- or 64-bit-wide banks. Mixing widths also makes it easier to add
memory to a system in increments of 4 Mbytes. However, mixed widths also mean that application
performance may differ depending on which bank you load the application into during runtime.
Table 2: Timing Specifications of the Internal DRAM
FPM
EDO
BEDO
SDRAM
RAMBUS
Specification
Symbol -5/-6/-7
-5/-6/-7
-5/-6/-7
-10/-12/-15 250 MHz
Row-access time tRAC 50/60/70 50/60/70 50/60/70 50/60/70
128**
Row-cycle time tRC
95/110/130 89/110/130 90/110/130 100/120/130 NA
Column-address
tAA
25/30/35 25/30/35 25/30/35 29/35/44
40**
access
tCAC 13/15/20 13/15/20 10/11.6/15 9/11/14*
NA
CAS access
tPC
30/35/40 20/25/30 15/16.6/20 10/12/15
2
CAS cycle
Maximum page
MHz 33/28/25 50/40/33 66/60/50 100/80/66 500
frequency
Notes
*CAS latency=3
**Doesn’t include any data-transfer time
SDRAMs require a relatively different interface because the devices are synchronous and use a clock
input. However, system designers’ move away from 32-bit- toward 64-bit-wide memory modules (or
dual in-line memory modules (DIMMs)) will speed the adoption of SDRAM. Vendors design DIMMs
so that EDO and SDRAM can exist on the same module format. Starting this quarter, some core-logic
DRAM controllers, such as those from VLSI, will begin supporting SDRAMs.
Activities in the graphics department
Graphics memory is experiencing significant advancements. The use of high-priced VRAM is nearing
an end as memory vendors move toward devices such as RDRAMs, SDRAMs, and synchronous
graphics DRAMs (SGRAMs). Designers of low-end graphics applications are even finding that
standard EDO devices provide ample performance. For higher performance (at a higher price),
Mosel Vitelic offers 256kx8- and 256kx16-bit EDOs with a guaranteed cycle time of 20 nsec.
Mitsubishi’s 3-D RAM chip, supporting high-performance 3-D graphics, integrates 10 Mbytes of
DRAM, 2 kbytes of SRAM, and an ALU onto a single chip. Samsung continues to sell its singlesourced window RAM (WRAM) into graphics applications that require VRAM-like performance at a
reduced price.
Benefits of integrating L2 cache with core logic
Mathew Arcoleo, Cypress Semiconductor Corp
When building a high-speed L2 cache subsystem, you can significantly increase
timing margins by putting the logic functions on the same chip. If an external SRAM
is used for the tag look-up, several delays must be accounted for:
Tlookup=Taa+Toffs+Tft1+Toncs+Tcomp+Toffcs+Tft2,
where Taa=RAM access time, Toffs=off-chip delay from SRAM (~1 nsec),
Tft1=flight time to the chip set (~1 nsec), Toncs=on-chip delay into the chip set (~1
nsec), Tcomp=Tag compare, Toffcs=off-chip delay of BRDY from the chip set, and
Tft2=flight time of BRDY to the processor.
Contrast this equation to the tag look-up for the integrated tag implementation:
Tlook-up=Taa+Tcomp+Toffcs+Tft2.
Three critical delays have been eliminated, thereby increasing the system timing
margin by 2 to 3 nsec. This also eases the speed requirements placed on the
integrated tag RAM.
Integrating the L2 cache also reduces the capacitive loading on the address bus. In
addition, most of the discontinuities on the transmission lines are eliminated,
resulting in a cleaner, more reliable system. Having the cache RAM inside the datapath unit also eliminates the 64 data I/O lines from the CPU bus required to connect
the L2 cache. Also, only a single set of address inputs, decoders, and clock signals is
required.
Interestingly enough, the die size of a core-logic chip set is largely a function of the
number of I/Os, and is not drastically increased by the addition of the cache SRAM.
The data-path unit of a chip set is I/O intensive and usually comes in a 208-pin PQFP
package. Assuming that all of the pins are being used, the die starts to become
"pad-limited" at about 100 kmils2 (assuming standard 4-mil pads with 2-mil spacing).
The chip set’s core logic only requires about 30 kmils2 (which includes
approximately 12k logic gates at 1.3 mils2/logic gate and routing resources). This is
a small fraction of the 100 kmils2 required to accommodate the I/O pads.
Mathew Arcoleo is a staff engineer at Cypress Semiconductor, San Jose, CA.
DRAM interfaces on embedded processors
Many embedded µP vendors do not integrate specific DRAM controllers into their devices. One
reason for not doing so is to minimize the amount of silicon. Another reason relates to the
uncertainty of which type of DRAM suits the variety of applications customers may implement.
Motorola’s MPC860 offers a flexible solution for accommodating a variety of memory types. The µP
contains a user-programmable controller that is similar to a microcode machine (Figure 1). The
controller provides you with two general-purpose lines that you can assert and deassert on a onequarter clock-cycle granularity to control most memory types. To provide this granularity, the
MPC860 has two clocks running: One is the system clock, and the other is the system clock shifted
by 90%. This clock shift essentially provides the same effect as doubling the clock. Motorola offers a
software-analysis tool that lets you create waveform outputs and learn how to control your system’s
memory.
SRAMs for L2 caches
To be a high-performance SRAM for PCs, the device must include a clock input and a pipeline. As a
result, asynchronous SRAMs are giving up market share to higher performing, lower cost
synchronous counterparts. Benchmarks running in systems using EDO DRAMs actually show a
decrease in performance when using asynchronous SRAMs. Therefore, the 32kx32-bit synchronous
burst SRAM has become the device of choice for L2 caches in PCs. This SRAM incorporates a 2-bit
burst counter. In a system with a 66-MHz bus, the device achieves bursts of 3-1-1-1 in pipeline
mode.
Table 3: Clocks to Access System Memory (with L2 cache)
Memory
L2 cache
Main memory miss*
System-memory accesses
80%
20%
Clocks
6-2-2-2 (12)
FPM: 14-6-6-6 (32)
EDO: 14-4-4-4 (26)
SDRAM: 14-2-2-2 (20)
BEDO: 12-2-2-2 (18)
Clocks of accesses
to this memory
9.6 (0.8x12)
FPM: 6.4
EDO: 5.2
SDRAM: 4
BEDO: 3.6
Notes: *FPM is a -7 device. EDO is a -7 device. BEDO is a -5 device. SDRAM is a -15
device.
Table 4: Clocks to Access System Memory (without L2 cache)
Memory
Main-Miss*
100%
System-memory accesses
FPM: 14-6-6-6 (32)
Clocks
EDO: 14-4-4-4 (26)
SDRAM: 14-2-2-2 (20)
BEDO: 12-2-2-2 (18)
Clocks of accesses
to this memory
FPM: 32
EDO: 26
SDRAM: 20
BEDO: 18
Notes: *FPM is a -7 device. EDO is a -7 device. BEDO is a -5 device. SDRAM is a -15
device.
The synchronous-burst SRAM should not have problems running to 75 MHz. Beyond this speed, you
must be creative with your system design to reduce noise. (IDT, a leading SRAM supplier, predicts
that between 1997 and 1999 there will be a market for 100- to 120-MHz synchronous SRAMs.
Designs will, therefore, get even tougher.) For example, the Pentium Pro uses a separate cache bus
to decrease loading and to allow the clock rate to equal the CPU’s internal clock rate. You can also
improve cache performance by integrating tag, controller, and SRAM on the same chip. Motorola’s
MPC2604GA is an integrated four-way, set-associative cache for PowerPC systems. This integrated
cache performs 2-1-1-1 bursts at 66 MHz vs 3-1-1-1 bursts typical of discrete devices. Sony’s
CXK78V5862GB is an integrated cache that contains a two-way, set-associative, 256-kbyte cache and
a 64-bit processor interface. Sony and IDT also have processor-specific SRAMs that latch incoming
address, data, and control signals into on-chip registers to decouple the processor’s addressing
cycles from memory address cycles. Cypress Semiconductor has taken a different tack for increasing
L2 cache performance by integrating the cache SRAM and control logic directly into its hyperCache
PC core-logic chip set. (See box, "Benefits of integrating L2 cache with core logic.")
Main Memory Effects on System Performance
The effect of main memory on PC system performance depends on the presence of
the processor’s L1 cache and the external L2 cache. The most significant use of PC
main memory is in performing cache-line fills to the L1 and L2 caches. This
operation attempts to transfer the data at the external bus frequency of the
processor, which is 60 or 66 MHz. The processor incurs wait states when a memory
device is unable to transfer data at these rates. You should, therefore, study the
relative performance of each memory device.
For the following analysis, assume that a µP’s internal frequency is twice its
external frequency and that each external clock cycle is internally two clocks.
Therefore, when an external memory system is running at zero wait states (that is,
2-1-1-1), the processor core actually sees an operation of 4-2-2-2. Also assume that
the L1 and L2 cache achieve a hit rate of 80%.
The main memory has two access modes: page open and closed. Similar to cache, an
access to an open page is significantly faster than an access to a closed page.
Because core logic deasserts RAS after each burst, all DRAM accesses are made to
a closed page. Tables 3 and 4 show the relationship in processor clocks for access to
each memory. For a system that contains an L2 cache, the average cycles are:
L2+FPM DRAM=9.6+6.4=16 clocks,
L2+EDO DRAM=9.6+5.2=14.8 clocks,
L2+SDRAM=9.6+4=13.6 clocks, and
L2+BEDO DRAM=9.6+3.6=13.2 clocks.
For a system that doesn’t contain an L2 cache the average cycles are:
FPM DRAM=28 clocks,
EDO DRAM=22 clocks,
SDRAM=16 clocks, and
BEDO DRAM=14 clocks.
You can draw two conclusions from the above analysis. First, up to 66 MHz, BEDO
DRAMs even outperform SDRAMs. Second, the benefits of EDO, SDRAM, and BEDO
become more pronounced without L2 cache. The analysis was made using FPM,
EDO, and SDRAM devices with 70-nsec row-access time. However, the BEDO
devices have a row-access time of 52 nsec. The choice of access times was made to
represent the typical devices used.
[Figure 2]In some PCs, computer OEMs question the necessity of using an L2 cache. Typically, a
PC with FPM DRAMs includes an L2 cache, whereas EDO-based systems can get by without using
one. Benchmarks from Intel, though, show that adding cache to an EDO-based system boosts
performance by over 20%.
Multitasking operating systems such as Windows 95 and NT require higher performing systems and,
in turn, larger L2 caches to accommodate more L1 cache misses. Various benchmarks show that
applications that are more graphical and mathematical require larger caches. Sharing main memory
with the graphics frame buffer also demands the use of an L2 cache (see box, "Understanding the
UMA trade-offs").
Looking Ahead
Although EDO is the most popular DRAM architecture, the SDRAM is gaining
appeal. Initially, the majority of SDRAMs will target bus frequencies less than 100
MHz, but the SDRAM’s architecture allows it to run even faster. Later this year,
NEC will begin sampling 143-MHz SDRAMs with a serial stub-terminate-transceiver-logic (SSTL) interface. (Currently, SDRAMs use the low-voltage TTL
interface.)
In 1997, RDRAMs will hit 64-Mbit densities, making these devices more costeffective by amoritizing the chip’s control logic overhead. Although these higher
density devices are 100% backwards compatible with previous-generation RDRAMs,
internally they have increased functionality. Instead of two DRAM banks, the 64Mbit devices will have four. The four banks, in conjunction with the device’s ability
to queue two operations, will allow accesses to be overlapped, improving the
RDRAM’s severe latencies. Queuing will make the 64-Mbit RDRAMs more useful for
Intel’s Pentium Pro, which also queues memory accesses. The 64-Mbit RDRAM’s
queuing capability will also be beneficial for system’s with a unified memory
architecture (UMA), because the queuing will allow overlap of CPU and graphics
operations.
IBM, Siemens, and Toshiba have developed functional 256-Mbit DRAMs; these highdensity devices will probably have a synchronous interface. If 256 Mbits is not
enough, Motorola is now working with these three companies to develop a 1-Gbit
device.
To learn more about the current and future state of DRAMs, you can attend
consultant Steven Przybylski’s seminar. The session takes place in Austin and
Portland, OR, in January (dates to be determined) and Feb 6 in Paolo Alto, CA. The
seminar costs $495. This seminar was given at the 1995 Microprocessor Forum.
Contact Przybylski directly for more information at (408) 984-2719 or via e-mail at
[email protected].
You can reach Technical Editor Markus Levy at (916) 939-1642; fax (916) 939-1650; email
[email protected]
References
1. Levy, Markus, "The dynamics of DRAM technology," EDN, Jan 5, 1995, pg 46.
2. Przybylski, Steven, New DRAM Technologies, MicroDesign Resources, Sebastopol, CA, 1995.
3. "VESA Unified Memory Architecture Hardware Specification Proposal," Video Electronics
Standards Association, San Jose, CA.
4. Yao, Yong, "Unified Memory Architecture Cuts PC Cost," Microprocessor Report, June 19, 1995,
MicroDesign Resources, Sebastopol, CA.
Manufacturers of DRAMs, Synchronous Burst SRAMs, and Chip Sets
When you contact any of the following manufacturers directly, please let them know you
read about their products at the EDN Magazine WWW site.
Alliance Semiconductor
Cypress Semiconductor
Fujitsu Microelectronics
Corp
Corp
San Jose, CA
San Jose, CA
San Jose, CA
(800) 642-7616
(408) 383-4900, ext 102
(408) 943-2600
Goldstar Technology Inc
Hitachi America Ltd
Hyundai America Ltd
San Jose, CA
Brisbane, CA
San Jose, CA
(408) 432-1331
(415) 589-8300
(408) 473-9200
Integrated Device
IBM Microelectronics Inc
Intel Literature Center
Technology (IDT)
Fishkill, NY
Mt Prospect, IL
Santa Clara, CA
(800) 426-0181
(800) 468-8118
(800) 345-7015
LG Semicon
Matsushita
Micron Technology Inc
San Jose, CA
Milpitas, CA
Boise, Idaho
(408) 432-5000
(408) 946-4311
(208) 368-3900
Mitsubishi Electronics
Mosel Vitelic
Motorola Inc
America
San Jose, CA
Austin, TX
Sunnyvale, CA
(408) 433-6000
(512) 933-7726
(408) 730-5900
NEC Electronics Inc
Oki Semiconductor Inc Opti Inc
Mountain View, CA
Sunnyvale, CA
Santa Clara, CA
(800) 366-9782
(408) 720-1900
(408) 980-8178
Paradigm Technology Inc PicoPower Technology
Rambus Inc
San Jose, CA
Fremont, CA
Mountain View, CA
(408) 954-0500
(510) 623-8300
(415) 903-3800
Ramtron International
Samsung Semiconductor Sharp Microelectronics
Corp
Inc
Corp
Colorado Springs, CO
San Jose, CA
Mahwah, NJ
(719) 481-7000
(408) 954-6972
(201) 529-8200
Siemens
Silicon Magic
Sony Semiconductor Co
Cupertino, CA
Cupertino, CA
San Jose, CA
(408) 777-4500
(408) 366-8888
(408) 955-6572
Texas Instruments Inc,
Toshiba America Electronic
VLSI Technology Inc
Literature Response Center Co
San Jose, CA
Denver, CO
Irvine, CA
(408) 434-3000
(800) 477-8924, ext 4500
(714) 455-2000
Weitek Corp
Sunnyvale, CA
(408)738-8400
| EDN Access | feedback | subscribe to EDN! |
| design features | out in front | design ideas | columnist | departments | products |
Copyright c 1996 EDN Magazine. EDN is a registered trademark of Reed Properties Inc, used under
license.

Download Report

EDN -- 01.04.96 Souped-up memories boost system performanc

Paperzz.com

Your Paperzz