Template - Webcourse

Computer Architecture
Peripherals
By Dan Tsafrir, 6/6/2011
Presentation based on slides by Lihu Rappoport
1
Computer Architecture 2011 – peripherals
MEMORY: REMINDER
2
Computer Architecture 2011 – peripherals
Not so long ago…
Performance
1000
CPU
CPU
60% per yr
2X in 1.5 yrs
100
Gap grew 50% per
year
10
DRAM
2000
1999
1998
1997
1996
1995
1994
1993
1992
1991
1990
1989
1988
1987
1986
1985
1984
1983
1982
1981
1980
1
DRAM
9% per yr
2X in 10 yrs
Time
3
Computer Architecture 2011 – peripherals
Not so long ago…

In 1994, in their paper
“Hitting the Memory Wall: Implications of the Obvious”,
William Wulf & Sally McKee said:
“We all know that the rate of improvement in microprocessor speed
exceeds the rate of improvement in DRAM memory speed – each is
improving exponentially, but the exponent for microprocessors is
substantially larger than that for DRAMs.
The difference between diverging exponentials also grows
exponentially; so, although the disparity between processor and
memory speed is already an issue, downstream someplace it will be
a much bigger one.”
4
Computer Architecture 2011 – peripherals
More recently (2008)…
Slow
Conventional
architecture
Performance (seconds)
lower = slower
Fast
Processor cores
The memory wall in the multicore era
5
Computer Architecture 2011 – peripherals
Memory Trade-Offs




Large (dense) memories are slow
Fast memories are small, expensive and consume high power
Goal: give the processor a feeling that it has a memory which
is large (dense), fast, consumes low power, and cheap
Solution: a Hierarchy of memories
CPU
Speed:
Size:
Cost:
Power:
6
L1
Cache
Fastest
Smallest
Highest
Highest
L2
Cache
L3
Cache
Memory
(DRAM)
Slowest
Biggest
Lowest
Lowest
Computer Architecture 2011 – peripherals
Typical levels in mem hierarchy
Memory level
Size
Response time
CPU registers
≈ 100 bytes
≈ 0.5 ns
L1 cache
≈ 64 KB
≈ 1 ns
L2 cache
≈ 1 – 4 MB
≈ 15 ns
Main memory (DRAM)
≈ 1 – 4 GB
≈ 150 ns
Hard disk (SATA)
≈ 1 – 2 TB
≈ 15 ms
7
Computer Architecture 2011 – peripherals
DRAM & SRAM
8
Computer Architecture 2011 – peripherals
DRAM basics



9
DRAM
 Dynamic random-access memory
 Random access = access cost the same (well, not really)
CPU thinks of DRAM as 1-dimensional
 Simpler
But DRAM is actually arranged as a 2-D grid
 Need row & col addresses to access
 Given “1-D address”, DRAM interface splits it to row & col
 Some time duration must elapse between row & col access
(10s of ns)
Computer Architecture 2011 – peripherals
DRAM basics


10
Why 2D? Why delayed row & col accesses?
 Every address-bit requires a physical pin
 DRAMs are large (GBs nowadays)
=> need many pins
=> more expensive
A DRAM array has
 Row decoder
• Extracts row number from memory address
 Column decoder
• Extracts column number from memory address
 Sense amplifiers
• Hold row when (1) written to, (2) read from, (3) is refreshed
(see next slide)
Computer Architecture 2011 – peripherals
DRAM basics




11
Use one transistor-capacitor pair
 Per bit
Capacitors leaks
 => Need to be refreshed every few ms
DRAM spends ~1% of time in refreshing
 “Opening” a row
 = fetching it to sense amplifiers
 = refreshing it
Is it worth it to make DRAM a rectangle (rather than
square?)
Computer Architecture 2011 – peripherals
x1 DRAM
Column
decoder
Sense
amplifiers
Memory
array
one bit
…rows…
Row
decoder
Data
in/out
buffers
…columns…
12
Computer Architecture 2011 – peripherals
DRAM banks



13
Each DRAM memory array outputs one bit
DRAMs use multiple arrays to output multiple bits at a time
 x N indicates DRAM with N memory arrays
 Typical today: x16, x32
Each collection of x N arrays forms a DRAM bank
 Can read/write from/to each bank independently
Computer Architecture 2011 – peripherals
x4 DRAM
Memory
Memory
array
Memory
array
Memory
array
array
one bit
…row…
…row…
…row…
…rows…
Row
Row
decoder
Row
decoder
Row
decoder
decoder
Data
Data
in/out
Data
in/out
Data
buffers
in/out
buffers
in/out
buffers
buffers
Column
Column
decoder
Column
decoder
Column
decoder
decoder
Sense
Sense
amplifiers
Sense
amplifiers
Sense
amplifiers
amplifiers
…columns…
…columns…
…columns…
…columns…
14
Computer Architecture 2011 – peripherals
Ranks & DIMMs



15
DIMM
 (Dual in-line) memory module (the unit we connect to the MB)
Increase bandwidth by delivering data from multiple banks
 Bandwidth by one bank is limited
 => Put multiple banks on DIMM
 Bus has higher clock frequency than any one DRAM
 Bus controls switches between banks to achieve high data rate
Increase capacity by utilizing multiple ranks
 Each rank is an independent set of banks that can be accessed
for the full data bit‐width,
• 64 bits for non-ECC; 72 for ECC (error correction code)
 Ranks cannot be accessed simultaneously
• As they share the same data path
Computer Architecture 2011 – peripherals
Ranks & DIMMs
1GB 2Rx8 (= 2ranks x 8 banks)
16
Computer Architecture 2011 – peripherals
Modern DRAM organization


A system has multiple DIMMs
Each DIMM has multiple DRAM banks
 Arranged in one or more ranks

Each bank has multiple DRAM arrays

Concurrency in banks increases memory bandwidth
17
Computer Architecture 2011 – peripherals
Memory controller
address/command bus
Memory
controller
data bus
chip select 1
address/command bus
data bus
chip select 2
18
Computer Architecture 2011 – peripherals
Memory controller




19
Functionality
 Executes processor memory requests
In earlier systems
 Separate off-processor chip
In modern systems
 Integrated on-chip with the processor
Interconnect with processor
 Bus, but can be point-to-point, or through crossbar
Computer Architecture 2011 – peripherals
Lifetime of a memory access
1.
2.
3.
4.
20
Processor orders & queues memory requests
Request(s) sent to memory controller
Controller queues & orders requests
For each request in queue, when the time is right
1.
Controller waits until requested DRAM ready
2.
Controller breaks address bits into rank, bank, row, column fields
3.
Controller sends chip-select signal to select rank
4.
Selected bank pre-charged to activate selected row
5.
Activate row within selected DRAM bank
• Use “RAS” (row-address strobe signal)
6.
Send (entire) row to sense amplifiers
7.
Select desired column
• Use “CAS” (column-address strobe signal)
8.
Send data back
Computer Architecture 2011 – peripherals
Basic DRAM array
Memory address bus
CAS#
Column latch
RAS#
Addr
Row
latch
Column addr
decoder
Row
address
decoder
Data
Memory
array
 Timing (2 phases)
 Decode row address + RAS assert
 Wait for “RAS to CAS delay”
 Decode column address + CAS assert
 Transfer DATA
21
Computer Architecture 2011 – peripherals
DRAM timing



22
CAS Latency
 Number of clock cycles to access a specific column of data
 From moment the memory controller issues a column in the
current row until data is read out from memory
RAS to CAS delay
 Number of cycles between row and column access
Row pre-charge time
 Number of cycles to close the opened-row & to open next-row
Computer Architecture 2011 – peripherals
Addressing sequence
precharge delay
access time
RAS#
RAS/CAS delay
CAS#
A[0:7]
X
Row i
Col n
X
Row j
CAS latency
Data
Data n
 Access sequence
 Put row address on data bus and assert RAS#
 Wait for RAS# to CAS# delay (tRCD)
 Put column address on data bus and assert CAS#
 DATA transfer
 Pre-charge
23
Computer Architecture 2011 – peripherals
Improved DRAM Schemes
 Paged Mode DRAM
– Multiple accesses to different columns from same row (special locality)
– Saves time it takes to bring a new row (but might be unfair)
RAS#
CAS#
A[0:7]
X
Row
X
Col n
X
Col n+1
X
Data n
Data
X
Col n+2
D n+1
D n+2
 Extended Data Output RAM (EDO RAM)
– A data output latch enables to parallelize next column address with
current column data
RAS#
CAS#
A[0:7]
Data
24
X
Row
X
Col n
X
Col n+1
X
X
Col n+2
Data n
Data n+1
Data n+2
Computer Architecture 2011 – peripherals
Improved DRAM Schemes (cont)
 Burst DRAM
– Generates consecutive column address by itself
RAS#
CAS#
A[0:7]
Data
25
X
Row
X
Col n
X
Data n
Data n+1
Data n+2
Computer Architecture 2011 – peripherals
Synchronous DRAM (SDRAM)



26
Asynchrony in DRAM
 Due to RAS & CAS arriving at any time
Synchronous DRAM
 Uses clock to deliver requests at regular intervals
 More predictable DRAM timing
 => Less skew
 => Faster turnaround
SDRAMs support burst-mode access
 Initial performance similar to BEDO (=burst +EDO)
 Clock scaling enabled higher transfer rates later
• => DDR SDRAM => DDR2 => DDR3
Computer Architecture 2011 – peripherals
DRAM vs. SRAM
(Random access = access time the same for all locations)
DRAM – Dynamic RAM SRAM – Static RAM
Refresh
Yes (~1% time)
No
Address
Address muxed: row+col Address not multiplexed
Random Access Not really…
Yes
density
High (1 Transistor/bit)
Low (6 Transistor/bit)
Power
low
high
Speed
slow
fast
Price/bit
low
high
Typical usage
Main memory
cache
27
Computer Architecture 2011 – peripherals