EEM 486: Computer Architecture Lecture 6 Memory Systems and Caches EEM 486 The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory Datapath Output Lec 6.2 The Art of Memory System Design Workload or Benchmark programs Processor reference stream <op,addr>, <op,addr>,<op,addr>,<op,addr>, . . . op: i-fetch, read, write SRAM DRAM Memory Cache Main Memory Optimize the memory system organization to minimize the average memory access time for typical workloads Lec 12.3 Technology Trends Year 1980 1000:1! 1983 DRAM Size Cycle Time 64 Kb 2:1! 250 ns 256 Kb 220 ns 1986 1989 1992 1995 1 Mb 4 Mb 16 Mb 64 Mb 190 ns 165 ns 145 ns 120 ns Lec 6.4 Processor-DRAM Memory Gap CPU “Moore’s Law” 100 10 Processor-Memory performance gap grows 50% / year “Less’ Law?” DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Performance 1000 µProc 60%/yr. (2X/1.5yr) DRAM 9%/yr. (2X/10 yrs) Time Lec 6.5 The Goal: illusion of large, fast, cheap memory Facts • Large memories are slow but cheap (DRAM) • Fast memories are small yet expensive (SRAM) How do we create a memory that is large, fast and cheap? • Memory hierarchy • Parallelism Lec 6.6 The Principle of Locality The principle of locality: Programs access a relatively small portion of their address space at any instant of time Temporal Locality (Locality in Time) => If an item is referenced, it will tend to be referenced again soon => Keep most recently accessed data items closer to the processor Spatial Locality (Locality in Space) => If an item is referenced, nearby items will tend to be referenced soon => Move blocks of contiguous words to the upper levels Q: Why does code have locality? Lec 6.7 Memory Hierarchy Based on the principle of locality A way of providing large, cheap, and fast memory Processor Control On-Chip Cache Registers Datapath Second Level Cache (SRAM) Speed (ns): 1s 10s Size (bytes): 100s Ks Main Memory (DRAM) 100s Ms $ per Mbyte Secondary Storage (Disk) Tertiary Storage (Tape) 10,000,000s 10,000,000,000s (10s ms) (10s sec) Ts Gs increases Lec 6.8 Cache Memory word CPU block Cache Memory MEMORY CACHE Tag Block 0 1 2 0 1 2 Block C-1 Block length (K words) Block 2^n-1 Word Lec 6.9 Elements of Cache Design Cache size Mapping function • Direct • Set Associative • Fully Associative Replacement algorithm • Least recently used (LRU) • First in first out (FIFO) • Random Write policy • Write through • Write back Line size Number of caches • Single or two level • Unified or split Lec 6.10 Terminology Hit: data appears in some block in the upper level • Hit Rate: the fraction of memory accesses found in the upper level • Hit Time: time to access the upper level which consists of RAM access time + Time to determine hit/miss Memory Processor (1) X1 (2) Read hit Cache X4 X1 Xn-2 Xn Xn-1 X2 X3 Upper level Lower level Lec 6.11 Terminology Miss: data needs to be retrieved from a block in the lower level • Miss Rate = 1 - (Hit Rate) • Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor Hit Time << Miss Penalty Processor (1) Xn (4) Read miss Memory Cache X4 X1 Xn-2 Xn-1 X2 (2) Xn Xn (3) X3 Upper level Lower level Lec 6.12 Direct Mapped Cache Each memory location is mapped to exactly one location in the cache: Cache block # = (Block address) modulo (# of cache blocks) = Low order log2 (# of cache blocks) bits of the address 000 001 010 011 100 101 110 111 Cache 00001 00101 01001 01101 10001 10101 11001 11101 Memory Lec 6.13 64 KByte Direct Mapped Cache Address (showing bit positions) 31 30 17 16 15 16 14 Hit Data 16 bits Valid Tag • Why do we need a Tag field? 543210 Byte offset • Why do we need a Valid bit field? • What kind of locality are we taking care of? 32 bits Data 16K entries • Total number of bits in a cache 2^n x (|valid| + |tag| + |block|) 16 32 2^n : # of cache blocks |valid| = 1 bit |tag| = 32 – (n + 2); 32-bit byte address 1 word blocks |block| = 32 bit Lec 6.14 Reading from Cache Address the cache by PC or ALU If the cache signals hit, we have a read hit • The requested word will be on the data lines Otherwise, we have a read miss • stall the CPU • fetch the block from memory and write into cache • restart the execution Lec 6.15 Writing to Cache Address the cache by PC or ALU If the cache signals hit, we have a write hit • We have two options: - write-through: write the data into both cache and memory - write-back: write the data only into cache and write it into memory only when it is replaced Otherwise, we have a write miss • Handle write miss as if it were a write hit Lec 6.16 64 KByte Direct Mapped Cache Taking advantage of spatial locality Address (showing bit positions) 31 16 15 4 32 1 0 16 Hit 12 2 Byte offset Tag Data Index V Block offset 16 bits 128 bits Tag Data 4K entries 16 32 32 32 32 Mux 32 Lec 6.17 Writing to Cache Address the cache by PC or ALU If the cache signals hit, we have a write hit • Write-through cache: write the data into both cache and memory Otherwise, we have a write miss • stall the CPU • fetch the block from memory and write into cache • restart the execution and rewrite the word Lec 6.18 Associativity in Caches One-way set associative (direct mapped) Block Tag Data 0 Two-way set associative 1 Set 2 3 0 4 1 5 2 6 3 Tag Data Tag Data 7 Four-way set associative Set Tag Data Tag Data Tag Data Tag Data 0 1 Eight-way set associative (fully associative) Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Compute the set number: (Block number) modulo (Number of sets) Choose one of the blocks in the computed set Lec 6.19 Set Asscociative Cache N-way set associative • N direct mapped caches operates in parallel • N entries for each cache index • N comparators and a N-to-1 mux • Data comes AFTER Hit/Miss decision and set selection 31 30 Address 12 11 10 9 8 8 22 Index 0 1 2 V Tag Data V 3210 Tag Data V Tag Data V Tag Data 253 254 255 22 32 4-to-1 multiplexor Hit Data A four-way set associative cache Lec 6.20 Fully Associative Cache A block can be anywhere in the cache => No Cache Index Compare the Cache Tags of all cache entries in parallel Practical for small number of cache blocks 31 4 Cache Tag (27 bits long) 0 Byte Select Ex: 0x01 Valid Bit = Cache Data Byte 31 = Byte 63 : : Cache Tag Byte 1 Byte 0 Byte 33 Byte 32 = = = : : : Lec 6.21 Four Questions for Caches Q1: Block placement? Where can a block be placed in the upper level? Q2: Block identification? How is a block found if it is in the upper level? Q3: Block replacement? Which block should be replaced on a miss? Q4: Write strategy? What happens on a write? Lec 6.22 Q1: Block Placement? Block 12 to be placed in an 8 block cache: Any block Block no. 01234567 Fully associative (12 mod 8) = 4 Only block 4 Block no. 01234567 Direct mapped (12 mod 4) = 0 Any block in set 0 Block no. 01234567 Set Set Set Set 0 1 2 3 Set associative Direct mapped: One place - (Block address) mod (# of cache blocks) Set associative: A few places - (Block address) mod (# of cache sets) # of cache sets = # of cache blocks/degree of associativity Fully associative: Any place Lec 6.23 Q2: Block Identification? Block Address Tag Block offset Index Set Select Data Select Direct mapped: Indexing – index, 1 comparison N-way set associative: Limited search – index the set, N comparison Fully associative: Full search – search all cache entries Lec 6.24 Q3: Replacement Policy on a Miss? Easy for Direct Mapped Set Associative or Fully Associative: • Random: Randomly select one of the blocks in the set • LRU (Least Recently Used): Select the block in the set which has been unused for the longest time Associativity: 2-way 4-way 8-way Size LRU Random LRU Random LRU Random 16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0% 64 KB 1.9% 2.0% 1.5% 1.4% 1.5% 256 KB 1.15% 1.17% 1.7% 1.13% 1.13% 1.12% 1.12% Lec 6.25 Q4: Write Policy? Write through— The information is written to both the block in the cache and to the block in the lower-level memory Write back— The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced • is block clean or dirty? Pros and Cons of each? • WT: read misses cannot result in writes • WB: no writes of repeated writes WT always combined with write buffers to avoid waiting for lower level memory Lec 6.26 Cache Performance CPU time = (CPU execution clock cycles + Memory stall clock cycles) x Cycle time Note: memory hit time is included in execution cycles Stalls due to cache misses: Memory stall clock cycles = Read-stall clock cycles + Write-stall clock cycles Read-stall clock cycles= Reads x Read miss rate x Read miss penalty Write-stall clock cycles= Writes x Write miss rate x Write miss penalty If read miss penalty = write miss penalty, Memory stall clock cycles = Memory accesses x Miss rate x Miss penalty Lec 6.27 Cache Performance CPU time = Instruction count x CPI x Cycle time = Inst count x Cycle time x (ideal CPI + Memory stalls/Inst + Other stalls/Inst) Memory Stalls/Inst = Instruction Miss Rate x Instruction Miss Penalty + Loads/Inst x Load Miss Rate x Load Miss Penalty + Stores/Inst x Store Miss Rate x Store Miss Penalty Average Memory Access time (AMAT) = Hit Time + (Miss Rate x Miss Penalty) = (Hit Rate x Hit Time) + (Miss Rate x Miss Time) Lec 6.28 Example Suppose a processor executes at • Clock Rate = 200 MHz (5 ns per cycle) • Base CPI = 1.1 • 50% arith/logic, 30% ld/st, 20% control Suppose that 10% of memory operations get 50 cycle miss penalty Suppose that 1% of instructions get same miss penalty CPI = Base CPI + average stalls per instruction = 1.1(cycles/ins) + [ 0.30 (Data Mops/ins) x 0.10 (miss/Data Mop) x 50 (cycle/miss)] + [ 1 (Inst Mop/ins) x 0.01 (miss/Inst Mop) x 50 (cycle/miss)] = (1.1 + 1.5 + .5) cycle/ins = 3.1 AMAT= (1/1.3)x[1+0.01x50]+ (0.3/1.3)x[1+0.1x50]= 2.54 Lec 6.29 Improving Cache Performance CPU Time = IC x CT x (ideal CPI + memory stalls) Average Memory Access time = Hit Time + (Miss Rate x Miss Penalty) = (Hit Rate x Hit Time) + (Miss Rate x Miss Time) Options to reduce AMAT: 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache Lec 6.30 Reduce Misses: Larger Block Size 25% 1K 20% Miss Rate 4K 15% 16K 10% 64K 5% 256K 256 128 64 32 16 0% Block Size (bytes) Increasing block size also increases miss penalty ! Lec 6.31 Reduce Misses: Higher Associativity 15% 12% Miss rate 9% 6% 3% 0% One-way Two-way Four-way Associativity Eight-way 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB Increasing associativity also increases both time and hardware cost ! Lec 6.32 Reducing Penalty: Second-Level Cache L2 Equations AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1 Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2 AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss PenaltyL2) Proc L1 Cache L2 Cache Lec 6.33 Designing the Memory System to Support Caches Wide: • CPU/Mux 1 word; Mux/Cache, Bus, Memory N words Interleaved: • CPU, Cache, Bus- 1 word • N Memory Modules Simple: • CPU, Cache, Bus, Memory same width (32 bits) Lec 6.34 Main Memory Performance Cycle Time Access Time Time DRAM (Read/Write) Cycle Time >> DRAM (Read/Write) Access Time DRAM (Read/Write) Cycle Time : • How frequent can you initiate an access? DRAM (Read/Write) Access Time: • How quickly will you get what you want once you initiate an access? DRAM Bandwidth Limitation Lec 6.35 Increasing Bandwidth - Interleaving Access Pattern without Interleaving: CPU Memory D1 available Start Access for D1 Start Access for D2 Memory Bank 0 Access Pattern with 4-way Interleaving: CPU Memory Bank 1 Access Bank 0 Memory Bank 2 Memory Bank 3 Access Bank 1 Access Bank 2 Access Bank 3 We can Access Bank 0 again Lec 6.36 Summary #1/2 The Principle of Locality: • Program likely to access a relatively small portion of the address space at any instant of time. - Temporal Locality: Locality in Time Spatial Locality: Locality in Space Three (+1) Major Categories of Cache Misses: • Compulsory Misses: sad facts of life. Example: cold start misses. • Conflict Misses: increase cache size and/or associativity. Nightmare Scenario: ping pong effect! • Capacity Misses: increase cache size Cache Design Space • • • • total size, block size, associativity replacement policy write-hit policy (write-through, write-back) write-miss policy Lec 6.37 Summary #2/2: The Cache Design Space Several interacting dimensions • • • • • • Cache Size cache size block size associativity replacement policy write-through vs write-back write allocation Associativity Block Size The optimal choice is a compromise • depends on access characteristics - workload - use (I-cache, D-cache, TLB) • depends on technology / cost Simplicity often wins Bad Good Factor A Less Factor B More Lec 6.38
© Copyright 2026 Paperzz