ECE 486/586 Computer Architecture Chapter 3 UP Data Cache Herbert G. Mayer, PSU Status 1/15/2017 1 Data Cache in UP Microprocessor 2 Syllabus UP Caches Intro: Purpose, Design Parameters, Architecture Effective Time teff Single-Line Degenerate Cache Multi-Line, Single-Set Cache Single-Line, Multi-Set Cache, Blocked Mapping Single-Line, Multi-Set, Cyclic Mapping Multi-Line per Set (Associative), Multi-Set Cache Replacement Policies LRU Sample Compute Cache Size Trace Cache Characteristic Cache Curve Bibliography 3 Intro: Purpose of Cache Cache is logically part of Memory Subsystem, but physically part of microprocessor, i.e. on the same silicon die Purpose: render slow memory into a fast one With minimal cost despite high cost per bit, since the cache is just a few % of total physical main store Works well, only if locality is good; else performance is same as memory access, or worse, depending on architecture With poor locality, when there is a random distribution of memory accesses, then any cache can slow down if: teff = tcache+ (1-h) * tmem and not: teff = max( tcache, (1-h) * tmem ) 4 Intro: Purpose of Cache With good locality, cache delivers available data in close to unit cycle time In MP systems, caches must cooperate with other processors’ caches, memory, some peripherals Even on a UP system there are multiple agents that access memory and thus impact caches, e.g. the DMA and memory controller Cache must cooperate with VMM of memory subsystem to jointly render a physically small, slow memory into a virtually large, fast memory at small cost of added HW (silicon), and system SW L1 cache access time ideally should be within a single machine cycle; realized on many CPUs 5 Intro: Trend of Speeds 6 Intro: Growing Caches Intel Haswell-E Die, Center Showing 20 MB Shared L3 Cache 7 From Definitions in Appendix Line Storage area in cache able to hold a copy of a contiguous block of memory cells, AKA paragraph The portion of memory stored in that line is aligned on a memory address modulo line size For example, if a line holds 64 bytes on a byteaddressable architecture, the address of the first byte has 6 trailing zeros: i.e. it is evenly divisible by 64, or we say it is 64-byte aligned Such known zeros don’t need to be stored in the tag; they are implied, they are known a-priori! This shortens the tag, rendering cache simpler, and cheaper to manufacture: less HW bits! 8 From Definitions in Appendix Set A logically connected region of memory, mapped to a specific area of cache (line), is a set; memory is partitioned into N sets Elements of a set don’t need to be physically contiguous in memory; if contiguous, leftmost log2(N) bits are known a-priori and don’t need to be stored; if cyclic distribution, then the rightmost log2(N) are known a-priori The number of sets is conventionally labeled N A degenerate case maps all memory onto the whole cache, in which case only a single set exists: N = 1; i.e. one set; not a meaningful method! Notion of set is meaningful only if there are multiple sets. Again: A memory region belonging to one set can be a physically contiguous block, or a cyclically distributed part of memory Former case is called blocked, the latter cyclic. Cache area into which such a portion of memory is mapped to is also called set 9 Intro: Cache Design Parameters Number of lines in set: K Quick test: K is how large in a direct-mapped cache? Number of units –bytes– in a line is named L, AKA Length of line L Number of sets in memory, and hence in the cache: N Policy upon store miss: cache write policy Policy upon load miss: cache read policy What to do, when an empty lines is needed for the next paragraph to be streamed-in, but none is available? That action is the: replacement policy 10 Intro: Cache Design Parameters Size here is the total size of a cache; unit being discussed can be bits or bytes, be careful! Size = K * ( L + bits for tag and control bits ) * N Ratio of cache size to physical memory is generally a very small percentage, e.g. < 1 % Cache access time, typically close to 1 cycle for L1 cache Number of processors with cache: 1 in UP, M in MP architecture Levels of caches, L1, L2, L3 … Last one referred to as LLC, for last level cache 11 Intro: Cache Architecture Cache-related definitions used throughout are common, though not all manufacturers apply the same nomenclature Initially we discuss cache designs for singleprocessor architectures In MP cache lecture we may progress to more complex cache designs, covering the MESI protocol for a two-processor system with external L2 cache Focus here: L1 data cache 12 Effective Time teff Starting with teff = tcache + ( 1 - h ) * tmem we observe: No matter how many hits (H) we experience during repeated memory access, the effective cycle time is never less than tcache No matter how many misses (M) we experience, the effective cycle time to access a datum is never more than tcache + tmem Desirable to have teff = tmem in case of a cache miss Another way to compute effective access time is to add all memory-access times, and divide them by the total number of accesses, and thus compute the effective time, or the average time teff 13 Effective Time teff Average time per access: teff = ( hits * tcache + misses * ( tcache + tmem ) ) / total_accesses teff = h * tcache + m * ( tcache + tmem ) if memory accessed immediately: teff = ( h + m ) * tcache + m * tmem = tcache + m * tmem • Assume an access time of 1 cycle to reference data in the cache; best case, at times feasible • Assume an access time of 10 cycles for data in memory; is unrealistically fast!! • Assume that a memory access is initiated after a cache miss; then: 14 Effective Time teff 15 Effective Time teff Symb. H M A T tcache tmem teff h m h+m Name Hits Misses All Total time Cache time Mem time Effective tm. Hit rate Miss rate Total rate = 1 Explanation Number of successful cache accesses Number of failed cache accesses All accesses A = H + M Time for A memory accesses Time to access data once via the data cache Time to access data via memory once Average time over all memory accesses H/A =h=1–m M/A =m=1–h Total rate, either hit or miss, probability is 1 16 Effective Time teff 17 Effective Time teff Compare teff, the effective memory access time in L1 data cache at 99% hit rate vs. 97% hit rate Time for hit thit = 1 cycle, time for miss tmiss = 100 cycles; then compare 99 and 97 percent hit rates: Given a 99% hit rate: 1 miss costs 100 cycles 99 hits cost 99 cycles total teff = ( 100 + 99 ) / 100 = 1.99 ≈ 2 cycles per average access Given a 97% hit rate: Students compute here! 18 Effective Time teff Compare teff, the effective memory access time in L1 data cache at 99% hit rate vs. 97% hit rate Time for hit thit = 1 cycle, time for miss tmiss = 100 cycles; then compare 99 and 97 percent hit rates: Given a 99% hit rate: 1 miss costs 100 cycles 99 hits cost 99 cycles total teff = ( 100 + 99 ) / 100 = 1.99 ≈ 2 cycles per average access Given a 97% hit rate: 3 misses costs 300 cycles 97 hits cost 97 cycles total teff = ( 300 + 97 ) / 100 = 397 / 100 = 3.97 ≈ 4 cycles per average access Or 100% additional cycles for loss of 2% hit accuracy! 19 Actual Cache Data Intel Core i7 with 3 levels of cache, L1 access > 1 cycle, L3 access costing dozens of cycles, still way faster than memory access! 20 Complexity of Opteron Data Cache -Taken from Wikipedia 21 Deriving Cache Performance Parameters 22 Single-Line Degenerate Cache 23 Single-Line Degenerate Cache Quick test: what is the minimum size (in number of bits) of the tag for this degenerate cache? (assume 32-bit architecture, and 64-byte lines) The single-line cache, shown here, stores multiple words Can improve memory access if extremely good locality exists within a very narrow address range Upon miss cache initiates a stream-in operation Is a direct mapped cache: all memory locations know a priori where they’ll reside in cache; there is but one line, one option for them Is a single-set cache 24 Single-Line Degenerate Cache As data cache: exploits only locality of near-by addresses in the same paragraph As instruction cache: Exploits locality of tight loops that completely fit inside the address range of a single cache line However, there will be a cache-miss as soon as an address makes reference outside of line’s range For example, tight loop with a function call will cause cache miss Stream-in time is time to load a line of data from memory Total overhead: tag bits + valid bit + dirty bit (if write-back) Not advisable to build this cache subsystem 25 Dual-Line, Single-Set Cache 26 Dual-Line, Single-Set Cache Next cache has 1 set, multiple lines; here 2 lines shown Quick test: minimum size of tag on byte-addressable, 32-bit architecture with 2 lines, 1 set, line size of 16 bytes? Each line holds multiple, contiguous addressing units, 4 words, 16 bytes shown Thus 2 disparate areas of memory can be cached at the same time Is associative cache; all lines (i.e. 2 lines) in single set must be searched to determine, whether a memory element is present in cache Is single-set associative cache, since all of memory (singleton set) is mapped onto the same cache lines 27 Dual-Line, Single-Set Cache Some tight loops with a function call can be completely cached in an I-cache, assuming loop body fits into line and callée fits into the other line Also would allow one larger loop to be cached, whose total body does not fit into a single line, but would fit into 2 lines Applies to more realistic programs But if number of lines K >> 1, the time to search all tags (in set) can grow beyond unit cycle time Not advisable to build this cache subsystem 28 Single-Line, Dual-Set Cache 29 Single-Line, Dual-Set Cache This cache architecture has multiple sets, 2 shown, 2 distinct areas of memory, each being mapped onto separate cache lines: N = 2, K = 1 Quick test: minimum size of the tag on 4-byte per word, 32-bit architecture with 16-byte lines? Each set has a single line, in this case 4 memory words; AKA paragraph in memory Thus 2 disparate areas of memory can be cached at the same time But these areas must reside in separate memory sets, each contiguous, each having only 1 option Is direct mapped; all memory locations know a priori where they’ll reside in cache Is multi-set cache, since parts of memory have their own portion of cache 30 Single-Line, Dual-Set Cache Allows one larger loop to be cached, whose total body does not fit into a single line of an I-cache, but would fit into two lines But only if by some great coincidence both parts of that loop reside in different memory sets If used as instruction cache, all programs consuming half of memory or less never use the second line in the second set. Hence this cache architecture would be a bad idea! If used as data cache, all data areas that fit into first block will never utilize second set of cache Problem specific to blocked mapping; try cyclic instead Not advisable to build this type cache 31 Dual-Set, Single-Line, Cyclic 32 Dual-Set, Single-Line, Cyclic This cache architecture below also has 2 sets, N = 2 Each set has a single line, each holding 4 contiguous memory units, 4 words, 16 bytes, K = 1 Thus 2 disparate areas of memory can be cached at the same time Quick test: tag size on 32-bit, 4-byte architecture? Disparate areas (of line size, equal to paragraph size) are scattered cyclically throughout memory Cyclically distributed memory areas associated with each respective set Is direct mapped; all memory locations know a priori where they’ll reside in cache, as each set has a single line Is multi-set cache: different locations of memory are mapped onto different cache lines, the sets 33 Dual-Set, Single-Line, Cyclic Also allows one larger loop to be cached, whose total body does not fit into a single line, but would fit into two lines Even if parts of loop belong to different sets If used as instruction cache, small code section can use the total cache If used as data cache, small data areas can utilize complete cache Cyclic mapping of memory areas to sets is generally superior to blocked mapping Still not advisable to build this cache subsystem 34 Multi-Line, Multi-Set, Cache 35 Multi-Line, Multi-Set, Cache Reminder: Tag is that minimal number of address bits to be stored in cache lines Quick test: minimum size (in bits) of the tag? Here 32-bit architecture, byte addressable, 2 sets cyclic, line length 16 bytes, 2-way set associative Two sets, memory will be mapped cyclically, AKA in a round-robin fashion Each set has two lines, each line holding 16 bytes; i.e. paragraph length of memory is 16 bytes in this example! Note: direct mapped caches, i.e. caches with one line per set, are also common; AKA non-associative 36 Multi-Line, Multi-Set, Cache Associative cache: once set is known, search all tags for the memory address in all lines of that set In earlier example p. 34, line 2 of set 2 is unused, AKA invalid in MESI terminology By now you know: sets, lines, associate, nonassociative, direct mapped, and more terms! 37 Replacement Policy The replacement policy is the rule that determines, when all lines are valid (i.e. already busy with other, good data), and a new line must be streamed in: Which of the valid lines in a cache is to be replaced, AKA removed? Removal can be low cost, if the modified bit (AKA dirty bit) is clear = 0; this means: data in memory and cache line are identical! Otherwise removal may be costly: If the dirty bit is set = 1, data have to be written back into memory, costing a memory access! We call this copying to memory: stream out 38 Replacement Policy # Name Summary 1 LRU Replaces Least Recently Used cache line; requires keeping track of relative “ages” of lines. Retire line that has remained unused for the longest time of all candidate lines. Speculate that that line will remain unused for the longest time in the future. 2 LFU Replaces Least Frequently Used cache line; requires keeping track of the number m of times this line was used over the last n>=m uses. Depending on how long we track the usage, this may require many bits. 3 FIFO First In First Out: The first of the lines in the set that was streamed in is the first to be retired, when it comes time to find a candidate. Has the advantage that no further update is needed, while all lines are in use. 4 Random Pick a random line from candidate set for retirement; is not as bad as this irrational algorithm might suggest. Reason: The other methods are not too good either 5 Optimal If a cache were omniscient, it could predict, which line will remain unused for the longest time in the future. Of course, that is not computable. However, for creating the perfect reference point, we can do this with past memory access patterns, and use the optimal access pattern for comparison, how well our chosen policy rates vs. the optimal strategy! 39 LRU Sample 1 Assume the following cache architecture: • N = 16 sets, cyclic distribution • K = 4 lines per set • 32-bit architecture, byte-addressable • write back (dirty bit) • valid line indicator (valid bit) • L = 64 bytes per line; AKA line length • LRU replacement; uses 2 bits (4 lines per set), to store relative ages • This results in a tag size of ???? bits 40 LRU Sample 1 Assume the following cache architecture: • This results in a tag size of 22 bits • What is the overhead size per line in bits? –– 41 LRU Sample 1 Assume the following cache architecture: • Tag size = 22 bits • 2 LRU bits (4 lines per set), to store relative ages of the 4 lines in each set • Dirty bit needed, AKA Modified bit = 1 • Valid bit needed = 1 • Overhead per line: 22 + 2 + 1 + 1 = 26 bits 42 LRU Sample 2 Sample 2 focuses on one particular Set: Let the 4 lines be numbered 0..3 Set is accessed in the order: line 0 miss, line 1 miss, line 0 hit, line 2 miss, line 0 hit again, line 3 miss, line 0 hit again, and another miss Now cache is full, now find an available line by eviction, to have a line for another miss! Assume initially a cold cache, all lines in the cache were free before these accesses Problem: Once all lines are filled (Valid bit is 1 for all 4 lines) some line must be retired to make room for the new access that missed, but which? Answer is based on the LRU policy (Least Recently Used line), which here is line 1 43 LRU Sample 2 The access order, assuming all memory accesses are just reads (loads), no writes (no stores), i.e. dirty bit is always clear: Read miss, all lines invalid, stream paragraph in line 0 Read miss (implies new address), stream paragraph in line 1 Read hit on line 0 Read miss to a new address, stream paragraph into line 2 Read hit, access line 0 Read miss, stream paragraph into line 3 Read hit, access line 0 Now another Read Miss, all lines valid, find line to retire, AKA to evict Note that LRU age 002 is youngest for cache line 0, and 112 is the oldest line (AKA the least recently used line) for cache line 1, of the 4 relative ages out of 4 total lines 44 LRU Sample 2 45 LRU Sample 2 1. Initially, in a partly cold cache, if we experience a miss, there will be an empty line (partly cold cache), the paragraph is streamed into the empty line, its relative age is set to 0, and all other ages are incremented by 1 2. In a warm cache (all lines are used) when a line of age X experiences a hit, its new age becomes 0. Ages of all other lines whose age is < X are incremented by 1; i.e. older ones remain “older” 46 Compute Cache Size Typical Cache Design Parameters: 1. Number of lines in every set: K 2. Number of bytes in a line, i.e. the Length of line: L 3. Number of sets in memory, and hence in cache: N 4. Policy upon memory write (cache write policy) 5. Policy upon read miss (cache read policy) 6. Replacement policy (e.g. LRU, random, FIFO, etc.) 7. Size (bits) = K * ( 8 * L + tag + control bits ) * N 47 Compute Cache Size Compute minimum number of bits for 8-way, setassociative cache with 64 sets, using cyclic allocation of sets, line length L = 32 bytes, using LRU and write-back. Memory is byte addressable, with 32-bit addresses: Tag = LRU 8-ways = Dirty bit = Valid bit = Overhead per line = # of lines = Data bits per cache line= Total cache size = Size in bytes approx. = 32-5-6 = 21 bits 3 bits 1 bit 1 bit 21+3+1+1 = 26 bits K*N = 64 * 8 = 29 lines 32*8 = 28 bits 29*(26+28)= 144,384 bits ~17.6 k Bytes 48 Trace Cache Trace Cache is a special-purpose cache that does not hold (raw) instruction bits, but instead stores predecoded operations AKA micro-ops The old AMD K5 uses a Trace Cache; see [1] Intel’s Pentium® P4 uses a 12 k micro-op Trace Cache Advantages: faster access to executable bits at every cached instruction Disadvantage: less dense storage, i.e. wasted cache bits, when compared to a regular I-cache Note that cache bits are way more costly than memory bits; several decimal orders of magnitude! Trace caches are falling out of favor since the 2000s 49 Trace Cache 50 Characteristic Cache Curve In the graph below we a use relative number of cache misses [RM] to avoid infinitely high abscissa RM = 0 is ideal case: No misses, all hits RM = 1 is worst case: All memory accesses are cache misses If a program exhibits good locality, relative cache size of 1 results in good performance; we use this as the reference point: Very coarsely, in some ranges, doubling the cache’s size results in 30% less cache misses In other ranges of the characteristic curve, doubling the cache results in just a few % of reduced misses: beyond the sweet spot! 51 Characteristic Cache Curve 52 Cache vs. Core on 22 nm Die 53 UP Cache Summary Cache is a special HW storage, allowing fast access to small areas of memory, copied into cache lines Built with expensive technology, hence the size of a cache relative to memory size is small; cache holds only a small subset of memory, typically < 1 % Frequently used data (or instructions in an I-cache) are copied to cache, with the hope that the data present in the cache are accessed frequently Miraculously that is generally true, so caches in general do speed up execution despite slow memories: Exploiting what is known as locality Caches are organized into sets, with each set having 1 or more lines; multiple lines require searching Defined portions of memory get mapped into any one of these sets 54 Bibliography UP Caches 1. Shen, John Paul, and Mikko H. Lipasti: Modern Processor Design, Fundamentals of Superscalar Processors, McGraw Hill, ©2005 2. http://forums.amd.com/forum/messageview.cfm?cati d=11&threadid=29382&enterthread=y 3. Lam, M., E. E. Rothberg, and M. E. Wolf [1991]. "The Cache Performance and Optimizations of Blocked Algorithms," ACM 0-89791-380-9/91, p. 63-74. 4. http://www.ece.umd.edu/~blj/papers/hpca2006.pdf 5. MESI: http://en.wikipedia.org/wiki/Cache_coherence 6. Kilburn, T., et al: “One-level storage systems, IRE Transactions, EC-11, 2, 1962, p. 223-235 55 Definitions For Caches 56 Definitions Aging A cache line’s age is tracked; only in associative cache, doesn’t apply for direct-mapped cache Aging tracks, when a cache line was accessed, relative to the other lines in this set This implies that ages are compared Generally, the relative ages are of interest, such as: am I older than you? Rather than the absolute age, e.g.: I was accessed at cycle such and such Think about the minimum number of bits needed to store the relative ages of, say, 8 cache lines! Memory access addresses only one line, hence all lines in a set have distinct (relative) ages 57 Definitions Alignment Alignment is a spacing requirement, i.e. the restriction that an address adhere to a specific placement condition For example, even-alignment means that an address is even, that it be divisible by 2 E.g. address 3 is not even-aligned, but address 1000 is; thus the rightmost address bit will be 0 In VMM, page addresses are aligned on pageboundaries. If a page-frame has size 4k, then page addresses that adhere to page-alignment are evenly divisible by 4k As a result, the low-order (rightmost) 12 bits are 0. Knowledge of alignment can be exploited to save storing address bits in VMM, caching, etc. 58 Definitions Allocate-on-Write If a store instruction experiences a cache miss, and as a result a cache line is filled, then the allocate-on-write cache policy is used If the write miss causes the paragraph from memory to be streamed into a data cache line, we say the cache uses allocate-on-write Pentium processors, for example, do not use allocate-on-write Antonym: write-by 59 Definitions Associativity If a cache has multiple lines per set, we call it k-way associative; k stands for number of lines in a set Having a cache with multiple lines (i.e. k > 1) does require searching, or address comparing; search checks, whether some referenced object is in fact present Another way of saying this is: In an associative cache any memory object has more cache lines than just one, where it might live Antonym: direct mapped; if only a single line (per set) exists, the search is reduced to a simple, single tag comparison 60 Definitions Back-Off If processor P1 issues a store to a data address shared with another processor P2, and P2 has cached and modified the same data, a chance for data inconsistency arises To avoid this, P2 with the modified cache line must snoop for other processors’ accesses, to guarantee delivery of the newest data Once the snoop detects the access request from P1, P1 must be prevented from getting ownership of the data; accomplished by temporarily preventing P1 bus access This bus denial for the sake of preserving data integrity is called back-off 61 Definitions Blocking Cache Let a cache miss result in streaming-in a line If during that stream-in no further accesses can be made to this cache until the data transfer is complete, this cache is called blocking Antonym: non-blocking Generally, a blocking cache yields lower performance than a non-blocking 62 Definitions Bus Master Only one of the devices connected to a system bus has the right to send signals across the bus; this ownership is called being the bus master Initially Memory & IO Controller (MIOC) is bus master; chipset may include special-purpose bus arbiter Over time, all processors –or their caches– may request to become bus master for some number of bus cycles The MIOC can grant this right; yet each of the processors pi (more specifically: its cache) can request a back-off for pj, even if otherwise pj would be bus master 63 Definitions Critical Chunk First The number of bytes in a line is generally larger than the number of bytes that can be brought to the cache across the bus in 1 step, requiring multiple bus transfers to fill a line completely Would be efficient, if the actually needed bytes resided in the first chunk brought across the bus Deliberate policy that accomplishes just that is the Critical Chunk First policy This allows the cache to be unblocked after the first transfer, though line is not completely loaded Other parts of the line may be used later, but the critical byte can thus be accessed right away 64 Definitions Direct Mapped If each memory address has just one possible location (i.e. one single line, of K = 1) in the cache where it could possibly reside, then that cache is called direct mapped Antonym: associative, or fully associative Synonym: non-associative 65 Definitions Directory The collection of all tags is referred to as the cache directory In addition to the directory and the actual data there may be further overhead bits in a data cache Dirty Bit Dirty bit is a data structure associated with a cache line. This bit expresses whether a write hit has occurred on a system applying write-back Synonym: Modified bit There may be further overhead bits in a data cache 66 Definitions Effective Cycle Time teff Let the cache hit rate h be the number of hits divided by the number of all memory accesses, with an ideal hit rate being 1; m being the miss rate = 1-h; thus: teff = tcache + (1-h) * tmem = tcache + m * tmem Alternatively, the effective cycle time might be teff = max( tcache, m * tmem ) The latter holds, if a memory access to retrieve the data is initiated simultaneously to the cache access tcache = time to access a datum in the cache, ideally 1 cycle, while tmem is the time to access a data item in memory; generally not a constant value The hit rate h varies from 0.0 to 1.0 67 Definitions Exclusive State in MESI protocol. The E state indicates that the current cache is not aware of any other cache sharing the same information, and that the line is unmodified E allows that in the future another line may contain a copy of the same information, in which case the E must transition to another state Possible that a higher-level cache (L1 for example viewed from an L2) may actually have a shared copy of the line in exclusive state; however that level of sharing is transparent to other potentially sharing agents outside the current processor 68 Definitions Fully Associative Cache Possible to not partition cache into sets In that case, all lines need to be searched for a cache hit or miss We call this a fully associative cache Generally works for small caches, since the search may become costly in time or HW if the cache were large 69 Definitions Hit Rate h The hit rate h is the number of memory accesses (read/writes, or load/stores) that hit the cache, over the total number of memory accesses By contrast H is the total number of hits A hit rate h = 1 means: all accesses are from the cache, while h = 0 means, all are from memory, i.e. none hit the cache Conventional notations are: hr and hw for read and write hits See also miss rate 70 Definitions Invalid State in the MESI protocol State I indicates that its cache line is invalid, and consequently holds no valid data; it is ready for use It is desirable to have I lines: Allows the stream-in of a paragraph without evicting another cache line Invalid (I) state is always set for any cache line after a system reset 71 Definitions Line Storage area in cache able to hold a copy of a contiguous block of memory cells, i.e. a paragraph The portion of memory stored in that line is aligned on an address modulo the line size For example, if a line holds 64 bytes on a byteaddressable architecture, the address of the first byte has 6 trailing zeros: evenly divisible by 64, it is 64-byte aligned Such known zeros don’t need to be stored in the tag, the address bits stored in the cache; they are implied This shortens the tag, rendering cache cheaper to manufacture: less HW bits! 72 Definitions LLC Acronym for Last Level Cache. This is the largest cache in the memory hierarchy, the one closest to physical memory, or furthest from the processor Typical on multi-core architectures Typical cash sizes: 4 MB to 32 MB Common to have one LLC be shared between all cores of an MCP (Multi-Core Processor), but have option of separating (by fusing) and creating dedicated LLC caches, with identical total size 73 Definitions LRU Acronym for Least Recently Used Cache replacement policy (also page replacement policy discussed under VMM) that requires aging information for the lines in a set Each time a cache line is accessed, that line become the youngest one touched Other lines of the same set do age by one unit, i.e. get older by 1 event: event is a memory access Relative ages are sufficient for LRU tracking; no need to track exact ages! Antonym: last recently used! 74 Definitions Locality of Data A surprising, beneficial attribute of memory access patterns: when an address is referenced, there is a good chance that in the near future another access will happen at or near that same address I.e. memory accesses tend to cluster, also observable in hashing functions and memory page accesses Antonym: Randomly distributed, or normally distributed 75 Definitions MESI Acronym for Modified, Exclusive, Shared and Invalid This is an ancient protocol to ensure cache coherence on the family of Pentium processors. A protocol is necessary, if multiple processors have copy of common data with right to modify Through the MESI protocol data coherence is ensured no matter which of the processors performs writes AKA as Illinois protocol due to its origin at the University of Illinois at Urbana-Champaign 76 Definitions Miss Rate Miss rate is the number of memory (read/write) accesses that miss the cache over total number of accesses, denoted m Clearly the miss rate, like the hit rate, varies between 0.0 .. 1.0 The miss rate m = 1 - h Antonym: hit rate h 77 Definitions Modified State in MESI protocol M state implies that the cache line found by a write hit was exclusive, and that the current processor has modified the data The modified state expresses: Currently not shared, exclusively owned data have been modified In a UP system, this is generally expressed by the dirty bit 78 Definitions Paragraph Conceptual, aligned, fixed-size area of the logical address space that can be streamed into the cache Holding area in the cache of paragraph-size is called a line In addition to the actual data, a line in cache has further information, including the dirty and valid bit (in UP systems), the tag, LRU information, and in MP systems the MESI bits The MESI M state corresponds to the dirty bit in a UP system 79 Definitions Replacement Policy A replacement policy is a defined convention that defines which line is to be retired in case a new line must be loaded, none is free in a set, so one has to be evicted Ideally, the line that would remain unused for the longest time in the future should be replaced and its contents overwritten with new data Generally we do not know which line will stay unreferenced for the longest time in the future In a direct-mapped cache, the replacement policy is trivial, it is moot, as there will be just 1 line 80 Definitions Set A logically connected region of memory, to be mapped onto a specific area of cache (line), is a set; there are N sets in memory Elements of a set don’t need to be physically contiguous in memory; if contiguous, leftmost log2(N) bits are 0; if cyclic distribution, then the rightmost log2(N) after alignment bits are 0 The number of sets is conventionally labeled N A degenerate case is to map all memory onto the whole cache, in which case only a single set exists: N = 1; i.e. one set Notion of set is meaningful only if there are multiple sets. A memory region belonging to one set can be physically contiguous or distributed cyclically In the former case the distribution is called blocked, the latter cyclic. Cache area into which a portion of memory is mapped to is also called set 81 Definitions Set-Associative A cached system in which each set has multiple cache lines is called set-associative For example, 4-way set associative means that there are multiple sets (could be 4 sets, 256 sets, 1024 sets, or any other number of sets) and each of those sets has 4 lines Integral powers of 2 are good to use That’s what the 4 refers to in a 4-way cache Antonym: non-associative, AKA direct-mapped 82 Definitions Shared State in the MESI protocol S state expresses that the hit line is present in more than one cache. Moreover, the current cache (with the shared state) has not modified the line after stream-in Another cache of the same processor may be such a sharing agent. For example, in a two level cache, the L2 cache will hold all data present in the L1 cache Similarly, another processor’s L2 cache may share data with the current processor’s L2 cache 83 Definitions Stale Memory A valid cache line may be overwritten with new data The write-back policy records such over writing At the moment of a cache write with write-back, cache and memory are out of synch; we say memory is stale Poses no danger, since the dirty bit (or modified bit) reflects that memory eventually must be updated But until this happens, memory is stale Note that if two processors’ caches share memory and one cache renders memory stale, the other processor should no longer have access to that portion of shared memory 84 Definitions Stream-Out Streaming out a line refers to the movement of one line of modified data, out of the cache and back into a memory paragraph Stream-In The movement of one paragraph of data from memory into a cache line. Since line length generally exceeds the bus width (i.e. exceeds the number of bytes that can be move in a single bus transaction), a stream-in process requires multiple bus transactions in a row Possible that the byte actually needed will arrive last in a cache line during a sequence of bus transactions; can be avoided with the critical chunk first policy 85 Definitions Snooping After a line write hit in a cache using write-back, the data in cache and memory are no longer identical. In accordance with the write-back policy, memory will be written eventually, but until then memory is stale The modifier (the cache that wrote) must pay attention to other bus masters trying to access the same line. If this is detected, action must be taken to ensure data integrity This paying attention is called snooping. The right action may be forcing a back-off, or snarfing, or yet something else that ensures data coherence Snooping starts with the lowest-order cache, here the L2 cache. If appropriate, L2 lets L1 snoop for the same address, because L1 may have further modified the line 86 Definitions Squashing Starting with a read-miss: In a non-blocking cache, a subsequent memory access may be issued after a read-miss, even if that previous miss results in a stream-in that is still under way That subsequent memory access will be a miss again, which is being queued. Whenever an access references an address for which a request is already outstanding, the duplicate request to stream-in can be skipped Not entering this in the queue is called squashing The second and any further outstanding memory access can be resolved, once the first stream-in results in the line being present in the cache 87 Definitions Strong Write Order A policy ensuring that memory writes occur in the same order as the store operations in the executing object code Antonym: Weak order The advantage of weak ordering can be speed gain, allowing a compiler or cache policy to schedule instructions out of order; this requires some other policy to ensure data integrity 88 Definitions Stream-In The movement of a paragraph from memory into a cache line Since line length generally exceeds the bus width (i.e. exceeds the number of bytes that can be move in a single bus transaction), a stream-in process requires multiple bus transactions It is possible that the byte actually needed arrives last (or first) in a cache line during a sequence of bus transactions Antonym: Stream-out 89 Definitions Stream-Out The movement of one line of modified data from cache into a memory paragraph Antonym: Stream-in Note that unmodified data don’t need to be streamed-out from cache to memory; they are already present in memory 90 Definitions Trace Cache Special-purpose cache that holds predecoded instructions, AKA micro-ops Advantage: Repeated decoding for instructions is not needed Trace caches have fallen out of favor in the 2000s 91 Definitions Valid Bit Single-bit data structure per cache line, indicating, whether or not the line is free; free means invalid If a line is not valid (i.e. if valid bit is 0), it can be filled with a new paragraph upon a cache miss Else, (valid bit 1), the line holds valid information After a system reset, all valid bits of the whole cache are set to 0 The I bit in the MESI protocol takes on that role on an MP cache subsystem To be discussed in MP-cache coherence topic 92 Definitions Weak Write Order A memory-write policy allowing (a compiler or cache) that memory writes may occur in a different order than their originating store operations Antonym: Strong Write Order The advantage of weak ordering is potential speed gain 93 Definitions Write-Back Cache write policy that keeps a line of data (a paragraph) in the cache even after a write, i.e. after a modification The changed state must be remembered via the dirty bit, AKA Modified state, or modified bit Memory is temporarily stale in such a case Upon retirement, any dirty line must be copied back into memory; called write-back Advantage: only one stream-out, no matter how many write hits did occur to that same line 94 Definitions Write-By Cache write policy, in which the cache is not accessed on a write miss, even if there are cache lines in I state A cache using write-by “hopes” that soon there may be a load, which will result in a miss and then stream-in the appropriate line; if not, it was not necessary to stream-in the line in the first place Antonym: allocate-on-write 95 Definitions Write-Once Cache write policy that starts out as write-through and changes to write-back after the first write hit to a line Typical policy imposed onto a higher level L1 cache by the L2 cache Advantage: The L1 cache places no unnecessary traffic onto the system bus upon a cache-write hit Lower level L2 cache can remember that a write has occurred by setting the MESI state to modified 96 Definitions Write-Through Cache write policy that writes data to memory upon a write hit. Thus, cache and main memory are in synch Disadvantage: repeated memory access traffic on the bus 97 Bibliography –Other Topics 1. Don Anderson and Shanley, T., MindShare [1995]. Pentium TM Processor System Architecture, Addison-Wesley Publishing Company, Reading MA, PC System Architecture Series. ISBN 0201-40992-5 2. Pentium Pro Developer’s Manual, Volume 1: Specifications, 1996, one of a set of 3 volumes 3. Pentium Pro Developer’s Manual, Volume 2: Programmer's Reference Manual, Intel document, 1996, one of a set of 3 volumes 4. Pentium Pro Developer’s Manual, Volume 3: Operating Systems Writer’s Manual, Intel document, 1996, one of a set of 3 volumes 5. Y. Sheffer: http://webee.technion.ac.il/courses/044800/lectures/MESI.pdf 6. MOESI protocol: http://en.wikipedia.org/wiki/MOESI_protocol 7. MESIF protocol: http://en.wikipedia.org/wiki/MESIF_protocol 98
© Copyright 2025 Paperzz