432 HM3 Problem 1 (relevant section 4.2) The simple, bus

432 HM3 Problem 1 (relevant section 4.2) The simple, bus‐based multiprocessor illustrated in Figure 4.37 represents commonly implemented symmetric shared‐memory architecture. Each processor has a single, private cache with coherence maintained using the snooping coherence protocol of Figure 4.7. Each cache is direct‐mapped, with four blocks each holding two words. To simplify the illustration, the cache address tag contains the full address and each word shows only two hex characters, with the least significant word on the right. The coherence states are denoted M, S, and I for Modified, Shared, and Invalid. For each part of this exercise below, assume the initial cache and memory state as illustrated in Figure 4.37. Each part of this exercise specifies a sequence of one or more CPU operations of the form: P#: <op> <address> [ <‐‐ <value> ] where P# designates the CPU (e.g., P0), <op> is the CPU operation (e.g., read or write), <address> denotes the memory address, and <value> indicates the new word to be assigned on a write operation. Treat each action below as independently applied to the initial state as given in Figure 4.37. What is the resulting state (i.e., coherence state, tags, and data) of the caches and memory after the given action? Show only the blocks that change, for example, P0.B0: (I, 120, 00 01) indicates that CPU P0’s block B0 has the final state of I, tag of 120, and data words 00 and 01. Also, what value is returned by each read operation? a. P15: read 118 b. P15: write 100 <‐‐ 48 c. P15: write 118 <‐‐ 80 d. P15: write 108 <‐‐ 80 e. P15: read 110 f. P15: read 128 g. P15: write 110 <‐‐ 40 Page 1 Problem 2 (relevant section 4.3) The performance of a snooping cache‐coherent multiprocessor depends on many detailed implementation issues that determine how quickly a cache responds with data in an exclusive or M state block. In some implementations, a CPU read miss to a cache block that is exclusive in another processor’s cache is faster than a miss to a block in memory. This is because caches are smaller, and thus faster, than main memory. Conversely, in some implementations, misses satisfied by memory are faster than those satisfied by caches. This is because caches are generally optimized for “front side” or CPU references, rather than “back side” or snooping accesses. For the multiprocessor illustrated in Figure 4.37, consider the execution of a sequence of operations on a single CPU where 



CPU read and write hits generate no stall cycles, CPU read and write misses generate Nmemory and Ncache stall cycles if satisfied by memory and cache, respectively, CPU write hits that generate an invalidate incur Ninvalidate stall cycles, a write back of a block, either due to a cache replacement or due to a cache supplying a block in response to another processor’s request, incurs an additional Nwriteback stall cycles. Consider two implementations with different performance characteristics summarized in Figure 4.38. Consider the following sequence of operations assuming the initial cache state in Figure 4.37. For simplicity, assume that the second operation begins after the first completes (even though they are on different processors): P1: read 110 P15: read 110 For Implementation 1, the first read generates 80 stall cycles because the read is satisfied by P0’s cache. P1 stalls for 70 cycles while it waits for the block, and P0 stalls for 10 cycles while it writes the block back to memory in response to P1’s request. Thus the second read by P15 generates 100 stall cycles because its miss is satisfied by memory. Thus this sequence generates a total of 180 stall cycles. For the following sequences of operations, how many stall cycles are generated by each implementation? a. P15: read 120 P15: read 128 P15: read 130 b. P15: read 118 P15: write 110 <‐‐ 48 P15: write 130 <‐‐ 78 c. P1: read 110 P1: read 108 P1: read 130 d. P0:read 100 P0: write 108 <‐‐ 48 P0: write 128 <‐‐ 78 Page 2 Problem 3 Consider a 2MB 4‐way set‐associative writeback cache with 16 byte line size and a 32 bit byte‐
addressable address. Assume a random replacement policy and a single core system. a) Which bits of the address are used for the cache index? b) Which bits of the address are used for the cache tag? c) How many bits of total storage does this cache need besides the 2MB for data? Remember to include any state bits needed. Problem 4 Part A Suppose a processor with virtual memory has pages of size 2MiB (2^21 bytes), 64 bit virtual addresses, and 48 bit physical addresses. Answer the following questions. a) Is the virtual address space larger or smaller than the physical address space? b) How many bits long is the page offset in a virtual address? c) How many bits long is a virtual page number? d) Is the page number in the most significant bits of a virtual address, or the least significant bits? e) How many bits long is a physical page number? f) How many bits long is the page offset in a physical address? g) How much virtual address space is covered by a 16 entry TLB? h) What is the page number of the virtual address 0xFFFF900011224488? Part B Consider a tiny system with virtual memory. Physical addresses are 8 bits long, but only 2^7 = 128 bytes of physical memory is installed, at physical addresses 0 up to 127. Pages are 2^4 = 16 bytes long. Virtual addresses are 10 bits long. An exception is raised if a program accesses a virtual address whose virtual page has no mapping in the page table, or is mapped to a physical page outside of installed physical memory. Here are the contents of main memory. To find the physical address of a byte, read the least significant digit from the column label and the most significant digit from the row label. For example, the shaded byte in the second row is at physical address 0x12. All entries are in hexadecimal. Here is the page table. The virtual page number in the left column is mapped to the physical page number in the second column. Virtual page numbers are listed in binary. Page 3 A) List the four bytes in the word beginning at physical address 0x34. B) How many virtual addresses refer to the first byte of the shaded word in row 0x2_? List them. C) How many virtual addresses refer to the first byte of the shaded word in row 0x4_? List them. D) How many virtual addresses refer to the first byte of the shaded word in row 0x6_? List them. E) What data is returned if the program loads a word from virtual address 0x5C (01011100)? F) What is the result if the program loads a word from virtual address 0x64 (01100010)? Page 4