PROBLEMS - Cap. 9 - Sistema di memoria

CHAPTER 5
•
THE MEMORY SYSTEM
PROBLEMS - Cap. 9 - Sistema di memoria
5.1
Give a block diagram similar to the one in Figure 5.10 for a 8M × 32 memory using
512K × 8 memory chips.
5.2
Consider the dynamic memory cell of Figure 5.6. Assume that C = 50 femtofarads
(10−15 F) and that leakage current through the transistor is about 9 picoamperes
(10−12 A). The voltage across the capacitor when it is fully charged is equal to 4.5 V.
The cell must be refreshed before this voltage drops below 3 V. Estimate the minimum
refresh rate.
5.3
In the bottom right corner of Figure 5.8 there are data input and data output registers.
Draw a circuit that can implement one bit of each of these registers, and show the
Introduzione all'architettura dei calcolatori 2/ed - Carl Hamacher, Zvonko Vranesic, Safwat Zaky
Copyright © 2006 - The McGraw-Hill Companies srl
PROBLEMS
361
required connections to the block “Read/Write circuits & latches” on one side and the
data bus on the other side.
5.4
Consider a main memory constructed with SDRAM chips that have timing requirements
depicted in Figure 5.9, except that the burst length is 8. Assume that 32 bits of data
are transferred in parallel. If a 133-MHz clock is used, how much time does it take to
transfer:
(a) 32 bytes of data
(b) 64 bytes of data
What is the latency in each case?
5.5
Criticize the following statement: “Using a faster processor chip results in a corresponding increase in performance of a computer even if the main memory speed remains the
same.”
5.6
A program consists of two nested loops — a small inner loop and a much larger outer
loop. The general structure of the program is given in Figure P5.1. The decimal memory
addresses shown delineate the location of the two loops and the beginning and end of the
total program. All memory locations in the various sections, 17–22, 23–164, 165–239,
and so on, contain instructions to be executed in straight-line sequencing. The program
is to be run on a computer that has an instruction cache organized in the direct-mapped
manner (see Figure 5.15) and that has the following parameters:
Main memory size
64K words
Cache size
1K words
Block size
128 words
START
17
23
165
Inner loop
executed
20 times
Outer loop
executed
10 times
239
1200
END
1500
Figure P5.1 A program structure for Problem 5.6.
Introduzione all'architettura dei calcolatori 2/ed - Carl Hamacher, Zvonko Vranesic, Safwat Zaky
Copyright © 2006 - The McGraw-Hill Companies srl
362
CHAPTER 5
•
THE MEMORY SYSTEM
The cycle time of the main memory is 10τ s, and the cycle time of the cache is 1τ s.
(a) Specify the number of bits in the TAG, BLOCK, and WORD fields in main memory
addresses.
(b) Compute the total time needed for instruction fetching during execution of the
program in Figure P5.1.
5.7
A computer uses a small direct-mapped cache between the main memory and the
processor. The cache has four 16-bit words, and each word has an associated 13-bit tag,
as shown in Figure P5.2a. When a miss occurs during a read operation, the requested
word is read from the main memory and sent to the processor. At the same time, it is
copied into the cache, and its block number is stored in the associated tag. Consider the
following loop in a program where all instructions and operands are 16 bits long:
LOOP
0
13 bits
16 bits
Tag
Data
Add
(R1)+,R0
Decrement R2
BNE
LOOP
054E
A03C
2
05D9
4
10D7
6
(a) Cache
(b) Main memory
Figure P5.2 Cache and main memory contents in Problem 5.7.
Assume that, before this loop is entered, registers R0, R1, and R2 contain 0, 054E,
and 3, respectively. Also assume that the main memory contains the data shown in
Figure P5.2b, where all entries are given in hexadecimal notation. The loop starts at
location LOOP = 02EC.
(a) Show the contents of the cache at the end of each pass through the loop.
(b) Assume that the access time of the main memory is 10τ and that of the cache is 1τ .
Calculate the execution time for each pass. Ignore the time taken by the processor
between memory cycles.
5.8
Repeat Problem 5.7, assuming only instructions are stored in the cache. Data operands
are fetched directly from the main memory and not copied into the cache. Why does
this choice lead to faster execution than when both instructions and data are written
into the cache?
Introduzione all'architettura dei calcolatori 2/ed - Carl Hamacher, Zvonko Vranesic, Safwat Zaky
Copyright © 2006 - The McGraw-Hill Companies srl
PROBLEMS
5.9
363
A block-set-associative cache consists of a total of 64 blocks divided into 4-block sets.
The main memory contains 4096 blocks, each consisting of 128 words.
(a) How many bits are there in a main memory address?
(b) How many bits are there in each of the TAG, SET, and WORD fields?
5.10
A computer system has a main memory consisting of 1M 16-bit words. It also has a
4K-word cache organized in the block-set-associative manner, with 4 blocks per set
and 64 words per block.
(a) Calculate the number of bits in each of the TAG, SET, and WORD fields of the
main memory address format.
(b) Assume that the cache is initially empty. Suppose that the processor fetches 4352
words from locations 0, 1, 2, . . . , 4351, in that order. It then repeats this fetch
sequence nine more times. If the cache is 10 times faster than the main memory,
estimate the improvement factor resulting from the use of the cache. Assume that
the LRU algorithm is used for block replacement.
5.11
Repeat Problem 5.10, assuming that whenever a block is to be brought from the main
memory and the corresponding set in the cache is full, the new block replaces the most
recently used block of this set.
5.12
Section 5.5.3 illustrates the effect of different cache-mapping techniques, using the
program in Figure 5.19. Suppose that this program is changed so that in the second
loop the elements are handled in the same order as in the first loop, that is, the control
for the second loop is specified as
for i := 0 to 9 do
Derive the equivalents of Figures 5.20 through 5.22 for this program. What conclusions
can be drawn from this exercise?
5.13
A byte-addressable computer has a small data cache capable of holding eight 32-bit
words. Each cache block consists of one 32-bit word. When a given program is executed,
the processor reads data from the following sequence of hex addresses:
200, 204, 208, 20C, 2F4, 2F0, 200, 204, 218, 21C, 24C, 2F4
This pattern is repeated four times.
(a) Show the contents of the cache at the end of each pass through this loop if a directmapped cache is used. Compute the hit rate for this example. Assume that the cache
is initially empty.
(b) Repeat part (a) for an associative-mapped cache that uses the LRU replacement
algorithm.
(c) Repeat part (a) for a four-way set-associative cache.
5.14
Repeat Problem 5.13, assuming that each cache block consists of two 32-bit words. For
part (c), use a two-way set-associative cache.
Introduzione all'architettura dei calcolatori 2/ed - Carl Hamacher, Zvonko Vranesic, Safwat Zaky
Copyright © 2006 - The McGraw-Hill Companies srl
364
CHAPTER 5
•
THE MEMORY SYSTEM
5.15
How might the value of k in the interleaved memory system of Figure 5.25b influence
block size in the design of a cache memory to be used with the system?
5.16
In many computers the cache block size is in the range of 32 to 128 bytes. What would
be the main advantages and disadvantages of making the size of cache blocks larger or
smaller?
5.17
Consider the effectiveness of interleaving with respect to the size of cache blocks. Using
calculations similar to those in Section 5.6.2, estimate the performance improvement
for block sizes of 16, 8, and 4 words. Assume that all words loaded into the cache are
accessed by the processor at least once.
5.18
Assume a computer has L1 and L2 caches, as discussed in Section 5.6.3. The cache
blocks consist of 8 words. Assume that the hit rate is the same for both caches and that
it is equal to 0.95 for instructions and 0.90 for data. Assume also that the times needed
to access an 8-word block in these caches are C1 = 1 cycle and C2 = 10 cycles.
(a) What is the average access time experienced by the processor if the main memory
uses interleaving? Assume that the memory access parameters are as described in
Section 5.6.1.
(b) What is the average access time if the main memory is not interleaved?
(c) What is the improvement obtained with interleaving?
5.19
Repeat Problem 5.18, assuming that a cache block consists of 4 words. Estimate an
appropriate value for C2 , assuming that the L2 cache is implemented with SRAM chips.
5.20
Consider the following analogy for the concept of caching. A serviceman comes to
a house to repair the heating system. He carries a toolbox that contains a number of
tools that he has used recently in similar jobs. He uses these tools repeatedly, until he
reaches a point where other tools are needed. It is likely that he has the required tools
in his truck outside the house. But, if the needed tools are not in the truck, he must go
to his shop to get them.
Suppose we argue that the toolbox, the truck, and the shop correspond to the L1
cache, the L2 cache, and the main memory of a computer. How good is this analogy?
Discuss its correct and incorrect features.
5.21
A 1024 × 1024 array of 32-bit numbers is to be “normalized” as follows. For each
column, the largest element is found and all elements of the column are divided by
this maximum value. Assume that each page in the virtual memory consists of 4K
bytes, and that 1M bytes of the main memory are allocated for storing data during this
computation. Suppose that it takes 40 ms to load a page from the disk into the main
memory when a page fault occurs.
(a) How many page faults would occur if the elements of the array are stored in column
order in the virtual memory?
(b) How many page faults would occur if the elements are stored in row order?
(c) Estimate the total time needed to perform this normalization for both arrangements
(a) and (b).
Introduzione all'architettura dei calcolatori 2/ed - Carl Hamacher, Zvonko Vranesic, Safwat Zaky
Copyright © 2006 - The McGraw-Hill Companies srl
PROBLEMS
365
5.22
Consider a computer system in which the available pages in the physical memory are
divided among several application programs. When all the pages allocated to a program
are full and a new page is needed, the new page must replace one of the resident pages.
The operating system monitors the page transfer activity and dynamically adjusts the
page allocation to various programs. Suggest a suitable strategy that the operating
system can use to minimize the overall rate of page transfers.
5.23
In a computer with a virtual-memory system, the execution of an instruction may be
interrupted by a page fault. What state information has to be saved so that this instruction
can be resumed later? Note that bringing a new page into the main memory involves a
DMA transfer, which requires execution of other instructions. Is it simpler to abandon
the interrupted instruction and completely reexecute it later? Can this be done?
5.24
When a program generates a reference to a page that does not reside in the physical
main memory, execution of the program is suspended until the requested page is loaded
into the main memory. What difficulties might arise when an instruction in one page
has an operand in a different page? What capabilities must the processor have to handle
this situation?
5.25
A disk unit has 24 recording surfaces. It has a total of 14,000 cylinders. There is an
average of 400 sectors per track. Each sector contains 512 bytes of data.
(a) What is the maximum number of bytes that can be stored in this unit?
(b) What is the data transfer rate in bytes per second at a rotational speed of 7200 rpm?
(c) Using a 32-bit word, suggest a suitable scheme for specifying the disk address,
assuming that there are 512 bytes per sector.
5.26
The seek time plus rotational delay in accessing a particular data block on a disk is
usually much longer than the data flow period for most disk transfers. Consider a long
sequence of accesses to the 3.5-inch disk given as an example in Section 5.9.1, for either
Read or Write operations in which the average block being accessed is 8K bytes long.
(a) Assuming that the blocks are randomly located on the disk, estimate the average
percentage of the total time occupied by seek operations and rotational delays.
(b) Repeat part (a) for the situation in which the disk accesses have been arranged so
that in 90 percent of the cases, the next access will be to a data block on the same
cylinder.
5.27
The average seek time and rotational delay in a disk system are 6 ms and 3 ms, respectively. The rate of data transfer to or from the disk is 30 Mbytes/sec and all disk
accesses are for 8 Kbytes of data. Disk DMA controllers, the processor, and the main
memory are all attached to a single bus. The bus data width is 32 bits, and a bus transfer
to or from the main memory takes 10 nanoseconds.
(a) What is the maximum number of disk units that can be simultaneously transferring
data to or from the main memory?
(b) What percentage of main memory cycles are stolen by a disk unit, on average, over
a long period of time during which a sequence of independent 8K-byte transfers
takes place?
Introduzione all'architettura dei calcolatori 2/ed - Carl Hamacher, Zvonko Vranesic, Safwat Zaky
Copyright © 2006 - The McGraw-Hill Companies srl
366
CHAPTER 5
•
THE MEMORY SYSTEM
5.28
Given that magnetic disks are used as the secondary storage for program and data files
in a virtual-memory system, which disk parameter(s) should influence the choice of
page size?
5.29
A tape drive has the following parameters:
Bit density
2000 bits/cm
Tape speed
800 cm/s
Time to reverse direction of motion
225 ms
Minimum time spent at an interrecord gap 3 ms
Average record length
4000 characters
Estimate the percentage gain in time resulting from the ability to read records in both
the forward and backward directions. Assume that records are accessed at random and
that on average, the distance between two records accessed in sequence is four records.
Introduzione all'architettura dei calcolatori 2/ed - Carl Hamacher, Zvonko Vranesic, Safwat Zaky
Copyright © 2006 - The McGraw-Hill Companies srl
Chapter 5 – The Memory System
5.1. The block diagram is essentially the same as in Figure 5.10, except that 16 rows
(of four 512 × 8 chips) are needed. Address lines A18−0 are connected to all
chips. Address lines A22−19 are connected to a 4-bit decoder to select one of the
16 rows.
5.2. The minimum refresh rate is given by
50 × 10−15 × (4.5 − 3)
= 8.33 × 10−3 s
9 × 10−12
Therefore, each row has to be refreshed every 8 ms.
5.3. Need control signals Min and Mout to control storing of data into the memory
cells and to gate the data read from the memory onto the bus, respectively. A
possible circuit is
Read/Write
circuits and latches
Min
D
Din
Q
D
Mout
Clk
Q
Dout
Clk
Data
5.4. (a) It takes 5 + 8 = 13 clock cycles.
Total time =
Latency =
13
= 0.098 × 10−6 s = 98 ns
(133 × 106 )
5
= 0.038 × 10−6 s = 38 ns
(133 × 106 )
(b) It takes twice as long to transfer 64 bytes, because two independent 32-byte
transfers have to be made. The latency is the same, i.e. 38 ns.
1
Introduzione all'architettura dei calcolatori 2/ed - Carl Hamacher, Zvonko Vranesic, Safwat Zaky
Copyright © 2006 - The McGraw-Hill Companies srl
5.5. A faster processor chip will result in increased performance, but the amount
of increase will not be directly proportional to the increase in processor speed,
because the cache miss penalty will remain the same if the main memory speed
is not improved.
5.6. (a) Main memory address length is 16 bits. TAG eld is 6 bits. BLOCK eld is
3 bits (8 blocks). WORD eld is 7 bits (128 words per block).
(b) The program words are mapped on the cache blocks as follows:
Start
0
1024
17
Block 0
127
1151
128
1152
23
165
Block 1
255
1279
256
1280
1200
239
Block 2
383
1407
384
1408
Block 3
511
1500
1535
End
512
Block 4
639
640
Block 5
767
768
Block 6
895
896
Block 7
1023
Hence, the sequence of reads from the main memory blocks into cache blocks is
Block : 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 0, 1, 0, 1, 0, 1, . . . , 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 2, 3
|
{z
} | {z }
{z
}
| {z } |
Pass 1
Pass 2
Pass 9
Pass 10
2
Introduzione all'architettura dei calcolatori 2/ed - Carl Hamacher, Zvonko Vranesic, Safwat Zaky
Copyright © 2006 - The McGraw-Hill Companies srl
As this sequence shows, both the beginning and the end of the outer loop use
blocks 0 and 1 in the cache. They overwrite each other on each pass through the
loop. Blocks 2 to 7 remain resident in the cache until the outer loop is completed.
The total time for reading the blocks from the main memory into the cache is
therefore
(10 + 4 × 9 + 2) × 128 × 10 τ = 61, 440 τ
Executing the program out of the cache:
Outer loop − inner loop = [(1200 − 22) − (239 − 164)]10 × 1τ = 11, 030 τ
Inner loop = (239 − 164)200 × 1 τ = 15, 000 τ
End section of program = 1500 − 1200 = 300 × 1 τ
Total execution time = 87, 770 τ
5.7. In the rst pass through the loop, the Add instruction is stored at address 4 in
the cache, and its operand (A03C) at address 6. Then the operand is overwritten
by the Decrement instruction. The BNE instruction is stored at address 0. In the
second pass, the value 05D9 overwrites the BNE instruction, then BNE is read
from the main memory and again stored in location 0. The contents of the cache,
the number of words read from the main memory and from the cache, and the
execution time for each pass are as shown below.
After pass No.
Cache contents
005E
1
Time
MM accesses
Cache accesses
4
0
40 τ
2
2
22 τ
1
3
13 τ
7
5
75 τ
BNE
005D Add
005D Dec
005E
2
BNE
005D Add
005D Dec
005E
3
BNE
00AA 10D7
005D Add
005D Dec
Total
3
Introduzione all'architettura dei calcolatori 2/ed - Carl Hamacher, Zvonko Vranesic, Safwat Zaky
Copyright © 2006 - The McGraw-Hill Companies srl
5.8. All three instructions are stored in the cache after the rst pass, and they remain in place during subsequent passes. In this case, there is a total of 6 read
operations from the main memory and 6 from the cache. Execution time is 66 τ .
Instructions and data are best stored in separate caches to avoid the data overwriting instructions, as in Problem 5.7.
5.9. (a) 4096 blocks of 128 words each require 12+7 = 19 bits for the main memory
address.
(b) TAG eld is 8 bits. SET eld is 4 bits. WORD eld is 7 bits.
5.10. (a) TAG eld is 10 bits. SET eld is 4 bits. WORD eld is 6 bits.
(b) Words 0, 1, 2, · · ·, 4351 occupy blocks 0 to 67 in the main memory (MM).
After blocks 0, 1, 2, · · ·, 63 have been read from MM into the cache on the rst
pass, the cache is full. Because of the fact that the replacement algorithm is
LRU, MM blocks that occupy the rst four sets of the 16 cache sets are always
overwritten before they can be used on a successive pass. In particular, MM
blocks 0, 16, 32, 48, and 64 continually displace each other in competing for
the 4 block positions in cache set 0. The same thing occurs in cache set 1 (MM
blocks, 1, 17, 33, 49, 65), cache set 2 (MM blocks 2, 18, 34, 50, 66) and cache
set 3 (MM blocks 3, 19, 35, 51, 67). MM blocks that occupy the last 12 sets
(sets 4 through 15) are fetched once on the rst pass and remain in the cache for
the next 9 passes. On the rst pass, all 68 blocks of the loop must be fetched
from the MM. On each of the 9 successive passes, blocks in the last 12 sets of
the cache (4 × 12 = 48) are found in the cache, and the remaining 20 (68 − 48)
blocks must be fetched from the MM.
Time without cache
Time with cache
10 × 68 × 10τ
=
1 × 68 × 11τ + 9(20 × 11τ + 48 × 1τ )
= 2.15
Improvement factor =
5.11. This replacement algorithm is actually better on this particular ”large” loop example. After the cache has been lled by the main memory blocks 0, 1, · · ·, 63
on the rst pass, block 64 replaces block 48 in set 0. On the second pass, block
48 replaces block 32 in set 0. On the third pass, block 32 replaces block 16, and
on the fourth pass, block 16 replaces block 0. On the fourth pass, there are two
replacements: 0 kicks out 64, and 64 kicks out 48. On the sixth, seventh, and
eighth passes, there is only one replacement in set 0. On the ninth pass there are
two replacements in set 0, and on the nal pass there is one replacement. The
situation is similar in sets 1, 2, and 3. Again, there is no contention in sets 4
through 15. In total, there are 11 replacements in set 0 in passes 2 through 10.
The same is true in sets 1, 2, and 3. Therefore, the improvement factor is
10 × 68 × 10τ
= 3.8
1 × 68 × 11τ + 4 × 11 × 11τ + (9 × 68 − 44) × 1τ
4
Introduzione all'architettura dei calcolatori 2/ed - Carl Hamacher, Zvonko Vranesic, Safwat Zaky
Copyright © 2006 - The McGraw-Hill Companies srl
_
_
5.12. For the rst loop, the contents of the cache are as indicated in Figures 5.20
through 5.22. For the second loop, they are as follows.
(a) Direct-mapped cache
Contents of data cache after pass:
Block
position
0
j =9
i =1
i =3
i =5
i =7
i =9
A(0,8) A(0,0) A(0,2) A(0,4) A(0,6) A(0,8)
1
2
3
4
_
A(0,9) A(0,1) A(0,3) A(0,5) A(0,7) A(0,9)
5
_
6
7
(b) Associative-mapped cache
Contents of data cache after pass:
Block
position
0
A(0,8) A(0,8) A(0,8) A(0,6)
1
A(0,9) A(0,9) A(0,9) A(0,7)
2
A(0,2) A(0,0) A(0,0) A(0,8)
3
A(0,3) A(0,3) A(0,1) A(0,9)
4
A(0,4) A(0,4) A(0,2) A(0,2)
5
A(0,5) A(0,5) A(0,3) A(0,3)
6
A(0,6) A(0,6) A(0,4) A(0,4)
7
A(0,7) A(0,7) A(0,5) A(0,5)
j =9
i =0
i =5
i =9
5
Introduzione all'architettura dei calcolatori 2/ed - Carl Hamacher, Zvonko Vranesic, Safwat Zaky
Copyright © 2006 - The McGraw-Hill Companies srl
_
_
(c) Set-associative-mapped cache
Contents of data cache after pass:
Set 0
Block
position
0
A(0,8) A(0,2) A(0,6) A(0,6)
1
A(0,9) A(0,3) A(0,7) A(0,7)
2
A(0,6) A(0,0) A(0,4) A(0,8)
3
A(0,7) A(0,1) A(0,5) A(0,9)
j =9
i =3
i =7
i =9
0
Set 1
1
2
3
_
In all 3 cases, all elements are overwritten before they are used in the second
loop. This suggests that the LRU algorithm may not lead to good performance if
used with arrays that do not t into the cache. The performance can be improved
by introducing some randomness in the replacement algorithm.
5.13. The two least-signi cant bits of an address, A1−0 , specify a byte within a 32-bit
word. For a direct-mapped cache, bits A4−2 specify the block position. For a
set-associative-mapped cache, bit A2 speci es the set.
(a) Direct-mapped cache
Contents of data cache after:
Block
position
0
Pass 1
Pass 2
Pass 3
Pass 4
[200]
[200]
[200]
[200]
1
[204]
[204]
[204]
[204]
2
[208]
[208]
[208]
[208]
3
[24C]
[24C]
[24C]
[24C]
4
[2F0]
[2F0]
[2F0]
[2F0]
5
[2F4]
[2F4]
[2F4]
[2F4]
6
[218]
[218]
[218]
[218]
7
[21C]
[21C]
[21C]
[21C]
Hit rate = 33/48 = 0.69
6
Introduzione all'architettura dei calcolatori 2/ed - Carl Hamacher, Zvonko Vranesic, Safwat Zaky
Copyright © 2006 - The McGraw-Hill Companies srl
_
_
_
(b) Associative-mapped cache
Contents of data cache after:
Block
position
0
_
Pass 1
Pass 2
Pass 3
Pass 4
[200]
[200]
[200]
[200]
1
[204]
[204]
[204]
[204]
2
[24C]
[21C]
[218]
[2F0]
3
[20C]
[24C]
[21C]
[218]
4
[2F4]
[2F4]
[2F4]
[2F4]
5
[2F0]
[20C]
[24C]
[21C]
6
[218]
[2F0]
[20C]
[24C]
7
[21C]
[218]
[2F0]
[20C]
Hit rate = 21/48 = 0.44
(c) Set-associative-mapped cache
Contents of data cache after:
Block
position
0
Set 0
Set 1
Pass 1
Pass 2
Pass 3
Pass 4
[200]
[200]
[200]
[200]
1
[208]
[208]
[208]
[208]
2
[2F0]
[2F0]
[2F0]
[2F0]
3
[218]
[218]
[218]
[218]
0
[204]
[204]
[204]
[204]
1
[24C]
[21C]
[24C]
[21C]
2
[2F4]
[2F4]
[2F4]
[2F4]
3
[21C]
[24C]
[21C]
[24C]
Hit rate = 30/48 = 0.63
7
Introduzione all'architettura dei calcolatori 2/ed - Carl Hamacher, Zvonko Vranesic, Safwat Zaky
Copyright © 2006 - The McGraw-Hill Companies srl
_
_
_
5.14. The two least-signi cant bits of an address, A1−0 , specify a byte within a 32-bit
word. For a direct-mapped cache, bits A4−3 specify the block position. For a
set-associative-mapped cache, bit A3 speci es the set.
(a) Direct-mapped cache
Contents of data cache after:
Block
position
0
1
2
_
3
Pass 1
Pass 2
Pass 3
Pass 4
[200]
[200]
[200]
[200]
[204]
[204]
[204]
[204]
[248]
[248]
[248]
[248]
[24C]
[24C]
[24C]
[24C]
[2F0]
[2F0]
[2F0]
[2F0]
[2F4]
[2F4]
[2F4]
[2F4]
[218]
[218]
[218]
[218]
[21C]
[21C]
[21C]
[21C]
Hit rate = 37/48 = 0.77
(b) Associative-mapped cache
Contents of data cache after:
Block
position
0
1
2
3
Pass 1
Pass 2
Pass 3
Pass 4
[200]
[200]
[200]
[200]
[204]
[204]
[204]
[204]
[248]
[218]
[248]
[218]
[24C]
[21C]
[24C]
[21C]
[2F0]
[2F0]
[2F0]
[2F0]
[2F4]
[2F4]
[2F4]
[2F4]
[218]
[248]
[218]
[248]
[21C]
[24C]
[21C]
[24C]
Hit rate = 34/48 = 0.71
8
Introduzione all'architettura dei calcolatori 2/ed - Carl Hamacher, Zvonko Vranesic, Safwat Zaky
Copyright © 2006 - The McGraw-Hill Companies srl
_
_
_
(c) Set-associative-mapped cache
Contents of data cache after:
Block
position
0
Set 0
1
0
Set 1
1
Pass 1
Pass 2
Pass 3
Pass 4
[200]
[200]
[200]
[200]
[204]
[204]
[204]
[204]
[2F0]
[2F0]
[2F0]
[2F0]
[2F4]
[2F4]
[2F4]
[2F4]
[248]
[218]
[248]
[218]
[24C]
[21C]
[24C]
[21C]
[218]
[248]
[218]
[248]
[21C]
[24C]
[21C]
[24C]
Hit rate = 34/48 = 0.71
5.15. The block size (number of words in a block) of the cache should be at least
as large as 2k , in order to take full advantage of the multiple module memory
when transferring a block between the cache and the main memory. Power of 2
multiples of 2k work just as ef ciently , and are natural because block size is 2k
for k bits in the ”word” eld.
5.16. Larger size
• fewer misses if most of the data in the block are actually used
• wasteful if much of the data are not used before the cache block is ejected
from the cache
Smaller size
• more misses
5.17. For 16-word blocks the value of M is 1 + 8 + 3 × 4 + 4 = 25 cycles. Then
Time without cache
= 4.04
Time with cache
In order to compare the 8-word and 16-word blocks, we can assume that two
8-word blocks must be brought into the cache for each 16-word block. Hence,
the effective value of M is 2 × 17 = 34. Then
Time without cache
= 3.3
Time with cache
9
Introduzione all'architettura dei calcolatori 2/ed - Carl Hamacher, Zvonko Vranesic, Safwat Zaky
Copyright © 2006 - The McGraw-Hill Companies srl
Similarly, for 4-word blocks the effective value of M is 4(1+8+4) = 52 cycles.
Then
Time without cache
= 2.42
Time with cache
Clearly, interleaving is more effective if larger cache blocks are used.
5.18. The hit rates are
h1 = h2 = h = 0.95 for instructions
= 0.90 for data
The average access time is computed as
2
tave = hC1 + (1 − h)hC2 + (1 − h) M
(a) With interleaving M = 17. Then
tave
= 0.95 × 1 + 0.05 × 0.95 × 10 + 0.0025 × 17 + 0.3(0.9 × 1 + 0.1 × 0.9 × 10 + 0.01 × 17)
= 2.0585 cycles
(b) Without interleaving M = 38. Then tave = 2.174 cycles.
(c) Without interleaving the average access takes 2.174/2.0585 = 1.056 times
longer.
5.19. Suppose that it takes one clock cycle to send the address to the L2 cache, one
cycle to access each word in the block, and one cycle to transfer a word from the
L2 cache to the L1 cache. This leads to C2 = 6 cycles.
(a) With interleaving M = 1 + 8 + 4 = 13. Then tave = 1.79 cycles.
(b) Without interleaving M = 1 + 8 + 3 × 4 + 1 = 22. Then tave = 1.86 cycles.
(c) Without interleaving the average access takes 1.86/1.79 = 1.039 times
longer.
5.20. The analogy is good with respect to:
• relative sizes of toolbox, truck and shop versus L1 cache, L2 cache and
main memory
• relative access times
• relative frequency of use of tools in the 3 storage places versus the data
accesses in caches and the main memory
The analogy fails with respect to the facts that:
• at the start of a working day the tools placed into the truck and the toolbox
are preselected based on the experience gained on previous jobs, while in
the case of a new program that is run on a computer there is no relevant
data loaded into the caches before execution begins
10
Introduzione all'architettura dei calcolatori 2/ed - Carl Hamacher, Zvonko Vranesic, Safwat Zaky
Copyright © 2006 - The McGraw-Hill Companies srl
• most of the tools in the toolbox and the truck are useful in successive jobs,
while the data left in a cache by one program are not useful for the subsequent programs
• tools displaced by the need to use other tools are never thrown away, while
data in the cache blocks are simply overwritten if the blocks are not agged
as dirty
5.21. Each 32-bit number comprises 4 bytes. Hence, each page holds 1024 numbers.
There is space for 256 pages in the 1M-byte portion of the main memory that is
allocated for storing data during the computation.
(a) Each column is one page; there will be 1024 page faults.
(b) Processing of entire columns, one at a time, would be very inef cient and
slow. However, if only one quarter of each column (for all columns) is processed
before the next quarter is brought in from the disk, then each element of the array
must be loaded into the memory twice. In this case, the number of page faults
would be 2048.
(c) Assuming that the computation time needed to normalize the numbers is
negligible compared to the time needed to bring a page from the disk:
Total time for (a) is 1024 × 40 ms = 41 s
Total time for (b) is 2048 × 40 ms = 82 s
5.22. The operating system may increase the main memory pages allocated to a program that has a large number of page faults, using space previously allocated to
a program with a few page faults.
5.23. Continuing the execution of an instruction interrupted by a page fault requires
saving the entire state of the processor, which includes saving all registers that
may have been affected by the instruction as well as the control information that
indicates how far the execution has progressed. The alternative of re-executing
the instruction from the beginning requires a capability to reverse any changes
that may have been caused by the partial execution of the instruction.
5.24. The problem is that a page fault may occur during intermediate steps in the execution of a single instruction. The page containing the referenced location must
be transferred from the disk into the main memory before execution can proceed.
Since the time needed for the page transfer (a disk operation) is very long, as
compared to instruction execution time, a context-switch will usually be made.
(A context-switch consists of preserving the state of the currently executing program, and ”switching” the processor to the execution of another program that is
resident in the main memory.) The page transfer, via DMA, takes place while
this other program executes. When the page transfer is complete, the original
program can be resumed.
Therefore, one of two features are needed in a system where the execution of
an individual instruction may be suspended by a page fault. The rst possibility
11
Introduzione all'architettura dei calcolatori 2/ed - Carl Hamacher, Zvonko Vranesic, Safwat Zaky
Copyright © 2006 - The McGraw-Hill Companies srl
is to save the state of instruction execution. This involves saving more information (temporary programmer-transparent registers, etc.) than needed when a
program is interrupted between instructions. The second possibility is to ”unwind” the effects of the portion of the instruction completed when the page fault
occurred, and then execute the instruction from the beginning when the program
is resumed.
5.25. (a) The maximum number of bytes that can be stored on this disk is 24×14000×
400 × 512 = 68.8 × 109 bytes.
(b) The data transfer rate is (400 × 512 × 7200)/60 = 24.58 × 10 6 bytes/s.
(c) Need 9 bits to identify a sector, 14 bits for a track, and 5 bits for a surface.
Thus, a possible scheme is to use address bits A8−0 for sector, A22−9 for track,
and A27−23 for surface identi cation. Bits A31−28 are not used.
5.26. The average seek time and rotational delay are 6 and 3 ms, respectively. The
average data transfer rate from a track to the data buffer in the disk controller is
34 Mbytes/s. Hence, it takes 8K/34M = 0.23 ms to transfer a block of data.
(a) The total time needed to access each block is 9 + 0.23 = 9.23 ms. The
portion of time occupied by seek and rotational delay is 9/9.23 = 0.97 = 97%.
(b) Only rotational delays are involved in 90% of the cases. Therefore, the average time to access a block is 0.9 × 3 + 0.1 × 9 + 0.23 = 3.89 ms. The portion
of time occupied by seek and rotational delay is 3.6/3.89 = 0.92 = 92%.
5.27. (a) The rate of transfer to or from any one disk is 30 megabytes per second.
Maximum memory transfer rate is 4/(10 × 10−9 ) = 400 × 106 bytes/s, which is
400 megabytes per second. Therefore, 13 disks can be simultaneously o wing
data to/from the main memory.
(b) 8K/30M = 0.27 ms is needed to transfer 8K bytes to/from the disk. Seek and
rotational delays are 6 ms and 3 ms, respectively. Therefore, 8K/4 = 2K words
are transferred in 9.27 ms. But in 9.27 ms there are (9.27 × 10 −3 )/(0.01 ×
10−6 ) = 927 × 103 memory (word) cycles available. Therefore, over a long
period of time, any one disk steals only (2/927) × 100 = 0.2% of available
memory cycles.
5.28. The sector size should in uence the choice of page size, because the sector is the
smallest directly addressable block of data on the disk that is read or written as a
unit. Therefore, pages should be some small integral number of sectors in size.
5.29. The next record, j, to be accessed after a forward read of record i has just been
completed might be in the forward direction, with probability 0.5 (4 records
distance to the beginning of j), or might be in the backward direction with probability 0.5 (6 records distance to the beginning of j plus 2 direction reversals).
Time to scan over one record and an interrecord gap is
1 s
1 cm
×
× 4000 bits × 1000 ms + 3 = 2.5 + 3 = 5.5 ms
800 cm 2000 bit
12
Introduzione all'architettura dei calcolatori 2/ed - Carl Hamacher, Zvonko Vranesic, Safwat Zaky
Copyright © 2006 - The McGraw-Hill Companies srl
Therefore, average access and read time is
0.5(4 × 5.5) + 0.5(6 × 5.5 + 2 × 225) + 5.5 = 258 ms
If records can be read while moving in both directions, average access and read
time is
0.5(4 × 5.5) + 0.5(5 × 5.5 + 225) + 5.5 = 142.75 ms
Therefore, the average percentage gain is (258 − 142.75)/258 × 100 = 44.7%
The major gain is because the records being read are relatively close together,
and one less direction reversal is needed.
13
Introduzione all'architettura dei calcolatori 2/ed - Carl Hamacher, Zvonko Vranesic, Safwat Zaky
Copyright © 2006 - The McGraw-Hill Companies srl

Download Report

PROBLEMS - Cap. 9 - Sistema di memoria

Paperzz.com

Your Paperzz