docx - UCSD CSE

Name:
PID:
CSE 160 Final Exam SAMPLE
Winter 2017 (Kesden)
Email:
Cache Performance (Questions from 15-213 @ CMU. Thanks!)
1. This problem requires you to analyze the cache behavior of a function that sums the elements
of an array A:
int A[2][4];
int sum()
{
int i, j, sum=0;
for
(j=0;
j<4; j++)
{ for
(i=0; i<2;
i++) {
sum += A[i][j];
}
}
return sum;
}
Assume the following:
• The memory system consists of registers, a single L1 cache, and main memory.
• The cache is cold when the function is called and the array has been initialized elsewhere.
• Variables i, j, and sum are all stored in registers.
• The array A is aligned in memory such that the first two array elements map to the same
cache block.
• sizeof(int) == 4.
• The cache is direct mapped, with a block size of 8 bytes.
A. Suppose that the cache consists of 2 sets. Fill out the table to indicate if the corresponding
memory access in A will be a hit (h) or a miss (m).
A
Col 0
Col 1
Row 0
Col 2
Col 3
m
Row 1
B. What is the pattern of hits and misses if the cache consists of 4 sets instead of 2 sets?
A
Col 0
Row 0
m
Col 1
Col 2
Col 3
Row 1
2. This problem tests your understanding of cache organization and performance. Assume the
following:
• sizeof(int) = 4
• Array x begins at memory address 0.
• The cache is initially empty.
• The only memory accesses are to the entries of the array x. All variables are stored in
registers.
Consider the following C code:
int x[128]; int i, j; int sum = 0;
for (i = 0; i < 64; i ++){ j = i + 64;
sum += x[i] * x[j];
}
Part A:
i. For this part only, additionally assume your cache is a 256-byte direct-mapped data cache with 8byte cache blocks. What is the cache miss rate?
miss rate = %
ii. If the cache were twice as big, what would be the miss rate?
miss rate =
%
Part B
i. For this part only, additionally assume your cache is 256-byte 2-way set associative using an LRU
replacement policy with 8-byte cache blocks. What is the cache miss rate?
miss rate =
%
ii. Will larger cache size help to reduce the miss rate? Yes or No, and why?
iii.
Will larger cache line help to reduce the miss rate? Yes or No, and why?
Synchronization
3. Consider barriers, mutexes, and C++ atomic variables. Please explain and distinguish the purpose
of each. In other words, for exam, explain the situations for which it is uniquely better and why.
4. Consider the following code. Is the output deterministic? That is to say, will it produce the same
output each time it runs? If so, what is that output? And what ensures that it is deterministic? If the
output is not deterministic, why not?
Global code
int w;
atomic<int> x;
atomic<bool> ready;
Thread 1:
w = 41;
x = 42;
ready = true;
Thread 2:
int r1;
while (!ready) {}
r1 = x;
cout << r1 << w << x;
5. Consider the following code. Is the output deterministic? That is to say, will it produce the same
output each time it runs? If so, what is that output? And what ensures that it is deterministic? If the
output is not deterministic, why not?
Global code
int x;
atomic<int> w;
volatile bool ready;
Thread 1:
w = 41;
x = 42;
ready = true;
Thread 2:
int r1, r2;
while (!ready) {}
r1 = x;
r2 = y;
cout << r1 << r2 << w << x;
SIMD, MIMD, SSE
6. For what type of operations is MIMD support most helpful? Give an example.
7. Why should MIMD models/instructions be avoid in cases other than those you described above?
8. Consider three arrays; A, B, and C, each consisting of N ints, where N is guaranteed to be divisible
by 2 and code that utilizes a for-loop to computer C = B / A, where the division is performed upon
corresponding elements of the three array.
Please write code using vector operations that implements the loop described above. Declare
variables as necessary. You may simply comment and assume the initialization of the arrays: There
is no need to provide initialization functions, or present your loop within a parameterized function,
etc.
9. Will the following loop be vectorized? Why or why not?
for (i=0; i < n; i++) {
a[i] = b[i] / c[i]
if (a[i] > maxval) maxval = a[i];
if (maxval > 1000.0) break;
}
Floating Point Numbers (Thanks, again, 15-213 @ CM
10. Floating point encoding. Consider the following 5-bit floating point representation based on the IEEE floating
point format. This format does not have a sign bit – it can only represent nonnegative numbers.
• There are k = 3 exponent bits. The exponent bias is 3.
• There are n = 2 fraction bits.
Recall that numeric values are encoded as a value of the form V = M × 2E , where E is the exponent after biasing, and
M is the significand value. The fraction bits encode the significand value M using either a denormalized (exponent
field 0) or a normalized representation (exponent field nonzero). The exponent E is given by E = 1 − Bias for
denormalized values and E = e − Bias for normalized values, where e is the value of the exponent field exp interpreted
as an unsigned number.
Below, you are given some decimal values, and your task it to encode them in floating point format. In addition, you
should give the rounded value of the encoded floating point number. To get credit, you must give these as whole
numbers (e.g., 17) or as fractions in reduced form (e.g., 3/4). Any rounding of the significand is based on round-toeven, which rounds an unrepresentable value that lies halfway between two representable values to the nearest even
representable value.
Value
Floating Point Bits
Rounded value
9/32
001 00
1/4
1
12
11
1/8
7/32
Computational Complexity
11. What is computational complexity? Why is it a helpful metric?
12. What characteristics of code affect its computational complexity?
13. What is the computational complexity of the following code? Show your work or explain your
intuition.
for (int i = 0; i < n; i++)
C[i] = 5 * A[i] / B[i]
NUMA Architectures
14. Are snooping protocols used for common NUMA architectures? Why or why not?
15. Consider the NUMA directory protocol we described in class, why are requests forwarded from the
owner to the home rather than taking a more direct approach and simply redirecting the requestor to
the home?
16. Why are the directory entries associated with each partition of the address space kept only with the
associated processor, rather than replicated across processors for faster access?
17. Why is it necessary for the cache to be inclusive, e.g. L3 contain everything in L2, which in turn
contains everything in L1?
18. Why is it necessary for the directory entry to keep track of ever user (sharer) of a block?
19. The block size associated with the NUMA directory is often the same as the block size used in
another part of the memory system. Which part is that? And, what ways is this helpful?
20. Consider a NUMA system with 4 processors, P0, P1, P2, and P3, wherein the array A is initially
allocated on P2 and is confined within one block within P2’s address space.
A. In the space provided, please show the sharer bit set, dirty bit, and home, as appropriate. You
may not need to use all provided boxes.
i.
After the initial allocation
ii.
After a read by P3
iii.
After a write by P0
B. Draw and label arrows representing any communication between or among processors associated
with each of the three steps above.
P0
P1
P2
P3
GPUs, CUDA
21. Which of the following operations are more efficient on GPUs than on conventional processors?
Why is this consistent with the use case for these processors?
a. Conditionals and short loops
b. Huge vector operations
c. Small vector operations
22. In class we said that, when compiling CUDA code, the compiler rearranges loads to hide latency.
What does this mean? Give an example.
23. Which is more costly, moving data from the host to the GPU? Or moving data from the GPUs shared
memory to a specific core? Why does this make sense?
SSE2 Cheat sheet