quiz2sol.pdf

CS252 QUIZ #2: 4/18/01
Last Name _______________________
Question
1
2
3
TOTAL
Name
David vs. Goliath
That’s Out of Order!
Who Needs Compilers?
D. A. Patterson
First Name _____________________
Time (minutes)
30
30
50
110
Max Points
14
16
18
48
Your Points
CS252 - Quiz #2, Spring 2001
Your last name: ______________________
Question #1: David vs. Goliath (14 points) [30 minutes]
The Intel Pentium III and the Transmeta Crusoe both translate 80x86 instructions into a different
instruction set for execution.
a) (4 points) List the following characteristics of each of the internal instruction sets:
Registers
(approximate number, size)
Instruction
(approximate size, style)
Pentium III
~80, 32 bit INT
~80, 80 bit FP
where 80 = 40 ROB + 40 RS
Transmeta Crusoe
64, 32 bit INT
32, 80 bit FP
72 bit RISC
64/128 VLIW
b) (2 points) What is the role of interpretation in each machine?
Pentium III:
Micro code interpreter for 80x86 instructions that are too complicated to be translated into 4 or
fewer micro operations.
Transmeta Crusoe:
Full 80x86 interpreter for all basic blocks that have not been translated into 80x86 instructions.
2
CS252 - Quiz #2, Spring 2001
Your last name: ______________________
c) (2 points) What are the methods of translation in each machine?
Pentium III:
Hardware fetches and translates up to 3 x86 instructions per clock cycle into RISC ops, as long
as they take no more than 4 RISC ops.
Transmeta Crusoe:
Profiling picks hot spots, and then basic blocks are compiled directly into VLIW code. So those
instructions are no longer interpreted.
d) (2 points) In addition to performance and cost, an increasingly important consideration is
power. What is the impact on power of each approach? Why?
Pentium III:
Instruction cache use is small, with only 80x86 instructions. Hardware for
fetch/decode/issue/translation on every execution burns power.
Transmeta Crusoe:
Instruction cache is less preferable, with VLIW instruction take more space. 80x86 are
translated once and cached. This saves the power used by transistors.
e) (2 points) Which is a better match to multithreading? Why?
Expected:
PIII mechanisms of ROB, RS make it faster to mix multiple threads execution out of order.
Other interesting answers:
- Keep compiler as a separate thread for Transmeta Crusoe
- Multiple threads might avoid extra translation.
3
CS252 - Quiz #2, Spring 2001
Your last name: ______________________
f) (2 points) Suppose another company wanted to change a chip to use it for another instruction
set (e.g., Alpha), thereby leveraging the considerable investment in the chip so far. What are the
pros and cons of doing this for each chip? What (if anything) is likely to have to be changed for
each chip?
Pentium III:
Fetch/Decode/Issue mechanism
Transmeta Crusoe:
Register size/ addressing/TLB
4
CS252 - Quiz #2, Spring 2001
Your last name: ______________________
Question 2: That’s Out of Order! (16 points) [30 minutes]
Using the MIPS code shown below, show the state of the Reservation stations, Reorder buffers,
and floating point (FP) register status for a speculative processor implementing Tomasulo’s
algorithm. Assume the following:
•
Only one instruction can issue per cycle.
•
The reorder buffer has 8 slots.
•
The reorder buffer implements the functionality of the load buffers and store buffers.
•
All function units are fully pipelined.
•
There are 2 floating point multiply reservation stations.
•
There are 3 floating point add reservation stations.
•
There are 3 integer reservation stations, which also execute load and store instructions.
•
No exceptions occur during the execution of this code.
•
All integer operations require 1 execution cycle. Memory requests occur and complete in
this cycle.
•
All FP multiply operations require 4 execution cycles.
•
All FP addition operations require 2 execution cycles.
•
On a common data bus write conflict, the instruction issued earlier gets priority.
•
Execution for a dependent instruction can begin on the cycle after its operand is broadcast on
the common data bus.
•
If any item changes from “Busy” to “Not Busy”, you should update the “Busy” column to
reflect this, but you should not erase any other information in the row (unless another
instruction then overwrites that informa tion).
•
Assume the all reservation stations, reorder buffers, and functional units were empty and not
busy when the code show below began execution.
•
The “Value” column gets updated when the value is broadcast one the common data bus.
•
Integer registers are not shown, and you do not have to show their state.
5
CS252 - Quiz #2, Spring 2001
•
Your last name: ______________________
For parts a) and b), fill in the new entry only when the entry value changes; leaving the
new column blank means it’s unchanged. Use dash to indicate the new entry value is empty.
For the instruction column in reorder buffer, use the empty entry for any new instruction.
a) (7 points) Assume the tables below show the old state at the end of the cycle in which ADDI
from the code below is issued. Modify the tables to show new state at the end of next clock
cycle. Assume the execute states for the floating point instructions MULT.D F3, F1, F11 and
MULT.D F4, F1, F10 are at the first and second cycle of the execution stage, respectively.
(In case you mess up this version, there is an extra copy on the next page.)
L.D
F0, 0(R1)
MULT.D
F2, F0, F12
ADD.D
F0, F2, F1
MULT.D
F3, F1, F11
MULT.D
F4, F1, F10
ADDI
R3, R3, 1
SUBI
R1, R1, 8
Reservation stations
ROB dest
Name
Busy
Op
Vj
Vk
Qj
Qk
old
Add1
Add2
Add3
Mult1
Mult2
Int1
Int2
Int3
new
old
new
old
new
old
new
old
new
old
ADD.D
F2
F1
#3
Y
Y
N
Y
MULT.D
MULT.D
F1
F1
F0
R3
F10
F11
R1
1
#5
#4
#1
#6
L.D.
ADDI
.
Y
N
N
Y
Y
Y
Y
Field
Instruction
N
L.D F0, 0(R1)
MULT.D F2, F0, F12
ADD.D F0, F2, F1
Y
MULT.D F3, F1, F11
MULT.D F4, F1, F10
ADDI R3, R3, 1
SUBI R1, R1, 8
3
Y
new
N
old
N
new
old
Commit
Commit
Write
Commit
Execute
Execute
Issue
Execute
#7
Destination
new
old
new
F0
F2
F0
F3
F4
R3
Issue
old
2
N
new
old
new
4
Y
old
5
Y
6
Value
old
Mem[0(R1)]
F0*F12
F2+F1
new
R1
FP register status
F2
F3
F4
F1
new
8
Reorder buffer
State
new
F0
old
R1
SUBI
Busy
old
Reorder #
Busy
new
N
Entry
1
2
3
4
5
6
7
8
old
new
F5
old
N
new
F6
old
N
new
…
old
N
new
CS252 - Quiz #2, Spring 2001
Your last name: ______________________
This is only a redundant copy of part a) in case you need it. Tell us which one is your answer!
L.D
F0, 0(R1)
MULT.D
F2, F0, F12
ADD.D
F0, F2, F1
MULT.D
F3, F1, F11
MULT.D
F4, F1, F10
ADDI
R3, R3, 1
SUBI
R1, R1, 8
Name
Busy
old
Add1
Add2
Add3
Mult1
Mult2
Int1
Int2
Int3
new
new
old
new
old
new
old
Qk
new
old
new
ROB dest
old new
Y
F2
F1
#3
Y
Y
N
Y
MULT.D
MULT.D
F1
F1
F0
R3
F10
F11
R1
1
#5
#4
#1
#6
L.D.
ADDI
Busy
old
Instruction
3
Y
Reorder buffer
State
L.D F0, 0(R1)
MULT.D F2, F0, F12
ADD.D F0, F2, F1
old
Commit
Commit
Write
MULT.D F3, F1, F11
MULT.D F4, F1, F10
ADDI R3, R3, 1
Execute
Execute
Issue
F0
old
.
new
N
N
Y
Y
Y
Y
Field
Reorder #
Busy
old
Qj
ADD.D
Entry
1
2
3
4
5
6
7
8
Reservation stations
Vj
Vk
Op
new
N
new
old
new
F0
F2
F0
F3
F4
R3
FP register status
F2
F3
F4
F1
old
Destination
new
old
2
N
new
old
new
4
Y
old
5
Y
7
new
Value
old
Mem[0(R1)]
F0*F12
F2+F1
F5
old
N
new
new
F6
old
N
new
…
old
N
new
CS252 - Quiz #2, Spring 2001
Your last name: ______________________
b) (9 points) The tables below, for a different program, show the state at the end of the cycle in
which the S.D from the code below is issued. Modify the tables to show state at the end of
the next three clock cycles. Assume the execute states for the floating point instructions for
MULT.D F2, F1, F11 and MULT.D F0, F0, F10 are at the end of fourth and third cycle of
the execution stage, respectively. (There is an extra copy on the next page.)
L.D
F0, 0(R1)
MULT.D
F2, F1, F12
ADD.D
F0, F2, F0
MULT.D
F2, F1, F11
MULT.D
F0, F0, F10
ADD.D
F0, F0, F2
S.D
F0, 0(R1)
ADDI
R1, R1, 8
SUBI
R2, R2, 1
Name
Busy
old
Add1
Add2
Add3
Mult1
Mult2
Int1
Int2
Int3
N
Y
N
N
Y
new
N
N
N
Y
Y
Y
Y
F2
MULT.D
MULT.D
F0
F1
R1
R1
L.D
S.D
N
N
6
Y
new
old
SUBI
Instruction
new
R2
F2
F10
F11
0
Reorder buffer
State
L.D F0, 0(R1)
MULT.D F2, F1, F12
ADD.D F0, F2, F0
MULT.D F2, F1, F11
MULT.D F0, F0, F10
ADD. D F0, F0, F2
S.D F0, 0(R1)
Execute
Execute
Issue
Issue
Commit
#5
-
new
#4
-
1
#3
#6
Commit
Execute
#8
Destination
old
new
F0
F2
F0
F2
F0
F0
R1
R2
Execute
old
new
old
4
Y
N
N
new
old
N
8
Value
old
Mem[0(R1)]
F1*F12
F2+F0
new
new
F1*F11
F0*F10
FP register status
F2
F3
F10
F1
old
#5
#4
#1
Write
ADDI R1, R1, 8
SUBI R2, R2, 1
N
new
ROB dest
old new
8
new
new
old
Qk
#6
R1
old
Commit
Commit
Commit
old
Qj
F0
ADDI
F0
old
new
F0
new
Y
Y
Field
old
ADD.D
ADD.D
Busy
Old
Reorder #
Busy
old
N
Entry
1
2
3
4
5
6
7
8
1
Op
new
Y
Y
N
Y
Reservation stations
Vj
Vk
R1+8
F11
old
N
new
F12
old
N
new
…
old
N
new
CS252 - Quiz #2, Spring 2001
Your last name: ______________________
This is only a redundant copy of part b) in case you need it. Tell us which one is your
answer!
L.D
F0, 0(R1)
MULT.D
F2, F1, F12
ADD.D
F0, F2, F0
MULT.D
F2, F1, F11
MULT.D
F0, F0, F10
ADD.D
F0, F0, F2
S.D
F0, 0(R1)
ADDI
R1, R1, 8
SUBI
R2, R2, 1
Name
Busy
old
Add1
Add2
Add3
Mult1
Mult2
Int1
Int2
Int3
N
Y
Y
Y
N
Y
old
ADD.D
ADD.D
new
old
Field
Instruction
5
Y
new
Reorder buffer
State
L.D F0, 0(R1)
MULT.D F2, F1, F12
ADD.D F0, F2, F0
MULT.D F2, F1, F11
MULT.D F0, F0, F10
ADD. D F0, F0, F2
S.D F0, 0(R1)
Commit
Execute
Execute
Issue
Issue
N
new
new
old
old
4
Y
new
new
old
#4
old
new
F0
F2
F0
F2
F0
F0
F0
N
old
N
9
new
#3
#6
Destination
new
new
old
#5
#4
#1
#7
new
Value
old
Mem[0(R1)]
F1*F12
new
F2+F0
FP register status
F2
F3
F10
F1
old
old
ROB dest
F10
F11
0
0
old
Commit
Commit
F0
old
new
Qk
F0
F0
F1
R1
R1
new
N
N
N
Y
Y
Y
Y
old
Qj
#5
MULT.D
L.D
S.D
Busy
new
F2
MULT.D
old
Reorder #
Busy
Op
new
Entry
1
2
3
4
5
6
7
8
1
Reservation stations
Vj
Vk
F11
old
N
new
F12
old
N
new
…
old
N
new
CS252 - Quiz #2, Spring 2001
Your last name: ______________________
3. Who needs compilers? (18 points) [50 minutes]
In the following problem, use a simple pipelined RISC architecture with a single branch delay
cycle. The architecture has pipelined functional units with the following execution cycles:
1. Floating point op: 3 cycles (7 stages total)
2. Integer op: 1 cycles (5 stages total)
The following table shows the minimum number of intervening cycles between the producer and
consumer instructions to avoid stalls. Assume 0 intervening cycle for combinations not listed.
Instruction producing re sult
FP ALU op
FP ALU op
Load double
Load double
Instruction using result
Another FP ALU op
Store and move double
FP ALU op
Store double
Latency in clock cycles
2
2
1
0
The following code computes a 3-tap filter. R1 contains address of the next input to the filter,
and the output overwrites the input for the iteration. R2 contains the loop counter. The tap
values are contained in F10, F11, and F12.
LOOP:
L.D
MULT.D
ADD.D
MULT.D
MOV.D
MULT.D
ADD.D
S.D
ADDI
BNEZ
SUBI
F0, 0(R1)
F2, F1, F12
F0, F2, F0
F2, F1, F11
F1, F0
F0, F0, F10
F0, F0, F2
F0, 0(R1)
R1, R1, 8
R2, LOOP
R2, R2, 1
#load the filter input for the iteration
#multiply elements
#add elements
#move value in F0 to F1
#store the result
#increment pointer, 8 bytes per DW
#continue till all inputs are processed
#decrement element count
a) (4 points) How many cycles does the current code take for each iteration?
____18______ cycles
LOOP:
1
2
4
5
6
7
8
9
11
12
14
15
16
17
18
L.D
MULT.D
Stall 2
ADD.D
MULT.D
Stall 1
MOV.D
MULT.D
Stall 2
ADD.D
Stall 2
S.D
ADDI
BNEZ
SUBI
F0, 0(R1)
F2, F1, F1 2
F0, F2, F0
F2, F1, F11
F1, F0
F0, F0, F10
#load the filter input for the iteration
#multiply elements
#ADD.D consumes MULT.D
#add elements
#MOV.D consumes ADD.D
#move value in F0 to F1
#ADD.D consumes MULT.D
F0, F0, F2
F0, 0(R1)
R1, R1, 8
R2, LOOP
R2, R2, 1
#ADD.D consumes MULT.D
#store the result
#increment pointer, 8 bytes per DW
#continue till all inputs are processed
#decrement element count
10
CS252 - Quiz #2, Spring 2001
Your last name: ______________________
b) (4 points) Rearrange the code without unrolling to achieve 2 less cycles per iteration. You can
reorder and drop any line of code, but do not change any line of code . To save writing, just
draw arrows in the below copy of the code to show any code movement. Show the
execution clock cycle number next to each code line . Assume initialization can be adjusted.
LOOP: 1
2
3
4
5
6
7
8
10
11
13
14
15
16
MULT.D
L.D
ADDI
ADD.D
MULT.D
Stall 1
MOV.D
MULT.D
Stall 2
ADD.D
Stall 2
S.D
BNEZ
SUBI
F2, F1, F12
F0, 0(R1)
R1, R1, 8
F0, F2, F0
F2, F1, F11
F1, F0
F0, F0, F10
#multiply elements
#load the filter input for the iteration
#increment pointer, 8 bytes per DW
#add elements
#MOV.D consumes MULT.D
#move value in F0 to F1
#ADD.D consumes MULT.D
F0, F0, F2
F0, 0(R1)
R2, LOOP
R2, R2, 1
#ADD.D consumes MULT.D
#store the result
#continue till all inputs are processed
#decrement element count
____16_____ cycles
If initialization can’t be changed then ADDI R1, R1, 8 can’t be move. This gives 17 cycles.
c) (2 points) Can the original code be optimized with loop unrolling and software pipeline to
avoid stalls in the loop due to data dependencies? Why or why not?
Each iteration is dependent on the previous one. This is done through register F1. Loop
unrolling and software pipeline will be much trickier.
11
CS252 - Quiz #2, Spring 2001
Your last name: ______________________
Suppose the original code is modified to the following; the MOV.D instruction was removed.
LOOP:
L.D
F0, 0(R1)
#load the filter input for the iteration
MULT.D
F2, F1, F12 #multiply elements
ADD.D
F0, F2, F0
#add elements
MULT.D
F2, F1, F11
MULT.D
F0, F0, F10
ADD.D
F0, F0, F2
S.D
F0, 0(R1)
#store the result
ADDI
R1, R1, 8
#increment pointer, 8 bytes per DW
BNEZ
R2, LOOP
#continue till all inputs are processed
SUBI
R2, R2, 1
#decrement element count
d) (2 points) Unroll the original loop twice (so contains 3 iterations) and schedule it to avoid
stalls. Assume the second iteration has F0 renamed to F3, F1 renamed to F4, and F2 renamed to
F5. Assume the third iteration has F0 renamed to F6, F1 renamed to F7, and F2 renamed to F8.
Write the code on the next page. Write the number reference when writing any instruction listed
below. If you need to use any instruction not listed below, write out the instruc tion explicitly.
Iteration 1
1LOOP:
2
3
4
5
6
7
8
9
10
Iteration 2
11
12
13
14
15
16
17
18
19
20
Iteration 3
21
22
23
24
25
26
27
28
29
30
L.D
MULT.D
ADD.D
MULT.D
MULT.D
ADD.D
S.D
ADDI
BNEZ
SUBI
F0, 0(R1)
F2, F1, F12
F0, F2, F0
F2, F1, F11
F0, F0, F10
F0, F0, F2
F0, 0(R1)
R1, R1, 8
R2, LOOP
R2, R2, 1
#load the filter input for the iteration
#multiply elements
#add elements
L.D
MULT.D
ADD.D
MULT.D
MULT.D
ADD.D
S.D
ADDI
BNEZ
SUBI
F3, 0(R1)
F5, F4, F12
F3, F5, F3
F5, F4, F11
F3, F3, F10
F3, F3, F5
F3, 0(R1)
R1, R1, 8
R2, LOOP
R2, R2, 1
#load the filter input for the iteration
#multiply elements
#add elements
L.D
MULT.D
ADD.D
MULT.D
MULT.D
ADD.D
S.D
ADDI
BNEZ
SUBI
F6, 0(R1)
F8, F7, F12
F6, F8, F6
F8, F7, F11
F6, F6, F10
F6, F6, F8
F6, 0(R1)
R1, R1, 8
R2, LOOP
R2, R2, 1
#load the filter input for the iteration
#multiply elements
#add elements
#store the result
#increment pointer, 8 bytes per DW
#continue till all inputs are processed
#decrement element count
#store the result
#increment pointer, 8 bytes per DW
#continue till all inputs are processed
#decrement element count
#store the result
#increment pointer, 8 bytes per DW
#continue till all inputs are processed
#decrement element count
12
CS252 - Quiz #2, Spring 2001
Your last name: ______________________
d) (continued) To save writing, just write the instruction number in the table below if
instruction can be used as is from the prior page. If it’s not there, write out the new instruction. If
you need fewer instructions than you have space below in the table, just leave the rest blank.
Number (if instruction unchanged)
1
Instruction (if not on prior page)
LD F3, 8(R1)
LD F6, 16(R1)
2
12
22
3
13
23
4
14
24
5
15
25
6
16
26
7
SD F3 8(R1)
SD F6 16(R1)
ADDI R1, R1, 24
9
SUBI R2, R2, 3
e) (2 points) What is the effective cycle per iteration for the unrolled loop, where the iteration is
referring to the iteration for the original code?
_______8_____ cycles
24 instructions with zero stalls. So 24/3 => 8 cycle per iterations.
13
CS252 - Quiz #2, Spring 2001
Your last name: ______________________
f) (4 points) For DSP processor, special instructions are provide to speed up DSP applications
such as an n-tap filter. Suppose the following instructions are provided in addition. How can
one use them to speed up the original 3-tap filter code at the beginning of the question? Write
out the new code below, starting with the version in part a). How many cycles do the DSP
instructions save? How does this compare to your answer to part b)?
LP RX, LABEL
Zero over head loop that loops the segment with the number of times specified in the
register RX. This eliminates branch delay.
LT FX, RY
Auto increment. Load MEM(RY) to register FX. Then increments the base register RY
to the next element.
LOOP: L.T
MULT.D
ADD.D
MULT.D
MOV.D
MULT.D
ADD.D
S.D
LP
F0, 0(R1)
F2, F1, F12
F0, F2, F0
F2, F1, F11
F1, F0
F0, F0, F10
F0, F0, F2
F0, 0(R1)
R2, LOOP
#load the filter input for the iteration
#multiply elements
#add elements
#move value in F0 to F1
#store the result
#continue till all inputs are processed
This gets rid of the pointer adjustment and counter adjustment instructions. This saves two
cycles. So same as in part b). If one assumes L.T F0 0(R1) is only executed once at the entrance
of the loop then this saves more. If LT is auto decrement, similar result can be achieve by
modifying the memory content accordingly.
14

Download Report

quiz2sol.pdf

Paperzz.com

Your Paperzz