CS252 QUIZ #2: 4/18/01 Last Name _______________________ Question 1 2 3 TOTAL Name David vs. Goliath That’s Out of Order! Who Needs Compilers? D. A. Patterson First Name _____________________ Time (minutes) 30 30 50 110 Max Points 14 16 18 48 Your Points CS252 - Quiz #2, Spring 2001 Your last name: ______________________ Question #1: David vs. Goliath (14 points) [30 minutes] The Intel Pentium III and the Transmeta Crusoe both translate 80x86 instructions into a different instruction set for execution. a) (4 points) List the following characteristics of each of the internal instruction sets: Registers (approximate number, size) Instruction (approximate size, style) Pentium III ~80, 32 bit INT ~80, 80 bit FP where 80 = 40 ROB + 40 RS Transmeta Crusoe 64, 32 bit INT 32, 80 bit FP 72 bit RISC 64/128 VLIW b) (2 points) What is the role of interpretation in each machine? Pentium III: Micro code interpreter for 80x86 instructions that are too complicated to be translated into 4 or fewer micro operations. Transmeta Crusoe: Full 80x86 interpreter for all basic blocks that have not been translated into 80x86 instructions. 2 CS252 - Quiz #2, Spring 2001 Your last name: ______________________ c) (2 points) What are the methods of translation in each machine? Pentium III: Hardware fetches and translates up to 3 x86 instructions per clock cycle into RISC ops, as long as they take no more than 4 RISC ops. Transmeta Crusoe: Profiling picks hot spots, and then basic blocks are compiled directly into VLIW code. So those instructions are no longer interpreted. d) (2 points) In addition to performance and cost, an increasingly important consideration is power. What is the impact on power of each approach? Why? Pentium III: Instruction cache use is small, with only 80x86 instructions. Hardware for fetch/decode/issue/translation on every execution burns power. Transmeta Crusoe: Instruction cache is less preferable, with VLIW instruction take more space. 80x86 are translated once and cached. This saves the power used by transistors. e) (2 points) Which is a better match to multithreading? Why? Expected: PIII mechanisms of ROB, RS make it faster to mix multiple threads execution out of order. Other interesting answers: - Keep compiler as a separate thread for Transmeta Crusoe - Multiple threads might avoid extra translation. 3 CS252 - Quiz #2, Spring 2001 Your last name: ______________________ f) (2 points) Suppose another company wanted to change a chip to use it for another instruction set (e.g., Alpha), thereby leveraging the considerable investment in the chip so far. What are the pros and cons of doing this for each chip? What (if anything) is likely to have to be changed for each chip? Pentium III: Fetch/Decode/Issue mechanism Transmeta Crusoe: Register size/ addressing/TLB 4 CS252 - Quiz #2, Spring 2001 Your last name: ______________________ Question 2: That’s Out of Order! (16 points) [30 minutes] Using the MIPS code shown below, show the state of the Reservation stations, Reorder buffers, and floating point (FP) register status for a speculative processor implementing Tomasulo’s algorithm. Assume the following: • Only one instruction can issue per cycle. • The reorder buffer has 8 slots. • The reorder buffer implements the functionality of the load buffers and store buffers. • All function units are fully pipelined. • There are 2 floating point multiply reservation stations. • There are 3 floating point add reservation stations. • There are 3 integer reservation stations, which also execute load and store instructions. • No exceptions occur during the execution of this code. • All integer operations require 1 execution cycle. Memory requests occur and complete in this cycle. • All FP multiply operations require 4 execution cycles. • All FP addition operations require 2 execution cycles. • On a common data bus write conflict, the instruction issued earlier gets priority. • Execution for a dependent instruction can begin on the cycle after its operand is broadcast on the common data bus. • If any item changes from “Busy” to “Not Busy”, you should update the “Busy” column to reflect this, but you should not erase any other information in the row (unless another instruction then overwrites that informa tion). • Assume the all reservation stations, reorder buffers, and functional units were empty and not busy when the code show below began execution. • The “Value” column gets updated when the value is broadcast one the common data bus. • Integer registers are not shown, and you do not have to show their state. 5 CS252 - Quiz #2, Spring 2001 • Your last name: ______________________ For parts a) and b), fill in the new entry only when the entry value changes; leaving the new column blank means it’s unchanged. Use dash to indicate the new entry value is empty. For the instruction column in reorder buffer, use the empty entry for any new instruction. a) (7 points) Assume the tables below show the old state at the end of the cycle in which ADDI from the code below is issued. Modify the tables to show new state at the end of next clock cycle. Assume the execute states for the floating point instructions MULT.D F3, F1, F11 and MULT.D F4, F1, F10 are at the first and second cycle of the execution stage, respectively. (In case you mess up this version, there is an extra copy on the next page.) L.D F0, 0(R1) MULT.D F2, F0, F12 ADD.D F0, F2, F1 MULT.D F3, F1, F11 MULT.D F4, F1, F10 ADDI R3, R3, 1 SUBI R1, R1, 8 Reservation stations ROB dest Name Busy Op Vj Vk Qj Qk old Add1 Add2 Add3 Mult1 Mult2 Int1 Int2 Int3 new old new old new old new old new old ADD.D F2 F1 #3 Y Y N Y MULT.D MULT.D F1 F1 F0 R3 F10 F11 R1 1 #5 #4 #1 #6 L.D. ADDI . Y N N Y Y Y Y Field Instruction N L.D F0, 0(R1) MULT.D F2, F0, F12 ADD.D F0, F2, F1 Y MULT.D F3, F1, F11 MULT.D F4, F1, F10 ADDI R3, R3, 1 SUBI R1, R1, 8 3 Y new N old N new old Commit Commit Write Commit Execute Execute Issue Execute #7 Destination new old new F0 F2 F0 F3 F4 R3 Issue old 2 N new old new 4 Y old 5 Y 6 Value old Mem[0(R1)] F0*F12 F2+F1 new R1 FP register status F2 F3 F4 F1 new 8 Reorder buffer State new F0 old R1 SUBI Busy old Reorder # Busy new N Entry 1 2 3 4 5 6 7 8 old new F5 old N new F6 old N new … old N new CS252 - Quiz #2, Spring 2001 Your last name: ______________________ This is only a redundant copy of part a) in case you need it. Tell us which one is your answer! L.D F0, 0(R1) MULT.D F2, F0, F12 ADD.D F0, F2, F1 MULT.D F3, F1, F11 MULT.D F4, F1, F10 ADDI R3, R3, 1 SUBI R1, R1, 8 Name Busy old Add1 Add2 Add3 Mult1 Mult2 Int1 Int2 Int3 new new old new old new old Qk new old new ROB dest old new Y F2 F1 #3 Y Y N Y MULT.D MULT.D F1 F1 F0 R3 F10 F11 R1 1 #5 #4 #1 #6 L.D. ADDI Busy old Instruction 3 Y Reorder buffer State L.D F0, 0(R1) MULT.D F2, F0, F12 ADD.D F0, F2, F1 old Commit Commit Write MULT.D F3, F1, F11 MULT.D F4, F1, F10 ADDI R3, R3, 1 Execute Execute Issue F0 old . new N N Y Y Y Y Field Reorder # Busy old Qj ADD.D Entry 1 2 3 4 5 6 7 8 Reservation stations Vj Vk Op new N new old new F0 F2 F0 F3 F4 R3 FP register status F2 F3 F4 F1 old Destination new old 2 N new old new 4 Y old 5 Y 7 new Value old Mem[0(R1)] F0*F12 F2+F1 F5 old N new new F6 old N new … old N new CS252 - Quiz #2, Spring 2001 Your last name: ______________________ b) (9 points) The tables below, for a different program, show the state at the end of the cycle in which the S.D from the code below is issued. Modify the tables to show state at the end of the next three clock cycles. Assume the execute states for the floating point instructions for MULT.D F2, F1, F11 and MULT.D F0, F0, F10 are at the end of fourth and third cycle of the execution stage, respectively. (There is an extra copy on the next page.) L.D F0, 0(R1) MULT.D F2, F1, F12 ADD.D F0, F2, F0 MULT.D F2, F1, F11 MULT.D F0, F0, F10 ADD.D F0, F0, F2 S.D F0, 0(R1) ADDI R1, R1, 8 SUBI R2, R2, 1 Name Busy old Add1 Add2 Add3 Mult1 Mult2 Int1 Int2 Int3 N Y N N Y new N N N Y Y Y Y F2 MULT.D MULT.D F0 F1 R1 R1 L.D S.D N N 6 Y new old SUBI Instruction new R2 F2 F10 F11 0 Reorder buffer State L.D F0, 0(R1) MULT.D F2, F1, F12 ADD.D F0, F2, F0 MULT.D F2, F1, F11 MULT.D F0, F0, F10 ADD. D F0, F0, F2 S.D F0, 0(R1) Execute Execute Issue Issue Commit #5 - new #4 - 1 #3 #6 Commit Execute #8 Destination old new F0 F2 F0 F2 F0 F0 R1 R2 Execute old new old 4 Y N N new old N 8 Value old Mem[0(R1)] F1*F12 F2+F0 new new F1*F11 F0*F10 FP register status F2 F3 F10 F1 old #5 #4 #1 Write ADDI R1, R1, 8 SUBI R2, R2, 1 N new ROB dest old new 8 new new old Qk #6 R1 old Commit Commit Commit old Qj F0 ADDI F0 old new F0 new Y Y Field old ADD.D ADD.D Busy Old Reorder # Busy old N Entry 1 2 3 4 5 6 7 8 1 Op new Y Y N Y Reservation stations Vj Vk R1+8 F11 old N new F12 old N new … old N new CS252 - Quiz #2, Spring 2001 Your last name: ______________________ This is only a redundant copy of part b) in case you need it. Tell us which one is your answer! L.D F0, 0(R1) MULT.D F2, F1, F12 ADD.D F0, F2, F0 MULT.D F2, F1, F11 MULT.D F0, F0, F10 ADD.D F0, F0, F2 S.D F0, 0(R1) ADDI R1, R1, 8 SUBI R2, R2, 1 Name Busy old Add1 Add2 Add3 Mult1 Mult2 Int1 Int2 Int3 N Y Y Y N Y old ADD.D ADD.D new old Field Instruction 5 Y new Reorder buffer State L.D F0, 0(R1) MULT.D F2, F1, F12 ADD.D F0, F2, F0 MULT.D F2, F1, F11 MULT.D F0, F0, F10 ADD. D F0, F0, F2 S.D F0, 0(R1) Commit Execute Execute Issue Issue N new new old old 4 Y new new old #4 old new F0 F2 F0 F2 F0 F0 F0 N old N 9 new #3 #6 Destination new new old #5 #4 #1 #7 new Value old Mem[0(R1)] F1*F12 new F2+F0 FP register status F2 F3 F10 F1 old old ROB dest F10 F11 0 0 old Commit Commit F0 old new Qk F0 F0 F1 R1 R1 new N N N Y Y Y Y old Qj #5 MULT.D L.D S.D Busy new F2 MULT.D old Reorder # Busy Op new Entry 1 2 3 4 5 6 7 8 1 Reservation stations Vj Vk F11 old N new F12 old N new … old N new CS252 - Quiz #2, Spring 2001 Your last name: ______________________ 3. Who needs compilers? (18 points) [50 minutes] In the following problem, use a simple pipelined RISC architecture with a single branch delay cycle. The architecture has pipelined functional units with the following execution cycles: 1. Floating point op: 3 cycles (7 stages total) 2. Integer op: 1 cycles (5 stages total) The following table shows the minimum number of intervening cycles between the producer and consumer instructions to avoid stalls. Assume 0 intervening cycle for combinations not listed. Instruction producing re sult FP ALU op FP ALU op Load double Load double Instruction using result Another FP ALU op Store and move double FP ALU op Store double Latency in clock cycles 2 2 1 0 The following code computes a 3-tap filter. R1 contains address of the next input to the filter, and the output overwrites the input for the iteration. R2 contains the loop counter. The tap values are contained in F10, F11, and F12. LOOP: L.D MULT.D ADD.D MULT.D MOV.D MULT.D ADD.D S.D ADDI BNEZ SUBI F0, 0(R1) F2, F1, F12 F0, F2, F0 F2, F1, F11 F1, F0 F0, F0, F10 F0, F0, F2 F0, 0(R1) R1, R1, 8 R2, LOOP R2, R2, 1 #load the filter input for the iteration #multiply elements #add elements #move value in F0 to F1 #store the result #increment pointer, 8 bytes per DW #continue till all inputs are processed #decrement element count a) (4 points) How many cycles does the current code take for each iteration? ____18______ cycles LOOP: 1 2 4 5 6 7 8 9 11 12 14 15 16 17 18 L.D MULT.D Stall 2 ADD.D MULT.D Stall 1 MOV.D MULT.D Stall 2 ADD.D Stall 2 S.D ADDI BNEZ SUBI F0, 0(R1) F2, F1, F1 2 F0, F2, F0 F2, F1, F11 F1, F0 F0, F0, F10 #load the filter input for the iteration #multiply elements #ADD.D consumes MULT.D #add elements #MOV.D consumes ADD.D #move value in F0 to F1 #ADD.D consumes MULT.D F0, F0, F2 F0, 0(R1) R1, R1, 8 R2, LOOP R2, R2, 1 #ADD.D consumes MULT.D #store the result #increment pointer, 8 bytes per DW #continue till all inputs are processed #decrement element count 10 CS252 - Quiz #2, Spring 2001 Your last name: ______________________ b) (4 points) Rearrange the code without unrolling to achieve 2 less cycles per iteration. You can reorder and drop any line of code, but do not change any line of code . To save writing, just draw arrows in the below copy of the code to show any code movement. Show the execution clock cycle number next to each code line . Assume initialization can be adjusted. LOOP: 1 2 3 4 5 6 7 8 10 11 13 14 15 16 MULT.D L.D ADDI ADD.D MULT.D Stall 1 MOV.D MULT.D Stall 2 ADD.D Stall 2 S.D BNEZ SUBI F2, F1, F12 F0, 0(R1) R1, R1, 8 F0, F2, F0 F2, F1, F11 F1, F0 F0, F0, F10 #multiply elements #load the filter input for the iteration #increment pointer, 8 bytes per DW #add elements #MOV.D consumes MULT.D #move value in F0 to F1 #ADD.D consumes MULT.D F0, F0, F2 F0, 0(R1) R2, LOOP R2, R2, 1 #ADD.D consumes MULT.D #store the result #continue till all inputs are processed #decrement element count ____16_____ cycles If initialization can’t be changed then ADDI R1, R1, 8 can’t be move. This gives 17 cycles. c) (2 points) Can the original code be optimized with loop unrolling and software pipeline to avoid stalls in the loop due to data dependencies? Why or why not? Each iteration is dependent on the previous one. This is done through register F1. Loop unrolling and software pipeline will be much trickier. 11 CS252 - Quiz #2, Spring 2001 Your last name: ______________________ Suppose the original code is modified to the following; the MOV.D instruction was removed. LOOP: L.D F0, 0(R1) #load the filter input for the iteration MULT.D F2, F1, F12 #multiply elements ADD.D F0, F2, F0 #add elements MULT.D F2, F1, F11 MULT.D F0, F0, F10 ADD.D F0, F0, F2 S.D F0, 0(R1) #store the result ADDI R1, R1, 8 #increment pointer, 8 bytes per DW BNEZ R2, LOOP #continue till all inputs are processed SUBI R2, R2, 1 #decrement element count d) (2 points) Unroll the original loop twice (so contains 3 iterations) and schedule it to avoid stalls. Assume the second iteration has F0 renamed to F3, F1 renamed to F4, and F2 renamed to F5. Assume the third iteration has F0 renamed to F6, F1 renamed to F7, and F2 renamed to F8. Write the code on the next page. Write the number reference when writing any instruction listed below. If you need to use any instruction not listed below, write out the instruc tion explicitly. Iteration 1 1LOOP: 2 3 4 5 6 7 8 9 10 Iteration 2 11 12 13 14 15 16 17 18 19 20 Iteration 3 21 22 23 24 25 26 27 28 29 30 L.D MULT.D ADD.D MULT.D MULT.D ADD.D S.D ADDI BNEZ SUBI F0, 0(R1) F2, F1, F12 F0, F2, F0 F2, F1, F11 F0, F0, F10 F0, F0, F2 F0, 0(R1) R1, R1, 8 R2, LOOP R2, R2, 1 #load the filter input for the iteration #multiply elements #add elements L.D MULT.D ADD.D MULT.D MULT.D ADD.D S.D ADDI BNEZ SUBI F3, 0(R1) F5, F4, F12 F3, F5, F3 F5, F4, F11 F3, F3, F10 F3, F3, F5 F3, 0(R1) R1, R1, 8 R2, LOOP R2, R2, 1 #load the filter input for the iteration #multiply elements #add elements L.D MULT.D ADD.D MULT.D MULT.D ADD.D S.D ADDI BNEZ SUBI F6, 0(R1) F8, F7, F12 F6, F8, F6 F8, F7, F11 F6, F6, F10 F6, F6, F8 F6, 0(R1) R1, R1, 8 R2, LOOP R2, R2, 1 #load the filter input for the iteration #multiply elements #add elements #store the result #increment pointer, 8 bytes per DW #continue till all inputs are processed #decrement element count #store the result #increment pointer, 8 bytes per DW #continue till all inputs are processed #decrement element count #store the result #increment pointer, 8 bytes per DW #continue till all inputs are processed #decrement element count 12 CS252 - Quiz #2, Spring 2001 Your last name: ______________________ d) (continued) To save writing, just write the instruction number in the table below if instruction can be used as is from the prior page. If it’s not there, write out the new instruction. If you need fewer instructions than you have space below in the table, just leave the rest blank. Number (if instruction unchanged) 1 Instruction (if not on prior page) LD F3, 8(R1) LD F6, 16(R1) 2 12 22 3 13 23 4 14 24 5 15 25 6 16 26 7 SD F3 8(R1) SD F6 16(R1) ADDI R1, R1, 24 9 SUBI R2, R2, 3 e) (2 points) What is the effective cycle per iteration for the unrolled loop, where the iteration is referring to the iteration for the original code? _______8_____ cycles 24 instructions with zero stalls. So 24/3 => 8 cycle per iterations. 13 CS252 - Quiz #2, Spring 2001 Your last name: ______________________ f) (4 points) For DSP processor, special instructions are provide to speed up DSP applications such as an n-tap filter. Suppose the following instructions are provided in addition. How can one use them to speed up the original 3-tap filter code at the beginning of the question? Write out the new code below, starting with the version in part a). How many cycles do the DSP instructions save? How does this compare to your answer to part b)? LP RX, LABEL Zero over head loop that loops the segment with the number of times specified in the register RX. This eliminates branch delay. LT FX, RY Auto increment. Load MEM(RY) to register FX. Then increments the base register RY to the next element. LOOP: L.T MULT.D ADD.D MULT.D MOV.D MULT.D ADD.D S.D LP F0, 0(R1) F2, F1, F12 F0, F2, F0 F2, F1, F11 F1, F0 F0, F0, F10 F0, F0, F2 F0, 0(R1) R2, LOOP #load the filter input for the iteration #multiply elements #add elements #move value in F0 to F1 #store the result #continue till all inputs are processed This gets rid of the pointer adjustment and counter adjustment instructions. This saves two cycles. So same as in part b). If one assumes L.T F0 0(R1) is only executed once at the entrance of the loop then this saves more. If LT is auto decrement, similar result can be achieve by modifying the memory content accordingly. 14
© Copyright 2026 Paperzz