hw3

Modification History: None as of October 9
ECE462/562
Computer Architecture and Design
Homework 3
Release Date: October 9
Due Date: Section I: October 30, Section II: Oct 23
Section I
Objective:
With this assignment you will:
 make tradeoffs related to branch-predictors’ implementation.
 Have an understanding for the relative importance of various advanced pipeline
techniques like branch prediction, variable pipeline width and out-of-order execution.
Prerqusitie:
 Go over the bred.h and sim-bpred.c in your simple-sim3.0 folder
 Go over the branchprediction.ppt posted under “Links” page of the ece462/562 website
 Reading and background material:
[1] Tse-Yu Yeh and Yale N. Patt, “Alternative Implementations of Two-level Adaptive
Branch Prediction,” ISCA 1992
[2] Tse-Yu Yeh and Yale N. Patt, “A Comparison of Dynamic Branch Predictors that use
Two Levels of Branch History,” ISCA '93
Part-1 Branch prediction (No deliverable)
Deliverable 1:
Suppose you are choosing a new branch strategy for a processor. Your design choices are:
1. predict branches not-taken with a branch penalty of 2 cycles and a 1200 MHz clock-rate
2. predict branches not-taken with a branch penalty of 3 cycles and a 1300 MHz clock-rate
3. predict branches using bimod with a branch penalty of 4 cycles and a 900 MHz clock-rate
4. predict branches using bimod with a branch penalty of 4 cycles and a 1000 MHz clock-rate
and half the L1 cache size
Q1: What would be your choice for your (use the same benchmark set from hw1) benchmarks
and why?
Q2: How much do you have to be able to increase the clock frequency in order to gain
performance when allowing a branch mis-prediction latency of 3 cycles instead of 2 when using
the not-taken predictor?
Q3: Why is bimodal branch prediction more expensive to implement than predict nottaken?
Q4: Why is bimodal better than nottaken?
Deliverable 2:
Compare performance of two different state diagrams in two-level branch prediction. According
to the reference papers [1] and [2], these two state diagrams have the nearly same performance.
Show how close these state diagrams. You need to change some lines of codes to implement the
first state diagram.
(a)
(b)
Use the following options as default
-max:inst 50000000 -fastfwd 20000000 -redir:sim sim_output_file
-bpred 2lev –bpred:2lev 1 512 8 0 –bpred:ras 8 –bpred:btb 64 2
Deliverable 3:
Compare performance of the following branch predictors with the same number of predictors
used in Deliverable-2. Use the second state diagram (b) for each predictor. Remember the total
number of predictors in each case should be the same as the number of predictors used in
Deliverable-2.
.
(1) GAg: 1 global history register and 1 global prediction table
(2) GAp: 1 global history register and 8 per-address prediction tables
(3) PAg: 8 per-address history registers and 1 global prediction table
(4) PAp: 8 per-address history registers and 8 per-address prediction tables
Which case has the best performance and Why?
Refer to the evaluation strategy of [1] and [2] as a guideline when conducting your experiments.
Note: configuration examples for 2-level predictors
-bpred:2lev <l1size> <l2size> <hist_size> <xor>
Configurations: N, M, W, X
N # entries in first level (# of shift register(s))
M # entries in 2nd level (# of counters, or other FSM)
W width of shift register(s) (# of bits in each shift register)
X (yes-1/no-0) xor history and address for 2nd level index (We
use 0 for this homework.)
Sample predictors:
GAg: 1, M, W, 0 where M =
GAp: 1, M, W, 0 where M =
address prediction tables
PAg: N, M, W, 0 where M =
PAp: N, M, W, 0 where M =
2^W
C * 2^W, C is the number of per2^W
N * 2^W
Part-2 Performance
Deliverable 1:
Configuration Options:
Execution type: in-order, out-of-order
Pipeline stage width (for fetch, decode, issue): 1,2,4 and 8
Number of memory ports: 1,2 and 4
For (each execution type)
For (each pipeline stage width)
For (each memory port size) {
Collect CPI and total number of cycles information.
}
Consider the profiling information you collected in homework-1 during your evaluations.
Evaluate the impact of each configuration option on performance. Here are a few questions to
consider as a guideline when preparing your discussions.




Is the wider pipeline more effective with in-order or out-of-order execution, and if so why?
What, if any, is the impact on CPI by allowing more instructions to be processed in one
cycle?
What, if any, is the impact on CPI by allowing out of order execution?
What is the benefit of increasing the number of memory ports?
Deliverable 2:
Given the following out-of-order CPU configuration below, identify an in-order CPU
configuration that performs better. Support your arguments with simulation results and justify
why you are getting better performance with the proposed modification.
Baseline out-of-order CPU model:
 Multiple issue: 1 instruction
 Speed of the front-end relative to the execution core: 1
 Integer ALUs: 1
 Integer multipliers/dividers: 1
 Memory system ports: 1
 Floating point ALUs: 1
 Floating point multipliers/dividers: 1
Notes: Focus on a limited set of parameters instead of sweeping the whole search space. Choose
your parameters carefully. You can change the parameter values, or modify the .def file. Do not
attempt to change the SimpleScalar code. That would be a semester long project.
What to turn in:
1. Make all your files including modified source codes, simulation results and the report
into one tar file. Your report must contain simulation results (You should include
SimpleScalar output files) and analysis of them. Any result you consider important can be
used. Only Microsoft DOC or PDF is acceptable for the report.
2. Send the tar file to the instructor with the following in the email’s subject line:
Assignment 3 (Full Names of Team Members) by 12:29PM on the due date (right before
the class)
3. “IMPORTANT” Be sure to turn in the hard copy of your report including simulation
results. It should be the same as the one submitted one.
4. Penalty of Late submission: 5% deduction per day
Notes:



In the report I need your analysis/evaluation/interpretation of the results supported by
trend lines, tables or charts based on the output files. Please do not print out the output
files.
For example: In a chart where you have block size in x-axis and miss rate in y-axis and
you have multiple line charts each for different cache sizes, you need to do
multiple evaluations:
first fix the cache size and observe the trends as the block size changes
then fix the block size and observe the trends as the cache size changes.
It is highly critical that you evaluate the results across the benchmarks you worked on.
May be in one benchmark, a prediction technique or option worked well, and in the other
one it didn’t perform as well. I need to see your discussion on why you observed such a
difference in the behavior.


In the report: You are expected to make a reference to your hw1 profiling analysis when
discussing the cache behavior. For example the amount of loads and stores
(memory transactions) will have significant impact on the cache performance! You will
need to refer to your hw1 data.
When you modify the code, make sure to include that portion of the code in the report
and specify the line number, and include the source code in the zipped folder. Also put a
comment line before and after the modified region and insert "ece462" at
the commented lines. It is your responsibility to make sure that your module runs
when I compile it in my environment.
SECTION II Starts Next Page
Section II
Problem 1 . Hennessy & Patterson 3.1, 3.2, 3.3, 3.5 Note: One of the problems will be picked
randomly for grading
Problem 2. Hennessy & Patterson 3.7
Problem 3. Hennessy & Patterson 3.17 (correction: local predictor should be (2,2))
Problem 4.
List all the dependencies (output, anti and true) in the following code fragment. Indicate whether
the true dependencies are loop-carried or not. Show why the loop is not parallel.
for (i=2; i<100;i++) {
a[i] = b[i] + a[i]
c[i-1] = a[i] + d[i]
a[i-1] = 2 * b[i]
b[i+1] = 2*b[i]
}
/*
/*
/*
/*
S1
S2
S3
S4
*/
*/
*/
*/
Problem 5.
Suppose we have a deeply pipelined processor, for which we implement a branch-target-buffer
for the conditional branches only. Assume that the misprediction penalty is always 4 cycles and
the buffer miss penalty is always 3 cycles. Assume 90% hit rate and 90% accuracy, and 15%
branch frequency. How much faster is the processor with the branch-target buffer versus a
processor that has a fixed 2-cycle branch penalty? Assume a base CPI without branch stalls of 1.
Problem 6.
Instruction producing result
FP ALU op
FP ALU op
Load double
Load double
Instruction using result
Another FP ALU op
Store double
FP ALU op
Store double
Latency in clock cycles
3
2
1
0
Assume the pipeline latencies given above and one-cycle delayed branch. Unroll the following
loop a sufficient number of times to schedule it without any delays Show the schedule after
eliminating any redundant overhead instructions. The loop is a dot product (assuming F2 is
initially 0) and contains a recurrence. Despite the fact the loop is not parallel, it can be scheduled
with no delays.
loop:
LD
LD
MULTD
F0,0(R1)
F4,0(R2)
F0,F0,F4
ADDD
SUBI
SUB1
BNEQZ
F2,F0,F2
R1,R1,8
R2,R2,8
R1, loop
Problem 7.
In this exercise we will look into how a common vector loop runs. The loop is the so-called
SAXPY loop and the central operation in Gaussian elimination. The loop implements the vector
operation Y=a * X + Y for a vector of length 100. Here is the code for the loop:
foo:
LD
MULTD
LD
ADDD
SD
ADDI
ADDI
SGTI
BEQZ
F2,0(R1)
F4,F2,F0
F6,0(R2)
F6,F4,F6
0(R2),F6
R1,R1,8
R2,R2,8
R3,R1,done
R3, foo
;load X[i]
;multiply a*X[i]
;load Y[i]
;add a*X[i] + Y[i]
;store a*X[i]+Y[i]
;increment x index
;increment y index
;if R1>done then R3=1, else R3=0
;branch if R3 == 0
Assume that the integer operations issue and complete in one clock cycle (including loads) and
their results are fully bypassed. Ignore the branch delay. You will use FP latencies shown in
Problem 5. Assume the FP unit is fully pipelined.
Assume Tomasulo’s algorithm for the hardware with one integer unit taking one execution cycle
( a latency of 0 cycle to use) for all integer operations. Show the state of the reservation stations
and register-status tables when the SGTI writes its results into CDB. Do not include the branch.