ECE 4750 Computer Architecture, Fall 2016
Discussion Session Handout
Problem 1. Pipelined Processor with Integer Multiplier
In this problem we will consider a 5-stage pipelined TinyRV2 processor with hardware stalling to
resolve data hazards. We will then investigate augmenting the processor with a single-cycle integer
multiplier and a 4-cycle unpipelined iterative multiplier unit.
The baseline design is the processor which includes a single-cycle integer multiplier and the alternative design would be the processor which includes a 4-cycle unpipelined iterative integer multiplier. We will evaluate the performance of both the baseline and alternative design by considering a
simple scalar multiplication C kernel as shown below. We have modified the kernel to only handle
integer inputs to be able to compute using our integer multiplier units.
1
2
3
4
5
void scalarMul_int( int src[ ], int a, int size )
{
for ( unsigned int i = 0; i < size; i ++ )
src[i] = a*src[i];
}
The simplified assembly instructions which correspond to the loop above are as shown below.
Study the assembly instructions and make sure that you understand how it corresponds to the C
program above. Assume the following:
• x3 : holds the base address of the src[ ] array
• x4 : holds the scalar constant a
• x5 : holds the size or loop count of value 64
1
2
3
4
5
6
7
8
9
10
loop:
lw
mul
sw
addi
addi
bne
opA
opB
...
x1,
x6,
x6,
x3,
x5,
x5,
0(x3)
x4, x1
0(x3)
x3, 4
x5, -1
x0, loop
Part 1.A TinyRV2 Processor with Single-Cycle Integer Multiplier
Figure 1 shows the datapath diagram of the TinyRV2 processor integrated with a single-cycle
integer multiplier unit. The multiplier unit exists in the excute (X) stage of the datapath in parallel
to the ALU unit as shown. Spend some time studying this datapath to understand how the singlecycle integer multiplier affects the structural, data, and control hazards in the pipeline. Example
below shows the execution of a multiplication operation in this pipeline.
Dynamic
Transaction
1 addi
x1, x3, 1
2 mul
x2, x6, x4
0
1
Cycle
2 3 4
F
D
X M W
F
D
1
5
X M W
6
ECE 4750 Computer Architecture, Fall 2016
Discussion Session Handout
Draw a pipeline diagram illustrating the first iteration of the loop. Note, you should include the
first instruction of the second iteration as shown in the instruction sequence. Draw arrows in your
pipeline diagram to indicate any data or control hazards. Ignoring the startup overhead calculate
the total number of cycles required to execute the loop. Assume that this processor in a given
standard CMOS technology take 1 ns for a clock cycle and that the critcal path is through the
single-cycle multiplier in execute (X) stage. Calculate the execution time for the loop.
lw x1, 0(x3)
mul x6, x4, x1
sw x6, 0(x3)
addi x3, x3, 4
addi x5, x5, -1
bne x5, x0, loop
opA
opB
lw x1, 0(x3)
2
imemreq_msg_
addr
+4
3
Fetch (F)
imemresp_msg_
data
reg_en_F
pc_F
pc_reg_F
inst_D
inst_D_reg
pc_reg_D
[24:20]
[19:15]
inst_D
reg_en_D
pc_plus4_F
jal_target_D
br_target_X
jalr_target_X
pc_sel_F
pc_incr_F
pc_next_F
pc_sel_
mux_F
regfile
(read)
rf
Decode (D)
pc_plus_
imm_D
+
imul_req_
msg
imul_resp_rdy_X
imul_resp_val_X
imul_resp_msg
ex_result_
sel_X
Execute (X)
wb_result_
reg_W
wb_result_
sel_M
reg_en_W
wb_result_
sel_mux_M
Memory (M)
dmemresp_msg_data
reg_en_M
ex_result_ ex_result_
sel_mux_X reg_M
dmemreq_msg_ dmemreq_msg_
addr
data
reg_en_X
alu
alu_fn_X
dmem_write_data_reg_X
imul_req_rdy_D
+4
pc_incr_X
br_cond_ltu_X
br_cond_lt_X
br_cond_eq_X
imul
op2_reg_X
op1_reg_X
pc_reg_X
br_target_
reg_X
imul_req_val_D
op2_
sel_D
op2_sel_
mux_D
csrr_sel_D
csrr_sel_
mux_D
op1_
sel_D
op1_sel_
mux_D
mngr_to_proc_data
core_id
num_cores
rf_rdata1_D
rf_rdata0_D
imm_type_D
imm
gen
imm_gen_D
regfile
(write)
Writeback (W)
proc_to_mngr_data
stats_en_
wen_W
stats_en
wb_result_
reg_W
rf_wdata_W
rf
rf_wen_W
rf_waddr_W
ECE 4750 Computer Architecture, Fall 2016
Discussion Session Handout
Figure 1: TinyRV2 Processor with Single-Cycle Integer Multiplier Datapath
ECE 4750 Computer Architecture, Fall 2016
Discussion Session Handout
Part 1.B TinyRV2 Processor with 4-Cycle Unpipelined Iterative Integer Multiplier
For the alternative design, we investigate integrating the TinyRV2 processor with a 4-cycle unpipelined iterative Integer Multplier. Assume that the cycle time for the processor reduces to 0.75ns
when implemented in the same standard CMOS technology as that of the baseline design.
The multiplier unit exists in the excute (X) stage of the datapath in parallel to the ALU unit as
shown. Spend some time studying this datapath to understand how the 4-cycle unpipelined iterative integer multiplier affects the structural, data, and control hazards in the pipeline. Example
below shows the execution of a multiplication operation in this pipeline. A transaction which occurs in the multiplier is indicated by X.
Dynamic
Transaction
1 addi
x1, x3, 1
2 mul
x2, x6, x4
3
Cycle
4 5
0
1
2
F
D
X M W
F
D
X
X
X
6
7
8
9
X M W
Draw a pipeline diagram illustrating the first iteration of the loop. Note, you should include the
first instruction of the second iteration as shown in the instruction sequence. Draw arrows in your
pipeline diagram to indicate any data or control hazards. Ignoring the startup overhead calculate
the total number of cycles required to execute the loop. Calculate the execution time for the loop.
lw x1, 0(x3)
mul x6, x4, x1
sw x6, 0(x3)
addi x3, x3, 4
addi x5, x5, -1
bne x5, x0, loop
opA
opB
lw x1, 0(x3)
Part 1.C Comparison of Processor Microarchitectures
Which microarchitecture has the highest performance? Discuss some of the trade-offs in terms of
area and performance between the processor microarchitectures. How does the single-cycle integer
multiplier unit impact the processor pipeline? How does the 4-cycle unpipelined iterative integer
multiplier unit impact the processor pipeline?
4
© Copyright 2026 Paperzz