Tomasulo Architecture - People @ EECS at UC Berkeley

Tomasulo Architecture:
Out of Order Execution
Lab 7 Report
Names (alphabetical order)
Jim Chen
Ray Juang
Ted Lee
Tim Lee
(cs152-etjim)
(cs152-juangr)
(cs152-timclee)
(cs152-leet)
Group name: Alt-0165 (¥)
1. Abstract
The goal of this project was to design a processor based on an architecture of our
choosing. We chose to create an Out-of-Order execution processor based on the
Tomosulo Algorithm. This architecture is very modular and provides a good base for
optimizations, but also allows execution to continue when some instructions, such as
multiply/divide, take a long time to finish. This is very advantageous on systems with
units that take many cycles to complete or on systems that have poor memory
subsystems. In our case, we decided to make a write-through, direct-mapped 8KB cache
composed of 8 word cache lines with a 4-entry write buffer. We were successful and
ended up with a design that operates at 12.3 MHz with a CPI count of 1.1 (or roughly 1).
2. Division of Labor
The division of labor for this project was mainly a division in time. We decided that we
would set timelines for portions of the project to be completed and took shifts for
completing each module. Every member took a part in the design of every module.
However, each member took responsibility for a tested and working module design
during the coding stage. The responsibilities were divided as follows:
Component
Person Responsible
Cache controller
cs152-leet
CDB arbiter
cs152-etjim
Memory controller
cs152-juangr
Memory arbiter
cs152-leet
Load/Store arbiter
cs152-timclee
Functional units
cs152-juangr
Reservation station
cs152-juangr
Register file
cs152-etjim
Fetch/Decoder
cs152-timclee
Table 1: Main components and person responsible for ensuring correctness
Our development was divided into the following stages:
Stage Name
Description
Top-Level Planning
Paper implementation of top-level goals and what
the overall design would look like.
Detailed-Level Planning
Paper implementation of what detailed components
of the design would look like
Coded Implementation
Replication of paper designs into Verilog coded
modules
Top-Level Structural
Verification
Analysis of all modules thrown together to ensure no
signals or components were missing.
Component Verification
and Modification
Individual testing of components to verify that they
were functioning
Incremental Testing Phase
Incremental tests to verify correct behaviour in
interaction between various modules
Overall Testing Phase
Top-level testing of overall design. Debug phase
from MIPS code in simulation.
Timing Analysis/Synthesis
Substitution of time synthesis programmable
modules to verify correctness of synthesized board.
Optimizations
Optimizations to reduce critical path, reduce CPI,
and improve overall performance
Table 2: Development stages and description of the stage
In summary, this can be broken down into a Top-Level Planning stage, a Detailed-Level
Component Design stage, a Testing phase, and an Optimization stage.
3. Detailed Strategy
3.1 Top-Level Planning
In this stage, we planned out on paper, what our top-level datapath would look like. This
would provide us with insight as to how individual components would function with the
rest of the datapath.
3.1.1 Datapath
Our Datapath is composed of 3 main parts: The Fetch/Issue Unit, the Functional Units
and the Register File. A typical instruction will have a lifetime as follows: be fetched,
decoded and issued by the Fetch/Issue unit. Issuing consists of broadcasting a tag, opcode and operand values to all functional units. Issuing also puts the issued tag in the
destination register’s tag field in the register file. The register file is constantly monitoring
the Common Data Bus, and the specific destination register is waiting for this tag. When
it sees the tag on the CDB, it will take the result. The tag is a 9-bit field specifying the
functional unit and reservation station (6 functional units, 3 reservation stations). The
functional units then take the instruction, wait to resolve dependencies by looking on the
Common Data Bus (CDB), and then put the instruction result on the CDB. Putting a
result on the CDB is done by sending a request to the CDB Arbiter which then
broadcasts a single result to the rest of the functional units and register file. The register
file will see the tag that is on the CDB and take the result from the CDB, store it into its
data register and clear its tag.
3.1.2 Control
Our Control is handled entirely by the Fetch/Issue unit and by the architecture of the
Datapath. Once an instruction is issued, the Datapath will put the result of the instruction
in the destination register after resolving dependencies and hazards by itself. Cases that
cannot be handled by the Datapath, such as lw and sw address ambiguity result in stalls
at issue.
Our initial top-level designed consisted of three functional units, as specified in the
project specifications. However, we decided later on to split the Integer functional unit
into R-type, I-type, and Shift functional units.
Figure 1: Initial sketch of top-level design
3.2 Detailed Level Planning and Component Design
This stage consisted of doing detail planning of each of the components in the top-level
design and coding them in Verilog. The following sections will describe in detail the
functionality and design of each component.
3.2.1 The Fetch/Issue Unit
The fetcher is the same as that for an ordinary pipelined processor. It contains a register
for the PC which is updated after a mux to select pc+4, the branch pc, a jump pc, or a jr.
The logic for the selector to this mux is handled in a module called “branch magic” which
waits (like a reservation station) for branch and jr results to be broadcast on the cdb.
Actually, right now we don’t have branch prediction so we keep issuing branches until
the dependency is resolved, so this unit only has to input register file outputs. This unit
fetches one instruction at a time from the instruction cache and stores it into an
instruction register. The instruction in this register is the current instruction that will be
decoded and issued by the decode/issue unit later on.
Below is a depiction of our planned fetch unit:
Figure 2: Fetch/decode block designs
In the decode/issue unit, the instruction is decoded for type, to see what functional unit
to send to, and then the status of the reservation stations is checked. If there is an open
reservation station in the appropriate functional unit, the instruction is issued to it by
specifying the correct tag on the Issue Bus. The tag is one-hot encoded with the top 3
bits for the reservation stations, and the bottom 6 bits for the functional units. So on any
instruction issue, two bits will be high in the tag, one for the reservation station, and one
for the functional unit. Any instructions that write to $0 are considered nops and have allzero tags, so they do not take up any reservation station space. When a break is
decoded, a signal is sent to the Fetcher to tell it to stall. The Fetcher then stalls until a
break release signal is received from the top-level. We decided that the decode/issue
unit would be written with control logic using Verilog.
3.2.2 Register File
Our register file contains 32 32-bit data registers and 32 9-bit tag registers. When Write
enable is high, a tag register is written to. Data registers can only be altered by sending
particular values on the CDB input. Register and Tag 0 are tied to ground since the
value is always supposed to be 0 and the value of 0 is always up-to-date. Register 31
($ra) has a special input, the pc which is used when the register file receives the jal
signal from decode. It gets this signal when decode decodes a jal instruction. The one
complication with the register file is that it must forward values from the CDB for reads.
In the case where an instruction is being issued at the same time that one of its
operand’s values is being broadcast on the CDB, the register file must forward the
CDB’s value onto the Issue Bus.
Figure 3: Register file design (note: does not include forwarding)
3.2.3 The Functional Units
The project specification required three functional units: a load unit, a store unit, and an
integer unit. We decided to split up the integer unit into three different units (r-type, itype, and shift-type). This made each unit very easy to create as they each had
specialized tasks. Shift was split from i-type or r-type because we didn’t need a shifter
for any other instructions. We decided to remove it from the critical path of the more
common r-type and i-type instructions. Also, we wanted to see the CDB take more traffic
to see if we could handle adding more functional units. We also anticipated adding an
extra functional unit (perhaps for a multiply/divide unit), but it is currently unused. The
load unit and store unit were more of a design challenge than the other units.
3.2.3.1 Reservation Station
A generalized reservation station was created that resolved dependencies by looking at
an inputted CDB and setting a flag high when the values were disambiguated. It contains
two value/tag registers for the two possible operands for an instruction. These registers
are written into when the reservation station is issued and then updated by the CDB if a
tag is non-zero. Also, it has a busy register which is high when a reservation station is in
use; it is set high when an instruction is issued to this reservation station. Finally, the
reservation station has an op register which keeps the operation to be performed on the
operands. For example, the op register in the R-Type station decides to do an add, or
sub, or xor, etc.
Figure 4: Design of a generic reservation station
From this generalized reservation station, we created particular reservation stations for
each of the functional units. Each of these specialized stations has an execution unit to
do computation on operators. We decided to put the actual execution here rather than in
the unit because it is not in our critical path and we had lots of hardware to use. If we
had put it in the path of issuing to the CDB, an asynchronous CDB and a forwarding
case could have been our critical path. From these specialized reservation stations, we
build our functional units. The unit contains three reservation stations and logic to decide
which station gets to request to the CDB and which and passes parsed values from the
Issue Bus to the individual reservation stations.
3.2.3.2 Integer Unit *(divided into R/I/Shift unit later on)
The design of the integer unit was used as a template in the design of other functional
units. The method of prioritizing the reservation station outputs was accomplished via tristate muxes. Only one reservation station is allowed to output onto a common bus at
any given time. A reservation station will only request for output when it contains valid
data and all of its data dependencies are resolved (i.e. it contains tag values of 0). If
more than one reservation station requests for output, priority is given to the first
reservation station, then second, then third. If any reservation station makes a request
for output, a general request signal is sent out to an arbiter to ask for its results to be
placed onto the CDB. This priority design will be used in the other functional units.
Figure 5: Design of an integer unit (note that the “+” block can be substituted with an ALU)
3.2.3.2 Load/Store Unit
The load/store functional units were the most complicated in terms of interaction and
design. There were issues of data getting committed to the same memory address in the
wrong order and issues with possible incorrect reads from uncommitted stores. In order
to deal with such hazards, we established a set of paradigms to handle load and store
instructions:
1. When the address is unresolved, we halt issue of any more
instructions.
2. All loads must occur before store into memory.
3. All loads must check store reservation stations before loading into
their own reservation stations.
4. All stores must check previous stores. If common address found, it
overwrites the value.
The paradigms above were developed to handle potential memory Read After Write
(RAW) and Write After Read (WAR) hazards. For example:
1: lw $2, 0($1)
2: sw $3, 0($1)
3: lw $4, 0($1)
4: sw $5, 0($1)
5: sw $6, 0($1)
A WAR hazard could be introduced if the load (line 1) and store (line 2) were issued, but
the store gets executed before the load. This can be resolved by forcing loads to occur
first before storing.
If loads did not check previous store issues, a RAW hazard could be introduced. In the
same example, if line 2 and 3 are issued, but the load in line 3 occurs before the store in
line 2, the wrong value would be read. This can be resolved by having the load unit
check the addresses of previous stores, and grabbing the data/tag of the matching store.
A WAW hazard could be introduced if the store in line 4 occurs after the store in line 5.
This can be resolved by having stores check previous stores and replacing the data of
the reservation station with matching address.
If we do not stall when an address is unresolved, then it would be impossible to check
for matching addresses (proposed in the lines above).
Figure 6 on the next page provides a detailed view of the implementation of the load
unit. It is just the inputs into the load unit's reservation station. Figure 7 on the next page
provides a detailed view of the implementation of the store unit.
Figure 6: Design of a load unit (note: design does not include output to load/store arbiter)
Figure 7: Design of a store unit (note: design does not include output to load/store arbiter)
3.2.4 Common Data Bus Arbiter (CDB Arbiter)
We created a synchronous CDB arbiter for simplicity. Requests are latched in on every
cycle and the request with the highest priority is broadcasted to the CDB. The passed
request also enables a mux to pass a 3 bit done signal to the reservation stations. These
done bits are used as resets for the busy register in the reservation stations.
Figure 8: Design of the CDB Arbiter
3.2.5 Load/Store Arbiter
The load/store arbiter was designed as a synchronous arbiter that registers requests and
gives priority to loads. This decision was made to preserve correctness (refer to detailed
strategy of load/store unit). The design of the load/store arbiter is similar to the CDB
arbiter. It only differs from the CDB Arbiter in that it can only pass a request when the
cache controller is not asserting wait, since the cache controller will ignore all requests
when wait is high.
Figure 9: Top-level view of the Load/Store Arbiter
3.2.6 Cache and Cache Controller
We decided to keep things simple this time and concentrate on finishing the design of
the Tomasulo architecture before optimizing the cache. For this reason, we chose to
implement a direct-mapped cache with write-through policy. Using a write-through policy
made things a lot simpler since there was no need to consider cases where the cache
line was dirty.
The cache is composed of 2 SRAM blocks, a 256-bit register, 4 muxes, a comparator
and a write buffer. One of the SRAM blocks is used for storing the data for the cache,
while the other is used to store the tags of the addresses. The SRAM block used to store
the data, takes a 12-bit input for the address, and a 32-bit input for data. The data into
the cache is first muxed among three sources: data from the DRAM, data from the
processor, or data from the write buffer. The Tag SRAM block was assumed to be an
asynchronous block because we wanted to be able to perform a cache hit in one cycle.
So, by having an asynchronous tag file we could compare the tags and return the data
from the cache all within one clock cycle. The 12-bit comparator is used to compare the
in coming "tag" with the tag that corresponds to that line in the cache. Then the 256-bit
register is used to keep track of the valid bits of the 256 cache lines in our cache. The
output to the processor was muxed between the SRAM block and the 4 outputs of the
write buffer. The last two muxes were used to chose the address and data being written
out to the DRAM.
Figure 10: Top-level view of the Load/Store Arbiter
The controller for the cache was also much simpler than our controller from lab 6. Our
new controller was basically a state machine of 6 states: Start, Read, Read_wait, Write,
Write_wait, and Flush. The START state was the largest state because it had to handle
flushing the write buffer, processing a read or write request, and clearing out the write
buffer.
In the START state, a read request was processed as follows:
1. Check write buffer slots 1,2,3,4 for the data by comparing the
addresses.
2. If not in write buffer, check if it's in the cache by checking valid bit and
comparing the tags.
3. Else, go to Read_wait state to request the correct cache line from
DRAM.
The 2 wait states, READ_WAIT and WRITE_WAIT, are basically used to setup the
address and data to send to the DRAM because that was the way we setup our memory
controller. They also make sure that the wait signal is low before making the request for
DRAM access.
In the READ state, we wait for the data ready signal from the DRAM before we start
counting and writing in 8 data entries into the cache. We also made a check to see if any
of the data in the write buffer could be written in at this time as well. The READ state
controls the mux which selects the data to be stored into the cache block. The counter
not only keeps track of how many entries we've read in, but it also serves as the bottom
3 bits of the address that is fed into the cache block to store the data.
In the WRITE state, we just write one word from one of the write buffer slots into DRAM
and then transition back to the START state.
In the FLUSH state, we basically perform the WRITE state 4 times. We only write out to
the DRAM if the slots are valid.
3.2.7 Memory Controller
The design of memory controller was taken directly from Lab 6. Since the DRAM
interface was correctly functioning, we simply replaced the write burst states with a
sequence of single write event states. Instead of completing the full 8 bursts of write, the
memory controller achieves a single write by sending a burst terminate command after
the first write completes.
3.3 Testing Methodology
We took an incremental approach to testing along with individual component testing. We
began by testing only the necessary components for a working Datapath – the Issue
Unit, the Register File, a Functional Unit (I-type) and the CDB Arbiter. These units were
initially tested for basic functionality and then were more thoroughly tested while being
put together. With these units, it was possible to issue, execute and commit instructions
by using an SRAM block to supply instructions to the Fetcher. This basic Datapath
supported normal execution as well as execution that would have required forwarding in
normal linear pipeline architectures. For example, code such as
addiu $1, $1, 1
addiu $1, $1, 1
addiu $1, $1, 1
addiu $1, $1, 1
could be executed and successfully committed.
After this was tested and debugged, we added the two other integer units, shift and Rtype. This allowed us to test the Common Data Bus more thoroughly as it now had to
deal with more than one incoming request. A problem we ran into when testing with this
Datapath was that each instruction executed in one cycle, so there was no stalling and
the CDB never had more than one incoming request. So to test the CDB more
completely at this stage, we forced the CDB Arbiter to reject all requests for some
number of cycles before beginning normal arbitration. So when the CDB Arbiter began
arbitrating, it already had several pending requests. After this was tested, we were
certain that execution without memory was done.
The next step was to add memory. In this phase of testing, we added the Load and
Store Functional Units and attached them to an SRAM block to be used as data
memory. This was a test solely for testing the Load and Store Units as they were the
most complicated Functional Units to design and implement. Unlike the integer
Functional Units, the Load and Store units also had to interact directly with each other,
and with the Load/Store Arbiter. Thus, these tests focused on tasks like storing to the
same address overwrites an older sw and lw’s look in the Store Unit’s reservation
stations for values before going to memory. The Load/Store Arbiter at this point was also
tested with the Load/Store Units to make sure that it could arbitrate simultaneous
requests from the Store and Load Units and then correctly pass back done bits to the
stations. This working Datapath brought us to the level of functionality as Lab 5. We
could execute any of the required MIPS instructions, but with only separate SRAM
blocks serving as memory.
The next phase of testing included a Data Cache attached to the DRAM controller and
the simulated DRAM modules. The instruction memory remained as an SRAM block.
This addition was made to test the interaction between the Data Cache and the rest of
the processor. The main new functionality tested at this level was the Load/Store Arbiter
waiting for the cache to lower its wait signal before passing done and (for Load) a value
back to the Units. This is the integration point where we had the most trouble in lab 6, so
we slowed the pace of testing and put in fewer pieces per testing stage. Next, we added
an Instruction Cache into the Datapath to test the Memory Arbiter (to arbitrate between
Instruction and Data Cache requests) and to test the Fetcher’s ability to cope with
instruction cache misses. Although it was not getting information from the Instruction
Cache (because we hadn’t yet inserted level 0 boot code), it was still feeding address
requests in to the Instruction Cache and receiving a stall signal at the end of cache
blocks. Thus we were now ready to integrate the last pieces, Level 0 Boot and hooking
up the Instruction Cache fully to the Fetcher.
Finally, we put the instmem SRAM (instruction source) into the Data Cache for use as a
Level 0 Data Source and we put the Boot Rom that contains our boot instructions in the
Instruction Cache. This is our final processor with all components in place. At this point,
we test by putting our code in instmem.contents (the instruction source), running through
the bootloader, and then actually running the program code. Testing was initially done
with only one instruction block in the Level 0 source format with the initial address at 0.
Then we tried changing the initial address to a value other than 0 such as 0x80000000.
Finally, we moved on to multiple Level 0 blocks (i.e. quick sort and its data block).
3.4 Optimizations
After initial testing of our procesor was thought to be finished, we began design and
implementation of the multi-way issue (2-way) enhancement, while further thorough tests
were in progress. The following are the additions that had to be made:
1. Add a second CDB (and arbiter) so that we could commit two results
at once. Multi-way issue has little gain with single commit.
2. Modify the reservation stations to take in input from both CDB's to
resolve dependencies.
3. Modify the Functional Units to take in input from two Issue Busses, to
be able to issue to two reservation stations in the same functional unit
in once cycle.
4. Modify the register file to read both CDB's to get committed results: - - Read both Issue Busses, to get up to two Tag's written in.
- Output 4 operand values/tags (so we can issue two instructions per
cycle)
- Take two different jal signals, depending on which instruction is jal
5. Redo the Fetch/Issue Unit so that it fetches two instructions at once
and issues two at once.
6. Stripe the cache (use multiple SRAM blocks) so that we could read
two instructions per cycle from it.
The first modification was a simple one, in order to make sure that two distinct requests
would get on both CDB's, we decided to simply invert the priority of the first CDB arbiter
and use that as a second CDB arbiter. So all units would request to both arbiters, and
the two arbiters would broadcast results to all reservation stations. The reservation
stations would then update from either of the CDB's. There should be no conflict in this
update unless both arbiters output the same result/tag, in which case it doesn't matter
which CDB the reservation station updates from. Additions 1-4 were straightforward in
both design and implementation. Addition 5 took the most time to design.
We decided, for simplicity of the cache to always request even addresses, which
ensures that our two-word requests never cross cache-line boundaries. This adds
complexity to the fetcher since we could branch to an odd instruction and we would have
to invalidate the fetched even instruction. We decided to make this tradeoff because of
our previous troubles with complicated cache designs. The complications occur with
branches and stalls. For branches, we send back two valid signals from the Decoder to
the Fetcher which of the next two instructions are valid. This allows us to control which
instructions that are fetched are actually used. For example, if the current even
instruction was a successful branch, we would invalidate both fetched instructions
because they should be jumped over. If the current odd instruction was a successful
branch, we would invalidate only the odd fetched instruction, since the even one would
be in the branch delay slot. The Fetcher (in the branch-magic block) is also modified to
keep track of if we branched to an odd word, and if we did, to invalidate the even fetched
word because it is before the target of the branch. The Fetcher also has to mux the
possible branch targets from beq/j/jr type instructions for both current instruction; a one
bit signal is sent to the Fetcher from the Decoder to tell the Fetcher which instruction is
the branch. For breaks, the Decoder tells the Fetcher when breaks are decoded for each
instruction. The Fetcher then stalls until there is a break release. When it recieves the
break release, it then checks to see if the current even instruction is a break, if it is, we
release it by reseting the instruction register (this sends out all 0's - nop). If the even
instruction is not a break, the odd one is checked and released/reset if necessary.
Addition 6 was also a non-trivial addition. We had noticed that we were getting many
instruction cache stalls because our memory bandwidth was one word per cycle at best
(ignoring request latency). Thus, we decided to simultaneously increase memory
bandwidth to the cache and provide striping. Thus, instead of reading in one word per
cycle from the DRAM controller, we would read in multiple words in one cycle. Our 2way issue design only required two SRAMs to supply our two instructions per cycle, but
we decided that it was actually easier to implement an 8-way striped design so we would
read in a whole cache line in one cycle. Thus we would increase our memory bandwith 8
times. Incidentally, this addition would also make a stream buffer very attractive as we
would now have the enough memory bandwidth to make it worth it. With the original
memory bandwith of one word per cycle, a stream buffer would not help in any program
where the data cache had any significant (1 request per 8 cycles) demands. The
memory would still (speculating) be sending data to the processor constantly, but a
larger fraction of this data would go to the instruction cache/stream buffer. With an
striped cache design, we were thus able to supply two instructions to the processor. All
of these additions were made, but are not fully functional/tested due to time constraints.
See the evaluation section for the potential performance impact that these additions
were projected to have made.
4 Results
4.1 Calculated Delays
The calculated minimum post-place and route cycle period for our entire
processor was 81.19 ns, as calculated by Xilinx Timing Analyzer. This yields a
maximum clock rate of 12.3 MHz. We tried to get a working synthesized version
of our processor on the board but failed to yield any results. From LED output
results, we believe that there are setup/hold time violations occuring in the cache
because of clocking interface issues between the DRAM controller and cache
controller.
Our critical path is as follows (highlighted in red):
Figure 11: Critical path is highlighted in red. It has a minimum delay of 81.19 ns.
The critical path case occurs when a load instruction is issued at the same time
as a CDB broadcast that updates the address register for the load instruction.
When this occurs, the fetch/issue unit forwards the updated CDB data to the
fetch/issue unit, which then passes it down to the load unit. The load unit is part
of the critical path because of huge fan in delay from various muxes.
4.2 CPI Analysis
Our final out-of-order execution processor has a CPI of approximately 1, if we
ignore instruction and data cache misses. This compares with our Lab 4
(single-cycle), and Lab 5 (5 Stage Pipeline) which each had a CPI of 1 (again if
we ignore memory effects). The effect of memory is difficult for us to evaluate
since we never had a working Lab 6 (5 Stage Pipeline with DRAM as memory),
so we can only speculate as to how our Lab 6 design would have performed. Lab
6, assuming the same cache architecture would perform identically to our final
processor except in the case of a data cache miss. The Tomasulo algorithm
allows us to continue execution when there are data cache misses since the
issue stage is independent of the execution and writeback stages. However,
since our design did not include branch or jump prediction, jumps and branches
that are dependent on loaded values still get stalled. Thus the performance gain
over Lab 6 would be highly dependent on the program run. If the program had
many loads and stores with no branching dependencies on the loads and stores,
as in a sorting program which used registers for status information (how to
perform the next step of the sort), our final project would do very well compared
to the Lab 6 project which would stall frequently on data cache misses. Assuming
a cache miss penalty of 10 cycles, our final project would still have a CPI of 1 on
this program, whereas Lab 6 would have a CPI of about 1.15 (assume the loop
fits in the 8 word cache block; one memory access per loop iteration = 1 memory
access/8 instructions = one 10 cycle stall per 64 instructions; CPI = 74/64).
4.3 Modular Test Results
All modular test cases pass. Note: To avoid regression tests, changes to the test
cases because of incorrect assumptions were not made. The test cases probably
do not pass after many fixes have been made to individual components, but the
overall funtionality is what counts.
4.4 Top-Level Test Results
All top-level test cases passed. The demonstration programs provided by the
T.A's passed as well with the following statistics: (Refer to appendix for test
results).
Test Name
Cycle count
Results
dumb
1092176 cycles
0x199 (input 0x50)
dumber
650 cycles
0x25 / 0x00
quicksort2
6500 cycles
all numbers sorted
Table 3: Demonstration top-level test results
4.5 Board Usage Statistics
The following is a report on the resource usage on our board after synthesis:
Number of External GCLKIOBs
Number of External IOBs
Number of LOCed External IOBs
Number of BLOCKRAMs
Number of SLICEs
Number of GCLKs
Number of TBUFs
1 out of 4
174 out of 512
154 out of 174
25%
33%
88%
16 out of 160
11206 out of 19200
10%
58%
3 out of 4
7690 out of 19520
75%
39%
5 Conclusion
This lab was relatively successful. We designed an out-of-order execution architecture
processor and got it to work in simulation. We were unable to get it functioning on the
board because of lack of time. Compared to the previous labs, team effort was at its
best. The initial top-level planning on paper greatly increased efficiency and reduced any
issues that would have arisen because of incorrect expectations.
Finally, the design we built lacked many of the enhancements that would make the
Tomasulo architecture powerful. There were many things we felt we could have added if
given more time. For example, branch prediction, if added to our final project would
eliminate all issue stalling, and our final project would have only lost productive cycles on
branch predictions. Another major enhancement would have been multi-way issue,
which would create a superscalar, out-of-order execution processor. This enhancement
would be fairly easy to add since all of the data dependencies would be resolved in the
usual manner. The complications to this enhancement are handling instruction fetching
and issuing, and creating a wider bus from cache. Adding this option would have halved
our base CPI to 0.5, and combined with branch prediction and an instruction prefetch
unit (stream buffer) would have yielded a final CPI (speculating) between 0.5 and 0.7.
Appendix I: Notebooks
Notebook entries are as follows (double click icon to view):
Name
Journal File
Total Hours Spent
Jim Chen
138.63 Hrs
Ray Juang
143.20 Hrs
Ted Lee
116.22 Hrs
Tim Lee
159.67 Hrs
Table 4: Table of journal files. (Double click icon to view)
*NOTE: The notebook entries were generated by an auto-logging journal program written by cs152-juangr. See journal
directory for executable.
Appendix II: Schematics
Schematic files used in design are as follows
Name
Schematic File
cache
cache_datapath
fetch_branch_unit
fetcher
top_level_driver
Table 5: Table of schematic files. (Double click icon to view)
Appendix III: Verilog Files
Verilog design files used in lab are as follows:
Module
Description
alu.v
ALU module
branch_magic.v
Branch magic unit
bts16.v
Tristate buffer (16 bits wide)
bts32.v
Tristate buffer (32 bits wide)
bts40.v
Tristate buffer (40 bits wide)
bts41.v
Tristate buffer (41 bits wide)
bts7.v
Tristate buffer (7 bits wide)
bts9.v
Tristate buffer (9 bits wide)
cache_controller.v
Cache controller
cdb_arbiter.v
CDB Arbiter
comp12.v
12-bit comparator
comp23.v
23-bit comparator
comp32.v
32-bit comparator
comp9.v
9-bit comparator
constants.v
Constants used for standardization
Source code
counter.v
Module
Description
Parameterizable Counters
dcache.v
Top-level Data cache wrapper
decode_issue_unit.v
Decode-issue unit
extender.v
32-bit sign extender
fifo.v
FIFO queue
global_counter.v
Global-counter (for cycle count)
hugeblock.v
Fake memory block for simulation
icache.v
Top-level Instruction cache wrapper
io_block.v
I/O block
issue_monitor.v
Issue monitor (disassembler)
jump_combine.v
Bus combiner (schematics hack)
level0.v
Level 0 boot ROM
load_store_arbiter.v
Load store arbiter
memory.v
Top-level Memory wrapper
memory_arbiter.v
Memory arbiter
memory_control.v
DRAM controller (DRAM interface)
Source code
Module
mips_test.v
Description
Top-level simulation tester
mt48lc8m16a2.v
DRAM simulation files
mux1x2.v
2 input Mux (1 bit wide)
mux32x2.v
2 input Mux (32 bits wide)
mux32x4.v
4 input Mux (32 bits wide)
mux32x5.v
5 input Mux (32 bits wide)
mux9x2.v
2 input Mux (9 bits wide)
opcodes.v
Op code constants for decoding
part1.v
Boardram block for synthesis
reg_pc.v
PC Register
reg256.v
256 bit register
reg3.v
3 bit register
reg32.v
32 bit register
reg7.v
7 bit register
reg9.v
9 bit register
regfile.v
Register File
Source code
Module
reservation_station.v
Description
Generic Reservation Station
rs_itype.v
I-type reservation station
rs_load.v
Load reservation station
rs_rtype.v
R-type reservation station
rs_shift.v
Shift reservation station
rs_store.v
Store reservation station
shifter.v
32-bit shifter
sramblock2048.v
SRAM block for synthesis/simulation
unit_i_type.v
I-Type Functional Unit
unit_load.v
Load Functional Unit
unit_r_type.v
R-Type Functional Unit
unit_shift.v
Shift Functional Unit
unit_store.v
Store Functional Unit
write_buffer.v
Write back buffer
Table 6: Table of verilog files. (Double click icon to view)
Source code
Appendix IV: Test Files
Module Test Results
Here are the test files used to test individual modules. Note that logs are not provided
since changes made during top-level testing may have broken the individual module
tests. The top-level results are what really counts:
Module Tested
counter
fifo (read)
fifo (write)
fifo (random delay)
cache
memory_arbiter
memory_control
regfile
reservation_station
cdb_arbiter
memory
Test Fixture File
Module Tested
Test Fixture File
unit_load
unit_r_type
unit_store
Table 7: Table of modular test files. (Double click icon to view)
The test on cache (processor_lite.v and instmem.lite) are modules designed to
fake a dumb processor to test functionality of cache.
Top-level Test Results
Here are the test files used and the associated transcript files of the test results for toplevel testing: (Note: Hardware level tests passed with results similar to simulation output)
Test
Name
Transcript Files
SPIM file /
IOinput
base
(from TA)
n/a
corner
(from TA)
n/a
diagnostic
(from TA)
n/a
linear
--
memio
(from TA)
n/a
quick_sort
(from TA)
n/a
verify
(from TA)
n/a
worm
mem Contents
file
Explanation of new SPIM files:
linear.s
Runs linear (bubble) sort using memory load/store as variables. Heavy
load/store usage. Tests cache fill/miss (sequential).
worm.s
This is a rigourous load/store dependency test checker. It creates a linked
list in memory and traverses through the list until it reaches the end. The
address of the next address to load must be loaded via memory. (We
found some deadlocking bugs this way)
Other test cases (logs got deleted):
Part1.s, Part2.s, Part3.s -
Exhaustive testing of basic correctness in executing
functions. (Original single file was too big to fit in the
boardramcreate blocks, so broken into 3 files).
rs_load.s
Tests load/store issues that can be a potential bug for
Tomasulo architecture. This was test specifically designed
to check against bugs that were not thought of.