Tomasulo Architecture: Out of Order Execution Lab 7 Report Names (alphabetical order) Jim Chen Ray Juang Ted Lee Tim Lee (cs152-etjim) (cs152-juangr) (cs152-timclee) (cs152-leet) Group name: Alt-0165 (¥) 1. Abstract The goal of this project was to design a processor based on an architecture of our choosing. We chose to create an Out-of-Order execution processor based on the Tomosulo Algorithm. This architecture is very modular and provides a good base for optimizations, but also allows execution to continue when some instructions, such as multiply/divide, take a long time to finish. This is very advantageous on systems with units that take many cycles to complete or on systems that have poor memory subsystems. In our case, we decided to make a write-through, direct-mapped 8KB cache composed of 8 word cache lines with a 4-entry write buffer. We were successful and ended up with a design that operates at 12.3 MHz with a CPI count of 1.1 (or roughly 1). 2. Division of Labor The division of labor for this project was mainly a division in time. We decided that we would set timelines for portions of the project to be completed and took shifts for completing each module. Every member took a part in the design of every module. However, each member took responsibility for a tested and working module design during the coding stage. The responsibilities were divided as follows: Component Person Responsible Cache controller cs152-leet CDB arbiter cs152-etjim Memory controller cs152-juangr Memory arbiter cs152-leet Load/Store arbiter cs152-timclee Functional units cs152-juangr Reservation station cs152-juangr Register file cs152-etjim Fetch/Decoder cs152-timclee Table 1: Main components and person responsible for ensuring correctness Our development was divided into the following stages: Stage Name Description Top-Level Planning Paper implementation of top-level goals and what the overall design would look like. Detailed-Level Planning Paper implementation of what detailed components of the design would look like Coded Implementation Replication of paper designs into Verilog coded modules Top-Level Structural Verification Analysis of all modules thrown together to ensure no signals or components were missing. Component Verification and Modification Individual testing of components to verify that they were functioning Incremental Testing Phase Incremental tests to verify correct behaviour in interaction between various modules Overall Testing Phase Top-level testing of overall design. Debug phase from MIPS code in simulation. Timing Analysis/Synthesis Substitution of time synthesis programmable modules to verify correctness of synthesized board. Optimizations Optimizations to reduce critical path, reduce CPI, and improve overall performance Table 2: Development stages and description of the stage In summary, this can be broken down into a Top-Level Planning stage, a Detailed-Level Component Design stage, a Testing phase, and an Optimization stage. 3. Detailed Strategy 3.1 Top-Level Planning In this stage, we planned out on paper, what our top-level datapath would look like. This would provide us with insight as to how individual components would function with the rest of the datapath. 3.1.1 Datapath Our Datapath is composed of 3 main parts: The Fetch/Issue Unit, the Functional Units and the Register File. A typical instruction will have a lifetime as follows: be fetched, decoded and issued by the Fetch/Issue unit. Issuing consists of broadcasting a tag, opcode and operand values to all functional units. Issuing also puts the issued tag in the destination register’s tag field in the register file. The register file is constantly monitoring the Common Data Bus, and the specific destination register is waiting for this tag. When it sees the tag on the CDB, it will take the result. The tag is a 9-bit field specifying the functional unit and reservation station (6 functional units, 3 reservation stations). The functional units then take the instruction, wait to resolve dependencies by looking on the Common Data Bus (CDB), and then put the instruction result on the CDB. Putting a result on the CDB is done by sending a request to the CDB Arbiter which then broadcasts a single result to the rest of the functional units and register file. The register file will see the tag that is on the CDB and take the result from the CDB, store it into its data register and clear its tag. 3.1.2 Control Our Control is handled entirely by the Fetch/Issue unit and by the architecture of the Datapath. Once an instruction is issued, the Datapath will put the result of the instruction in the destination register after resolving dependencies and hazards by itself. Cases that cannot be handled by the Datapath, such as lw and sw address ambiguity result in stalls at issue. Our initial top-level designed consisted of three functional units, as specified in the project specifications. However, we decided later on to split the Integer functional unit into R-type, I-type, and Shift functional units. Figure 1: Initial sketch of top-level design 3.2 Detailed Level Planning and Component Design This stage consisted of doing detail planning of each of the components in the top-level design and coding them in Verilog. The following sections will describe in detail the functionality and design of each component. 3.2.1 The Fetch/Issue Unit The fetcher is the same as that for an ordinary pipelined processor. It contains a register for the PC which is updated after a mux to select pc+4, the branch pc, a jump pc, or a jr. The logic for the selector to this mux is handled in a module called “branch magic” which waits (like a reservation station) for branch and jr results to be broadcast on the cdb. Actually, right now we don’t have branch prediction so we keep issuing branches until the dependency is resolved, so this unit only has to input register file outputs. This unit fetches one instruction at a time from the instruction cache and stores it into an instruction register. The instruction in this register is the current instruction that will be decoded and issued by the decode/issue unit later on. Below is a depiction of our planned fetch unit: Figure 2: Fetch/decode block designs In the decode/issue unit, the instruction is decoded for type, to see what functional unit to send to, and then the status of the reservation stations is checked. If there is an open reservation station in the appropriate functional unit, the instruction is issued to it by specifying the correct tag on the Issue Bus. The tag is one-hot encoded with the top 3 bits for the reservation stations, and the bottom 6 bits for the functional units. So on any instruction issue, two bits will be high in the tag, one for the reservation station, and one for the functional unit. Any instructions that write to $0 are considered nops and have allzero tags, so they do not take up any reservation station space. When a break is decoded, a signal is sent to the Fetcher to tell it to stall. The Fetcher then stalls until a break release signal is received from the top-level. We decided that the decode/issue unit would be written with control logic using Verilog. 3.2.2 Register File Our register file contains 32 32-bit data registers and 32 9-bit tag registers. When Write enable is high, a tag register is written to. Data registers can only be altered by sending particular values on the CDB input. Register and Tag 0 are tied to ground since the value is always supposed to be 0 and the value of 0 is always up-to-date. Register 31 ($ra) has a special input, the pc which is used when the register file receives the jal signal from decode. It gets this signal when decode decodes a jal instruction. The one complication with the register file is that it must forward values from the CDB for reads. In the case where an instruction is being issued at the same time that one of its operand’s values is being broadcast on the CDB, the register file must forward the CDB’s value onto the Issue Bus. Figure 3: Register file design (note: does not include forwarding) 3.2.3 The Functional Units The project specification required three functional units: a load unit, a store unit, and an integer unit. We decided to split up the integer unit into three different units (r-type, itype, and shift-type). This made each unit very easy to create as they each had specialized tasks. Shift was split from i-type or r-type because we didn’t need a shifter for any other instructions. We decided to remove it from the critical path of the more common r-type and i-type instructions. Also, we wanted to see the CDB take more traffic to see if we could handle adding more functional units. We also anticipated adding an extra functional unit (perhaps for a multiply/divide unit), but it is currently unused. The load unit and store unit were more of a design challenge than the other units. 3.2.3.1 Reservation Station A generalized reservation station was created that resolved dependencies by looking at an inputted CDB and setting a flag high when the values were disambiguated. It contains two value/tag registers for the two possible operands for an instruction. These registers are written into when the reservation station is issued and then updated by the CDB if a tag is non-zero. Also, it has a busy register which is high when a reservation station is in use; it is set high when an instruction is issued to this reservation station. Finally, the reservation station has an op register which keeps the operation to be performed on the operands. For example, the op register in the R-Type station decides to do an add, or sub, or xor, etc. Figure 4: Design of a generic reservation station From this generalized reservation station, we created particular reservation stations for each of the functional units. Each of these specialized stations has an execution unit to do computation on operators. We decided to put the actual execution here rather than in the unit because it is not in our critical path and we had lots of hardware to use. If we had put it in the path of issuing to the CDB, an asynchronous CDB and a forwarding case could have been our critical path. From these specialized reservation stations, we build our functional units. The unit contains three reservation stations and logic to decide which station gets to request to the CDB and which and passes parsed values from the Issue Bus to the individual reservation stations. 3.2.3.2 Integer Unit *(divided into R/I/Shift unit later on) The design of the integer unit was used as a template in the design of other functional units. The method of prioritizing the reservation station outputs was accomplished via tristate muxes. Only one reservation station is allowed to output onto a common bus at any given time. A reservation station will only request for output when it contains valid data and all of its data dependencies are resolved (i.e. it contains tag values of 0). If more than one reservation station requests for output, priority is given to the first reservation station, then second, then third. If any reservation station makes a request for output, a general request signal is sent out to an arbiter to ask for its results to be placed onto the CDB. This priority design will be used in the other functional units. Figure 5: Design of an integer unit (note that the “+” block can be substituted with an ALU) 3.2.3.2 Load/Store Unit The load/store functional units were the most complicated in terms of interaction and design. There were issues of data getting committed to the same memory address in the wrong order and issues with possible incorrect reads from uncommitted stores. In order to deal with such hazards, we established a set of paradigms to handle load and store instructions: 1. When the address is unresolved, we halt issue of any more instructions. 2. All loads must occur before store into memory. 3. All loads must check store reservation stations before loading into their own reservation stations. 4. All stores must check previous stores. If common address found, it overwrites the value. The paradigms above were developed to handle potential memory Read After Write (RAW) and Write After Read (WAR) hazards. For example: 1: lw $2, 0($1) 2: sw $3, 0($1) 3: lw $4, 0($1) 4: sw $5, 0($1) 5: sw $6, 0($1) A WAR hazard could be introduced if the load (line 1) and store (line 2) were issued, but the store gets executed before the load. This can be resolved by forcing loads to occur first before storing. If loads did not check previous store issues, a RAW hazard could be introduced. In the same example, if line 2 and 3 are issued, but the load in line 3 occurs before the store in line 2, the wrong value would be read. This can be resolved by having the load unit check the addresses of previous stores, and grabbing the data/tag of the matching store. A WAW hazard could be introduced if the store in line 4 occurs after the store in line 5. This can be resolved by having stores check previous stores and replacing the data of the reservation station with matching address. If we do not stall when an address is unresolved, then it would be impossible to check for matching addresses (proposed in the lines above). Figure 6 on the next page provides a detailed view of the implementation of the load unit. It is just the inputs into the load unit's reservation station. Figure 7 on the next page provides a detailed view of the implementation of the store unit. Figure 6: Design of a load unit (note: design does not include output to load/store arbiter) Figure 7: Design of a store unit (note: design does not include output to load/store arbiter) 3.2.4 Common Data Bus Arbiter (CDB Arbiter) We created a synchronous CDB arbiter for simplicity. Requests are latched in on every cycle and the request with the highest priority is broadcasted to the CDB. The passed request also enables a mux to pass a 3 bit done signal to the reservation stations. These done bits are used as resets for the busy register in the reservation stations. Figure 8: Design of the CDB Arbiter 3.2.5 Load/Store Arbiter The load/store arbiter was designed as a synchronous arbiter that registers requests and gives priority to loads. This decision was made to preserve correctness (refer to detailed strategy of load/store unit). The design of the load/store arbiter is similar to the CDB arbiter. It only differs from the CDB Arbiter in that it can only pass a request when the cache controller is not asserting wait, since the cache controller will ignore all requests when wait is high. Figure 9: Top-level view of the Load/Store Arbiter 3.2.6 Cache and Cache Controller We decided to keep things simple this time and concentrate on finishing the design of the Tomasulo architecture before optimizing the cache. For this reason, we chose to implement a direct-mapped cache with write-through policy. Using a write-through policy made things a lot simpler since there was no need to consider cases where the cache line was dirty. The cache is composed of 2 SRAM blocks, a 256-bit register, 4 muxes, a comparator and a write buffer. One of the SRAM blocks is used for storing the data for the cache, while the other is used to store the tags of the addresses. The SRAM block used to store the data, takes a 12-bit input for the address, and a 32-bit input for data. The data into the cache is first muxed among three sources: data from the DRAM, data from the processor, or data from the write buffer. The Tag SRAM block was assumed to be an asynchronous block because we wanted to be able to perform a cache hit in one cycle. So, by having an asynchronous tag file we could compare the tags and return the data from the cache all within one clock cycle. The 12-bit comparator is used to compare the in coming "tag" with the tag that corresponds to that line in the cache. Then the 256-bit register is used to keep track of the valid bits of the 256 cache lines in our cache. The output to the processor was muxed between the SRAM block and the 4 outputs of the write buffer. The last two muxes were used to chose the address and data being written out to the DRAM. Figure 10: Top-level view of the Load/Store Arbiter The controller for the cache was also much simpler than our controller from lab 6. Our new controller was basically a state machine of 6 states: Start, Read, Read_wait, Write, Write_wait, and Flush. The START state was the largest state because it had to handle flushing the write buffer, processing a read or write request, and clearing out the write buffer. In the START state, a read request was processed as follows: 1. Check write buffer slots 1,2,3,4 for the data by comparing the addresses. 2. If not in write buffer, check if it's in the cache by checking valid bit and comparing the tags. 3. Else, go to Read_wait state to request the correct cache line from DRAM. The 2 wait states, READ_WAIT and WRITE_WAIT, are basically used to setup the address and data to send to the DRAM because that was the way we setup our memory controller. They also make sure that the wait signal is low before making the request for DRAM access. In the READ state, we wait for the data ready signal from the DRAM before we start counting and writing in 8 data entries into the cache. We also made a check to see if any of the data in the write buffer could be written in at this time as well. The READ state controls the mux which selects the data to be stored into the cache block. The counter not only keeps track of how many entries we've read in, but it also serves as the bottom 3 bits of the address that is fed into the cache block to store the data. In the WRITE state, we just write one word from one of the write buffer slots into DRAM and then transition back to the START state. In the FLUSH state, we basically perform the WRITE state 4 times. We only write out to the DRAM if the slots are valid. 3.2.7 Memory Controller The design of memory controller was taken directly from Lab 6. Since the DRAM interface was correctly functioning, we simply replaced the write burst states with a sequence of single write event states. Instead of completing the full 8 bursts of write, the memory controller achieves a single write by sending a burst terminate command after the first write completes. 3.3 Testing Methodology We took an incremental approach to testing along with individual component testing. We began by testing only the necessary components for a working Datapath – the Issue Unit, the Register File, a Functional Unit (I-type) and the CDB Arbiter. These units were initially tested for basic functionality and then were more thoroughly tested while being put together. With these units, it was possible to issue, execute and commit instructions by using an SRAM block to supply instructions to the Fetcher. This basic Datapath supported normal execution as well as execution that would have required forwarding in normal linear pipeline architectures. For example, code such as addiu $1, $1, 1 addiu $1, $1, 1 addiu $1, $1, 1 addiu $1, $1, 1 could be executed and successfully committed. After this was tested and debugged, we added the two other integer units, shift and Rtype. This allowed us to test the Common Data Bus more thoroughly as it now had to deal with more than one incoming request. A problem we ran into when testing with this Datapath was that each instruction executed in one cycle, so there was no stalling and the CDB never had more than one incoming request. So to test the CDB more completely at this stage, we forced the CDB Arbiter to reject all requests for some number of cycles before beginning normal arbitration. So when the CDB Arbiter began arbitrating, it already had several pending requests. After this was tested, we were certain that execution without memory was done. The next step was to add memory. In this phase of testing, we added the Load and Store Functional Units and attached them to an SRAM block to be used as data memory. This was a test solely for testing the Load and Store Units as they were the most complicated Functional Units to design and implement. Unlike the integer Functional Units, the Load and Store units also had to interact directly with each other, and with the Load/Store Arbiter. Thus, these tests focused on tasks like storing to the same address overwrites an older sw and lw’s look in the Store Unit’s reservation stations for values before going to memory. The Load/Store Arbiter at this point was also tested with the Load/Store Units to make sure that it could arbitrate simultaneous requests from the Store and Load Units and then correctly pass back done bits to the stations. This working Datapath brought us to the level of functionality as Lab 5. We could execute any of the required MIPS instructions, but with only separate SRAM blocks serving as memory. The next phase of testing included a Data Cache attached to the DRAM controller and the simulated DRAM modules. The instruction memory remained as an SRAM block. This addition was made to test the interaction between the Data Cache and the rest of the processor. The main new functionality tested at this level was the Load/Store Arbiter waiting for the cache to lower its wait signal before passing done and (for Load) a value back to the Units. This is the integration point where we had the most trouble in lab 6, so we slowed the pace of testing and put in fewer pieces per testing stage. Next, we added an Instruction Cache into the Datapath to test the Memory Arbiter (to arbitrate between Instruction and Data Cache requests) and to test the Fetcher’s ability to cope with instruction cache misses. Although it was not getting information from the Instruction Cache (because we hadn’t yet inserted level 0 boot code), it was still feeding address requests in to the Instruction Cache and receiving a stall signal at the end of cache blocks. Thus we were now ready to integrate the last pieces, Level 0 Boot and hooking up the Instruction Cache fully to the Fetcher. Finally, we put the instmem SRAM (instruction source) into the Data Cache for use as a Level 0 Data Source and we put the Boot Rom that contains our boot instructions in the Instruction Cache. This is our final processor with all components in place. At this point, we test by putting our code in instmem.contents (the instruction source), running through the bootloader, and then actually running the program code. Testing was initially done with only one instruction block in the Level 0 source format with the initial address at 0. Then we tried changing the initial address to a value other than 0 such as 0x80000000. Finally, we moved on to multiple Level 0 blocks (i.e. quick sort and its data block). 3.4 Optimizations After initial testing of our procesor was thought to be finished, we began design and implementation of the multi-way issue (2-way) enhancement, while further thorough tests were in progress. The following are the additions that had to be made: 1. Add a second CDB (and arbiter) so that we could commit two results at once. Multi-way issue has little gain with single commit. 2. Modify the reservation stations to take in input from both CDB's to resolve dependencies. 3. Modify the Functional Units to take in input from two Issue Busses, to be able to issue to two reservation stations in the same functional unit in once cycle. 4. Modify the register file to read both CDB's to get committed results: - - Read both Issue Busses, to get up to two Tag's written in. - Output 4 operand values/tags (so we can issue two instructions per cycle) - Take two different jal signals, depending on which instruction is jal 5. Redo the Fetch/Issue Unit so that it fetches two instructions at once and issues two at once. 6. Stripe the cache (use multiple SRAM blocks) so that we could read two instructions per cycle from it. The first modification was a simple one, in order to make sure that two distinct requests would get on both CDB's, we decided to simply invert the priority of the first CDB arbiter and use that as a second CDB arbiter. So all units would request to both arbiters, and the two arbiters would broadcast results to all reservation stations. The reservation stations would then update from either of the CDB's. There should be no conflict in this update unless both arbiters output the same result/tag, in which case it doesn't matter which CDB the reservation station updates from. Additions 1-4 were straightforward in both design and implementation. Addition 5 took the most time to design. We decided, for simplicity of the cache to always request even addresses, which ensures that our two-word requests never cross cache-line boundaries. This adds complexity to the fetcher since we could branch to an odd instruction and we would have to invalidate the fetched even instruction. We decided to make this tradeoff because of our previous troubles with complicated cache designs. The complications occur with branches and stalls. For branches, we send back two valid signals from the Decoder to the Fetcher which of the next two instructions are valid. This allows us to control which instructions that are fetched are actually used. For example, if the current even instruction was a successful branch, we would invalidate both fetched instructions because they should be jumped over. If the current odd instruction was a successful branch, we would invalidate only the odd fetched instruction, since the even one would be in the branch delay slot. The Fetcher (in the branch-magic block) is also modified to keep track of if we branched to an odd word, and if we did, to invalidate the even fetched word because it is before the target of the branch. The Fetcher also has to mux the possible branch targets from beq/j/jr type instructions for both current instruction; a one bit signal is sent to the Fetcher from the Decoder to tell the Fetcher which instruction is the branch. For breaks, the Decoder tells the Fetcher when breaks are decoded for each instruction. The Fetcher then stalls until there is a break release. When it recieves the break release, it then checks to see if the current even instruction is a break, if it is, we release it by reseting the instruction register (this sends out all 0's - nop). If the even instruction is not a break, the odd one is checked and released/reset if necessary. Addition 6 was also a non-trivial addition. We had noticed that we were getting many instruction cache stalls because our memory bandwidth was one word per cycle at best (ignoring request latency). Thus, we decided to simultaneously increase memory bandwidth to the cache and provide striping. Thus, instead of reading in one word per cycle from the DRAM controller, we would read in multiple words in one cycle. Our 2way issue design only required two SRAMs to supply our two instructions per cycle, but we decided that it was actually easier to implement an 8-way striped design so we would read in a whole cache line in one cycle. Thus we would increase our memory bandwith 8 times. Incidentally, this addition would also make a stream buffer very attractive as we would now have the enough memory bandwidth to make it worth it. With the original memory bandwith of one word per cycle, a stream buffer would not help in any program where the data cache had any significant (1 request per 8 cycles) demands. The memory would still (speculating) be sending data to the processor constantly, but a larger fraction of this data would go to the instruction cache/stream buffer. With an striped cache design, we were thus able to supply two instructions to the processor. All of these additions were made, but are not fully functional/tested due to time constraints. See the evaluation section for the potential performance impact that these additions were projected to have made. 4 Results 4.1 Calculated Delays The calculated minimum post-place and route cycle period for our entire processor was 81.19 ns, as calculated by Xilinx Timing Analyzer. This yields a maximum clock rate of 12.3 MHz. We tried to get a working synthesized version of our processor on the board but failed to yield any results. From LED output results, we believe that there are setup/hold time violations occuring in the cache because of clocking interface issues between the DRAM controller and cache controller. Our critical path is as follows (highlighted in red): Figure 11: Critical path is highlighted in red. It has a minimum delay of 81.19 ns. The critical path case occurs when a load instruction is issued at the same time as a CDB broadcast that updates the address register for the load instruction. When this occurs, the fetch/issue unit forwards the updated CDB data to the fetch/issue unit, which then passes it down to the load unit. The load unit is part of the critical path because of huge fan in delay from various muxes. 4.2 CPI Analysis Our final out-of-order execution processor has a CPI of approximately 1, if we ignore instruction and data cache misses. This compares with our Lab 4 (single-cycle), and Lab 5 (5 Stage Pipeline) which each had a CPI of 1 (again if we ignore memory effects). The effect of memory is difficult for us to evaluate since we never had a working Lab 6 (5 Stage Pipeline with DRAM as memory), so we can only speculate as to how our Lab 6 design would have performed. Lab 6, assuming the same cache architecture would perform identically to our final processor except in the case of a data cache miss. The Tomasulo algorithm allows us to continue execution when there are data cache misses since the issue stage is independent of the execution and writeback stages. However, since our design did not include branch or jump prediction, jumps and branches that are dependent on loaded values still get stalled. Thus the performance gain over Lab 6 would be highly dependent on the program run. If the program had many loads and stores with no branching dependencies on the loads and stores, as in a sorting program which used registers for status information (how to perform the next step of the sort), our final project would do very well compared to the Lab 6 project which would stall frequently on data cache misses. Assuming a cache miss penalty of 10 cycles, our final project would still have a CPI of 1 on this program, whereas Lab 6 would have a CPI of about 1.15 (assume the loop fits in the 8 word cache block; one memory access per loop iteration = 1 memory access/8 instructions = one 10 cycle stall per 64 instructions; CPI = 74/64). 4.3 Modular Test Results All modular test cases pass. Note: To avoid regression tests, changes to the test cases because of incorrect assumptions were not made. The test cases probably do not pass after many fixes have been made to individual components, but the overall funtionality is what counts. 4.4 Top-Level Test Results All top-level test cases passed. The demonstration programs provided by the T.A's passed as well with the following statistics: (Refer to appendix for test results). Test Name Cycle count Results dumb 1092176 cycles 0x199 (input 0x50) dumber 650 cycles 0x25 / 0x00 quicksort2 6500 cycles all numbers sorted Table 3: Demonstration top-level test results 4.5 Board Usage Statistics The following is a report on the resource usage on our board after synthesis: Number of External GCLKIOBs Number of External IOBs Number of LOCed External IOBs Number of BLOCKRAMs Number of SLICEs Number of GCLKs Number of TBUFs 1 out of 4 174 out of 512 154 out of 174 25% 33% 88% 16 out of 160 11206 out of 19200 10% 58% 3 out of 4 7690 out of 19520 75% 39% 5 Conclusion This lab was relatively successful. We designed an out-of-order execution architecture processor and got it to work in simulation. We were unable to get it functioning on the board because of lack of time. Compared to the previous labs, team effort was at its best. The initial top-level planning on paper greatly increased efficiency and reduced any issues that would have arisen because of incorrect expectations. Finally, the design we built lacked many of the enhancements that would make the Tomasulo architecture powerful. There were many things we felt we could have added if given more time. For example, branch prediction, if added to our final project would eliminate all issue stalling, and our final project would have only lost productive cycles on branch predictions. Another major enhancement would have been multi-way issue, which would create a superscalar, out-of-order execution processor. This enhancement would be fairly easy to add since all of the data dependencies would be resolved in the usual manner. The complications to this enhancement are handling instruction fetching and issuing, and creating a wider bus from cache. Adding this option would have halved our base CPI to 0.5, and combined with branch prediction and an instruction prefetch unit (stream buffer) would have yielded a final CPI (speculating) between 0.5 and 0.7. Appendix I: Notebooks Notebook entries are as follows (double click icon to view): Name Journal File Total Hours Spent Jim Chen 138.63 Hrs Ray Juang 143.20 Hrs Ted Lee 116.22 Hrs Tim Lee 159.67 Hrs Table 4: Table of journal files. (Double click icon to view) *NOTE: The notebook entries were generated by an auto-logging journal program written by cs152-juangr. See journal directory for executable. Appendix II: Schematics Schematic files used in design are as follows Name Schematic File cache cache_datapath fetch_branch_unit fetcher top_level_driver Table 5: Table of schematic files. (Double click icon to view) Appendix III: Verilog Files Verilog design files used in lab are as follows: Module Description alu.v ALU module branch_magic.v Branch magic unit bts16.v Tristate buffer (16 bits wide) bts32.v Tristate buffer (32 bits wide) bts40.v Tristate buffer (40 bits wide) bts41.v Tristate buffer (41 bits wide) bts7.v Tristate buffer (7 bits wide) bts9.v Tristate buffer (9 bits wide) cache_controller.v Cache controller cdb_arbiter.v CDB Arbiter comp12.v 12-bit comparator comp23.v 23-bit comparator comp32.v 32-bit comparator comp9.v 9-bit comparator constants.v Constants used for standardization Source code counter.v Module Description Parameterizable Counters dcache.v Top-level Data cache wrapper decode_issue_unit.v Decode-issue unit extender.v 32-bit sign extender fifo.v FIFO queue global_counter.v Global-counter (for cycle count) hugeblock.v Fake memory block for simulation icache.v Top-level Instruction cache wrapper io_block.v I/O block issue_monitor.v Issue monitor (disassembler) jump_combine.v Bus combiner (schematics hack) level0.v Level 0 boot ROM load_store_arbiter.v Load store arbiter memory.v Top-level Memory wrapper memory_arbiter.v Memory arbiter memory_control.v DRAM controller (DRAM interface) Source code Module mips_test.v Description Top-level simulation tester mt48lc8m16a2.v DRAM simulation files mux1x2.v 2 input Mux (1 bit wide) mux32x2.v 2 input Mux (32 bits wide) mux32x4.v 4 input Mux (32 bits wide) mux32x5.v 5 input Mux (32 bits wide) mux9x2.v 2 input Mux (9 bits wide) opcodes.v Op code constants for decoding part1.v Boardram block for synthesis reg_pc.v PC Register reg256.v 256 bit register reg3.v 3 bit register reg32.v 32 bit register reg7.v 7 bit register reg9.v 9 bit register regfile.v Register File Source code Module reservation_station.v Description Generic Reservation Station rs_itype.v I-type reservation station rs_load.v Load reservation station rs_rtype.v R-type reservation station rs_shift.v Shift reservation station rs_store.v Store reservation station shifter.v 32-bit shifter sramblock2048.v SRAM block for synthesis/simulation unit_i_type.v I-Type Functional Unit unit_load.v Load Functional Unit unit_r_type.v R-Type Functional Unit unit_shift.v Shift Functional Unit unit_store.v Store Functional Unit write_buffer.v Write back buffer Table 6: Table of verilog files. (Double click icon to view) Source code Appendix IV: Test Files Module Test Results Here are the test files used to test individual modules. Note that logs are not provided since changes made during top-level testing may have broken the individual module tests. The top-level results are what really counts: Module Tested counter fifo (read) fifo (write) fifo (random delay) cache memory_arbiter memory_control regfile reservation_station cdb_arbiter memory Test Fixture File Module Tested Test Fixture File unit_load unit_r_type unit_store Table 7: Table of modular test files. (Double click icon to view) The test on cache (processor_lite.v and instmem.lite) are modules designed to fake a dumb processor to test functionality of cache. Top-level Test Results Here are the test files used and the associated transcript files of the test results for toplevel testing: (Note: Hardware level tests passed with results similar to simulation output) Test Name Transcript Files SPIM file / IOinput base (from TA) n/a corner (from TA) n/a diagnostic (from TA) n/a linear -- memio (from TA) n/a quick_sort (from TA) n/a verify (from TA) n/a worm mem Contents file Explanation of new SPIM files: linear.s Runs linear (bubble) sort using memory load/store as variables. Heavy load/store usage. Tests cache fill/miss (sequential). worm.s This is a rigourous load/store dependency test checker. It creates a linked list in memory and traverses through the list until it reaches the end. The address of the next address to load must be loaded via memory. (We found some deadlocking bugs this way) Other test cases (logs got deleted): Part1.s, Part2.s, Part3.s - Exhaustive testing of basic correctness in executing functions. (Original single file was too big to fit in the boardramcreate blocks, so broken into 3 files). rs_load.s Tests load/store issues that can be a potential bug for Tomasulo architecture. This was test specifically designed to check against bugs that were not thought of.
© Copyright 2026 Paperzz