The Microarchitecture (level-1)

CE 454
Computer Architecture
Lecture 8
Ahmed Ezzat
The Microarchitecture Level,
Ch-4.4, 4.5,
1
Outline

Design of the Microarchitecture Level
–
–
–
–
–

Improving Performance
–
–
–
–

2
Speed vs Cost
Reducing Execution Path Length
Instruction Prefetching Design - The Mic-2
Pipelined Design - The Mic-3
Seven-Stage Pipeline Design: The Mic-4
Cache Memory
Branch Prediction
Out-of-Order Execution and Register Renaming
Speculative Execution
Reading Assignment: Examples of the Microarchitecture Level
CE 454
Ahmed Ezzat
Design of the Microarchitecture Level:
Speed vs Cost



Simple machines are slow and fast machines are complex
Speed improvement due to organization vs faster technology
can’t be ignored
Ways to make faster machines
–
–
–

Ways to measure cost
–
–
3
Reduce the number of clock cycles needed to execute an
instruction (known as path length)
Make clock cycle shorter (I.e reducing execution path length)
Overlap the execution of instructions, e.g., Instruction pipelining
Count of number of components, transistors, etc.
Area (real estate) required on the IC is more important
CE 454
Ahmed Ezzat
Design of the Microarchitecture Level:
Reducing Execution Path Length



4
Mic-1 is simple CPU with minimum hardware (less than 5000
transistors) + control store (ROM) + main memory (RAM)
IJVM was implemented in microcode with little hardware. Now, let us
look for a faster alternative
The above POP instruction cost 4 clock cycles (3 microinstructions + 1
main loop)
CE 454
Ahmed Ezzat
Design of the Microarchitecture Level:
Reducing Execution Path Length
Merging the Interpreter Loop with the Microcode

Merge interpreter loop with microcode
–
–
–
5
Main loop instruction can be overlapped with the previous instruction
When ALU not used in POP2, use it. As a result POP2 cost is reduced to 3
clock cycles
Having dead cycle, where ALU is not used, is not common, so merging
Main1 into the end of each microinstruction sequence is worth doing
CE 454
Ahmed Ezzat
Design of the Microarchitecture Level:
Reducing Execution Path Length
Three-Bus Architecture
6

Using Mic-1 architecture, let us revisit ILOAD instruction (push local
variable onto stack)

Have two input buses, A and B: Can add any two registers in one cycle
CE 454
Ahmed Ezzat
Design of the Microarchitecture Level:
Instruction Prefetching Design – Mic-2
Instruction Fetch Unit: IFU
 Execution Loop
–
–
–
–
–

ALU intervenes in instruction fetching (fetch one byte at a time
then assemble) – this ties the ALU – 1 cycle/byte:
–
7
PC passed through ALU and incremented
PC used to fetch next byte of instruction
Operands read from memory
Operands written to memory
ALU compute and store result
Have a separate Instruction Fetch Unit to
 Increment PC
 Fetch Bytes
 Assemble 8- and 16-bit operands
CE 454
Ahmed Ezzat
Design of the Microarchitecture Level:
Instruction Prefetching Design – Mic-2
Instruction Fetch Unit: IFU
 Two ways:
–
–

Use 2 MBR’s. (MBR1 holds oldest, and MBR2 two oldest bytes in the shift
register):
–
–
–
–
–
8
IFU interpret code, fetch additional fields and assemble in register for use by the
main execution unit (ALU)
Always fetch next 8- or 16- bytes regardless of use – design shown in next page
Automatically senses when MBR1 is read
Read next byte into MBR1
When MBR1 is read, shift register shifts I Byte R
When MBR2 is read it is loaded 2 bytes
IFU has its own IMAR, to address memory when new word is needed
CE 454
Ahmed Ezzat
Design of the Microarchitecture Level:
Instruction Prefetching Design – Mic-2
Instruction Fetch Unit: IFU
9
CE 454
Ahmed Ezzat
Design of the Microarchitecture Level:
Instruction Prefetching Design – Mic-2
The Whole Design
10
CE 454
Ahmed Ezzat
Design of the Microarchitecture Level:
Instruction Prefetching Design – Mic-2
Summary
–
–
–
–
–
Mic-2 is an enhanced version of Mic-1
Eliminates the main loop entirely
Avoid tying the ALU incrementing the PC
Reu8ces path length whenever 16-bit index or offset is calculated – no
need to assemble in H
Mic-2 improves some instructions more than others. For example it
reduces:




11
LDC_W from 9  3 microinstructions
ILOAD from 6  3 microinstructions
SWAP from 8  6 microinstructions
IADD from 4  3 microinstructions
CE 454
Ahmed Ezzat
Design of the Microarchitecture Level:
Pipelined Design – Mic-3




12
Mic-2 is faster than Mic-1 with little increase in the real estate introduced by
the IFU
Reducing cycle time is tied into technology used. How about exploiting
parallelism as Mic-2 is highly sequential except the IFU!
Major components of the data path cycle
– Driving selected registers onto A and B
– ALU and shifter work
– Results get back to registers and stored
Can introduce latches to partition buses
– Parts operate independently
– Why
 Can speed up clock because maximum delay is less
 Can use parts during every sub cycle
CE 454
Ahmed Ezzat
Design of the Microarchitecture Level:
Pipelined Design – Mic-3
3-bus architecture with 3 latches
 Latch is inserted in the middle of
each bus
 In effect, latch partition the data
path into 3 distinct parts that can
operate independently (Mic-3)
 Each subcycle is about 1/3
original length, hence triple the
clock speed
 Previously during 1, 3 subcycles
ALU is idle. Now ALU can be
used on every subcycle – better
throughput
13
CE 454
Ahmed Ezzat
Design of the Microarchitecture Level:
Pipelined Design – Mic-3
SWAP in Mic-2

In Mic-3, need 3 microsteps to use the data path:
–
–
–
14
Load A and B
Perform operation and load C
Write result back
CE 454
Ahmed Ezzat
Design of the Microarchitecture Level:
Pipelined Design – Mic-3
SWAP in Mic-3
 Mic-3 instructions
takes more cycles
than Mic-2
 However, Mic-3 cycle
is 1/3 of Mic-2 cycle
 For SWAP, Mic-3
costs 11 microsteps,
while Mic-2 would
cost (6x3) = 18
microsteps
15
CE 454
Ahmed Ezzat
Design of the Microarchitecture Level:
Pipelined Design – Mic-3
Dependencies
 Like to start SWAP3 in cycle 3, but MDR is available only in cycle
5. This is called true Dependence or RAW (Read After Write)
dependence. SWAP3 has to wait/stall till SWAP1 completes
16

Pipelining is a key technique in all modern CPUs. An analogy is a
car assembly line – produces 1 car/hr independent of how long it
actually takes to assemble a car.

Reading assignment: A Seven-Stage Pipeline (Mic-4)
CE 454
Ahmed Ezzat
Improving Performance

Ways to improve performance, primarily CPU and memory):
– Implementation improvement without architectural changes


–
Means old programs run without changes, Major selling point
80386 through Pentiums improvements are like this
Architectural changes

New or additional instructions and/or registers
New architecture such as RISC, IA-64, etc.
Major Techniques
– Cache memory
– Branch prediction
– Out of order execution with register renaming
– Speculative execution


17
CE 454
Ahmed Ezzat
Improving Performance:
Cache Memory



18
Memory latency vs bandwidth are at odds (e.g., pipelining) – hence cache
Split cache: Separate caches for instructions and data
– Two separate memory ports
– Doubles the speed with independent access
Level 2+ cache: between I/D cache and main memory
CE 454
Ahmed Ezzat
Improving Performance:
Cache Memory

Caches are generally inclusive
–

Depends on Locality of reference
–
–

Main memory is divided into fixed size blocks called caches lines – 4 to 64
consecutive bytes
If memory referenced,
–
–
19
Spatial locality: memory locations with addresses numerically similar to the
recently accessed memory are likely to be accessed in the near future
Temporal locality: recently accessed memory locations are likely to be
accessed again
Cache Model
–

L3 cache includes L2 cache content, and L2 cache includes L1 cache
content
Cache controller checks if memory referenced is in the caches,
else a cache line is removed and new line is cached from main memory
CE 454
Ahmed Ezzat
Improving Performance:
Direct-Mapped Caches

Given memory word stored exactly in one place
–

Format:
–
–
–
20
If not there, not in cache
VALID BIT: on if cache line has valid data
TAG: (16 bit) unique value identifying corresponding line of memory
DATA: (32 bytes) copy of data from memory
CE 454
Ahmed Ezzat
Improving Performance:
Direct-Mapped Caches
Address Translation





21
TAG: Tag bit in address corresponds to TAG field in the cache entry
LINE: which cache entry holds the data, if it is present
WORD: which word within line
BYTE: which byte within the word (not used normally)
When CPU gives address, HW extracts LINE bits
– Indexes into cache, finds one of 2048 entries, if valid TAG field are
compared, If same cache HIT!
– Else cache miss!, whole cache line fetched from memory, stored in
cache, existing line stored back in necessary
CE 454
Ahmed Ezzat
Improving Performance:
Direct-Mapped Caches
22

Consecutive memory lines in consecutive cache line entries

If access pattern (e.g., address “X” and address “X + cache
size”) the line will be overwritten and if this pattern is frequent, it
would result in poor performance – frequent misses

Direct-mapped cache is very common, and typically they are
effective as collisions as the ones described above are rare
CE 454
Ahmed Ezzat
Improving Performance:
N-way Set Associative Caches



23
Allow “n” entries for each hashed address (address modulo cache-size). These
entries need to be ordered as LRU for replacement
Each entry must be checked to see if the needed line is present
2-way and 4-way caches have performed well
CE 454
Ahmed Ezzat
Improving Performance:
Issues in Cache Design



24
Cache replacement policy: LRU
Writing Cache Back
– Write through
– Write deferred or write back
Writing to address that is not in the cache:
– Write Allocation: Bring to cache – typically used with write back
cache
– Write memory directly: typically used with write through cache
CE 454
Ahmed Ezzat
Improving Performance:
Branch Prediction


Pipelining works best with linear code, but 20% of code is either branches or
conditional branches, hence branch prediction is important
Most pipelined machines execute instruction following branch, logically should
not do so (because we know the opcode after we started the next instruction
fetch)
–
–

Example Predictions
–

Backward branches will be taken, e.g., end of loop. Some Forward branches
occurs due to error condition, which is rare, so not taking forward branch is O.K.
Two ways of branch prediction
–
–
25
Try to find useful instruction to execute after branches!
Compilers can stuff No Op instructions, but it slows down and makes the code
longer
Execute until change state (i.e., write to register) then update scratch
temporarily until we know if the branch prediction is correct
Record update value to be able to rollback in case of need
CE 454
Ahmed Ezzat
Improving Performance:
Dynamic Branch Prediction

CPU maintains history table of previous branches in HW.
–


Look up history table for predictions
(a) Organized just like caches <address of branch instruction, bit telling if
branch was taken>
(b) End of Loop takes wrong guess, having 2-bit branch history



26
Hence change branch only after two correct executions
(c) 2- or 4-way associative entry approach as with cache
Can take a Finite State Machine approach
CE 454
Ahmed Ezzat
Improving Performance:
Static Branch Prediction



27
Dynamic branch prediction is carried out at run time – requires
special expensive hardware
Compiler passes hints (new branch instruction format)
– Sets a bit to indicate which branch will be mostly taken
– Requires special hardware (enhanced instructions)
Profiling
– Program run though a profiler (simulator ) to capture branch
behavior, and pass the info to the compiler which in turn can
pass it into special branch instructions
– IA-64 supports profiling
CE 454
Ahmed Ezzat
Improving Performance:
Out-of-Order Execution and Register Renaming





28
Pipelined superscalar machines fetches and issues instructions
before they are needed
In order instruction issue and retirements is simpler but inefficient
Some instructions depend on others, hence cannot resort to out of
order execution.
Example machine:
– 8 registers, 2 for operands, one for result
– Decoded in cycle N, execution starts in N+1
– Addition & subtraction is written back in N+2
– Multiplication is written back in N+3
Scoreboard is a table to reflect use of registers for reading and
writing at run time
CE 454
Ahmed Ezzat
Improving Performance:
Example: In Order Execution
29
CE 454
Ahmed Ezzat
Improving Performance:
Example: In Order Execution


In order issue and in order retirement
Instruction Dependencies
–
–
–

Instruction (I4) has RAW dependency, it stalls
–
–
–
30
Read After Write (RAW): If any operand being written, do not issue
Write After Read (WAR): If result register being read, do not issue
Write After Write (WAW): If result register being written, do not issue
Decode units stalls until R4 is available
Stops pulling from fetch unit
When buffer full fetch unit stalls fetching from memory
CE 454
Ahmed Ezzat
Improving Performance:
Out-of-Order Execution and Register Renaming







31
Issued out of order and may retire out of order
Instruction (I5) is issued while Instruction (I4) is stalled
Problem: (I5) can use an operand (I4) computed
New Rule: Do not issue instructions that uses operand stored by
previous instruction
Example: (I7) uses R1, written by (I6),
– never uses again because (I8) writes R1,
– hence (I6) can use different register to hold value
Register renaming: decode unit changes R1 in (I6), (I7) to S1
(secret) S1 so (I5), (I6) can be issued concurrently
Eliminates WAW and WAR dependencies often
CE 454
Ahmed Ezzat
Improving Performance:
Out-of-Order Execution and Register Renaming
32

Same eight
instructions:

Executed in 18
cycles using inorder issue and
retirement

Executed in 9
cycles using outof-order issue and
retirement
CE 454
Ahmed Ezzat
Improving Performance:
Speculative Execution






33
Code consists of basic blocks with no control structures such as
if then else or while statements. Only linear sequence of code. No
branches.
Within each block, reordering works well.
Program can be represented as a directed graph.
Problem: blocks are short, insufficient parallelism
If slow instructions can be moved up across blocks (hoisting), so
that if they are executed, then the result is there when needed!
Speculative execution: execute code before known if they will be
executed
CE 454
Ahmed Ezzat
Improving Performance:
Speculative Execution - Example
34
CE 454
Ahmed Ezzat
Improving Performance:
Speculative Execution - Problems

In the example,
–
–
–





35
say except even-sum and odd-sum, all variables are kept in registers.
Can move LOAD of even-sum and odd-sum variables to top of loop.
Only one of {even-sum, odd-sum} will be needed in any iteration, and the other
LOAD is wasted
Reordering instructions must have no irrevocable results
Can rename all destination registers in speculative code
Problem: Speculative code causing exceptions (cache miss or page fault
Solution: Use SPECULATIVE-LOAD instead of load so that in case of cache
miss does not cause load from memory
Poison Bit: If speculative instruction (LOAD) causes trap, a special version of
such instruction is used that instead, it sets the poison-bit on the result
register. If that register is touched by the regular instruction in the future, it
will cause the trap. If the result in that register is never used, the poison bit is
eventually cleared and no harm is done
CE 454
Ahmed Ezzat
36
CE 454
Ahmed Ezzat