Instruction execution

CHAPTER 2-1
PROCESSORS
----------------------------------------------------------------------------------------------------------------------------- ---
-------------------------------------------------------------------------------------------------- -----------------------------Figure 2-1. The organization of a simple computer with one CPU and two I/O devices.

The CPU is composed of several distinct parts:
o The CU is responsible for fetching instructions from main memory and determining their type.
o The ALU performs operations such as addition and Boolean AND needed to carry out the
instructions.
o A small, high-speed memory is used to store temporary results and certain control information.
 This memory is made up of a number of registers, each of which has a certain size and
function.
 One register is the Program Counter (PC), which points to the next instruction to be fetched
for execution.
 Another register is the Instruction Register (IR) which holds the instruction currently being
executed.
CPU organization

The internal organization of part of a simple von Neumann CPU is shown in Figure 2-2 in more detail.
----------------------------------------------------------------------------------------------------------------------------- ---
------------------------------------------------------------------------------------------------------------------------------- Figure 2-2. The data path of a typical von Neumann computer.

The process of running 2 operands through the ALU and storing the result is called the data path cycle
and is the heart of most CPUs.
o The faster the data path cycle is (the cycle time), the faster the machine runs.
o Modern computers have multiple ALUs specialized for different functions operating in parallel.
Instruction execution


The CPU executes each instruction in a series of small steps. Roughly speaking, the steps are as
follows (these steps are referred to as the fetch-decode-execute cycle):
a. Fetch the next instruction from memory into IR.
b. Change PC to point to the next instruction.
c. Determine the type of instruction just fetched.
d. If the instruction uses a word in memory, determine where it is.
e. Fetch the word, if needed, into a CPU register.
f. Execute the instruction (running the operands through the data path).
g. Go to step 1 to begin executing the next instruction.
We can write a program (an interpreter) that imitates the function of a CPU, i.e., to carry out the fetchdecode-execute cycle.
-------------------------------------------------------------------------------------------------------------------------------public class interp
{
static int PC;
//program counter
static int AC;
//accumulator
static int instr;
//IR
static int instr_type;
//opcode
static int data_loc;
//address of data, -1 if none
static int data;
//holds the current operand
static boolean run_bit=true;
//a bit that can be turned off to halt the machine
public static void interpret(int[] memory, int starting_address)
{
PC=starting_address;
while (run_bit)
{
instr=memory[PC];
PC=PC+1;
instr_type=get_instr_type(instr);
data_loc=find_data(instr, instr_type);
if (data_loc>=0)
data=memory[data_loc];
execute(instr_type, data);
}
}
publiuc static int get_instr_type(int instr){…}
publiuc static int find_data(int instr, int type){…}
publiuc static void execute(int type, int data){…}
}
----------------------------------------------------------------------------------------------------------------------------- --Figure 2-3. An interpreter for a simple computer in Java.


The very fact that it is possible to write a program that can imitate the function of a CPU shows that a
program need not be executed by a "hardware" CPU.
o Instead, a program can be carried out by having another program fetch, examine, and execute its
instructions.
A program, such as the one in Figure 2-3 that fetches, examines, and executes the instructions of
another program is called ______________.

An interpreter breaks the instructions of its target machine into smaller steps. As a consequence, the
machine on which the interpreter runs can be much simpler and less expensive than a hardware
processor for the target machine would be. This saving can be significant if L has a large number of
complicated instructions.
Complex Instruction Set Computer






Early computers had small, simple sets of instructions.
o But the quest for more powerful computers led to more powerful individual instructions.
Very early on, it was discovered that more complex instructions often led to faster program execution
even though individual instructions might take longer to execute.
o This is because the execution of these complex operations could sometimes be executed in parallel
using different hardware.
o For expensive, high-performance computers, the cost of this extra hardware could be readily
justified.
Thus expensive, high-performance computers came to have many more actual instructions than lowercost ones.
o However, instruction compatibility requirements created the need to implement complex
instructions even on low-end computers where cost was more important than speed.
But how to build a low-cost computer that could execute all the complicated instructions of highperformance, expensive machines?
o The answer lay in interpretation.
o The IBM 360 family of computers spanned 2 orders of magnitude in price and capability.

The expensive models do not have the microprogramming level and the level 2 instructions
(the IAS level) are directly executed by hardware.
 The low-cost models, however, use the microprogramming level to interpret the IAS level
instructions.
As the market for computers exploded in the 1970s, the demand for low-cost computers favored
designs of computers using interpreters.
o Nearly all new computers designed in the 1970s, from minicomputers to mainframes, were based
on interpretation.
o By the late 1970s, the use of simple processors running interpreters had become very widespread
except among the most expensive highest-performance models such as the Cray-1 and the CDC
series.
The use of an interpreter eliminated the inherent cost limitations of complex instructions so designers
began to explore much more complex instructions, particularly the ways to specify the operands to be
used.
o This trend reached its zenith with DEC VAX computer, which had several hundred instructions
and more than 200 different ways of specifying the operands to be used in each instruction.
Reduced Instruction Set Computer



During the late 70s there was experimentation with very complex instructions, made possible by the
interpreter.
o Designers tried to close the gap between what machines could do and what HLL required.
o Hardly anyone thought about designing simpler machines.
In 1980, a group at Berkeley led by David Patterson began designing VLSI CPU chips that did not use
interpretation.
o They coined the term RISC for this concept and named their CPU chip the RISC I followed
shortly by RISC II.
These new processors were significantly different than commercial processors of the day.
o Since they did not have to be backward compatible with existing products, their designers were
free to choose new instruction sets that would maximize total system performance.
o The characteristic of the RISC I, the RISC II, and the MIPS that caught everyone's attention was
the relatively small number of instructions available, typically around 50.



This number was far smaller than the 200 to 300 on established computers such as the VAX
and large IBM mainframes.
 In fact, the acronym RISC stands for ________________________________, which was
contrasted with CISC, which stands for _______________________________.
The RISC supporters claimed that the best way to design a computer was to have a small number of
simple instructions that execute in 1 cycle of the data path, namely, fetching 2 registers, combining
them somehow, and storing the result back in a register.
o Their argument was that even if a RISC machine takes 4 or 5 instructions to do what a CISC
machine does in 1 instruction, if the RISC instructions are 10 times faster (because they are not
interpreted), RISC wins.
Given the performance advantages of RISC technology, RISC machines (such as Sun UltraSPARC)
would have mowed over CISC machines (such as Intel Pentium) in the marketplace. Nothing like this
has happened. Why not?
o First of all, there is the issue of backward compatibility and the billions of dollars companies have
invested in software for the Intel line.
o Second, Intel has been able to employ the same ideas even in a CISC architecture.
 Starting with the 486, the Intel CPUs contain a RISC core that executes the simplest (and
typically most common) instructions directly in a single data path cycle, while interpreting the
more complicated instructions in the usual CISC way.
 The net result is that common instructions are fast and less common instructions are slow.
 While this hybrid approach is not as fast as a pure RISC design, it gives competitive overall
performance while still allowing old software to run unmodified.
Design principles for modern computers


In the absence of the backward compatible requirement, the common sense to build a new processor
would be to use the RISC philosophy.
There is a set of design principles called the RISC design principles, which architects of generalpurpose CPUs do their best to follow.
a.
All instructions are directly executed by hardware

All common instructions are directly executed by hardware and not interpreted by microinstructions.
o Eliminating a level of interpretation (the microprogram) provides high speed for most instructions.
o For computers that implement CISC instruction sets, the more complex instructions may still be
interpreted by the microprogram. This extra step slows the machine down, but for less frequently
occurring instructions it may be acceptable.
b.
Maximize the rate at which instructions are issued

Modern computers resort to many tricks to maximize their performance, chief among which is trying
to start as many instructions per second as possible.
o After all, if you can issue 500 million instructions per second, you have built a 500-MIPS
processor.
o This principle implies that parallelism plays a major role in improving performance, since issuing
large numbers of instructions in a short interval is only possible if multiple instructions can
execute concurrently.
c.
Instructions should be easy to decode

A critical limit on the rate of issue of instructions is decoding individual instructions to determine what
resources they need.
o Instructions should be regular, fixed length, and with a small number of fields.
o The fewer formats for instructions, the better.
d.
Only loads and stores should reference memory

One of the simplest ways to break operations into separate steps (to allow parallelism) is to require that
operands for most instructions come from, and return to, CPU registers.
o This is because access to memory can take a long time and the delay is unpredictable.
o This observation means that only LOAD and STORE instructions should reference memory. All
other instructions should operate only on registers so that they can be parallelized.
e.
Provide plenty of registers

Since accessing memory is relatively slow, many registers (at least 32) need to be provided so that
once a word is fetched, it can be kept in a register until it is no longer needed.
Hardware limits



Can computer keep getting faster and faster?
o The laws of physics say that nothing can travel faster than the speed of light, which is about 30
cm/nsec in vacuum and 20 cm/nsec in copper wire.
 This means that to build a computer with 1 nsec instruction fetch time, the total distance that
the electrical signals can travel within the CPU to memory and back cannot be more than 20
cm.
 Therefore very fast computers have to be very small.
o Electrons travel along a path that can be as narrow as 3 atoms in width. If the path is narrower
than that, electrons can stray onto an adjacent path.
 Thus, parallel paths must not be too narrow and must not be located too close to one another.
o These intrinsic laws of nature will eventually prevent computers from becoming arbitrarily fast or
arbitrarily small.
Consequently, most architects look to parallelism (doing 2 or more things at once) as a way to build
faster machines for a given clock speed.
Parallelism comes in 2 general forms: instruction-level parallelism and processor-level parallelism.
o In the former, parallelism is exploited within individual instructions to get more instructions/sec
out of the machine.
o In the latter, multiple CPUs work together on the same problem.
Instruction-level parallelism
a.
Pipelining

It has been known for years that the actual fetching of instructions from memory is a major bottleneck
in instruction execution speed.
o To alleviate this problem, computers going back at least as far as the IBM Stretch (1959) have had
the ability to fetch instructions from memory in advance, so that they would be there in the CPU
when they are needed.
In effect, prefetching divides instruction execution up into 2 parts: fetching and actual execution.
o The concept of a pipeline carries this strategy much further.
o Instead of dividing instruction execution into only 2 parts, it is often divided into many parts, each
one handled by a dedicated piece of hardware, all of which can run in parallel.
Figure 2-4(a) illustrates a pipeline with 5 units, or stages.


----------------------------------------------------------------------------------------------------------------------------- ---
-------------------------------------------------------------------------------------------------------------------------------Figure 2-4. (a) A 5-stage pipeline. (b) The state of each stage as a function of time. 9 clock cycles are
illustrated.




In Figure 2-4(b) we see how the pipeline operates as a function of time.
o Of course, timing is everything. S2 cannot work on instruction 1 until S1 has finished it. So we
use clock cycles to control these events.
o In addition, to be able to run in parallel, the 2 instructions must not conflict over resource usage
(e.g. registers), and neither must depend on the result of the other.
Getting back to our pipeline of Figure 2-4, suppose that the cycle time of this machine is 2 nsec. Then
it takes 10 nsec for an instructing to progress all the way through the 5-stages pipeline.
o At first glance, with an instruction taking 10 nsec, it might appear that the machine can run at
____ MIPS, but in fact it does much better than this.
o At every clock cycle (2 nsec), one new instruction is completed, so the actual rate of processing is
______________.
Thus, pipelining allows a trade-off between latency (how long it takes to execute an instruction) and
processor bandwidth (how many MIPS the CPU offers).
o With a cycle time of T nsec, and n stages in the pipeline, the latency is ____ nsec.
o Since 1 instruction completes every clock cycle and there are ______ clock cycles/second, the
number of instructions executed per second is _____.
o To get the number of MIPS, we have to divide the instruction execution rate by 1 million to get
_________ MIPS.
If 1 pipeline is good, then surely 2 pipelines are better. One possible design for a dual pipeline CPU us
shown in Figure 2-5.
o Here a single instruction fetch unit fetches 2 instructions together and puts each one in its own
pipeline, complete with its own ALU for parallel operation.
o To be able to run in parallel, the 2 instructions must not conflict over resource usage, and neither
must depend on the result of the other.
 As with a single pipeline, either the compiler must guarantee this situation to hold, or conflicts
are detected and eliminated during execution using extra hardware.
----------------------------------------------------------------------------------------------------------------------------- ---
------------------------------------------------------------------------------------------------------------------------------Figure 2-5. Dual 5-stage pipelines with a common instruction fetch unit.
b.
Superscalar architectures


Going to 4 pipelines is conceivable, but doing so duplicates too much hardware (computer scientists do
not believe in the number 3). Instead, a different approach is used on high-end CPUs.
The basic idea is to have just 1 pipeline but give it multiple functional units, as shown in Figure 2-6
(the Intel Core architecture has a structure similar to this).
o The term superscalar architecture was coined for this approach in 1987.
--------------------------------------------------------------------------------------------------------------------------- ----
----------------------------------------------------------------------------------------------------------------------------- -Figure 2-6. A superscalar processor with 5 functional units.

Implicit in the idea of a superscalar processor is that S3 can issue instructions considerably faster than
S4 is able to execute them.
o If S3 issued an instruction every 10 nsec and all the functional units could do their work in 10
nsec, no more than 1 would ever be busy at once, negating the whole idea.
o In reality, most of the functional units in S4 take longer than 1 clock cycle to execute, certainly the
ones that access memory or do floating point arithmetic.
Processor-level parallelism

Instruction-level parallelism can only speed up a machine by a factor of at most 10. To get gains of 50,
100, or more, the only way is to design computers with multiple CPUs. Here are common
organizations.
a.
Data parallel computers

Many scientific problems involve loops and arrays, or otherwise have a highly regular structure.
o Often the same calculations are performed on many different sets of data.
o The regularity and structure of these programs makes them easy targets for speedup through
parallel execution.
Data parallel computers have found many successful applications as a consequence of their remarkable
efficiency.
o Because all of the processors are running the same instruction, the system needs only 1 brain
controlling the computer.
o Consequently, the processor needs only 1 fetch stage, 1 decode stage, and 1 set of control logic.
There are 2 types of data parallel computers that have been used to execute large scientific programs
quickly, namely, SIMD processors and vector processors.
A Single Instruction-stream Multiple Data-stream or SIMD processor consists of a large number of
identical processors that perform the same sequence of instructions on different sets of data.
The world's first SIMD processor was the University of Illinois ILLIAC IV computer.
o The original ILLIAC IV design consists of 4 quadrants, each quadrant having an 8x8 square grid
of processor/memory elements.




o


A single control unit per quadrant broadcast instructions to all processors, which were carried out
by all the processors in lockstep, each one using its own data from its own memory (loaded during
the initialization phase).
Modern graphics processing units (GPUs) heavily rely on SIMD processing to provide massive
computational power.
o Graphics processing lends itself to SIMD processors because most of the algorithms are highly
regular, with repeated operations on pixels, vertices, textures, and edges.
A vector processor appears to the programmers very much like a SIMD processor. Both SIMD
processors and vector processors work on arrays of data, and both execute single instructions to do
things such as add the elements together pairwise for 2 arrays.
o A SIMD processor does it by having as many adders as elements in the array.
o A vector processor has the concept of a vector register which consists of a set of conventional
registers that can be loaded from memory in a single instruction.
 Then a vector addition instruction performs the pairwise addition of the elements of 2 such
arrays by feeding them to a pipelined adder from the 2 vector registers.
 The result from the ALU is another array, which can either be stored into a vector register, or
used directly as an operand for another vector operation.
b.
Multiprocessors

The processing elements in a data parallel processor are not independent CPUs, since there is only 1
CU shared among all of them.
Our first parallel system with multiple full-blown CPUs is the multiprocessor, a system with more than
1 CPU sharing a common memory.
o Since each CPU can read or write any part of memory, they must coordinate (in software) to avoid
getting in each other's way.
o When 2 or more CPUs have the ability to interact closely, as is the case with multiprocessors, they
are said to be tightly coupled.
One simple design is to have a single bus with multiple CPUs and 1 memory all plugged into it. A
diagram of such a bus-based multiprocessor is shown in Figure 2-8(a).
o With a large number of fast processors constantly trying to access memory over the same bus,
conflicts will result.
o Multiprocessor designers have come up with various schemes to reduce this contention and
improve performance.
 One design, shown in Figure 2-8(b), gives each processor some local memory of its own, not
accessible to the others.
 This memory can be used for program code and those data items that need not be shared.
 Access to this private memory does not use the main bus, greatly reducing bus traffic.


----------------------------------------------------------------------------------------------------------------------------- --
------------------------------------------------------------------------------------------------------------------------------Figure 2-8. (a) A single-bus multiprocessor. (b) A multiprocessor with local memories.
c.
Multicomputers

Although multiprocessors with a small number of processors (<=256) are relatively easy to build, large
ones are surprisingly difficult to construct.
o The difficulty is in connecting all the processors to the shared memory.




To get around these problems, many designers have simply abandoned the idea of having a shared
memory and just build systems consisting of large numbers of interconnected computers, each having
its own private memory, but no common memory.
o These systems are called multicomputers.
o The CPUs in a multicomputer are said to be loosely coupled, to contrast them with tightly coupled
multiprocessor CPUs.
The CPUs in a multicomputer communicate by sending each other messages.
For large systems, having every computer connected to every other computer is impractical, so
topologies such as trees and rings are used.
o As a result, messages from one computer to another often must pass through one or more
intermediate computers or switches to get from the source to the destination.
o Nevertheless, message-passing times on the order of a few microseconds can be achieved without
much difficulty.
Multicomputers with over 250,000 CPUs such as IBM's Blue Gene/P, have been built and put into
operation.
Problems
1.
Consider the operation of a machine with the data path of Figure 2-2. Suppose that loading the ALU
input registers take 5 nsec, running the ALU takes 10 nsec, and storing the result back in the register
scratchpad takes 5 nsec. What is the maximum number of MIPS this machine is capable of in the
absence of pipelining?
2.
What is the purpose of step 2 (modify the PC) in the fetch-decode-execute cycle? What would happen
if this step were omitted?
3.
On computer 1, all instructions take 10 nsec to execute. On computer 2, they all take 5 nsec to
execute. Can you say for sure that computer 2 is faster? Discuss.
4.
Suppose that you were designing a single-chip computer for use in embedded systems. The chip is
going to have all its memory on chip and running at the same speed as the CPU with no access penalty.
Examine each of the computer design principles discussed in this chapter (5 of them) and tell whether
they are still so important (assuming that high performance is still desired).
5.
To compete with the newly-invented printing press, a certain medieval monastery decided to massproduce handwritten paperback books by assembling a vast number of scribes in a huge hall. The head
monk would then call out the first word of the book to be produced and all the scribes would copy it
down. Then the head monk would call out the second word and all the scribes would copy it down.
This process was repeated until the entire book had been read aloud and copied. Which of the parallel
processor system discussed in this chapter does this system resemble most closely?