handouts - CS StudentNet

COMP12111 Fundamentals of Computer Engineering
School of Computer Science
The Three Box Model
Ernie Hill
Room IT 118
[email protected]
1
The ‘classic’ model of a basic computer is known as the “three box model”; the
reason should be obvious! The three boxes are:
• The Central Processing Unit (CPU)
• The Memory
• Input and Output (I/O)
CPU
The CPU is the computer processor – the ‘brain’ in the system. It is responsible
for running programmes and controls – or, at least, supervises, the functioning of
the other parts of the system.
The processor is perhaps the most complex element in the system, however it is
simply a big Finite State Machine (FSM).
Memory
In principle the memory is the simplest part of the computer. It acts as a large
‘jotting pad’ for data that the processor cannot hold itself. In practice modern
memory architectures can be very complex!
I/O
It may be slightly misleading to collect all the disparate input and output systems
{keyboard, display, disc, network, speaker, …} together under a single heading,
but it makes a convenient grouping. Input and output can use an extremely
diverse range of mechanisms and devices, both from system to system and within
a particular computer. Nevertheless the point remains that a computer needs some
form of communication or it is not useful!
1
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Three Box Model of a Computer
SYSTEM BUS
Memory
CPU
I/O
•
•
•
There is more to a computer than a CPU!
Clearly there must be some storage (memory).
The system is no use unless it can communicate with the
outside world, therefore some Input and Output (I/O) is needed.
•
•
These need interconnecting; this is usually done via a bus.
This leads to what is often known as the “Three Box Model”
2
System Bus
The three boxes are connected by a system bus (or, occasionally, several buses).
This is a different use of the term “bus” from that previously encountered.
Although a computer bus is a collection of signals with a broadly similar purpose
(communication between the ‘boxes’ described above) the signals do not all form
part of a single number or value. Typically there will be several sub-bundles such
as the data bus, address bus and a set of other signals often collected together as
the control bus.
‘Glue’
Not included in the three box view is the small amount of logic used to interface
these components together. This is often collectively known as “glue” logic. It
includes such necessities as address decoding, clock distribution and reset
circuits.
2
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Balanced System
Amdahl/Case Rule
A balanced computer system needs about 1
megabyte of main memory capacity and 1
megabit per second of I/O per MIPS of CPU
performance.
3
A computer must be “balanced” in that it should have approximately comparable
capabilities in each of its ‘boxes’; for example a high performance processor is
wasted if insufficient memory is provided.
3
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
CPU
FETCH
DECODE
EXECUTE
The Central Processing Unit is the ‘brain’ of the computer. It is a finite state machine which
‘runs’ the programs placed in the memory by the user. It does this by repeatedly performing
three operations:
o Fetch
o Decode
o Execute
on a sequence of instructions or “program”.
4
The CPU repeats the following three actions indefinitely:
Fetch
The processor maintains a pointer to the address it has reached in the program. It
reads a word from that address which will be interpreted as an instruction.
Normally, having read an instruction, the pointer moves on to the next address.
Decode
The instruction which has been read is examined to see what it means. In practice
the contents of the memory are just numbers. However the processor can interpret
a number as a coded way of specifying some action. Thus, for example, 0 could
mean “add”, 1 could mean “subtract” etc.1
The decoding process takes the instruction or op. code (operation code) and sets
the appropriate set of control signals for the FSM.
Execute
In the execution phase data is moved through the datapath (see the notes on RTL
design) and the requested calculation is actually performed.
After completing this sequence the processor goes back to fetch the next
instruction and repeats the sequence.
1. This is not a new idea. Consider the way flag signals were used to control ships in Nelson’s
navy, or the way dots and dashes form messages in Morse code.
4
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Program Execution
The function of the program is to change the state of the
computer system, according to the data supplied as input.
A typical CPU maintains the system state in the form of:
o on
- board registers
o external memory
This state together with the predefined program (“logic”)
moves one clock (“instruction”) at a time to resolve the final
output.
i.e. the entire system is also a finite state machine!
5
A computer processor is not, necessarily, a complex component; this will be
illustrated in later lectures where a complete processor design is developed.
However CPU design can be extremely complex!
Modern CPUs
Modern CPUs are made very complex in an effort to get the maximum speed
from the technology.
Considerations include:
o multiple issue (“superscalar”) – trying to do several instructions in
parallel
o “pipelining” – starting the next instruction before the current one is
finished
o “reordering” – executing instructions in a different order from which they
are fetched
o combining groups of instructions to just determine their net effect &
skipping intermediate stages
o speculation – making ‘guesses’ as to what is likely to happen in the future
before it can be calculated
These subjects will be described in future courses.
5
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Address Space(s)
“Conventional” CPUs have at least one address space.
An address space is a number of (potential) locations which the
system can address.
Memory address space:
Each location has a unique address
o the address is just a number, interpreted as an address
o programmers sometimes call this a pointer
o it may not be the same length as an ‘integer’
6
Hexadecimal Numbers
From here on we are going to be using some large numbers; notably 16- and 32-bit numbers.
Long binary numbers are fine for computers but hard for humans. Decimal numbers are familiar
to humans but difficult to convert to/from binary. To make binary numbers more readable they
frequently are represented in base 16 or hexadecimal.
Hexadecimal is convenient because 16 is a power of 2, so each digit represents exactly 4 bits. The
representation is much easier to read than a long string of “0”s and “1”s.Hexadecimal digits are:
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F}.
Addressing
The CPU is the system master; it controls the bus which is used to communicate with the other
‘boxes’. Memory and I/O can be regarded as passive components. The memory and the various
I/O devices need to be distinguished and this is done by addressing them. Each memory location
and each I/O location has a unique address, in the same way that each house in Manchester (or,
indeed, the world) has a unique address.
o An address is the location of some item.
To simplify addressing a computer will normally require that (unlike houses) all its data elements
are the same size. This characteristic size is known as the word length; it varies in different
architectures.
o Simple, cheap microcontrollers frequently have an 8-bit word length
o Workstation processors usually use 32-bit words
o Newer processors are moving to 64-bit architectures
o A few specialist architectures can address individual bits (1-bit word)
Note: what is referred to as a “word” can vary in size, depending on who is using the term!
A processor will have an address space in which all its words reside. The size of the address space
also varies according to the architecture of the processor. The smallest common address space
uses a 16-bit address – i.e. it can represent 216 different address values and therefore hold 216 = 65,
536 different words. A 32-bit processor (i.e. one using 32-bit words) will typically have a 32-bit
address space which has room for 232 = 4,294,967,296 memory locations.
6
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Other Address Spaces
• A processor will often have several selectable
(addressable) registers usable in an instruction.
Sometimes these are obvious, e.g. {R0, R1, R2, …}
• Some processors have separate addresses for
memory and I/O. Example: Intel x86 architecture.
• Some specialist processors (such as DSPs) have
several, separate memories (with separate buses).
• On the WWW each page has an address:
http://www…
7
Byte addressing
From the foregoing it might be expected that a 32-bit processor such as a Pentium
or an ARM would be able to address 232 32-bit words. In practice, as this is rather
a lot of memory, it is usual for such a processor to be able to address individual
bytes (8-bit quantities) as well as whole words. This uses 2 of the address lines
(because there are 22 = 4 8-bit bytes in a 32-bit word), but there are still 30 left to
provide 1Gword (4Gbytes) of memory space.
Note that, despite this, the processor will normally use its full word size when
performing calculations; the least significant (LS) two address lines are therefore
largely unused.
A consequence of this is that the addresses of adjacent words differ by 4.
7
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Memory
Memory is usually regarded as a ‘flat’, randomly addressable
space. It is usually depicted using a memory map.
00000000
20000000
40000000
60000000
80000000
A0000000
C0000000
E0000000
FFFFFFFC
00000000
00000004
00000008
0000000C
00000010
00000014
00000018
0000001C
This memory map has been drawn with the lower addresses at the
top; they are sometimes drawn the other way round.
8
Memory
Perhaps the commonest form of memory is referred to as “RAM” which stands
for Random Access Memory. It can be implemented in many technologies
(“SRAM”, “SDRAM”, “Flash RAM”, etc.) but that is not our concern here. RAM
is simply ordinary memory in which the processor can store (write) data and load
(read) it back.
The term “random” does not imply anything non-deterministic; it means that the
processor can get at any location at any time without penalty. The term dates
back to the time when some memory technologies did not have this property;
magnetic tape is one example, where some ‘locations’ may require considerable
winding before they can be accessed.
It can be quite surprising to see the different sorts of technology which have been
used for memory in the past; a little research in this area may provide quite a lot
of amusement!
Memory sizes
It is common to describe memory sizes in terms of kilobytes, megabytes etc.
When used for memory these prefixes typically diverge slightly from their
normal meaning. One kilobyte is usually 1024 (=210) locations; it is frequently
written 1Kbyte (upper case “K”) to distinguish this. Similarly one megabyte is
1048576 (=220) bytes. This convention makes it relatively easy to see how many
bits are required to address a memory of a given size. For example, a 64Mbyte
memory requires 226 (= 2 (6+20) = 26 x 220 = 64 x 1M) different addresses and
therefore 26 address lines to distinguish these.
Exercise: How many locations are addressable using only 19 address lines?
8
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Memory address space
•
•
•
•
•
In practice memory is not generally uniform.
The memory address space may not be filled
Areas may be set aside for I/O (see below)
There may be space for expansion
Different area of memory may cycle at
different speeds
• Some areas may repeat due to incomplete
address decoding
9
Caches
A term often heard in association with memory systems is “cache memory”, or
just cache1. The function and operation of caches will be described in future
courses. However the basic idea of a cache is that it provides a small set of local
data to avoid constant references to the real memory (which is big and slow).
If you think of the (main) memory as the University library a cache is analogous
to the pile of books on your desk – much smaller, but easier to get at!
1. Pronounced the same way as “cash”.
9
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Input/Output
• External interfacing generally involves a wide number of
devices. In a desktop computer some typical devices are:
– Keyboard
– Mouse
– VDU
– Printer
– Sound generator
– CD ROM
– Magnetic Discs (Hard Disc, Floppy Disc, …)
– Flash card reader
– Network
– Modem
–…
•
In an embedded system there will be many different, specific I/O devices.
e.g. think of a computer in a “fly-by-wire” aeroplane.
10
A few example I/O devices:
Keyboard
Clearly the major feature of a keyboard is that it has a large number of buttons; it is also likely
to have a few other functions such as LEDs1. The number of buttons is a potential problem in
terms of the hardware requirement – the keyboard must be as cheap as possible – so it is usual to
read and encode the key input separately from the main computer system. Typically a keyboard
will contain a microcontroller (a single chip computer) which monitors the key input, performs
functions such as debouncing, and communicates to the main computer via a serial line.
A keyboard is therefore quite a complex item in its own right.
Visual Display Unit (VDU)
The output for what most people think of as a “computer”2 will be a VDU. This is basically a
television screen. The computer views this as a large number (about a million) coloured dots or
“pixels”3. The pixels will often be memory mapped, i.e. they not only appear on the screen but
they can be read and written as memory elements. For example each pixel could be a byte.
Question: how many different colours could then be represented?
However, in addition to acting as a part of the memory this frame buffer has to be copied to the
screen 70 times4 a second, one pixel at a time. The logic also has to ensure that pixels are sent at a
constant rate and that every pixel is sent at exactly the right time. This calls for considerable, fast
logic!
1. LED = Light Emitting Diode
2. In fact more computers are now found in other embedded applications e.g. mobile phones.
3. “Pixel” is a contraction of “picture element”.
4. … or thereabouts.
10
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
I/O Interfacing
System Bus
Interface 1
Interface 2
Interfaces translate the system
bus signals to those required by the
device
Device specific Buses
I/O Device 1
I/O Device 2
The diverse collection of mechanisms,
collected under the heading
“I/O”, communicate with the CPU via a
range of specially tailored
interface devices known as
“peripherals”.
11
Magnetic Disc
Discs provide a cheap way of storing a lot of bits. They are also a permanent
store (unlike most modern RAMs) in that they remember data when the power is
off. They operate by magnetising tiny areas of the disc surface either as N-S or SN; these can be interpreted as “0”s and“1”s. The disc spins in the drive, which
has a ‘head’ which moves radially to reach any part of the magnetic surface.
Because the memory provided is usually much larger than the processor’s address
space, much of a disc is (often) organised as file store, itself an addressable space
(using “filenames”).
Modem
“Modem” is a contraction of “MODulator/DEModulator”; it performs both
functions. In this case “modulation” is the transformation of a stream of bits into
audible tones which can be sent across a telephone line; demodulation reverses
the process. The modem will also be capable of producing the tones for dialling,
detecting ‘ringing’ etc.
All of these devices require different types of I/O signal. It is the job of the I/O
interface to translate the system bus signals to those required by the I/O device.
11
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Ports & Peripherals
• A “port” is some form of I/O; it can take many
forms.
• For example:
– Parallel port
– Serial port
• A “peripheral” is a device which interfaces a
port to the computer.
• Usually the peripheral ‘maps’ the port into an
area of memory.
12
Ports & Peripherals
Parallel ports
The simplest form of interface is the parallel port. This port could be an output
port or an input port.
A simple output port will appear just like a memory location (to the CPU)
however the contents of the memory location will also appear on some physically
accessible wires. These wires could then be connected to devices, such as LEDs,
the barriers on a car park, etc. The location in question will often be referred to as
an output register (because that’s what it is). A simple input port will also
appear as a memory location, but in this case there is no actual memory; reading
the address will return the values on a set of external signals (switches, buttons,
car detection sensors, …). Note that, in this case, the ‘memory’ is volatile, i.e.
reading the same location twice may not give the same answer in each case!
Ports of either sort will typically be 8-bits wide, even in 32-bit machines.
Often – and possibly surprisingly – the scarcest commodity in a computer system
is the number of wires available. For this reason many parallel peripherals allow
a port to be configured so that the same port wires can be used for input or output.
Indeed it is possible to make a bidirectional port by allowing it to be an output at
one time and an input at another. This clearly requires some additional
information and the peripheral device will contain other registers which do not
appear directly but are used for internal programming, such as setting the port
direction. The peripheral will therefore need a (small) range of addresses to
support a single I/O port.
12
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Memory Mapped I/O
Much of this area will remain unused.
A third common form of I/O is the timer. This is not associated with
any input or output signals, but it provides input in the form of a
timing reference.
13
Serial ports
In order to save wires – especially when communicating at a distance – much I/O
is done serially. In fact when sending a 1Mbye file it would not occur to anyone
to provide 8 million wires in order to transmit it all in parallel, but it could
comfortable be sent in 1 million operations each transmitting eight bits (a byte) in
parallel.
Normally the term serial is used to refer to operations where bits are sent one at a
time. This clearly takes more operations than a parallel transmission, but often
the operations can be done very quickly; because only a single interface is needed
extra money can be spent on speeding it up. Serial interfaces are also highly
suitable for transmission by radio, telephone, optic fibre, etc.
A CPU is optimised for handling data in (8-, 16-, 32-, 64-bit) parallel; it would be
a waste of resource to have it fiddle around shifting single bits around. It is
therefore usual to have a peripheral to do this.
More details on serial communications will appear later. However a serial
peripheral will often contain numerous programming registers to specify its
protocols and speed and some status registers indicating when it is ready to
transmit or if it has received communication, as well as the registers for
communicating the actual data. Eight (or more) registers to support a single serial
port are by no means uncommon.
13
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Buses
A “bus” is a collection of signals which act together.
A processor communicates with the rest of the system using its bus. This is an
amalgamation of signals comprising:
o the address bus
o the data bus
o the control bus
The address bus is output by the processor and specifies the
memory (or I/O) location to be transferred.The address bus size dictates the
size of the memory map.
The data bus – commonly bidirectional – carries the information to/from that
location. This is usually the same size as the processor’s internal data paths.
The control bus specifies which way the data flows, and
when. It may also carry a host of other, specialist information not
discussed here.
14
Buses
Previously a “bus” has been described as a collection of signals with the same function.
A good example is that of the address used to specify a memory location. A 32-bit address
requires 32 binary signals to specify the desired location.
Usually these signals are lumped together and called the “address bus”.
o 16-bits wide in ‘small’ processors/microcontrollers
o 32-bits wide in PCs, workstations, etc.
o 64-bits wide in the future
o Other sizes in other processors (e.g. 20 bits on the 8086Þ 1Mbyte limit in DOS)
Similarly data transfers between the processor and memory will transmit their information
across a “data bus”.
o Normally the same width as the processors registers/ALU {8-, 16-, 32-, 64- bits}
o Sometimes narrower to reduce cost (in pins, wiring, memory devices)
o Sometimes wider to increase bandwidth1 (e.g. fetch two instructions in one cycle)
These elements are so ubiquitous that an engineer will always recognise:
o A[31:0]
o D[31:0]
(although the widths of the buses may vary)
1. The rate at which data can be transferred across the bus. The bandwidth of a bus can be doubled
by
a) cycling the bus at twice the speed; b) keeping the same cycle time but doubling the number of
bits.
14
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Bus Hierarchy
CONTROL BUS
32
RD
WR
Memory
16
ADDRESS BUS
PROCESSOR BUS
DATA BUS
CPU
I/O
15
Together address and data buses are insufficient to transfer data to/from memory:
at least extra signals to specify the direction and timing of the transfer are needed.
Timing is important to make sure setup and hold times for all the devices are
met.
These signals (with others) form a collection loosely known as the “control bus”.
Collectively all these signals are known as the processor bus, or just “the bus”.
Expansion bus
Many computers will not have their address space(s) filled with memory and I/O
If there is spare space it is common to allow access to the signals as an
“expansion bus”; this allows the later addition of new devices to the computer.
Expansion is facilitated by the adherence to a bus standard, which specifies the
interface signals and timing. Many PCs have an expansion bus which is not,
precisely, the processor bus (for example it is often slowed down) to allow older
I/O cards2 to be used in newer, faster machines.
2. Populated Printed Circuit Boards (PCBs)
15
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Processor Design
16
Manchester University has a long history of processor design; the world’s first computer (in the
modern sense) was designed and built here in 1948. For details of this, and other early Manchester
University computers see: www.computer50.org
At that time miniaturisation had not yet begun; the Small Scale Experimental Machine (SSEM)
was built using valves and would fill a reasonably sized room (the same machine these days
would fit on a pinhead!). However the number of logic gates available was small, so its
architecture had to be simple. The SSEM was also experimental, so it was used as the basis of an
evolving design which later became the Manchester Mark 1.
A replica of the SSEM now resides in Manchester Museum of Science and Industry.
16
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Processor Design
The CPU is usually the most complex part of a computer system.
All other systems depend on the CPU for control.
The design of the whole computer is heavily influenced by the
architecture of its CPU.
The following lectures outline the detailed design of an
implementation of a computer CPU. The instruction set is already
defined.
o How do we perform the detailed logic design of a
processor, given an outline block diagram and a
specification of its architecture?
17
MU0
MU0 is an abstract machine based on the SSEM. It is a complete processor
specification and is quite capable of running useful programs; it is also simple
enough to describe a complete implementation down to gate level in a few
lectures.
17
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
MU0 Instruction Set Architecture
A 16
- bit machine
o 16
- bit data
o 16
- bit instructions
o 12
- bit address space
Instruction format
o -4 bit instruction
o 12
- bit operand address
18
MU0
MU0 is a simple model computer. Its architecture is (simplified from, but)
similar to the very early Manchester machines, such as the Manchester Mark 1.
When beginning to design a new computer the architecture is one of the first
things to fix. It is necessary to define the programmer’s view of the system and
the instructions which it will execute. The word length (i.e. the ‘width’ or
number of bits in the datapath) and size of the address space are also fixed here.
When designing a new processor all these issues must be resolved. The word
length, addressing range etc. are influenced by cost and available technology.
When these have been set the processor’s instruction set and number of internal
registers are determined, usually using computer simulations to experiment with
the performance of different possible architectures. This sets what is known as
the Instruction Set Architecture (ISA). When the ISA is determined the
processor can be implemented. This involves the design of the hardware
architecture (often called the “microarchitecture”). Processors often go through
many different implementations with the same basic ISA (although this changes
and grows over time). The direct ancestors of many processors in use today
(Pentium, ARM, Coldfire, …) first evolved in the early ’80s; newer
implementations have yielded speed increases of >1000X.
18
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
MU0 Instruction Set
Only eight of the sixteen possible operations are implemented.
The others are “reserved for future expansion”.
19
In the case of MU0 the ISA is already fixed:
MU0 is a 16-bit machine
o Memory is 16 bits wide
o The internal data paths are 16 bits wide
o The instructions are 16 bits wide
o The address space is 12 bits long (i.e. 4 Kwords)
The instructions are fixed format
o 4 bits instruction
o 12 bits operand address
It has two user visible registers1
o Accumulator (Acc) – the only ‘user’ register
o Program Counter (PC)
It is a single address machine
o One operand is specified in the instruction
o Other operands (such as ACC) are implicit in the instruction
1. As shall be seen shortly there can be registers which are not directly accessible
via the instruction set.
19
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Instruction Execution Sequence
Like any CPU, MU0 goes through the three phases of execution:
These are repeated indefinitely. In more detail …
a) Fetch Instruction from Memory [PC]
b) PC = PC + 1
c) Decode Instruction
d) Get Operand(s) from: Memory {LDA, ADD, SUB}
IR (S) {JMP, JGE, JNE}
Acc {STO, ADD, SUB}
e) Perform Operation
f) Write Result to:
Acc {LDA, ADD, SUB}
PC {JMP, JGE, JNE}
Memory {STO}
20
MU0 Programming
Programming example
MU0 can be used to write ‘real’ programs; however programming this type of processor can be
very tedious! Below is an example of a program to total the numbers in a data table:
Loop
LDA Total
; Accumulate total
Add_instr
ADD
Table
; Begin at head of table
STO
Total
;
LDA
Add_instr
; Change address
ADD
One
; by modifying instruction!
STO
Add_instr
;
LDA
Count
; Count iterations
SUB
One
; Count down to zero
STO
Count
;
JGE
Loop
; If >= 0 repeat
STP
; Halt execution
; Data definitions
Total
DEFW 0
; Total - initially zero
One
DEFW 1
; The number one
Count
DEFW 4
; Loop counter (loop 5x)
Table
DEFW 39
; The numbers to total ...
DEFW 25 ;
DEFW 4 ;
DEFW 98 ;
DEFW 17 ;
Note:
o Much shuttling of data to/from the accumulator (tedious & slow)
o Constants (e.g. “One”) need to be preloaded into memory
o Self-modifying code needed to index through the data table
In particular self-modifying code (where the program alters its own instructions) is normally
deprecated.
Exercise
Rewrite this program using the ARM instruction set used in CS1031. Use registers as appropriate.
(Your answer should be considerably shorter!)
20
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
A Practical MU0 Datapath
The next stage is to produce an RTL datapath picture:
MEMORY
Data Out Address
Data In
ACC PC
IR
ALU
Having produced a sketch it is
necessary to check to see that all
the required operations are
possible.
It is possible to determine all the
required ALU functions. The
control (such as the decisions
about whether to jump or not) is
still being neglected at this point.
Timing and Control
21
Datapath Design
Instructions can be compressed into two cycles of execution. In many cases each
phase requires a memory cycle:
o Fetch Read instruction
o Decode/Execute Read operand/store accumulator
We (in some cases, you) can verify the validity of the datapath by testing the
different instructions and seeing which buses are used in each cycle. (In this case
all instructions are possible.) Try to fill in the data paths for the following
instructions.
21
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
ADD
Fetch
Data Out Address
Decode/Execute
Data In
IR
ACC PC
Data Out Address
ACC PC
ALU
Data In
IR
ALU
Timing and Control
Timing and Control
22
STO
Fetch
Decode/Execute
JMP
22
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Registers
In our MU0 there are three registers: ACC, PC, IR
Not all of these are visible to the programmer.
We will make these from sets of 16 D
- type flip
- flops.
D Q
Out15
D Q
Out14
D Q
Out0
Note that all the control
signals are common for
the whole register.
23
Register Banks
Registers
The registers described here are the same as those previously described in the
notes on RTL.
All flip-flops within a register have a common clock which is the system clock.
All registers in the design will use this clock to ensure synchronous operation.
Each flip-flop has an individual input. However these can be shared across more
than one register.
The loading of the register is controlled by a Clock-Enable (CE) signal; if this is
active when the system is clocked the register will adopt the input value. By
activating the CE signal at the correct time the register can copy (“latch”) the
value.
Similarly the outputs may feed into a shared bus, providing only (at most) one
output is enabled at once for each bus. This is controlled by the Output-Enable
(OE) signal. By activating the OE signal at the right time the register can drive its
output bus.
23
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Register File
The register bank or register file is often multiport
memory where any register can be connected to any
port at any time.
24
A modern processor will typically have more than one programmer-accessible
register; a typical RISC (Reduced Instruction Set Computer) will have 16 (ARM)
or 32 (MIPS) registers, any or all of which can be used to store temporary
operands. These registers are normally grouped together in a register bank – also
known as a “register file”.
A register bank is similar to a memory, although its address size is much smaller;
a register bank with 16 registers needs only a 4-bit register address (24 = 16). (In
MU0 there is only one register (ACC) and so it can be addressed using zero bits
(20 = 1)). However, unlike memory, it is common to be able to perform several
operations on the register bank simultaneously; for example an ARM instruction
might specify:
ADD R1, R2, R3
which requires two read operations and a write operation to be performed at the
same time.
Any register can be connected to any port at any time, including, for example:ADD R1, R1, R1
How might this be implemented?
24
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
ALU
X
ALU
Z
Y
Fn
To call it an ALU in MU0 is rather an
exaggeration. The instruction set does not
provide facilities for performing logical
operations (e.g. NOT, AND, XOR etc.) and thus
only an arithmetic unit is required. An
enhanced version of the machine could include
logic operations which could easily be
supported.
25
The easiest example of a microprocessor ALU to present here is that of the ARM, as
used in COMP15111. The ARM is a 1980s architecture but is still in common use
today1.
An ALU is an RTL component; it is therefore irrelevant to us how many bits it
processes. It will usually have two input buses (let’s call them X and Y) and a single
output bus (Z). An adequate number of bits are supplied to specify the function
performed on the inputs. In general an ALU will perform both arithmetic and logical
functions. Arithmetic functions are typically addition/ subtraction/comparison treating
the input buses as numbers. Logical functions are the now-familiar Boolean
operations performed by pairing off the bits in the input buses.
A subset of the ARM ALU functions is given below:
1. If you own a mobile ’phone you probably carry at least one around with you!
25
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
MU0 ALU
MU0’s ALU must be capable of doing the following:–
Z=X+Y
(for the ADD instruction).
Z=X–Y
(for the SUB instruction).
Z=X+1
(to allow PC to be incremented after an
instruction fetch).
Z=Y
(for the LDA instruction & to allow the S-field of
IR to be sent to the PC for JMP etc.). Other
operations might prove useful in an enhanced
version of the machine.
Each of the operations can be expressed as an addition:
X + Y X + -( Y) X + 1
0+Y
26
Note that these functions are directly user accessible. The ALU may also provide
other functions within the processor which are used for internal operations. An
example above could be a ‘move’ from the ‘A’ bus.
The MU0 ALU is much simpler; it does not provide logical functions at all!
However there are more functions than just the ADD/SUB visible in the
instruction set.
Later we will look at how the MU0 ALU can be extended to include some of
these and some other functions.
26
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Adders
Clearly some form of
adder is required inside
the ALU!
The 16
- bit architecture
of MU0 requires a 16
bit ALU …… and hence
a 16
- bit adder.
One way of providing this is to
use a Ripple Carry Adder.
o Simple just a string of
full adders
o Slow long critical path
27
Adders
There are many ways to build a single bit adder; two of these are shown below.
The first design (which should, by now, be familiar) comprises
two half adders joined into a full adder. The second design is
the result of minimising the function and is a ‘direct’ approach
to the logic. Although the first design is slower as a single bit
adder (count the gates in the worst case path) the designs are
comparable when used in larger adders because the critical
path is the carry propagation and the path from Cin to Cout is 2
gates in both designs. When several single bit adders are
wires together the time for an addition is always dominated by
the speed by which the carries can be generated. This is
because, under certain circumstances (which?) the carry into
the most significant bit depends on the data into the least
significant bit, which is many gates away.
27
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
ALU Structure
Although the ALU is based on an adder this is not all that it does.
The input buses have some preconditioning function applied first.
o The X bus can be zeroed
o The Y bus can be zeroed or inverted
28
All the necessary functions can be supported by additions, providing the input
buses are conditioned as follows:
Note that, for example “X-Y” is now expressed as “X+(-Y)”.
Some of these functions are relatively easy; others are harder. For example
producing a value of one on the Y input is awkward if only because the bit values
are dissimilar (i.e. binary 0000000000000001).
However a general purpose adder also has a carry input to the least significant bit.
If we consider this, things become easier:
28
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
General ALU’s
In a more general ALU it is often useful to be able to provide:
o True data- the data as supplied
o Complement data- the data with all bits inverted (NOT)
o Zero all data bits zero
o One all data bits one
29
The carry in is a single bit value which adopts the appropriate value. Now the
input buses are transformed by bitwise operations (the -Y of the previous table
has gone too).
Note the transformation: X - Y = X + (-Y) = X + (Y + 1) = X + Y + 1
Exercise
Prove this to yourself by working though the following examples.
In each case note that a “carry out” is generated (and ignored).
29
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Operand Select Logic
The inputs to the 16-bit adder are X’[15:0] and Y’[15:0]. The
outputs are Z15 - Z0.
module xprecon(sx, x, xprime);
output [15:0] xprime;
input [15:0] x;
input sx;
assign xprime = sx ? x : 0;
endmodule
module yprecon(sy, siy,y, yprime);
output [15:0] yprime;
input [15:0] y;
input sy, siy;
assign yprime = (sy ? (siy ? y : ~y) : 16’h0001)
endmodule
Y[15:0]
X[15:0]
xprime
0000h
sx
y1
y1
Y[15:0]
yprime
0001h
siy
sy
30
Alternative select logic could be as shown below.:
What would the Verilog code look like to produce these on synthesis?
30
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
ALU Function Decoder
The three function control bits have been assigned in a way
that allows relatively easy decoding in the ALU.
o
o
o
o
o
The choice of coding is arbitrary.
There are 24 possible choices.
Any choice can be decoded.
However some choices simplify the logic.
Finding a good solution takes intuition & practice.
31
Function Decoder Design
The ALU control bits could be generated directly from the decoder. However, providing that it is
inexpensive, it is sensible to compress the function code into the fewest possible bits. In this
design we require four ALU functions, so this requires a 2-bit function code. The mapping of an
ALU function code to the ALU function is quite arbitrary; however sensible choices can simplify
the logic design, as the two assignments below attempt to show.
There is no easy way to find the ‘best’ assignment in this form of logic optimization; practice is
the only way. In a design such as this simple inspection can reveal some of the optimisations. For
example SY and Cin are the inverse of each other and split 50/50 between “0”s and “1”s;
matching these to one of the input bit therefore gives half the decoder outputs for the price of one
inverter …
We will choose the right hand side code because it produces simpler logic.
31
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Controlling the processor
o
o
We have considered the design of most of the datapath of
the very simple computer, MU0.
We now consider the logic to control the sequence of
actions necessary in the execution of an instruction.
Data Out Address
X_MUX
ACC PC
ALU
Data In
ADDR_MUX
IR
Note all registers have clock and
enable signals and the
multiplexers have select lines.
Also the ALU has function select
inputs.
Y_MUX
These all need to be provided by
the timing and control circuit.
Timing and Control
32
Processor Control and Sequencing
Using our generic registers, each register has two signals, CE and CLK. We need
not consider the clock (CLK) here because it is distributed to all registers in the
same way. We also have a two bit code to generate to specify the ALU function.
There are also some signals to control the memory which are not shown
explicitly. For control purposes the memory can be regarded as just another latch
which can be told to store or output a value. Which value it stores/outputs is
controlled by the address, so this is not a control problem. Note that the
Instruction Register (IR) is always enabled. We have added tristate buffers to
control its access to the “Y” bus. The lower 12 bits are ‘padded’ with four zero
bits whereas the top four bits (the S field) are fed into the Timing/Control unit.
Summary of Control Signals Required
o Address source control Asel
o Clock enables for registers AccEn, PCEn, IREn
o ALU operation selectors M[1:0]
o X-bus source Xsel
o Y-bus source Ysel
o Memory control signals Wen, Ren
The control signals have been given (reasonably appropriate) abbreviated names.
32
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Status & Decisions
The ability to make decisions according to their
calculated state distinguishes computers from simple
automata. A computer is capable of the action:
IF <condition> THEN <something>
Although it may be hidden beneath layers of
language syntax in almost all processors this is
implemented as a conditional branch.
This diverts the processor to code (program) in
another part of the memory IF its condition is fulfilled.
33
Flags
MU0 evaluates conditions purely on the state of its accumulator. Some other
processors work in this way, using a ‘condition’ evaluated and stored in a
register. Others, such as ARM, evaluate and store the results of comparisons in a
separate condition code register. These results are usually known as “flags” and
typically represent the result of the last ALU operation. They are independent of
the destination of the result and, indeed, it is usually possible to affect the flags
without any other destination. For example the “CMP” (CoMPare) operation will
perform a subtraction but throws the result away.
Common flags include:
o Sign (Negative)
o Zero
o Carry
o Overflow
o Parity
The ARM contains the first four of these.
33
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Conditional Branching
MU0 has two conditional branches:
• JGE Jump if Acc is positive
• JNE Jump if Acc is not zero
which test properties of the Accumulator
• Positive if the most significant bit is zero
• Non
- zero if any bit is a “1” (OR of all bits)
NB In Verilog can be performed using the reduction form
of the logic operator e.g.
Z <= |Acc; //performs bit by bit OR on Acc from left to right
If a specific test is required an operation such as SUB can be
performed first to ‘compare’ with a known value.
34
Sign
A copy of the most significant bit of the result. Set for a (two’s complement) negative result.
Zero
Set if the result was zero, otherwise clear.
Carry
Used to store the carry out of the most significant bit of the adder; in a 32-bit processor this would
be the 33rd bit of the result of an addition. If the addition was two unsigned numbers the carry
will be set if the result was too big to represent in the word length. The carry can also be used as
an input to further additions, thus a 32-bit processor can perform a 64-bit addition by adding the
two lower (less significant) words and then adding the higher (more significant) words together
with the carry. Subtraction can also be done following similar rules.
Overflow
The overflow flag will be set if a two’s complement operation produced a result which was not
representable, for example if adding two numbers produced an answer so large that the sign bit
was set producing an (apparently) negative result. Note that this applies to signed numbers only;
the carry flag performs a similar function. The CPU is not aware of whether the programmer
thinks numbers are signed or not. It therefore will evaluate both carry and overflow and allow the
program to use one or the other.
Parity
Every word will have a number of “0” and a number of “1” bits. If a word has an even number of
“1”s it is said to have even parity; if not is has odd parity. One or other of these states is
sometimes indicated by the state of a flag bit.
Parity is primarily used for detecting errors in transmitted date where a bit may have been
corrupted (“dropped”); any single bit change in a word changes its parity.
34
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Description of Operations
o All instructions execute in two cycles
o The instruction fetch is common to all operations
35
Possible Control Sequences
All the possible instruction execution sequences are summarised in the slide, opposite. A key to the
meaning of the various functions is given here.
This picture can be regarded as a state diagram, although it contains more information.
All instructions (except STP) execute in two (clock) cycles: the first fetches the instruction and
increments the PC, the second decodes and executes the instruction itself. This leads to a very simple,
two state FSM. Let’s label these states “fetch” and “execute”.
If the processor is in the “fetch” state it performs an instruction fetch (a memory read from the address in
the PC with the data being placed in IR); it also increments the PC so the next instruction will be fetched
from a different address.
It does this irrespective of what might happen next. (If the instruction is a JMP the PC increment will be
wasted, but the processor doesn’t know that yet!). It then moves, inevitably, to the “execute” state.
When the processor is in the “execute” state its behaviour is influenced by the “F” field of the fetched
instruction; it follows one of eight possible paths. Unless it has encountered an “STP” it will then return
to fetching the next instruction.
When executing instructions other than STP the control signals are all derived from the “F” field with the
exception of the PC enable. Here this may be influenced by the contents of the Accumulator to allow
conditional branches. Notice that all branches behave in the same way except for the decision to latch the
new PC value or not; this will simplify the logic by reducing the number of cases which need to be
designed.
This picture can be translated into a state transition table which includes all the control signals.
35
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
State Transition Table
state
F[2:0]
Next
state
IREn PCEn AccEn M[1:0]
Xsel
Ysel
Asel
Ren
Wen
0
xxx
1
1
1
0
10
1
x
0
1
0
1
000
0
0
0
1
00
0
0
1
1
0
1
001
0
0
0
0
xx
0
x
1
0
1
1
010
0
0
0
1
01
0
0
1
1
0
1
011
0
0
0
1
11
0
0
1
1
0
1
100
0
0
1
0
00
0
1
x
0
0
1
101
0
0
N
0
00
0
1
x
0
0
1
110
0
0
Z
0
00
0
1
x
0
0
1
111
1
0
0
0
xx
0
x
x
0
0
36
The state transition table describes the operation of the state machine that
controls the various control lines to the multiplexers and registers. The inputs are
the instruction codes and the state and the outputs are the next state and the
control lines (IREn, PCEn, AccEn, M[1:0), Xsel, Ysel, Asel, Ren and Wen). Thus
we can design the logic required to provide the correct transition to the next state
and the correct outputs.
Notes:
o N and Z are the Negative and Zero state of the Accumulator, respectively.
(used to reduce the size of the table, as drawn)
o If a value is not going to be latched it doesn’t matter what it is!
(e.g. ALU output for STO)
o STP operates by remaining in its evaluation state.
Observations:
o Many control bits are trivial to derive (e.g. IREn = State)
o “Don’t cares” give added freedom (e.g. Asel = State, Ysel = F[2])
o In conditional jumps {JGE, JNE} the jump target is always available (for simplicity)
36
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
FSM Implementation
Firstly define the combinatorial control logic.
state
IR[15:12]
Asel
AccEn
PCEn
IREn
M[1:0]
Xsel
Ysel
Ren
Wen
always @ (state, pc, ir)
if (state == 0)
begin
Asel = 0; // sel pc
.
.
Ren = 1;
Wen = 0;
end
else
begin
// state must be 1
Ren = 0;
Wen = 0;
// now control depends on instruction
case (ir[15:12])
0: begin
// LDA
Ren = 1;
.
etc.
37
We can implement the state machine using a Verilog always block as shown.
This block is only triggered if the state, pc or ir change.
37
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
State Transition
halt
D
clk
state
Q
CLR Q
reset
always @ (posedge clk or posedge reset)
if (reset)
begin
pc<= 12’h000; // set pc to zero
state <= 0;
// start with fetch
end
else
case (state)
0: begin
// fetch state
Ren <= 1;
.
.
end
1: begin
// decode/execute state
case(ir[15:12]); //action depends on instruction
0: begin
Ren <=1;
.
.
end ……
38
To manage the state transition on system clock we use another always block
triggered by a positive clock transition or a reset.
38
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Timing
An important aspect of this design is that it is fully synchronous.
All state changes happen ‘at the same time’ in:
o Registers {PC, Acc, IR}
o The controlling FSM
o Memory – more of which in a later lecture
No state changes at any times other than an active clock edge.
For example the various control signals begin to be calculated
when the IR is latched and have a complete clock period to settle
before they are used. This allowed the analysis of the system in
a static manner.
Assumptions
o The clock distribution is good enough that the signal
arrives ‘simultaneously’ at every flip
- flop
o The clock is slow enough to accommodate the slowest
possible set of logic changes
39
Timing
Our MU0 implementation fetches, decodes and executes instructions at a rate of two clock cycles
per instruction (2 CPI). The majority of these cycles include a memory operation.
The clock is a regular square wave; its period is set by the worst case critical path. In order to
find this the operation of each cycle should be examined. Some examples are given below,
although only the major operations {memory, ALU} are accounted for – bus switching times are
ignored for simplicity. The time taken to decode the instruction is also neglected here because the
instruction set is so small/simple; note that in a ‘real’ modern machine this is definitely not the
case!
Instruction fetch
A memory cycle is performed with the result routed to IR. An ALU cycle is also performed, but
this is in parallel with the memory operation. The critical path for this cycle will be whichever is
the longer time.
ADD execution
This clearly requires an ALU operation (the addition), but it first requires one of its operands to be
fetched from memory. The critical path is therefore the sum of the memory and the ALU cycle
times.
STO execution
Only a memory cycle is performed.
JMP execution
The S field of the IR is transferred (via the ALU) to the PC; this can be counted as an ALU
operation. The memory is not used here.
From this (incomplete) analysis it appears that the critical path is the sum of the memory and
ALU cycle times. This would be used to set the clock period. The unused time in other operations
would be wasted.
39
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Fetch
MEMORY
0C0
00F1
Data Out Address
0X_MUX1
ACC PC
Data In
ADDR_MUX
1
0
IR
Memory
0C0: 00F1 //LD 0F1H
:
:
0F1: 0C50 // data 0C50H
Y_MUX
1
0
ALU
Timing and Control
40
40
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Decode/Execute
MEMORY
0F1
0C50
Data Out Address
0X_MUX1
ACC PC
Data In
ADDR_MUX
1
0
IR
Memory
0C0: 00F1 //LD 0F1H
:
:
0F1: 0C50 // data 0C50H
Y_MUX
1
0
ALU
Timing and Control
41
41
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Timing
An important aspect of this design is that it is fully synchronous.
All state changes happen ‘at the same time’ in:
o Registers {PC, Acc, IR}
o The controlling FSM
o Memory – more of which in a later lecture
No state changes at any times other than an active clock edge.
For example the various control signals begin to be calculated
when the IR is latched and have a complete clock period to settle
before they are used. This allowed the analysis of the system in
a static manner.
Assumptions
o The clock distribution is good enough that the signal
arrives ‘simultaneously’ at every flip
- flop
o The clock is slow enough to accommodate the slowest
possible set of logic changes
42
Timing
Our MU0 implementation fetches, decodes and executes instructions at a rate of two clock cycles
per instruction (2 CPI). The majority of these cycles include a memory operation.
The clock is a regular square wave; its period is set by the worst case critical path. In order to
find this the operation of each cycle should be examined. Some examples are given below,
although only the major operations {memory, ALU} are accounted for – bus switching times are
ignored for simplicity. The time taken to decode the instruction is also neglected here because the
instruction set is so small/simple; note that in a ‘real’ modern machine this is definitely not the
case!
Instruction fetch
A memory cycle is performed with the result routed to IR. An ALU cycle is also performed, but
this is in parallel with the memory operation. The critical path for this cycle will be whichever is
the longer time.
ADD execution
This clearly requires an ALU operation (the addition), but it first requires one of its operands to be
fetched from memory. The critical path is therefore the sum of the memory and the ALU cycle
times.
STO execution
Only a memory cycle is performed.
JMP execution
The S field of the IR is transferred (via the ALU) to the PC; this can be counted as an ALU
operation. The memory is not used here.
From this (incomplete) analysis it appears that the critical path is the sum of the memory and
ALU cycle times. This would be used to set the clock period. The unused time in other operations
would be wasted.
42
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Optimisations
How do we make our computer go faster?
o Improve the technology
Make smaller transistors and put more on a chip
o Improve the implementation
Speed up the clock by shrinking the critical path
o Change the architecture
Restructure the design to do more in a given period
This leads to:
o Faster clock
o Fewer clocks per instruction (CPI)
The following slides introduce some examples of these
techniques.
43
Optimisations
Technology
Since the introduction of integrated circuits (circa 1970) the size of the manufactured features
(transistors, wires, etc.) has been shrinking steadily. Reduced feature size leads to larger number
of components on a device and faster operation of those components.
The (empirically derived) Moore’s Law observes that the number of components available (and
the overall processing speed available) approximately doubles every 18 months. This is equivalent
to a 10x improvement every 5 years, or about 1 000 000 improvement from 1970 to 2000.
Implementation
A computer engineer has little influence on where the technology leads. However it is important
both to exploit the available technology and to design efficient circuits with short critical paths.
The implementation will specify such things as the type of latches, flip-flops and registers used,
the internal design of the ALU, the type of multiplexers etc.
Note also that implementations of a given function (such as a processor instruction set) may be
optimises towards different goals. For example a high-speed implementation may be different
from a low-power one.
Architecture
The architecture of the computer is where the designer has, perhaps, the most impact in its
success. The architecture includes all aspects of the hardware from the instruction set design to
the RTL layout of the blocks.
At RTL – which is the aspect which concerns us most here – the objective is to achieve maximum
unit occupancy so that all parts of the system are kept as busy as possible. This is often achieved
through parallelism. An example of parallelism already introduced here is the MU0 instruction
fetch operation, where the PC is used as a memory address and is incremented at the same time.
These operations could be done in series, but then two cycles would be required for every
instruction fetch, considerably slowing the processor’s operation.
In general adding parallelism increases performance. Often, however, extra resources (buses,
multiplexers, registers, functional units, …) are required and the cost can outweigh the benefit.
43
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Reducing CPI
One method of speeding up the processor is to reduce the
average number of clocks per instruction.
There are improvement opportunities in (for example) JMP …
Data Out Address
Data In
IR
ACC PC
ALU
Note:
o Uses existing datapaths
o Requires more complex
control/sequencing
o Requires an additional
ALU operation (Y + 1)
Timing and Control
44
Reducing CPI
In a simple processor such as MU0 there are not many methods of speeding up
the design. However there are some …
For example the memory is not used when executing a JMP.
Old
Fetch
Data Out Address
ACC PC
Jump
Data In
IR
ALU
Data Out Address
ACC PC
Data In
IR
ALU
Timing
Timing and
and Control
Control
Timing
Timing and
and Control
Control
44
New
Fetch/Jump
Data Out Address
ACC PC
Data In
IR
ALU
Timing
Timing and
and Control
Control
This reduces the number of cycles for a JMP instruction to one, thus reducing the
average number of CPI. Similar optimisations can be applied to conditional
jumps, with different behaviours depending on whether the jump is/is not taken.
The disadvantage of this method is that it makes the control and sequencing more
complicated. Also note that another ALU operation is required: the value sent to
the PC is not the JMP destination (which is already being used) – it is the
following address – thus an extra increment is required using the other ALU
input. (The ‘move’ operation is still needed for LDA.)
Note that this particular modification can be made solely by changing the control
logic; the RTL picture is the same as before.
45
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Go Faster …
Data Out Address
Data In In our MU0 the critical path includes
both the memory and the ALU.
Adding a register (Din) can break the
critical path (roughly) in half.
IR DIN
ACC PC
o The cost is an extra clock cycle.
o The benefit is that the clock can
be (nearly) twice as fast.
An ADD takes three cycles (instead
of two) at ~twice the speed.
ALU
Timing and Control
46
MU0 Timing Analysis
As a simplification let’s assume that the only blocks that impose significant delays are the
memory and the ALU. This is a reasonable approximation for this design. Furthermore let us
apply some values to these, say:
o The memory (read or write) takes 10ns
o The ALU (any function) takes 8ns
By examining all the possible cycles the critical path, and hence the clock speed, can be
determined.
o Instruction fetch uses memory and ALU in parallel, therefore requires
10ns
o LDA/ADD/SUB use memory and ALU in series, requiring a total of 18ns
o STO uses only memory and so requires 10ns
o JMP uses only ALU (original architecture) for a critical path of 8ns
The ‘improved’ architecture has a parallel memory cycle, increasing this to 10ns
o JGE/JNE are analogous to JMP, or faster if the jump is not taken
The worst case is therefore 18ns, which sets the clock period; (the frequency would therefore be
about 56 MHz). Executing an ADD instruction requires two clock cycles or 36ns (fetch, then
execute) as the clock is a constant frequency.
Note that a lot of time is wasted in some cycles!
46
A Faster Implementation
If an extra latch (Din) was added to the RTL design adjacent to the IR the
operand read from memory could be stored temporarily on its way to the ALU
(see slide). This would require an extra cycle to execute an ADD operation
{instruction fetch, operand fetch (to Din), ADD (from Din)} which doesn’t sound
sensible in terms of accelerating execution!
However the critical path is now reduced to the memory cycle time (10ns) so the
clock can run faster (100 MHz). The ADD operation as a whole can therefore
complete in 30ns – a 17% speed up. Furthermore not all operations require
operands from memory, so a STO (for example) could still be done in two cycles
or 20ns – nearly twice as fast as before!
Although not possible with the existing architecture it would be possible to
execute a LDA in two cycles too; what added buses/multiplexers would be
required?
The disadvantage with such modifications is, of course, added complexity (and,
hence, development cost).
Reducing Execution Time
Although it is beneficial to make instructions go faster what a user wants is for a
program to go faster; this is not quite the same thing. The program is made up of
instructions in some mixture; for instance LDAs might be more common than
JMPs (or vice versa).
In reducing execution time it is therefore more important to optimise the more
common operations.
In this context “more common” means those encountered most dynamically (i.e.
as the program runs) rather than statically (counting through a program listing).
This, of course varies considerably, depending on the application program …
47
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Implementation
Consider the full adder:
The carry output is either a copy of the carry input or it is
independent of the carry input (and therefore available at once).
48
Consider a two bit adder:
ie
Note here that Cout can ‘ripple’ across two bit positions at a time.
48
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Look-Ahead Carry
(Almost) as soon as the A and B inputs arrive it can be
predicted that Cout will be:
o Zero (carry “killed”)
o Cin (carry “propagated”)
o One (Carry “generated”)
This can be extended across more than one bit (see notes).
This scheme is called carry look
- ahead.
49
Furthermore this reasoning can be applied recursively to bigger and bigger blocks
…
Verify that, in this circuit, the maximum logic depth is six ‘blocks’.
What would the maximum depth be for a 16-bit adder?
As the carry has fewer logic blocks (therefore gates) to negotiate it can reach the
more significant bits more rapidly (the benefits improve as the width increases).
This reduces the critical path and makes the whole adder faster…
49
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Parallelism
Parallelism involves executing more than one instruction at
once.
It is not the purpose of this course to discuss parallel computer
architectures.
Sufficient to say that there are two, possible means of
achieving parallel execution.
o Starting one instruction before the previous one is
complete (“pipelining”)
o Starting several instructions at the same time
(“superscalar” & multi- processor)
50
Parallelism
Parallelism is something exploited extensively and at all levels in hardware
design. For example adding carry look-ahead logic increases the number of gates
in the adder but these decrease the overall addition time because more gates are
switching in parallel.
We have also applied parallelism at RTL by incrementing the PC in parallel with
the instruction fetch and, later, fetching an instruction while still executing a
JMP.
However usually the word “parallelism” is applied more explicitly at the
architectural level. A naive example of this would be using two complete
processors to go “twice as fast”. This may work if we have two independent tasks
but, in general, we haven’t. A system with twice the cost would therefore give
less than twice the performance.
Much of the art of designing parallel systems is finding a ‘sensible’ balance
between the hardware investment and the performance return. Two commonly
employed techniques are given below.
50
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Parallelism cont.
51
Pipelining
A common analogue of a processor pipeline is the process of washing clothes.
When several loads of washing the second load can go into the washing machine
as the first load goes into the drier. This means that two loads can be at different
stages of ‘processing’ at the same time – an example of parallelism.
This is a relatively ‘cheap’ solution if you were going to use both machines
anyway. However notice that our MU0 uses the same hardware for both fetching
and executing instructions (e.g. the same ALU increments the PC and ADDs the
data) and could not be pipelined without adding extra hardware; it is more the
‘washer/drier’ solution!
Multiple issue
By adding extra hardware it is possible to execute more than one instruction at
once. With two decoders and two ALUs two instructions may be fetched and
decoded together. This, potentially, doubles the processor speed for roughly twice
the hardware cost.
In practice things are not so simple because it is not always possible to issue two
instructions concurrently1; for instance if the ‘first’ instruction was a JMP then
the other instruction would be wasted anyway. There can also be dependencies
where the second instruction needs the result from the first and therefore has to
wait (and hardware has to be added to detect this). Trying to issue two
instructions at once therefore gives less than twice the speed at more than twice
the cost. Nevertheless attempting to issue two, four, or even more instructions
together is quite common in high-performance processors.
1. “Concurrently” – at the same time.
51
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
ALU Enhancements
o MU0 has simple instructions involving only ADD and
SUBTRACT arithmetic operations.
o The range of operations could easily be expanded to
include Boolean logic operations, shift operations and
extra arithmetic operations such as multiply. (Division is
too complicated for consideration on this course.)
o The ALU would require more control bits (e.g. M[3:0])
Such operations could use some of the spare instruction codes.
52
ALU Enhancements
Bitwise Logic Operations
These are Boolean operations applied to each of the bits of the two values presented to the ALU.
Operand bits are paired with others at the same position (significance) in the words, hence the
expression “bitwise” operation. Unlike addition, which propagates a carry, each set of bits is
independent.
e.g. the AND operation, for all values of i between 0 and 15, would yield:Z-busi = X-busi & Y-busi
For example an AND operation:–
0011 0101 1001 1110
AND
0101 0110 1111 0010
yields 0001 0100 1001 0010
Relatively simple changes to the ALU are needed to implement AND, OR, XOR etc. These are
normally done by selecting different functions using a multiplexer.
Signal Preconditioning
It is usual to locate the logic functions after the input preconditioning T/C 0/1s. This allows (for
example) operations such as the ARM’s Bit Clear (BIC) instruction.
Result = A and not(B)
This can also provide alternative codings for operations such as MOV, which may simplify the
decode logic. For example in the initial MU0 design a move operation was coded as 0+Y; it could
also be coded as 0 OR Y, -1 AND Y, etc. (remembering -1 = 1111111111111111).
52
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Bit-Wise Logic Operations
These are easy to add to an ALU
53
A possible optimisation
The bit-wise XOR function could be implemented by disabling the carry between
the 1- bit full adders. The “SUM” outputs will then be the XOR of the X and Y
bits presented to the adder. Review the full adder circuit to see how this could be
done.
53
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Shift Operations
In decimal it is very easy to multiply or divide numbers by ten.
e.g. 123 x 10 = 1230
The above operation has shifted the input left by one place.
In binary it is very easy to multiply or divide numbers by two.
e.g. 110110102 ÷ 102 = 011011012
The above operation has shifted the input right by one
place.
54
Shift Operations
What are shift operations?
Shift operations are movements of the bits within a word. For example:
Shift left, one place
A left shift, as shown above, moves all the bits to a more significant position; thus left shifting a
number by one place is equivalent to multiplication by two. Similarly shifting left two places is
multiplication by four (assuming no bits are ‘lost’ at the most significant end).
Contrariwise shifting right is equivalent to dividing by powers of two.
When shifting left it is normal to fill the ‘vacant’ position(s) in the least significant bit(s) with
zero(s). When shifting right this rule can also be obeyed with the most significant bits; this will
divide correctly (subject to remainders) for positive or unsigned numbers, but not for two’s
complement negative numbers where shifting a zero in makes the number positive. To avoid this
there are often two forms of right shift provided:
o Logical shift right (LSR) – shift in zero
o Arithmetic shift right (ASR) – shift in copies of the existing MSB
A couple of minutes running through a couple of examples of left and the various right shifts is
probably time well spent!
54
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Practical Shift Operations
In practice there are some limitations due to operands being
finite:
o a left shift loses its Most Significant Bit (MSB)
the answer will be ‘wrong’ if this bit was “1”
o a right shift loses its Least Significant Bit (LSB)
the answer will be ‘wrong’ if this bit was “1”
i.e. if the number was odd.
As well as losing a bit, a bit must be shifted in.
o Left shifts shift in zero
o Right shifts either shift in zero or copy the MSB
the latter case preserves the two’s complement sign bit
o A rotation (either way) shifts the ‘lost bit’ back in
55
When are they useful?
Primarily in multiplication and division algorithms. They are also used in
graphics, cryptography, in fact lots of “bit fiddling” operations.
What else can I do?
Rotation is another common ‘shift’ operation. Rotating left and right is just like
shifting except the bits that ‘fall off’ one end of the operation ‘wrap back’ onto
the other end.
Barrel shifting
Many processors provide shift operations which move the bits only one place per
instruction.
A “barrel” shifter is a device which shifts bits an arbitrary number of places.
The ARM processor used in CS1031 has a barrel shifter.
55
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Implementing Shifts
It is usual to treat a shift as an ALU
- type function.
This can be a single operand (one place) shift or shift a specified
number of places (two operands).
A shift is merely a rearrangement of the bits; it requires no logic!
A single place shift can be done purely with multiplexers.
o The input selection on the multiplexers is
common to all these are decoded from the
selected function
o In practice these may be part of a larger
multiplexer (see slide on bitwise operations)
o The inputs at each ‘end’ of the row are wired
appropriately for the selected shift
56
Implementing Shifts
The slide shows part of the
‘middle’ of a one place left or
right shifter. Clearly there are
some ‘dangling’ at each end.
The bit input shifted left into
the LSB is always “0”. The bit
shifted right into the MSB is a
“0” for logical shift right, but
for arithmetic shifts the MSB
or sign bit is copied here as
well as into bit 14.
Shifts of more than a single bit position are also possible. This is sometimes done by the control
logic repeating a one-place shift the correct number of times, a solution which requires little extra
hardware but takes several cycles to complete.
A true barrel shifter can shift any number of places in a single cycle. This requires considerable
extra hardware because all the multiplexers get much larger (the wiring is more complex too) and
so it exacts a significant cost1.
1. In practice there are methods of implementing such multiplexers more cheaply, but they are
still costly.
56
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Shift Registers
A14
A15
C
D
Q
C
A13
D
A
Q
Note registers can be parallel
loaded or implement a shift
depending on the state of the
multiplexers.
Q
Can implement a shift in Verilog using concatenation { }
A <= {C,A[15:1]};
//Shift right A
Q <= {A[0],Q[15:1]};
//Shift right Q
57
Shift registers
A shift register is similar to a ‘normal’ register, which always inputs and outputs
all its bits in parallel. A shift register has the added feature of being able to move
bits in or out one at a time, typically by ‘shuffling’ all the bits one way so that
one ‘falls off the end’ on each successive cycle. Of course an extra control bit is
also needed to enable this function.
Although it would be possible to use one, shift operations are not normally
implemented using shift registers; a shift register contains more circuitry than an
ordinary register and is therefore more expensive. While not significant in our
MU0 this would be a considerable overhead on all the registers in a processor
like an ARM.
Shift registers are primarily used in interfacing to I/O devices; some examples
will appear later in the course. They are occasionally useful in other tasks, such
as multiplication and division,
57
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Multiplication
Multiplication is a more complex arithmetic operation than
addition. Multiplication is repeated addition:
o To multiply two numbers, N and M, start with zero
and add N to it M times
This works, but is very slow for big numbers. A short cut –
long multiplication
58
An algorithm
1. Start with zero in an accumulator
Accumulator
2. Make X the least significant digit of N
000000
3. Multiply M by X – add the result to the accumulator
000711
4. Multiply M by 10
000711
5. Make X the next least significant digit of N
014931
6. If unfinished then repeat from step 3.
7.Done
014931
204531
x
3
3
6
6
8
8
M
000237
000237
002370
002370
023700
023700
Points to note
o
o
o
o
This loops three times (the number of digits in M) not 863 time (the number in M).
Only multiplications by a single digit (or by ten) are required.
Multiplying by 10 is trivial.
Only an addition of two numbers (new partial product and accumulator)
is needed in any one step.
o The result has more digits than either of the operands
(in general as many digits as both the operands combined, or six in this case).
Binary multiplication
The same algorithm can be used for binary multiplication. The only differences are:
o The digits are single bits.
o Multiplication is only ever by 0 (easy) or 1 (also easy)
the multiplication outcome is therefore either 0 or N.
o The accumulator is multiplied by 2 – this is a one place left shift (and trivial).
Note also that X can remain the least significant bit of M if M is right-shifted at each step.
The algorithm now becomes
1. Start with zero in an accumulator
2. Make X the least significant digit of N
3. Multiply M by X – add the result to the accumulator
4. Multiply M by 10
5. Make X the next least significant digit of N
6. If unfinished then repeat from step 3.
7.Done
58
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Long Multiplication
This multiplication algorithm is (in general) much faster.
No doubt it is already familiar!
The same algorithm can be applied to binary numbers.
With binary digits the only multiplications needed are x0 and
x1. This is easy since x0 gives 0 and x1 leaves the original
value – so we need to add in 0 or the original multiplicand
depending on the value of the digit we are looking at.
59
More Aspects of Multiplication
Termination
The loop described above iterates for every digit (bit) in the N operand. Thus a counter can be used to control
and terminate the operation. However using the procedure described for the binary operation N is continually
being divided by two (and integer division discards any fractions) and so will eventually become zero. This
can be used to indicate completion because any subsequent cycles can only ever add zero to the total – and
we might as well not bother. This is sometimes known as early termination because, in many cases, fewer
cycles are performed than there are bits in the operands. This gives shorter multiplication times as well as
easier control (no counter needed).
Modulo arithmetic
A general case addition of two 32-bit operands can require up to 33 bits to hold its result, because of the
carry out. This is true whether the addition is unsigned or two’s complement.
A general case multiplication of two 32-bit operands can require up to 64 bits to hold its result.
Often such results are mapped back into a register (variable) of the same width as the operands. This results
in modulo- arithmetic, where bits may be ‘lost’ off the right-hand end of the
number.
In the examples above it is quite likely that the operations would be performed “modulo 32”.
This is the same as dividing the result by 232 and keeping the remainder hence the name.
Negative numbers
The multiplication algorithm described only works with positive or unsigned numbers. A simple extension to
cope with signed numbers would be to convert the operands to positive numbers and then make the result
negative if the operands had different signs. This is the normal method with decimal numbers.
However if modulo arithmetic truncates the result to the same length as the operands then the algorithm will
work anyway. Any potential errors occur in the high-order bits which are truncated.
(Try it!)
If the full answer is required this ‘trick’ can be exploited by first sign extending1 the operands to the full
word length and truncating at the end.
1. Extending the number to the left with copies of the existing MSB, so the sign bit is preserved.
59
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
A Sequential Multiplier
The implementation of this multiplication algorithm may be done
in software or hardware.
‘0’
Multiplicand
A hardware implementation is
shown here
Carry out
B
ADDER
A software implementation may
be coded simply by following the
steps outlined in the notes
A
Product
Done
Multiplier
Q
Q0
‘n’
P counter
P==0
Z
FSM
Controller
S start
60
Other multipliers
The algorithm described is not the only way to build a multiplier. A number of
other schemes employing the same basic ‘shift and add’ approach exist, but
different operands may be shifted in different directions.
You may meet a different implementation in a reference book; however the
principle will be the same.
60
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Multiplier
This is a serial multiplier:
o A number of steps (clocks) are performed
o Only one adder is required
Simple FSM used for control.
Almost a processor datapath in itself!
61
Verilog listing for 8x8 bit multiplier
//HDL COMP 10211 Multiply Example
//-------------------------------------//RTL description of binary multiplier
//Block diagram in notes
//n = 8 to halt after all bits done
module mltp(S,CLK,Clr,Binput,Qinput,C,A,Q,P,Done);
input S,CLK,Clr;
input [7:0] Binput,Qinput;
//Data inputs
output C, Done;
output [7:0] A,Q;
output [3:0] P;
//System registers
reg C, Done;
reg [7:0] A,Q,B;
reg [3:0] P;
reg [1:0] pstate, nstate;
//control register
parameter T0=2'b00, T1=2'b01, T2=2'b10, T3=2'b11;
//Combinational circuit
wire Z;
assign Z = ~|P;
//Check for zero
//State transition for control
//See state diagram in notes
always @(negedge CLK or negedge Clr)
if (~Clr) pstate = T0;
else pstate <= nstate;
always @(S or Z or pstate)
case (pstate)
T0: if (S) nstate = T1;
T1: nstate = T2;
T2: nstate = T3;
T3: if (Z) nstate = T0;
else nstate = T2;
endcase
//Register transfer operations
//See register operation Fig.8-15(b)
always @(negedge CLK)
case (pstate)
T0: B <= Binput;
//Input multiplicand
T1: begin
A <= 8'b00000000;
C <= 1'b0;
P <= 4'b1000;
//Initialize counter to n=8
Q <= Qinput;
//Input multiplier
Done <= 1'b0;
// Not Done
end
T2: begin
P <= P - 4'b0001;
//Decrement counter
if (Q[0])
{C,A} <= A + B;
//Add multiplicand
end
T3: begin
C <= 1'b0;
//Clear C
A <= {C,A[7:1]};
//Shift right A
Q <= {A[0],Q[7:1]};
//Shift right Q
end
endcase
always @(negedge CLK)
Done <= Z;
// Z = 1 when done
endmodule
61
// Testbench for HDL Multiply COMP 10211
//----------------------//Testing binary multiplier
module test_mltp;
//Inputs for multiplier
reg S,CLK,Clr;
reg [7:0] Binput,Qinput;
//Data for display
wire C;
wire [7:0] A,Q;
wire [3:0] P;
wire Done;
//Instantiate multiplier
mltp mp(S,CLK,Clr,Binput,Qinput,C,A,Q,P,Done);
initial
begin
S=0; CLK=0; Clr=0;
#5 S=1; Clr=1;
Binput = 8'b00010111;
Qinput = 8'b00010011;
#15 S = 0;
end
initial
begin
repeat (46)
#5 CLK = ~CLK;
end
endmodule
62
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Processor Design – a Summary
• A processor can be quite simple to design
– an entire processor can be described down to gate level in a few
lectures
• A processor has a datapath which does the processing
–
–
–
–
an RTL (Register Transfer Level) design
many (16-, 32-, …) bits wide, but regular structures
the datapath may account for 90%+ of the gates
therefore it is designed and optimised first
• The datapath needs control logic
– an FSM (Finite State Machine)
– the control provides steering and timing for the datapath
– relatively few gates, but more complex structures
• All CPUs are built this way
– it’s just that the instruction set gets bigger and the number of
optimisations increases.
63
Multiplication Exercises
Have a go at these:
0010 x 0010
0110 x 0101
0011 x 1110 (unsigned)
0011 x 1110 (signed)
63
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Memory
“Thanks for the memories”
64
Memory
When a computer program is operating it needs to keep its data somewhere.
There may be some registers (such as “Acc” in MU0) but these are not usually
enough and a larger memory (or “store”) is required.
The computer program itself must also be kept somewhere. The earliest
programmable devices were weaving looms. In 1804 Joseph Marie Jacquard
invented the Jacquard Loom in Lyon. This used a string of punched cards as a
program; a binary hole/no hole system allowed complex patterns to be woven
with the cards being advanced as the loom ran. Many later devices have also used
this concept, notably the pianola or player piano (1895).
64
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Memory
• The CPU is the most complex unit in a computer.
• The memory is the largest.
Producing an adequately sized, adequately fast memory has
always been a serious challenge in computer design. The
Manchester Small Scale Experimental Machine (SSEM)(1948)
or ‘Baby’ was the world’s first stored program computer.
This was built specifically as a test for the memory devices.
A stored program computer is one where the program resides
within the memory and therefore can also be treated as data.
This means the memory has a shared function:
• It contains data
• It contains programs and memory cycles are allocated to
each function.
This is often known as the von Neumann architecture.
65
The stored program concept
The concept of a ‘stored program’ is attributed to John von Neumann. Put simply
it says: “Instructions can be represented by numbers and stored in the same
way as data.” Thus a bit pattern 01000101 might represent the number 4516 or
the Ascii code for the letter “E” as data but it could also be used to tell a
processor to perform a multiplication.
This has led to the so called “von Neumann” architecture which is followed by
almost all modern computers where a single memory holds values which can be
interpreted as data or as instructions by the processor.
Whilst it is rare that the same memory locations are used as instructions and data
it does happen. The most notable case is when a program is loaded and executed:
the loader fetches words from an I/O device (e.g. disc) which it treats as data and
puts into memory; the same values are interpreted as instructions when execution
starts.
65
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
The von Neumann Architecture
One memory suits all requirements.
This is the model we already assumed for our MU0 processor.
66
Johann (“John”) von Neumann (1903-1957)
• 1903: born, Budapest, 28th December, son of a banking family.
• 1910: could divide 8-digit numbers in his head.
• 1921: entered University of Budapest to study Chemistry. Published
first paper.
• 1928: completed doctorate in Mathematics.
• 1930: Moved to Princeton.
• 1933: One of six original professors when Princeton Institute for
Advanced Studies (IAS) founded. (Alan Turing studied
mathematics here 1936-8.)
• Engaged in war work on several national committees.
• 1945: wrote “The First Draft of a Report on the EDVAC”, which
introduced the stored program concept.
• 1945+: worked with Los Alamos on H-bomb issues.
• 1950s: consultant to IBM.
• 1952: designed MANIAC I.
• 1954: appointed to the U.S. Atomic Energy Commission.
• 1956: won the Enrico Fermi Award for outstanding contributions to
the theory and design of electronic computers.
• 1957: died 8th February.
66
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
von Neumann Architecture
Note that here:
• the term “memory” is applied to the store which is directly
addressable by the processor
• other forms of store (such as discs) are not considered
This is (currently) the most common computer architecture.
PCs, workstations, etc. all work with this model.
(Detailed implementations vary though!).
Note that there are other architectures e.g. the “Harvard
Architecture” where data and instruction storage are
separated.
67
RAM
The address space of a computer such as our MU0 will normally contain Random Access
Memory or RAM. RAM is memory where any location can be used at any time. MU0 has a 12bit address bus and so can address up to 4Kwords of memory, each word being 16 bits wide. As
this is a small memory by modern standards it is likely (now) that all the words would be
implemented (although a few locations must be reserved for I/O or there would be no way to
communicate with the computer).
Back in 1948 4Kwords (64 Kbits) would have seemed a very large memory which would require
many memory devices to fill. By the end of the 20th century the largest RAM devices reached
256 Mbits so one device could provide for 4000 MU0s!
To come more up to date we shall use the ARM address model instead. ARM produces byte
addresses and has a 32-bit address space, which allows the addressing of 232 separate bytes.
However as instructions and most data are 32-bits (4 bytes) wide it is normal to read or write four
bytes in parallel. We will therefore regard the ARM as having a 30-bit address space (the last 2
bits can specify one of the four bytes).
Thirty address bits allow the addressing of 230 separate words or 1 Gword. This is larger than
contemporary devices (256 Mbits ⇒ 8 Mwords); it is therefore necessary to be able to map
several memory devices into the address space. Not all these devices may be fitted in every
system; whether the memory space is ‘fully populated’ or not depends on the needs and the
budget of the owner.
Definitions and usage
O RAM – Random Access Memory; by convention (& slightly incorrectly) used for memory
which is readable and writeable. Most modern RAM ‘forgets’ when the power is turned off.
O ROM – Read Only Memory; usually a variation of RAM which cannot be written to; used to
hold fixed programs. As it cannot be written to its contents must be permanent.
In addition there are forms of serial access memory (such as magnetic tape, disc etc).
67
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Memory Devices – in Principle
Basically a memory is a large number of flip
- flops.
Here four 4-bit memory
words are shown as a bank
of registers.
The address is used either:
o to enable the data into a
specific register, or
o to output enable a specific
register
With appropriate (external)
control the data bus could
be shared and bidirectional.
68
Addressing
Within the CPU it is common for several things to happen in parallel.; the memory only performs
one operation at once.
This operation requires the answers to the questions:
• Do what? – Control (read or write)
• With what? – Data
• Where? – Address
Because only one operation is happening at a time the control signals and the data bus can be shared over the
whole memory.
The address bus provides a code to specify which location is being used (“addressed”).
Some definitions:
• Byte – now standardised as eight bits.
• Word – the ‘natural’ size of operands, which varies from processor to processor
(16 bits in MU0, 32 bits in ARM). Usually the width of the data bus.
• Nibble – four bits or half a byte (sometimes “nybble”)
• Width – the number of bits in a bus, register or other RTL block.
• Address range– the number of elements which can be addressed.
• Type – what the data represents. This is really a software concept in that the hardware (usually)
does not care whether a word is to be interpreted as an instruction, an integer, a ‘float’, an address
(pointer) etc.This may, however, influence the size of the transfer (byte, word, etc.).
The figure shows part of a memory; four words of four bits each are depicted (although the decoders imply
that another four words are omitted). The bits in each word are stacked vertically; note that the write enables
and the read enables (to the tristate buffers) are common across each word. The words can be made as wide
as required in this way.
The width of the memory is normally the same as the width of the CPU’s datapath, but it may not always be
so; for example some high-performance processors use wider memory so that they can fetch two (four, …)
instructions simultaneously.
Questions:
What are the advantages of fetching several instructions in a single cycle?
What are the disadvantages?
68
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Tri-State Devices and Bidirectional Busses
Tri- state devices have 3 – states ‘0’, ‘1’ and ‘off’ and can replace
multiplexers.
Bus Wire
EnRead
A
In
Out
EnA
En
B
EnWrite
Basic Tri- State Buffer
EnB
Can select A or B input
onto bus wire.
(Replaces MUX)
Bidirectional Bus
69
Tristate signals
Tristate signals and gates are introduced here. These will be referred to in the
following lectures. Tristate signals are used as a convenient method of controlling
and switching buses. At this point it is enough to know that they exist. As a good,
general rule tristate signals should not be used in control circuits.
The switching of a tristate output is digitally controlled – another input signal is
used as an enable. If the enable is true then the output is enabled (‘on’), if the
enable is false then the output is disabled (‘off’). The enable is usually drawn
entering the side of the gate so that it can easily be distinguished.
The enable may be active high or active low; an active low signal is usually
drawn with a ‘bubble’ on the connection.
A ‘normal’ buffer does nothing to the logic signal; the output is always the same
as the input.
Such buffers are used to match electrical properties of the circuits and are an
implementation issue which does not concern this course. Unlike ‘normal’
outputs tristate outputs may be connected together. In general a signal should be
driven, so tristate outputs are used for multiplexing two or more signals together.
No more than one tristate output should be enabled onto a net at any time.
The usual designation for the third state of an output is “high-impedance” or
simply “tristate”; it is usually abbreviated to “Z”.
69
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Address Decoding
The CPU ‘addresses’ (talks to) one memory location at a
time.
It has a large number to choose from.
It specifies which location with a single number on the
address bus.
Eventually this number must be coded into a true/false
select for every possible location.
Either:
‰ all selects are false
or
‰ one select is true and all the others are false
70
Address Decoding
An address is coded as a binary number to minimise the number of bits/wires
required.
The memory requires a word select as a “1-of-N” code.
The conversion is performed by an address decoder.
70
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Decoders
‰ The address selects which output
may become active
‰ The enable(s) allow that output to be active
o Multiple enables are usually ANDed together
71
A simple three to eight decoder described in Verilog:
module three_to_eight(addr_in, enable, sel_out);
input [2:0]
addr_in;
input
enable;
output [7:0]
sel_out;
wire
sel_out;
[7:0]
// nested conditional operator (?) used here
assign sel_out = enable ? (
(addr_in == 0) ? 8'b00000001 :
(addr_in == 1) ? 8'b00000010 :
(addr_in == 2) ? 8'b00000100 :
(addr_in == 3) ? 8'b00001000 :
(addr_in == 4) ? 8'b00010000 :
(addr_in == 5) ? 8'b00100000 :
(addr_in == 6) ? 8'b01000000 :
8'b10000000
) : 0;
endmodule
71
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Decoder in Verilog
module three_to_eight(
addr_in, enable, sel_out);
input [2:0] addr_in;
input
enable;
output [7:0] sel_out;
reg
[7:0] sel_out;
always @ (addr_in or enable)
if (enable)
case (addr_in)
0: sel_out = 8'b00000001;
1: sel_out = 8'b00000010;
2: sel_out = 8'b00000100;
3: sel_out = 8'b00001000;
4: sel_out = 8'b00010000;
5: sel_out = 8'b00100000;
6: sel_out = 8'b01000000;
7: sel_out = 8'b10000000;
endcase
else
sel_out = 0;
endmodule
72
Address Decoding
It is often both inconvenient and impractical to decode the entire address bus in a
single decoder. Instead a hierarchical approach is used:
Here the first decoder is used to enable one of the next set of decoders to give a
6-to-64 decoder (not all of which is shown). This can be extended further if
required.
In practice the decoders need to be very large, but the last stage of decoding
(which could be decoding around 20 address lines!) is built into the memory
device. The designer only needs to produce the equivalent of the first level of the
address decoder which selects which memory device is active. This is described
in more detail later.
72
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Real Memory Devices
Using edge
- triggered D
- type flip
- flops:
‰ is often fine for registers (e.g. 3 or 4 in MU0, <50 in ARM)
‰ is too expensive for ‘main’ memory (millions of locations)
Memories use special design techniques to squash as many
bits together as possible.
The ‘densest’ RAM currently in use is Dynamic RAM or DRAM
‰ This has a number of awkward characteristics
The ‘easiest’ RAM currently in use is Static RAM or SRAM
‰ Fewer bits/chip than DRAM
‰ Faster
‰ Simpler
We will therefore only examine SRAM in detail here!
73
Commodity Memories
All von Neumann computers need memory. Sometimes their needs are small – an embedded controller
operating a central heating system probably needs only a few byte of RAM – but others need many
megabytes. Even a heating controller may need a kilobyte or so of program memory.
Small memories (a few Kbytes) are often constructed on the same chip as the processor, I/O etc. Large
memories will need one or more separate, dedicated devices.
The figure below illustrates why D-type flip-flops are not used for mass storage!
In practice both the SRAM and DRAM need other circuits (such as amplifiers) to interface them to
computational circuits. However the overhead is small because a few amplifiers can be shared by many
thousands of bits of store.
A bit of ROM will be roughly the same size as a bit of DRAM or SRAM, depending on the technology
employed.
When building a system the cost is related to the number of silicon chips and their size. Thus if D-type flipflops were used the memory which could be implemented at a given price would be much smaller (i.e. fewer
bits). Cost is extremely important in system design!
The reason several types of memory exist is that the cost trade-offs vary according to the system requirement.
For example DRAM is the most area efficient but it is slower than SRAM and requires more support logic and
can be more expensive for memories below a certain size.
73
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
A Real Memory Device
‘Power’
The figure illustrates one of the
packaging options of a typical
commodity SRAM chip.
Write Enable
Output Enable
Chip Select
‘Ground’
512K x 8 SRAM
74
Using Memory Chips
The memory device shown is a 628512. This is a 4Mbit SRAM chip (memory sizes are normally quoted in
bits) organised as 512 Kwords of 8 bits each.
It therefore requires nineteen address lines and eight data lines; together with its power supplies {Vdd,
Vss} and three control signals these occupy all the pins on a 32 pin DIL (Dual, In- Line) package.
The following table defines the memory chip’s behaviour.
Points to note:
o All the control signals are active low
o If the chip is not selected (CS = H), nothing happens
o Write enable overrides read operations
o The data bus is bidirectional (either read or write – saves pins)
The CS signal is used to indicate that this particular device is in use. If a larger memory is required an
external decoder can be fitted so that only one memory chip is enabled at once. In this way several
memory chips can be wired with all their pins in parallel except for CS.
Observation
Because it uses the same pins for read and write operations it does not actually matter what
order the address and data pins are wired in. The user does not care what the address of any
particular location is as long as its address does not vary.
74
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Timing
A simple memory device acts as array of transparent latches.
Only the required latch(es)
must be enabled when
writing.
Address must be stable all
the time that CS and WE are
asserted.
75
Timing
Memory holds state. It can be written to.
When writing it is important that the correct data is written to the correct location; it is also important to ensure that no
other memory locations are corrupted.
In the MU0 model described earlier the memory was controlled by read and write control signals and it was assumed
that the processor clock would control state changes. A real memory device often has no clock input!
A simple SRAM has the same timing characteristics as a transparent latch. If the chip is selected (CS=L) and write
enabled (WE=L) then the data inputs will be copied into the addressed location. It is important that the address is stable
during the write operation; if it is not, other locations may also be affected.
There are set-up and hold time requirements for the address and data values around the write
cycle. (The set-up time is normally greater to allow for the address to be decoded.)
The actual write strobe is a logical AND of the write enable and chip select signals; both must
be active for data to be written. The timing diagram shown above is therefore only one possible
approach to strobing the memory. Another approach could use WE as the timing signal.
Different processors (& different implementations) encode timing differently. That’s okay, as
long as it’s included somewhere.
Note that this is not essential for read operations, because they do not change the state of the
memory; it does no harm though
75
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Wider Memory
A typical 32
- bit processor will be able to address different sized
data in memory.
‰ The memory will usually be built
32-bits wide so that words are
fetched efficiently (i.e. in one
cycle).
‰ The addresses of the memory
words are therefore spaced four
locations apart {0, 4, 8, C, …}.
Notes:
Addressing a 32-bit quantity at address 00000003 (say) may not work because
the bytes are located in different memory words.
Some processors (e.g. x86) do allow this, but they need to perform two
memory operations, one for each affected word.
76
The Reality of Memory Decoding
In the foregoing it is assumed that each address corresponds to one memory ‘location’.
In a ‘real’ memory system this is often not the case. For example an ARM processor can address memory in 32-bit words
or 8-bit bytes (or 16-bit “halfwords) and the memory system must be able to support all access sizes.
Addresses are decoded to the minimum addressable size (in this case bytes). Addressing a word requires fewer address
bits.
Thus the least significant bit used by the address decoder is A[2]; A[1] and A[0] act as byte selects, which will be
ignored when performing word-wide operations. Of course the bus must also carry signals to specify the transfer size.
Byte accesses
Notice that when the processor reads word 00000000 it receives data on all its data lines (D[31:0]). When the processor
reads byte 00000000 it receives data only on one quarter of the data bus (D[7:0]); furthermore if the processor reads byte
00000001 it uses a different subset of the data bus (D[15:8]). The processor resolves this internally by shifting the byte to
the required place (an ARM always moves the byte to the eight least significant bits when loading a register).
The same is true when writing quantities of less than a full word – the data must be copied onto the appropriate data
lines.
When reading a byte it is possible to ‘cheat’ by reading an entire word from memory and ignoring the bits that are
unwanted. This works because reading memory does not affect its contents. However when writing it is essential that
only the byte(s) to be modified receive a WE signal or other bytes in the same word will be corrupted. This would be a
Bad Thing.
76
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Memory Subsystems
Here is part of the memory subsystem for an ARM
- based system.
77
The ARM processor
‰
has a 32-bit word length.
‰
produces a 32-bit byte address.
‰
can perform read and write operations with 32-, 16- and 8-bit
data.
The normal design for the memory system would therefore be a space of 230
words (byte addressing, remember) of 32-bits each. Let’s see how this could be
populated, using the RAM chips described above.
The RAMs are 8 bits wide, therefore four devices are required to make a 32-bit
word. This then gives 512 Kwords of memory. We can then repeat this
arrangement another 2048 (=211) times to fill the address space, using the
appropriate decoder circuits.
Of course the 8192 RAM chips required will be expensive, will occupy a large
volume and use a lot of power (thus generating unwanted heat) and it is unlikely
that we really need 4 gigabytes of memory!
The usual alternative with a large address space is to make it sparsely populated;
this saves on memory chips and also simplifies the decoder circuits. Let’s say we
need only 1Mword of RAM, as in the figure above.
77
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Points to Note
‰ Most signals to the RAM chips are shared
‰ The total memory space is coarsely divided
(top left)
‰ One address line (A[21]) is used to select
between the banks or RAM
‰ There is some byte selection logic using
A[1:0]
‰ Some address lines are ignored!
78
(In the figure the ability to perform 16-bit ‘halfword’ transfers has been omitted.)
‰
Here the memory is 32-bits wide which requires four, 8-bit wide chips per
‘bank’.
‰
Two banks of memory provide 1Mword.
‰
The two least significant address lines (A[1:0]) are used as a byte select.
These are ignored if a word transfer is performed.
‰
The next nineteen address lines (i.e. A[20:2] are connected to all the RAM
chips. Note that the signal A[2] will be wired to the pin A0 on the RAMs
(and so forth) because the RAM address is the word address, not the byte
address. A[1:0] are used as a byte address within the word.
‰
A[21] is used here to select between the two banks of RAM. The last stage
of the decoder is shown as explicit gates which drive the individual chip
selects. (NAND gates provide for the fact that CS is active low.)
‰
The chip selects are the only signals which are distributed on a ‘one-perchip’ basis; other signals can be broadcast across many/all devices. This
simplifies the wiring on the PCB1.
‰
A[29:22] are ignored!
‰
The RAM region select signal is produced from the most significant address
bits; here RAM is selected when A[31:30] = 01. This means RAM
occupies addresses 40000000-7FFFFFFF inclusive. The lowest region is
reserved for ROM because the ARM starts executing at address 00000000
when the power is switched on.
Exercise: Understand the decoder’s operation.
1. Printed Circuit Board
78
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
The Memory Map
The memory is divided into
areas
‰ Some areas may not
contain anything
‰
Ignored address
lines means that
memory is aliased
into multiple locations
79
Memory Map Details
The memory map described (which is just one possible example) shows many of the basic
properties found in real systems.
‰ The memory is coarsely divided into areas with different functions.
¾ Areas may contain different types of memory or different technologies that
run at different speeds. For example the I/O area may be designed to cycle more
slowly (i.e. more clock cycles) than the RAM.
¾ Some integrated CPU devices may provide such decoders ‘on-board’.
‰ Some area are left ‘blank’.
¾ The previously described decoder does not use one of the area selects.
m Writing to such areas has no effect.
m Reading from such areas could return any value (i.e. it is undefined)
o Some physical devices can appear at several different addresses.
m This is due to ignoring some address lines when decoding.
m Fewer levels of decoding reduces cost and increases speed.
m This is known as aliasing.
o The I/O space is unlikely to be full.
m There will be both undecoded and aliased locations.
m In most cases peripheral devices will be byte-wide so not all bits in the word
will be defined. When reading peripherals it is important to mask out the undefined bits.
m Peripheral devices are sometimes unlike memory in that reading an address will
not return the same value that was written to that address.
m Input ports, by definition, are volatile in that their value can change without
processor intervention.
79
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Separate I/O
ROM
RAM
I/O
M/IO line false
Use special instructions IN and
OUT to refer to I/O
RAM
RAM
M/IO line true
80
A separate I/O address space
The memory map shown here includes space for ROM, RAM and I/O
peripherals. I/O access patterns are somewhat different from memory accesses in
that they are much rarer and often come individually (as opposed, for example, to
instruction fetches which run in long sequences).
Some processor families, a notable example being the x86 architecture, provide a
completely separate address space which is intended for I/O. If this is used it
leaves a ‘cleaner’ address space just for ‘true’ memory. The programmer can get
at this space by using different instructions (e.g. “IN” and “OUT” replace
“LOAD” and “STORE”) which are usually provide only limited addressing
modes and, possibly, a smaller address range.
The hardware view typically uses the same bus (with an added address line
M/IO). The hardware may also slow down bus cycles automatically in the
expectation that peripheral devices are slower than memory.
Note that the system designer is not compelled to use these spaces in this way!
80
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Little Endian
‰ The bytes in a word are numbered with the least
significant having the lowest number (i.e. 0)
‰ The bits in a byte are numbered with the
least significant having the lowest number (i.e. 0)
81
Endianness
Generically “endianness” refers to the way sub-elements are numbered within an element, for
example the way that bytes are numbered in a word. By convention the bytes-in-a-word definition
tends to dominate, thus a “big-endian” processor will typically still number its bits in a littleendian fashion (see slide).
This can get pretty confusing. If it’s any consolation the numbering schemes used to be worse!
Little endian addressing
Pick a word address, say 00001000, in a 32-bit byte-addressable address space. Let’s store a word
(say, 12345678) at this address.
‰
Address 1000 contains byte 78
‰
Address 1001 contains byte 56
‰
Address 1002 contains byte 34
‰
Address 1003 contains byte 12
i.e. the least significant byte is at the lowest address.
This has the effect that, if displayed as bytes, a memory dump would look like:
00001000 78 56 34 12
i.e. the bytes appear reversed (because higher addresses appear further to the right).
If a byte load was performed on the same address the result would be: 00000078
81
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Big Endian
‰ The bytes in a word are numbered with the most
significant having the lowest number (i.e. 0)
‰ The bits in a byte are still numbered with the least
significant having the lowest number (i.e. 0)
¾ This is inconsistent, but frequently encountered
82
Big endian addressing
Using the same word address (00001000) for the same word (12345678).
‰
Address 1000 contains byte 12
‰
Address 1001 contains byte 34
‰
Address 1002 contains byte 56
‰
Address 1003 contains byte 78
i.e. the least significant byte is at the lowest address.
This has the effect that, if displayed as bytes, a memory dump would look like:
00001000 12 34 56 78
If a byte load was performed on the same address the result would be: 00000012
Choice of endianness
Some processors are designed to be little endian (x86, ARM, …), others to be big
endian (68k, MIPS, …). There is no particular rationale behind this. Most modern
workstation processors allow their endianness to be programmed at the memory
interface.
82
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Memory Hierarchy
‰ Processors are fast
‰ Programmes are big
‰ Big memories are slow
A hierarchical memory alleviates some of the penalty:
Level 1 Level 2 Level 3
Level 4
Register File
CPU
Speed
HIGH
LOW
Cost per bit
83
Memory Hierarchy
Bottom line:
for a given price
o big memory = slow memory
o small memory = fast memory
If a programme has to run from ‘main’ memory it will only run at the speed at which its
instructions can be read – maybe 10x slower than the processor can go. However in reality typical
programmes show a great deal of locality, i.e. they spend maybe 90% of their time using perhaps
only 10% of the code.
If the critical 10% of the code is placed in a small, fast memory then the performance of the
overall programme can be significantly increased without the expense of filling the address space
with fast memory.
This is exploited extensively in high performance systems. Depending on the implementation it
may be known as caching or virtual memory1; the principle is the same in each case.
A typical PC will have several levels in its memory hierarchy:
‰
The internal registers
‰
An on-chip cache, integrated onto the processor chip (SRAM)
‰
A much larger secondary cache on the motherboard (SRAM) (sometimes
erroneously referred to as “the cache”)
‰
The ‘main’ memory – usually many megabytes of some cost-effective DRAM
‰
Some ROM or EEPROM to store information required on power up.
‰
The virtual memory space which is kept on a hard disc (magnetic)
There may be more levels than this though!
1. … and was an invention from the Atlas machine built by this department.
83
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Memory Hierarchy Cont.
Providing that the first level of the memory can keep up with
the CPU, full speed is achieved most of the time.
The first level of memory will usually be a cache
The last level of memory will often be (part of) a hard disc
The register File can be thought of as Level 0 i.e. top of the
hierarchy
Level 1 Level 2 Level 3
Level 4
ROM
RAM
CPU
Cache
Register File
Disc
84
Although the principle of locality is used at each level of the hierarchy the
process of choosing the “working set” (the elements to store) is often
implemented differently: it is sometimes done by hardware and sometimes by
software and may be static or dynamic.
In the future it is likely that the technology will evolve but it is unlikely that
memory hierarchies will disappear.
The Register Bank
Unlike MU0 modern processors usually have a significant number (ARM
has 16, MIPS has 32, …) of registers forming a register bank (sometimes
called a register ‘file’). These registers are used for operands for and results
from the current set of calculations.
Although they are not addressed in the same way as memory, the registers
can be regarded as the topmost level of the memory hierarchy (level 0).
The management of what is stored in the registers is done ‘manually’ by
the compiler or the programmer directly and is – of course – specified
explicitly in the object code.
84
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Caches
‰ The ‘main’ memory is big.
‰ To fill this with fast memory (i.e. as fast as the processor)
would be really expensive!
‰ A cache is a small (but busy) memory.
85
Caches
Two observations:
‰ Large memories (at an economical price) tend to be slower than small ones.
‰ A program spends 90% of its time using 10% of the available address space1.
No one has said that the memory has to be homogeneous; it is quite possible to have memories of
different speeds at different addresses. If you can organise things so that the 10% of the address
space which is frequently used is in fast memory then you can get startling improvements at
relatively small cost.
In some circumstances this is possible. In embedded controllers the software is fixed and the
programmer can profile and arrange the code to exploit different memory speeds.
In general purpose machines (e.g. PCs) the code is dynamic (a posh way of saying you run lots
of different programs) and those programs are designed to run on different machine
configurations.
Profiling is not a great help here.
A cache memory adapts itself to prevailing conditions by allowing the addresses it occupies to
change as the program runs. It relies on:
‰ Spacial Locality – guessing that if an address is used others nearby are likely to be wanted.
‰ Temporal Locality – guessing that if an address has been used it is likely to be used again in
the near future.
Two illustrations will illustrate why this often works:
‰ Instructions are (usually) fetched from successive addresses and loops repeat sections of code
many times.
‰ Many data are held on a stack which uses a fairly small area of RAM repeatedly.
Sufficient to say this works very well. In ‘typical’ code a cache will probably satisfy ~99% of
memory transactions.
Detailed cache design is beyond the scope of this course; further information can be found in
books such as Clements.
85
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Cache Properties
‰ The cache intercepts many memory references and
services them quickly.
‰ If we knew in advance which memory locations were
going to be busy this would be easy.
‰ Caches adapt dynamically to the changing needs of
the system.
86
Cache Hierarchies
Caches work so well that it is now common practice to have a cache of the cache.
This introduces several levels of cache or a cache hierarchy.
The first level (or “L1”) cache will be integrated with the processor silicon (“onchip”).
There will be a second level of the cache (“L2”); this may be on the PCB, on the
CPU chip or somewhere in between such as the integrated processor module.
Further cache levels are also possible; “L3” is increasingly common in highperformance systems.
86
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
A Different Architecture
‰ The von Neumann architecture has a single memory shared
between code and data.
‰ The Harvard architecture separates instruction and data
memories.
87
Harvard Architecture
The term “Harvard architecture” is normally used for stored program computers
which separate instruction and data buses. This separation may apply to the entire
memory architecture (as shown on the slide) or may be limited to the cache
architecture (below).
87
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Harvard Architecture
This increases the overall memory bandwidth.
‰ The extra parallelism allows the next instruction to be
fetched while the previous one does a load or store.
¾ Note- not every instruction requires a data transfer so,
sometimes, the data memory bus may be idle.
‰ Many high
- performance processors employ such
architectures – at least to the first level of cache.
88
The Harvard architecture logically separates the fetching of instructions from
data reads and writes (e.g. ‘load’ and ‘store’). However its real purpose is to
increase memory bandwidth.
Bandwidth is the quantity of data (number of bits) which can be transferred in a
given time. In a von Neumann architecture instruction fetches and data references
share the same bus and so compete for resources. In a Harvard architecture there
is no competition so instruction fetches and data reads/writes can take place in
parallel; this means that the overall processing speed is increased.
The disadvantages of Harvard architecture are:
‰ the available memory is pre-divided into code and data areas; in a von
Neumann machine the memory can be allocated differently according to the
needs of a particular program
‰ it is hard/impossible for the code to modify itself (not often a problem, but
can make loading programs difficult!)
‰ more wiring (pins, etc.)
Note: with a Harvard architecture the main memory may be completely divided
in two. The parts need not have the same width or address range. For example a
processor could have 32- bit wide data memory and 24-bit wide instruction
memory.
Many DSPs (Digital Signal Processors) have more ‘unusual’ Harvard
architectures.
88
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Read-Only Memory (ROM)
‰ Interface similar to SRAM
‰ Extra pins for programming
(write) support
89
Read-Only Memory (ROM)
ROMs are usually random-access memory devices. They use a similar IC
technology to RAMs, with lower cost/bit than RAM.
They are:
‰ read-only which means their contents cannot be corrupted by ‘accidents’
such as bugs or crashes.
‰ non-volatile (i.e. they retain their information when power is removed).
Uses
‰ ‘Bootstrap’ programs.
‰ ‘Fixed’ operating system and application code.
‰ Logic functions (e.g. microcode, finite state machines).
89
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Types of ROM
‰ Mask programmed ROMs are programmed during
chip manufacture.
‰ PROMs are ‘Programmable’ after manufacture, using
programming equipment.
‰ EPROMs are Erasable and Programmable (usually by
exposure to strong ultraviolet light).
‰ EEPROMs are Electrically Erasable.
90
Types of ROM
‰ Mask programmed ROMs are programmed during chip manufacture.
¾
Cheap for large quantities.
¾
Used in ASIC1 applications
‰ PROMs are ‘Programmable’ after manufacture, using programming equipment.
¾
Each individual IC is separately programmed (a manual operation).
¾
Contents cannot be changed after programming.
‰ EPROMs are Erasable and Programmable (usually by exposure to strong ultraviolet
light).
¾
A technology in decline.
‰ EEPROMs are Electrically Erasable.
¾
Currently one of the most popular ROM technologies.
¾
Many can be altered ‘in-circuit’, i.e. without removal from the PCB.
¾
They differ from RAM in that they require considerable time to alter a location
(writes take >100x the read time).
¾
Many devices also require ‘bulk’ erasure so that all or a large portion of the
chip is ‘blanked’ before new values can be written.
¾
Widely used for non-volatile store in consumer applications such as
telephones, TV remote controls, digital cameras et al.
¾
“Flash Memory” falls into this category.
90
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Memory Technology
The challenge in building computer memory is to achieve:
‰ maximum density
‰ adequate speed
‰ minimum cost
Bonus points are awarded if the technology is:
‰ easy to use
‰ non
- volatile
The history of computer memory has been a struggle to find
the ‘ideal’ storage at any given technology level.
What can be used to store data bits?
91
Other Memories
So far we have treated “memory” as simply the directly addressable memory space (which is the
usual interpretation of the term). However there are a number of other storage devices in use in a
modern stored program computer.
One other form of store is the processor registers. In RISC processors it is usually clear that
these form a separate, addressable ‘memory’ space: e.g. in an ARM “R7” means the “Register
store with address 7” (not to be confused with the memory location with address 7).
Perhaps more obvious are magnetic storage devices such as discs. The primary function of a
disc store is act as a filing system. In a filing system each file is a separate, addressable entity
where the ‘address’ is the name of the file. File handling is beyond the scope of the processor
hardware and is performed by specialist software, usually as part of an operating system. Files
may be stored on local discs (i.e. on the machine which is using them) on elsewhere (e.g. on a
networked fileserver); this should be transparent to the user.
Memory used as file storage has the following characteristics:
o Addressed in variable size elements (“files”)
o Addresses (“filenames”) variable length
o Address decoding done by software (“filing system”)
It is possible – with some extra hardware support – to make disc store to ‘stand in’ for areas of
the address space not populated with semiconductor memory. This is a virtual memory
scheme and will be described more fully in later courses.
Another type of addressable store uses addresses of the form: “http://www.cs.man.ac.uk/”
91
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Current Technologies
Current favoured memory technologies:
‰
‰
‰
‰
‰
SRAM – Static RAM (flip
- flops)
DRAM – Dynamic RAM (Needs refreshing)
Flash Memory – Block erase can be byte or block read
Magnetic disc – Hard Disc (serial data block read)
Optical disc – DVD or CDROM (serial data block read)
92
Current Memory Technology
SRAM
‰ fast
‰ truly random access
‰ relatively expensive per bit
DRAM
‰ significantly slower than a fast processor
‰ faster if addressed in ‘bursts’ of addresses
‰ medium cost per bit
Flash
‰ Slower than DRAM
‰ Block erasable
‰ Readable in byte or block form
‰ Very cheap per bit in block readable form (e.g USB pendrive)
Magnetic storage
‰ very slow (compared to processor speeds)
‰ variable in their access times (think of the mechanics involved)
‰ read/writeable only in blocks
‰ very cheap per bit (e.g. Hard Disk)
Optical storage
‰ very slow (compared to processor speeds)
‰ variable in their access times (think of the mechanics involved)
‰ primarily (but not exclusively) read only
‰ extremely cheap per bit (e.g. CDROM or DVD)
92
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Magnetic Memory
Track
Sector
(Contains block of data)
Read/Write Head
Discs stacked on a
single spindle with
separate heads on
each surface.
Moves in and out across disc surface
93
Magnetic Discs
Discs in one form or another should be familiar and need little further
description. They come in several forms but can loosely be classified into hard
discs which use a metal substrate and flexible (“floppy”) discs which use plastic.
Hard disc drives often contain several
platters on a single spindle, with
surfaces on each side of a platter. The
heads are linked to the same
mechanical structure (‘arm’). A set of
tracks
at the
is referred
Hard
discs
cansame
store radius
data more
densely to
than floppies because the heads can
as a cylinder.
approach
more closely, more reliably; the discs can also rotate faster without
distortion. Hard disc heads literally fly over the surface on a thin layer of
entrained air. They are enclosed to prevent dust particles disrupting their
operation.
The price/bit of disc storage is declining rapidly as the density increases. Future
disc technologies may become more exotic. For example to store a bit in a very
small area the magnetic material needs to be quite ‘stiff’; it may then need
zapping with a laser to warm and ‘soften’ it each time it is written. Research is
ongoing …
Unlike semiconductor RAM the access time of a disc memory depends on its
mechanical con-figuration and will vary depending on circumstances.
93
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Optical Memories
The most significant 1990s memory technology popularisation
was optical store as CD
- ROM and, later, DVD (Digital Versatile
Disc). CDs offer:
‰ High bit density
‰ Cheap manufacture (read only discs)
‰ Interchangeable medium
‰ Limited write capability
Holographic
‰ Uses non
- linear optical media
‰ Promises very high storage density
¾ Three dimensional storage
¾ Many bits in same volume (different viewing angles
and different laser wavelengths)
This is an active and ongoing field of research.
94
Optical Memories
CD-ROM uses the presence/absence of pits in a foil disc to represent bits. The disc is read with a
laser and optical sensor but the transport is otherwise largely similar to a magnetic disc. A CDROM holds up to 650 Mbytes of data.
DVD is simply an extension of CD technologies, with smaller (denser) bits. The only significant
development is that there are two planes of bits on (each side of) the disc which are separated by
‘focus pulling’. A DVD can contain 4.7 Gbytes of data.
Other, similar optical storage formats are possible. Particularly attractive are those which dispense
with the spinning disc and tracking head (and hence the large motors with their associated power
consumption). Instead of moving the medium the laser can be scanned instead, using a smaller,
lighter mechanism. The medium can also be made smaller (e.g. credit card sized). Such optical
memory cards are under investigation and development. Being physically smaller than a CDROM an optical memory cards holds 2.8 Mbytes.
Holographic storage
In theory the storage of data in some sort of transparent ‘crystal’ could be very space efficient, not
least because it offers 3D storage. Such storage was hinted at in, for example, the film “2001: A
Space Odyssey” (1968) as the basis of the “HAL 9000” computer. Sadly this prediction proved
somewhat optimistic.
Another potential of holographic memory is the ease of construction of associative or ‘Content
Addressable’ Memory (CAM). This is used in (for example) parts of cache memories but optical
CAL may be useful for more elaborate tasks such as pattern recognition. This is beyond the scope
of this course but
This is an active and ongoing field of research. Searching the WWW will give more up-to-date
details than can be included here.
94
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Memory Lane!!
Now to look at some of the history of storage
technology.
95
Punch Cards, etc.
Early references in weaving
‰ 1725: M. Bouchon used a pierced band of paper pressed against
horizontal wires.
‰ 1728: M. Falcon suggested a chain of cards and a square prism
instead of the paper.
‰ 1745: Jacques de Vancanson automated the process using pierced
paper over a moving, pierced cylinder.
‰ 1790: Joseph-Marie Jacquard developed the mechanism which
still bears his name.
Uses in computing
Analytical Engine (1837)
Certainly the earliest ‘use’ of punched cards was in Charles Babbage’s (17921871) design of the Analytical Engine (a mechanical digital computer). This was
never built but was, in most respects, a modern computer architecture with a
processor (called the “mill”), memory and I/
‰ The analytical engine had, in fact, three types of card decks with “operation
cards” (instructions), “cards of the variables” (addresses), and “cards of numbers”
(immediates).
95
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Punch cards
and other mechanical storage
Jacquard loom (1790)
Read mechanically
96
Joseph-Marie Jacquard (1752-1834)
‰ 1752: born, July 7 in Lyon, France; parents were silk weavers.
‰ Tried book-binding, type-founding and cutlery.
‰ 1772: father died leaving him inventing and accumulating debts.
‰ 1790: developed first loom – release delayed by French Revolution.
‰ 1792: joined revolutionists; son killed with him in defence of Lyons.
‰ 1801: revealed ideas for loom.
‰ 1803: summoned to Paris to demonstrate machine; given a patent and a
medal.
‰ 1804: returned to Lyon to run the workhouse (and perfect his machine).
‰ 1806: loom declared public property – Jacquard granted annuity & royalties.
‰ 1806-10: much opposition from machine breakers; fled Lyon in fear of life.
‰ 1812: 11 000 Jacquard looms in use in France.
‰ 1834: died Aug. 7 in Oullins, near Lyon. At this time >30 000 Jacquard
machines operating in this city.
96
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Hollerith cards (1887)
‰ Punched physically (slow)
‰ Read by electrical contact (or not)
97
Hollerith machine (1884)
Not truly ‘computing’ as much as a counting (& accounting) machine the Hollerith machine revolutionised
record keeping. With information on punched cards a machine could be ‘programmed’ to count all cards
with certain sets of punched holes. This was first used for applications such as the US census in 1890;
Hollerith cards were used extensively in early electronic computers and for other systems – they were
familiar, everyday objects from the 1950s to the
1970s – and in use for some applications (such as voting) into the late 20th century.
In 1928 the standard card size increased from 45 to 80 columns (960 bits). In computing this was adopted
as a line of text/program and was used as the width of a Visual Display Unit
(VDU). This survives as the ‘standard’ page width.
Herman Hollerith (1860-1929)
‰ 1860: born 29 Feb in Buffalo, New York, USA, child of German immigrants.
‰ Unable to spell as a schoolboy!
‰ 1875: entered the City College of New York.
‰ 1879: graduated from the Columbia School of Mines with distinction.
‰ 1880: worked on the US census.
‰ 1882: joined MIT (Mechanical Engineering); began experimenting with paper tape, then punched
cards, read by a wire contacting with mercury and triggering mechanical counters.
‰ 1884: moved to U.S. Patent Office – to avoid teaching duties.
‰ 1884: applied for his own first (of over 30) patents.
‰ 1887: Hollerith Electric Tabulating System tested.
‰ 1890: US census saves $5 million and two years’ work (pop. 62,622,250).
‰ 1896: founded the Tabulating Machine Company, which later became the Computer Tabulating
Recording Company (CTR).
‰ 1921: retired.
‰ 1924: CTR was renamed International Business Machines Corporation (IBM).
‰ 1929: died 17 Nov. Washington D.C., USA of a heart attack.
97
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Paper Tape
‰ Read electrically (or optically)
‰ Typically seven bits wide
98
Paper tape
Paper tape uses the same principle as punch card to store data. It has the
advantage in density and is faster to feed through a reader. It is also not as easy to
get muddled (or ‘hacked’) as a deck of cards because it cannot be shuffled.
Conversely the ability to edit programs by adding, deleting and substituting
punched cards could be very useful. Editing paper tape is difficult. One of these
difficulties has left a trace in the ASCII character set where the character 7F
(DEL or ‘delete’) is separated from the other ‘control’ characters; this code is
used because it was represented by all the holes punched out (ASCII is a 7-bit
code) and so could be used to overwrite mistakes.
Similarly the character 00 (NUL) is used as a ‘no operation’ in order to allow an
indefinite length of unpunched “leader” on the reel of tape.
98
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
IBM Millipede (1999)
‰ Data stored as pits in plastic surface
‰ Fixed medium
‰ Bit densities 10x magnetic disc
(~500Gbits/in2)
99
‘Millipede’
A possible new technology, using a microscopic punched card; hunt out your
own references.
99
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Mask programmed ROM
‰ Bit values are indicated by the presence or
absence of physical wire connections
‰ Fixed rather than interchangeable medium
‰ Bits programmed in during manufacture
‰ Often used on
- chip to store secure manufacturers data
100
ROM/PROM
Some ROMs retain data as physical wire connections. In mask programmed
ROMs these wires are fixed at manufacture. In the (currently defunct) fuse
PROM technology the wires were fuses which – if not required – were
overloaded and ‘blown’ during programming.
Older technologies used matrices of diodes on PCBs for a similar effect.
100
COMP12111 Fundamentals of Computer Engineering
The principle …
School of Computer Science
Delay lines
101
Delay lines
A delay line is a device which exploits the ‘time of flight’ of bits in transit. It
would be possible, for instance, to do this optically but sound – which travels
more slowly – gets more bits into a short space.
Delay lines are dynamic store in that data must be read, regenerated and
rewritten continuously. Clearly random access is not possible as data can only be
read or written as the required ‘location’ circulates through the electronics.
Access to a given ‘memory’ is, of course, strictly bit-serial.
Many early electronic computers – e.g. ENIAC (1946), EDSAC (1949) – used
mercury delay lines (or “tanks”) as their main store. A typical 5¢ delay could
hold about 1 Kbit. It was folded up for convenience (rather like a bassoon).
Mercury delay lines were originally developed in the 1940s for radar
applications.
Mercury is a good acoustic conductor but is rather expensive (and heavy). A
more convenient system was sought. The solution was the magnetostrictive
delay line. Magnetostriction is a stress induced in a magnetostrictive material
(such as nickel) when it us subject to a magnetic field. (A magnetic field is, of
course, generated by a flowing electric current.) This was translated into torsional
(twisting) waves on a long rod. The process is reversible, so the bit stream can be
detected again at the far end and neoprene buffers damp out any excess energy.
As the system runs at high frequency (~1Mbit/s) the ‘rod’ could really be quite a
light wire which could be loosely coiled onto a circuit board. Single lines of up to
100¢ were made which could store up to 10 Kbits.
101
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Mercury Delay Line (1940’s)
‰ ~1.5 m tube
‰ Capacity ~1 Kbit
Magnetostrictive Delay Line (1960’s)
‰ Up to 30 m wire
‰ Capacity 10 Kbits at ~1MHz bit rate
NB. Delay lines do not give random access.
102
102
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Electrostatic Memories
Williams Tube (1948)
‰ Charge stored on a glass/phosphor screen
‰ Written by an electron beam
‰ Read by displacing charge onto sensor mesh
103
Electrostatic Memories
Williams Tube (1948)
TheWilliams Tube (more correctly theWilliams-Kilburn Tube) was an early allelectrical storage device developed in Manchester. Its basis is a Cathode Ray
Tube (CRT) similar to those used in televisions and computer monitors. Bits were
stored as charge patterns on the phosphor screen. In effect some electrical charge
was or was not planted at each point on the screen using an electron beam. The
bits were read back by displacing these charges with another electron beam
which caused a discharge into the screen; the discharge was picked up by a wire
mesh across the screen’s front.
The first Williams tubes could store 2Kbits – perhaps twice the contents of a
mercury ‘tank’. They offered the added advantage that the data could be viewed
by the operator (although a second, parallel tube was needed because the actual
store was enclosed).
Reading the data is destructive, therefore it was necessary to regenerate the
charge and refresh the display. In any, case as charge tended to leak away,
regular refreshing was necessary; the store was therefore ‘dynamic’.
This was the store technology employed in the Manchester ‘Baby’, a computer
which was really built as a memory test unit.
103
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Dynamic RAM (DRAM) (1970s-present)
‰ Stores bits as charged/uncharged capacitors
‰ Dense and therefore cheap
‰ Volatile and needs refreshing
104
Dynamic RAM (DRAM) (1970s-present)
Instead of a glass and phosphor screen it is possible to store charge in a large array of capacitors.
However making such an array was very expensive until it could be done on a single silicon chip.
This is the principle behind DRAM.
The capacitors are accessed via a matrix of wires and switches which allow individual capacitors
to be charged or discharged. Opening these switches (which are really transistors) isolates the
cells. Closing the switches again allows the charge to escape, which can be sensed and amplified
as a read operation.
Read operations s are destructive and therefore any data which are read must be rewritten
afterwards. Also the capacitors are not perfect so charge gradually leaks away, therefore periodic
refreshing is required – hence the name dynamic RAM.
Each bit store comprises one capacitor and one switch (transistor) and these can be made very
small. It is therefore possible to fit many megabits on a single chip. This is why DRAM has
remained the cost-effective choice for large addressable memories for several decades.
DRAM is customised and marketed in a number of guises such as EDO-RAM, SDRAM, Rambus
etc.; all these use the same basic technology.
The Decatron
Not an electrostatic memory – indeed related to very little else – the decatron was a neon discharge
tube with 10 anodes; the discharge could be jumped from one to another where it would remain
following the ionisation path. This was a decimal memory cell which also acted as a display.
Now a historical curiosity.
104
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
More Electrostatic Memories
EPROM
‰ EPROM uses a special ‘floating gate’ process
¾ special ‘isolated’ charge store
‰ Requires special programming and erasure
EEPROM
‰ Electrically Erasable
¾ In situ programming/reprogramming
‰ Bulk erased
‰ Slow to write (~100x read time)
FLASH
‰ Block write/erase
‰ Block read with serial interface
105
EPROM (1970s-1990s)
If the charge leakage can be (effectively) eliminated and a DRAM read can be made
nondestructive then the store would be even more useful. This is the principle behind EPROM
(Erasable Programmable Read Only Memory) which is non-volatile – i.e. it retains its data
indefinitely (even when the power is off).
This is done by adding a ‘floating gates’ to the memory transistors; these are ‘islands’ where
charge can be stored which are insulated by (relatively) thick glass (SiO2). Charge was driven
through the glass by (relatively) high voltages after which it stayed in a stable state (discharge
times >10 years).
To erase the chip the charge was drained by a short (~10 minute) exposure to powerful ultraviolet
light which lent enough energy to the electrons so they could escape. EPROM devices therefore
required a quartz1 window in their package so they could be erased.
Programming required a special programmer and the device had to be removed from the circuit; it
was therefore important that an EPROM was socketed on the PCB. The socket, expensive
windowed package and programming procedure makes EPROMs relatively unattractive if there is
an alternative. (Sometimes a saving was made by using OTP or “One Time Programmable”
EPROMs – the same devices but without the windows.)
EEPROM (1990s-present)
An EEPROM (Electrically Erasable Programmable Read Only Memory) uses EPROM
technology but erasure may be done electrically. The devices may now be programmed ‘in situ’.
Sadly eliminating the charge leakage adds so much ‘insulation’ that the cell becomes difficult
(slow) to write to. In addition erasure is still a ‘bulk’ erasure rather than the ability to modify
single bits. Thus EEPROM is a complementary rather than a replacement technology for DRAM.
105
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Magnetic Memories
Magnetism has highly desirable properties for storing data.
‰ It has a distinct polarity (think of a compass)
‰ It is (relatively) permanent, if undisturbed
‰ It can be manipulated using electric currents
Two distinct classes of magnetic storage have been attempted:
those using fixed and travelling magnetic media.
106
Magnetic Memories
Core
The memory element in core store was a small torus1 (“core”) of ferrite. This could be
magnetised in either direction. This can be set (written) by passing a current through a wire
threading the core. To read the device the core was probed and – if it switched – a characteristic
pulse was returned. (The read was destructive, so the data has to be written back.)
Because it requires a current over a certain threshold to switch the polarisation of a core it was
possible to produce dense, 2D arrays. These use two ‘address’ wires running at right angles; the
current in each was kept below the switching threshold but where they crossed the sum of the
resultant magnetic fields was great enough to affect just this one bit.
The legacy of core memory still exists in some terminology: a computer’s main memory is still
sometimes called “the core”, and “core dump” for an output of a memory image is still in
common usage.
Core store was followed by “plated wire” as a miniaturisation step.
Magnetic core technology was in use in specialist applications (such as space shuttles) in the
1980s because it is both non-volatile and radiation resistant (“rad-hard”).
Bubble Memory (1970s+)
Now a historical curiosity ‘bubble memory’ was once thought to be the technology for light,
portable equipment. Functionally it is the precursor of EEPROM, but works in an entirely
different way.
Bits are stored in a thin film of a medium such as gadolinium gallium garnet which is
magnetisable only along a single axis (across the film). A magnetic field (from a flowing electric
current) can be used to generate or destroy magnetically polarised ‘bubbles’ in the film which
represent the two states of a bit. These bubbles are non-volatile.
Perhaps unfortunately – if only for the name – bubble memory devices proved more expensive
than other technologies.
106
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Ferrite Cores (1950s-70s)
‰
‰
‰
‰
Ferrite ‘cores’ polarised to bit state
The 1970s memory technology
~100 Kbits in two stacks of 2D arrays
Speeds from 60 kHz to a few MHz
107
107
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Moving Magnetic Media
The main examples are:
‰ Magnetic Drum (1950s)
‰ Magnetic Tape (1940s- present)
‰ Magnetic Disc (1950s- present)
Two of these are still in widespread use today.
108
Moving Magnetic Media
Rather than providing wires to each memory element the memory density can be
increased – and the cost decreased – by providing a thin magnetic coating on a
substrate material and moving this to the read/write element.
Drums
Drums were the earliest magnetic stores and often acted as directly addressable
memory where each CPU generated address corresponds to a particular place on
the drum’s surface.
(This is in contrast to the modern use of – for example – discs, which form
secondary storage and is managed by a layer of software such as a filing system).
Drums were used as both primary and secondary store on many early machines;
however they proved bulkier and less convenient than discs and were gradually
superseded as secondary store. Core memory proved significantly better as main
store.
If you want to know more about drums – and the sort of programmers who used
them – look up
“The Story of Mel”.
108
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Magnetic drums
Magnetic tape
‰ Originally large reel- to
- reel drives
‰ Compact cassettes once used for home computers
‰ Now large capacity tapes used for backups & archives
109
Magnetic Tape
Magnetic tape uses the same storage technology as disc but the magnetic medium
is carried on a flexible plastic tape rather than a plastic or metal disc. The tape is
dragged past the read/write head(s) by capstan.
The heyday of tape storage was the 1950s & 1960s where science fiction films
always showed computers as banks of spinning tape drives. In fact the
engineering required for a tape transport to allow heavy reels of tape to start,
stop and reverse rapidly is quite complex.
However modern systems have relegated tape to archival storage (such as
backups) where large volumes of data are streamed onto tape in handy-sized
cartridges. Here the slow, serial access is not a significant problem and the thin
tape wound onto a spool packs a lot of bits into a small volume.
109
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Memory Summary
• The processor has a memory map which is filled with
RAM, ROM
• Addresses are decoded to select the appropriate
memory block.
• Memory has a hierarchy with fast but small memory
at the top and slow but large at the bottom
• Caches may be used to provide higher bandwidth.
• Many different types of memory systems have been
tried in the past since memory performance it is one
of the critical factors affecting the overall
performance of a computer.
110
110
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Input/Output and Communications
111
111
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Computers need to communicate
A computer which can process information at incredible speed is
still useless unless it can:
‰
get input operands to work on
‰
output its results
This requires some sort of Input/Output (I/O) system
Remember the Amdahl/Case Rule
A balanced computer system needs about 1 megabyte
of main memory capacity and 1 megabit per second
of I/O per MIPS of CPU performance.
Desktop machines today go at about 30,000 MIPS so would need 30,000
megabit/second comms (30 GHz on a serial line!!!)
112
112
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Batch vs Interactive processing
Many early computers were used for batch processing
where the input(s) and output(s) were data files stored on
magnetic tape, punched cards etc.
Examples:
‰ Processing census data
‰ Calculating and generating a payroll
‰ Forecasting the weather
Most modern systems are interactive
Examples:
‰ Word processor
‰ Computer game
‰ Mobile ’phone
In both cases I/O is required.
113
113
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Communication Speed
First let’s set the scale
‰ Start with the speed of light (in vacuum)
¾ 3 x 108 ms-1
¾ 186 000 miles/s
¾ one foot per nanosecond
This is the best case! Information cannot travel faster than
this.
On a PCB signals propagate (at best) at ~60% of this
speed.
One nanosecond (1 ns) is 10-9s or the period of a signal of
frequency one gigahertz (1 GHz).
114
Digital Communications
We are concerned primarily with digital communications. Digital transmissions
use a binary coding system and so have the same advantages that binary signals
have inside the computer.
‰ Two widely separated signal levels are easy to tell apart
‰ There is no ‘intermediate’ state which could be confused
‰ Discriminating a received signal digitally means that noise can be
rejected.
A binary signal is representable by a voltage (e.g. ‘high’/‘low’), current, light
level (‘on’/’off’) etc. Clearly with a single wire the voltage can only represent
one bit at any given time. To send a large number of bits therefore requires some
sequencing, separating different elements of the message in time.
If only a single bit is conveyed at a given time then the transmission is said to be
‘serial’. If more than one bit is sent at the same time the transmission is said to be
‘parallel’, although it is likely that some time sequencing will be involved too.
114
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Examples of coding data
‰Voltage (high/low) on a wire
¾ e.g. ‘RS232’ serial interface
‰ Current (on/off or forwards/backwards) in a
loop of wire
¾ e.g. MIDI
‰ Audio frequency (high/low)
¾ e.g. tones for a modem or fax machine
(FSK – ‘Frequency Shift Key’)
‰ Light (on/off)
¾ e.g. optic fibre, Aldis lamp (signal lamp
often used at sea)
115
Some different ways of encoding binary data
‰ Voltage (high/low) on a wire
¾ e.g. ‘RS232’ serial interface
‰ Current (on/off or forwards/backwards) in a loop of wire
¾ e.g. MIDI
‰ Audio frequency (high/low)
¾ e.g. tones for a modem or fax machine (FSK – ‘Frequency
Shift Key’)
‰ Light (on/off)
¾ e.g. optic fibre, Aldis lamp
115
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Bandwidth and Latency
‰ Latency is journey time
¾ Measured in seconds
¾ Set by journey length and transmission speed
‰ Bandwidth is traffic capacity
¾ Measured in bits per second
¾ Set by channel ‘width’ and transmission speed
116
Bandwidth and Latency
Two important terms:
‰ The latency is the time taken from sending a signal until it is received.
‰ The bandwidth of a communications channel is the amount of information
which can be sent in a given time.
These two are only indirectly related. Think of latency as journey time, being
influenced by the length of the trip, the quality of the road and the speed limit.
Bandwidth is the number of cars which can pass a point over a given time.
Bandwidth is not affected by the length of the road. Furthermore bandwidth can
be increased by adding more lanes even though this will not (in principle) shorten
an individual journey.
116
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Example
An 8
- bit wide memory is cycled at 1 MHz
‰ The latency is 1ms
‰ The bandwidth is 1 Mbyte/s (or 8 Mbits/s)
If the bus was 32 bits wide …
‰ The latency would be 1ms
‰ The bandwidth would be 4 Mbyte/s (or 32 Mbits/s)
117
Examples
Latency
Approximate journey times for a signal to travelling at the speed of light:
These represent the single trip times; a return trip (such as in a ’phone conversation) will double
this. (There may also be added delays due to switching, signal translation etc.)
Bandwidth (Data rate)
‰ Deep-space or ELF1 submarine communications may use a few bits per second
‰ For old-fashioned telex links, about 50 bps is used.
‰ For links between printers and computers, between (about) 100 and 20 000 bps.
‰ For Ethernet networks, about 10 Mbps are available.
‰ High speed optical fibre networks reach into gigabits per second.
1. Extremely Low Frequency – necessary to penetrate the overlying sea water
117
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Serial Communications
“Serial” means that data elements are communicated in a
series. i.e. different elements are distinguished by being sent
at different times. “Serial” normally refers to communicating
one bit at a time.
Disadvantages:
‰ ‘Narrow’ channel means potentially low bandwidth
‰ Data elements are normally 8, 16, 32, 64, … bits wide
¾ Serial disassembly/reassembly required
Advantages:
‰ Suitable for telephone or radio transmission
‰ Single bit reduces cabling requirement
‰ Small number (one!) of interfacing circuits
¾ Allows extra investment in increasing data rate
118
Serial Communications
‘Communications’ involves sending a signal over a physical medium. As digital computers are primarily
electronic this medium is frequently metal wire. Wires are convenient to interface and cheap at the small
scale such as on-chip or on a PCB. As the distance increases the wires get longer and the cost goes up; it is
likely that the wires will also need connectors at intervals, again increasing the cost.
One way of reducing the cost is to reduce the number of wires. Instead of communicating a 32- bit word
along thirty two wires it can be serialised and sent as thirty two messages each one bit long. This will take a
longer time but only requires a single wire and can use simpler (cheaper) connectors.
In fact at least one extra wire is required in both examples to act as a reference ground (or earth) so that the
communicating systems have some common agreement as to the logic ‘high’ and ‘low’ levels.
Serial transmission is also well suited to other communications mediums. The electric telegraph (think of the
old man with the moustache and Morse tapper in many old Westerns) is serial with symbols encoded as
‘mark’ and ‘space’ on a wire. The modern counterparts use radio, but they normally use only a single radio
frequency.
In computer communications optical fibres are now very common. The reasons for this are that the available
bandwidth is higher and (persuasively) the glass/plastic fibres are much cheaper than copper wires. Data is
transmitted digitally down fibres using an on/off binary code, usually with a single frequency (colour) for
each transmission. There is some added cost in converting the electronic signal into an optical one and back
again, but this is offset by other savings.
Lastly there are media which are less obviously used for ‘communications’ but are inherently serial. Good
examples are storage media such as CDs or magnetic tape. Data can only be written or read one bit at a time
to the medium and so serial conversion must be included in the interfacing process.
Notes
‰ When serialising a signal it is important to observe a know convention as to which bit is transmitted first.
This is the ‘big endian/little endian’ debate again. Little endian (i.e. least significant bit first) is more
common.
‰ When sending a series of data separated only in time it is essential that both parties agree on how time is
delineated so that the receiver knows when to sample the incoming signal.
118
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Parallel Communications
“Parallel” means that data elements are communicated at the same time.
i.e. different elements are distinguished by being sent in different places.
‰ example: the seven-segment display interface used in the
accompanying laboratory.
Disadvantages:
‰ More wiring required
‰ More interface circuits
‰ Bigger connectors
‰ Unsuitable for many transmission media (e.g. radio)
Advantages:
‰ Potential for high bandwidth
‰ Data sizes can match ‘internal’ data types
In practice it is also common to use a series of several bits in parallel
119
Parallel Communications
Parallel communication involves sending several bits of data at the same time. This requires the
use of more data channels (usually wires!) and is therefore likely to impose a higher cost. Parallel
communications is useful for two different reasons:
Interface simplicity
Parallel interfacing is very easy at the computer end. A parallel output is simply a latch that the
processor can write to; a parallel input is a buffer which allows an external set of bits to reach the
CPU. For simple parallel interfaces this is all that is required and software can provide any
required control.
The interface is also simple to other devices such as switches, lamps etc. If each lamp has a
dedicated parallel output bit then there needs to be no ‘intelligence’ external to the computer
which may represent a considerable cost saving.
High bandwidth
Clearly (in principle) increasing the number of wires increases the number of bits per second
which can cross the interface. When communicating at a distance this is often offset because the
some of cost saving in going to serial interconnection can be reinvested in fancy interfaces which
increase the bit (or ‘symbol’) rate.
However in local communication there is no competition and parallelism is a Good Thing.
Examples occur as the physical scale shrinks and we look at communications around a PCB or on
a single chip. Perhaps the most obvious example is the CPU’s own bus connecting it with
memory; the bandwidth demands on this are very high (possibly many gigabytes per second) and
it would simply not be feasible to approach this serially. In high performance processors a
common technique to improve data bus bandwidth is to double (or even quadruple) the bus width
from the processor’s ‘natural’ size and fetch two (or four) instructions in parallel. Of course all
the bits in the instruction are also in parallel …
Serial buses
Whilst (almost) all processor buses have separate parallel wires for every address and
data bit it should be noted that it is possible to provide such a bus serially (albeit at very
low performance). For an example interested parties should look up buses such as I2C
119
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Synchronous Communications #1
A synchronous communication occurs if the transmitter and
receiver are using the same clock.
One method is to send the clock between the systems.
Data validity indicated/sampled with rising clock edge
‰ Two wires required
‰ Potential for clock skew
‰ Suitable for transmission under software control/timing
Note: with this system the clock period can be varied. (It is still
synchronous though.) Example: a PC keyboard serial line
120
Synchronous Communications #1
Synchronous communications are those which use the same clock for both the transmitter and the
receiver. This can be a very convenient method because the system can be designed as a
straightforward, synchronous FSM. It is a method which works well, for example, when a
processor communicates with its memory.
No two clocks run at the same speed. However accurately they are made there will always be
some discrepancy. Therefore the only way to maintain synchronous communications is to use the
same clock.
When a processor communicates with its memory the CPU is the master and can dictate the
system timing. However this is harder when considering communications over a distance, perhaps
between two separate computers. In this case to maintain synchronisation the clock information
must be sent across the communications link. Note that this could be set by either the transmitter
or the receiver (although the first choice is more intuitive).
The problem is then that the clock information is being sent in parallel with the data, thus
implying that an extra wire (or similar) is required. This is a small overhead for a parallel
interface but doubles the number of connections on a serial interface. Furthermore there can be a
problem of clock skew; this happens if the path lengths (delays) of the two wires are different.
Skew cannot change the frequency of transmission but the different latency can shift the relative
phase of the signals. Clearly any phase shift must be significantly less than a bit period.
120
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Synchronous Communications #2
Another method of synchronous communications is to encode
the clock with the data.
Example: Manchester encoding
There is always a transition at a predictable time
‰ Receiver uses signal and knowledge of (approximate) bit
rate to regenerate the clock
‰ Only need one signal wire + return (Gnd)
Example: Ethernet
Similar approaches are used for other ‘communications’ media.
Examples:
‰ Magnetic disc
‰ CD
- ROM
121
Synchronous Communications #2
One solution to both these problems is to encode the clock and data onto the same
signal. There are several ways to do this: one example is Manchester encoding
which is used for Ethernet transmissions. This encodes a “0” as a falling edge and
a “1” as a rising edge. Some other transitions may have to be inserted to make
this work (see figure opposite). The ‘clock’ information can be recovered because
there is a transition (rising or falling) for every bit and the receiver, knowing the
approximate bit period, can lock to the exact period using a Phase-Locked Loop
(PLL).
.
Phase-Locked Loops
A phase-locked loop is an oscillator which can adjust itself to match an external frequency. An
example would be a system of you and your watch; the watch is a good time reference for most
purposes but, every so often, it is necessary to ‘re-synchronise’ with a reference clock such as a
radio time signal
Exercise
When using Manchester encoding a synchronising preamble is required; why is
the sequence 10101010 chosen? (Hint: try encoding this sequence.)
121
Some other self-clocking encodings:
Another common application for self-clocking encodings are magnetic
recordings. A magnetic disc’s data rate depends on its rotation speed (which may
not be quite constant). This is exacerbated with interchangeable discs, which may
have been written on a different drive. The data stream must therefore be selfclocking. Some codes which are/have been used are:
‰ Frequency Modulation (FM1) once used for floppy discs
‰ Modified Frequency Modulation (MFM) used for floppy discs & early
hard discs
‰ Run Length Limited (RLL) used for hard discs
FM encoding uses a transition to indicate the start of a bit period; data is encoded
by the presence or absence of another transition within the bit period.
The other codes mentioned give denser recordings by omitting some transitions;
basically it is possible to survive without re-synchronising the clock for every bit,
just as a watch does not have to be reset every hour. CD recordings (etc.) make
use of similar data recovery techniques.
1. Nothing to do with the radio!
122
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Asynchronous Communications #1
The Asynchronous Serial Line
‰ Asynchronous – no clock transmitted
However
‰ Does rely on an agreement on the (approximate)
transmission frequency
‰ All ‘symbols’ have the same length
‰ but there can be an arbitrary time between them
123
Asynchronous Communications #1
Asynchronous communication – i.e. communication without a common clock – is
often more convenient than shipping the clock across the interface. There are two
different techniques which are referred to as asynchronous communication; these
are exemplified below.
The Asynchronous Serial Line
In an asynchronous serial line no clock information is transmitted but the
transmitter and receiver have already agreed on (and fixed) the period of
transmission of a data element. Because no two clocks run at exactly the same
rate there is necessarily some mismatch, but this can be minimised by
resynchronising the receiver with the transmitted stream every so often. A typical
asynchronous serial line used (for example) as a modem1 will synchronise on
every byte of a message. As long as the transmitter and receiver frequencies are
‘close’ they will not drift too far apart before they are synchronised again.
The need for this resynchronisation imposes an overhead on the transmission
which inserts ‘extra’ bits that are not in the message. This is the reason that a
serial line at (say) 9600 baud will only transmit around 960 bytes per second
rather than the 1200 you might have expected.
1. MOdulator/DEModulator
123
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Symbol transmission
The ‘start bit’ is used for synchronisation
‰ The ‘stop bit’ guarantees the line goes ‘idle’
¾ ensures the next start bit will cause a transition
‰ The bit time at the receiver is similar to the transmitter
¾ no synchronisation is lost in ~10 bit times
In this example sending 8 bits takes (at least) 10 bit times.
124
124
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
RS232 serial interface
The most common asynchronous serial
interface (RS232) uses a protocol as follows:
st a rt bit
idle line
lsb
1
msb
0
0
0
0
0
1
0
stop bit
ASCII character A
125
Parity
Parity is used for consistency checking of data. In can usually be checked by the
UART which will indicate a parity mismatch. It can be programmed to be odd,
even or none and may be ignored by the receiver in any case.
125
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Typical set-up for RS232
There are many variations on the details, but most interfaces
can be programmed to send or receive all of them:
•
A start bit
•
7 or 8 data bits
•
A parity bit (optional)
•
1 or 2 stop bits
The standard data rates are 75, 110, 300, 600, 1200, 2400,
4800, 9600, 19k2 and 38k4 bits/s.
126
126
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Asynchronous Communications #2
Handshaking
Two handshake transfers are shown.
‰ The data is set up
‰ The transmitter asserts Request
‰ The receiver latches the data and asserts Acknowledge
‰ The transmitter removes Request
‰ The receiver removes Acknowledge
The process may then repeat.
Note that the duration of each transaction can be varied by
either participant and there can be an arbitrary time between
transactions. This is a truly asynchronous communication.
127
Asynchronous Communications #2
Handshaking
Handshaking is frequently used in communications buses but the classic example of handshaking
is the parallel printer interface usually known as the “Centronics” interface.
The principle of a handshake interface is that the data is set up and then a request signal asserted
by the sender. The sender can take no further action until the receiver has acknowledged receipt;
the receiver will not do this until it has secured the data and knows that it can continue.
Transactions performed by this handshaking process – where the initiative is passed backwards
and forwards – is truly asynchronous. The transmitter may wait indefinitely before transmitting
and then the receiver can take as much time as it needs before acknowledging. Transmission of
each individual symbol can take a different time.
Clearly this mechanism requires several parallel channels: one for request, one for acknowledge
and at least one for data – it would be usual to have, say, eight data bits in parallel for a total of
eleven wires (don’t forget a ground signal which sets the level the other signals are compared to).
Asynchronous Processors
Traditionally on-chip communication has been done synchronously; the small size of a silicon
chip relative to the clock ‘wavelength’ (a 100 MHz clock will be over a metre of wiring) meant
that the assumption of synchronicity was a good one. With increasing speeds it is increasingly
attractive to consider asynchronous communication between devices and even within devices.
Asynchronous processing is therefore making a comeback in some areas. This is a particular area
of interest in – among other places – Manchester.
127
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
TDM
If there’s spare bandwidth (capacity) on a wire it can be
shared amongst several channels.
This is known as Time Domain (or ‘Division’) Multiplexing (TDM).
128
TDM
Sometimes a communications channel will have more bandwidth available than
an application requires. A common example is a telephone wire.
Most telephone connections are used for voice conversations. Humans can hear
frequencies up to about 20 kHz (at best) but a typical voice is recognisable with a
much lower frequency range; in analogue terms around 3 kHz bandwidth is
sufficient. Because the signal is analogue this can represent rather more than
3000 bits per second (bps); in fact ~56 000 bps is a reasonable guide1.
When sending messages between cities or between countries rather more
sophisticated connection technologies are used and the bandwidths are
commensurately higher. It would clearly be a huge waste to use such a channel
for a single telephone chat!
Instead the available bandwidth can be partitioned amongst a number of
conversations which can share the same wire (or fibre). To do this each
conversation is stored, broken into small excerpts and squeezed onto the wire
much more rapidly.
1. Think of a typical computer modem.
128
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Uses of TDM
‰ TDM is a common technique in communications.
‰ It can be used in other applications
¾ e.g. a seven segment decoder in the laboratory
If the channel is being used for real time communications
(such as a telephone line or television programme) there is
also a limit on the latency of each ‘packet’.
129
Optical fibres
An electrical signal can be converted into light using an LED and back to an
electrical signal using a photodiode or phototransistor. In between the light can be
carried along an optical fibre or “light guide” with very little loss.
An optical fibre conducts light by total internal reflection from its inner walls.
The light (and hence the energy) cannot escape because the incident angle is
greater than the critical angle of the fibre.
129
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Layered Communications Protocol
130
ISO 7498
International Standards Organisation model for Open Systems Interconnection
(OSI).
When you send a letter – the old fashioned kind with paper and a stamp – all you
want is for your message to get to the correct place; you don’t really know (or
care) how this happens.
To facilitate this you place the letter in an envelope with an address on the
outside and deposit it in a postbox. It will be emptied into a sack and the sack will
be put in a van with other sacks. The sack will be transported to an sorting office
where the van will be opened and the sack removed and emptied. The letter will
then be enclosed in a different sack for transport to another city. This may be put
in a train; the train driver knows where to take the train (usually!) but may be
unaware of the mail sack and certainly won’t have seen your letter.
The process continues until, in the end, the postman takes letters bundled per
street, splits these, delivers them to a house where the recipients check the
individual names and the recipient receives the missive.
Computer communications happens in much the same way. A number of ‘layers’
are defined. When an application (e.g. Netscape) on one machine wants to
communicate with an application on another it uses a virtual link.
130
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Layered Comms. Protocol Cont.
‰ Most layers correspond via a virtual link
‰ Only the physical layers have a real link
‰ Some layers are implemented in hardware, others in
software
Not being a communications course we shan’t delve into this,
just observe that it exists. The layered protocols allow a
convenient degree of abstraction at each level, so a word
processor does not need to know that its document file is on a
remote file server accessed by ,for example, a fibre optic token
ring …
131
ISO 7498 defines the following layers:
‰ Application - the user’s view of the system
‰ Presentation - format conversion
‰ Session - session management, security et al.
‰ Transport - connections and channels (e.g. TCP – Transmission Control
Protocol)
‰ Network - packeting and routeing (e.g. IP – Internet Protocol)
‰ Data link - error checking and retransmission if necessary
‰ Physical - connection and switching (e.g. Ethernet)
All of these can communicate with their counterparts but only the physical layer
has a physical link (wire, optic fibre etc.); all the others use virtual links. The
upper levels will be implemented in software, the lower ones in hardware and
some mixture will be used in between
131
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Interfacing
There is a plethora of possible I/O devices which may be used.
132
Interfacing
A user does not (or, at least, should not) care how an I/O device works. From
Unix a ‘file’ can be transferred to a disc, network or modem without knowing
what these actual devices are similarly a file can be sent to a printer without
regard for whether the printer’s interface is parallel or serial.
The upper layers of the communications protocol are held in the device driver.
This is a set of operating system routines which has a common virtual interface to
applications but is specific to a particular peripheral device.
We don’t do software in this course, so we will concentrate on the peripheral
device, usually just called a ‘peripheral’. The peripheral implements the lower
levels of the comms. protocol in hardware. These provide the signals on physical
wires (etc.) with the correct voltage levels and the correct timing.
A peripheral may be a very simple interface (leaving much of the function in
software) or it can be highly sophisticated. The most sophisticated devices are
complete peripheral processors which are computers in their own right, complete
with their own, embedded software.
The devices at the other side of the interface (outside the box) may also have
considerable intelligence. A typical printer, hard disc drive or even a keyboard
will often contain its own processor.
132
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Interfacing Cont.
‰ Can’t consider all possibilities when specifying a CPU.
Typically I/O devices are ‘mapped’ into memory space.
The specific requirements of a particular interface are
provided by:
‰ a peripheral device (hardware)
‰ a device driver (software)
133
Interfacing examples
When typing on a PC keyboard the matrix of 100+ keys is scanned by a
microcontroller (single chip computer). This uses simple, digital parallel inputs to
detect if a key is pressed or released and will also note the time when a change
happens with its on-board timer. It then runs software which ensures that the key
is debounced. When it is sure that a key state has changed it identifies the key and
action (pressed or released) by a code which is sent to the main computer via a
synchronous serial line.
The serial transmission is received by another microcontroller which records the
key information and translates it into an internal key code such as an ASCII
character. Note that the translation may depend on the state of other keys, such as
‘control’ and ‘shift’. It then interrupts the CPU and allows this to read the key
state via the bus. If a key is pressed the CPU records this in a buffer in memory.
This can later be read by an application program asking for keyboard input.
In reading data from a hard disc drive the magnetic transitions induce tiny
electric currents in the read head. These are amplified to digital levels which are
used by the data recovery circuit. This passes the serial data to a shift register
where it is assembled and passed to the drive’s on board processor. It is then
assembled into a buffer in the drive’s memory. This can then be read out across a
parallel interface by the CPU or – more likely – DMA into the main computer’s
memory space.
133
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
An Example Peripheral
A good example of a peripheral device is a serial interface.
The generic example is the UART.
A UART is basically a pair of shift registers (one for input, one for output)
controlled by a finite state machine.
The processor can follow the states of the FSM via a status register. To use
the transmitter:
‰ Wait until the transmitter is free
¾ indicated in the status register
‰ Write byte to transmitter register
‰ Repeat
To use the receiver:
‰ Wait until the receiver is full
¾ indicated in the status register
‰ Read byte from receiver register
‰ Repeat
The peripheral deals with the serialisation etc.
134
A UART
o Universal Programmable to match user’s requirement
o Asynchronous Can use asynchronous serial communications
o Receiver Can do input …
o Transmitter … and output
Using a UART
We will save the detailed construction of a UART for later. However let’s look at
the interfaces starting with the transmitter. N.B. This is a simplification of the
actual operation.
It would be possible for the CPU to break down words into individual bits for
serial transmission. However this is a tedious process for software but is
relatively easy in hardware. The UART therefore provides a parallel interface for
the CPU (often only 8 bits wide though).
The CPU can write a byte to this interface and the peripheral does the rest,
serialising it by shifting the bits out one at a time. Shifting occurs at the rate
expected by the interface, which may well be different from the processor’s
operating speed.
134
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
A ‘Real’ UART
The table below shows the register definitions for a (fictitious) UART.
Status register
135
Because the serial interface and the processor are running at different speeds it is
necessary to have a means of indicating that all the bits have been shifted out; if
the processor tried to send the subsequent byte too early it would corrupt the
previous transmission. The whole process is controlled by an FSM which keeps
track of the operation of the device.
The UART idles until a byte is sent; it then becomes ‘busy’ The CPU must not
send another byte while the UART is busy. This is prevented by software, but the
software must find the FSM’s status; the ‘busy’ bit is therefore made available in
a second register, the status register.
The status register must reside in a different place from the address used for the
data output, so our UART must occupy at least two addresses. In practice it will
occupy more, because it will be programmable (‘universal’) with characteristics
such as transmission speed.
The operation of the receiver is similar, with bits being shifted in serially and the
assembled byte presented in parallel. Another status bit (it will fit in the same
register at the same address) is used to indicate when a byte has arrived.
135
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Real UART Cont.
Here some of the features common in such peripheral
interface devices are shown.
‰ Some functions are programmable (e.g. baud rate)
‰ Several registers needed to support one serial
interface
‰ Not all the registers read back the same value that was
written to them
‰ Reading some registers can cause ‘side effects’
¾ e.g. reading the data input clears the ‘data ready’
bit
‰ A timer has been included which shares the interface
136
A UART in Action!
A UART is a reasonably complex peripheral device. The one depicted here is a (gross)
simplification, but it illustrates most of the features of a typical peripheral device as viewed from
the software side.
The UART interface comprises several registers at different addresses. Typically these are 8 bits
wide. If the UART is used with a 32-bit processor (e.g. ARM) then all these registers will be
connected to the same data lines (see the section of notes on memory) and thus will not be at
contiguous addresses. The UART itself will also be mapped into memory space somewhere, so
the registers may be at (for example){C0000000, C0000004, C0000008, C000000C}.
When the UART is reset (e.g. at switch-on) it will disable its receiver and transmitter and
inactivate any possible interrupt signals. The user must program any set up options (such as the
baud rate) before enabling the peripheral’s function.
A byte to be transmitted is written to register 0. Before transmitting a subsequent byte the
software needs to ensure that the first one has gone (serial transmission is usually a lot slower than
the software). This can be done by testing bit 4 in the status register. The receiver operates in a
similar fashion; here the software must wait for the receiver to be ready (status register bit 0 set)
before reading the byte. Typically the act of reading register 0 will reset this bit, so you can only
read the character once.
A flag has been included to indicate if an error has occurred (e.g. if reception has been corrupted
by noise). It is up to the software to check (or ignore!) this information.
If you want to see real UART data sheets there are plenty available on the WWW.
Timers
This UART also includes a timer. This is a device which measures the elapsed time as the system
runs. It is necessary to use hardware to get an accurate timing measurement because software
timing is difficult to calibrate and notoriously inaccurate1. The timer is not part of the UART.
However timers are often included on other peripherals because they only require one extra signal
(their clock) over the many bus interface and address decoding signals already present.
1. Think of the effect of running the same program on a faster computer!
136
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
DMA
Our ‘three box’ model has considered the CPU, memory and I/O
So far the CPU has always been the bus master (i.e. in control) Most I/O is
data being moved into or out of memory, e.g.
‰ programs
‰ data files
‰ video display
The CPU has to perform the transfer. It is much more efficient if the transfer
can be performed without CPU intervention.
The CPU can then be used for something else.
137
Direct Memory Access (DMA)
DMA is Direct Memory Access (by a peripheral device). The concept is quite straightforward but
the implementation details can be troublesome; we shall therefore stick to the general idea here.
Most I/O – especially most high bandwidth I/O – involves moving the data to/from memory in
significant blocks. For example, loading a program involves moving a large block of words from
a disc or network interface into a contiguous address space.
This is a pretty mundane process:
‰ wait for a peripheral to become ready
‰ fetch a byte (or word) from I/O
‰ store the byte (or word) into memory
‰ increment the memory address (ready for the next transfer)
‰ count off the transfer (to detect when to finish)
‰ repeat
Hardly a taxing program! The CPU’s time can be better spent doing more difficult work. DMA
allows the peripheral itself direct access to the memory. The peripheral, with the addition of some
simple hardware, can then move the data into (or out of) the memory without bothering the CPU.
This introduces some extra concurrency (parallelism) into the system.
The hardware cost of DMA is fairly small but it does add complexity to the overall system. In
order to get at the memory the DMA transfer needs the bus (see picture, opposite) and the CPU
may want it at the same time. There is therefore an issue of arbitration for bus mastery to
resolve any contention. (Even if conflicts occur the DMA process is more efficient at moving data
than the CPU – it needs no instruction fetches.)
DMA is primarily used for high bandwidth transfers. A good, but perhaps not obvious, example is
the display output on a typical workstation. A frame buffer contains about 1 Mpixel and each
pixel may use a 32-bit representation. As the display is refreshed at (say) 70 Hz the required
bandwidth will be ~300 Mbytes per second. Certainly hardware assistance is required for transfers
at this rate!
137
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Hardware Interrupt
‰ This is a hardware signal which when active
causes the processor to change the program
context.
‰ It allows the processor to have a response to
an external event which is closer to real time.
138
Interrupts
Interrupts fall into a category of occurrences usually classified as “exceptions”. Exceptions
are events which occur ‘unexpectedly’ during the execution of a program, rather than as a
direct result of the execution of instructions. An example of an exception could be an integer
DIVision instruction which tries to divide by zero; the answer “infinity” is not representable
by an integer and so an unexpected error has occurred. Some processors will generate an
exception (or “trap”) if this occurs. (ARM does not do this one because it has no specific
divide instruction.) This causes execution to branch to an exception handler (a.k.a. “trap
handler”) which can take remedial action.
What is an interrupt?
An interrupt is another class of exception. It is initiated by a hardware signal to the processor
which causes the processor to jump to an interrupt handler or interrupt service routine
(ISR). When the exception is resolved the processor needs to jump back to the original code
and proceed as if nothing had happened. The interrupt service routine can therefore be
regarded as a software procedure, the only difference is that it is called by a hardware signal
rather than a software instruction.
138
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Interrupts
Program Port #1 Port #2
Program Port #1 Port #2
1. Interrupts are more efficient
2. Interrupts are less
complex
(honest!)
Input
arrives
Service
Latency
Software
Time saved
Polling
Interrupt
hardware
Interrupts
139
Why use interrupts?
Program execution in the processor is a serial operation. However the computer often wants to
do many tasks ‘at once’. Most computers give the illusion of multi-tasking by doing a bit of
one job, then a bit of another, etc. and cycling these different tasks fast enough to deceive the
eye! This is another form of time domain multiplexing.
Many tasks are very simple. For example the job of inputting characters from a keyboard
involves a lot of waiting (characters arrive at <10 per second, instructions may execute at
>1 000 000 000 per second) and it is a waste of processor time for the CPU to repeatedly poll
the keyboard and wait.
A simple, cheap hardware circuit can do nothing just as well as the expensive processor. This
can be tasked to wait for the keyboard input and do something with it. This action may, for
example, involve DMAing the character into memory; in such a system we have (cheap,)
independent, parallel processing between the CPU and the keyboard, and parallelism is
typically good for performance.
Alternatively we may wish to do something more complicated with the character. We could
spend more on hardware but this is often uneconomic. Instead the input can wait until the
character is ready and then ‘borrow’ a short burst of CPU time. This is done by requesting
service via an interrupt. The interrupt service routine may require a few hundred operations, but
these are relatively infrequent and only requested as needed so the overhead is small.
139
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Servicing Interrupts
Hardware
Sof tware
Interrupt
Save working
occurs
registers
Save PC
Service
& st a tus
interrupt
Disable
Clear
interrupt
interrupt
Context
switch
registers
Restore PC
& st a tus
140
Servicing Interrupts
What happens on an interrupt?
When an interrupt is serviced the processor is seconded to run a different thread. This
interruption of the current program preempts ‘normal’ execution. The point to be emphasised
here is that the interrupt service routine (ISR) is called at ‘random’ positions in the user’s code.
At the time the interrupt is serviced any or all of the processor’s registers may be holding useful
data. Some of these registers may be needed to run the ISR but, if they are altered, it is essential
that all the state of the processor is restored before normal service is resumed. (The
consequence of this not happening has been likened to walking down a street when, suddenly in
mid stride, you find all your clothes have changed – or worse!)
Clearly there are some parts of the processor state which the ISR is unable to preserve because
they must be changed before the ISR is entered. The most obvious value is the program counter
(PC) which indicates the position the interrupt occurred and therefore defines the return
address. The hardware therefore has to cooperate to save some of the processor’s state.
140
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Interrupt Implementation
Many devices, one interrupt.
On-chip where there is a fixed number of interrupting devices …
INT
CPU
I/O
I/O
I/O
On a PCB (etc.) where there is a variable number of interrupting devices …
INT
CPU
I/O
I/O
I/O
The latter uses open-drain outputs and forms an expandible (if relatively slow) ‘gate’
from the wire itself.
141
Sometimes a ‘peripheral’ device is used to collect the various interrupt signals and simplify the
interrupt response software
Processor
Interrupt
Peripheral
Interrupts
Data
Bus
Decode
Enables
Latch
It concentrates the interrupts for the processor and will usually allow the state of the individual
signals to be read in software. This alleviates the need to read each potential interrupting device.
Another common facility (implemented here) is to allow the various interrupt sources to be enabled
selectively; this is in addition to any such facility provided by the peripherals themselves.
More sophisticated devices may contain a priority encoder which allows the highest priority active
interrupt to be identified by a single read operation. Such devices may also support the ability to
change the interrupt priority ‘on the fly’. For example it may be desirable for the last serviced
interrupt to be given the lowest priority; this is difficult (i.e. it adds significantly to the interrupt
latency) to do in software.
141
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Interrupt Priority
Some interrupts are more ‘important’ than others.
‰ Some interrupts require servicing more urgently than others.
‰ Some interrupts take longer to service than others.
‰ Sometimes it is desirable to service one interrupt whilst in the middle of
another’s ISR.
‰ Occasionally two or more interrupts may happen simultaneously; which is
chosen?
Prioritisation can be done by:
‰ Software, choosing the order in which potential interrupts are checked
‰ An interrupt controller with several separate interrupt inputs
¾ Some CPUs include this, together with a number of operating priorities
‰ Daisy chaining
‰ A combination of all the above
142
Exercise
By making reasonable assumptions estimate how much of a processor’s time is
needed to deal with a mouse which interrupts every time it moves a ‘step’.
Number of steps/cm?
Number of steps/s?
What happens on each step?
Roughly how many instructions per interrupt?
How many dimensions?
What about button clicks?
142
COMP12111 Fundamentals of Computer Engineering
School of Computer Science
Daisy Chain
The daisy chain is a mechanism which can be used to prioritise any
number of interrupting devices.
(One possible method)
Int
Int
Int
Peripheral
IAckIn
IAckIn
IAckOut
Peripheral
IAckOut
IAckIn
IAckOut
Peripheral
Int
CPU
IAck
N/C
The processor cooperates by ‘acknowledging’ accepting an interrupt request.
* Used in collaboration with vectored interrupts
* Requires little extra hardware
* Usable with single interrupt signal
* Cheap in pins
* ‘Infinitely’ expandible (Sequential process, so beware of the time required though.)
143
143