COMP12111 Fundamentals of Computer Engineering School of Computer Science The Three Box Model Ernie Hill Room IT 118 [email protected] 1 The ‘classic’ model of a basic computer is known as the “three box model”; the reason should be obvious! The three boxes are: • The Central Processing Unit (CPU) • The Memory • Input and Output (I/O) CPU The CPU is the computer processor – the ‘brain’ in the system. It is responsible for running programmes and controls – or, at least, supervises, the functioning of the other parts of the system. The processor is perhaps the most complex element in the system, however it is simply a big Finite State Machine (FSM). Memory In principle the memory is the simplest part of the computer. It acts as a large ‘jotting pad’ for data that the processor cannot hold itself. In practice modern memory architectures can be very complex! I/O It may be slightly misleading to collect all the disparate input and output systems {keyboard, display, disc, network, speaker, …} together under a single heading, but it makes a convenient grouping. Input and output can use an extremely diverse range of mechanisms and devices, both from system to system and within a particular computer. Nevertheless the point remains that a computer needs some form of communication or it is not useful! 1 COMP12111 Fundamentals of Computer Engineering School of Computer Science Three Box Model of a Computer SYSTEM BUS Memory CPU I/O • • • There is more to a computer than a CPU! Clearly there must be some storage (memory). The system is no use unless it can communicate with the outside world, therefore some Input and Output (I/O) is needed. • • These need interconnecting; this is usually done via a bus. This leads to what is often known as the “Three Box Model” 2 System Bus The three boxes are connected by a system bus (or, occasionally, several buses). This is a different use of the term “bus” from that previously encountered. Although a computer bus is a collection of signals with a broadly similar purpose (communication between the ‘boxes’ described above) the signals do not all form part of a single number or value. Typically there will be several sub-bundles such as the data bus, address bus and a set of other signals often collected together as the control bus. ‘Glue’ Not included in the three box view is the small amount of logic used to interface these components together. This is often collectively known as “glue” logic. It includes such necessities as address decoding, clock distribution and reset circuits. 2 COMP12111 Fundamentals of Computer Engineering School of Computer Science Balanced System Amdahl/Case Rule A balanced computer system needs about 1 megabyte of main memory capacity and 1 megabit per second of I/O per MIPS of CPU performance. 3 A computer must be “balanced” in that it should have approximately comparable capabilities in each of its ‘boxes’; for example a high performance processor is wasted if insufficient memory is provided. 3 COMP12111 Fundamentals of Computer Engineering School of Computer Science CPU FETCH DECODE EXECUTE The Central Processing Unit is the ‘brain’ of the computer. It is a finite state machine which ‘runs’ the programs placed in the memory by the user. It does this by repeatedly performing three operations: o Fetch o Decode o Execute on a sequence of instructions or “program”. 4 The CPU repeats the following three actions indefinitely: Fetch The processor maintains a pointer to the address it has reached in the program. It reads a word from that address which will be interpreted as an instruction. Normally, having read an instruction, the pointer moves on to the next address. Decode The instruction which has been read is examined to see what it means. In practice the contents of the memory are just numbers. However the processor can interpret a number as a coded way of specifying some action. Thus, for example, 0 could mean “add”, 1 could mean “subtract” etc.1 The decoding process takes the instruction or op. code (operation code) and sets the appropriate set of control signals for the FSM. Execute In the execution phase data is moved through the datapath (see the notes on RTL design) and the requested calculation is actually performed. After completing this sequence the processor goes back to fetch the next instruction and repeats the sequence. 1. This is not a new idea. Consider the way flag signals were used to control ships in Nelson’s navy, or the way dots and dashes form messages in Morse code. 4 COMP12111 Fundamentals of Computer Engineering School of Computer Science Program Execution The function of the program is to change the state of the computer system, according to the data supplied as input. A typical CPU maintains the system state in the form of: o on - board registers o external memory This state together with the predefined program (“logic”) moves one clock (“instruction”) at a time to resolve the final output. i.e. the entire system is also a finite state machine! 5 A computer processor is not, necessarily, a complex component; this will be illustrated in later lectures where a complete processor design is developed. However CPU design can be extremely complex! Modern CPUs Modern CPUs are made very complex in an effort to get the maximum speed from the technology. Considerations include: o multiple issue (“superscalar”) – trying to do several instructions in parallel o “pipelining” – starting the next instruction before the current one is finished o “reordering” – executing instructions in a different order from which they are fetched o combining groups of instructions to just determine their net effect & skipping intermediate stages o speculation – making ‘guesses’ as to what is likely to happen in the future before it can be calculated These subjects will be described in future courses. 5 COMP12111 Fundamentals of Computer Engineering School of Computer Science Address Space(s) “Conventional” CPUs have at least one address space. An address space is a number of (potential) locations which the system can address. Memory address space: Each location has a unique address o the address is just a number, interpreted as an address o programmers sometimes call this a pointer o it may not be the same length as an ‘integer’ 6 Hexadecimal Numbers From here on we are going to be using some large numbers; notably 16- and 32-bit numbers. Long binary numbers are fine for computers but hard for humans. Decimal numbers are familiar to humans but difficult to convert to/from binary. To make binary numbers more readable they frequently are represented in base 16 or hexadecimal. Hexadecimal is convenient because 16 is a power of 2, so each digit represents exactly 4 bits. The representation is much easier to read than a long string of “0”s and “1”s.Hexadecimal digits are: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F}. Addressing The CPU is the system master; it controls the bus which is used to communicate with the other ‘boxes’. Memory and I/O can be regarded as passive components. The memory and the various I/O devices need to be distinguished and this is done by addressing them. Each memory location and each I/O location has a unique address, in the same way that each house in Manchester (or, indeed, the world) has a unique address. o An address is the location of some item. To simplify addressing a computer will normally require that (unlike houses) all its data elements are the same size. This characteristic size is known as the word length; it varies in different architectures. o Simple, cheap microcontrollers frequently have an 8-bit word length o Workstation processors usually use 32-bit words o Newer processors are moving to 64-bit architectures o A few specialist architectures can address individual bits (1-bit word) Note: what is referred to as a “word” can vary in size, depending on who is using the term! A processor will have an address space in which all its words reside. The size of the address space also varies according to the architecture of the processor. The smallest common address space uses a 16-bit address – i.e. it can represent 216 different address values and therefore hold 216 = 65, 536 different words. A 32-bit processor (i.e. one using 32-bit words) will typically have a 32-bit address space which has room for 232 = 4,294,967,296 memory locations. 6 COMP12111 Fundamentals of Computer Engineering School of Computer Science Other Address Spaces • A processor will often have several selectable (addressable) registers usable in an instruction. Sometimes these are obvious, e.g. {R0, R1, R2, …} • Some processors have separate addresses for memory and I/O. Example: Intel x86 architecture. • Some specialist processors (such as DSPs) have several, separate memories (with separate buses). • On the WWW each page has an address: http://www… 7 Byte addressing From the foregoing it might be expected that a 32-bit processor such as a Pentium or an ARM would be able to address 232 32-bit words. In practice, as this is rather a lot of memory, it is usual for such a processor to be able to address individual bytes (8-bit quantities) as well as whole words. This uses 2 of the address lines (because there are 22 = 4 8-bit bytes in a 32-bit word), but there are still 30 left to provide 1Gword (4Gbytes) of memory space. Note that, despite this, the processor will normally use its full word size when performing calculations; the least significant (LS) two address lines are therefore largely unused. A consequence of this is that the addresses of adjacent words differ by 4. 7 COMP12111 Fundamentals of Computer Engineering School of Computer Science Memory Memory is usually regarded as a ‘flat’, randomly addressable space. It is usually depicted using a memory map. 00000000 20000000 40000000 60000000 80000000 A0000000 C0000000 E0000000 FFFFFFFC 00000000 00000004 00000008 0000000C 00000010 00000014 00000018 0000001C This memory map has been drawn with the lower addresses at the top; they are sometimes drawn the other way round. 8 Memory Perhaps the commonest form of memory is referred to as “RAM” which stands for Random Access Memory. It can be implemented in many technologies (“SRAM”, “SDRAM”, “Flash RAM”, etc.) but that is not our concern here. RAM is simply ordinary memory in which the processor can store (write) data and load (read) it back. The term “random” does not imply anything non-deterministic; it means that the processor can get at any location at any time without penalty. The term dates back to the time when some memory technologies did not have this property; magnetic tape is one example, where some ‘locations’ may require considerable winding before they can be accessed. It can be quite surprising to see the different sorts of technology which have been used for memory in the past; a little research in this area may provide quite a lot of amusement! Memory sizes It is common to describe memory sizes in terms of kilobytes, megabytes etc. When used for memory these prefixes typically diverge slightly from their normal meaning. One kilobyte is usually 1024 (=210) locations; it is frequently written 1Kbyte (upper case “K”) to distinguish this. Similarly one megabyte is 1048576 (=220) bytes. This convention makes it relatively easy to see how many bits are required to address a memory of a given size. For example, a 64Mbyte memory requires 226 (= 2 (6+20) = 26 x 220 = 64 x 1M) different addresses and therefore 26 address lines to distinguish these. Exercise: How many locations are addressable using only 19 address lines? 8 COMP12111 Fundamentals of Computer Engineering School of Computer Science Memory address space • • • • • In practice memory is not generally uniform. The memory address space may not be filled Areas may be set aside for I/O (see below) There may be space for expansion Different area of memory may cycle at different speeds • Some areas may repeat due to incomplete address decoding 9 Caches A term often heard in association with memory systems is “cache memory”, or just cache1. The function and operation of caches will be described in future courses. However the basic idea of a cache is that it provides a small set of local data to avoid constant references to the real memory (which is big and slow). If you think of the (main) memory as the University library a cache is analogous to the pile of books on your desk – much smaller, but easier to get at! 1. Pronounced the same way as “cash”. 9 COMP12111 Fundamentals of Computer Engineering School of Computer Science Input/Output • External interfacing generally involves a wide number of devices. In a desktop computer some typical devices are: – Keyboard – Mouse – VDU – Printer – Sound generator – CD ROM – Magnetic Discs (Hard Disc, Floppy Disc, …) – Flash card reader – Network – Modem –… • In an embedded system there will be many different, specific I/O devices. e.g. think of a computer in a “fly-by-wire” aeroplane. 10 A few example I/O devices: Keyboard Clearly the major feature of a keyboard is that it has a large number of buttons; it is also likely to have a few other functions such as LEDs1. The number of buttons is a potential problem in terms of the hardware requirement – the keyboard must be as cheap as possible – so it is usual to read and encode the key input separately from the main computer system. Typically a keyboard will contain a microcontroller (a single chip computer) which monitors the key input, performs functions such as debouncing, and communicates to the main computer via a serial line. A keyboard is therefore quite a complex item in its own right. Visual Display Unit (VDU) The output for what most people think of as a “computer”2 will be a VDU. This is basically a television screen. The computer views this as a large number (about a million) coloured dots or “pixels”3. The pixels will often be memory mapped, i.e. they not only appear on the screen but they can be read and written as memory elements. For example each pixel could be a byte. Question: how many different colours could then be represented? However, in addition to acting as a part of the memory this frame buffer has to be copied to the screen 70 times4 a second, one pixel at a time. The logic also has to ensure that pixels are sent at a constant rate and that every pixel is sent at exactly the right time. This calls for considerable, fast logic! 1. LED = Light Emitting Diode 2. In fact more computers are now found in other embedded applications e.g. mobile phones. 3. “Pixel” is a contraction of “picture element”. 4. … or thereabouts. 10 COMP12111 Fundamentals of Computer Engineering School of Computer Science I/O Interfacing System Bus Interface 1 Interface 2 Interfaces translate the system bus signals to those required by the device Device specific Buses I/O Device 1 I/O Device 2 The diverse collection of mechanisms, collected under the heading “I/O”, communicate with the CPU via a range of specially tailored interface devices known as “peripherals”. 11 Magnetic Disc Discs provide a cheap way of storing a lot of bits. They are also a permanent store (unlike most modern RAMs) in that they remember data when the power is off. They operate by magnetising tiny areas of the disc surface either as N-S or SN; these can be interpreted as “0”s and“1”s. The disc spins in the drive, which has a ‘head’ which moves radially to reach any part of the magnetic surface. Because the memory provided is usually much larger than the processor’s address space, much of a disc is (often) organised as file store, itself an addressable space (using “filenames”). Modem “Modem” is a contraction of “MODulator/DEModulator”; it performs both functions. In this case “modulation” is the transformation of a stream of bits into audible tones which can be sent across a telephone line; demodulation reverses the process. The modem will also be capable of producing the tones for dialling, detecting ‘ringing’ etc. All of these devices require different types of I/O signal. It is the job of the I/O interface to translate the system bus signals to those required by the I/O device. 11 COMP12111 Fundamentals of Computer Engineering School of Computer Science Ports & Peripherals • A “port” is some form of I/O; it can take many forms. • For example: – Parallel port – Serial port • A “peripheral” is a device which interfaces a port to the computer. • Usually the peripheral ‘maps’ the port into an area of memory. 12 Ports & Peripherals Parallel ports The simplest form of interface is the parallel port. This port could be an output port or an input port. A simple output port will appear just like a memory location (to the CPU) however the contents of the memory location will also appear on some physically accessible wires. These wires could then be connected to devices, such as LEDs, the barriers on a car park, etc. The location in question will often be referred to as an output register (because that’s what it is). A simple input port will also appear as a memory location, but in this case there is no actual memory; reading the address will return the values on a set of external signals (switches, buttons, car detection sensors, …). Note that, in this case, the ‘memory’ is volatile, i.e. reading the same location twice may not give the same answer in each case! Ports of either sort will typically be 8-bits wide, even in 32-bit machines. Often – and possibly surprisingly – the scarcest commodity in a computer system is the number of wires available. For this reason many parallel peripherals allow a port to be configured so that the same port wires can be used for input or output. Indeed it is possible to make a bidirectional port by allowing it to be an output at one time and an input at another. This clearly requires some additional information and the peripheral device will contain other registers which do not appear directly but are used for internal programming, such as setting the port direction. The peripheral will therefore need a (small) range of addresses to support a single I/O port. 12 COMP12111 Fundamentals of Computer Engineering School of Computer Science Memory Mapped I/O Much of this area will remain unused. A third common form of I/O is the timer. This is not associated with any input or output signals, but it provides input in the form of a timing reference. 13 Serial ports In order to save wires – especially when communicating at a distance – much I/O is done serially. In fact when sending a 1Mbye file it would not occur to anyone to provide 8 million wires in order to transmit it all in parallel, but it could comfortable be sent in 1 million operations each transmitting eight bits (a byte) in parallel. Normally the term serial is used to refer to operations where bits are sent one at a time. This clearly takes more operations than a parallel transmission, but often the operations can be done very quickly; because only a single interface is needed extra money can be spent on speeding it up. Serial interfaces are also highly suitable for transmission by radio, telephone, optic fibre, etc. A CPU is optimised for handling data in (8-, 16-, 32-, 64-bit) parallel; it would be a waste of resource to have it fiddle around shifting single bits around. It is therefore usual to have a peripheral to do this. More details on serial communications will appear later. However a serial peripheral will often contain numerous programming registers to specify its protocols and speed and some status registers indicating when it is ready to transmit or if it has received communication, as well as the registers for communicating the actual data. Eight (or more) registers to support a single serial port are by no means uncommon. 13 COMP12111 Fundamentals of Computer Engineering School of Computer Science Buses A “bus” is a collection of signals which act together. A processor communicates with the rest of the system using its bus. This is an amalgamation of signals comprising: o the address bus o the data bus o the control bus The address bus is output by the processor and specifies the memory (or I/O) location to be transferred.The address bus size dictates the size of the memory map. The data bus – commonly bidirectional – carries the information to/from that location. This is usually the same size as the processor’s internal data paths. The control bus specifies which way the data flows, and when. It may also carry a host of other, specialist information not discussed here. 14 Buses Previously a “bus” has been described as a collection of signals with the same function. A good example is that of the address used to specify a memory location. A 32-bit address requires 32 binary signals to specify the desired location. Usually these signals are lumped together and called the “address bus”. o 16-bits wide in ‘small’ processors/microcontrollers o 32-bits wide in PCs, workstations, etc. o 64-bits wide in the future o Other sizes in other processors (e.g. 20 bits on the 8086Þ 1Mbyte limit in DOS) Similarly data transfers between the processor and memory will transmit their information across a “data bus”. o Normally the same width as the processors registers/ALU {8-, 16-, 32-, 64- bits} o Sometimes narrower to reduce cost (in pins, wiring, memory devices) o Sometimes wider to increase bandwidth1 (e.g. fetch two instructions in one cycle) These elements are so ubiquitous that an engineer will always recognise: o A[31:0] o D[31:0] (although the widths of the buses may vary) 1. The rate at which data can be transferred across the bus. The bandwidth of a bus can be doubled by a) cycling the bus at twice the speed; b) keeping the same cycle time but doubling the number of bits. 14 COMP12111 Fundamentals of Computer Engineering School of Computer Science Bus Hierarchy CONTROL BUS 32 RD WR Memory 16 ADDRESS BUS PROCESSOR BUS DATA BUS CPU I/O 15 Together address and data buses are insufficient to transfer data to/from memory: at least extra signals to specify the direction and timing of the transfer are needed. Timing is important to make sure setup and hold times for all the devices are met. These signals (with others) form a collection loosely known as the “control bus”. Collectively all these signals are known as the processor bus, or just “the bus”. Expansion bus Many computers will not have their address space(s) filled with memory and I/O If there is spare space it is common to allow access to the signals as an “expansion bus”; this allows the later addition of new devices to the computer. Expansion is facilitated by the adherence to a bus standard, which specifies the interface signals and timing. Many PCs have an expansion bus which is not, precisely, the processor bus (for example it is often slowed down) to allow older I/O cards2 to be used in newer, faster machines. 2. Populated Printed Circuit Boards (PCBs) 15 COMP12111 Fundamentals of Computer Engineering School of Computer Science Processor Design 16 Manchester University has a long history of processor design; the world’s first computer (in the modern sense) was designed and built here in 1948. For details of this, and other early Manchester University computers see: www.computer50.org At that time miniaturisation had not yet begun; the Small Scale Experimental Machine (SSEM) was built using valves and would fill a reasonably sized room (the same machine these days would fit on a pinhead!). However the number of logic gates available was small, so its architecture had to be simple. The SSEM was also experimental, so it was used as the basis of an evolving design which later became the Manchester Mark 1. A replica of the SSEM now resides in Manchester Museum of Science and Industry. 16 COMP12111 Fundamentals of Computer Engineering School of Computer Science Processor Design The CPU is usually the most complex part of a computer system. All other systems depend on the CPU for control. The design of the whole computer is heavily influenced by the architecture of its CPU. The following lectures outline the detailed design of an implementation of a computer CPU. The instruction set is already defined. o How do we perform the detailed logic design of a processor, given an outline block diagram and a specification of its architecture? 17 MU0 MU0 is an abstract machine based on the SSEM. It is a complete processor specification and is quite capable of running useful programs; it is also simple enough to describe a complete implementation down to gate level in a few lectures. 17 COMP12111 Fundamentals of Computer Engineering School of Computer Science MU0 Instruction Set Architecture A 16 - bit machine o 16 - bit data o 16 - bit instructions o 12 - bit address space Instruction format o -4 bit instruction o 12 - bit operand address 18 MU0 MU0 is a simple model computer. Its architecture is (simplified from, but) similar to the very early Manchester machines, such as the Manchester Mark 1. When beginning to design a new computer the architecture is one of the first things to fix. It is necessary to define the programmer’s view of the system and the instructions which it will execute. The word length (i.e. the ‘width’ or number of bits in the datapath) and size of the address space are also fixed here. When designing a new processor all these issues must be resolved. The word length, addressing range etc. are influenced by cost and available technology. When these have been set the processor’s instruction set and number of internal registers are determined, usually using computer simulations to experiment with the performance of different possible architectures. This sets what is known as the Instruction Set Architecture (ISA). When the ISA is determined the processor can be implemented. This involves the design of the hardware architecture (often called the “microarchitecture”). Processors often go through many different implementations with the same basic ISA (although this changes and grows over time). The direct ancestors of many processors in use today (Pentium, ARM, Coldfire, …) first evolved in the early ’80s; newer implementations have yielded speed increases of >1000X. 18 COMP12111 Fundamentals of Computer Engineering School of Computer Science MU0 Instruction Set Only eight of the sixteen possible operations are implemented. The others are “reserved for future expansion”. 19 In the case of MU0 the ISA is already fixed: MU0 is a 16-bit machine o Memory is 16 bits wide o The internal data paths are 16 bits wide o The instructions are 16 bits wide o The address space is 12 bits long (i.e. 4 Kwords) The instructions are fixed format o 4 bits instruction o 12 bits operand address It has two user visible registers1 o Accumulator (Acc) – the only ‘user’ register o Program Counter (PC) It is a single address machine o One operand is specified in the instruction o Other operands (such as ACC) are implicit in the instruction 1. As shall be seen shortly there can be registers which are not directly accessible via the instruction set. 19 COMP12111 Fundamentals of Computer Engineering School of Computer Science Instruction Execution Sequence Like any CPU, MU0 goes through the three phases of execution: These are repeated indefinitely. In more detail … a) Fetch Instruction from Memory [PC] b) PC = PC + 1 c) Decode Instruction d) Get Operand(s) from: Memory {LDA, ADD, SUB} IR (S) {JMP, JGE, JNE} Acc {STO, ADD, SUB} e) Perform Operation f) Write Result to: Acc {LDA, ADD, SUB} PC {JMP, JGE, JNE} Memory {STO} 20 MU0 Programming Programming example MU0 can be used to write ‘real’ programs; however programming this type of processor can be very tedious! Below is an example of a program to total the numbers in a data table: Loop LDA Total ; Accumulate total Add_instr ADD Table ; Begin at head of table STO Total ; LDA Add_instr ; Change address ADD One ; by modifying instruction! STO Add_instr ; LDA Count ; Count iterations SUB One ; Count down to zero STO Count ; JGE Loop ; If >= 0 repeat STP ; Halt execution ; Data definitions Total DEFW 0 ; Total - initially zero One DEFW 1 ; The number one Count DEFW 4 ; Loop counter (loop 5x) Table DEFW 39 ; The numbers to total ... DEFW 25 ; DEFW 4 ; DEFW 98 ; DEFW 17 ; Note: o Much shuttling of data to/from the accumulator (tedious & slow) o Constants (e.g. “One”) need to be preloaded into memory o Self-modifying code needed to index through the data table In particular self-modifying code (where the program alters its own instructions) is normally deprecated. Exercise Rewrite this program using the ARM instruction set used in CS1031. Use registers as appropriate. (Your answer should be considerably shorter!) 20 COMP12111 Fundamentals of Computer Engineering School of Computer Science A Practical MU0 Datapath The next stage is to produce an RTL datapath picture: MEMORY Data Out Address Data In ACC PC IR ALU Having produced a sketch it is necessary to check to see that all the required operations are possible. It is possible to determine all the required ALU functions. The control (such as the decisions about whether to jump or not) is still being neglected at this point. Timing and Control 21 Datapath Design Instructions can be compressed into two cycles of execution. In many cases each phase requires a memory cycle: o Fetch Read instruction o Decode/Execute Read operand/store accumulator We (in some cases, you) can verify the validity of the datapath by testing the different instructions and seeing which buses are used in each cycle. (In this case all instructions are possible.) Try to fill in the data paths for the following instructions. 21 COMP12111 Fundamentals of Computer Engineering School of Computer Science ADD Fetch Data Out Address Decode/Execute Data In IR ACC PC Data Out Address ACC PC ALU Data In IR ALU Timing and Control Timing and Control 22 STO Fetch Decode/Execute JMP 22 COMP12111 Fundamentals of Computer Engineering School of Computer Science Registers In our MU0 there are three registers: ACC, PC, IR Not all of these are visible to the programmer. We will make these from sets of 16 D - type flip - flops. D Q Out15 D Q Out14 D Q Out0 Note that all the control signals are common for the whole register. 23 Register Banks Registers The registers described here are the same as those previously described in the notes on RTL. All flip-flops within a register have a common clock which is the system clock. All registers in the design will use this clock to ensure synchronous operation. Each flip-flop has an individual input. However these can be shared across more than one register. The loading of the register is controlled by a Clock-Enable (CE) signal; if this is active when the system is clocked the register will adopt the input value. By activating the CE signal at the correct time the register can copy (“latch”) the value. Similarly the outputs may feed into a shared bus, providing only (at most) one output is enabled at once for each bus. This is controlled by the Output-Enable (OE) signal. By activating the OE signal at the right time the register can drive its output bus. 23 COMP12111 Fundamentals of Computer Engineering School of Computer Science Register File The register bank or register file is often multiport memory where any register can be connected to any port at any time. 24 A modern processor will typically have more than one programmer-accessible register; a typical RISC (Reduced Instruction Set Computer) will have 16 (ARM) or 32 (MIPS) registers, any or all of which can be used to store temporary operands. These registers are normally grouped together in a register bank – also known as a “register file”. A register bank is similar to a memory, although its address size is much smaller; a register bank with 16 registers needs only a 4-bit register address (24 = 16). (In MU0 there is only one register (ACC) and so it can be addressed using zero bits (20 = 1)). However, unlike memory, it is common to be able to perform several operations on the register bank simultaneously; for example an ARM instruction might specify: ADD R1, R2, R3 which requires two read operations and a write operation to be performed at the same time. Any register can be connected to any port at any time, including, for example:ADD R1, R1, R1 How might this be implemented? 24 COMP12111 Fundamentals of Computer Engineering School of Computer Science ALU X ALU Z Y Fn To call it an ALU in MU0 is rather an exaggeration. The instruction set does not provide facilities for performing logical operations (e.g. NOT, AND, XOR etc.) and thus only an arithmetic unit is required. An enhanced version of the machine could include logic operations which could easily be supported. 25 The easiest example of a microprocessor ALU to present here is that of the ARM, as used in COMP15111. The ARM is a 1980s architecture but is still in common use today1. An ALU is an RTL component; it is therefore irrelevant to us how many bits it processes. It will usually have two input buses (let’s call them X and Y) and a single output bus (Z). An adequate number of bits are supplied to specify the function performed on the inputs. In general an ALU will perform both arithmetic and logical functions. Arithmetic functions are typically addition/ subtraction/comparison treating the input buses as numbers. Logical functions are the now-familiar Boolean operations performed by pairing off the bits in the input buses. A subset of the ARM ALU functions is given below: 1. If you own a mobile ’phone you probably carry at least one around with you! 25 COMP12111 Fundamentals of Computer Engineering School of Computer Science MU0 ALU MU0’s ALU must be capable of doing the following:– Z=X+Y (for the ADD instruction). Z=X–Y (for the SUB instruction). Z=X+1 (to allow PC to be incremented after an instruction fetch). Z=Y (for the LDA instruction & to allow the S-field of IR to be sent to the PC for JMP etc.). Other operations might prove useful in an enhanced version of the machine. Each of the operations can be expressed as an addition: X + Y X + -( Y) X + 1 0+Y 26 Note that these functions are directly user accessible. The ALU may also provide other functions within the processor which are used for internal operations. An example above could be a ‘move’ from the ‘A’ bus. The MU0 ALU is much simpler; it does not provide logical functions at all! However there are more functions than just the ADD/SUB visible in the instruction set. Later we will look at how the MU0 ALU can be extended to include some of these and some other functions. 26 COMP12111 Fundamentals of Computer Engineering School of Computer Science Adders Clearly some form of adder is required inside the ALU! The 16 - bit architecture of MU0 requires a 16 bit ALU …… and hence a 16 - bit adder. One way of providing this is to use a Ripple Carry Adder. o Simple just a string of full adders o Slow long critical path 27 Adders There are many ways to build a single bit adder; two of these are shown below. The first design (which should, by now, be familiar) comprises two half adders joined into a full adder. The second design is the result of minimising the function and is a ‘direct’ approach to the logic. Although the first design is slower as a single bit adder (count the gates in the worst case path) the designs are comparable when used in larger adders because the critical path is the carry propagation and the path from Cin to Cout is 2 gates in both designs. When several single bit adders are wires together the time for an addition is always dominated by the speed by which the carries can be generated. This is because, under certain circumstances (which?) the carry into the most significant bit depends on the data into the least significant bit, which is many gates away. 27 COMP12111 Fundamentals of Computer Engineering School of Computer Science ALU Structure Although the ALU is based on an adder this is not all that it does. The input buses have some preconditioning function applied first. o The X bus can be zeroed o The Y bus can be zeroed or inverted 28 All the necessary functions can be supported by additions, providing the input buses are conditioned as follows: Note that, for example “X-Y” is now expressed as “X+(-Y)”. Some of these functions are relatively easy; others are harder. For example producing a value of one on the Y input is awkward if only because the bit values are dissimilar (i.e. binary 0000000000000001). However a general purpose adder also has a carry input to the least significant bit. If we consider this, things become easier: 28 COMP12111 Fundamentals of Computer Engineering School of Computer Science General ALU’s In a more general ALU it is often useful to be able to provide: o True data- the data as supplied o Complement data- the data with all bits inverted (NOT) o Zero all data bits zero o One all data bits one 29 The carry in is a single bit value which adopts the appropriate value. Now the input buses are transformed by bitwise operations (the -Y of the previous table has gone too). Note the transformation: X - Y = X + (-Y) = X + (Y + 1) = X + Y + 1 Exercise Prove this to yourself by working though the following examples. In each case note that a “carry out” is generated (and ignored). 29 COMP12111 Fundamentals of Computer Engineering School of Computer Science Operand Select Logic The inputs to the 16-bit adder are X’[15:0] and Y’[15:0]. The outputs are Z15 - Z0. module xprecon(sx, x, xprime); output [15:0] xprime; input [15:0] x; input sx; assign xprime = sx ? x : 0; endmodule module yprecon(sy, siy,y, yprime); output [15:0] yprime; input [15:0] y; input sy, siy; assign yprime = (sy ? (siy ? y : ~y) : 16’h0001) endmodule Y[15:0] X[15:0] xprime 0000h sx y1 y1 Y[15:0] yprime 0001h siy sy 30 Alternative select logic could be as shown below.: What would the Verilog code look like to produce these on synthesis? 30 COMP12111 Fundamentals of Computer Engineering School of Computer Science ALU Function Decoder The three function control bits have been assigned in a way that allows relatively easy decoding in the ALU. o o o o o The choice of coding is arbitrary. There are 24 possible choices. Any choice can be decoded. However some choices simplify the logic. Finding a good solution takes intuition & practice. 31 Function Decoder Design The ALU control bits could be generated directly from the decoder. However, providing that it is inexpensive, it is sensible to compress the function code into the fewest possible bits. In this design we require four ALU functions, so this requires a 2-bit function code. The mapping of an ALU function code to the ALU function is quite arbitrary; however sensible choices can simplify the logic design, as the two assignments below attempt to show. There is no easy way to find the ‘best’ assignment in this form of logic optimization; practice is the only way. In a design such as this simple inspection can reveal some of the optimisations. For example SY and Cin are the inverse of each other and split 50/50 between “0”s and “1”s; matching these to one of the input bit therefore gives half the decoder outputs for the price of one inverter … We will choose the right hand side code because it produces simpler logic. 31 COMP12111 Fundamentals of Computer Engineering School of Computer Science Controlling the processor o o We have considered the design of most of the datapath of the very simple computer, MU0. We now consider the logic to control the sequence of actions necessary in the execution of an instruction. Data Out Address X_MUX ACC PC ALU Data In ADDR_MUX IR Note all registers have clock and enable signals and the multiplexers have select lines. Also the ALU has function select inputs. Y_MUX These all need to be provided by the timing and control circuit. Timing and Control 32 Processor Control and Sequencing Using our generic registers, each register has two signals, CE and CLK. We need not consider the clock (CLK) here because it is distributed to all registers in the same way. We also have a two bit code to generate to specify the ALU function. There are also some signals to control the memory which are not shown explicitly. For control purposes the memory can be regarded as just another latch which can be told to store or output a value. Which value it stores/outputs is controlled by the address, so this is not a control problem. Note that the Instruction Register (IR) is always enabled. We have added tristate buffers to control its access to the “Y” bus. The lower 12 bits are ‘padded’ with four zero bits whereas the top four bits (the S field) are fed into the Timing/Control unit. Summary of Control Signals Required o Address source control Asel o Clock enables for registers AccEn, PCEn, IREn o ALU operation selectors M[1:0] o X-bus source Xsel o Y-bus source Ysel o Memory control signals Wen, Ren The control signals have been given (reasonably appropriate) abbreviated names. 32 COMP12111 Fundamentals of Computer Engineering School of Computer Science Status & Decisions The ability to make decisions according to their calculated state distinguishes computers from simple automata. A computer is capable of the action: IF <condition> THEN <something> Although it may be hidden beneath layers of language syntax in almost all processors this is implemented as a conditional branch. This diverts the processor to code (program) in another part of the memory IF its condition is fulfilled. 33 Flags MU0 evaluates conditions purely on the state of its accumulator. Some other processors work in this way, using a ‘condition’ evaluated and stored in a register. Others, such as ARM, evaluate and store the results of comparisons in a separate condition code register. These results are usually known as “flags” and typically represent the result of the last ALU operation. They are independent of the destination of the result and, indeed, it is usually possible to affect the flags without any other destination. For example the “CMP” (CoMPare) operation will perform a subtraction but throws the result away. Common flags include: o Sign (Negative) o Zero o Carry o Overflow o Parity The ARM contains the first four of these. 33 COMP12111 Fundamentals of Computer Engineering School of Computer Science Conditional Branching MU0 has two conditional branches: • JGE Jump if Acc is positive • JNE Jump if Acc is not zero which test properties of the Accumulator • Positive if the most significant bit is zero • Non - zero if any bit is a “1” (OR of all bits) NB In Verilog can be performed using the reduction form of the logic operator e.g. Z <= |Acc; //performs bit by bit OR on Acc from left to right If a specific test is required an operation such as SUB can be performed first to ‘compare’ with a known value. 34 Sign A copy of the most significant bit of the result. Set for a (two’s complement) negative result. Zero Set if the result was zero, otherwise clear. Carry Used to store the carry out of the most significant bit of the adder; in a 32-bit processor this would be the 33rd bit of the result of an addition. If the addition was two unsigned numbers the carry will be set if the result was too big to represent in the word length. The carry can also be used as an input to further additions, thus a 32-bit processor can perform a 64-bit addition by adding the two lower (less significant) words and then adding the higher (more significant) words together with the carry. Subtraction can also be done following similar rules. Overflow The overflow flag will be set if a two’s complement operation produced a result which was not representable, for example if adding two numbers produced an answer so large that the sign bit was set producing an (apparently) negative result. Note that this applies to signed numbers only; the carry flag performs a similar function. The CPU is not aware of whether the programmer thinks numbers are signed or not. It therefore will evaluate both carry and overflow and allow the program to use one or the other. Parity Every word will have a number of “0” and a number of “1” bits. If a word has an even number of “1”s it is said to have even parity; if not is has odd parity. One or other of these states is sometimes indicated by the state of a flag bit. Parity is primarily used for detecting errors in transmitted date where a bit may have been corrupted (“dropped”); any single bit change in a word changes its parity. 34 COMP12111 Fundamentals of Computer Engineering School of Computer Science Description of Operations o All instructions execute in two cycles o The instruction fetch is common to all operations 35 Possible Control Sequences All the possible instruction execution sequences are summarised in the slide, opposite. A key to the meaning of the various functions is given here. This picture can be regarded as a state diagram, although it contains more information. All instructions (except STP) execute in two (clock) cycles: the first fetches the instruction and increments the PC, the second decodes and executes the instruction itself. This leads to a very simple, two state FSM. Let’s label these states “fetch” and “execute”. If the processor is in the “fetch” state it performs an instruction fetch (a memory read from the address in the PC with the data being placed in IR); it also increments the PC so the next instruction will be fetched from a different address. It does this irrespective of what might happen next. (If the instruction is a JMP the PC increment will be wasted, but the processor doesn’t know that yet!). It then moves, inevitably, to the “execute” state. When the processor is in the “execute” state its behaviour is influenced by the “F” field of the fetched instruction; it follows one of eight possible paths. Unless it has encountered an “STP” it will then return to fetching the next instruction. When executing instructions other than STP the control signals are all derived from the “F” field with the exception of the PC enable. Here this may be influenced by the contents of the Accumulator to allow conditional branches. Notice that all branches behave in the same way except for the decision to latch the new PC value or not; this will simplify the logic by reducing the number of cases which need to be designed. This picture can be translated into a state transition table which includes all the control signals. 35 COMP12111 Fundamentals of Computer Engineering School of Computer Science State Transition Table state F[2:0] Next state IREn PCEn AccEn M[1:0] Xsel Ysel Asel Ren Wen 0 xxx 1 1 1 0 10 1 x 0 1 0 1 000 0 0 0 1 00 0 0 1 1 0 1 001 0 0 0 0 xx 0 x 1 0 1 1 010 0 0 0 1 01 0 0 1 1 0 1 011 0 0 0 1 11 0 0 1 1 0 1 100 0 0 1 0 00 0 1 x 0 0 1 101 0 0 N 0 00 0 1 x 0 0 1 110 0 0 Z 0 00 0 1 x 0 0 1 111 1 0 0 0 xx 0 x x 0 0 36 The state transition table describes the operation of the state machine that controls the various control lines to the multiplexers and registers. The inputs are the instruction codes and the state and the outputs are the next state and the control lines (IREn, PCEn, AccEn, M[1:0), Xsel, Ysel, Asel, Ren and Wen). Thus we can design the logic required to provide the correct transition to the next state and the correct outputs. Notes: o N and Z are the Negative and Zero state of the Accumulator, respectively. (used to reduce the size of the table, as drawn) o If a value is not going to be latched it doesn’t matter what it is! (e.g. ALU output for STO) o STP operates by remaining in its evaluation state. Observations: o Many control bits are trivial to derive (e.g. IREn = State) o “Don’t cares” give added freedom (e.g. Asel = State, Ysel = F[2]) o In conditional jumps {JGE, JNE} the jump target is always available (for simplicity) 36 COMP12111 Fundamentals of Computer Engineering School of Computer Science FSM Implementation Firstly define the combinatorial control logic. state IR[15:12] Asel AccEn PCEn IREn M[1:0] Xsel Ysel Ren Wen always @ (state, pc, ir) if (state == 0) begin Asel = 0; // sel pc . . Ren = 1; Wen = 0; end else begin // state must be 1 Ren = 0; Wen = 0; // now control depends on instruction case (ir[15:12]) 0: begin // LDA Ren = 1; . etc. 37 We can implement the state machine using a Verilog always block as shown. This block is only triggered if the state, pc or ir change. 37 COMP12111 Fundamentals of Computer Engineering School of Computer Science State Transition halt D clk state Q CLR Q reset always @ (posedge clk or posedge reset) if (reset) begin pc<= 12’h000; // set pc to zero state <= 0; // start with fetch end else case (state) 0: begin // fetch state Ren <= 1; . . end 1: begin // decode/execute state case(ir[15:12]); //action depends on instruction 0: begin Ren <=1; . . end …… 38 To manage the state transition on system clock we use another always block triggered by a positive clock transition or a reset. 38 COMP12111 Fundamentals of Computer Engineering School of Computer Science Timing An important aspect of this design is that it is fully synchronous. All state changes happen ‘at the same time’ in: o Registers {PC, Acc, IR} o The controlling FSM o Memory – more of which in a later lecture No state changes at any times other than an active clock edge. For example the various control signals begin to be calculated when the IR is latched and have a complete clock period to settle before they are used. This allowed the analysis of the system in a static manner. Assumptions o The clock distribution is good enough that the signal arrives ‘simultaneously’ at every flip - flop o The clock is slow enough to accommodate the slowest possible set of logic changes 39 Timing Our MU0 implementation fetches, decodes and executes instructions at a rate of two clock cycles per instruction (2 CPI). The majority of these cycles include a memory operation. The clock is a regular square wave; its period is set by the worst case critical path. In order to find this the operation of each cycle should be examined. Some examples are given below, although only the major operations {memory, ALU} are accounted for – bus switching times are ignored for simplicity. The time taken to decode the instruction is also neglected here because the instruction set is so small/simple; note that in a ‘real’ modern machine this is definitely not the case! Instruction fetch A memory cycle is performed with the result routed to IR. An ALU cycle is also performed, but this is in parallel with the memory operation. The critical path for this cycle will be whichever is the longer time. ADD execution This clearly requires an ALU operation (the addition), but it first requires one of its operands to be fetched from memory. The critical path is therefore the sum of the memory and the ALU cycle times. STO execution Only a memory cycle is performed. JMP execution The S field of the IR is transferred (via the ALU) to the PC; this can be counted as an ALU operation. The memory is not used here. From this (incomplete) analysis it appears that the critical path is the sum of the memory and ALU cycle times. This would be used to set the clock period. The unused time in other operations would be wasted. 39 COMP12111 Fundamentals of Computer Engineering School of Computer Science Fetch MEMORY 0C0 00F1 Data Out Address 0X_MUX1 ACC PC Data In ADDR_MUX 1 0 IR Memory 0C0: 00F1 //LD 0F1H : : 0F1: 0C50 // data 0C50H Y_MUX 1 0 ALU Timing and Control 40 40 COMP12111 Fundamentals of Computer Engineering School of Computer Science Decode/Execute MEMORY 0F1 0C50 Data Out Address 0X_MUX1 ACC PC Data In ADDR_MUX 1 0 IR Memory 0C0: 00F1 //LD 0F1H : : 0F1: 0C50 // data 0C50H Y_MUX 1 0 ALU Timing and Control 41 41 COMP12111 Fundamentals of Computer Engineering School of Computer Science Timing An important aspect of this design is that it is fully synchronous. All state changes happen ‘at the same time’ in: o Registers {PC, Acc, IR} o The controlling FSM o Memory – more of which in a later lecture No state changes at any times other than an active clock edge. For example the various control signals begin to be calculated when the IR is latched and have a complete clock period to settle before they are used. This allowed the analysis of the system in a static manner. Assumptions o The clock distribution is good enough that the signal arrives ‘simultaneously’ at every flip - flop o The clock is slow enough to accommodate the slowest possible set of logic changes 42 Timing Our MU0 implementation fetches, decodes and executes instructions at a rate of two clock cycles per instruction (2 CPI). The majority of these cycles include a memory operation. The clock is a regular square wave; its period is set by the worst case critical path. In order to find this the operation of each cycle should be examined. Some examples are given below, although only the major operations {memory, ALU} are accounted for – bus switching times are ignored for simplicity. The time taken to decode the instruction is also neglected here because the instruction set is so small/simple; note that in a ‘real’ modern machine this is definitely not the case! Instruction fetch A memory cycle is performed with the result routed to IR. An ALU cycle is also performed, but this is in parallel with the memory operation. The critical path for this cycle will be whichever is the longer time. ADD execution This clearly requires an ALU operation (the addition), but it first requires one of its operands to be fetched from memory. The critical path is therefore the sum of the memory and the ALU cycle times. STO execution Only a memory cycle is performed. JMP execution The S field of the IR is transferred (via the ALU) to the PC; this can be counted as an ALU operation. The memory is not used here. From this (incomplete) analysis it appears that the critical path is the sum of the memory and ALU cycle times. This would be used to set the clock period. The unused time in other operations would be wasted. 42 COMP12111 Fundamentals of Computer Engineering School of Computer Science Optimisations How do we make our computer go faster? o Improve the technology Make smaller transistors and put more on a chip o Improve the implementation Speed up the clock by shrinking the critical path o Change the architecture Restructure the design to do more in a given period This leads to: o Faster clock o Fewer clocks per instruction (CPI) The following slides introduce some examples of these techniques. 43 Optimisations Technology Since the introduction of integrated circuits (circa 1970) the size of the manufactured features (transistors, wires, etc.) has been shrinking steadily. Reduced feature size leads to larger number of components on a device and faster operation of those components. The (empirically derived) Moore’s Law observes that the number of components available (and the overall processing speed available) approximately doubles every 18 months. This is equivalent to a 10x improvement every 5 years, or about 1 000 000 improvement from 1970 to 2000. Implementation A computer engineer has little influence on where the technology leads. However it is important both to exploit the available technology and to design efficient circuits with short critical paths. The implementation will specify such things as the type of latches, flip-flops and registers used, the internal design of the ALU, the type of multiplexers etc. Note also that implementations of a given function (such as a processor instruction set) may be optimises towards different goals. For example a high-speed implementation may be different from a low-power one. Architecture The architecture of the computer is where the designer has, perhaps, the most impact in its success. The architecture includes all aspects of the hardware from the instruction set design to the RTL layout of the blocks. At RTL – which is the aspect which concerns us most here – the objective is to achieve maximum unit occupancy so that all parts of the system are kept as busy as possible. This is often achieved through parallelism. An example of parallelism already introduced here is the MU0 instruction fetch operation, where the PC is used as a memory address and is incremented at the same time. These operations could be done in series, but then two cycles would be required for every instruction fetch, considerably slowing the processor’s operation. In general adding parallelism increases performance. Often, however, extra resources (buses, multiplexers, registers, functional units, …) are required and the cost can outweigh the benefit. 43 COMP12111 Fundamentals of Computer Engineering School of Computer Science Reducing CPI One method of speeding up the processor is to reduce the average number of clocks per instruction. There are improvement opportunities in (for example) JMP … Data Out Address Data In IR ACC PC ALU Note: o Uses existing datapaths o Requires more complex control/sequencing o Requires an additional ALU operation (Y + 1) Timing and Control 44 Reducing CPI In a simple processor such as MU0 there are not many methods of speeding up the design. However there are some … For example the memory is not used when executing a JMP. Old Fetch Data Out Address ACC PC Jump Data In IR ALU Data Out Address ACC PC Data In IR ALU Timing Timing and and Control Control Timing Timing and and Control Control 44 New Fetch/Jump Data Out Address ACC PC Data In IR ALU Timing Timing and and Control Control This reduces the number of cycles for a JMP instruction to one, thus reducing the average number of CPI. Similar optimisations can be applied to conditional jumps, with different behaviours depending on whether the jump is/is not taken. The disadvantage of this method is that it makes the control and sequencing more complicated. Also note that another ALU operation is required: the value sent to the PC is not the JMP destination (which is already being used) – it is the following address – thus an extra increment is required using the other ALU input. (The ‘move’ operation is still needed for LDA.) Note that this particular modification can be made solely by changing the control logic; the RTL picture is the same as before. 45 COMP12111 Fundamentals of Computer Engineering School of Computer Science Go Faster … Data Out Address Data In In our MU0 the critical path includes both the memory and the ALU. Adding a register (Din) can break the critical path (roughly) in half. IR DIN ACC PC o The cost is an extra clock cycle. o The benefit is that the clock can be (nearly) twice as fast. An ADD takes three cycles (instead of two) at ~twice the speed. ALU Timing and Control 46 MU0 Timing Analysis As a simplification let’s assume that the only blocks that impose significant delays are the memory and the ALU. This is a reasonable approximation for this design. Furthermore let us apply some values to these, say: o The memory (read or write) takes 10ns o The ALU (any function) takes 8ns By examining all the possible cycles the critical path, and hence the clock speed, can be determined. o Instruction fetch uses memory and ALU in parallel, therefore requires 10ns o LDA/ADD/SUB use memory and ALU in series, requiring a total of 18ns o STO uses only memory and so requires 10ns o JMP uses only ALU (original architecture) for a critical path of 8ns The ‘improved’ architecture has a parallel memory cycle, increasing this to 10ns o JGE/JNE are analogous to JMP, or faster if the jump is not taken The worst case is therefore 18ns, which sets the clock period; (the frequency would therefore be about 56 MHz). Executing an ADD instruction requires two clock cycles or 36ns (fetch, then execute) as the clock is a constant frequency. Note that a lot of time is wasted in some cycles! 46 A Faster Implementation If an extra latch (Din) was added to the RTL design adjacent to the IR the operand read from memory could be stored temporarily on its way to the ALU (see slide). This would require an extra cycle to execute an ADD operation {instruction fetch, operand fetch (to Din), ADD (from Din)} which doesn’t sound sensible in terms of accelerating execution! However the critical path is now reduced to the memory cycle time (10ns) so the clock can run faster (100 MHz). The ADD operation as a whole can therefore complete in 30ns – a 17% speed up. Furthermore not all operations require operands from memory, so a STO (for example) could still be done in two cycles or 20ns – nearly twice as fast as before! Although not possible with the existing architecture it would be possible to execute a LDA in two cycles too; what added buses/multiplexers would be required? The disadvantage with such modifications is, of course, added complexity (and, hence, development cost). Reducing Execution Time Although it is beneficial to make instructions go faster what a user wants is for a program to go faster; this is not quite the same thing. The program is made up of instructions in some mixture; for instance LDAs might be more common than JMPs (or vice versa). In reducing execution time it is therefore more important to optimise the more common operations. In this context “more common” means those encountered most dynamically (i.e. as the program runs) rather than statically (counting through a program listing). This, of course varies considerably, depending on the application program … 47 COMP12111 Fundamentals of Computer Engineering School of Computer Science Implementation Consider the full adder: The carry output is either a copy of the carry input or it is independent of the carry input (and therefore available at once). 48 Consider a two bit adder: ie Note here that Cout can ‘ripple’ across two bit positions at a time. 48 COMP12111 Fundamentals of Computer Engineering School of Computer Science Look-Ahead Carry (Almost) as soon as the A and B inputs arrive it can be predicted that Cout will be: o Zero (carry “killed”) o Cin (carry “propagated”) o One (Carry “generated”) This can be extended across more than one bit (see notes). This scheme is called carry look - ahead. 49 Furthermore this reasoning can be applied recursively to bigger and bigger blocks … Verify that, in this circuit, the maximum logic depth is six ‘blocks’. What would the maximum depth be for a 16-bit adder? As the carry has fewer logic blocks (therefore gates) to negotiate it can reach the more significant bits more rapidly (the benefits improve as the width increases). This reduces the critical path and makes the whole adder faster… 49 COMP12111 Fundamentals of Computer Engineering School of Computer Science Parallelism Parallelism involves executing more than one instruction at once. It is not the purpose of this course to discuss parallel computer architectures. Sufficient to say that there are two, possible means of achieving parallel execution. o Starting one instruction before the previous one is complete (“pipelining”) o Starting several instructions at the same time (“superscalar” & multi- processor) 50 Parallelism Parallelism is something exploited extensively and at all levels in hardware design. For example adding carry look-ahead logic increases the number of gates in the adder but these decrease the overall addition time because more gates are switching in parallel. We have also applied parallelism at RTL by incrementing the PC in parallel with the instruction fetch and, later, fetching an instruction while still executing a JMP. However usually the word “parallelism” is applied more explicitly at the architectural level. A naive example of this would be using two complete processors to go “twice as fast”. This may work if we have two independent tasks but, in general, we haven’t. A system with twice the cost would therefore give less than twice the performance. Much of the art of designing parallel systems is finding a ‘sensible’ balance between the hardware investment and the performance return. Two commonly employed techniques are given below. 50 COMP12111 Fundamentals of Computer Engineering School of Computer Science Parallelism cont. 51 Pipelining A common analogue of a processor pipeline is the process of washing clothes. When several loads of washing the second load can go into the washing machine as the first load goes into the drier. This means that two loads can be at different stages of ‘processing’ at the same time – an example of parallelism. This is a relatively ‘cheap’ solution if you were going to use both machines anyway. However notice that our MU0 uses the same hardware for both fetching and executing instructions (e.g. the same ALU increments the PC and ADDs the data) and could not be pipelined without adding extra hardware; it is more the ‘washer/drier’ solution! Multiple issue By adding extra hardware it is possible to execute more than one instruction at once. With two decoders and two ALUs two instructions may be fetched and decoded together. This, potentially, doubles the processor speed for roughly twice the hardware cost. In practice things are not so simple because it is not always possible to issue two instructions concurrently1; for instance if the ‘first’ instruction was a JMP then the other instruction would be wasted anyway. There can also be dependencies where the second instruction needs the result from the first and therefore has to wait (and hardware has to be added to detect this). Trying to issue two instructions at once therefore gives less than twice the speed at more than twice the cost. Nevertheless attempting to issue two, four, or even more instructions together is quite common in high-performance processors. 1. “Concurrently” – at the same time. 51 COMP12111 Fundamentals of Computer Engineering School of Computer Science ALU Enhancements o MU0 has simple instructions involving only ADD and SUBTRACT arithmetic operations. o The range of operations could easily be expanded to include Boolean logic operations, shift operations and extra arithmetic operations such as multiply. (Division is too complicated for consideration on this course.) o The ALU would require more control bits (e.g. M[3:0]) Such operations could use some of the spare instruction codes. 52 ALU Enhancements Bitwise Logic Operations These are Boolean operations applied to each of the bits of the two values presented to the ALU. Operand bits are paired with others at the same position (significance) in the words, hence the expression “bitwise” operation. Unlike addition, which propagates a carry, each set of bits is independent. e.g. the AND operation, for all values of i between 0 and 15, would yield:Z-busi = X-busi & Y-busi For example an AND operation:– 0011 0101 1001 1110 AND 0101 0110 1111 0010 yields 0001 0100 1001 0010 Relatively simple changes to the ALU are needed to implement AND, OR, XOR etc. These are normally done by selecting different functions using a multiplexer. Signal Preconditioning It is usual to locate the logic functions after the input preconditioning T/C 0/1s. This allows (for example) operations such as the ARM’s Bit Clear (BIC) instruction. Result = A and not(B) This can also provide alternative codings for operations such as MOV, which may simplify the decode logic. For example in the initial MU0 design a move operation was coded as 0+Y; it could also be coded as 0 OR Y, -1 AND Y, etc. (remembering -1 = 1111111111111111). 52 COMP12111 Fundamentals of Computer Engineering School of Computer Science Bit-Wise Logic Operations These are easy to add to an ALU 53 A possible optimisation The bit-wise XOR function could be implemented by disabling the carry between the 1- bit full adders. The “SUM” outputs will then be the XOR of the X and Y bits presented to the adder. Review the full adder circuit to see how this could be done. 53 COMP12111 Fundamentals of Computer Engineering School of Computer Science Shift Operations In decimal it is very easy to multiply or divide numbers by ten. e.g. 123 x 10 = 1230 The above operation has shifted the input left by one place. In binary it is very easy to multiply or divide numbers by two. e.g. 110110102 ÷ 102 = 011011012 The above operation has shifted the input right by one place. 54 Shift Operations What are shift operations? Shift operations are movements of the bits within a word. For example: Shift left, one place A left shift, as shown above, moves all the bits to a more significant position; thus left shifting a number by one place is equivalent to multiplication by two. Similarly shifting left two places is multiplication by four (assuming no bits are ‘lost’ at the most significant end). Contrariwise shifting right is equivalent to dividing by powers of two. When shifting left it is normal to fill the ‘vacant’ position(s) in the least significant bit(s) with zero(s). When shifting right this rule can also be obeyed with the most significant bits; this will divide correctly (subject to remainders) for positive or unsigned numbers, but not for two’s complement negative numbers where shifting a zero in makes the number positive. To avoid this there are often two forms of right shift provided: o Logical shift right (LSR) – shift in zero o Arithmetic shift right (ASR) – shift in copies of the existing MSB A couple of minutes running through a couple of examples of left and the various right shifts is probably time well spent! 54 COMP12111 Fundamentals of Computer Engineering School of Computer Science Practical Shift Operations In practice there are some limitations due to operands being finite: o a left shift loses its Most Significant Bit (MSB) the answer will be ‘wrong’ if this bit was “1” o a right shift loses its Least Significant Bit (LSB) the answer will be ‘wrong’ if this bit was “1” i.e. if the number was odd. As well as losing a bit, a bit must be shifted in. o Left shifts shift in zero o Right shifts either shift in zero or copy the MSB the latter case preserves the two’s complement sign bit o A rotation (either way) shifts the ‘lost bit’ back in 55 When are they useful? Primarily in multiplication and division algorithms. They are also used in graphics, cryptography, in fact lots of “bit fiddling” operations. What else can I do? Rotation is another common ‘shift’ operation. Rotating left and right is just like shifting except the bits that ‘fall off’ one end of the operation ‘wrap back’ onto the other end. Barrel shifting Many processors provide shift operations which move the bits only one place per instruction. A “barrel” shifter is a device which shifts bits an arbitrary number of places. The ARM processor used in CS1031 has a barrel shifter. 55 COMP12111 Fundamentals of Computer Engineering School of Computer Science Implementing Shifts It is usual to treat a shift as an ALU - type function. This can be a single operand (one place) shift or shift a specified number of places (two operands). A shift is merely a rearrangement of the bits; it requires no logic! A single place shift can be done purely with multiplexers. o The input selection on the multiplexers is common to all these are decoded from the selected function o In practice these may be part of a larger multiplexer (see slide on bitwise operations) o The inputs at each ‘end’ of the row are wired appropriately for the selected shift 56 Implementing Shifts The slide shows part of the ‘middle’ of a one place left or right shifter. Clearly there are some ‘dangling’ at each end. The bit input shifted left into the LSB is always “0”. The bit shifted right into the MSB is a “0” for logical shift right, but for arithmetic shifts the MSB or sign bit is copied here as well as into bit 14. Shifts of more than a single bit position are also possible. This is sometimes done by the control logic repeating a one-place shift the correct number of times, a solution which requires little extra hardware but takes several cycles to complete. A true barrel shifter can shift any number of places in a single cycle. This requires considerable extra hardware because all the multiplexers get much larger (the wiring is more complex too) and so it exacts a significant cost1. 1. In practice there are methods of implementing such multiplexers more cheaply, but they are still costly. 56 COMP12111 Fundamentals of Computer Engineering School of Computer Science Shift Registers A14 A15 C D Q C A13 D A Q Note registers can be parallel loaded or implement a shift depending on the state of the multiplexers. Q Can implement a shift in Verilog using concatenation { } A <= {C,A[15:1]}; //Shift right A Q <= {A[0],Q[15:1]}; //Shift right Q 57 Shift registers A shift register is similar to a ‘normal’ register, which always inputs and outputs all its bits in parallel. A shift register has the added feature of being able to move bits in or out one at a time, typically by ‘shuffling’ all the bits one way so that one ‘falls off the end’ on each successive cycle. Of course an extra control bit is also needed to enable this function. Although it would be possible to use one, shift operations are not normally implemented using shift registers; a shift register contains more circuitry than an ordinary register and is therefore more expensive. While not significant in our MU0 this would be a considerable overhead on all the registers in a processor like an ARM. Shift registers are primarily used in interfacing to I/O devices; some examples will appear later in the course. They are occasionally useful in other tasks, such as multiplication and division, 57 COMP12111 Fundamentals of Computer Engineering School of Computer Science Multiplication Multiplication is a more complex arithmetic operation than addition. Multiplication is repeated addition: o To multiply two numbers, N and M, start with zero and add N to it M times This works, but is very slow for big numbers. A short cut – long multiplication 58 An algorithm 1. Start with zero in an accumulator Accumulator 2. Make X the least significant digit of N 000000 3. Multiply M by X – add the result to the accumulator 000711 4. Multiply M by 10 000711 5. Make X the next least significant digit of N 014931 6. If unfinished then repeat from step 3. 7.Done 014931 204531 x 3 3 6 6 8 8 M 000237 000237 002370 002370 023700 023700 Points to note o o o o This loops three times (the number of digits in M) not 863 time (the number in M). Only multiplications by a single digit (or by ten) are required. Multiplying by 10 is trivial. Only an addition of two numbers (new partial product and accumulator) is needed in any one step. o The result has more digits than either of the operands (in general as many digits as both the operands combined, or six in this case). Binary multiplication The same algorithm can be used for binary multiplication. The only differences are: o The digits are single bits. o Multiplication is only ever by 0 (easy) or 1 (also easy) the multiplication outcome is therefore either 0 or N. o The accumulator is multiplied by 2 – this is a one place left shift (and trivial). Note also that X can remain the least significant bit of M if M is right-shifted at each step. The algorithm now becomes 1. Start with zero in an accumulator 2. Make X the least significant digit of N 3. Multiply M by X – add the result to the accumulator 4. Multiply M by 10 5. Make X the next least significant digit of N 6. If unfinished then repeat from step 3. 7.Done 58 COMP12111 Fundamentals of Computer Engineering School of Computer Science Long Multiplication This multiplication algorithm is (in general) much faster. No doubt it is already familiar! The same algorithm can be applied to binary numbers. With binary digits the only multiplications needed are x0 and x1. This is easy since x0 gives 0 and x1 leaves the original value – so we need to add in 0 or the original multiplicand depending on the value of the digit we are looking at. 59 More Aspects of Multiplication Termination The loop described above iterates for every digit (bit) in the N operand. Thus a counter can be used to control and terminate the operation. However using the procedure described for the binary operation N is continually being divided by two (and integer division discards any fractions) and so will eventually become zero. This can be used to indicate completion because any subsequent cycles can only ever add zero to the total – and we might as well not bother. This is sometimes known as early termination because, in many cases, fewer cycles are performed than there are bits in the operands. This gives shorter multiplication times as well as easier control (no counter needed). Modulo arithmetic A general case addition of two 32-bit operands can require up to 33 bits to hold its result, because of the carry out. This is true whether the addition is unsigned or two’s complement. A general case multiplication of two 32-bit operands can require up to 64 bits to hold its result. Often such results are mapped back into a register (variable) of the same width as the operands. This results in modulo- arithmetic, where bits may be ‘lost’ off the right-hand end of the number. In the examples above it is quite likely that the operations would be performed “modulo 32”. This is the same as dividing the result by 232 and keeping the remainder hence the name. Negative numbers The multiplication algorithm described only works with positive or unsigned numbers. A simple extension to cope with signed numbers would be to convert the operands to positive numbers and then make the result negative if the operands had different signs. This is the normal method with decimal numbers. However if modulo arithmetic truncates the result to the same length as the operands then the algorithm will work anyway. Any potential errors occur in the high-order bits which are truncated. (Try it!) If the full answer is required this ‘trick’ can be exploited by first sign extending1 the operands to the full word length and truncating at the end. 1. Extending the number to the left with copies of the existing MSB, so the sign bit is preserved. 59 COMP12111 Fundamentals of Computer Engineering School of Computer Science A Sequential Multiplier The implementation of this multiplication algorithm may be done in software or hardware. ‘0’ Multiplicand A hardware implementation is shown here Carry out B ADDER A software implementation may be coded simply by following the steps outlined in the notes A Product Done Multiplier Q Q0 ‘n’ P counter P==0 Z FSM Controller S start 60 Other multipliers The algorithm described is not the only way to build a multiplier. A number of other schemes employing the same basic ‘shift and add’ approach exist, but different operands may be shifted in different directions. You may meet a different implementation in a reference book; however the principle will be the same. 60 COMP12111 Fundamentals of Computer Engineering School of Computer Science Multiplier This is a serial multiplier: o A number of steps (clocks) are performed o Only one adder is required Simple FSM used for control. Almost a processor datapath in itself! 61 Verilog listing for 8x8 bit multiplier //HDL COMP 10211 Multiply Example //-------------------------------------//RTL description of binary multiplier //Block diagram in notes //n = 8 to halt after all bits done module mltp(S,CLK,Clr,Binput,Qinput,C,A,Q,P,Done); input S,CLK,Clr; input [7:0] Binput,Qinput; //Data inputs output C, Done; output [7:0] A,Q; output [3:0] P; //System registers reg C, Done; reg [7:0] A,Q,B; reg [3:0] P; reg [1:0] pstate, nstate; //control register parameter T0=2'b00, T1=2'b01, T2=2'b10, T3=2'b11; //Combinational circuit wire Z; assign Z = ~|P; //Check for zero //State transition for control //See state diagram in notes always @(negedge CLK or negedge Clr) if (~Clr) pstate = T0; else pstate <= nstate; always @(S or Z or pstate) case (pstate) T0: if (S) nstate = T1; T1: nstate = T2; T2: nstate = T3; T3: if (Z) nstate = T0; else nstate = T2; endcase //Register transfer operations //See register operation Fig.8-15(b) always @(negedge CLK) case (pstate) T0: B <= Binput; //Input multiplicand T1: begin A <= 8'b00000000; C <= 1'b0; P <= 4'b1000; //Initialize counter to n=8 Q <= Qinput; //Input multiplier Done <= 1'b0; // Not Done end T2: begin P <= P - 4'b0001; //Decrement counter if (Q[0]) {C,A} <= A + B; //Add multiplicand end T3: begin C <= 1'b0; //Clear C A <= {C,A[7:1]}; //Shift right A Q <= {A[0],Q[7:1]}; //Shift right Q end endcase always @(negedge CLK) Done <= Z; // Z = 1 when done endmodule 61 // Testbench for HDL Multiply COMP 10211 //----------------------//Testing binary multiplier module test_mltp; //Inputs for multiplier reg S,CLK,Clr; reg [7:0] Binput,Qinput; //Data for display wire C; wire [7:0] A,Q; wire [3:0] P; wire Done; //Instantiate multiplier mltp mp(S,CLK,Clr,Binput,Qinput,C,A,Q,P,Done); initial begin S=0; CLK=0; Clr=0; #5 S=1; Clr=1; Binput = 8'b00010111; Qinput = 8'b00010011; #15 S = 0; end initial begin repeat (46) #5 CLK = ~CLK; end endmodule 62 COMP12111 Fundamentals of Computer Engineering School of Computer Science Processor Design – a Summary • A processor can be quite simple to design – an entire processor can be described down to gate level in a few lectures • A processor has a datapath which does the processing – – – – an RTL (Register Transfer Level) design many (16-, 32-, …) bits wide, but regular structures the datapath may account for 90%+ of the gates therefore it is designed and optimised first • The datapath needs control logic – an FSM (Finite State Machine) – the control provides steering and timing for the datapath – relatively few gates, but more complex structures • All CPUs are built this way – it’s just that the instruction set gets bigger and the number of optimisations increases. 63 Multiplication Exercises Have a go at these: 0010 x 0010 0110 x 0101 0011 x 1110 (unsigned) 0011 x 1110 (signed) 63 COMP12111 Fundamentals of Computer Engineering School of Computer Science Memory “Thanks for the memories” 64 Memory When a computer program is operating it needs to keep its data somewhere. There may be some registers (such as “Acc” in MU0) but these are not usually enough and a larger memory (or “store”) is required. The computer program itself must also be kept somewhere. The earliest programmable devices were weaving looms. In 1804 Joseph Marie Jacquard invented the Jacquard Loom in Lyon. This used a string of punched cards as a program; a binary hole/no hole system allowed complex patterns to be woven with the cards being advanced as the loom ran. Many later devices have also used this concept, notably the pianola or player piano (1895). 64 COMP12111 Fundamentals of Computer Engineering School of Computer Science Memory • The CPU is the most complex unit in a computer. • The memory is the largest. Producing an adequately sized, adequately fast memory has always been a serious challenge in computer design. The Manchester Small Scale Experimental Machine (SSEM)(1948) or ‘Baby’ was the world’s first stored program computer. This was built specifically as a test for the memory devices. A stored program computer is one where the program resides within the memory and therefore can also be treated as data. This means the memory has a shared function: • It contains data • It contains programs and memory cycles are allocated to each function. This is often known as the von Neumann architecture. 65 The stored program concept The concept of a ‘stored program’ is attributed to John von Neumann. Put simply it says: “Instructions can be represented by numbers and stored in the same way as data.” Thus a bit pattern 01000101 might represent the number 4516 or the Ascii code for the letter “E” as data but it could also be used to tell a processor to perform a multiplication. This has led to the so called “von Neumann” architecture which is followed by almost all modern computers where a single memory holds values which can be interpreted as data or as instructions by the processor. Whilst it is rare that the same memory locations are used as instructions and data it does happen. The most notable case is when a program is loaded and executed: the loader fetches words from an I/O device (e.g. disc) which it treats as data and puts into memory; the same values are interpreted as instructions when execution starts. 65 COMP12111 Fundamentals of Computer Engineering School of Computer Science The von Neumann Architecture One memory suits all requirements. This is the model we already assumed for our MU0 processor. 66 Johann (“John”) von Neumann (1903-1957) • 1903: born, Budapest, 28th December, son of a banking family. • 1910: could divide 8-digit numbers in his head. • 1921: entered University of Budapest to study Chemistry. Published first paper. • 1928: completed doctorate in Mathematics. • 1930: Moved to Princeton. • 1933: One of six original professors when Princeton Institute for Advanced Studies (IAS) founded. (Alan Turing studied mathematics here 1936-8.) • Engaged in war work on several national committees. • 1945: wrote “The First Draft of a Report on the EDVAC”, which introduced the stored program concept. • 1945+: worked with Los Alamos on H-bomb issues. • 1950s: consultant to IBM. • 1952: designed MANIAC I. • 1954: appointed to the U.S. Atomic Energy Commission. • 1956: won the Enrico Fermi Award for outstanding contributions to the theory and design of electronic computers. • 1957: died 8th February. 66 COMP12111 Fundamentals of Computer Engineering School of Computer Science von Neumann Architecture Note that here: • the term “memory” is applied to the store which is directly addressable by the processor • other forms of store (such as discs) are not considered This is (currently) the most common computer architecture. PCs, workstations, etc. all work with this model. (Detailed implementations vary though!). Note that there are other architectures e.g. the “Harvard Architecture” where data and instruction storage are separated. 67 RAM The address space of a computer such as our MU0 will normally contain Random Access Memory or RAM. RAM is memory where any location can be used at any time. MU0 has a 12bit address bus and so can address up to 4Kwords of memory, each word being 16 bits wide. As this is a small memory by modern standards it is likely (now) that all the words would be implemented (although a few locations must be reserved for I/O or there would be no way to communicate with the computer). Back in 1948 4Kwords (64 Kbits) would have seemed a very large memory which would require many memory devices to fill. By the end of the 20th century the largest RAM devices reached 256 Mbits so one device could provide for 4000 MU0s! To come more up to date we shall use the ARM address model instead. ARM produces byte addresses and has a 32-bit address space, which allows the addressing of 232 separate bytes. However as instructions and most data are 32-bits (4 bytes) wide it is normal to read or write four bytes in parallel. We will therefore regard the ARM as having a 30-bit address space (the last 2 bits can specify one of the four bytes). Thirty address bits allow the addressing of 230 separate words or 1 Gword. This is larger than contemporary devices (256 Mbits ⇒ 8 Mwords); it is therefore necessary to be able to map several memory devices into the address space. Not all these devices may be fitted in every system; whether the memory space is ‘fully populated’ or not depends on the needs and the budget of the owner. Definitions and usage O RAM – Random Access Memory; by convention (& slightly incorrectly) used for memory which is readable and writeable. Most modern RAM ‘forgets’ when the power is turned off. O ROM – Read Only Memory; usually a variation of RAM which cannot be written to; used to hold fixed programs. As it cannot be written to its contents must be permanent. In addition there are forms of serial access memory (such as magnetic tape, disc etc). 67 COMP12111 Fundamentals of Computer Engineering School of Computer Science Memory Devices – in Principle Basically a memory is a large number of flip - flops. Here four 4-bit memory words are shown as a bank of registers. The address is used either: o to enable the data into a specific register, or o to output enable a specific register With appropriate (external) control the data bus could be shared and bidirectional. 68 Addressing Within the CPU it is common for several things to happen in parallel.; the memory only performs one operation at once. This operation requires the answers to the questions: • Do what? – Control (read or write) • With what? – Data • Where? – Address Because only one operation is happening at a time the control signals and the data bus can be shared over the whole memory. The address bus provides a code to specify which location is being used (“addressed”). Some definitions: • Byte – now standardised as eight bits. • Word – the ‘natural’ size of operands, which varies from processor to processor (16 bits in MU0, 32 bits in ARM). Usually the width of the data bus. • Nibble – four bits or half a byte (sometimes “nybble”) • Width – the number of bits in a bus, register or other RTL block. • Address range– the number of elements which can be addressed. • Type – what the data represents. This is really a software concept in that the hardware (usually) does not care whether a word is to be interpreted as an instruction, an integer, a ‘float’, an address (pointer) etc.This may, however, influence the size of the transfer (byte, word, etc.). The figure shows part of a memory; four words of four bits each are depicted (although the decoders imply that another four words are omitted). The bits in each word are stacked vertically; note that the write enables and the read enables (to the tristate buffers) are common across each word. The words can be made as wide as required in this way. The width of the memory is normally the same as the width of the CPU’s datapath, but it may not always be so; for example some high-performance processors use wider memory so that they can fetch two (four, …) instructions simultaneously. Questions: What are the advantages of fetching several instructions in a single cycle? What are the disadvantages? 68 COMP12111 Fundamentals of Computer Engineering School of Computer Science Tri-State Devices and Bidirectional Busses Tri- state devices have 3 – states ‘0’, ‘1’ and ‘off’ and can replace multiplexers. Bus Wire EnRead A In Out EnA En B EnWrite Basic Tri- State Buffer EnB Can select A or B input onto bus wire. (Replaces MUX) Bidirectional Bus 69 Tristate signals Tristate signals and gates are introduced here. These will be referred to in the following lectures. Tristate signals are used as a convenient method of controlling and switching buses. At this point it is enough to know that they exist. As a good, general rule tristate signals should not be used in control circuits. The switching of a tristate output is digitally controlled – another input signal is used as an enable. If the enable is true then the output is enabled (‘on’), if the enable is false then the output is disabled (‘off’). The enable is usually drawn entering the side of the gate so that it can easily be distinguished. The enable may be active high or active low; an active low signal is usually drawn with a ‘bubble’ on the connection. A ‘normal’ buffer does nothing to the logic signal; the output is always the same as the input. Such buffers are used to match electrical properties of the circuits and are an implementation issue which does not concern this course. Unlike ‘normal’ outputs tristate outputs may be connected together. In general a signal should be driven, so tristate outputs are used for multiplexing two or more signals together. No more than one tristate output should be enabled onto a net at any time. The usual designation for the third state of an output is “high-impedance” or simply “tristate”; it is usually abbreviated to “Z”. 69 COMP12111 Fundamentals of Computer Engineering School of Computer Science Address Decoding The CPU ‘addresses’ (talks to) one memory location at a time. It has a large number to choose from. It specifies which location with a single number on the address bus. Eventually this number must be coded into a true/false select for every possible location. Either: all selects are false or one select is true and all the others are false 70 Address Decoding An address is coded as a binary number to minimise the number of bits/wires required. The memory requires a word select as a “1-of-N” code. The conversion is performed by an address decoder. 70 COMP12111 Fundamentals of Computer Engineering School of Computer Science Decoders The address selects which output may become active The enable(s) allow that output to be active o Multiple enables are usually ANDed together 71 A simple three to eight decoder described in Verilog: module three_to_eight(addr_in, enable, sel_out); input [2:0] addr_in; input enable; output [7:0] sel_out; wire sel_out; [7:0] // nested conditional operator (?) used here assign sel_out = enable ? ( (addr_in == 0) ? 8'b00000001 : (addr_in == 1) ? 8'b00000010 : (addr_in == 2) ? 8'b00000100 : (addr_in == 3) ? 8'b00001000 : (addr_in == 4) ? 8'b00010000 : (addr_in == 5) ? 8'b00100000 : (addr_in == 6) ? 8'b01000000 : 8'b10000000 ) : 0; endmodule 71 COMP12111 Fundamentals of Computer Engineering School of Computer Science Decoder in Verilog module three_to_eight( addr_in, enable, sel_out); input [2:0] addr_in; input enable; output [7:0] sel_out; reg [7:0] sel_out; always @ (addr_in or enable) if (enable) case (addr_in) 0: sel_out = 8'b00000001; 1: sel_out = 8'b00000010; 2: sel_out = 8'b00000100; 3: sel_out = 8'b00001000; 4: sel_out = 8'b00010000; 5: sel_out = 8'b00100000; 6: sel_out = 8'b01000000; 7: sel_out = 8'b10000000; endcase else sel_out = 0; endmodule 72 Address Decoding It is often both inconvenient and impractical to decode the entire address bus in a single decoder. Instead a hierarchical approach is used: Here the first decoder is used to enable one of the next set of decoders to give a 6-to-64 decoder (not all of which is shown). This can be extended further if required. In practice the decoders need to be very large, but the last stage of decoding (which could be decoding around 20 address lines!) is built into the memory device. The designer only needs to produce the equivalent of the first level of the address decoder which selects which memory device is active. This is described in more detail later. 72 COMP12111 Fundamentals of Computer Engineering School of Computer Science Real Memory Devices Using edge - triggered D - type flip - flops: is often fine for registers (e.g. 3 or 4 in MU0, <50 in ARM) is too expensive for ‘main’ memory (millions of locations) Memories use special design techniques to squash as many bits together as possible. The ‘densest’ RAM currently in use is Dynamic RAM or DRAM This has a number of awkward characteristics The ‘easiest’ RAM currently in use is Static RAM or SRAM Fewer bits/chip than DRAM Faster Simpler We will therefore only examine SRAM in detail here! 73 Commodity Memories All von Neumann computers need memory. Sometimes their needs are small – an embedded controller operating a central heating system probably needs only a few byte of RAM – but others need many megabytes. Even a heating controller may need a kilobyte or so of program memory. Small memories (a few Kbytes) are often constructed on the same chip as the processor, I/O etc. Large memories will need one or more separate, dedicated devices. The figure below illustrates why D-type flip-flops are not used for mass storage! In practice both the SRAM and DRAM need other circuits (such as amplifiers) to interface them to computational circuits. However the overhead is small because a few amplifiers can be shared by many thousands of bits of store. A bit of ROM will be roughly the same size as a bit of DRAM or SRAM, depending on the technology employed. When building a system the cost is related to the number of silicon chips and their size. Thus if D-type flipflops were used the memory which could be implemented at a given price would be much smaller (i.e. fewer bits). Cost is extremely important in system design! The reason several types of memory exist is that the cost trade-offs vary according to the system requirement. For example DRAM is the most area efficient but it is slower than SRAM and requires more support logic and can be more expensive for memories below a certain size. 73 COMP12111 Fundamentals of Computer Engineering School of Computer Science A Real Memory Device ‘Power’ The figure illustrates one of the packaging options of a typical commodity SRAM chip. Write Enable Output Enable Chip Select ‘Ground’ 512K x 8 SRAM 74 Using Memory Chips The memory device shown is a 628512. This is a 4Mbit SRAM chip (memory sizes are normally quoted in bits) organised as 512 Kwords of 8 bits each. It therefore requires nineteen address lines and eight data lines; together with its power supplies {Vdd, Vss} and three control signals these occupy all the pins on a 32 pin DIL (Dual, In- Line) package. The following table defines the memory chip’s behaviour. Points to note: o All the control signals are active low o If the chip is not selected (CS = H), nothing happens o Write enable overrides read operations o The data bus is bidirectional (either read or write – saves pins) The CS signal is used to indicate that this particular device is in use. If a larger memory is required an external decoder can be fitted so that only one memory chip is enabled at once. In this way several memory chips can be wired with all their pins in parallel except for CS. Observation Because it uses the same pins for read and write operations it does not actually matter what order the address and data pins are wired in. The user does not care what the address of any particular location is as long as its address does not vary. 74 COMP12111 Fundamentals of Computer Engineering School of Computer Science Timing A simple memory device acts as array of transparent latches. Only the required latch(es) must be enabled when writing. Address must be stable all the time that CS and WE are asserted. 75 Timing Memory holds state. It can be written to. When writing it is important that the correct data is written to the correct location; it is also important to ensure that no other memory locations are corrupted. In the MU0 model described earlier the memory was controlled by read and write control signals and it was assumed that the processor clock would control state changes. A real memory device often has no clock input! A simple SRAM has the same timing characteristics as a transparent latch. If the chip is selected (CS=L) and write enabled (WE=L) then the data inputs will be copied into the addressed location. It is important that the address is stable during the write operation; if it is not, other locations may also be affected. There are set-up and hold time requirements for the address and data values around the write cycle. (The set-up time is normally greater to allow for the address to be decoded.) The actual write strobe is a logical AND of the write enable and chip select signals; both must be active for data to be written. The timing diagram shown above is therefore only one possible approach to strobing the memory. Another approach could use WE as the timing signal. Different processors (& different implementations) encode timing differently. That’s okay, as long as it’s included somewhere. Note that this is not essential for read operations, because they do not change the state of the memory; it does no harm though 75 COMP12111 Fundamentals of Computer Engineering School of Computer Science Wider Memory A typical 32 - bit processor will be able to address different sized data in memory. The memory will usually be built 32-bits wide so that words are fetched efficiently (i.e. in one cycle). The addresses of the memory words are therefore spaced four locations apart {0, 4, 8, C, …}. Notes: Addressing a 32-bit quantity at address 00000003 (say) may not work because the bytes are located in different memory words. Some processors (e.g. x86) do allow this, but they need to perform two memory operations, one for each affected word. 76 The Reality of Memory Decoding In the foregoing it is assumed that each address corresponds to one memory ‘location’. In a ‘real’ memory system this is often not the case. For example an ARM processor can address memory in 32-bit words or 8-bit bytes (or 16-bit “halfwords) and the memory system must be able to support all access sizes. Addresses are decoded to the minimum addressable size (in this case bytes). Addressing a word requires fewer address bits. Thus the least significant bit used by the address decoder is A[2]; A[1] and A[0] act as byte selects, which will be ignored when performing word-wide operations. Of course the bus must also carry signals to specify the transfer size. Byte accesses Notice that when the processor reads word 00000000 it receives data on all its data lines (D[31:0]). When the processor reads byte 00000000 it receives data only on one quarter of the data bus (D[7:0]); furthermore if the processor reads byte 00000001 it uses a different subset of the data bus (D[15:8]). The processor resolves this internally by shifting the byte to the required place (an ARM always moves the byte to the eight least significant bits when loading a register). The same is true when writing quantities of less than a full word – the data must be copied onto the appropriate data lines. When reading a byte it is possible to ‘cheat’ by reading an entire word from memory and ignoring the bits that are unwanted. This works because reading memory does not affect its contents. However when writing it is essential that only the byte(s) to be modified receive a WE signal or other bytes in the same word will be corrupted. This would be a Bad Thing. 76 COMP12111 Fundamentals of Computer Engineering School of Computer Science Memory Subsystems Here is part of the memory subsystem for an ARM - based system. 77 The ARM processor has a 32-bit word length. produces a 32-bit byte address. can perform read and write operations with 32-, 16- and 8-bit data. The normal design for the memory system would therefore be a space of 230 words (byte addressing, remember) of 32-bits each. Let’s see how this could be populated, using the RAM chips described above. The RAMs are 8 bits wide, therefore four devices are required to make a 32-bit word. This then gives 512 Kwords of memory. We can then repeat this arrangement another 2048 (=211) times to fill the address space, using the appropriate decoder circuits. Of course the 8192 RAM chips required will be expensive, will occupy a large volume and use a lot of power (thus generating unwanted heat) and it is unlikely that we really need 4 gigabytes of memory! The usual alternative with a large address space is to make it sparsely populated; this saves on memory chips and also simplifies the decoder circuits. Let’s say we need only 1Mword of RAM, as in the figure above. 77 COMP12111 Fundamentals of Computer Engineering School of Computer Science Points to Note Most signals to the RAM chips are shared The total memory space is coarsely divided (top left) One address line (A[21]) is used to select between the banks or RAM There is some byte selection logic using A[1:0] Some address lines are ignored! 78 (In the figure the ability to perform 16-bit ‘halfword’ transfers has been omitted.) Here the memory is 32-bits wide which requires four, 8-bit wide chips per ‘bank’. Two banks of memory provide 1Mword. The two least significant address lines (A[1:0]) are used as a byte select. These are ignored if a word transfer is performed. The next nineteen address lines (i.e. A[20:2] are connected to all the RAM chips. Note that the signal A[2] will be wired to the pin A0 on the RAMs (and so forth) because the RAM address is the word address, not the byte address. A[1:0] are used as a byte address within the word. A[21] is used here to select between the two banks of RAM. The last stage of the decoder is shown as explicit gates which drive the individual chip selects. (NAND gates provide for the fact that CS is active low.) The chip selects are the only signals which are distributed on a ‘one-perchip’ basis; other signals can be broadcast across many/all devices. This simplifies the wiring on the PCB1. A[29:22] are ignored! The RAM region select signal is produced from the most significant address bits; here RAM is selected when A[31:30] = 01. This means RAM occupies addresses 40000000-7FFFFFFF inclusive. The lowest region is reserved for ROM because the ARM starts executing at address 00000000 when the power is switched on. Exercise: Understand the decoder’s operation. 1. Printed Circuit Board 78 COMP12111 Fundamentals of Computer Engineering School of Computer Science The Memory Map The memory is divided into areas Some areas may not contain anything Ignored address lines means that memory is aliased into multiple locations 79 Memory Map Details The memory map described (which is just one possible example) shows many of the basic properties found in real systems. The memory is coarsely divided into areas with different functions. ¾ Areas may contain different types of memory or different technologies that run at different speeds. For example the I/O area may be designed to cycle more slowly (i.e. more clock cycles) than the RAM. ¾ Some integrated CPU devices may provide such decoders ‘on-board’. Some area are left ‘blank’. ¾ The previously described decoder does not use one of the area selects. m Writing to such areas has no effect. m Reading from such areas could return any value (i.e. it is undefined) o Some physical devices can appear at several different addresses. m This is due to ignoring some address lines when decoding. m Fewer levels of decoding reduces cost and increases speed. m This is known as aliasing. o The I/O space is unlikely to be full. m There will be both undecoded and aliased locations. m In most cases peripheral devices will be byte-wide so not all bits in the word will be defined. When reading peripherals it is important to mask out the undefined bits. m Peripheral devices are sometimes unlike memory in that reading an address will not return the same value that was written to that address. m Input ports, by definition, are volatile in that their value can change without processor intervention. 79 COMP12111 Fundamentals of Computer Engineering School of Computer Science Separate I/O ROM RAM I/O M/IO line false Use special instructions IN and OUT to refer to I/O RAM RAM M/IO line true 80 A separate I/O address space The memory map shown here includes space for ROM, RAM and I/O peripherals. I/O access patterns are somewhat different from memory accesses in that they are much rarer and often come individually (as opposed, for example, to instruction fetches which run in long sequences). Some processor families, a notable example being the x86 architecture, provide a completely separate address space which is intended for I/O. If this is used it leaves a ‘cleaner’ address space just for ‘true’ memory. The programmer can get at this space by using different instructions (e.g. “IN” and “OUT” replace “LOAD” and “STORE”) which are usually provide only limited addressing modes and, possibly, a smaller address range. The hardware view typically uses the same bus (with an added address line M/IO). The hardware may also slow down bus cycles automatically in the expectation that peripheral devices are slower than memory. Note that the system designer is not compelled to use these spaces in this way! 80 COMP12111 Fundamentals of Computer Engineering School of Computer Science Little Endian The bytes in a word are numbered with the least significant having the lowest number (i.e. 0) The bits in a byte are numbered with the least significant having the lowest number (i.e. 0) 81 Endianness Generically “endianness” refers to the way sub-elements are numbered within an element, for example the way that bytes are numbered in a word. By convention the bytes-in-a-word definition tends to dominate, thus a “big-endian” processor will typically still number its bits in a littleendian fashion (see slide). This can get pretty confusing. If it’s any consolation the numbering schemes used to be worse! Little endian addressing Pick a word address, say 00001000, in a 32-bit byte-addressable address space. Let’s store a word (say, 12345678) at this address. Address 1000 contains byte 78 Address 1001 contains byte 56 Address 1002 contains byte 34 Address 1003 contains byte 12 i.e. the least significant byte is at the lowest address. This has the effect that, if displayed as bytes, a memory dump would look like: 00001000 78 56 34 12 i.e. the bytes appear reversed (because higher addresses appear further to the right). If a byte load was performed on the same address the result would be: 00000078 81 COMP12111 Fundamentals of Computer Engineering School of Computer Science Big Endian The bytes in a word are numbered with the most significant having the lowest number (i.e. 0) The bits in a byte are still numbered with the least significant having the lowest number (i.e. 0) ¾ This is inconsistent, but frequently encountered 82 Big endian addressing Using the same word address (00001000) for the same word (12345678). Address 1000 contains byte 12 Address 1001 contains byte 34 Address 1002 contains byte 56 Address 1003 contains byte 78 i.e. the least significant byte is at the lowest address. This has the effect that, if displayed as bytes, a memory dump would look like: 00001000 12 34 56 78 If a byte load was performed on the same address the result would be: 00000012 Choice of endianness Some processors are designed to be little endian (x86, ARM, …), others to be big endian (68k, MIPS, …). There is no particular rationale behind this. Most modern workstation processors allow their endianness to be programmed at the memory interface. 82 COMP12111 Fundamentals of Computer Engineering School of Computer Science Memory Hierarchy Processors are fast Programmes are big Big memories are slow A hierarchical memory alleviates some of the penalty: Level 1 Level 2 Level 3 Level 4 Register File CPU Speed HIGH LOW Cost per bit 83 Memory Hierarchy Bottom line: for a given price o big memory = slow memory o small memory = fast memory If a programme has to run from ‘main’ memory it will only run at the speed at which its instructions can be read – maybe 10x slower than the processor can go. However in reality typical programmes show a great deal of locality, i.e. they spend maybe 90% of their time using perhaps only 10% of the code. If the critical 10% of the code is placed in a small, fast memory then the performance of the overall programme can be significantly increased without the expense of filling the address space with fast memory. This is exploited extensively in high performance systems. Depending on the implementation it may be known as caching or virtual memory1; the principle is the same in each case. A typical PC will have several levels in its memory hierarchy: The internal registers An on-chip cache, integrated onto the processor chip (SRAM) A much larger secondary cache on the motherboard (SRAM) (sometimes erroneously referred to as “the cache”) The ‘main’ memory – usually many megabytes of some cost-effective DRAM Some ROM or EEPROM to store information required on power up. The virtual memory space which is kept on a hard disc (magnetic) There may be more levels than this though! 1. … and was an invention from the Atlas machine built by this department. 83 COMP12111 Fundamentals of Computer Engineering School of Computer Science Memory Hierarchy Cont. Providing that the first level of the memory can keep up with the CPU, full speed is achieved most of the time. The first level of memory will usually be a cache The last level of memory will often be (part of) a hard disc The register File can be thought of as Level 0 i.e. top of the hierarchy Level 1 Level 2 Level 3 Level 4 ROM RAM CPU Cache Register File Disc 84 Although the principle of locality is used at each level of the hierarchy the process of choosing the “working set” (the elements to store) is often implemented differently: it is sometimes done by hardware and sometimes by software and may be static or dynamic. In the future it is likely that the technology will evolve but it is unlikely that memory hierarchies will disappear. The Register Bank Unlike MU0 modern processors usually have a significant number (ARM has 16, MIPS has 32, …) of registers forming a register bank (sometimes called a register ‘file’). These registers are used for operands for and results from the current set of calculations. Although they are not addressed in the same way as memory, the registers can be regarded as the topmost level of the memory hierarchy (level 0). The management of what is stored in the registers is done ‘manually’ by the compiler or the programmer directly and is – of course – specified explicitly in the object code. 84 COMP12111 Fundamentals of Computer Engineering School of Computer Science Caches The ‘main’ memory is big. To fill this with fast memory (i.e. as fast as the processor) would be really expensive! A cache is a small (but busy) memory. 85 Caches Two observations: Large memories (at an economical price) tend to be slower than small ones. A program spends 90% of its time using 10% of the available address space1. No one has said that the memory has to be homogeneous; it is quite possible to have memories of different speeds at different addresses. If you can organise things so that the 10% of the address space which is frequently used is in fast memory then you can get startling improvements at relatively small cost. In some circumstances this is possible. In embedded controllers the software is fixed and the programmer can profile and arrange the code to exploit different memory speeds. In general purpose machines (e.g. PCs) the code is dynamic (a posh way of saying you run lots of different programs) and those programs are designed to run on different machine configurations. Profiling is not a great help here. A cache memory adapts itself to prevailing conditions by allowing the addresses it occupies to change as the program runs. It relies on: Spacial Locality – guessing that if an address is used others nearby are likely to be wanted. Temporal Locality – guessing that if an address has been used it is likely to be used again in the near future. Two illustrations will illustrate why this often works: Instructions are (usually) fetched from successive addresses and loops repeat sections of code many times. Many data are held on a stack which uses a fairly small area of RAM repeatedly. Sufficient to say this works very well. In ‘typical’ code a cache will probably satisfy ~99% of memory transactions. Detailed cache design is beyond the scope of this course; further information can be found in books such as Clements. 85 COMP12111 Fundamentals of Computer Engineering School of Computer Science Cache Properties The cache intercepts many memory references and services them quickly. If we knew in advance which memory locations were going to be busy this would be easy. Caches adapt dynamically to the changing needs of the system. 86 Cache Hierarchies Caches work so well that it is now common practice to have a cache of the cache. This introduces several levels of cache or a cache hierarchy. The first level (or “L1”) cache will be integrated with the processor silicon (“onchip”). There will be a second level of the cache (“L2”); this may be on the PCB, on the CPU chip or somewhere in between such as the integrated processor module. Further cache levels are also possible; “L3” is increasingly common in highperformance systems. 86 COMP12111 Fundamentals of Computer Engineering School of Computer Science A Different Architecture The von Neumann architecture has a single memory shared between code and data. The Harvard architecture separates instruction and data memories. 87 Harvard Architecture The term “Harvard architecture” is normally used for stored program computers which separate instruction and data buses. This separation may apply to the entire memory architecture (as shown on the slide) or may be limited to the cache architecture (below). 87 COMP12111 Fundamentals of Computer Engineering School of Computer Science Harvard Architecture This increases the overall memory bandwidth. The extra parallelism allows the next instruction to be fetched while the previous one does a load or store. ¾ Note- not every instruction requires a data transfer so, sometimes, the data memory bus may be idle. Many high - performance processors employ such architectures – at least to the first level of cache. 88 The Harvard architecture logically separates the fetching of instructions from data reads and writes (e.g. ‘load’ and ‘store’). However its real purpose is to increase memory bandwidth. Bandwidth is the quantity of data (number of bits) which can be transferred in a given time. In a von Neumann architecture instruction fetches and data references share the same bus and so compete for resources. In a Harvard architecture there is no competition so instruction fetches and data reads/writes can take place in parallel; this means that the overall processing speed is increased. The disadvantages of Harvard architecture are: the available memory is pre-divided into code and data areas; in a von Neumann machine the memory can be allocated differently according to the needs of a particular program it is hard/impossible for the code to modify itself (not often a problem, but can make loading programs difficult!) more wiring (pins, etc.) Note: with a Harvard architecture the main memory may be completely divided in two. The parts need not have the same width or address range. For example a processor could have 32- bit wide data memory and 24-bit wide instruction memory. Many DSPs (Digital Signal Processors) have more ‘unusual’ Harvard architectures. 88 COMP12111 Fundamentals of Computer Engineering School of Computer Science Read-Only Memory (ROM) Interface similar to SRAM Extra pins for programming (write) support 89 Read-Only Memory (ROM) ROMs are usually random-access memory devices. They use a similar IC technology to RAMs, with lower cost/bit than RAM. They are: read-only which means their contents cannot be corrupted by ‘accidents’ such as bugs or crashes. non-volatile (i.e. they retain their information when power is removed). Uses ‘Bootstrap’ programs. ‘Fixed’ operating system and application code. Logic functions (e.g. microcode, finite state machines). 89 COMP12111 Fundamentals of Computer Engineering School of Computer Science Types of ROM Mask programmed ROMs are programmed during chip manufacture. PROMs are ‘Programmable’ after manufacture, using programming equipment. EPROMs are Erasable and Programmable (usually by exposure to strong ultraviolet light). EEPROMs are Electrically Erasable. 90 Types of ROM Mask programmed ROMs are programmed during chip manufacture. ¾ Cheap for large quantities. ¾ Used in ASIC1 applications PROMs are ‘Programmable’ after manufacture, using programming equipment. ¾ Each individual IC is separately programmed (a manual operation). ¾ Contents cannot be changed after programming. EPROMs are Erasable and Programmable (usually by exposure to strong ultraviolet light). ¾ A technology in decline. EEPROMs are Electrically Erasable. ¾ Currently one of the most popular ROM technologies. ¾ Many can be altered ‘in-circuit’, i.e. without removal from the PCB. ¾ They differ from RAM in that they require considerable time to alter a location (writes take >100x the read time). ¾ Many devices also require ‘bulk’ erasure so that all or a large portion of the chip is ‘blanked’ before new values can be written. ¾ Widely used for non-volatile store in consumer applications such as telephones, TV remote controls, digital cameras et al. ¾ “Flash Memory” falls into this category. 90 COMP12111 Fundamentals of Computer Engineering School of Computer Science Memory Technology The challenge in building computer memory is to achieve: maximum density adequate speed minimum cost Bonus points are awarded if the technology is: easy to use non - volatile The history of computer memory has been a struggle to find the ‘ideal’ storage at any given technology level. What can be used to store data bits? 91 Other Memories So far we have treated “memory” as simply the directly addressable memory space (which is the usual interpretation of the term). However there are a number of other storage devices in use in a modern stored program computer. One other form of store is the processor registers. In RISC processors it is usually clear that these form a separate, addressable ‘memory’ space: e.g. in an ARM “R7” means the “Register store with address 7” (not to be confused with the memory location with address 7). Perhaps more obvious are magnetic storage devices such as discs. The primary function of a disc store is act as a filing system. In a filing system each file is a separate, addressable entity where the ‘address’ is the name of the file. File handling is beyond the scope of the processor hardware and is performed by specialist software, usually as part of an operating system. Files may be stored on local discs (i.e. on the machine which is using them) on elsewhere (e.g. on a networked fileserver); this should be transparent to the user. Memory used as file storage has the following characteristics: o Addressed in variable size elements (“files”) o Addresses (“filenames”) variable length o Address decoding done by software (“filing system”) It is possible – with some extra hardware support – to make disc store to ‘stand in’ for areas of the address space not populated with semiconductor memory. This is a virtual memory scheme and will be described more fully in later courses. Another type of addressable store uses addresses of the form: “http://www.cs.man.ac.uk/” 91 COMP12111 Fundamentals of Computer Engineering School of Computer Science Current Technologies Current favoured memory technologies: SRAM – Static RAM (flip - flops) DRAM – Dynamic RAM (Needs refreshing) Flash Memory – Block erase can be byte or block read Magnetic disc – Hard Disc (serial data block read) Optical disc – DVD or CDROM (serial data block read) 92 Current Memory Technology SRAM fast truly random access relatively expensive per bit DRAM significantly slower than a fast processor faster if addressed in ‘bursts’ of addresses medium cost per bit Flash Slower than DRAM Block erasable Readable in byte or block form Very cheap per bit in block readable form (e.g USB pendrive) Magnetic storage very slow (compared to processor speeds) variable in their access times (think of the mechanics involved) read/writeable only in blocks very cheap per bit (e.g. Hard Disk) Optical storage very slow (compared to processor speeds) variable in their access times (think of the mechanics involved) primarily (but not exclusively) read only extremely cheap per bit (e.g. CDROM or DVD) 92 COMP12111 Fundamentals of Computer Engineering School of Computer Science Magnetic Memory Track Sector (Contains block of data) Read/Write Head Discs stacked on a single spindle with separate heads on each surface. Moves in and out across disc surface 93 Magnetic Discs Discs in one form or another should be familiar and need little further description. They come in several forms but can loosely be classified into hard discs which use a metal substrate and flexible (“floppy”) discs which use plastic. Hard disc drives often contain several platters on a single spindle, with surfaces on each side of a platter. The heads are linked to the same mechanical structure (‘arm’). A set of tracks at the is referred Hard discs cansame store radius data more densely to than floppies because the heads can as a cylinder. approach more closely, more reliably; the discs can also rotate faster without distortion. Hard disc heads literally fly over the surface on a thin layer of entrained air. They are enclosed to prevent dust particles disrupting their operation. The price/bit of disc storage is declining rapidly as the density increases. Future disc technologies may become more exotic. For example to store a bit in a very small area the magnetic material needs to be quite ‘stiff’; it may then need zapping with a laser to warm and ‘soften’ it each time it is written. Research is ongoing … Unlike semiconductor RAM the access time of a disc memory depends on its mechanical con-figuration and will vary depending on circumstances. 93 COMP12111 Fundamentals of Computer Engineering School of Computer Science Optical Memories The most significant 1990s memory technology popularisation was optical store as CD - ROM and, later, DVD (Digital Versatile Disc). CDs offer: High bit density Cheap manufacture (read only discs) Interchangeable medium Limited write capability Holographic Uses non - linear optical media Promises very high storage density ¾ Three dimensional storage ¾ Many bits in same volume (different viewing angles and different laser wavelengths) This is an active and ongoing field of research. 94 Optical Memories CD-ROM uses the presence/absence of pits in a foil disc to represent bits. The disc is read with a laser and optical sensor but the transport is otherwise largely similar to a magnetic disc. A CDROM holds up to 650 Mbytes of data. DVD is simply an extension of CD technologies, with smaller (denser) bits. The only significant development is that there are two planes of bits on (each side of) the disc which are separated by ‘focus pulling’. A DVD can contain 4.7 Gbytes of data. Other, similar optical storage formats are possible. Particularly attractive are those which dispense with the spinning disc and tracking head (and hence the large motors with their associated power consumption). Instead of moving the medium the laser can be scanned instead, using a smaller, lighter mechanism. The medium can also be made smaller (e.g. credit card sized). Such optical memory cards are under investigation and development. Being physically smaller than a CDROM an optical memory cards holds 2.8 Mbytes. Holographic storage In theory the storage of data in some sort of transparent ‘crystal’ could be very space efficient, not least because it offers 3D storage. Such storage was hinted at in, for example, the film “2001: A Space Odyssey” (1968) as the basis of the “HAL 9000” computer. Sadly this prediction proved somewhat optimistic. Another potential of holographic memory is the ease of construction of associative or ‘Content Addressable’ Memory (CAM). This is used in (for example) parts of cache memories but optical CAL may be useful for more elaborate tasks such as pattern recognition. This is beyond the scope of this course but This is an active and ongoing field of research. Searching the WWW will give more up-to-date details than can be included here. 94 COMP12111 Fundamentals of Computer Engineering School of Computer Science Memory Lane!! Now to look at some of the history of storage technology. 95 Punch Cards, etc. Early references in weaving 1725: M. Bouchon used a pierced band of paper pressed against horizontal wires. 1728: M. Falcon suggested a chain of cards and a square prism instead of the paper. 1745: Jacques de Vancanson automated the process using pierced paper over a moving, pierced cylinder. 1790: Joseph-Marie Jacquard developed the mechanism which still bears his name. Uses in computing Analytical Engine (1837) Certainly the earliest ‘use’ of punched cards was in Charles Babbage’s (17921871) design of the Analytical Engine (a mechanical digital computer). This was never built but was, in most respects, a modern computer architecture with a processor (called the “mill”), memory and I/ The analytical engine had, in fact, three types of card decks with “operation cards” (instructions), “cards of the variables” (addresses), and “cards of numbers” (immediates). 95 COMP12111 Fundamentals of Computer Engineering School of Computer Science Punch cards and other mechanical storage Jacquard loom (1790) Read mechanically 96 Joseph-Marie Jacquard (1752-1834) 1752: born, July 7 in Lyon, France; parents were silk weavers. Tried book-binding, type-founding and cutlery. 1772: father died leaving him inventing and accumulating debts. 1790: developed first loom – release delayed by French Revolution. 1792: joined revolutionists; son killed with him in defence of Lyons. 1801: revealed ideas for loom. 1803: summoned to Paris to demonstrate machine; given a patent and a medal. 1804: returned to Lyon to run the workhouse (and perfect his machine). 1806: loom declared public property – Jacquard granted annuity & royalties. 1806-10: much opposition from machine breakers; fled Lyon in fear of life. 1812: 11 000 Jacquard looms in use in France. 1834: died Aug. 7 in Oullins, near Lyon. At this time >30 000 Jacquard machines operating in this city. 96 COMP12111 Fundamentals of Computer Engineering School of Computer Science Hollerith cards (1887) Punched physically (slow) Read by electrical contact (or not) 97 Hollerith machine (1884) Not truly ‘computing’ as much as a counting (& accounting) machine the Hollerith machine revolutionised record keeping. With information on punched cards a machine could be ‘programmed’ to count all cards with certain sets of punched holes. This was first used for applications such as the US census in 1890; Hollerith cards were used extensively in early electronic computers and for other systems – they were familiar, everyday objects from the 1950s to the 1970s – and in use for some applications (such as voting) into the late 20th century. In 1928 the standard card size increased from 45 to 80 columns (960 bits). In computing this was adopted as a line of text/program and was used as the width of a Visual Display Unit (VDU). This survives as the ‘standard’ page width. Herman Hollerith (1860-1929) 1860: born 29 Feb in Buffalo, New York, USA, child of German immigrants. Unable to spell as a schoolboy! 1875: entered the City College of New York. 1879: graduated from the Columbia School of Mines with distinction. 1880: worked on the US census. 1882: joined MIT (Mechanical Engineering); began experimenting with paper tape, then punched cards, read by a wire contacting with mercury and triggering mechanical counters. 1884: moved to U.S. Patent Office – to avoid teaching duties. 1884: applied for his own first (of over 30) patents. 1887: Hollerith Electric Tabulating System tested. 1890: US census saves $5 million and two years’ work (pop. 62,622,250). 1896: founded the Tabulating Machine Company, which later became the Computer Tabulating Recording Company (CTR). 1921: retired. 1924: CTR was renamed International Business Machines Corporation (IBM). 1929: died 17 Nov. Washington D.C., USA of a heart attack. 97 COMP12111 Fundamentals of Computer Engineering School of Computer Science Paper Tape Read electrically (or optically) Typically seven bits wide 98 Paper tape Paper tape uses the same principle as punch card to store data. It has the advantage in density and is faster to feed through a reader. It is also not as easy to get muddled (or ‘hacked’) as a deck of cards because it cannot be shuffled. Conversely the ability to edit programs by adding, deleting and substituting punched cards could be very useful. Editing paper tape is difficult. One of these difficulties has left a trace in the ASCII character set where the character 7F (DEL or ‘delete’) is separated from the other ‘control’ characters; this code is used because it was represented by all the holes punched out (ASCII is a 7-bit code) and so could be used to overwrite mistakes. Similarly the character 00 (NUL) is used as a ‘no operation’ in order to allow an indefinite length of unpunched “leader” on the reel of tape. 98 COMP12111 Fundamentals of Computer Engineering School of Computer Science IBM Millipede (1999) Data stored as pits in plastic surface Fixed medium Bit densities 10x magnetic disc (~500Gbits/in2) 99 ‘Millipede’ A possible new technology, using a microscopic punched card; hunt out your own references. 99 COMP12111 Fundamentals of Computer Engineering School of Computer Science Mask programmed ROM Bit values are indicated by the presence or absence of physical wire connections Fixed rather than interchangeable medium Bits programmed in during manufacture Often used on - chip to store secure manufacturers data 100 ROM/PROM Some ROMs retain data as physical wire connections. In mask programmed ROMs these wires are fixed at manufacture. In the (currently defunct) fuse PROM technology the wires were fuses which – if not required – were overloaded and ‘blown’ during programming. Older technologies used matrices of diodes on PCBs for a similar effect. 100 COMP12111 Fundamentals of Computer Engineering The principle … School of Computer Science Delay lines 101 Delay lines A delay line is a device which exploits the ‘time of flight’ of bits in transit. It would be possible, for instance, to do this optically but sound – which travels more slowly – gets more bits into a short space. Delay lines are dynamic store in that data must be read, regenerated and rewritten continuously. Clearly random access is not possible as data can only be read or written as the required ‘location’ circulates through the electronics. Access to a given ‘memory’ is, of course, strictly bit-serial. Many early electronic computers – e.g. ENIAC (1946), EDSAC (1949) – used mercury delay lines (or “tanks”) as their main store. A typical 5¢ delay could hold about 1 Kbit. It was folded up for convenience (rather like a bassoon). Mercury delay lines were originally developed in the 1940s for radar applications. Mercury is a good acoustic conductor but is rather expensive (and heavy). A more convenient system was sought. The solution was the magnetostrictive delay line. Magnetostriction is a stress induced in a magnetostrictive material (such as nickel) when it us subject to a magnetic field. (A magnetic field is, of course, generated by a flowing electric current.) This was translated into torsional (twisting) waves on a long rod. The process is reversible, so the bit stream can be detected again at the far end and neoprene buffers damp out any excess energy. As the system runs at high frequency (~1Mbit/s) the ‘rod’ could really be quite a light wire which could be loosely coiled onto a circuit board. Single lines of up to 100¢ were made which could store up to 10 Kbits. 101 COMP12111 Fundamentals of Computer Engineering School of Computer Science Mercury Delay Line (1940’s) ~1.5 m tube Capacity ~1 Kbit Magnetostrictive Delay Line (1960’s) Up to 30 m wire Capacity 10 Kbits at ~1MHz bit rate NB. Delay lines do not give random access. 102 102 COMP12111 Fundamentals of Computer Engineering School of Computer Science Electrostatic Memories Williams Tube (1948) Charge stored on a glass/phosphor screen Written by an electron beam Read by displacing charge onto sensor mesh 103 Electrostatic Memories Williams Tube (1948) TheWilliams Tube (more correctly theWilliams-Kilburn Tube) was an early allelectrical storage device developed in Manchester. Its basis is a Cathode Ray Tube (CRT) similar to those used in televisions and computer monitors. Bits were stored as charge patterns on the phosphor screen. In effect some electrical charge was or was not planted at each point on the screen using an electron beam. The bits were read back by displacing these charges with another electron beam which caused a discharge into the screen; the discharge was picked up by a wire mesh across the screen’s front. The first Williams tubes could store 2Kbits – perhaps twice the contents of a mercury ‘tank’. They offered the added advantage that the data could be viewed by the operator (although a second, parallel tube was needed because the actual store was enclosed). Reading the data is destructive, therefore it was necessary to regenerate the charge and refresh the display. In any, case as charge tended to leak away, regular refreshing was necessary; the store was therefore ‘dynamic’. This was the store technology employed in the Manchester ‘Baby’, a computer which was really built as a memory test unit. 103 COMP12111 Fundamentals of Computer Engineering School of Computer Science Dynamic RAM (DRAM) (1970s-present) Stores bits as charged/uncharged capacitors Dense and therefore cheap Volatile and needs refreshing 104 Dynamic RAM (DRAM) (1970s-present) Instead of a glass and phosphor screen it is possible to store charge in a large array of capacitors. However making such an array was very expensive until it could be done on a single silicon chip. This is the principle behind DRAM. The capacitors are accessed via a matrix of wires and switches which allow individual capacitors to be charged or discharged. Opening these switches (which are really transistors) isolates the cells. Closing the switches again allows the charge to escape, which can be sensed and amplified as a read operation. Read operations s are destructive and therefore any data which are read must be rewritten afterwards. Also the capacitors are not perfect so charge gradually leaks away, therefore periodic refreshing is required – hence the name dynamic RAM. Each bit store comprises one capacitor and one switch (transistor) and these can be made very small. It is therefore possible to fit many megabits on a single chip. This is why DRAM has remained the cost-effective choice for large addressable memories for several decades. DRAM is customised and marketed in a number of guises such as EDO-RAM, SDRAM, Rambus etc.; all these use the same basic technology. The Decatron Not an electrostatic memory – indeed related to very little else – the decatron was a neon discharge tube with 10 anodes; the discharge could be jumped from one to another where it would remain following the ionisation path. This was a decimal memory cell which also acted as a display. Now a historical curiosity. 104 COMP12111 Fundamentals of Computer Engineering School of Computer Science More Electrostatic Memories EPROM EPROM uses a special ‘floating gate’ process ¾ special ‘isolated’ charge store Requires special programming and erasure EEPROM Electrically Erasable ¾ In situ programming/reprogramming Bulk erased Slow to write (~100x read time) FLASH Block write/erase Block read with serial interface 105 EPROM (1970s-1990s) If the charge leakage can be (effectively) eliminated and a DRAM read can be made nondestructive then the store would be even more useful. This is the principle behind EPROM (Erasable Programmable Read Only Memory) which is non-volatile – i.e. it retains its data indefinitely (even when the power is off). This is done by adding a ‘floating gates’ to the memory transistors; these are ‘islands’ where charge can be stored which are insulated by (relatively) thick glass (SiO2). Charge was driven through the glass by (relatively) high voltages after which it stayed in a stable state (discharge times >10 years). To erase the chip the charge was drained by a short (~10 minute) exposure to powerful ultraviolet light which lent enough energy to the electrons so they could escape. EPROM devices therefore required a quartz1 window in their package so they could be erased. Programming required a special programmer and the device had to be removed from the circuit; it was therefore important that an EPROM was socketed on the PCB. The socket, expensive windowed package and programming procedure makes EPROMs relatively unattractive if there is an alternative. (Sometimes a saving was made by using OTP or “One Time Programmable” EPROMs – the same devices but without the windows.) EEPROM (1990s-present) An EEPROM (Electrically Erasable Programmable Read Only Memory) uses EPROM technology but erasure may be done electrically. The devices may now be programmed ‘in situ’. Sadly eliminating the charge leakage adds so much ‘insulation’ that the cell becomes difficult (slow) to write to. In addition erasure is still a ‘bulk’ erasure rather than the ability to modify single bits. Thus EEPROM is a complementary rather than a replacement technology for DRAM. 105 COMP12111 Fundamentals of Computer Engineering School of Computer Science Magnetic Memories Magnetism has highly desirable properties for storing data. It has a distinct polarity (think of a compass) It is (relatively) permanent, if undisturbed It can be manipulated using electric currents Two distinct classes of magnetic storage have been attempted: those using fixed and travelling magnetic media. 106 Magnetic Memories Core The memory element in core store was a small torus1 (“core”) of ferrite. This could be magnetised in either direction. This can be set (written) by passing a current through a wire threading the core. To read the device the core was probed and – if it switched – a characteristic pulse was returned. (The read was destructive, so the data has to be written back.) Because it requires a current over a certain threshold to switch the polarisation of a core it was possible to produce dense, 2D arrays. These use two ‘address’ wires running at right angles; the current in each was kept below the switching threshold but where they crossed the sum of the resultant magnetic fields was great enough to affect just this one bit. The legacy of core memory still exists in some terminology: a computer’s main memory is still sometimes called “the core”, and “core dump” for an output of a memory image is still in common usage. Core store was followed by “plated wire” as a miniaturisation step. Magnetic core technology was in use in specialist applications (such as space shuttles) in the 1980s because it is both non-volatile and radiation resistant (“rad-hard”). Bubble Memory (1970s+) Now a historical curiosity ‘bubble memory’ was once thought to be the technology for light, portable equipment. Functionally it is the precursor of EEPROM, but works in an entirely different way. Bits are stored in a thin film of a medium such as gadolinium gallium garnet which is magnetisable only along a single axis (across the film). A magnetic field (from a flowing electric current) can be used to generate or destroy magnetically polarised ‘bubbles’ in the film which represent the two states of a bit. These bubbles are non-volatile. Perhaps unfortunately – if only for the name – bubble memory devices proved more expensive than other technologies. 106 COMP12111 Fundamentals of Computer Engineering School of Computer Science Ferrite Cores (1950s-70s) Ferrite ‘cores’ polarised to bit state The 1970s memory technology ~100 Kbits in two stacks of 2D arrays Speeds from 60 kHz to a few MHz 107 107 COMP12111 Fundamentals of Computer Engineering School of Computer Science Moving Magnetic Media The main examples are: Magnetic Drum (1950s) Magnetic Tape (1940s- present) Magnetic Disc (1950s- present) Two of these are still in widespread use today. 108 Moving Magnetic Media Rather than providing wires to each memory element the memory density can be increased – and the cost decreased – by providing a thin magnetic coating on a substrate material and moving this to the read/write element. Drums Drums were the earliest magnetic stores and often acted as directly addressable memory where each CPU generated address corresponds to a particular place on the drum’s surface. (This is in contrast to the modern use of – for example – discs, which form secondary storage and is managed by a layer of software such as a filing system). Drums were used as both primary and secondary store on many early machines; however they proved bulkier and less convenient than discs and were gradually superseded as secondary store. Core memory proved significantly better as main store. If you want to know more about drums – and the sort of programmers who used them – look up “The Story of Mel”. 108 COMP12111 Fundamentals of Computer Engineering School of Computer Science Magnetic drums Magnetic tape Originally large reel- to - reel drives Compact cassettes once used for home computers Now large capacity tapes used for backups & archives 109 Magnetic Tape Magnetic tape uses the same storage technology as disc but the magnetic medium is carried on a flexible plastic tape rather than a plastic or metal disc. The tape is dragged past the read/write head(s) by capstan. The heyday of tape storage was the 1950s & 1960s where science fiction films always showed computers as banks of spinning tape drives. In fact the engineering required for a tape transport to allow heavy reels of tape to start, stop and reverse rapidly is quite complex. However modern systems have relegated tape to archival storage (such as backups) where large volumes of data are streamed onto tape in handy-sized cartridges. Here the slow, serial access is not a significant problem and the thin tape wound onto a spool packs a lot of bits into a small volume. 109 COMP12111 Fundamentals of Computer Engineering School of Computer Science Memory Summary • The processor has a memory map which is filled with RAM, ROM • Addresses are decoded to select the appropriate memory block. • Memory has a hierarchy with fast but small memory at the top and slow but large at the bottom • Caches may be used to provide higher bandwidth. • Many different types of memory systems have been tried in the past since memory performance it is one of the critical factors affecting the overall performance of a computer. 110 110 COMP12111 Fundamentals of Computer Engineering School of Computer Science Input/Output and Communications 111 111 COMP12111 Fundamentals of Computer Engineering School of Computer Science Computers need to communicate A computer which can process information at incredible speed is still useless unless it can: get input operands to work on output its results This requires some sort of Input/Output (I/O) system Remember the Amdahl/Case Rule A balanced computer system needs about 1 megabyte of main memory capacity and 1 megabit per second of I/O per MIPS of CPU performance. Desktop machines today go at about 30,000 MIPS so would need 30,000 megabit/second comms (30 GHz on a serial line!!!) 112 112 COMP12111 Fundamentals of Computer Engineering School of Computer Science Batch vs Interactive processing Many early computers were used for batch processing where the input(s) and output(s) were data files stored on magnetic tape, punched cards etc. Examples: Processing census data Calculating and generating a payroll Forecasting the weather Most modern systems are interactive Examples: Word processor Computer game Mobile ’phone In both cases I/O is required. 113 113 COMP12111 Fundamentals of Computer Engineering School of Computer Science Communication Speed First let’s set the scale Start with the speed of light (in vacuum) ¾ 3 x 108 ms-1 ¾ 186 000 miles/s ¾ one foot per nanosecond This is the best case! Information cannot travel faster than this. On a PCB signals propagate (at best) at ~60% of this speed. One nanosecond (1 ns) is 10-9s or the period of a signal of frequency one gigahertz (1 GHz). 114 Digital Communications We are concerned primarily with digital communications. Digital transmissions use a binary coding system and so have the same advantages that binary signals have inside the computer. Two widely separated signal levels are easy to tell apart There is no ‘intermediate’ state which could be confused Discriminating a received signal digitally means that noise can be rejected. A binary signal is representable by a voltage (e.g. ‘high’/‘low’), current, light level (‘on’/’off’) etc. Clearly with a single wire the voltage can only represent one bit at any given time. To send a large number of bits therefore requires some sequencing, separating different elements of the message in time. If only a single bit is conveyed at a given time then the transmission is said to be ‘serial’. If more than one bit is sent at the same time the transmission is said to be ‘parallel’, although it is likely that some time sequencing will be involved too. 114 COMP12111 Fundamentals of Computer Engineering School of Computer Science Examples of coding data Voltage (high/low) on a wire ¾ e.g. ‘RS232’ serial interface Current (on/off or forwards/backwards) in a loop of wire ¾ e.g. MIDI Audio frequency (high/low) ¾ e.g. tones for a modem or fax machine (FSK – ‘Frequency Shift Key’) Light (on/off) ¾ e.g. optic fibre, Aldis lamp (signal lamp often used at sea) 115 Some different ways of encoding binary data Voltage (high/low) on a wire ¾ e.g. ‘RS232’ serial interface Current (on/off or forwards/backwards) in a loop of wire ¾ e.g. MIDI Audio frequency (high/low) ¾ e.g. tones for a modem or fax machine (FSK – ‘Frequency Shift Key’) Light (on/off) ¾ e.g. optic fibre, Aldis lamp 115 COMP12111 Fundamentals of Computer Engineering School of Computer Science Bandwidth and Latency Latency is journey time ¾ Measured in seconds ¾ Set by journey length and transmission speed Bandwidth is traffic capacity ¾ Measured in bits per second ¾ Set by channel ‘width’ and transmission speed 116 Bandwidth and Latency Two important terms: The latency is the time taken from sending a signal until it is received. The bandwidth of a communications channel is the amount of information which can be sent in a given time. These two are only indirectly related. Think of latency as journey time, being influenced by the length of the trip, the quality of the road and the speed limit. Bandwidth is the number of cars which can pass a point over a given time. Bandwidth is not affected by the length of the road. Furthermore bandwidth can be increased by adding more lanes even though this will not (in principle) shorten an individual journey. 116 COMP12111 Fundamentals of Computer Engineering School of Computer Science Example An 8 - bit wide memory is cycled at 1 MHz The latency is 1ms The bandwidth is 1 Mbyte/s (or 8 Mbits/s) If the bus was 32 bits wide … The latency would be 1ms The bandwidth would be 4 Mbyte/s (or 32 Mbits/s) 117 Examples Latency Approximate journey times for a signal to travelling at the speed of light: These represent the single trip times; a return trip (such as in a ’phone conversation) will double this. (There may also be added delays due to switching, signal translation etc.) Bandwidth (Data rate) Deep-space or ELF1 submarine communications may use a few bits per second For old-fashioned telex links, about 50 bps is used. For links between printers and computers, between (about) 100 and 20 000 bps. For Ethernet networks, about 10 Mbps are available. High speed optical fibre networks reach into gigabits per second. 1. Extremely Low Frequency – necessary to penetrate the overlying sea water 117 COMP12111 Fundamentals of Computer Engineering School of Computer Science Serial Communications “Serial” means that data elements are communicated in a series. i.e. different elements are distinguished by being sent at different times. “Serial” normally refers to communicating one bit at a time. Disadvantages: ‘Narrow’ channel means potentially low bandwidth Data elements are normally 8, 16, 32, 64, … bits wide ¾ Serial disassembly/reassembly required Advantages: Suitable for telephone or radio transmission Single bit reduces cabling requirement Small number (one!) of interfacing circuits ¾ Allows extra investment in increasing data rate 118 Serial Communications ‘Communications’ involves sending a signal over a physical medium. As digital computers are primarily electronic this medium is frequently metal wire. Wires are convenient to interface and cheap at the small scale such as on-chip or on a PCB. As the distance increases the wires get longer and the cost goes up; it is likely that the wires will also need connectors at intervals, again increasing the cost. One way of reducing the cost is to reduce the number of wires. Instead of communicating a 32- bit word along thirty two wires it can be serialised and sent as thirty two messages each one bit long. This will take a longer time but only requires a single wire and can use simpler (cheaper) connectors. In fact at least one extra wire is required in both examples to act as a reference ground (or earth) so that the communicating systems have some common agreement as to the logic ‘high’ and ‘low’ levels. Serial transmission is also well suited to other communications mediums. The electric telegraph (think of the old man with the moustache and Morse tapper in many old Westerns) is serial with symbols encoded as ‘mark’ and ‘space’ on a wire. The modern counterparts use radio, but they normally use only a single radio frequency. In computer communications optical fibres are now very common. The reasons for this are that the available bandwidth is higher and (persuasively) the glass/plastic fibres are much cheaper than copper wires. Data is transmitted digitally down fibres using an on/off binary code, usually with a single frequency (colour) for each transmission. There is some added cost in converting the electronic signal into an optical one and back again, but this is offset by other savings. Lastly there are media which are less obviously used for ‘communications’ but are inherently serial. Good examples are storage media such as CDs or magnetic tape. Data can only be written or read one bit at a time to the medium and so serial conversion must be included in the interfacing process. Notes When serialising a signal it is important to observe a know convention as to which bit is transmitted first. This is the ‘big endian/little endian’ debate again. Little endian (i.e. least significant bit first) is more common. When sending a series of data separated only in time it is essential that both parties agree on how time is delineated so that the receiver knows when to sample the incoming signal. 118 COMP12111 Fundamentals of Computer Engineering School of Computer Science Parallel Communications “Parallel” means that data elements are communicated at the same time. i.e. different elements are distinguished by being sent in different places. example: the seven-segment display interface used in the accompanying laboratory. Disadvantages: More wiring required More interface circuits Bigger connectors Unsuitable for many transmission media (e.g. radio) Advantages: Potential for high bandwidth Data sizes can match ‘internal’ data types In practice it is also common to use a series of several bits in parallel 119 Parallel Communications Parallel communication involves sending several bits of data at the same time. This requires the use of more data channels (usually wires!) and is therefore likely to impose a higher cost. Parallel communications is useful for two different reasons: Interface simplicity Parallel interfacing is very easy at the computer end. A parallel output is simply a latch that the processor can write to; a parallel input is a buffer which allows an external set of bits to reach the CPU. For simple parallel interfaces this is all that is required and software can provide any required control. The interface is also simple to other devices such as switches, lamps etc. If each lamp has a dedicated parallel output bit then there needs to be no ‘intelligence’ external to the computer which may represent a considerable cost saving. High bandwidth Clearly (in principle) increasing the number of wires increases the number of bits per second which can cross the interface. When communicating at a distance this is often offset because the some of cost saving in going to serial interconnection can be reinvested in fancy interfaces which increase the bit (or ‘symbol’) rate. However in local communication there is no competition and parallelism is a Good Thing. Examples occur as the physical scale shrinks and we look at communications around a PCB or on a single chip. Perhaps the most obvious example is the CPU’s own bus connecting it with memory; the bandwidth demands on this are very high (possibly many gigabytes per second) and it would simply not be feasible to approach this serially. In high performance processors a common technique to improve data bus bandwidth is to double (or even quadruple) the bus width from the processor’s ‘natural’ size and fetch two (or four) instructions in parallel. Of course all the bits in the instruction are also in parallel … Serial buses Whilst (almost) all processor buses have separate parallel wires for every address and data bit it should be noted that it is possible to provide such a bus serially (albeit at very low performance). For an example interested parties should look up buses such as I2C 119 COMP12111 Fundamentals of Computer Engineering School of Computer Science Synchronous Communications #1 A synchronous communication occurs if the transmitter and receiver are using the same clock. One method is to send the clock between the systems. Data validity indicated/sampled with rising clock edge Two wires required Potential for clock skew Suitable for transmission under software control/timing Note: with this system the clock period can be varied. (It is still synchronous though.) Example: a PC keyboard serial line 120 Synchronous Communications #1 Synchronous communications are those which use the same clock for both the transmitter and the receiver. This can be a very convenient method because the system can be designed as a straightforward, synchronous FSM. It is a method which works well, for example, when a processor communicates with its memory. No two clocks run at the same speed. However accurately they are made there will always be some discrepancy. Therefore the only way to maintain synchronous communications is to use the same clock. When a processor communicates with its memory the CPU is the master and can dictate the system timing. However this is harder when considering communications over a distance, perhaps between two separate computers. In this case to maintain synchronisation the clock information must be sent across the communications link. Note that this could be set by either the transmitter or the receiver (although the first choice is more intuitive). The problem is then that the clock information is being sent in parallel with the data, thus implying that an extra wire (or similar) is required. This is a small overhead for a parallel interface but doubles the number of connections on a serial interface. Furthermore there can be a problem of clock skew; this happens if the path lengths (delays) of the two wires are different. Skew cannot change the frequency of transmission but the different latency can shift the relative phase of the signals. Clearly any phase shift must be significantly less than a bit period. 120 COMP12111 Fundamentals of Computer Engineering School of Computer Science Synchronous Communications #2 Another method of synchronous communications is to encode the clock with the data. Example: Manchester encoding There is always a transition at a predictable time Receiver uses signal and knowledge of (approximate) bit rate to regenerate the clock Only need one signal wire + return (Gnd) Example: Ethernet Similar approaches are used for other ‘communications’ media. Examples: Magnetic disc CD - ROM 121 Synchronous Communications #2 One solution to both these problems is to encode the clock and data onto the same signal. There are several ways to do this: one example is Manchester encoding which is used for Ethernet transmissions. This encodes a “0” as a falling edge and a “1” as a rising edge. Some other transitions may have to be inserted to make this work (see figure opposite). The ‘clock’ information can be recovered because there is a transition (rising or falling) for every bit and the receiver, knowing the approximate bit period, can lock to the exact period using a Phase-Locked Loop (PLL). . Phase-Locked Loops A phase-locked loop is an oscillator which can adjust itself to match an external frequency. An example would be a system of you and your watch; the watch is a good time reference for most purposes but, every so often, it is necessary to ‘re-synchronise’ with a reference clock such as a radio time signal Exercise When using Manchester encoding a synchronising preamble is required; why is the sequence 10101010 chosen? (Hint: try encoding this sequence.) 121 Some other self-clocking encodings: Another common application for self-clocking encodings are magnetic recordings. A magnetic disc’s data rate depends on its rotation speed (which may not be quite constant). This is exacerbated with interchangeable discs, which may have been written on a different drive. The data stream must therefore be selfclocking. Some codes which are/have been used are: Frequency Modulation (FM1) once used for floppy discs Modified Frequency Modulation (MFM) used for floppy discs & early hard discs Run Length Limited (RLL) used for hard discs FM encoding uses a transition to indicate the start of a bit period; data is encoded by the presence or absence of another transition within the bit period. The other codes mentioned give denser recordings by omitting some transitions; basically it is possible to survive without re-synchronising the clock for every bit, just as a watch does not have to be reset every hour. CD recordings (etc.) make use of similar data recovery techniques. 1. Nothing to do with the radio! 122 COMP12111 Fundamentals of Computer Engineering School of Computer Science Asynchronous Communications #1 The Asynchronous Serial Line Asynchronous – no clock transmitted However Does rely on an agreement on the (approximate) transmission frequency All ‘symbols’ have the same length but there can be an arbitrary time between them 123 Asynchronous Communications #1 Asynchronous communication – i.e. communication without a common clock – is often more convenient than shipping the clock across the interface. There are two different techniques which are referred to as asynchronous communication; these are exemplified below. The Asynchronous Serial Line In an asynchronous serial line no clock information is transmitted but the transmitter and receiver have already agreed on (and fixed) the period of transmission of a data element. Because no two clocks run at exactly the same rate there is necessarily some mismatch, but this can be minimised by resynchronising the receiver with the transmitted stream every so often. A typical asynchronous serial line used (for example) as a modem1 will synchronise on every byte of a message. As long as the transmitter and receiver frequencies are ‘close’ they will not drift too far apart before they are synchronised again. The need for this resynchronisation imposes an overhead on the transmission which inserts ‘extra’ bits that are not in the message. This is the reason that a serial line at (say) 9600 baud will only transmit around 960 bytes per second rather than the 1200 you might have expected. 1. MOdulator/DEModulator 123 COMP12111 Fundamentals of Computer Engineering School of Computer Science Symbol transmission The ‘start bit’ is used for synchronisation The ‘stop bit’ guarantees the line goes ‘idle’ ¾ ensures the next start bit will cause a transition The bit time at the receiver is similar to the transmitter ¾ no synchronisation is lost in ~10 bit times In this example sending 8 bits takes (at least) 10 bit times. 124 124 COMP12111 Fundamentals of Computer Engineering School of Computer Science RS232 serial interface The most common asynchronous serial interface (RS232) uses a protocol as follows: st a rt bit idle line lsb 1 msb 0 0 0 0 0 1 0 stop bit ASCII character A 125 Parity Parity is used for consistency checking of data. In can usually be checked by the UART which will indicate a parity mismatch. It can be programmed to be odd, even or none and may be ignored by the receiver in any case. 125 COMP12111 Fundamentals of Computer Engineering School of Computer Science Typical set-up for RS232 There are many variations on the details, but most interfaces can be programmed to send or receive all of them: • A start bit • 7 or 8 data bits • A parity bit (optional) • 1 or 2 stop bits The standard data rates are 75, 110, 300, 600, 1200, 2400, 4800, 9600, 19k2 and 38k4 bits/s. 126 126 COMP12111 Fundamentals of Computer Engineering School of Computer Science Asynchronous Communications #2 Handshaking Two handshake transfers are shown. The data is set up The transmitter asserts Request The receiver latches the data and asserts Acknowledge The transmitter removes Request The receiver removes Acknowledge The process may then repeat. Note that the duration of each transaction can be varied by either participant and there can be an arbitrary time between transactions. This is a truly asynchronous communication. 127 Asynchronous Communications #2 Handshaking Handshaking is frequently used in communications buses but the classic example of handshaking is the parallel printer interface usually known as the “Centronics” interface. The principle of a handshake interface is that the data is set up and then a request signal asserted by the sender. The sender can take no further action until the receiver has acknowledged receipt; the receiver will not do this until it has secured the data and knows that it can continue. Transactions performed by this handshaking process – where the initiative is passed backwards and forwards – is truly asynchronous. The transmitter may wait indefinitely before transmitting and then the receiver can take as much time as it needs before acknowledging. Transmission of each individual symbol can take a different time. Clearly this mechanism requires several parallel channels: one for request, one for acknowledge and at least one for data – it would be usual to have, say, eight data bits in parallel for a total of eleven wires (don’t forget a ground signal which sets the level the other signals are compared to). Asynchronous Processors Traditionally on-chip communication has been done synchronously; the small size of a silicon chip relative to the clock ‘wavelength’ (a 100 MHz clock will be over a metre of wiring) meant that the assumption of synchronicity was a good one. With increasing speeds it is increasingly attractive to consider asynchronous communication between devices and even within devices. Asynchronous processing is therefore making a comeback in some areas. This is a particular area of interest in – among other places – Manchester. 127 COMP12111 Fundamentals of Computer Engineering School of Computer Science TDM If there’s spare bandwidth (capacity) on a wire it can be shared amongst several channels. This is known as Time Domain (or ‘Division’) Multiplexing (TDM). 128 TDM Sometimes a communications channel will have more bandwidth available than an application requires. A common example is a telephone wire. Most telephone connections are used for voice conversations. Humans can hear frequencies up to about 20 kHz (at best) but a typical voice is recognisable with a much lower frequency range; in analogue terms around 3 kHz bandwidth is sufficient. Because the signal is analogue this can represent rather more than 3000 bits per second (bps); in fact ~56 000 bps is a reasonable guide1. When sending messages between cities or between countries rather more sophisticated connection technologies are used and the bandwidths are commensurately higher. It would clearly be a huge waste to use such a channel for a single telephone chat! Instead the available bandwidth can be partitioned amongst a number of conversations which can share the same wire (or fibre). To do this each conversation is stored, broken into small excerpts and squeezed onto the wire much more rapidly. 1. Think of a typical computer modem. 128 COMP12111 Fundamentals of Computer Engineering School of Computer Science Uses of TDM TDM is a common technique in communications. It can be used in other applications ¾ e.g. a seven segment decoder in the laboratory If the channel is being used for real time communications (such as a telephone line or television programme) there is also a limit on the latency of each ‘packet’. 129 Optical fibres An electrical signal can be converted into light using an LED and back to an electrical signal using a photodiode or phototransistor. In between the light can be carried along an optical fibre or “light guide” with very little loss. An optical fibre conducts light by total internal reflection from its inner walls. The light (and hence the energy) cannot escape because the incident angle is greater than the critical angle of the fibre. 129 COMP12111 Fundamentals of Computer Engineering School of Computer Science Layered Communications Protocol 130 ISO 7498 International Standards Organisation model for Open Systems Interconnection (OSI). When you send a letter – the old fashioned kind with paper and a stamp – all you want is for your message to get to the correct place; you don’t really know (or care) how this happens. To facilitate this you place the letter in an envelope with an address on the outside and deposit it in a postbox. It will be emptied into a sack and the sack will be put in a van with other sacks. The sack will be transported to an sorting office where the van will be opened and the sack removed and emptied. The letter will then be enclosed in a different sack for transport to another city. This may be put in a train; the train driver knows where to take the train (usually!) but may be unaware of the mail sack and certainly won’t have seen your letter. The process continues until, in the end, the postman takes letters bundled per street, splits these, delivers them to a house where the recipients check the individual names and the recipient receives the missive. Computer communications happens in much the same way. A number of ‘layers’ are defined. When an application (e.g. Netscape) on one machine wants to communicate with an application on another it uses a virtual link. 130 COMP12111 Fundamentals of Computer Engineering School of Computer Science Layered Comms. Protocol Cont. Most layers correspond via a virtual link Only the physical layers have a real link Some layers are implemented in hardware, others in software Not being a communications course we shan’t delve into this, just observe that it exists. The layered protocols allow a convenient degree of abstraction at each level, so a word processor does not need to know that its document file is on a remote file server accessed by ,for example, a fibre optic token ring … 131 ISO 7498 defines the following layers: Application - the user’s view of the system Presentation - format conversion Session - session management, security et al. Transport - connections and channels (e.g. TCP – Transmission Control Protocol) Network - packeting and routeing (e.g. IP – Internet Protocol) Data link - error checking and retransmission if necessary Physical - connection and switching (e.g. Ethernet) All of these can communicate with their counterparts but only the physical layer has a physical link (wire, optic fibre etc.); all the others use virtual links. The upper levels will be implemented in software, the lower ones in hardware and some mixture will be used in between 131 COMP12111 Fundamentals of Computer Engineering School of Computer Science Interfacing There is a plethora of possible I/O devices which may be used. 132 Interfacing A user does not (or, at least, should not) care how an I/O device works. From Unix a ‘file’ can be transferred to a disc, network or modem without knowing what these actual devices are similarly a file can be sent to a printer without regard for whether the printer’s interface is parallel or serial. The upper layers of the communications protocol are held in the device driver. This is a set of operating system routines which has a common virtual interface to applications but is specific to a particular peripheral device. We don’t do software in this course, so we will concentrate on the peripheral device, usually just called a ‘peripheral’. The peripheral implements the lower levels of the comms. protocol in hardware. These provide the signals on physical wires (etc.) with the correct voltage levels and the correct timing. A peripheral may be a very simple interface (leaving much of the function in software) or it can be highly sophisticated. The most sophisticated devices are complete peripheral processors which are computers in their own right, complete with their own, embedded software. The devices at the other side of the interface (outside the box) may also have considerable intelligence. A typical printer, hard disc drive or even a keyboard will often contain its own processor. 132 COMP12111 Fundamentals of Computer Engineering School of Computer Science Interfacing Cont. Can’t consider all possibilities when specifying a CPU. Typically I/O devices are ‘mapped’ into memory space. The specific requirements of a particular interface are provided by: a peripheral device (hardware) a device driver (software) 133 Interfacing examples When typing on a PC keyboard the matrix of 100+ keys is scanned by a microcontroller (single chip computer). This uses simple, digital parallel inputs to detect if a key is pressed or released and will also note the time when a change happens with its on-board timer. It then runs software which ensures that the key is debounced. When it is sure that a key state has changed it identifies the key and action (pressed or released) by a code which is sent to the main computer via a synchronous serial line. The serial transmission is received by another microcontroller which records the key information and translates it into an internal key code such as an ASCII character. Note that the translation may depend on the state of other keys, such as ‘control’ and ‘shift’. It then interrupts the CPU and allows this to read the key state via the bus. If a key is pressed the CPU records this in a buffer in memory. This can later be read by an application program asking for keyboard input. In reading data from a hard disc drive the magnetic transitions induce tiny electric currents in the read head. These are amplified to digital levels which are used by the data recovery circuit. This passes the serial data to a shift register where it is assembled and passed to the drive’s on board processor. It is then assembled into a buffer in the drive’s memory. This can then be read out across a parallel interface by the CPU or – more likely – DMA into the main computer’s memory space. 133 COMP12111 Fundamentals of Computer Engineering School of Computer Science An Example Peripheral A good example of a peripheral device is a serial interface. The generic example is the UART. A UART is basically a pair of shift registers (one for input, one for output) controlled by a finite state machine. The processor can follow the states of the FSM via a status register. To use the transmitter: Wait until the transmitter is free ¾ indicated in the status register Write byte to transmitter register Repeat To use the receiver: Wait until the receiver is full ¾ indicated in the status register Read byte from receiver register Repeat The peripheral deals with the serialisation etc. 134 A UART o Universal Programmable to match user’s requirement o Asynchronous Can use asynchronous serial communications o Receiver Can do input … o Transmitter … and output Using a UART We will save the detailed construction of a UART for later. However let’s look at the interfaces starting with the transmitter. N.B. This is a simplification of the actual operation. It would be possible for the CPU to break down words into individual bits for serial transmission. However this is a tedious process for software but is relatively easy in hardware. The UART therefore provides a parallel interface for the CPU (often only 8 bits wide though). The CPU can write a byte to this interface and the peripheral does the rest, serialising it by shifting the bits out one at a time. Shifting occurs at the rate expected by the interface, which may well be different from the processor’s operating speed. 134 COMP12111 Fundamentals of Computer Engineering School of Computer Science A ‘Real’ UART The table below shows the register definitions for a (fictitious) UART. Status register 135 Because the serial interface and the processor are running at different speeds it is necessary to have a means of indicating that all the bits have been shifted out; if the processor tried to send the subsequent byte too early it would corrupt the previous transmission. The whole process is controlled by an FSM which keeps track of the operation of the device. The UART idles until a byte is sent; it then becomes ‘busy’ The CPU must not send another byte while the UART is busy. This is prevented by software, but the software must find the FSM’s status; the ‘busy’ bit is therefore made available in a second register, the status register. The status register must reside in a different place from the address used for the data output, so our UART must occupy at least two addresses. In practice it will occupy more, because it will be programmable (‘universal’) with characteristics such as transmission speed. The operation of the receiver is similar, with bits being shifted in serially and the assembled byte presented in parallel. Another status bit (it will fit in the same register at the same address) is used to indicate when a byte has arrived. 135 COMP12111 Fundamentals of Computer Engineering School of Computer Science Real UART Cont. Here some of the features common in such peripheral interface devices are shown. Some functions are programmable (e.g. baud rate) Several registers needed to support one serial interface Not all the registers read back the same value that was written to them Reading some registers can cause ‘side effects’ ¾ e.g. reading the data input clears the ‘data ready’ bit A timer has been included which shares the interface 136 A UART in Action! A UART is a reasonably complex peripheral device. The one depicted here is a (gross) simplification, but it illustrates most of the features of a typical peripheral device as viewed from the software side. The UART interface comprises several registers at different addresses. Typically these are 8 bits wide. If the UART is used with a 32-bit processor (e.g. ARM) then all these registers will be connected to the same data lines (see the section of notes on memory) and thus will not be at contiguous addresses. The UART itself will also be mapped into memory space somewhere, so the registers may be at (for example){C0000000, C0000004, C0000008, C000000C}. When the UART is reset (e.g. at switch-on) it will disable its receiver and transmitter and inactivate any possible interrupt signals. The user must program any set up options (such as the baud rate) before enabling the peripheral’s function. A byte to be transmitted is written to register 0. Before transmitting a subsequent byte the software needs to ensure that the first one has gone (serial transmission is usually a lot slower than the software). This can be done by testing bit 4 in the status register. The receiver operates in a similar fashion; here the software must wait for the receiver to be ready (status register bit 0 set) before reading the byte. Typically the act of reading register 0 will reset this bit, so you can only read the character once. A flag has been included to indicate if an error has occurred (e.g. if reception has been corrupted by noise). It is up to the software to check (or ignore!) this information. If you want to see real UART data sheets there are plenty available on the WWW. Timers This UART also includes a timer. This is a device which measures the elapsed time as the system runs. It is necessary to use hardware to get an accurate timing measurement because software timing is difficult to calibrate and notoriously inaccurate1. The timer is not part of the UART. However timers are often included on other peripherals because they only require one extra signal (their clock) over the many bus interface and address decoding signals already present. 1. Think of the effect of running the same program on a faster computer! 136 COMP12111 Fundamentals of Computer Engineering School of Computer Science DMA Our ‘three box’ model has considered the CPU, memory and I/O So far the CPU has always been the bus master (i.e. in control) Most I/O is data being moved into or out of memory, e.g. programs data files video display The CPU has to perform the transfer. It is much more efficient if the transfer can be performed without CPU intervention. The CPU can then be used for something else. 137 Direct Memory Access (DMA) DMA is Direct Memory Access (by a peripheral device). The concept is quite straightforward but the implementation details can be troublesome; we shall therefore stick to the general idea here. Most I/O – especially most high bandwidth I/O – involves moving the data to/from memory in significant blocks. For example, loading a program involves moving a large block of words from a disc or network interface into a contiguous address space. This is a pretty mundane process: wait for a peripheral to become ready fetch a byte (or word) from I/O store the byte (or word) into memory increment the memory address (ready for the next transfer) count off the transfer (to detect when to finish) repeat Hardly a taxing program! The CPU’s time can be better spent doing more difficult work. DMA allows the peripheral itself direct access to the memory. The peripheral, with the addition of some simple hardware, can then move the data into (or out of) the memory without bothering the CPU. This introduces some extra concurrency (parallelism) into the system. The hardware cost of DMA is fairly small but it does add complexity to the overall system. In order to get at the memory the DMA transfer needs the bus (see picture, opposite) and the CPU may want it at the same time. There is therefore an issue of arbitration for bus mastery to resolve any contention. (Even if conflicts occur the DMA process is more efficient at moving data than the CPU – it needs no instruction fetches.) DMA is primarily used for high bandwidth transfers. A good, but perhaps not obvious, example is the display output on a typical workstation. A frame buffer contains about 1 Mpixel and each pixel may use a 32-bit representation. As the display is refreshed at (say) 70 Hz the required bandwidth will be ~300 Mbytes per second. Certainly hardware assistance is required for transfers at this rate! 137 COMP12111 Fundamentals of Computer Engineering School of Computer Science Hardware Interrupt This is a hardware signal which when active causes the processor to change the program context. It allows the processor to have a response to an external event which is closer to real time. 138 Interrupts Interrupts fall into a category of occurrences usually classified as “exceptions”. Exceptions are events which occur ‘unexpectedly’ during the execution of a program, rather than as a direct result of the execution of instructions. An example of an exception could be an integer DIVision instruction which tries to divide by zero; the answer “infinity” is not representable by an integer and so an unexpected error has occurred. Some processors will generate an exception (or “trap”) if this occurs. (ARM does not do this one because it has no specific divide instruction.) This causes execution to branch to an exception handler (a.k.a. “trap handler”) which can take remedial action. What is an interrupt? An interrupt is another class of exception. It is initiated by a hardware signal to the processor which causes the processor to jump to an interrupt handler or interrupt service routine (ISR). When the exception is resolved the processor needs to jump back to the original code and proceed as if nothing had happened. The interrupt service routine can therefore be regarded as a software procedure, the only difference is that it is called by a hardware signal rather than a software instruction. 138 COMP12111 Fundamentals of Computer Engineering School of Computer Science Interrupts Program Port #1 Port #2 Program Port #1 Port #2 1. Interrupts are more efficient 2. Interrupts are less complex (honest!) Input arrives Service Latency Software Time saved Polling Interrupt hardware Interrupts 139 Why use interrupts? Program execution in the processor is a serial operation. However the computer often wants to do many tasks ‘at once’. Most computers give the illusion of multi-tasking by doing a bit of one job, then a bit of another, etc. and cycling these different tasks fast enough to deceive the eye! This is another form of time domain multiplexing. Many tasks are very simple. For example the job of inputting characters from a keyboard involves a lot of waiting (characters arrive at <10 per second, instructions may execute at >1 000 000 000 per second) and it is a waste of processor time for the CPU to repeatedly poll the keyboard and wait. A simple, cheap hardware circuit can do nothing just as well as the expensive processor. This can be tasked to wait for the keyboard input and do something with it. This action may, for example, involve DMAing the character into memory; in such a system we have (cheap,) independent, parallel processing between the CPU and the keyboard, and parallelism is typically good for performance. Alternatively we may wish to do something more complicated with the character. We could spend more on hardware but this is often uneconomic. Instead the input can wait until the character is ready and then ‘borrow’ a short burst of CPU time. This is done by requesting service via an interrupt. The interrupt service routine may require a few hundred operations, but these are relatively infrequent and only requested as needed so the overhead is small. 139 COMP12111 Fundamentals of Computer Engineering School of Computer Science Servicing Interrupts Hardware Sof tware Interrupt Save working occurs registers Save PC Service & st a tus interrupt Disable Clear interrupt interrupt Context switch registers Restore PC & st a tus 140 Servicing Interrupts What happens on an interrupt? When an interrupt is serviced the processor is seconded to run a different thread. This interruption of the current program preempts ‘normal’ execution. The point to be emphasised here is that the interrupt service routine (ISR) is called at ‘random’ positions in the user’s code. At the time the interrupt is serviced any or all of the processor’s registers may be holding useful data. Some of these registers may be needed to run the ISR but, if they are altered, it is essential that all the state of the processor is restored before normal service is resumed. (The consequence of this not happening has been likened to walking down a street when, suddenly in mid stride, you find all your clothes have changed – or worse!) Clearly there are some parts of the processor state which the ISR is unable to preserve because they must be changed before the ISR is entered. The most obvious value is the program counter (PC) which indicates the position the interrupt occurred and therefore defines the return address. The hardware therefore has to cooperate to save some of the processor’s state. 140 COMP12111 Fundamentals of Computer Engineering School of Computer Science Interrupt Implementation Many devices, one interrupt. On-chip where there is a fixed number of interrupting devices … INT CPU I/O I/O I/O On a PCB (etc.) where there is a variable number of interrupting devices … INT CPU I/O I/O I/O The latter uses open-drain outputs and forms an expandible (if relatively slow) ‘gate’ from the wire itself. 141 Sometimes a ‘peripheral’ device is used to collect the various interrupt signals and simplify the interrupt response software Processor Interrupt Peripheral Interrupts Data Bus Decode Enables Latch It concentrates the interrupts for the processor and will usually allow the state of the individual signals to be read in software. This alleviates the need to read each potential interrupting device. Another common facility (implemented here) is to allow the various interrupt sources to be enabled selectively; this is in addition to any such facility provided by the peripherals themselves. More sophisticated devices may contain a priority encoder which allows the highest priority active interrupt to be identified by a single read operation. Such devices may also support the ability to change the interrupt priority ‘on the fly’. For example it may be desirable for the last serviced interrupt to be given the lowest priority; this is difficult (i.e. it adds significantly to the interrupt latency) to do in software. 141 COMP12111 Fundamentals of Computer Engineering School of Computer Science Interrupt Priority Some interrupts are more ‘important’ than others. Some interrupts require servicing more urgently than others. Some interrupts take longer to service than others. Sometimes it is desirable to service one interrupt whilst in the middle of another’s ISR. Occasionally two or more interrupts may happen simultaneously; which is chosen? Prioritisation can be done by: Software, choosing the order in which potential interrupts are checked An interrupt controller with several separate interrupt inputs ¾ Some CPUs include this, together with a number of operating priorities Daisy chaining A combination of all the above 142 Exercise By making reasonable assumptions estimate how much of a processor’s time is needed to deal with a mouse which interrupts every time it moves a ‘step’. Number of steps/cm? Number of steps/s? What happens on each step? Roughly how many instructions per interrupt? How many dimensions? What about button clicks? 142 COMP12111 Fundamentals of Computer Engineering School of Computer Science Daisy Chain The daisy chain is a mechanism which can be used to prioritise any number of interrupting devices. (One possible method) Int Int Int Peripheral IAckIn IAckIn IAckOut Peripheral IAckOut IAckIn IAckOut Peripheral Int CPU IAck N/C The processor cooperates by ‘acknowledging’ accepting an interrupt request. * Used in collaboration with vectored interrupts * Requires little extra hardware * Usable with single interrupt signal * Cheap in pins * ‘Infinitely’ expandible (Sequential process, so beware of the time required though.) 143 143
© Copyright 2026 Paperzz