Insight To Processor Architecture and Design Metrics, Analysis, and Comparison Between Ten Designs Daniel Rosabal Department of Electrical Engineering and Computer Science University of Central Florida Orlando, FL 32816-2362 Abstract— This paper focuses on different types of processor architectures. It begins by describing the five classic components of a processor. Then it explains how the address bus and data bus interact to make a connection between memory and processor possible. It continues by stating the metrics that will be analyzed such as CPU clock rate, memory capacity, word width, Moore’s Law, etc. A brief explanation on processor performance is provided along with Amdahl’s Law and the concept of parallelism. Finally, a list of 10 processor architectures is provided and described in a paragraph for each of the designs listed. Keywords—Amdahl’s Law, Clock rate, CMOS technology, CPI, Memory Capacity, Moore’s Law, Word width I. OVERVIEW OF PROCESSOR ARCHITECTURE 1) Classic Components: A computer system consists of the following components: I/O devices, memory, and CPU. I/O stands for input/output and some examples for such devices are a keyboard or mouse for input and a printer or display for output. CPU stands for central processing unit, also known as processor, and it consists of a control unit and an arithmetic logical unit (ALU). The control unit decodes the instruction bits to control instruction execution and the ALU, which is capable of arithmetic and logic functions on these bits, crunches them. Memory is in charge of storing these bits. 2) Processor Busses and Bit Width: Two buses connect the processor and memory, the data bus and the address bus. These two buses make interaction between the processor and memory possible. While memory stores data as bits in different addresses, the processor has access to this data by providing the address of the data desired via the address bus. Once the memory receives this information, it sends the data stored in the specified address via the data bus. 3) Metrics Studied: The CPU clock rate, also known as CPU clock speed, is a measure of how many clock cycles a CPU performs per second. Clock rate is measured in Hertz (Hz), kilohertz (KHz) megahertz (MHz), or gigahertz (GHz). The difference between these four units of measurement is simply a multiplication factor. 1Khz = 109 Hz, uy1MHz = 106 Hz and 1GHz = 109 Hz. The unit utilized to measure memory capacity is the byte. However, typical measures of memory capacity are kilobytes, megabytes, or gigabytes. Like Hz, MHz, and GHz, the only difference between these units is the multiplication factor behind each one. 1KB = 210 Bytes = 1024 Bytes, 1MB = 220 Bytes, and 1GB = 230 Bytes. Another unit related to memory is the bit. This is the smallest unit of measurement. 1Byte = 8bits. 4) Significance: According to Moore’s Law, the number of transistors per die doubles approximately every eighteen months to two years. 5) Processor Performance Equation: Performance Equation is the following: The Processor 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑇𝑖𝑚𝑒 = 𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛 𝐶𝑜𝑢𝑛𝑡 × 𝐶𝑃𝐼 ÷ 𝐶𝑙𝑜𝑐𝑘 𝑅𝑎𝑡𝑒 The instruction count refers to the dynamic count of instructions executed at runtime, not the number of instructions of assembly code. The CPI refers to the amount of clock periods required per instruction in the dynamic instruction count. Finally, the clock rate refers to the amount of clock cycles performed per second. By multiplying the dynamic instruction count by the CPI and dividing with the clock rate the execution time is obtained. 6) Parallelism: Parallelism is utilized, ideally, to make a program run faster. The concept of parallelism consists of simultaneously using more than one processor, or CPU, to execute a program. Amdahl’s Law calculates the maximum speedup a program can achieve when parallel cores are utilized with the following equation: 𝑆𝑝𝑒𝑒𝑑𝑢𝑝 = !"!#$%&'( !"#$ !" ! !"#$%& !"#$%&&#" !"!#$%&'( !"#$ !" ! !"#"$$%$ !"#$%&&#"& = ! ! !!! !! , where f refers to the fraction of code infinitely parallelizable with no scheduling overhead, (1-f) refers to the fraction of code inherently serial, and N refers to the amount of parallel processors used. In Section 2 there are ten processor architectures spanning from 1990s until today that are reviewed. Section 3 consists of a table with some of the features for each processor mentioned in the previous section. Based on this table, several plots were created to compare the different types of processors in order to identify different trends that have occurred over the past two Page 1 of 4 and a half decades. Section 4 provides a brief conclusion to analyze the results obtained from this report. Finally, Section 5 provides a list that references several other reports that were utilized to some of the information necessary to complete Table 1. II. LITERATURE REVIEW § 1990s In 1992, the SNAP-1 Parallel AI Prototype was developed at the University of Southern California [5]. It contained 144 CPUs, each of which was a Texas Instruments TMS320C30 32-bit DSP chip clocked at 25 MHz and had 256KB of memory which were organized on to 8 large circuit boards. The SNAP-1 ISA had 20 instructions for the special purpose computation of marker-passing. It bridged the semantic gap by providing complex powerful instructions close to the need of Natural Language Understanding (NLU) applications. The DECmpp/Sx-1208 parallel processor was utilized for the implementation of Fuzzy ARTMAP. This processor consists of a DEC RISC Workstation Front-End (FE) and a MasPar MP-1 Back-End (BE). The RISC Workstation is a 32-bit processor with 128KB of data memory that is necessary to control and transfer data to and from the processor elements. The Back end machine as described by MasPar is a SIMD massively parallel machine consisting of 512 4x4 clusters of processor elements (PE) arranged in a 16 x 32 cluster array. [3] Therefore, containing 8,192 processor elements. After implementation, the design was tested on the Letters benchmark developed by Frey and Slate. In 1999, Intel launched the Pentium III microprocessor which size range was between 130nm to 250nm. This single core microprocessor with a memory capacity of 512MB and 32-bit word width was evaluated in a benchmarking environment know as One Semi-Automated Forces Testbed (OTBSAF). The purpose of this evaluation was to test various microprocessors for potential use in embedded simulation. Host “bahrd” was a Dell Inspiron 8000 laptop using a Pentium III processor running at 1Ghz. It utilized 0.5 GB of memory and was running the Linux 2.4.20 kernel. [1] § 2000-2009 With the goal of improving power awareness of pipelined array multipliers the FIR filter was designed. To obtain this, a two-dimensional pipeline-gating scheme was proposed. This technique is to gate the clock to registers in both vertical direction (data flow direction in pipeline) and horizontal direction (within each pipeline stage). [4] The design consisted of 16 registers and the technology used in synthesis process is 240nm CMOS logic. The designed FIR filter was able to work under 1250MHz clock rate. Automated Forces Testbed (OTBSAF) in order to test for potential use in embedded simulation. This microprocessor runs at 2000MHz clock speed. In 2009, the architecture for a MIPS processor capable of power reduction was presented. The processor was successfully designed in Verilog HDL, simulated with ModelSim and synthesized on to a Xilinx Spartan-3E FPGA. [9] The design of this MIPS processor consisted of 1890 fourinput LUTs and 397 flip-flops. A maximum frequency of 205.7Mhz was used and the results indicated an average power consumption of 1139mW for the modified pipeline compared to 1359mW for the normal pipeline. Therefore, power reduction was achieved with this design. In 2009, Intel Corporation introduced the Xeon® EX next generation microprocessor under the codename Nehalem-EX. This microprocessor featured 8 dual-threaded 64-bit cores inside a single chip. The processor has 2.3 B transistors and is implemented in a 45nm CMOS technology using metal gate high-K dielectric transistors that reduce the gate leakage by a factor of 25x for nMOS devices and 1000x for pMOS devices, compared to the 65nm process generation. [12] § 2010-present In 2011, a 16-bit non-pipelined RISC processor was proposed. The processor consists of the blocks, namely, program counter, clock control unit, ALU, IDU, and registers. [11] This processor was used for signal processing applications. Some of the features for this processor are 65012mm2 die are, 90nm CMOS technology, and it ran at 200MHz. This processor was designed to execute an instruction set with a total of 27 instructions, based on the user requirements. The 48-core IA-32 processor was presented in a 45nm Hi-K CMOS process that utilized a 2D-mesh network and 4 DDR3 channels. [13] It was utilized for performance and power scaling purposes. It consisted of 1.3 billion transistors in a total die are of 567mm2 and 45nm CMOS technology. This architecture design ran at 1000MHz and had a memory capacity of 18MB. In 2014, a multicore embedded processor with reconfigurable same-instruction multiple process (RSIMP) architecture design was presented. The main goal was to reduce the power consumption of instruction memory (IM), thus reducing the total power of the processor. [15] This design runs at 800MHz clock rate, contains 16 processors, has a memory capacity of 0.5MB, and utilizes 65nm CMOS technology. In 2003, the Athlon XP 3000+ microprocessor was introduced. Some of its features include a 32-bit word width data bus, 1024MB of memory, and 130nm CMOS technology. This microprocessor was also evaluated in the One SemiPage 2 of 4 Metrics covered analyzed in this paper: • CPU clock rate (MHz) vs. Year • Memory Capacity (MB) vs. Year III. DATA ANALYSIS • Number of Processors or Cores vs. Year • Data bus Word Width (bits) vs. Year Fig 1. Metrics from table 1 compared on different years from the past decade and a half. CPU Clock Rate vs Year IV. CONCLUSION 3000 2000 CPU clock rate 0 1992 1999 2003 2004 2009 2011 2011 2014 1000 V. REFERENCES Memory Capacity vs Year [1] 1000 [2] 2014 2011 2009 2004 1999 1997 1992 500 0 According to Moore’s Law, the number of transistors per die doubles approximately every eighteen months to two years. By observing some of the metrics depicted in table 1, one can conclude that the information provided agrees with Moore’s Law. For some designs the microprocessor size was provided and it is clear that as the years go by this metric tends to decrease. Also, CPU clock rate, memory capacity, and word width do not have linearity. This was observed on the plots provided. Memory capacity [3] [4] [5] Word Width vs Year 60 [6] 40 Word Width 0 1992 1997 1999 2003 2004 2009 2009 2011 2011 20 [7] [8] Three different metrics from table 1 have been plotted using line or bars graphs. However, no particular trends have been observed due to the fact that each different type of architecture presented was designed for a particular purpose. [9] Page 3 of 4 H. A. Bahr and R. F. DeMara, "OTBSAF Scalability on Pentium III/4 and Athlon 64/XP3000 Architectures," in MSIAC Modeling and Simulation Journal, on February 9, 2005, Vol.6, No. 3, March, 2005, pp. 1 - 4. J. Di, J. S. Yuan, and R. F. DeMara, "Improving Power-awareness of Pipelined Array Multipliers using 2-Dimensional Pipeline Gating and its Application to FIR Design," Integration, the VLSI Journal, Vol. 39, No. 2, March, 2006, pp. 90-112. H. Bahr, R. F. DeMara, and M. Georgiopoulos, "Integer-Encoded Massively Parallel Processing of Fast-Learning ARTMAP Networks," in Proceedings of the 1997 SPIE AeroSense Symposium (AeroSense-97), pp. 678 - 689, Orlando, Florida, U.S.A., April 21 - 24, 1997. R. F. DeMara and D. I. Moldovan, "The SNAP-1 Parallel AI Prototype," IEEE Transactions on Parallel and Distributed Systems, Vol. 4, No. 8, August, 1993, pp. 841-854. Gautham, P.; Parthasarathy, R.; Balasubramanian, K., "Low-power pipelined MIPS processor design," Integrated Circuits, ISIC '09. Proceedings of the 2009 12th International Symposium on , vol., no., pp.462,465, 14-16 Dec. 2009. Sakthikumaran, S.; Salivahanan, S.; Bhaaskaran, V.S.K., "16-Bit RISC processor design for convolution application," Recent Trends in Information Technology (ICRTIT), 2011 International Conference on , vol., no., pp.394,397, 3-5 June 2011. Rusu, S.; Simon Tam; Muljono, H.; Stinson, J.; Ayers, D.; Chang, Jonathan; Varada, R.; Ratta, M.; Kottapalli, S.; Vora, S., "A 45 nm 8Core Enterprise Xeon¯ Processor," Solid-State Circuits, IEEE Journal of , vol.45, no.1, pp.7,14, Jan. 2010. Howard, J.; Dighe, S.; Vangal, S.R.; Ruhl, G.; Borkar, N.; Jain, S.; Erraguntla, V.; Konow, M.; Riepen, M.; Gries, M.; Droege, G.; LundLarsen, T.; Steibl, S.; Borkar, S.; De, V.K.; Van Der Wijngaart, R., "A 48-Core IA-32 Processor in 45 nm CMOS Using On-Die MessagePassing and DVFS for Performance and Power Scaling," Solid-State Circuits, IEEE Journal of , vol.46, no.1, pp.173,183, Jan. 2011. Zheng Yu; Zhiyi Yu; Xueqiu Yu; Ningxi Liu; Xiaoyang Zeng, "LowPower Multicore Processor Design With Reconfigurable SameInstruction Multiple Process," Circuits and Systems II: Express Briefs, IEEE Transactions on , vol.61, no.6, pp.423,427, June 2014. TABLE I. PROCESSOR ARCHITECTURES FROM 1990 TO PRESENT AND THEIR FEATURES Name of Architecture [reference] Purpose: ApplicationSpecific or General-purpose Computation Die Area, Number of Transistors, or Number of Chips/Boards/etc. CPU Clock Rate (MHz) Memory Capacity (MB) Data Bus Word Width (bits) Number of Cores or CPUs Ideal Speedup for 99% parallel code (ignoring overheads) SNAP-1 Parallel AI Prototype [4] NLU: Special Purpose 144 DSP Chips on 8 large circuit boards 25 256KB/CPU * 144 CPU = 36.86MB 32 144 single core CPUs = 144 cores 144 cores so Told/Tnew= 1/[0.01+ (0.99/144)] = 59.26-fold 8192 processors N/A 128KB/CPU * 8192CPUs = 1024MB 32 8192 processors Told/Tnew=1/[0.01+(0.99/N)] = 98.8-fold 130nm-250nm CMOS technology 1000 512MB 32 1 - 16 registers, 240nm static CMOS logic 1250 N/A 16 N/A - 130nm CMOS technology 2000 1024MB 32 1 - 205.7 256B + 1024B + 32B = 0.001312 MB 32 1 - N/A 192MB 64 8 processors Told/Tnew=1/[0.01+(0.99/N)] = 7.48-fold 200 N/A 16 1 - 1000 384KB/CPU * 48CPUs = 18MB 32 48 cores Told/Tnew=1/[0.01+(0.99/N)] = 32.65-fold 800 32KB/CPU * 16CPUs = 0.5MB NA 16 processors Told/Tnew=1/[0.01+(0.99/N)] = 13.9-fold DECmpp/Sx-1208 [3] Pentium III [1] FIR filter design [2] XP 3000+ [1] MIPS processor [5] Implementation of Fuzzy ARTMAP networks OTBSAF Scalability Improve Power Awareness of Pipelined Array Multipliers OTBSAF Scalability Power Reduction Nehalem-EX [6] RISC processor design [7] Signal Processing Applications 48-core IA-32 processor [8] Performance and Power Scaling RSIMP [9] Reduce Powe Consumption of IM 1890 four-input LUTs and 397 flip-flops 45nm CMOS technology, 2.3 billion transistors 65012mm2 die are, 90nm CMOS technology 567mm2 die area, 45nm CMOS technology, 1.3 billion transistors 65nm CMOS technology Page 4 of 4
© Copyright 2026 Paperzz