Insight To Processor Architecture and Design Metrics, Analysis, and

Insight To Processor Architecture and Design
Metrics, Analysis, and Comparison Between Ten
Designs
Daniel Rosabal
Department of Electrical Engineering and Computer Science
University of Central Florida
Orlando, FL 32816-2362
Abstract— This paper focuses on different types of processor
architectures. It begins by describing the five classic components of a
processor. Then it explains how the address bus and data bus interact
to make a connection between memory and processor possible. It
continues by stating the metrics that will be analyzed such as CPU
clock rate, memory capacity, word width, Moore’s Law, etc. A brief
explanation on processor performance is provided along with
Amdahl’s Law and the concept of parallelism. Finally, a list of 10
processor architectures is provided and described in a paragraph for
each of the designs listed.
Keywords—Amdahl’s Law, Clock rate, CMOS technology, CPI,
Memory Capacity, Moore’s Law, Word width
I. OVERVIEW OF PROCESSOR ARCHITECTURE
1) Classic Components: A computer system consists of the
following components: I/O devices, memory, and CPU.
I/O stands for input/output and some examples for such
devices are a keyboard or mouse for input and a printer or
display for output. CPU stands for central processing unit,
also known as processor, and it consists of a control unit
and an arithmetic logical unit (ALU). The control unit
decodes the instruction bits to control instruction
execution and the ALU, which is capable of arithmetic and
logic functions on these bits, crunches them. Memory is in
charge of storing these bits.
2) Processor Busses and Bit Width: Two buses connect the
processor and memory, the data bus and the address bus.
These two buses make interaction between the processor
and memory possible. While memory stores data as bits in
different addresses, the processor has access to this data by
providing the address of the data desired via the address
bus. Once the memory receives this information, it sends
the data stored in the specified address via the data bus.
3) Metrics Studied: The CPU clock rate, also known as
CPU clock speed, is a measure of how many clock cycles
a CPU performs per second. Clock rate is measured in
Hertz (Hz), kilohertz (KHz) megahertz (MHz), or
gigahertz (GHz). The difference between these four units
of measurement is simply a multiplication factor. 1Khz =
109 Hz, uy1MHz = 106 Hz and 1GHz = 109 Hz. The unit
utilized to measure memory capacity is the byte. However,
typical measures of memory capacity are kilobytes,
megabytes, or gigabytes. Like Hz, MHz, and GHz, the
only difference between these units is the multiplication
factor behind each one. 1KB = 210 Bytes = 1024 Bytes,
1MB = 220 Bytes, and 1GB = 230 Bytes. Another unit
related to memory is the bit. This is the smallest unit of
measurement. 1Byte = 8bits.
4) Significance: According to Moore’s Law, the number of
transistors per die doubles approximately every eighteen
months to two years.
5) Processor Performance Equation:
Performance Equation is the following:
The
Processor
𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑇𝑖𝑚𝑒 = 𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛 𝐶𝑜𝑢𝑛𝑡 × 𝐶𝑃𝐼 ÷ 𝐶𝑙𝑜𝑐𝑘 𝑅𝑎𝑡𝑒
The instruction count refers to the dynamic count of
instructions executed at runtime, not the number of
instructions of assembly code. The CPI refers to the
amount of clock periods required per instruction in the
dynamic instruction count. Finally, the clock rate refers to
the amount of clock cycles performed per second. By
multiplying the dynamic instruction count by the CPI and
dividing with the clock rate the execution time is obtained.
6) Parallelism: Parallelism is utilized, ideally, to make a
program run faster. The concept of parallelism consists of
simultaneously using more than one processor, or CPU, to
execute a program. Amdahl’s Law calculates the
maximum speedup a program can achieve when parallel
cores are utilized with the following equation:
𝑆𝑝𝑒𝑒𝑑𝑢𝑝 =
!"!#$%&'( !"#$ !" ! !"#$%& !"#$%&&#"
!"!#$%&'( !"#$ !" ! !"#"$$%$ !"#$%&&#"&
= !
!
!!! !!
,
where f refers to the fraction of code infinitely
parallelizable with no scheduling overhead, (1-f) refers to
the fraction of code inherently serial, and N refers to the
amount of parallel processors used.
In Section 2 there are ten processor architectures spanning
from 1990s until today that are reviewed. Section 3 consists of
a table with some of the features for each processor mentioned
in the previous section. Based on this table, several plots were
created to compare the different types of processors in order to
identify different trends that have occurred over the past two
Page 1 of 4
and a half decades. Section 4 provides a brief conclusion to
analyze the results obtained from this report. Finally, Section 5
provides a list that references several other reports that were
utilized to some of the information necessary to complete
Table 1.
II. LITERATURE REVIEW
§
1990s
In 1992, the SNAP-1 Parallel AI Prototype was developed at
the University of Southern California [5]. It contained 144
CPUs, each of which was a Texas Instruments TMS320C30
32-bit DSP chip clocked at 25 MHz and had 256KB of
memory which were organized on to 8 large circuit boards.
The SNAP-1 ISA had 20 instructions for the special purpose
computation of marker-passing. It bridged the semantic gap by
providing complex powerful instructions close to the need of
Natural Language Understanding (NLU) applications.
The DECmpp/Sx-1208 parallel processor was utilized for the
implementation of Fuzzy ARTMAP. This processor consists
of a DEC RISC Workstation Front-End (FE) and a MasPar
MP-1 Back-End (BE). The RISC Workstation is a 32-bit
processor with 128KB of data memory that is necessary to
control and transfer data to and from the processor elements.
The Back end machine as described by MasPar is a SIMD
massively parallel machine consisting of 512 4x4 clusters of
processor elements (PE) arranged in a 16 x 32 cluster array.
[3] Therefore, containing 8,192 processor elements. After
implementation, the design was tested on the Letters
benchmark developed by Frey and Slate.
In 1999, Intel launched the Pentium III microprocessor which
size range was between 130nm to 250nm. This single core
microprocessor with a memory capacity of 512MB and 32-bit
word width was evaluated in a benchmarking environment
know as One Semi-Automated Forces Testbed (OTBSAF).
The purpose of this evaluation was to test various
microprocessors for potential use in embedded simulation.
Host “bahrd” was a Dell Inspiron 8000 laptop using a Pentium
III processor running at 1Ghz. It utilized 0.5 GB of memory
and was running the Linux 2.4.20 kernel. [1]
§
2000-2009
With the goal of improving power awareness of pipelined
array multipliers the FIR filter was designed. To obtain this, a
two-dimensional pipeline-gating scheme was proposed. This
technique is to gate the clock to registers in both vertical
direction (data flow direction in pipeline) and horizontal
direction (within each pipeline stage). [4] The design
consisted of 16 registers and the technology used in synthesis
process is 240nm CMOS logic. The designed FIR filter was
able to work under 1250MHz clock rate.
Automated Forces Testbed (OTBSAF) in order to test for
potential use in embedded simulation. This microprocessor
runs at 2000MHz clock speed.
In 2009, the architecture for a MIPS processor capable of
power reduction was presented. The processor was
successfully designed in Verilog HDL, simulated with
ModelSim and synthesized on to a Xilinx Spartan-3E FPGA.
[9] The design of this MIPS processor consisted of 1890 fourinput LUTs and 397 flip-flops. A maximum frequency of
205.7Mhz was used and the results indicated an average
power consumption of 1139mW for the modified pipeline
compared to 1359mW for the normal pipeline. Therefore,
power reduction was achieved with this design.
In 2009, Intel Corporation introduced the Xeon® EX next
generation microprocessor under the codename Nehalem-EX.
This microprocessor featured 8 dual-threaded 64-bit cores
inside a single chip. The processor has 2.3 B transistors and is
implemented in a 45nm CMOS technology using metal gate
high-K dielectric transistors that reduce the gate leakage by a
factor of 25x for nMOS devices and 1000x for pMOS devices,
compared to the 65nm process generation. [12]
§
2010-present
In 2011, a 16-bit non-pipelined RISC processor was proposed.
The processor consists of the blocks, namely, program
counter, clock control unit, ALU, IDU, and registers. [11] This
processor was used for signal processing applications. Some
of the features for this processor are 65012mm2 die are, 90nm
CMOS technology, and it ran at 200MHz. This processor was
designed to execute an instruction set with a total of 27
instructions, based on the user requirements.
The 48-core IA-32 processor was presented in a 45nm Hi-K
CMOS process that utilized a 2D-mesh network and 4 DDR3
channels. [13] It was utilized for performance and power
scaling purposes. It consisted of 1.3 billion transistors in a
total die are of 567mm2 and 45nm CMOS technology. This
architecture design ran at 1000MHz and had a memory
capacity of 18MB.
In 2014, a multicore embedded processor with reconfigurable
same-instruction multiple process (RSIMP) architecture
design was presented. The main goal was to reduce the power
consumption of instruction memory (IM), thus reducing the
total power of the processor. [15] This design runs at 800MHz
clock rate, contains 16 processors, has a memory capacity of
0.5MB, and utilizes 65nm CMOS technology.
In 2003, the Athlon XP 3000+ microprocessor was
introduced. Some of its features include a 32-bit word width
data bus, 1024MB of memory, and 130nm CMOS technology.
This microprocessor was also evaluated in the One SemiPage 2 of 4
Metrics covered analyzed in this paper:
• CPU clock rate (MHz) vs. Year
• Memory Capacity (MB) vs. Year
III. DATA ANALYSIS
• Number of Processors or Cores vs. Year
• Data bus Word Width (bits) vs. Year
Fig 1. Metrics from table 1 compared on different years from the past
decade and a half.
CPU Clock Rate vs Year IV. CONCLUSION
3000 2000 CPU clock rate 0 1992 1999 2003 2004 2009 2011 2011 2014 1000 V. REFERENCES
Memory Capacity vs Year [1]
1000 [2]
2014 2011 2009 2004 1999 1997 1992 500 0 According to Moore’s Law, the number of transistors per
die doubles approximately every eighteen months to two years.
By observing some of the metrics depicted in table 1, one can
conclude that the information provided agrees with Moore’s
Law. For some designs the microprocessor size was provided
and it is clear that as the years go by this metric tends to
decrease. Also, CPU clock rate, memory capacity, and word
width do not have linearity. This was observed on the plots
provided.
Memory capacity [3]
[4]
[5]
Word Width vs Year 60 [6]
40 Word Width 0 1992 1997 1999 2003 2004 2009 2009 2011 2011 20 [7]
[8]
Three different metrics from table 1 have been plotted
using line or bars graphs. However, no particular trends
have been observed due to the fact that each different type
of architecture presented was designed for a particular
purpose.
[9]
Page 3 of 4
H. A. Bahr and R. F. DeMara, "OTBSAF Scalability on Pentium III/4
and Athlon 64/XP3000 Architectures," in MSIAC Modeling and
Simulation Journal, on February 9, 2005, Vol.6, No. 3, March, 2005, pp.
1 - 4.
J. Di, J. S. Yuan, and R. F. DeMara, "Improving Power-awareness of
Pipelined Array Multipliers using 2-Dimensional Pipeline Gating and its
Application to FIR Design," Integration, the VLSI Journal, Vol. 39, No.
2, March, 2006, pp. 90-112.
H. Bahr, R. F. DeMara, and M. Georgiopoulos, "Integer-Encoded
Massively Parallel Processing of Fast-Learning ARTMAP Networks," in
Proceedings of the 1997 SPIE AeroSense Symposium (AeroSense-97),
pp. 678 - 689, Orlando, Florida, U.S.A., April 21 - 24, 1997.
R. F. DeMara and D. I. Moldovan, "The SNAP-1 Parallel AI Prototype,"
IEEE Transactions on Parallel and Distributed Systems, Vol. 4, No. 8,
August, 1993, pp. 841-854.
Gautham, P.; Parthasarathy, R.; Balasubramanian, K., "Low-power
pipelined MIPS processor design," Integrated Circuits, ISIC '09.
Proceedings of the 2009 12th International Symposium on , vol., no.,
pp.462,465, 14-16 Dec. 2009.
Sakthikumaran, S.; Salivahanan, S.; Bhaaskaran, V.S.K., "16-Bit RISC
processor design for convolution application," Recent Trends in
Information Technology (ICRTIT), 2011 International Conference on ,
vol., no., pp.394,397, 3-5 June 2011.
Rusu, S.; Simon Tam; Muljono, H.; Stinson, J.; Ayers, D.; Chang,
Jonathan; Varada, R.; Ratta, M.; Kottapalli, S.; Vora, S., "A 45 nm 8Core Enterprise Xeon¯ Processor," Solid-State Circuits, IEEE Journal of
, vol.45, no.1, pp.7,14, Jan. 2010.
Howard, J.; Dighe, S.; Vangal, S.R.; Ruhl, G.; Borkar, N.; Jain, S.;
Erraguntla, V.; Konow, M.; Riepen, M.; Gries, M.; Droege, G.; LundLarsen, T.; Steibl, S.; Borkar, S.; De, V.K.; Van Der Wijngaart, R., "A
48-Core IA-32 Processor in 45 nm CMOS Using On-Die MessagePassing and DVFS for Performance and Power Scaling," Solid-State
Circuits, IEEE Journal of , vol.46, no.1, pp.173,183, Jan. 2011.
Zheng Yu; Zhiyi Yu; Xueqiu Yu; Ningxi Liu; Xiaoyang Zeng, "LowPower Multicore Processor Design With Reconfigurable SameInstruction Multiple Process," Circuits and Systems II: Express Briefs,
IEEE Transactions on , vol.61, no.6, pp.423,427, June 2014.
TABLE I.
PROCESSOR ARCHITECTURES FROM 1990 TO PRESENT AND THEIR FEATURES
Name of Architecture
[reference]
Purpose:
ApplicationSpecific or
General-purpose
Computation
Die Area, Number of
Transistors, or
Number of
Chips/Boards/etc.
CPU
Clock
Rate
(MHz)
Memory
Capacity
(MB)
Data Bus Word
Width (bits)
Number of Cores or
CPUs
Ideal Speedup for 99%
parallel code (ignoring
overheads)
SNAP-1 Parallel AI
Prototype [4]
NLU: Special
Purpose
144 DSP Chips on 8
large circuit boards
25
256KB/CPU
* 144 CPU =
36.86MB
32
144 single core CPUs =
144 cores
144 cores so Told/Tnew=
1/[0.01+ (0.99/144)] =
59.26-fold
8192 processors
N/A
128KB/CPU
* 8192CPUs
= 1024MB
32
8192 processors
Told/Tnew=1/[0.01+(0.99/N)]
= 98.8-fold
130nm-250nm
CMOS technology
1000
512MB
32
1
-
16 registers, 240nm
static CMOS logic
1250
N/A
16
N/A
-
130nm CMOS
technology
2000
1024MB
32
1
-
205.7
256B +
1024B + 32B
= 0.001312
MB
32
1
-
N/A
192MB
64
8 processors
Told/Tnew=1/[0.01+(0.99/N)]
= 7.48-fold
200
N/A
16
1
-
1000
384KB/CPU
* 48CPUs =
18MB
32
48 cores
Told/Tnew=1/[0.01+(0.99/N)]
= 32.65-fold
800
32KB/CPU *
16CPUs =
0.5MB
NA
16 processors
Told/Tnew=1/[0.01+(0.99/N)]
= 13.9-fold
DECmpp/Sx-1208 [3]
Pentium III [1]
FIR filter design [2]
XP 3000+ [1]
MIPS processor [5]
Implementation
of Fuzzy
ARTMAP
networks
OTBSAF
Scalability
Improve Power
Awareness of
Pipelined Array
Multipliers
OTBSAF
Scalability
Power Reduction
Nehalem-EX [6]
RISC processor design
[7]
Signal Processing
Applications
48-core IA-32 processor
[8]
Performance and
Power Scaling
RSIMP [9]
Reduce Powe
Consumption of
IM
1890 four-input LUTs
and 397 flip-flops
45nm CMOS
technology, 2.3
billion transistors
65012mm2 die are,
90nm CMOS
technology
567mm2 die area,
45nm CMOS
technology, 1.3
billion transistors
65nm CMOS
technology
Page 4 of 4