1. Computer Abstractions and Technology

1. Computer Abstractions and Technology
Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3
Emil Sekerinski, McMaster University, Fall Term 2015/16
Classes of Computers
Personal Computers
•
General purpose, variety of (third-party) software
•
Good performance at reasonable cost
Server Computers
•
Accessed via a network
•
Large workloads, single (science or engineering) application or many small jobs (web
server), customized applications (database, simulation)
•
Range from small (“mini”) servers to supercomputers (e.g. IBM BlueGene/Q) with
terrabytes of main memory
Embedded Computers
•
Hidden as components of systems, e.g. car, TV, router
•
Cost and power consumption critical for required performance
Personal Mobile Device (PMD)
•
PostPC Era
User interface, power consumption, network, cost critical: phones, tablets, glasses, bands
Cloud Computing
•
Warehouse scale computers with 100,000 servers, geographically distributed
•
Running Software as a Service (SaaS), with portion on PMD and portion in the Cloud
Five Main Classes of Computers
(as of 2012)
Question: What is the largest class (most manufactured)?
A. Personal mobile device
B. Desktop
C. Server
D. Clusters/warehouse-scale computer
E. Embedded
Manufactured Units (as of 2010)
Personal mobile device: 1.8 billion PMDs (90% phones)
Desktop: 350 million
Server: 20 million
Embedded: 19 billion (6.1 billion ARM based chips)
What You Will (Not) Learn
•
How programs are translated into the machine language and how the hardware
executes them
•
The hardware/software interface
•
What determines program performance and how it can be improved
•
How hardware designers improve performance
•
Techniques hardware designers use to improve energy efficiency and how
programmers can support that
•
Why is there a shift from sequential to parallel (“multi-core”) processing and
what consequences it has to programmers
Eight “Great Ideas” in Computer Architecture
•
Design for Moore’s Law
•
Use abstraction to simplify design
•
Make the common case fast
•
Performance via parallelism
•
Performance via pipelining
•
Performance via prediction
•
Hierarchy of memories
•
Dependability via redundancy
Levels of Program Code
High-level language
•
Level of abstraction closer to problem
domain
•
Provides for productivity and portability
Assembly language
•
Textual notation of instructions
•
Directly represents hardware
Machine language
•
Binary digits (bits)
•
Encoded instructions and data
Instruction Set Architecture
Instruction set architecture (ISA)
•
is the hardware/software interface
•
the specification hardware designers implement
Application binary interface (ABI)
•
is the ISA plus system software interface
•
application programmers work with the ABI
Question: What of the following is true for ISAs in general?
A. Many models of processors can support one ISA.
B. An ISA is unique to one model of processor.
C. Every processor supports multiple ISAs.
D. Each processor manufacturer has its own unique ISA.
E. None of the above.
Components of a Computer
The five classic components are:
•
input
•
output
•
memory
•
datapath
•
control
}
processor
Inside a Computer
Capacitive multitouch LCD screen
Computer board
Inside a Processor
The processor integrated
circuit inside the A5 package:
Defining Performance
Which airplane has the best performance?
Boeing 777
Boeing 777
Boeing 747
Boeing 747
BAC/Sud
Concorde
BAC/Sud
Concorde
Douglas
DC-8-50
Douglas DC8-50
0
100
200
300
400
0
500
Boeing 777
Boeing 777
Boeing 747
Boeing 747
BAC/Sud
Concorde
BAC/Sud
Concorde
Douglas
DC-8-50
Douglas DC8-50
500
1000
Cruising Speed (mph)
4000
6000
8000 10000
Cruising Range (miles)
Passenger Capacity
0
2000
1500
0
100000 200000 300000 400000
Passengers x mph
If we define performance as top cruising speed, Concorde is the fastest.
If we want to transport 450 passengers, 747 has the highest throughput.
Two Questions: Response Time vs Throughput
Response time (execution time): total time required to complete a task
Throughput (bandwidth): total work (number of task) done per unit time
1. Consider replacing a processor with a faster one. Does this:
A. increase throughput,
B. decrease response time,
C. both?
2. Consider adding additional processors to a system that uses multiple processors
for separate tasks (e.g. serving http requests). Does this:
A. increase throughput,
B. decrease response time, or
C. both?
Measuring Execution Time
Elapsed time (wall clock time, response time):
•
Total response time, including disk access, I/O (e.g. network), OS overhead
•
Determines system performance
CPU time
•
Time spent processing a given task
•
Discounts I/O time, other jobs’ shares
•
Comprises user CPU time and system CPU time
Hence we refer to system performance and CPU performance
CPU Clocking
Operation of digital hardware governed by a constant-rate clock:
Clock period
Clock (cycles)
Data transfer
and computation
Update state
Clock period: duration of a clock cycle
•
e.g., 1000 ps = 1 ns = 0.001µs = 10-6 ms = 10-9 s
Clock frequency (rate): cycles per second
•
e.g. 1 GHz = 1000 MHz = 106 kHz = 109Hz
CPU performance
for given program:
How to
improve CPU Time?
CPU Time = CPU Clock Cycles × Clock Cycle Time
CPU Clock Cycles
=
Clock Rate
Instruction Count and Cycles Per Instruction (CPI)
Instruction count for a program
•
Determined by program, ISA and compiler
Average cycles per instruction
•
Determined by CPU hardware
•
If different instructions have different CPI
•
Average CPI affected by instruction mix
Clock Cycles = Instruction Count × Cycles per Instruction
CPU Time = Instruction Count × CPI × Clock Cycle Time
Instruction Count × CPI
=
Clock Rate
CPI Example
•
Computer A: Cycle Time = 250ps, CPI = 2.0
•
Computer B: Cycle Time = 500ps, CPI = 1.2
Assuming same ISA (same programs), which is faster, and by how much?
CPU Time
A
= Instruction Count × CPI × Cycle Time
A
A
= I × 2.0 × 250ps = I × 500ps
A is faster…
CPU Time = Instruction Count × CPI × Cycle Time
B
B
B
= I × 1.2 × 500ps = I × 600ps
CPU Time
B = I × 600ps = 1.2
CPU Time
I × 500ps
A
…by this much
Factors influencing CPU Performance
In general, different instructions can take a different number of cycles. In that case,
the weighted average of the CPI’s has to be taken.
Instructions Clock cycles Seconds
CPU Time =
×
×
Program
Instruction Clock cycle
CPU performance depends on
•
Algorithm: affects IC, possibly CPI
•
Programming language: affects IC, CPI
•
Compiler: affects IC, CPI
•
Instruction set architecture: affects IC, CPI, Clock cycle
Power Trends
In CMOS (Complimentary Metal Oxide Semiconductor) technology:
Power = Capacitive load × Voltage 2 × Frequency
×30
5V → 1V
×1000
Power consumption at maximum (battery, cooling); voltage cannot be reliably
decreased; capacitive load depends on “fanout” and technology: Power Wall
Uniprocessor Performance
Constrained by power, instruction-level parallelism,
memory latency
Multiprocessors
Multicore microprocessors
•
More than one processor per chip
•
Requires explicitly parallel programming
Compare with instruction level parallelism
•
Hardware executes multiple instructions at once
•
Hidden from the programmer
Multicore programming is hard
•
Programming for performance
•
Load balancing
•
Optimizing communication and synchronization
SPEC CPU Benchmark
Programs used to measure performance
•
Supposedly typical of actual workload
Standard Performance Evaluation Corp (SPEC)
•
Develops benchmarks for CPU, I/O, Web, …
SPEC CPU2006
•
Elapsed time to execute a selection of programs
•
Negligible I/O, so focuses on CPU performance
•
Normalize relative to reference machine
•
Summarize as geometric mean of performance ratios
•
CINT2006 (integer) and CFP2006 (floating-point)
n
n
∏ Execution time ratio
i=1
i
CINT2006 for Intel Core i7 920
SPEC Power Benchmark
Power consumption of server at different workload levels
•
Performance: ssj_ops/sec
•
Power: Watts (Joules/sec)
& 10
# & 10
#
Overall ssj_ops per Watt = $ ∑ ssj_ops i ! $ ∑ poweri !
% i =0
" % i=0
"
SPECpower_ssj2008 for Xeon X5650:
Fallacy: Low Power at Idle
Look back at i7 power benchmark
•
At 100% load: 258W
•
At 50% load: 170W (66%)
•
At 10% load: 121W (47%)
Google data center
•
Mostly operates at 10% – 50% load
•
At 100% load less than 1% of the time
Consider designing processors to make power proportional to load