Simultaneous Multi

ECE 485/585
Microprocessors
Chapter 10
Threads and Multiprocessing,
By Sample of Intel® CoreTM i7
Herbert G. Mayer, PSU
Status 11/29/2016
1
Syllabus






Introduction
Definitions
Threads & Hyper-Threads
Hyper-Threading Anomalies
Appendix
Bibliography
2
Introduction
 We’ll study threads and hyper-threads
 And analyze the Intel Core i7, a sample
hyper-threaded MP μP
 Thread here is understood as process
subset; see Definitions below
 Key decision for Electrical Engineer: When
designing a CPU, should it incorporate
Hyper-Threads in silicon?
 Or better use that silicon space instead for
another complete μP?
3
Definitions
(Logical order, not alphabetical)
4
Definitions
Core
 Core: synonymous to full silicon μP; includes
instructions, registers, condition codes
 Core also includes ALU; if so: is a full μP
 If not, is a hyper-threaded core, missing ALU plus a
few other non-essential modules
 Actually hyper-thread (super thread) is misnomer:
should be named hypo-thread (sub thread)
 Sample hyper-threaded product is Intel Core i7
 When processor has multiple cores – i7 has 4 – then
total number of μP for that product is 8: 4 of them
full μPs, and 4 partial μP, the silicon hyper-threads
5
Definitions
Concurrent
 Let process p be composed of 2 threads, t1 and t2
 Let t1 and t2 have no dependency on one another
 Then t1 and t2 are executable concurrently; may,
but do not have to run at the same time
 Yet they may execute simultaneously, if 2 real μP
are available, then true parallel execution possible
 If we do not know or care, whether this happens on
1 or more μP, we call this concurrent execution
 Concurrent doesn’t mean: running at the same
time. Only means, logically OK to run at same time
6
Definitions
Parallel
 Let process p be composed of 2 threads, t1 and t2
 Let t1 and t2 have no dependence on one another
 Then t1 and t2 can be executed at the same time
 If we know that this execution in fact happens on 2
(or more) μP, we call this parallel execution
 Synonym: simultaneous execution
7
Definitions
Process
 Process is program in execution, almost synonymous
 Process owns and uses actual μP resources & time
 Multiple processes can execute on one single μP

if so, they can execute in an interleaved fashion (overhead)

creating illusion of simultaneousness

but at the expense of added time to completion
 Multiple processes can run on multi-core μP

If so, they can actually run in parallel, AKA simultaneously

Provided that sufficient μP (cores) are available
 EEs build multi-core μPs for parallel execution
8
Definitions
Thread
 Thread is a process subset, not necessarily proper
subset
 Thread has full control over all code and all global
data of the program it runs in
 Thread allows a process subset to run concurrently
with rest of process, concurrent with other threads
 Concurrent means: to run independent of the time of
any other thread belonging to that same process!
 Independent means, at any time: OK to run before,
after, or simultaneous with other threads
 Simultaneously implies that there be another
separate execution engine, another μP
9
Definitions
Thread, Cont’d
 Threads run concurrently, each having its own stack,
register set, condition codes, control over full static
address space: data and code
 It does not mean, that each thread actually has its
own complete μP execution engine; yet it may!
 If it does have its own, full μP engine, then true
parallel execution is possible; we refer to this as
multi-processing, or parallel execution
 If it does not, concurrent execution is possible and
could result in speedup vs. uni-processor execution
 Concurrent execution can be beneficial, if some
threads would be blocked anyway, e.g. due to IO
10
Definitions
Multi-Threading
 Breaking one process into multiple threads (SW
threads) is named threading
 All n threads of some process p are free of datadependence, allowing arbitrary execution order!
That is key!
 Having multiple threads execute within one process
can peed up execution:

If there are multiple μPs to run such threads

Even if some of these μPs are limited, e.g. hyper-threaded
 This scheduling is named multi-threading
11
Definitions
Hyper-Thread (HT) Quick Intro



A silicon hyper-thread is a partial μP lacking the ALU execution
engine, but has the rest of a full μP
Silicon hyper-thread consumes ~25% of silicon space of full μP
Why execute on a partial μP?
1.
2.


When SW thread A is executing on a full μP, but A is blocked, and
another thread B is ready run, yet only a silicon hyper-thread is
available, run B on the hyper-threaded part of the μP
Reusing the ALU part of the blocked μP, with thread A making no
progress, but continuing to run thread B on its hyper-threaded twin,
using all its registers without need to load and restore registers
The context switch to thread B is fast, because all registers etc.
are already set, except for the very first time
In the end, no need to save registers; keep until the next
activation, where thread B continues
12
Definitions
Hyper-Thread
 AKA simultaneous multi-threading
 Hyper-threading allows multiple threads to run
concurrently (not simultaneously):

faster than on a single μP that is shared concurrently

slower than on genuine multi-core μP
 By having almost multiple μP on a single core
 Thus context switch within a process is fast on HT
μP, as register saving & restoring can be saved!
13
Definitions
Hyper-Thread
 A hyper-threaded core has:

at least one real, complete μP, with all ALU instructions,
registers, condition codes in silicon & access to memory

and at least one partial μP, consisting of full set of registers,
own stack, sp etc., but not silicon for a full ALU
 If thread A is interrupted, e.g. due to IO, but another
thread B of same process is ready, then execution on
hyper-thread core is as fast as on a second real,
complete core
 In that case, hyper-threading is cheap (in silicon),
benefitting from concurrency!
14
Threads & Hyper-Threads
15
Hyper-Threading

Core i7 is a four-core μP based on older Nehalem

Each of the 4 cores constitute one complete μP and one hyperthreaded μP; each can be viewed as 2 cores –with restrictions

Yet second is only partial core! Hyper here actually means less 

Replicated on all 8 cores, full or hyper are:
1.
Architectural register state –that is the key: all registers
2.
Return stack buffer
3.
Large page ITLB --instruction translation look-aside buffer

Other structures are statically partitioned among threads, when 2
or more are running

In picture below: QPI is Quick Path Interconnect, fast point-topoint processor interconnect, functioning like a fast bus
16
Hyper-Threading
Frequency &
Voltage
Independent
Interface
DRAMs
DDR3
C
O
R
E
0
C
O
R
E
1
C
O
R
E
2
Last Level Cache
IMC
QPI
QPI
QPI
C
O
R
E
3
Pwr
&
Clk
C
O
R
E
S
U
N
C
O
R
E
Core I7 with 4 Real μP, Each Having 1 Hyper-Thread
17
Caches on Core i7
 Execution units, caches, etc. do service a request,
regardless of which thread initiates it
 Due to minimal replication, Hyper-Threading is costefficient; few mils (a mil is 1/1000 of inch) of μP silicon
 Each core consists of an execution pipeline, a 32K Icache, a 32K D-Cache, and a shared 256K mid-level
unified cache
 Core connected to large shared 8MB cache in the
Uncore, i.e. rest of the μP
 Uncore contains memory controller etc.
18
Hyper-Threading
 Total number of cores on Core i7 is 8 as viewed by a
simple scheduler, i.e. not differentiating between μP
and a hyper-threaded μP
 When a 2nd thread runs on first μP, it competes for
resources (ALU) though another 6 μPs are idle; 3 of
which are full μP
 When 1 thread runs on a core, it has all that core’s
resources, i.e. can starve its hyper-thread companion
 Hyper-Threading launched on Intel Xeon™
 First appearance of Hyper-Threading on the desktop
was in Pentium® 4, name: Northwood
 Next Picture shows Intel Hyper-Threading timeline
19
Hyper-Threading Timeline
History of Hyper-Threading on Intel μP
20
Hyper-Threading
 Following table lists common desktop
platforms:
 Which processor offers more than

one core, or

more than one thread
 Also shows for each product the total
number of logical cores
 Years ~2002 to ~2010
21
Hyper-Threading
®
2002
Pentium 4 (Northwood)
Number of cores
1
Total Number of HTs
1
Max speedup with HT
2, no benefit to single-threaded apps
2005
Pentium D (Smithfield)
Number of cores
2
Total Number of HTs
2
Max speedup with HT
2, no benefit to apps w < 3 threads
2006-2008
Core 2 Duo, Core 2 Quad (Conroe)
Number of cores
2 (Duo), 4 (Quad)
Total Number of HTs
0
Max speedup with HT
N/A
2009
Intel ® Core i7 (Nehalem)
Number of cores
4
Total Number of HTs
4
Maximum speedup
2, but no benefit to apps < 5 threads
®
22
Hyper-Threading: Speedups
 Can 8 cores (4 real, 4 hyper) speed up execution by
a factor 8? Clearly not! Reasons for less than 8 are:
1.
2.
3.
4.
Data dependences in original program; written with
sequential execution in mind of programmer
Control dependences of original SW
OS overhead to schedule 1..8 threads or subtasks
Need for message-passing for synchronization
 Yet some compilers (e.g. SUIF [3]) perform interprocedural analysis and find parallel threads
 Such analysis, named simultaneous multi-threading
(SMT), yields further parallelism
 In following sample, all dependence relations are
maintained, wait is decreased on hyper-threaded
CPU!
23
Hyper-Threading
with SMT
1 core longer time
1 core + HT shorter time
without SMT
4 separate execution units, 1
box each
24
Hyper-Threading Anomalies
 Strange things can happen, when all 8 μP are
viewed as equal
 Which is an incorrect view, as the hyper-threads
have severe limitations: ALU is missing for each!
 For example, round-robin thread allocation would
be non-optimal within hyper-threaded cores, when
less SW threads run than total μP available
 Sample shown next, on a dual hyper-threaded
processor, and all 4 cores viewed as equal
 Scenario: Execute just 2 SW threads A and B, then
 Allocating thread A on μP 0’s full core and thread B
on μP 0’s hyper-thread (HW) core
25
Hyper-Threading Anomalies
1 Full core used with HT.
2nd full core sits idle.
Dumb use of resource!
26
Hyper-Threading Anomalies
 Problem? Thread B is deprived of the ALU,
since thread A is already using the ALU on
the first core, using their only ALU
 The second μP sits idle, full μP as well as
hyper-thread, wasting the full ALU provided
on B
 So threads A and B are competing for ALU,
though a whole ALU sits idle on the other
core! Silly!
 Better way to allocate free cores: shown next
27
Hyper-Threading Anomalies
Both full cores are used.
Only hyper-threaded parts idle
28
Hyper-Threading Anomalies
 Here the hyper-threads sit idle, while both
full cores are progressing, working on
threads A and B simultaneously
 Not just concurrently, but simultaneously
 But 2 hyper-threads also sit idle
 Their silicon could have been used for
(almost) a 3rd full μP
29
Bibliography
1. Mattwandel, Markus et al.: Performance Gains on Intel® MultiCore, Multi-Threaded Core™ i7, ICS 2009
2. QPI: https://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect
3. Hall, Mary W., Maximizing Multiprocessor Performance with the
SUIF Compiler, Computer, Vol., 29, No. 12, pp. 84-89, December
1996
30