ECE 485/585 Microprocessors Chapter 10 Threads and Multiprocessing, By Sample of Intel® CoreTM i7 Herbert G. Mayer, PSU Status 11/29/2016 1 Syllabus Introduction Definitions Threads & Hyper-Threads Hyper-Threading Anomalies Appendix Bibliography 2 Introduction We’ll study threads and hyper-threads And analyze the Intel Core i7, a sample hyper-threaded MP μP Thread here is understood as process subset; see Definitions below Key decision for Electrical Engineer: When designing a CPU, should it incorporate Hyper-Threads in silicon? Or better use that silicon space instead for another complete μP? 3 Definitions (Logical order, not alphabetical) 4 Definitions Core Core: synonymous to full silicon μP; includes instructions, registers, condition codes Core also includes ALU; if so: is a full μP If not, is a hyper-threaded core, missing ALU plus a few other non-essential modules Actually hyper-thread (super thread) is misnomer: should be named hypo-thread (sub thread) Sample hyper-threaded product is Intel Core i7 When processor has multiple cores – i7 has 4 – then total number of μP for that product is 8: 4 of them full μPs, and 4 partial μP, the silicon hyper-threads 5 Definitions Concurrent Let process p be composed of 2 threads, t1 and t2 Let t1 and t2 have no dependency on one another Then t1 and t2 are executable concurrently; may, but do not have to run at the same time Yet they may execute simultaneously, if 2 real μP are available, then true parallel execution possible If we do not know or care, whether this happens on 1 or more μP, we call this concurrent execution Concurrent doesn’t mean: running at the same time. Only means, logically OK to run at same time 6 Definitions Parallel Let process p be composed of 2 threads, t1 and t2 Let t1 and t2 have no dependence on one another Then t1 and t2 can be executed at the same time If we know that this execution in fact happens on 2 (or more) μP, we call this parallel execution Synonym: simultaneous execution 7 Definitions Process Process is program in execution, almost synonymous Process owns and uses actual μP resources & time Multiple processes can execute on one single μP if so, they can execute in an interleaved fashion (overhead) creating illusion of simultaneousness but at the expense of added time to completion Multiple processes can run on multi-core μP If so, they can actually run in parallel, AKA simultaneously Provided that sufficient μP (cores) are available EEs build multi-core μPs for parallel execution 8 Definitions Thread Thread is a process subset, not necessarily proper subset Thread has full control over all code and all global data of the program it runs in Thread allows a process subset to run concurrently with rest of process, concurrent with other threads Concurrent means: to run independent of the time of any other thread belonging to that same process! Independent means, at any time: OK to run before, after, or simultaneous with other threads Simultaneously implies that there be another separate execution engine, another μP 9 Definitions Thread, Cont’d Threads run concurrently, each having its own stack, register set, condition codes, control over full static address space: data and code It does not mean, that each thread actually has its own complete μP execution engine; yet it may! If it does have its own, full μP engine, then true parallel execution is possible; we refer to this as multi-processing, or parallel execution If it does not, concurrent execution is possible and could result in speedup vs. uni-processor execution Concurrent execution can be beneficial, if some threads would be blocked anyway, e.g. due to IO 10 Definitions Multi-Threading Breaking one process into multiple threads (SW threads) is named threading All n threads of some process p are free of datadependence, allowing arbitrary execution order! That is key! Having multiple threads execute within one process can peed up execution: If there are multiple μPs to run such threads Even if some of these μPs are limited, e.g. hyper-threaded This scheduling is named multi-threading 11 Definitions Hyper-Thread (HT) Quick Intro A silicon hyper-thread is a partial μP lacking the ALU execution engine, but has the rest of a full μP Silicon hyper-thread consumes ~25% of silicon space of full μP Why execute on a partial μP? 1. 2. When SW thread A is executing on a full μP, but A is blocked, and another thread B is ready run, yet only a silicon hyper-thread is available, run B on the hyper-threaded part of the μP Reusing the ALU part of the blocked μP, with thread A making no progress, but continuing to run thread B on its hyper-threaded twin, using all its registers without need to load and restore registers The context switch to thread B is fast, because all registers etc. are already set, except for the very first time In the end, no need to save registers; keep until the next activation, where thread B continues 12 Definitions Hyper-Thread AKA simultaneous multi-threading Hyper-threading allows multiple threads to run concurrently (not simultaneously): faster than on a single μP that is shared concurrently slower than on genuine multi-core μP By having almost multiple μP on a single core Thus context switch within a process is fast on HT μP, as register saving & restoring can be saved! 13 Definitions Hyper-Thread A hyper-threaded core has: at least one real, complete μP, with all ALU instructions, registers, condition codes in silicon & access to memory and at least one partial μP, consisting of full set of registers, own stack, sp etc., but not silicon for a full ALU If thread A is interrupted, e.g. due to IO, but another thread B of same process is ready, then execution on hyper-thread core is as fast as on a second real, complete core In that case, hyper-threading is cheap (in silicon), benefitting from concurrency! 14 Threads & Hyper-Threads 15 Hyper-Threading Core i7 is a four-core μP based on older Nehalem Each of the 4 cores constitute one complete μP and one hyperthreaded μP; each can be viewed as 2 cores –with restrictions Yet second is only partial core! Hyper here actually means less Replicated on all 8 cores, full or hyper are: 1. Architectural register state –that is the key: all registers 2. Return stack buffer 3. Large page ITLB --instruction translation look-aside buffer Other structures are statically partitioned among threads, when 2 or more are running In picture below: QPI is Quick Path Interconnect, fast point-topoint processor interconnect, functioning like a fast bus 16 Hyper-Threading Frequency & Voltage Independent Interface DRAMs DDR3 C O R E 0 C O R E 1 C O R E 2 Last Level Cache IMC QPI QPI QPI C O R E 3 Pwr & Clk C O R E S U N C O R E Core I7 with 4 Real μP, Each Having 1 Hyper-Thread 17 Caches on Core i7 Execution units, caches, etc. do service a request, regardless of which thread initiates it Due to minimal replication, Hyper-Threading is costefficient; few mils (a mil is 1/1000 of inch) of μP silicon Each core consists of an execution pipeline, a 32K Icache, a 32K D-Cache, and a shared 256K mid-level unified cache Core connected to large shared 8MB cache in the Uncore, i.e. rest of the μP Uncore contains memory controller etc. 18 Hyper-Threading Total number of cores on Core i7 is 8 as viewed by a simple scheduler, i.e. not differentiating between μP and a hyper-threaded μP When a 2nd thread runs on first μP, it competes for resources (ALU) though another 6 μPs are idle; 3 of which are full μP When 1 thread runs on a core, it has all that core’s resources, i.e. can starve its hyper-thread companion Hyper-Threading launched on Intel Xeon™ First appearance of Hyper-Threading on the desktop was in Pentium® 4, name: Northwood Next Picture shows Intel Hyper-Threading timeline 19 Hyper-Threading Timeline History of Hyper-Threading on Intel μP 20 Hyper-Threading Following table lists common desktop platforms: Which processor offers more than one core, or more than one thread Also shows for each product the total number of logical cores Years ~2002 to ~2010 21 Hyper-Threading ® 2002 Pentium 4 (Northwood) Number of cores 1 Total Number of HTs 1 Max speedup with HT 2, no benefit to single-threaded apps 2005 Pentium D (Smithfield) Number of cores 2 Total Number of HTs 2 Max speedup with HT 2, no benefit to apps w < 3 threads 2006-2008 Core 2 Duo, Core 2 Quad (Conroe) Number of cores 2 (Duo), 4 (Quad) Total Number of HTs 0 Max speedup with HT N/A 2009 Intel ® Core i7 (Nehalem) Number of cores 4 Total Number of HTs 4 Maximum speedup 2, but no benefit to apps < 5 threads ® 22 Hyper-Threading: Speedups Can 8 cores (4 real, 4 hyper) speed up execution by a factor 8? Clearly not! Reasons for less than 8 are: 1. 2. 3. 4. Data dependences in original program; written with sequential execution in mind of programmer Control dependences of original SW OS overhead to schedule 1..8 threads or subtasks Need for message-passing for synchronization Yet some compilers (e.g. SUIF [3]) perform interprocedural analysis and find parallel threads Such analysis, named simultaneous multi-threading (SMT), yields further parallelism In following sample, all dependence relations are maintained, wait is decreased on hyper-threaded CPU! 23 Hyper-Threading with SMT 1 core longer time 1 core + HT shorter time without SMT 4 separate execution units, 1 box each 24 Hyper-Threading Anomalies Strange things can happen, when all 8 μP are viewed as equal Which is an incorrect view, as the hyper-threads have severe limitations: ALU is missing for each! For example, round-robin thread allocation would be non-optimal within hyper-threaded cores, when less SW threads run than total μP available Sample shown next, on a dual hyper-threaded processor, and all 4 cores viewed as equal Scenario: Execute just 2 SW threads A and B, then Allocating thread A on μP 0’s full core and thread B on μP 0’s hyper-thread (HW) core 25 Hyper-Threading Anomalies 1 Full core used with HT. 2nd full core sits idle. Dumb use of resource! 26 Hyper-Threading Anomalies Problem? Thread B is deprived of the ALU, since thread A is already using the ALU on the first core, using their only ALU The second μP sits idle, full μP as well as hyper-thread, wasting the full ALU provided on B So threads A and B are competing for ALU, though a whole ALU sits idle on the other core! Silly! Better way to allocate free cores: shown next 27 Hyper-Threading Anomalies Both full cores are used. Only hyper-threaded parts idle 28 Hyper-Threading Anomalies Here the hyper-threads sit idle, while both full cores are progressing, working on threads A and B simultaneously Not just concurrently, but simultaneously But 2 hyper-threads also sit idle Their silicon could have been used for (almost) a 3rd full μP 29 Bibliography 1. Mattwandel, Markus et al.: Performance Gains on Intel® MultiCore, Multi-Threaded Core™ i7, ICS 2009 2. QPI: https://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect 3. Hall, Mary W., Maximizing Multiprocessor Performance with the SUIF Compiler, Computer, Vol., 29, No. 12, pp. 84-89, December 1996 30
© Copyright 2026 Paperzz