Intel Multimedia Extensions and Hyper-Threading Michele Co CS451 Outline • Evolution of Intel multimedia extensions – x87 (386) – MMX (Pentium MMX, Pentium II) – SSE (Pentium III) – SSE2 (Pentium 4 – Willamette) – SSE3 (Pentium 4 – Prescott) • Hyper-Threading X87 FPU • 8 80-bit data registers (double extended precision floating point) • Data registers treated as a stack • Control register – FP precision, rounding, … • Status register – FPU busy, TOS, CC, error, exception, … • Tag register- (2 bits) valid, zero, special, empty • Last instruction pointer register • Last data (operand) pointer register • Opcode register x87 FPU State X87 Data Types x87 Instructions • • • • • • Data transfer (load, store, move) Basic arithmetic Comparison Transcendental (trigonometric, log, exp) Load constant x87 FPU control MMX • SIMD execution • 8 64-bit data registers (MMX) – Aliased to x87 FPU registers • Randomly accessible SIMD Execution MMX State MMX Registers MMX Data Types MMX Instructions • • • • • • • • Data transfer Arithmetic Comparison Conversion Unpacking Logical Shift Empty MMX state SSE • Pentium III • 8 128-bit data registers (XMM) – Independent of x87 FPU and MMX registers • SSE instructions can be executed in parallel with MMX/x87 • MXCSR register – control and status for XMM registers (similar to x87 status register) • EFLAGS register – results of compare ops • 128-bit packed single-precision fp data type • Prefetching, cacheability, store ordering control instructions SSE State XMM Registers SSE Data Type SSE Instructions • • • • • • • Packed and scalar single-precision floating point Logical Conversion 64-bit SIMD integer MXCSR management State management Cacheability control, prefetch, memory ordering – SFENCE (store fence) • FXSAVE, FXRSTORE – extension of x87 fast save and restore of x87, MMX registers to also include save/restore of XMM, MXCSR registers Packed Single-Precision FP Operation Scalar Single-Precision FP Operation Shuffle Unpack and Interleave SSE2 • Pentium 4 • More data types • More instructions to support new data types SSE2 State SSE2 Data Types SSE2 Instructions • • • • Support for additional types CLFLUSH (cache line flush) LFENCE (load fence) MFENCE (load + store fence) Packed Double-Precision FP Operations Scalar Double-Precision FP Operations SSE3 • Pentium 4 (Prescott) – Support for Hyper-Threading • 13 new instructions – 10 SIMD support instructions – 1 x87 accelerating instruction (fp to int conversion) – Synchronization of threads • MONITOR (monitor write-back stores) • MWAIT (wait for write-back store) • No new state Asymmetric Processing Horizontal Data Movement Hyper-Threading Terminology • Process – Program associated with a context (state: registers, program counter, flags, etc.) – Consists of one or more threads • Thread – “lightweight process” (less state) Hyper-threading • Single physical processor appears as 2 logical processors • Thread Level Parallelism (TLP) – Many applications have software threads that can be executed simultaneously • Online transaction processing • Web services • Latency can leave execution units idle – Cache misses – Branch mispredictions – Waiting for loads/stores Techniques for Minimizing Effect of Long Latency • Chip multiprocessing (CMP) – 2 processors on single die – Larger than single core chip, manufacture more expensive • Time-slice or switch-on-event multithreading – Switch threads after fixed time period or on long latency events like cache misses – Doesn’t take advantage of other sources of inefficient resource usage (branch mispredictions, instruction dependencies, etc.) • Simultaneous multithreading (SMT) – Multiple threads execute on single processor without switching – Hyper-Threading is Intel’s implementation Intel Hyper-Threading Demo Resource Requirements for HT Need to maintain 2 contexts • Replicated – – – – – Register renaming logic (RAT) Instruction Pointer ITLB Return stack predictor Various other architectural registers (GP, control, APIC, machine state) • Partitioned – Re-order buffers (ROBs) – Load/Store buffers – Various queues, like the scheduling queues, uop queue, etc. • Shared – Caches: trace cache, L1, L2, L3, microcode ROM – Microarchitectural registers – Execution Units Hyper-Threading Goals • Minimize die area cost for implementing • Ensure forward progress by at least one logical processor • Maintain single-threaded performance Frontend Changes • 2 PCs • Arbitration for shared resource access – Trace cache, microcode ROM, caches – One logical processor at a time per structure • • • • Thread tags per trace cache entry Microcode ROM – 2 microcode instruction pointers Wider pipeline latches to hold state for 2 contexts Branch prediction – RAS and branch history buffer duplicated – Global history shared, but tagged with logical processor ID Trace Cache Hit Trace Cache Miss Hyper-threaded Execution Execution Modes • Single-task (ST), Multi-task (MT) – ST0, ST1 – HALT: transitions ST modes depending on logical processor executing – Interrupt sent to halted processor transitions to MT HT Performance - OLTP HT Performance – Web Server
© Copyright 2026 Paperzz