Michele Co`s Multimedia Extensions and Hyper-Threading

Intel Multimedia Extensions
and
Hyper-Threading
Michele Co
CS451
Outline
• Evolution of Intel multimedia extensions
– x87 (386)
– MMX (Pentium MMX, Pentium II)
– SSE (Pentium III)
– SSE2 (Pentium 4 – Willamette)
– SSE3 (Pentium 4 – Prescott)
• Hyper-Threading
X87 FPU
• 8 80-bit data registers (double extended
precision floating point)
• Data registers treated as a stack
• Control register – FP precision, rounding, …
• Status register – FPU busy, TOS, CC, error,
exception, …
• Tag register- (2 bits) valid, zero, special, empty
• Last instruction pointer register
• Last data (operand) pointer register
• Opcode register
x87 FPU State
X87 Data Types
x87 Instructions
•
•
•
•
•
•
Data transfer (load, store, move)
Basic arithmetic
Comparison
Transcendental (trigonometric, log, exp)
Load constant
x87 FPU control
MMX
• SIMD execution
• 8 64-bit data registers (MMX)
– Aliased to x87 FPU registers
• Randomly accessible
SIMD Execution
MMX State
MMX Registers
MMX Data Types
MMX Instructions
•
•
•
•
•
•
•
•
Data transfer
Arithmetic
Comparison
Conversion
Unpacking
Logical
Shift
Empty MMX state
SSE
• Pentium III
• 8 128-bit data registers (XMM)
– Independent of x87 FPU and MMX registers
• SSE instructions can be executed in parallel with MMX/x87
• MXCSR register – control and status for XMM
registers (similar to x87 status register)
• EFLAGS register – results of compare ops
• 128-bit packed single-precision fp data type
• Prefetching, cacheability, store ordering control
instructions
SSE State
XMM Registers
SSE Data Type
SSE Instructions
•
•
•
•
•
•
•
Packed and scalar single-precision floating point
Logical
Conversion
64-bit SIMD integer
MXCSR management
State management
Cacheability control, prefetch, memory ordering
– SFENCE (store fence)
• FXSAVE, FXRSTORE
– extension of x87 fast save and restore of x87, MMX registers to
also include save/restore of XMM, MXCSR registers
Packed Single-Precision FP
Operation
Scalar Single-Precision FP
Operation
Shuffle
Unpack and Interleave
SSE2
• Pentium 4
• More data types
• More instructions to support new data
types
SSE2 State
SSE2 Data Types
SSE2 Instructions
•
•
•
•
Support for additional types
CLFLUSH (cache line flush)
LFENCE (load fence)
MFENCE (load + store fence)
Packed Double-Precision FP
Operations
Scalar Double-Precision FP
Operations
SSE3
• Pentium 4 (Prescott)
– Support for Hyper-Threading
• 13 new instructions
– 10 SIMD support instructions
– 1 x87 accelerating instruction (fp to int conversion)
– Synchronization of threads
• MONITOR (monitor write-back stores)
• MWAIT (wait for write-back store)
• No new state
Asymmetric Processing
Horizontal Data Movement
Hyper-Threading
Terminology
• Process
– Program associated with a context (state:
registers, program counter, flags, etc.)
– Consists of one or more threads
• Thread
– “lightweight process” (less state)
Hyper-threading
• Single physical processor appears as 2 logical
processors
• Thread Level Parallelism (TLP)
– Many applications have software threads that can be
executed simultaneously
• Online transaction processing
• Web services
• Latency can leave execution units idle
– Cache misses
– Branch mispredictions
– Waiting for loads/stores
Techniques for Minimizing Effect of
Long Latency
• Chip multiprocessing (CMP)
– 2 processors on single die
– Larger than single core chip, manufacture more expensive
• Time-slice or switch-on-event multithreading
– Switch threads after fixed time period or on long latency events
like cache misses
– Doesn’t take advantage of other sources of inefficient resource
usage (branch mispredictions, instruction dependencies, etc.)
• Simultaneous multithreading (SMT)
– Multiple threads execute on single processor without switching
– Hyper-Threading is Intel’s implementation
Intel Hyper-Threading Demo
Resource Requirements for HT
Need to maintain 2 contexts
• Replicated
–
–
–
–
–
Register renaming logic (RAT)
Instruction Pointer
ITLB
Return stack predictor
Various other architectural registers (GP, control, APIC, machine state)
• Partitioned
– Re-order buffers (ROBs)
– Load/Store buffers
– Various queues, like the scheduling queues, uop queue, etc.
• Shared
– Caches: trace cache, L1, L2, L3, microcode ROM
– Microarchitectural registers
– Execution Units
Hyper-Threading Goals
• Minimize die area cost for implementing
• Ensure forward progress by at least one
logical processor
• Maintain single-threaded performance
Frontend Changes
• 2 PCs
• Arbitration for shared resource access
– Trace cache, microcode ROM, caches
– One logical processor at a time per structure
•
•
•
•
Thread tags per trace cache entry
Microcode ROM – 2 microcode instruction pointers
Wider pipeline latches to hold state for 2 contexts
Branch prediction
– RAS and branch history buffer duplicated
– Global history shared, but tagged with logical processor ID
Trace Cache Hit
Trace Cache Miss
Hyper-threaded Execution
Execution Modes
• Single-task (ST), Multi-task (MT)
– ST0, ST1
– HALT: transitions ST modes depending on
logical processor executing
– Interrupt sent to halted processor transitions
to MT
HT Performance - OLTP
HT Performance – Web Server