Multithreaded architecture

Other Architectures & Examples
Multithreaded architectures
Dataflow architectures
Multiprocessor examples
1st May, 2006
Anshul Kumar, CSE IITD
Context switching
• Delays and poor resource utilization due to – Data/control hazards
– cache misses
– waiting for some event
• Solution –
– context switch to another thread
• Context switch mechanism –
– operating system - slow
– hardware - fast
Anshul Kumar, CSE IITD
Multithreaded architecture
• Hardware context switching
• Models
– control flow or hybrid (control flow, data flow)
• Granularity
– fine grain or coarse grain
• Memory organization
– shared?, distributed?, cache coherent?
• No. of threads
– small, medium, large
Anshul Kumar, CSE IITD
ILP and Multithreading
Hennessy and Patterson
ILP
Coarse MT
Fine MT
SMT
Wikipedia
Chip level multithreading
Executing instructions from multiple threads within
one processor chip at the same time.
• Multithreading: Interleaved issue of multiple
instructions from different threads
• Simultaneous multithreading (SMT): Issue
multiple instructions from multiple threads in one
cycle.
• Chip-level multiprocessing (CMP or Multicore):
integrate two or more superscalar processors into
one chip, each execute one thread independently
• Any combination of multithreading/SMT/CMP
Anshul Kumar, CSE IITD
Historical Examples
Machine Granularity
HEP from fine
Denelcor
Procs
Tera
fine
max 256
128
distributed 1990
shared
Alewife
(MIT)
coarse
max 512
sparcle
1 active
3 loaded
CC
Anshul Kumar, CSE IITD
max 16
Threads/ Memory Year
proc
8 active shared
1978
64 max
centralized
1990
Modern examples
•
•
•
•
Pentium 4
MIPS MT
IBM Power 5
Ultrasparc T1
Anshul Kumar, CSE IITD
Hyperthreading
8 cores with 4 threads each
dual core, 2 threads each
fine grained multithreading
HEP
Control loop
8 stage pipeline
scheduler function unit
PSW
queue
Matching
unit
Program
memory
Increment
control
Operand
fetch
Registers
SFU
FU1
FU2
FUn
To/from
data
memory
Anshul Kumar, CSE IITD
Control Flow & Data Flow models
• Control Flow (von Neumann)
– control flows through a sequence of
instructions, branches can alter the flow
– instructions get data from or put data in
memory
– explicit parallelism through control operators –
fork/join
• Data Flow
– instructions are triggered by availability of data
– data flows from instruction to instruction
– explicit parallelism
Anshul Kumar, CSE IITD
Dataflow Model
A
B
-
1
+
A-B
B+1
*
R=(A-B)*(B+1)
Anshul Kumar, CSE IITD
Dataflow Program
L1:
A
L2:
L4/1
A-B
-
Compute B
L2/2
L3/1
B
B
L3:
+
1
L4/2
L4:
*
B+1
L6/1
R=(A-B)*(B+1)
Anshul Kumar, CSE IITD
Static Dataflow Architecture
Fetch
unit
Instruction
queue
FU1
FU2
FUn
Update
unit
to/from other PEs
Anshul Kumar, CSE IITD
Activity
Store
Tagged-token dataflow architecture
Matching
unit
Matching
store
Fetch
unit
Token
queue
FU1
FU2
Instruction/
data
memory
FUn
Form
token unit
to/from other PEs
Anshul Kumar, CSE IITD
UMA Examples
• Earlier approach : Large number of
processors (e.g. Denelcor HEP, NYU
Ultracomputer)
• Now realized : Good only for small number
of processors (e.g. Encore Multimax 1980’s, SGI Power Challenge - 1990’s)
Anshul Kumar, CSE IITD
SGI Power Challenge
• 18 MIPS R 8000
• 16 GB RAM, 8-way interleaved
• 4 power channel-2, each 320 MB/s (I/O
bus)
• Power path-2 : split transaction shared bus
(256 bit data, 40 bit address)
• Snoopy cache coherence protocol
Anshul Kumar, CSE IITD
NUMA Examples
•
•
•
•
BBN TC2000
IBM RP3
Hector
Cray T3D
Anshul Kumar, CSE IITD
Hector
• Hierarchical Structure
global ring
local rings
stations
Proc module (P+C+M)
I/O module
Anshul Kumar, CSE IITD
Hector
station
station
station
local ring
global ring
local ring
station
Station
station
Station
controller
Station bus
Proc
Proc
module
module
Anshul Kumar, CSE IITD
Proc
module
station
I/O
module
Cray T3D
•
•
•
•
Alpha 21064 Proc
Cray Y-MP host
upto 128 GB memory
4x4x4 3D torus - config upto 8x8x8
2 PEs in each node
Anshul Kumar, CSE IITD
CC-NUMA examples
Machine Nodes
Mem
Cache
Wisconsin
Multicube
Aquarius
Multimulti
Stanford
Dash
single proc
per col bus
snoopy bus grid
single proc
per node
snoopy+
directory
snoopy+
directory
Stanford
Flash
Convex
Exemplar
cluster
per cluster
4 R3000+
FPU on bus
single proc
per node
T5+magic chip
hyper node
per
8 PA-RISC
hyper node
directory
SCI
Magic chip : memory + I/O + network controller
Anshul Kumar, CSE IITD
Net
bus grid
pair of
meshes
2D
mesh
X bar
(hyper node)
multi rings
COMA examples
• DDM (Data Diffusion Machine)
– single bus (split transaction)
– can be made hierarchical
• KSR 1
– hierarchical rings
– distributed directory is a matrix :
rows for pages, columns for caches
Anshul Kumar, CSE IITD
Distr Mem Arch Examples
Machine Comp. Comm. Vec. Switch Topology
proc
proc
proc
nCUBE2
iPSC2
Intel
Paragon
Genesis
Manna
Parsytec
Transtech
Paramid
IBM SP2
Meiko
C32
Parsys
SN9800
custom
i386
i860
i860
i870
i860
P.PC601
i860
i870
i860
T805
T805
custom
Power2
SPARC
i860
custom
custom
Fujitsu custom
T900
T900
Anshul Kumar, CSE IITD
yes
custom
yes
custom
hyper cube
hyper cube
2D mesh
2 level X bar
16x16 X bar hierarch.
C004
3D mesh
C004
variable
C104
fat tree
fat tree
hierarch sw
References
• D. Sima, T. Fountain, P. Kacsuk, "Advanced
Computer Architectures : A Design Space
Approach", Addison Wesley, 1997.
Anshul Kumar, CSE IITD