Other Architectures & Examples Multithreaded architectures Dataflow architectures Multiprocessor examples 1st May, 2006 Anshul Kumar, CSE IITD Context switching • Delays and poor resource utilization due to – Data/control hazards – cache misses – waiting for some event • Solution – – context switch to another thread • Context switch mechanism – – operating system - slow – hardware - fast Anshul Kumar, CSE IITD Multithreaded architecture • Hardware context switching • Models – control flow or hybrid (control flow, data flow) • Granularity – fine grain or coarse grain • Memory organization – shared?, distributed?, cache coherent? • No. of threads – small, medium, large Anshul Kumar, CSE IITD ILP and Multithreading Hennessy and Patterson ILP Coarse MT Fine MT SMT Wikipedia Chip level multithreading Executing instructions from multiple threads within one processor chip at the same time. • Multithreading: Interleaved issue of multiple instructions from different threads • Simultaneous multithreading (SMT): Issue multiple instructions from multiple threads in one cycle. • Chip-level multiprocessing (CMP or Multicore): integrate two or more superscalar processors into one chip, each execute one thread independently • Any combination of multithreading/SMT/CMP Anshul Kumar, CSE IITD Historical Examples Machine Granularity HEP from fine Denelcor Procs Tera fine max 256 128 distributed 1990 shared Alewife (MIT) coarse max 512 sparcle 1 active 3 loaded CC Anshul Kumar, CSE IITD max 16 Threads/ Memory Year proc 8 active shared 1978 64 max centralized 1990 Modern examples • • • • Pentium 4 MIPS MT IBM Power 5 Ultrasparc T1 Anshul Kumar, CSE IITD Hyperthreading 8 cores with 4 threads each dual core, 2 threads each fine grained multithreading HEP Control loop 8 stage pipeline scheduler function unit PSW queue Matching unit Program memory Increment control Operand fetch Registers SFU FU1 FU2 FUn To/from data memory Anshul Kumar, CSE IITD Control Flow & Data Flow models • Control Flow (von Neumann) – control flows through a sequence of instructions, branches can alter the flow – instructions get data from or put data in memory – explicit parallelism through control operators – fork/join • Data Flow – instructions are triggered by availability of data – data flows from instruction to instruction – explicit parallelism Anshul Kumar, CSE IITD Dataflow Model A B - 1 + A-B B+1 * R=(A-B)*(B+1) Anshul Kumar, CSE IITD Dataflow Program L1: A L2: L4/1 A-B - Compute B L2/2 L3/1 B B L3: + 1 L4/2 L4: * B+1 L6/1 R=(A-B)*(B+1) Anshul Kumar, CSE IITD Static Dataflow Architecture Fetch unit Instruction queue FU1 FU2 FUn Update unit to/from other PEs Anshul Kumar, CSE IITD Activity Store Tagged-token dataflow architecture Matching unit Matching store Fetch unit Token queue FU1 FU2 Instruction/ data memory FUn Form token unit to/from other PEs Anshul Kumar, CSE IITD UMA Examples • Earlier approach : Large number of processors (e.g. Denelcor HEP, NYU Ultracomputer) • Now realized : Good only for small number of processors (e.g. Encore Multimax 1980’s, SGI Power Challenge - 1990’s) Anshul Kumar, CSE IITD SGI Power Challenge • 18 MIPS R 8000 • 16 GB RAM, 8-way interleaved • 4 power channel-2, each 320 MB/s (I/O bus) • Power path-2 : split transaction shared bus (256 bit data, 40 bit address) • Snoopy cache coherence protocol Anshul Kumar, CSE IITD NUMA Examples • • • • BBN TC2000 IBM RP3 Hector Cray T3D Anshul Kumar, CSE IITD Hector • Hierarchical Structure global ring local rings stations Proc module (P+C+M) I/O module Anshul Kumar, CSE IITD Hector station station station local ring global ring local ring station Station station Station controller Station bus Proc Proc module module Anshul Kumar, CSE IITD Proc module station I/O module Cray T3D • • • • Alpha 21064 Proc Cray Y-MP host upto 128 GB memory 4x4x4 3D torus - config upto 8x8x8 2 PEs in each node Anshul Kumar, CSE IITD CC-NUMA examples Machine Nodes Mem Cache Wisconsin Multicube Aquarius Multimulti Stanford Dash single proc per col bus snoopy bus grid single proc per node snoopy+ directory snoopy+ directory Stanford Flash Convex Exemplar cluster per cluster 4 R3000+ FPU on bus single proc per node T5+magic chip hyper node per 8 PA-RISC hyper node directory SCI Magic chip : memory + I/O + network controller Anshul Kumar, CSE IITD Net bus grid pair of meshes 2D mesh X bar (hyper node) multi rings COMA examples • DDM (Data Diffusion Machine) – single bus (split transaction) – can be made hierarchical • KSR 1 – hierarchical rings – distributed directory is a matrix : rows for pages, columns for caches Anshul Kumar, CSE IITD Distr Mem Arch Examples Machine Comp. Comm. Vec. Switch Topology proc proc proc nCUBE2 iPSC2 Intel Paragon Genesis Manna Parsytec Transtech Paramid IBM SP2 Meiko C32 Parsys SN9800 custom i386 i860 i860 i870 i860 P.PC601 i860 i870 i860 T805 T805 custom Power2 SPARC i860 custom custom Fujitsu custom T900 T900 Anshul Kumar, CSE IITD yes custom yes custom hyper cube hyper cube 2D mesh 2 level X bar 16x16 X bar hierarch. C004 3D mesh C004 variable C104 fat tree fat tree hierarch sw References • D. Sima, T. Fountain, P. Kacsuk, "Advanced Computer Architectures : A Design Space Approach", Addison Wesley, 1997. Anshul Kumar, CSE IITD
© Copyright 2026 Paperzz