Extensible Message Layers for Multimedia Cluster Computers Dr. Craig Ulmer Center for Experimental Research in Computer Systems Outline Background - Evolution of cluster computers - Multimedia of “Resource-rich” cluster computers Design of extensible message layers - GRIM: General-purpose Reliable In-order Messages Extensions - Integrating peripheral devices - Streaming computations Host-to-host performance Concluding remarks Background An Evolution of Cluster Computers Cluster Computers Cost-effective alternative to supercomputers - Number of commodity workstations - Specialized network hardware and software CPU Memory CPU Memory CPU Memory CPU Memory I/O Bus I/O Bus I/O Bus I/O Bus Result: Large pool of host processors Network Interface Network Interface Network Interface System Area Network Network Interface Improving Cluster Computers Adding more host CPUs Adding intelligent peripheral devices Peripheral Devices Host CPUs Peripheral Device Trends Increasingly independent, intelligent peripheral devices CPU Feature on-card processing and memory facilities SAN NI Ethernet Media Capture Storage Migration of computing power and bandwidth requirements to peripherals Host Resource-Rich Cluster Computers Inclusion of diverse peripheral devices - Ethernet server cards, multimedia capture devices, embedded storage, computational accelerators Processing takes place in host CPUs and peripherals Host Host Host CPU CPU Ethernet Video Capture System Area Network SAN NI Storage SAN NI FPGA Host Host Cluster Host Host Host Benefits of Resource-Rich Clusters Employ cluster computing in new applications - Real-time constraints - I/O intensive - Network Example: Digital libraries - Enormous amounts of data - Large number of network users Example: Multimedia - Capture and process large streams of multimedia data - CAVE or Visualization clusters Extensible Message Layers Supporting Resource-Rich Cluster Computers Problem: Utilizing distributed cluster resources How is efficient intra-cluster communication provided? How can applications make use of resources? CPU CPU CPU CPU CPU CPU CPU CPU CPU ?????? FPGA RAID Video Capture RAID FPGA RAID Ethernet FPGA Ethernet Answer: Flexible “Message Layer” Communication Software Message layers are enabling technology for clusters - Enable cluster to function as single image multiprocessor system Current message layers - Optimized for transmissions between host CPUs - Peripheral devices only available in context of the local host What is needed - Support efficient communication with host CPUs and peripherals - Ability to harness peripheral devices as pool of resources GRIM: An Implementation A message layer for resource-rich clusters General-purpose Reliable In-order Message Layer (GRIM) Message layer for resource-rich clusters - Myrinet SAN backbone - Both host CPUs and peripheral devices are endpoints - Communication core implemented in NI CPU GRIM FPGA Card Network Interface Card Storage Card Core System Area Network Per-hop Flow Control End-to-end flow control necessary for reliable delivery - Prevents buffer overflows in communication path Endpoint-managed schemes - Impractical for peripheral devices Per-hop flow control scheme - Transfer data as soon as next stage can accept - Optimistic approach Send Sending Sending Endpoint Endpoint DATA DATA DATA PCI SAN PCI ACK Network Interface Network Interface ACK Reply Network Interface Network Interface ACK Receiving Receiving Endpoint Endpoint Logical Channels Multiple endpoints in a host share the NI Employ multiple logical channels in the NI - Each endpoint owns one or more logical channels - Logical channel provides virtual interface to network Scheduler Endpoint 1 Logical Channel Network Logical Channel Endpoint n Network Interface Programming Interfaces: Active Messages Message specifies function to be executed at receiver - Similar to remote procedure calls, but lightweight - Invoke operations at remote resources Useful for constructing device-specific APIs Example: Interactions with remote storage controller AM_fetch_file() CPU NI SAN AM_return_file() NI Storage Controller Programming Interfaces: Remote Memory Transfer blocks of data from one host to another - Receiving NI executes transfer directly Read and Write operations - NI interacts with kernel driver to translate virtual addresses - Optional notification mechanisms Memory Memory CPU CPU NI SAN NI Integrating Peripheral Devices Hardware Extensibility Peripheral Device Overview In GRIM peripherals are endpoints Intelligent peripherals - Operate autonomously On-card message queues Process incoming active messages Eject outgoing active messages CPU Peripheral Device Legacy CPU Peripheral Device Legacy peripherals - Managed by host application or - Remote memory operations NI Peripheral Devices Examples Server adaptor card - Networked host on PCI card - AM handlers for LAN-SAN bridge Video capture card - Specialized DMA engine - AM handlers capture data Video display card - Manipulate frame buffer - Remote memory writes Server Adaptor Ethernet PCI i960 SCSI Video Capture PCI A/D DMA Host Memory Frame Buffer Video Display AGP Frame Buffer D/A Celoxica RC-1000 FPGA Card FPGAs provide acceleration - Load with application-specific circuits Celoxica RC-1000 FPGA card - Xilinx Virtex-1000 FPGA - 8 MB SRAM SRAM SRAM SRAM SRAM 0 1 2 3 Control & Switching Hardware implementation - Endpoint as state machines - AM handlers are circuits FPGA PCI FPGA Endpoint Organization FPGA Card Memory Input Queues Output Queues Communication Library API Frame User Circuit API User Circuit 1 User Circuit n Circuit Canvas FPGA Application Data Memory API Example FPGA Configuration Cryptography configuration - DES, RC6, MD5, and ALU Control Block 20% Unused 30% ALU 5% Occupies 70% of FPGA - Newer FPGAs 8x in size Operates with 20 MHz clock - Newer FPGAs 6x faster - 4KB Payload => 55 s (73MB/s) DES 6% MD5 26% RC6 13% Expansion: Sharing the FPGA FPGA has limited space for hardware circuits - Host reconfigures FPGA on demand - FPGA Function Fault Configuration A Configuration B Circuit X Host CPU Configuration C Circuit Y Circuit E Function Fault Configuration C A Circuit E Circuit X Circuit F Circuit Y Circuit G Message: Use Circuit F FPGA Circuit F Circuit G (150 ms) State Storage SRAM 0 Extension: Streaming Computations Software extensibility Streaming Computation Overview Programming method for distributed resources - Establish pipeline for streaming operations - Example: Multimedia processing Celoxica RC-1000 FPGA endpoint CPU CPU CPU CPU Media Processor Video Capture Media Processor NI Media Processor NI NI NI System Area Network Streaming Fundamentals Computation: How is a computation performed? - Active message approach Forwarding: Where are results transmitted? - Programmable forwarding directory FPGA In Message Destination: FPGA Forward Entry: X AM: Perform FFT Forwarding Directory Computational Circuits Circuit 1: FFT Circuit N: Encrypt Out Message Destination: Host Forward Entry: X AM: Receive FFT Host-to-Host Performance Transferring data between two host-level endpoints Host-to-Host Communication Performance Host-to-Host transfers standard benchmark Three phases of data transfer - Injection most challenging Overall communication path Active Messages CPU CPU Remote Memory Operations Memory Memory 2 1 Source NI SAN NI 3 Destination Host-NI: Data Injections Host-NI transfers challenging - Host lacks DMA engine CPU Cache - Programmed I/O - DMA Memory Controller PCI Bus Multiple transfer methods Main Memory PCI DMA Automatically select method Result: Tunable PCI Injection Library (TPIL) Peripheral Device Memory TPIL Performance: LANai 9 NI with Pentium III-550 MHz Host 140 DMA 0-Copy DMA 1-Copy DB Bandwidth (MBytes/s) 120 DMA 1-Copy PIO SSE 100 PIO MMX PIO Memcpy 80 60 40 20 0 10 100 1,000 10,000 Injection Size (Bytes) 100,000 1,000,000 Overall Communication Pipeline Three phases of transmission - Optimization: Use fragmentation to increase utilization - Optimization: Allow cut-through transmissions Sending Host-NI NI-NI Receiving NI-Host Message 1 Message 1 Message 2 Message 1 Message 1 Message 2 Message 3 Message 2 Message 2 Message 3 Message 1 Message 1 Message 2 Message 2 Message 3 time Overall Transmission Overall OverallTransmission Transmission Time Time Time Overall Host-to-Host Performance Bandwidth (MBytes/s) 150 P4 LANai 9 120 P4 LANai 4 90 P3 LANai 9 60 P3 LANai 4 30 0 1 10 100 1,000 10,000 100,000 1,000,000 Message Size (Bytes) Host P4-1.7GHz P3-550MHz NI Latency (μs) Bandwidth (MB/s) LANai 9 8 146 LANai 4 14.5 108 LANai 9 9.5 116 LANai 4 14 96 Comparison to Existing Message Layers Latency (μs) Bandwidth (MB/s) AM AM AM-II AM-II BIP BIP FM FM LFC LFC Trapeze Trapeze GM GM GM - L4 GM - L4 GRIM - L4 GRIM - L4 GM - L9 GM - L9 GRIM - L9 GRIM - L9 0 5 10 15 20 25 30 μs 0 50 100 150 200 250 MB/s Concluding Remarks Key Contributions Framework for communication in resource-rich clusters - Reliable delivery mechanisms, virtualized network interface, and flexible programming interfaces - Comparable performance to state-of-the-art message layers Extensible for peripheral devices - Suitable for intelligent and legacy peripherals - Methods for managing card resources Extensible for higher-level programming abstractions - Endpoint-level: Streaming computations and sockets emulation - NI-level: Multicast support Future Directions Continued work with GRIM - Video card vendors opening cards to developers - Myrinet connected embedded devices Adaptation to other network substrates - Gigabit Ethernet appealing because of cost - Modification to transmission protocols - InfiniBand technology promising Active system area networks - FPGA chips beginning to feature gigabit transceivers - Use FPGA chips as networked processing device Additional Research Projects Wireless Sensor Networks NASA JPL Research - In-situ WSNs - Exploration of Mars Communication - Self organization - Routing SensorSim - Java simulator - Evaluate protocols PeZ: Pole-Zero Editor for MATLAB Related Publications A Tunable Communications Library for Data Injection, C. Ulmer and S. Yalamanchili, Proceedings of Parallel and Distributed Processing Techniques and Applications, 2002. Active SANs: Hardware Support for Integrating Computation and Communication, C. Ulmer, C. Wood, and S. Yalamanchili, Proceedings of the Workshop on Novel Uses of System Area Networks at HPCA, 2002. A Messaging Layer for Heterogeneous Endpoints in Resource Rich Clusters, C. Ulmer and S. Yalamanchili, Proceedings of the First Myrinet User Group Conference, 2000. An Extensible Message Layer for High-Performance Clusters, C. Ulmer and S. Yalamanchili, Proceedings of Parallel and Distributed Processing Techniques and Applications, 2000. Papers and Software Available at http://www.CraigUlmer.com/research Backup Slides Performance: FPGA Computations Clocks Acquire SRAM 8 Detect New Message Fetch Header 4 7 Lookup Forwarding 5 Fetch Payload 1024 Computation Store Results 1024 1024 Store Header 16 Update Queues Release SRAM 3 1 Clock Speed: Operation Latency: 20MHz 55 s (4KB 73MB/s) Control/ Status Port SRAM 0 SRAM 1 SRAM 2 SRAM 3 (Incoming Queues) (User Page 0) (User Page 1) (Outgoing Queues) Fetch/Decode Scratchpad Controller Scratchpad Controller Message Generator Results Cache Port A Port B Port C Built-in ALU Ops Expansion: Sharing On-Card Memory Limited on-card memory for storing application data - Construct virtual memory system for on-card memory - Swap space is host memory Host CPU Page Page Page Frame 11 User Frame Frame Page 1X Page Fault Page Frame 1 User-defined Circuits FPGA SRAM 1 Page Frame 2 SRAM 2 RC-1000 Challenges Hardware implementation CPU - Queue state machines Memory locking - SRAM single ported - Arbitrate for use User Circuits SRAM Memory Lock CPU / NI contention - NI manages FPGA lock NI FPGA Example: Autonomous Spaceborne Clusters NASA Remote Exploration and Experimentation - Spaceborne vehicle processes data locally - Clusters in the sky Number of peripheral devices - Data sensors - FPGA & DSPs Adaptive hardware - Modify functionality after deployment Performance: Card Interactions Acquire FPGA SRAM - CPU-NI: 20 s - NI: 8 s Inject 4 KB message to FPGA - CPU: 58 s (70 MB/s) - NI: 32 s (128 MB/s) Release FPGA SRAM - CPU-NI: 8 s - NI: 5 s CPU User Circuits SRAM Memory Lock NI FPGA Example: Digital Libraries Enormous amount of data and users - Intelligent LAN and storage cards to manage requests Client Client Client Intelligent LAN Adaptor CPU CPU Files A-H Client Intelligent LAN Adaptor Storage Adaptor SAN NI Client Intelligent LAN Adaptor CPU Storage Adaptor SAN NI Client Files I-R SAN Backbone Storage Adaptor SAN NI Files S-Z Cyclone Systems I2O Server Adaptor Card Networked host on a PCI card Integration with GRIM - Interact directly with the NI - Ported host-level endpoint software Utilized as a LAN-SAN bridge Daughter Card Host System Primary PCI Interface i960 Rx Processor DMA Engines 10/100 Ethernet Secondary PCI Interface 10/100 Ethernet SCSI DMA Engine SCSI Local Bus ROM DRAM GRIM Multicast Extensions Distribute the same message to multiple receivers - Tree based distributions - Replicate message at NI - Messages are recycled back into network Extensions to NI’s core communication operations - Recycled messages in separate logical channel - Utilize per-hop flow control for reliable delivery NI B A B D D C E Endpoint B A NI Endpoint A E C NI Endpoint C NI Endpoint D NI Endpoint E Multicast Performance 8 Hosts 100,000 Multicast RTT Unicast RTT 10,000 Multicast Injection Overhead Time (μs) Unicast Injection Overhead 1,000 100 10 1 1 10 100 1,000 10,000 100,000 1,000,000 Multicast Message Size (Bytes) LANai 4, P4-1.7 GHz Hosts Multicast Observations Beneficial: reduces sending overhead Performance loss for large messages - Dependent on NI memory copy bandwidth On-card memory copy benchmark: - LANai 4: - LANai 9: 19 MB/s 66 MB/s Extension: Sockets Emulation Berkeley sockets is a communication standard - Utilized in numerous distributed applications GRIM provides sockets API emulation - Functions for intercepting socket calls - AM handler functions for buffering connection data read() write() Intercept Generate AM AM: Append Socket X Socket Data Sender Socket X AM Handler Append Socket Intercept Extract Data Receiver Sockets Emulation Performance 120 Bandwidth (MBytes/s) GRIM Sockets LANai 4 100 100 Mb/s Ethernet 80 60 40 20 0 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 Transfer Size (Bytes) P4-1.7 GHz Hosts Overall Performance: Store-and-Forward Approach: Single message, no overlap - Three transmission stages - Expect roughly 1/3 of bandwidth of individual stage Bandwidth (MBytes/s) 50 45 40 Sending 35 Host-NI LANai 9 Message 1 PCI: 132 MB/s LANai 4 30 25 NI-NI Message 1 Myrinet: 160 MB/s 20 Receiving 15 NI-Host Message 1 PCI: 132 MB/s 10 time 5 0 1 10 100 1,000 Overall Transmission Time Message Size (Bytes) 10,000 100,000 P3-550 MHz Hosts Enhancement: Message Pipelining Allow overlap with multiple in-flight messages - GRIM uses AM and RM fragmentation/reassembly - Performance depends on fragment size Bandwidth (MBytes/s) 80 256 B 70 Sending 60 Host-NI Message 1 1 KB 50 4 KB NI-NI40 16 KB 30 Message 2 Message 3 Message 1 Message 2 Message 3 Message 1 Message 2 64 KB Receiving NI-Host20 10 Message 3 time 0 1 10 Overall 100 Transmission Time 100,000 1,000 10,000 1,000,000 Message Size (Bytes) LANai 9, P3-550 MHz Hosts Enhancement: Cut-through Transfers Forward data as soon as it begins to arrive - Cut-through at sending and receiving NIs Bandwidth (MBytes/s) 100 90 Sending & Receiving CT Message 1 Sending80 Host-NI Message 2 Receiving CT Only 70 Sending CT Only 60 No CT NI-NI 50 Message 1 Message 2 40 Receiving30 NI-Host Message 1 Message 2 20 10 time 0 1 10 100 10,000 100,000 1,000,000 Overall1,000 Transmission Time Message Size (Bytes) LANai 9, P3-550 MHz Hosts
© Copyright 2026 Paperzz