As presented at Sandia (more intro material)

Extensible Message Layers for
Multimedia Cluster Computers
Dr. Craig Ulmer
Center for Experimental Research in Computer Systems
Outline
 Background
- Evolution of cluster computers
- Multimedia of “Resource-rich” cluster computers
 Design of extensible message layers
- GRIM: General-purpose Reliable In-order Messages
 Extensions
- Integrating peripheral devices
- Streaming computations
 Host-to-host performance
 Concluding remarks
Background
An Evolution of Cluster Computers
Cluster Computers
 Cost-effective alternative to supercomputers
- Number of commodity workstations
- Specialized network hardware and software
CPU Memory
CPU Memory
CPU Memory
CPU Memory
I/O Bus
I/O Bus
I/O Bus
I/O Bus
 Result: Large pool of host processors
Network
Interface
Network
Interface
Network
Interface
System Area Network
Network
Interface
Improving Cluster Computers
 Adding more host CPUs
 Adding intelligent peripheral devices
Peripheral
Devices
Host CPUs
Peripheral Device Trends
 Increasingly independent,
intelligent peripheral devices
CPU
 Feature on-card processing and
memory facilities
SAN NI
Ethernet
Media Capture
Storage
 Migration of computing power
and bandwidth requirements to
peripherals
Host
Resource-Rich Cluster Computers
 Inclusion of diverse peripheral devices
- Ethernet server cards, multimedia capture devices,
embedded storage, computational accelerators
 Processing takes place in host CPUs and peripherals
Host
Host
Host
CPU
CPU
Ethernet
Video Capture
System Area
Network
SAN NI
Storage
SAN NI
FPGA
Host
Host
Cluster
Host
Host
Host
Benefits of Resource-Rich Clusters
 Employ cluster computing in new applications
- Real-time constraints
- I/O intensive
- Network
 Example: Digital libraries
- Enormous amounts of data
- Large number of network users
 Example: Multimedia
- Capture and process large streams of multimedia data
- CAVE or Visualization clusters
Extensible Message Layers
Supporting Resource-Rich Cluster Computers
Problem: Utilizing distributed cluster resources
 How is efficient intra-cluster communication provided?
 How can applications make use of resources?
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
??????
FPGA
RAID
Video
Capture
RAID
FPGA
RAID
Ethernet
FPGA
Ethernet
Answer: Flexible “Message Layer”
Communication Software
 Message layers are enabling technology for clusters
- Enable cluster to function as single image multiprocessor system
 Current message layers
- Optimized for transmissions between host CPUs
- Peripheral devices only available in context of the local host
 What is needed
- Support efficient communication with host CPUs and peripherals
- Ability to harness peripheral devices as pool of resources
GRIM: An Implementation
A message layer for
resource-rich clusters
General-purpose Reliable In-order
Message Layer (GRIM)
 Message layer for resource-rich clusters
- Myrinet SAN backbone
- Both host CPUs and peripheral devices are endpoints
- Communication core implemented in NI
CPU
GRIM
FPGA Card
Network
Interface
Card
Storage Card
Core
System
Area
Network
Per-hop Flow Control
 End-to-end flow control necessary for reliable delivery
- Prevents buffer overflows in communication path
 Endpoint-managed schemes
- Impractical for peripheral devices
 Per-hop flow control scheme
- Transfer data as soon as next stage can accept
- Optimistic approach
Send
Sending
Sending
Endpoint
Endpoint
DATA
DATA
DATA
PCI
SAN
PCI
ACK
Network Interface
Network Interface
ACK
Reply
Network Interface
Network Interface
ACK
Receiving
Receiving
Endpoint
Endpoint
Logical Channels
 Multiple endpoints in a host share the NI
 Employ multiple logical channels in the NI
- Each endpoint owns one or more logical channels
- Logical channel provides virtual interface to network
Scheduler
Endpoint 1
Logical Channel
Network
Logical Channel
Endpoint n
Network Interface
Programming Interfaces: Active Messages
 Message specifies function to be executed at receiver
- Similar to remote procedure calls, but lightweight
- Invoke operations at remote resources
 Useful for constructing device-specific APIs
 Example: Interactions with remote storage controller
AM_fetch_file()
CPU
NI
SAN
AM_return_file()
NI
Storage
Controller
Programming Interfaces: Remote Memory
 Transfer blocks of data from one host to another
- Receiving NI executes transfer directly
 Read and Write operations
- NI interacts with kernel driver to translate virtual addresses
- Optional notification mechanisms
Memory
Memory
CPU
CPU
NI
SAN
NI
Integrating Peripheral Devices
Hardware Extensibility
Peripheral Device Overview
 In GRIM peripherals are endpoints
 Intelligent peripherals
-
Operate autonomously
On-card message queues
Process incoming active messages
Eject outgoing active messages
CPU
Peripheral Device
Legacy
CPU
Peripheral Device
 Legacy peripherals
- Managed by host application or
- Remote memory operations
NI
Peripheral Devices Examples
 Server adaptor card
- Networked host on PCI card
- AM handlers for LAN-SAN bridge
 Video capture card
- Specialized DMA engine
- AM handlers capture data
 Video display card
- Manipulate frame buffer
- Remote memory writes
Server Adaptor
Ethernet
PCI
i960
SCSI
Video Capture
PCI
A/D
DMA
Host
Memory
Frame
Buffer
Video Display
AGP
Frame
Buffer
D/A
Celoxica RC-1000 FPGA Card
 FPGAs provide acceleration
- Load with application-specific circuits
 Celoxica RC-1000 FPGA card
- Xilinx Virtex-1000 FPGA
- 8 MB SRAM
SRAM
SRAM
SRAM
SRAM
0
1
2
3
Control & Switching
 Hardware implementation
- Endpoint as state machines
- AM handlers are circuits
FPGA
PCI
FPGA Endpoint Organization
FPGA Card Memory
Input
Queues
Output
Queues
Communication Library API
Frame
User Circuit API
User Circuit 1
User Circuit n
Circuit Canvas
FPGA
Application
Data
Memory API
Example FPGA Configuration
 Cryptography configuration
- DES, RC6, MD5, and ALU
Control Block
20%
Unused
30%
ALU
5%
 Occupies 70% of FPGA
- Newer FPGAs 8x in size
 Operates with 20 MHz clock
- Newer FPGAs 6x faster
- 4KB Payload => 55 s (73MB/s)
DES
6%
MD5
26%
RC6
13%
Expansion: Sharing the FPGA
 FPGA has limited space for hardware circuits
- Host reconfigures FPGA on demand
- FPGA Function Fault
Configuration A
Configuration
B
Circuit X
Host
CPU
Configuration C
Circuit Y
Circuit E
Function
Fault
Configuration C
A
Circuit E
Circuit X
Circuit F
Circuit Y
Circuit G
Message:
Use Circuit F
FPGA
Circuit F
Circuit G
(150 ms)
State
Storage
SRAM 0
Extension: Streaming Computations
Software extensibility
Streaming Computation Overview
 Programming method for distributed resources
- Establish pipeline for streaming operations
- Example: Multimedia processing
 Celoxica RC-1000 FPGA endpoint
CPU
CPU
CPU
CPU
Media
Processor
Video
Capture
Media
Processor
NI
Media
Processor
NI
NI
NI
System Area Network
Streaming Fundamentals
 Computation: How is a computation performed?
- Active message approach
 Forwarding: Where are results transmitted?
- Programmable forwarding directory
FPGA
In Message
Destination: FPGA
Forward Entry: X
AM: Perform FFT
Forwarding Directory
Computational Circuits
Circuit 1:
FFT
Circuit N: Encrypt
Out Message
Destination: Host
Forward Entry: X
AM: Receive FFT
Host-to-Host Performance
Transferring data between
two host-level endpoints
Host-to-Host Communication Performance
 Host-to-Host transfers standard benchmark
 Three phases of data transfer
- Injection most challenging
 Overall communication path
Active Messages
CPU
CPU
Remote Memory Operations
Memory
Memory
2
1
Source
NI
SAN
NI
3
Destination
Host-NI: Data Injections
 Host-NI transfers challenging
- Host lacks DMA engine
CPU
Cache
- Programmed I/O
- DMA
Memory
Controller
PCI Bus
 Multiple transfer methods
Main
Memory
PCI
DMA
 Automatically select method
Result: Tunable PCI Injection Library (TPIL)
Peripheral
Device
Memory
TPIL Performance:
LANai 9 NI with Pentium III-550 MHz Host
140
DMA 0-Copy
DMA 1-Copy DB
Bandwidth (MBytes/s)
120
DMA 1-Copy
PIO SSE
100
PIO MMX
PIO Memcpy
80
60
40
20
0
10
100
1,000
10,000
Injection Size (Bytes)
100,000
1,000,000
Overall Communication Pipeline
 Three phases of transmission
- Optimization: Use fragmentation to increase utilization
- Optimization: Allow cut-through transmissions
Sending
Host-NI
NI-NI
Receiving
NI-Host
Message 1
Message 1
Message 2
Message 1
Message 1
Message 2
Message 3
Message 2
Message 2
Message 3
Message 1
Message 1
Message 2
Message 2
Message 3
time
Overall Transmission
Overall
OverallTransmission
Transmission
Time
Time
Time
Overall Host-to-Host Performance
Bandwidth (MBytes/s)
150
P4 LANai 9
120
P4 LANai 4
90
P3 LANai 9
60
P3 LANai 4
30
0
1
10
100
1,000
10,000
100,000
1,000,000
Message Size (Bytes)
Host
P4-1.7GHz
P3-550MHz
NI
Latency (μs)
Bandwidth (MB/s)
LANai 9
8
146
LANai 4
14.5
108
LANai 9
9.5
116
LANai 4
14
96
Comparison to Existing Message Layers
Latency (μs)
Bandwidth (MB/s)
AM
AM
AM-II
AM-II
BIP
BIP
FM
FM
LFC
LFC
Trapeze
Trapeze
GM
GM
GM - L4
GM - L4
GRIM - L4
GRIM - L4
GM - L9
GM - L9
GRIM - L9
GRIM - L9
0
5
10
15 20 25 30 μs
0
50
100 150 200 250 MB/s
Concluding Remarks
Key Contributions
 Framework for communication in resource-rich clusters
- Reliable delivery mechanisms, virtualized network interface, and
flexible programming interfaces
- Comparable performance to state-of-the-art message layers
 Extensible for peripheral devices
- Suitable for intelligent and legacy peripherals
- Methods for managing card resources
 Extensible for higher-level programming abstractions
- Endpoint-level: Streaming computations and sockets emulation
- NI-level: Multicast support
Future Directions
 Continued work with GRIM
- Video card vendors opening cards to developers
- Myrinet connected embedded devices
 Adaptation to other network substrates
- Gigabit Ethernet appealing because of cost
- Modification to transmission protocols
- InfiniBand technology promising
 Active system area networks
- FPGA chips beginning to feature gigabit transceivers
- Use FPGA chips as networked processing device
Additional Research Projects
Wireless Sensor Networks
 NASA JPL Research
- In-situ WSNs
- Exploration of Mars
 Communication
- Self organization
- Routing
 SensorSim
- Java simulator
- Evaluate protocols
PeZ: Pole-Zero Editor for MATLAB
Related Publications

A Tunable Communications Library for Data Injection, C. Ulmer and S.
Yalamanchili, Proceedings of Parallel and Distributed Processing
Techniques and Applications, 2002.

Active SANs: Hardware Support for Integrating Computation and
Communication, C. Ulmer, C. Wood, and S. Yalamanchili, Proceedings of
the Workshop on Novel Uses of System Area Networks at HPCA, 2002.

A Messaging Layer for Heterogeneous Endpoints in Resource Rich
Clusters, C. Ulmer and S. Yalamanchili, Proceedings of the First Myrinet
User Group Conference, 2000.

An Extensible Message Layer for High-Performance Clusters, C. Ulmer
and S. Yalamanchili, Proceedings of Parallel and Distributed Processing
Techniques and Applications, 2000.
Papers and Software Available at
http://www.CraigUlmer.com/research
Backup Slides
Performance: FPGA Computations
Clocks
Acquire SRAM
8
Detect New Message
Fetch Header
4
7
Lookup Forwarding
5
Fetch Payload
1024
Computation
Store Results
1024
1024
Store Header
16
Update Queues
Release SRAM
3
1
Clock Speed:
Operation Latency:
20MHz
55 s (4KB
73MB/s)
Control/
Status
Port
SRAM 0
SRAM 1
SRAM 2
SRAM 3
(Incoming Queues)
(User Page 0)
(User Page 1)
(Outgoing Queues)
Fetch/Decode
Scratchpad
Controller
Scratchpad
Controller
Message
Generator
Results
Cache
Port A
Port B
Port C
Built-in
ALU Ops
Expansion: Sharing On-Card Memory
 Limited on-card memory for storing application data
- Construct virtual memory system for on-card memory
- Swap space is host memory
Host
CPU
Page
Page
Page
Frame
11
User
Frame
Frame
Page 1X
Page Fault
Page
Frame 1
User-defined
Circuits
FPGA
SRAM 1
Page
Frame 2
SRAM 2
RC-1000 Challenges
 Hardware implementation
CPU
- Queue state machines
 Memory locking
- SRAM single ported
- Arbitrate for use
User
Circuits
SRAM
Memory Lock
 CPU / NI contention
- NI manages FPGA lock
NI
FPGA
Example: Autonomous Spaceborne Clusters
 NASA Remote Exploration and Experimentation
- Spaceborne vehicle processes data locally
- Clusters in the sky
 Number of peripheral devices
- Data sensors
- FPGA & DSPs
 Adaptive hardware
- Modify functionality after deployment
Performance: Card Interactions
 Acquire FPGA SRAM
- CPU-NI: 20 s
- NI:
8 s
 Inject 4 KB message to FPGA
- CPU:
58 s (70 MB/s)
- NI:
32 s (128 MB/s)
 Release FPGA SRAM
- CPU-NI: 8 s
- NI:
5 s
CPU
User
Circuits
SRAM
Memory Lock
NI
FPGA
Example: Digital Libraries
 Enormous amount of data and users
- Intelligent LAN and storage cards to manage requests
Client
Client
Client
Intelligent LAN
Adaptor
CPU
CPU
Files A-H
Client
Intelligent LAN
Adaptor
Storage
Adaptor
SAN
NI
Client
Intelligent LAN
Adaptor
CPU
Storage
Adaptor
SAN
NI
Client
Files I-R
SAN Backbone
Storage
Adaptor
SAN
NI
Files S-Z
Cyclone Systems I2O Server Adaptor Card
 Networked host on a PCI card
 Integration with GRIM
- Interact directly with the NI
- Ported host-level endpoint software
 Utilized as a LAN-SAN bridge
Daughter Card
Host
System
Primary
PCI
Interface
i960 Rx
Processor
DMA
Engines
10/100 Ethernet
Secondary
PCI
Interface
10/100 Ethernet
SCSI
DMA
Engine
SCSI
Local Bus
ROM
DRAM
GRIM Multicast Extensions
 Distribute the same message to multiple receivers
- Tree based distributions
- Replicate message at NI
- Messages are recycled back into network
 Extensions to NI’s core communication operations
- Recycled messages in separate logical channel
- Utilize per-hop flow control for reliable delivery
NI
B
A
B
D
D
C
E
Endpoint B
A
NI
Endpoint A
E
C
NI
Endpoint C
NI
Endpoint D
NI
Endpoint E
Multicast Performance
8 Hosts
100,000
Multicast RTT
Unicast RTT
10,000
Multicast Injection Overhead
Time (μs)
Unicast Injection Overhead
1,000
100
10
1
1
10
100
1,000
10,000
100,000
1,000,000
Multicast Message Size (Bytes)
LANai 4, P4-1.7 GHz Hosts
Multicast Observations
 Beneficial: reduces sending overhead
 Performance loss for large messages
- Dependent on NI memory copy bandwidth
 On-card memory copy benchmark:
- LANai 4:
- LANai 9:
19 MB/s
66 MB/s
Extension: Sockets Emulation
 Berkeley sockets is a communication standard
- Utilized in numerous distributed applications
 GRIM provides sockets API emulation
- Functions for intercepting socket calls
- AM handler functions for buffering connection data
read()
write()
Intercept
Generate AM
AM:
Append
Socket X
Socket
Data
Sender
Socket X
AM Handler
Append Socket
Intercept
Extract Data
Receiver
Sockets Emulation Performance
120
Bandwidth (MBytes/s)
GRIM Sockets LANai 4
100
100 Mb/s Ethernet
80
60
40
20
0
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
Transfer Size (Bytes)
P4-1.7 GHz Hosts
Overall Performance: Store-and-Forward
 Approach: Single message, no overlap
- Three transmission stages
- Expect roughly 1/3 of bandwidth of individual stage
Bandwidth (MBytes/s)
50
45
40 Sending
35
Host-NI
LANai 9
Message 1
PCI: 132 MB/s
LANai 4
30
25
NI-NI
Message 1
Myrinet: 160 MB/s
20
Receiving
15 NI-Host
Message 1
PCI: 132 MB/s
10
time
5
0
1
10
100
1,000
Overall
Transmission
Time
Message Size (Bytes)
10,000
100,000
P3-550 MHz Hosts
Enhancement: Message Pipelining
 Allow overlap with multiple in-flight messages
- GRIM uses AM and RM fragmentation/reassembly
- Performance depends on fragment size
Bandwidth (MBytes/s)
80
256 B
70
Sending
60
Host-NI
Message
1
1 KB
50
4 KB
NI-NI40
16 KB
30
Message 2
Message 3
Message 1
Message 2
Message 3
Message 1
Message 2
64 KB
Receiving
NI-Host20
10
Message 3
time
0
1
10
Overall
100
Transmission
Time 100,000
1,000
10,000
1,000,000
Message Size (Bytes)
LANai 9, P3-550 MHz Hosts
Enhancement: Cut-through Transfers
 Forward data as soon as it begins to arrive
- Cut-through at sending and receiving NIs
Bandwidth (MBytes/s)
100
90
Sending & Receiving CT
Message 1
Sending80
Host-NI
Message 2
Receiving CT Only
70
Sending CT Only
60
No CT
NI-NI 50
Message 1
Message 2
40
Receiving30
NI-Host
Message 1
Message 2
20
10
time
0
1
10
100
10,000
100,000
1,000,000
Overall1,000
Transmission
Time
Message Size (Bytes) LANai 9, P3-550 MHz Hosts