Introduction to Parallel Computing

Introduction to Parallel
Computing
Chris Hines
iVEC
Supercomputing in a nutshell
Combine resources to:
I Solve more problems sooner
I Solve a larger problem
I Solve a single problem faster
Solve more problems
Example
I Identify radio sources in each data cube
I Render each frame in a movie
I Compare a gene sequence against genome for each related species
I Run a weather simulation for each input state
Solutions
I Master process allocates work to Worker processes
I Queue each work task and let the scheduler take care of it
Solve a larger problem
Example
I Data cube is too large to fit in RAM
I Simulation has too many particles to fit in RAM
I Simulation needs higher resolution.
Solutions
I Use a computer with more RAM/CPUs (expensive)
I Use more computers (difficult)
Solve a problem faster
Example
I Identify interesting radio signals in real time
I Run a weather simulation for predictions
Solutions
I Buy a faster CPU
I Split your task and use more CPUs
Bandwidth V Latency
I Bandwidth: How many things you can do in a second
I Latency: How long it takes you to do one thing
Strategy
I Decompose task into smaller tasks
I Add bandwidth to process more small tasks each second
I Stop when adding more bandwidth adds too much latency
Reducing latency
I Can't be done. Electronics are only so fast.
I Can be hidden. Predict next task, and start it before its necessary
Latency Hiding
I Disk cache reads disk assuming if you need one byte you need the next
I RAM cache same as disk cache
I CPU pipeline starts one instruction before the previous one is complete. Fails if
an "if" statment changes which instruction comes next.
I MPI Isend/IRecv send data to another computer before you need it. Keep
working while its being sent
Increasing bandwidth
I Just add more X for X
∈ [CPU, disk, researchers, PhD students]
I Difficult to coordinate
I Each X means more coordiation overhead means more latency
Amdahl's Law
Serial
Speedup
Walltime
Parallel
1
2
4
8
Ncpus
Amdahl's Law
Serial
Parallel
Speedup
Walltime
Parallel Overhead
1
2
4
8
Ncpus
Decomposing Problems
I Most tasks can be broken into smaller task.
I Frequently each task depends on the output of a previous task
I Sometimes the dependency is on a task done on a different computer
I Clasify as either Loosely or Tightly coupled
Coupling
Tightly Coupled
I Problem is made up of tasks which depend on the output of other tasks
I Need to run all tasks at the same time because of dependency
I Minimise dependency on tasks on different CPUs
Loosely coupled
I Problem is made up of independent tasks
I Don't require all tasks to run at the same time
I Sometimes called "Embarassingly Parallel"
Loose Coupling
I Use as many CPUs as possible to solve the problem all problems as fast as
possible
I Tasks don't need to run at the same time.
Shared Facilities
I Most supercomputing CPUs are used by one program at a time
I Many users
Tightly Coupled Problems
I Particle methods
I Grid methods
I Fourier Transforms
I Linear Algebra
Particle Methods
Many Body Problems
I each CPU holds a different regions of space
I Strong force from particles in an adjacent regions
I Force from distant regions can be approximated
Particle Methods
Many Body Problems
Algorithm
I Each time step
I
I
I
I
I
I
Send positions to adjacent CPUs
calculate forces from local particles
calculate forces from adjacent CPUs
calculate long range forces
update velocity for each particle
update position for each particle
Particle Methods
Many Body Problems
Note
I Communicating positions (slow) happens while the CPU is doing other work
(local forces)
I I haven't specified long range forces. Might involve an FFT or a multipole
expansion. Either way more communication.
Particle Methods
Many Body Problems
Scaling limits
I
2 × Ncpus → 12 work and s × communication size s ∈ [ 12 . . . 1]
I 1 communication size
2
I Network latency limits
6= 12 communication time
Grid methods
PDEs
I Solution discretised to constant values on a grid
I Derivatives approximated by fintite differences of DFTs
PDE algorithm
I Guess a soltuion
I Communicate adjacent elements you don't have (halo)
I calculate derivatives for elements you do have
I calculate derivatives dependent on elements you just got
I Improve solution, iterate
Maximise surface area to volume per CPU
Grid methods
Cluster identification/Source finding
I Divide up data cube
I identify and label sources on each subcube
I relabel sources to unique labels
I communicate edges of each subcube
I make a network/graph of which source touches which
I walk graph working out new labels
I relabel subcubes
Multidimensional Fourier Transforms
I Multiple 1D FFTs in each dimension.
I Each 1D FFT is done at the same time
I Each CPU holds all X values for a given Y and Z
I Rearrange data
I Each CPU holds all Y values for a given X and Z
I ...
Communication cost rearranging the data is heavy, but it works
Linear Algebra
I Use ScaLAPACK
I Matricies are block cyclic
I This ensures load balance when reducing to upper triangular form
Parallel Computation Technologies
I Hardware
I
I
Shared Memory
Clusters
I
GPUs
I Programing Tools
I
MPI
I
OpenMP
CUDA
I
I
I
OpenCL
Hybrids
Parallel Hardware
Shared Memory
I Single Node Multiple CPUs/cores
I Fast access to remote data
I Expensive hardware
I Non-Uniform RAM latency
Parallel Hardware
Clusters
I Kludge together
I How to access remote data?
I Buy a good network
I Not a Kludge anymore
Parallel Hardware
GPGPUs
I Perform the same action on every item in a list
I If statmentes not so good
I Very fast access to GPU RAM
I Slow access to CPU RAM
I Low power per FLOP
Parallel Programing
OpenMP
I Shared Memory Only
I Incremental route from serial to parallel
I Lack control over RAM latency
I Limited scalability
Parallel Programing
MPI
I Clusters of Shared memory
I Control over RAM latency
I fast software
I tedious programing
Parallel Programing
CUDA/OpenCL
I Code the Kernel in CUDA
I Good fine grained performance
I Explicit memory managment
Parallel Programing
OpenMP+MPI
I Share Read-only data between threads
I MPI to span nodes
I theoritically performance gain intra-node
I watch for thread/process placement
Parallel Programing
GPGPU+MPI
I Use many GPUs
I Add another layer of complexity
Parallel IO Hardware
I RAID
I Lustre
RAID
I Write to many disks increase Bandwidth
I Doesn't help latency
I May add failure protection
I Single computer
Lustre
Clients
Computers which run the caluclation. Nodes. Many clients.
OSS
Object Storage Servers. Write/Read from RAID arrays
OST
Object Storage Targets. The actual RAID arrays
MDS
Metadata server. Tells clients which objects make up a file.
Luster Redundancy
I RAID. Protects against single disks failing
I Redundant controlers. OSS has two connections to each disk
I Active-Active OSS failover.
I Active-Passive MDS failover
Luster Performance
I Single node, limited by its network bandwidth
I Lots of nodes limited by sum of OSS network bandwidth
I Latency worse than RAID
I Big file → Many OSSs
I Small file → One OSS
Big Data
I tape
I MAID
Tape Archive
Advantages
I Cheep
I Low power
I Long shelf life (archival properties)
Disadvantages
I Long seek times (rewind the tape to the correct position)
Hierachal Storage Manager
I Put files on fast disk if you use them a lot
I Slow disk if you use them less frequently
I Tape if your using them hardly at all
I Automate.
HSM Performance
Problems
I Latency to get a file off tape is long (12 seconds best case)
I Regular programs request one file at a time
I Lots of space = lots of files
I Lots of files = big database = poor performance
HSM Performance
Mitigation Strategies
I Request all files at once
I Put files used together in an archive (.zip/.tar.gz)
MAID et al
Massive Array of Independent Disks.
Problem
Disks use power all the time
Solution
I Spin down disks
I Turn off controlers
Unknowns
I New technology
I Power cycles reduce lifetime