Introduction to Parallel Computing Chris Hines iVEC Supercomputing in a nutshell Combine resources to: I Solve more problems sooner I Solve a larger problem I Solve a single problem faster Solve more problems Example I Identify radio sources in each data cube I Render each frame in a movie I Compare a gene sequence against genome for each related species I Run a weather simulation for each input state Solutions I Master process allocates work to Worker processes I Queue each work task and let the scheduler take care of it Solve a larger problem Example I Data cube is too large to fit in RAM I Simulation has too many particles to fit in RAM I Simulation needs higher resolution. Solutions I Use a computer with more RAM/CPUs (expensive) I Use more computers (difficult) Solve a problem faster Example I Identify interesting radio signals in real time I Run a weather simulation for predictions Solutions I Buy a faster CPU I Split your task and use more CPUs Bandwidth V Latency I Bandwidth: How many things you can do in a second I Latency: How long it takes you to do one thing Strategy I Decompose task into smaller tasks I Add bandwidth to process more small tasks each second I Stop when adding more bandwidth adds too much latency Reducing latency I Can't be done. Electronics are only so fast. I Can be hidden. Predict next task, and start it before its necessary Latency Hiding I Disk cache reads disk assuming if you need one byte you need the next I RAM cache same as disk cache I CPU pipeline starts one instruction before the previous one is complete. Fails if an "if" statment changes which instruction comes next. I MPI Isend/IRecv send data to another computer before you need it. Keep working while its being sent Increasing bandwidth I Just add more X for X ∈ [CPU, disk, researchers, PhD students] I Difficult to coordinate I Each X means more coordiation overhead means more latency Amdahl's Law Serial Speedup Walltime Parallel 1 2 4 8 Ncpus Amdahl's Law Serial Parallel Speedup Walltime Parallel Overhead 1 2 4 8 Ncpus Decomposing Problems I Most tasks can be broken into smaller task. I Frequently each task depends on the output of a previous task I Sometimes the dependency is on a task done on a different computer I Clasify as either Loosely or Tightly coupled Coupling Tightly Coupled I Problem is made up of tasks which depend on the output of other tasks I Need to run all tasks at the same time because of dependency I Minimise dependency on tasks on different CPUs Loosely coupled I Problem is made up of independent tasks I Don't require all tasks to run at the same time I Sometimes called "Embarassingly Parallel" Loose Coupling I Use as many CPUs as possible to solve the problem all problems as fast as possible I Tasks don't need to run at the same time. Shared Facilities I Most supercomputing CPUs are used by one program at a time I Many users Tightly Coupled Problems I Particle methods I Grid methods I Fourier Transforms I Linear Algebra Particle Methods Many Body Problems I each CPU holds a different regions of space I Strong force from particles in an adjacent regions I Force from distant regions can be approximated Particle Methods Many Body Problems Algorithm I Each time step I I I I I I Send positions to adjacent CPUs calculate forces from local particles calculate forces from adjacent CPUs calculate long range forces update velocity for each particle update position for each particle Particle Methods Many Body Problems Note I Communicating positions (slow) happens while the CPU is doing other work (local forces) I I haven't specified long range forces. Might involve an FFT or a multipole expansion. Either way more communication. Particle Methods Many Body Problems Scaling limits I 2 × Ncpus → 12 work and s × communication size s ∈ [ 12 . . . 1] I 1 communication size 2 I Network latency limits 6= 12 communication time Grid methods PDEs I Solution discretised to constant values on a grid I Derivatives approximated by fintite differences of DFTs PDE algorithm I Guess a soltuion I Communicate adjacent elements you don't have (halo) I calculate derivatives for elements you do have I calculate derivatives dependent on elements you just got I Improve solution, iterate Maximise surface area to volume per CPU Grid methods Cluster identification/Source finding I Divide up data cube I identify and label sources on each subcube I relabel sources to unique labels I communicate edges of each subcube I make a network/graph of which source touches which I walk graph working out new labels I relabel subcubes Multidimensional Fourier Transforms I Multiple 1D FFTs in each dimension. I Each 1D FFT is done at the same time I Each CPU holds all X values for a given Y and Z I Rearrange data I Each CPU holds all Y values for a given X and Z I ... Communication cost rearranging the data is heavy, but it works Linear Algebra I Use ScaLAPACK I Matricies are block cyclic I This ensures load balance when reducing to upper triangular form Parallel Computation Technologies I Hardware I I Shared Memory Clusters I GPUs I Programing Tools I MPI I OpenMP CUDA I I I OpenCL Hybrids Parallel Hardware Shared Memory I Single Node Multiple CPUs/cores I Fast access to remote data I Expensive hardware I Non-Uniform RAM latency Parallel Hardware Clusters I Kludge together I How to access remote data? I Buy a good network I Not a Kludge anymore Parallel Hardware GPGPUs I Perform the same action on every item in a list I If statmentes not so good I Very fast access to GPU RAM I Slow access to CPU RAM I Low power per FLOP Parallel Programing OpenMP I Shared Memory Only I Incremental route from serial to parallel I Lack control over RAM latency I Limited scalability Parallel Programing MPI I Clusters of Shared memory I Control over RAM latency I fast software I tedious programing Parallel Programing CUDA/OpenCL I Code the Kernel in CUDA I Good fine grained performance I Explicit memory managment Parallel Programing OpenMP+MPI I Share Read-only data between threads I MPI to span nodes I theoritically performance gain intra-node I watch for thread/process placement Parallel Programing GPGPU+MPI I Use many GPUs I Add another layer of complexity Parallel IO Hardware I RAID I Lustre RAID I Write to many disks increase Bandwidth I Doesn't help latency I May add failure protection I Single computer Lustre Clients Computers which run the caluclation. Nodes. Many clients. OSS Object Storage Servers. Write/Read from RAID arrays OST Object Storage Targets. The actual RAID arrays MDS Metadata server. Tells clients which objects make up a file. Luster Redundancy I RAID. Protects against single disks failing I Redundant controlers. OSS has two connections to each disk I Active-Active OSS failover. I Active-Passive MDS failover Luster Performance I Single node, limited by its network bandwidth I Lots of nodes limited by sum of OSS network bandwidth I Latency worse than RAID I Big file → Many OSSs I Small file → One OSS Big Data I tape I MAID Tape Archive Advantages I Cheep I Low power I Long shelf life (archival properties) Disadvantages I Long seek times (rewind the tape to the correct position) Hierachal Storage Manager I Put files on fast disk if you use them a lot I Slow disk if you use them less frequently I Tape if your using them hardly at all I Automate. HSM Performance Problems I Latency to get a file off tape is long (12 seconds best case) I Regular programs request one file at a time I Lots of space = lots of files I Lots of files = big database = poor performance HSM Performance Mitigation Strategies I Request all files at once I Put files used together in an archive (.zip/.tar.gz) MAID et al Massive Array of Independent Disks. Problem Disks use power all the time Solution I Spin down disks I Turn off controlers Unknowns I New technology I Power cycles reduce lifetime
© Copyright 2026 Paperzz