GPU Basics - Types of Parallelism

GPU Basics
GPU Basics
S. Sundar &
M. Panchatcharam
Types of Parallelism
Task-based parallelism
Data-Based
Parallelism
FLYNN’s Taxonomoy
S. Sundar and M. Panchatcharam
Other Parallel Patterns
Classes of Parallel
Computers
Simple Coding
August 9, 2014
47
Outline
GPU Basics
S. Sundar &
M. Panchatcharam
1
Task-based parallelism
Task-based parallelism
Data-Based
Parallelism
2
Data-Based Parallelism
FLYNN’s Taxonomoy
Other Parallel Patterns
3
Classes of Parallel
Computers
FLYNN’s Taxonomoy
Simple Coding
4
Other Parallel Patterns
5
Classes of Parallel Computers
6
Simple Coding
47
Pipeline parallelism
GPU Basics
S. Sundar &
M. Panchatcharam
3
Task-based parallelism
Data-Based
Parallelism
A typical OS, exploit a type of parallelism called task-based
parallelism
FLYNN’s Taxonomoy
Other Parallel Patterns
Classes of Parallel
Computers
Example: A user read an article on a website while playing
music from media player
Simple Coding
In terms of parallel programming, the linux operator uses pipe
commands to execute this
47
Coarse/fine-grained parallelism
GPU Basics
S. Sundar &
M. Panchatcharam
4
Applications are often classified according to how often their
subtasks need to synchronize or communicate with each other
Task-based parallelism
Data-Based
Parallelism
FLYNN’s Taxonomoy
If subtasks communicate many times per second, then it is
called fine-grained parallelism
Other Parallel Patterns
Classes of Parallel
Computers
If they do not communicate many times per second, then it is
called coarser-grained parallelism
Simple Coding
If they rarely or never have to communicate, then it is called
embarrassingly parallel
Embarrassingly parallel applications are easiest to parallelize.
47
Task Parallelism
GPU Basics
S. Sundar &
M. Panchatcharam
5
Task-based parallelism
Data-Based
Parallelism
Task Parallelism
FLYNN’s Taxonomoy
In a multiprocessor system, task parallelism is achieved when each
processor executes different thread on the same or different data.
Other Parallel Patterns
Classes of Parallel
Computers
Simple Coding
Threads may execute same or different code
communication between threads takes place by passing data
from one thread to next
47
Data Parallelism
GPU Basics
S. Sundar &
M. Panchatcharam
Task-based parallelism
6
Data Parallelism
Data-Based
Parallelism
FLYNN’s Taxonomoy
In a multiprocessor system, each processor performs the same task
on different pieces of distributed data
Other Parallel Patterns
Classes of Parallel
Computers
Simple Coding
Consider adding two matrices. Addition is the task on
different piece of data (each element)
It is a fine-grained parallelism
47
Instruction Level Parallelism
GPU Basics
S. Sundar &
M. Panchatcharam
Task-based parallelism
7
Instruction level parallelism is a measure of how many of the
operations in a computer program can be performed
simultaneously. The potential overlap among instructions is called
instruction level parallelism.
Data-Based
Parallelism
FLYNN’s Taxonomoy
Other Parallel Patterns
Classes of Parallel
Computers
Simple Coding
Hardware level works on dynamic parallelism
Software level works on static parallelism
47
Taxonomy
GPU Basics
S. Sundar &
M. Panchatcharam
Task-based parallelism
Data-Based
Parallelism
Flynn’s Taxonomy is a classification of different computer
architectures. Various types are as follows:
8
FLYNN’s Taxonomoy
Other Parallel Patterns
SIMD - Single Instruction Multiple Data
Classes of Parallel
Computers
MIMD - Multiple Instruction Multiple Data
Simple Coding
SISD - Single Instruction Single Data
MISD - Multiple Instruction Single Data
47
SISD
GPU Basics
S. Sundar &
M. Panchatcharam
Standard serial programming
Task-based parallelism
Single instruction stream on a single data
Single-core CPU is enough
Data-Based
Parallelism
9
FLYNN’s Taxonomoy
Other Parallel Patterns
Classes of Parallel
Computers
Simple Coding
47
MIMD
GPU Basics
S. Sundar &
M. Panchatcharam
Today’s dual or quad-core desktop machines
Work allocated in one of N CPU cores
Task-based parallelism
Each thread has independent stream of instructions
Data-Based
Parallelism
Hardware has the control logic for decoding separate
instruction streams
10
FLYNN’s Taxonomoy
Other Parallel Patterns
Classes of Parallel
Computers
Simple Coding
47
SIMD
GPU Basics
S. Sundar &
M. Panchatcharam
Type of Data parallelism
Task-based parallelism
Single instruction stream at any one point of time
Single set of logic to decode and execute the instruction
stream
Data-Based
Parallelism
11
FLYNN’s Taxonomoy
Other Parallel Patterns
Classes of Parallel
Computers
Simple Coding
47
MISD
GPU Basics
Many functional units perform different operations on the
same data
S. Sundar &
M. Panchatcharam
Task-based parallelism
Pipeline architectures belong to this type
Data-Based
Parallelism
Example: Horner’s Rule is an example
12
FLYNN’s Taxonomoy
Other Parallel Patterns
y = (· · · (((an ∗x +an−1 )∗x +an−2 )∗x +an−3 )∗x +· · ·+a1 )∗x +a0
Classes of Parallel
Computers
Simple Coding
47
Loop based Pattern
GPU Basics
S. Sundar &
M. Panchatcharam
Task-based parallelism
Data-Based
Parallelism
FLYNN’s Taxonomoy
Are you familiar with loop structures?
13
Other Parallel Patterns
Types of loop?
Classes of Parallel
Computers
Entry level loop
Simple Coding
Exit level loop
47
Loop based Pattern
GPU Basics
S. Sundar &
M. Panchatcharam
Task-based parallelism
Data-Based
Parallelism
Easy pattern to parallelize
With inter-loop dependencies removed, decide splitting or
partition the work between available processors
FLYNN’s Taxonomoy
14
Other Parallel Patterns
Classes of Parallel
Computers
Optimize communication between processors and the use of
on chip resources
Simple Coding
Communication overhead is the bottleneck
Decompose based on the number of logical hardware threads
available
47
Loop based Pattern
GPU Basics
S. Sundar &
M. Panchatcharam
Task-based parallelism
However, oversubscribing the number of threads leads to poor
performance
Data-Based
Parallelism
FLYNN’s Taxonomoy
15
Reason: Context switching performed in software by the OS
Other Parallel Patterns
Classes of Parallel
Computers
Aware of hidden dependencies while doing existing serial
implementation
Simple Coding
Concentrate on inner loops and one or more outer loops
Best approach, parallelize only the outer loops
47
Loop based Pattern
GPU Basics
S. Sundar &
M. Panchatcharam
Task-based parallelism
Data-Based
Parallelism
Note: Most loop can be flattened
Reduce inner loop and outer loop to a single loop, if possible
FLYNN’s Taxonomoy
16
Other Parallel Patterns
Classes of Parallel
Computers
Example: Image processing algorithm
Simple Coding
X pixel axis in the inner loop and Y axis in the outer loop
Flatten this loop by considering all pixels as a single
dimensional array and iterate over image coordinates
47
Fork/Join Pattern
GPU Basics
S. Sundar &
M. Panchatcharam
Task-based parallelism
It is a common pattern in serial programming where there are
synchronization points and only certain aspects of the
program are parallel.
Data-Based
Parallelism
FLYNN’s Taxonomoy
17
The serial code reaches the work that can be distributed to P
processors in some manner
Other Parallel Patterns
Classes of Parallel
Computers
Simple Coding
It then Forks or spawns N threads/processes that perform the
calculation in parallel.
These processes execute independently and finally converge or
join once all the calculations are complete
A typical approach found in OpenMP
47
Fork/Join Pattern
Code splits into N threads and later converges to a single
thread again
See figure, we see a queue of data items
Data items are split into three processing cores
Each data item is processed independently and later written
to appropriate destination place
GPU Basics
S. Sundar &
M. Panchatcharam
Task-based parallelism
Data-Based
Parallelism
FLYNN’s Taxonomoy
18
Other Parallel Patterns
Classes of Parallel
Computers
Simple Coding
47
Fork/Join Pattern
GPU Basics
S. Sundar &
M. Panchatcharam
Task-based parallelism
Typically implemented with partitioning of the data
Data-Based
Parallelism
Serial code launches N threads and divide the dataset equally
between the N threads
FLYNN’s Taxonomoy
19
Works well, if each packet of data takes same time to process
Other Parallel Patterns
Classes of Parallel
Computers
If one thread takes too much time to work, it becomes single
factor determining the local time
Simple Coding
Why did we choose three threads, Instead of six threads?
Reality: Millions of data items attempting to fork million
threads will cause almost all OS to fail
OS applies "fair scheduling policy"
47
Fork/Join Pattern
GPU Basics
S. Sundar &
M. Panchatcharam
Task-based parallelism
Programmer and many multithreaded libraries will use the
number of logical processor threads available as number of
processes to fork
Data-Based
Parallelism
FLYNN’s Taxonomoy
20
Other Parallel Patterns
Classes of Parallel
Computers
CPU threads are so expensive to create, destroy and utilize
Simple Coding
Fork/join pattern is useful when there is an unknown amount
of concurrency available in a problem
Traversing a tree structure may fork additional threads when
it encounters another node
47
Divide and Conquer Pattern
GPU Basics
S. Sundar &
M. Panchatcharam
Task-based parallelism
Data-Based
Parallelism
A pattern for breaking down(divide) large problems into
smaller sections each of which can be conquered
FLYNN’s Taxonomoy
21
Other Parallel Patterns
Useful with recursion
Classes of Parallel
Computers
Example: Quick Sort
Simple Coding
Quick sort recursively partitions the data into two sets, above
pivot point and below pivot point
47
Divide and Conquer Pattern
GPU Basics
S. Sundar &
M. Panchatcharam
Task-based parallelism
Data-Based
Parallelism
FLYNN’s Taxonomoy
22
Other Parallel Patterns
Classes of Parallel
Computers
Simple Coding
47
Implicit Parallelism
GPU Basics
S. Sundar &
M. Panchatcharam
Task-based parallelism
Data-Based
Parallelism
Implicit Parallelism
It is characteristic of a programming language that allows a
compiler or interpreter to automatically exploit the parallelism
inherent to the computations expressed by some of the language’s
constructs
FLYNN’s Taxonomoy
23
Other Parallel Patterns
Classes of Parallel
Computers
Simple Coding
In other words, it is a compiler level parallelism, such as openmp
flags or -O3 flags
47
Implicit Parallelism
GPU Basics
S. Sundar &
M. Panchatcharam
Task-based parallelism
Data-Based
Parallelism
Advantages:
Need not worry about task division or communication
FLYNN’s Taxonomoy
24
Other Parallel Patterns
Classes of Parallel
Computers
Focus on the problem rather than parallelization
Simple Coding
Substantial improvement of programmer’s productivity
Add simplicity and clarity to the program
47
Implicit Parallelism
GPU Basics
S. Sundar &
M. Panchatcharam
Task-based parallelism
Data-Based
Parallelism
Disadvantages:
Programmer loose the control over the parallel execution of
the program
FLYNN’s Taxonomoy
25
Other Parallel Patterns
Classes of Parallel
Computers
Gives less than optimal parallel efficiency
Simple Coding
Debugging is difficult
Every program has some parallel and serial logic
47
Implicit Parallelism
GPU Basics
S. Sundar &
M. Panchatcharam
Implicit Parallelism
Task-based parallelism
It is the representation of concurrent optimizations by means of
primitive in the form of special-purpose directives or function calls.
Data-Based
Parallelism
FLYNN’s Taxonomoy
26
Related to process synchronization, communication,
partitioning
Other Parallel Patterns
Classes of Parallel
Computers
Simple Coding
Absolute programmer control over the parallel execution
Skilled programmer takes advantage of explicit parallelism to
produce very efficient code
However, it is difficult to do explicit parallelism
Reason: Extra work in planning decomposition/partition,
synchronization of concurrent process
47
Classes of Parallel Computers
GPU Basics
S. Sundar &
M. Panchatcharam
Task-based parallelism
Data-Based
Parallelism
Multicore computing
FLYNN’s Taxonomoy
Symmetric multiprocessing
Distributed computing
Other Parallel Patterns
27
Classes of Parallel
Computers
Simple Coding
Cluster Computing
Massive parallel computing
Grid computing
47
Classes of Parallel Computers
GPU Basics
S. Sundar &
M. Panchatcharam
Task-based parallelism
Data-Based
Parallelism
Reconfigurable computing with field programmable gate
arrays (FPGA)
General Purpose computing on Graphics Processing Units
(GPGPU)
FLYNN’s Taxonomoy
Other Parallel Patterns
28
Classes of Parallel
Computers
Simple Coding
Application Specific integrated circuits
Vector Processors
47
Multicore computing
GPU Basics
S. Sundar &
M. Panchatcharam
Contains multiple execution units on the same chip
Task-based parallelism
Multiple instructions per cycle from multiple instruction
streams
Data-Based
Parallelism
FLYNN’s Taxonomoy
Example: Intel Quad-Core
Advantages:
Other Parallel Patterns
29
Multiple CPU cores on the same die allows the cache
coherency to operate at much higher clock rate
Classes of Parallel
Computers
Simple Coding
It allows higher performance at lower energy
Disadvantages:
Difficult to manage thermal power
Two processing cores shares the same bus and memory
limiting the real-world performance
47
Symmetric Multiprocessing (SMP)
GPU Basics
S. Sundar &
M. Panchatcharam
Task-based parallelism
Data-Based
Parallelism
SMP is a system with multiple identical processors that share
memory and connect via bus
SMPs don’t compromise more than 32 processors
FLYNN’s Taxonomoy
Other Parallel Patterns
30
SMPs are cost effective if sufficient amount of memory
bandwidth exists
Classes of Parallel
Computers
Simple Coding
Example: Intel’s Xeon machine
47
Distributed Computing
GPU Basics
S. Sundar &
M. Panchatcharam
Task-based parallelism
Distributed Computing
Data-Based
Parallelism
It is a distributed memory computer system in which the
processing elements are connected by a network
Examples:
FLYNN’s Taxonomoy
Other Parallel Patterns
31
Classes of Parallel
Computers
Simple Coding
Telephone networks
World wide web/Internet
Aircraft control systems
Multiplayer online games
Grid Computing
47
Cluster Computing
GPU Basics
S. Sundar &
M. Panchatcharam
Task-based parallelism
Cluster
Data-Based
Parallelism
Cluster is a group of loosely coupled computers that work together
closely, considered as a single computer
FLYNN’s Taxonomoy
Other Parallel Patterns
32
Consists of multiple standalone computers connected by a
network
Classes of Parallel
Computers
Simple Coding
Load balancing is difficult if the cluster machines are not
symmetric
Examples: Beowulf cluster, TCP/IP LAN
Top500 supercomputers are clusters
47
Massively Parallel Computing
GPU Basics
S. Sundar &
M. Panchatcharam
Task-based parallelism
MPP
Data-Based
Parallelism
It is a single computer with many networked processors
FLYNN’s Taxonomoy
Other Parallel Patterns
Similar to clusters, but specialized with interconnect networks
33
Larger than clusters
Classes of Parallel
Computers
Simple Coding
Has more than 100 processors
Each CPU has its own memory and OS/applications
Communicates via high-speed interconnect
Example: Blue Gene/L, the fifth fastest super computer
47
NUMA
GPU Basics
S. Sundar &
M. Panchatcharam
Task-based parallelism
NUMA
It stands for Non-Uniform Memory Access and is a special type of
shared memory architecture where access times of different
memory locations by a processor may vary as may also access
times to the same memory location by different processors
Data-Based
Parallelism
FLYNN’s Taxonomoy
Other Parallel Patterns
34
Classes of Parallel
Computers
Simple Coding
It is known as a tightly coupled form of cluster computing
Influences the memory access performance
Most of the current OS such windows 7, 8 support NUMA
Linux kernel 2.5 supports
47
NUMA
GPU Basics
S. Sundar &
M. Panchatcharam
Modern CPUs work faster than the main memory they use
Task-based parallelism
All of the processors have equal access to the memory I/O in
the system (see figure)
Data-Based
Parallelism
FLYNN’s Taxonomoy
Special attention require to write a software run on NUMA
Other Parallel Patterns
35
Classes of Parallel
Computers
Simple Coding
47
Parallel Random Access Machine
GPU Basics
S. Sundar &
M. Panchatcharam
Task-based parallelism
It is a shared memory abstract machine
Read/write conflicts in accessing the same shared memory
location simultaneously are resolved by the following
categories
Data-Based
Parallelism
FLYNN’s Taxonomoy
Other Parallel Patterns
36
EREW-Exclusive Read, exclusive write. Every memory cell
can read/write by only one processor at a time
CREW-Concurrent Read, exclusive write. Multiple processor
can read a memory but only one can write at a time
CRCW-Concurrent Read, concurrent write. Multiple
processor can read and write
Classes of Parallel
Computers
Simple Coding
47
GPU Basics
S. Sundar &
M. Panchatcharam
Task-based parallelism
FAQ on Parallel
Computing
Data-Based
Parallelism
FLYNN’s Taxonomoy
Other Parallel Patterns
37
Classes of Parallel
Computers
Simple Coding
47
FAQ
GPU Basics
S. Sundar &
M. Panchatcharam
Task-based parallelism
Shared Memory architecture: Single address space is visible to
all execution threads
Task-latency:The time taken for a task to complete since a
request for it is made
Data-Based
Parallelism
FLYNN’s Taxonomoy
Other Parallel Patterns
38
Task-throughput:The number of tasks completed in a given
time
Classes of Parallel
Computers
Simple Coding
Speed up: The ratio of some performance metric obtained
using a single processor with that obtained using a set of
parallel processors
Parallel Efficiency: The speed-up per processor
47
FAQ
GPU Basics
S. Sundar &
M. Panchatcharam
Maximum time speed up possible according to Amdhal’s law:
1
fs , where fs is inherently sequential fraction of the time taken
by the best sequential of the task (Prove!)
Cache Coherence: Different processors maintain their own
local caches. That is multiple copies of the same data
available. Coherence implies that access to the local copies
behave similarly to access from the local copy–apart from the
time to access.
Task-based parallelism
Data-Based
Parallelism
FLYNN’s Taxonomoy
Other Parallel Patterns
39
Classes of Parallel
Computers
Simple Coding
False Sharing: Sharing of a cache line by distinct variables. If
such variables are not accessed together, the un-accessed
variable is unnecessarily brought into cache along with the
accessed variable.
47
NC=P Problem
GPU Basics
S. Sundar &
M. Panchatcharam
Task-based parallelism
Finally, let us look at an unsolved problem (open Problem) in
Parallel Computing
Data-Based
Parallelism
FLYNN’s Taxonomoy
NC=P
The class NC is the set of decision problems decidable in
polylogarithmic time on parallel computer with a polynomial
number of processor.
That is, a problem is in NC, if there exists constants c and k such
that it can be solved in time O(log c n) using O(nk ) parallel
processors
Other Parallel Patterns
40
Classes of Parallel
Computers
Simple Coding
47
NC=P Problem
GPU Basics
S. Sundar &
M. Panchatcharam
Task-based parallelism
Data-Based
Parallelism
NC=P
FLYNN’s Taxonomoy
P : tractable or efficiently solvable problem or decision problems
that can be solved by a deterministic turing machine using
polynomial time
NC : problems can be efficiently solved on a parallel computer
NC ⊂ P as polylogarithmic parallel computations can be
simulated by polynomial time sequential ones
Other Parallel Patterns
41
Classes of Parallel
Computers
Simple Coding
Open Problem: Does NC = P
47
NC=P Problem
GPU Basics
S. Sundar &
M. Panchatcharam
Task-based parallelism
Data-Based
Parallelism
FLYNN’s Taxonomoy
NC problem is known to include many problems including
Integer addition, multiplication and division
Other Parallel Patterns
42
Matrix multiplication, determinant, inverse, rank
Classes of Parallel
Computers
Simple Coding
Polynomial GCD, Sylvester Matrix
47
GPU Basics
S. Sundar &
M. Panchatcharam
Task-based parallelism
Data-Based
Parallelism
Simple Coding
FLYNN’s Taxonomoy
Other Parallel Patterns
Classes of Parallel
Computers
43
47
Simple Coding
Simple Vector Addition
GPU Basics
S. Sundar &
M. Panchatcharam
Task-based parallelism
Data-Based
Parallelism
Explanation
FLYNN’s Taxonomoy
Vector addition
for ( i =0; i < N ; i ++)
c [ i ]= a [ i ]+ b [ i ];
Step 1: c[0]=a[0]+b[0]
Step 2: c[1]=a[1]+b[1]
..
.
Other Parallel Patterns
Classes of Parallel
Computers
44
Step N:
c[N-1]=a[N-1]+b[N-1]
Each addition is independent of others
47
Simple Coding
Parallelization in CPU
GPU Basics
S. Sundar &
M. Panchatcharam
Assume you have 4 processors and N is divisible by 4
Task-based parallelism
Data-Based
Parallelism
Explanation
Vector addition
// 4 is the number of p r o c e s s o r s
for ( j =0; j <4; j ++)
for ( i =0; i < N ; i = i *4+ j )
c [ i ]= a [ i ]+ b [ i ];
FLYNN’s Taxonomoy
Step 1:
c[0:3]=a[0:3]+b[0:3]
Step 2:
c[4:7]=a[4:7]+b[4:7]
..
.
Other Parallel Patterns
Classes of Parallel
Computers
45
Step N/4: c[N-4:N-1]=
a[N-4:N-1]+b[N-4:N-1]
4 elements are added simultaneously
47
Simple Coding
Parallelization in Intel OpenMP
GPU Basics
S. Sundar &
M. Panchatcharam
Task-based parallelism
OpenMP code
Data-Based
Parallelism
FLYNN’s Taxonomoy
# include < omp .h >
# pragma omp for schedule ( static , chunk )
for ( i =0; i < N ; i ++)
c [ i ]= a [ i ]+ b [ i ];
Other Parallel Patterns
Classes of Parallel
Computers
46
Intel OpenMP cares the parallelization
Autoparallelization in Intel
47
Simple Coding
GPU Basics
S. Sundar &
M. Panchatcharam
Task-based parallelism
Data-Based
Parallelism
THANK YOU
FLYNN’s Taxonomoy
Other Parallel Patterns
Classes of Parallel
Computers
47
47
Simple Coding