GPU Basics GPU Basics S. Sundar & M. Panchatcharam Types of Parallelism Task-based parallelism Data-Based Parallelism FLYNN’s Taxonomoy S. Sundar and M. Panchatcharam Other Parallel Patterns Classes of Parallel Computers Simple Coding August 9, 2014 47 Outline GPU Basics S. Sundar & M. Panchatcharam 1 Task-based parallelism Task-based parallelism Data-Based Parallelism 2 Data-Based Parallelism FLYNN’s Taxonomoy Other Parallel Patterns 3 Classes of Parallel Computers FLYNN’s Taxonomoy Simple Coding 4 Other Parallel Patterns 5 Classes of Parallel Computers 6 Simple Coding 47 Pipeline parallelism GPU Basics S. Sundar & M. Panchatcharam 3 Task-based parallelism Data-Based Parallelism A typical OS, exploit a type of parallelism called task-based parallelism FLYNN’s Taxonomoy Other Parallel Patterns Classes of Parallel Computers Example: A user read an article on a website while playing music from media player Simple Coding In terms of parallel programming, the linux operator uses pipe commands to execute this 47 Coarse/fine-grained parallelism GPU Basics S. Sundar & M. Panchatcharam 4 Applications are often classified according to how often their subtasks need to synchronize or communicate with each other Task-based parallelism Data-Based Parallelism FLYNN’s Taxonomoy If subtasks communicate many times per second, then it is called fine-grained parallelism Other Parallel Patterns Classes of Parallel Computers If they do not communicate many times per second, then it is called coarser-grained parallelism Simple Coding If they rarely or never have to communicate, then it is called embarrassingly parallel Embarrassingly parallel applications are easiest to parallelize. 47 Task Parallelism GPU Basics S. Sundar & M. Panchatcharam 5 Task-based parallelism Data-Based Parallelism Task Parallelism FLYNN’s Taxonomoy In a multiprocessor system, task parallelism is achieved when each processor executes different thread on the same or different data. Other Parallel Patterns Classes of Parallel Computers Simple Coding Threads may execute same or different code communication between threads takes place by passing data from one thread to next 47 Data Parallelism GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism 6 Data Parallelism Data-Based Parallelism FLYNN’s Taxonomoy In a multiprocessor system, each processor performs the same task on different pieces of distributed data Other Parallel Patterns Classes of Parallel Computers Simple Coding Consider adding two matrices. Addition is the task on different piece of data (each element) It is a fine-grained parallelism 47 Instruction Level Parallelism GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism 7 Instruction level parallelism is a measure of how many of the operations in a computer program can be performed simultaneously. The potential overlap among instructions is called instruction level parallelism. Data-Based Parallelism FLYNN’s Taxonomoy Other Parallel Patterns Classes of Parallel Computers Simple Coding Hardware level works on dynamic parallelism Software level works on static parallelism 47 Taxonomy GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism Data-Based Parallelism Flynn’s Taxonomy is a classification of different computer architectures. Various types are as follows: 8 FLYNN’s Taxonomoy Other Parallel Patterns SIMD - Single Instruction Multiple Data Classes of Parallel Computers MIMD - Multiple Instruction Multiple Data Simple Coding SISD - Single Instruction Single Data MISD - Multiple Instruction Single Data 47 SISD GPU Basics S. Sundar & M. Panchatcharam Standard serial programming Task-based parallelism Single instruction stream on a single data Single-core CPU is enough Data-Based Parallelism 9 FLYNN’s Taxonomoy Other Parallel Patterns Classes of Parallel Computers Simple Coding 47 MIMD GPU Basics S. Sundar & M. Panchatcharam Today’s dual or quad-core desktop machines Work allocated in one of N CPU cores Task-based parallelism Each thread has independent stream of instructions Data-Based Parallelism Hardware has the control logic for decoding separate instruction streams 10 FLYNN’s Taxonomoy Other Parallel Patterns Classes of Parallel Computers Simple Coding 47 SIMD GPU Basics S. Sundar & M. Panchatcharam Type of Data parallelism Task-based parallelism Single instruction stream at any one point of time Single set of logic to decode and execute the instruction stream Data-Based Parallelism 11 FLYNN’s Taxonomoy Other Parallel Patterns Classes of Parallel Computers Simple Coding 47 MISD GPU Basics Many functional units perform different operations on the same data S. Sundar & M. Panchatcharam Task-based parallelism Pipeline architectures belong to this type Data-Based Parallelism Example: Horner’s Rule is an example 12 FLYNN’s Taxonomoy Other Parallel Patterns y = (· · · (((an ∗x +an−1 )∗x +an−2 )∗x +an−3 )∗x +· · ·+a1 )∗x +a0 Classes of Parallel Computers Simple Coding 47 Loop based Pattern GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism Data-Based Parallelism FLYNN’s Taxonomoy Are you familiar with loop structures? 13 Other Parallel Patterns Types of loop? Classes of Parallel Computers Entry level loop Simple Coding Exit level loop 47 Loop based Pattern GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism Data-Based Parallelism Easy pattern to parallelize With inter-loop dependencies removed, decide splitting or partition the work between available processors FLYNN’s Taxonomoy 14 Other Parallel Patterns Classes of Parallel Computers Optimize communication between processors and the use of on chip resources Simple Coding Communication overhead is the bottleneck Decompose based on the number of logical hardware threads available 47 Loop based Pattern GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism However, oversubscribing the number of threads leads to poor performance Data-Based Parallelism FLYNN’s Taxonomoy 15 Reason: Context switching performed in software by the OS Other Parallel Patterns Classes of Parallel Computers Aware of hidden dependencies while doing existing serial implementation Simple Coding Concentrate on inner loops and one or more outer loops Best approach, parallelize only the outer loops 47 Loop based Pattern GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism Data-Based Parallelism Note: Most loop can be flattened Reduce inner loop and outer loop to a single loop, if possible FLYNN’s Taxonomoy 16 Other Parallel Patterns Classes of Parallel Computers Example: Image processing algorithm Simple Coding X pixel axis in the inner loop and Y axis in the outer loop Flatten this loop by considering all pixels as a single dimensional array and iterate over image coordinates 47 Fork/Join Pattern GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism It is a common pattern in serial programming where there are synchronization points and only certain aspects of the program are parallel. Data-Based Parallelism FLYNN’s Taxonomoy 17 The serial code reaches the work that can be distributed to P processors in some manner Other Parallel Patterns Classes of Parallel Computers Simple Coding It then Forks or spawns N threads/processes that perform the calculation in parallel. These processes execute independently and finally converge or join once all the calculations are complete A typical approach found in OpenMP 47 Fork/Join Pattern Code splits into N threads and later converges to a single thread again See figure, we see a queue of data items Data items are split into three processing cores Each data item is processed independently and later written to appropriate destination place GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism Data-Based Parallelism FLYNN’s Taxonomoy 18 Other Parallel Patterns Classes of Parallel Computers Simple Coding 47 Fork/Join Pattern GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism Typically implemented with partitioning of the data Data-Based Parallelism Serial code launches N threads and divide the dataset equally between the N threads FLYNN’s Taxonomoy 19 Works well, if each packet of data takes same time to process Other Parallel Patterns Classes of Parallel Computers If one thread takes too much time to work, it becomes single factor determining the local time Simple Coding Why did we choose three threads, Instead of six threads? Reality: Millions of data items attempting to fork million threads will cause almost all OS to fail OS applies "fair scheduling policy" 47 Fork/Join Pattern GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism Programmer and many multithreaded libraries will use the number of logical processor threads available as number of processes to fork Data-Based Parallelism FLYNN’s Taxonomoy 20 Other Parallel Patterns Classes of Parallel Computers CPU threads are so expensive to create, destroy and utilize Simple Coding Fork/join pattern is useful when there is an unknown amount of concurrency available in a problem Traversing a tree structure may fork additional threads when it encounters another node 47 Divide and Conquer Pattern GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism Data-Based Parallelism A pattern for breaking down(divide) large problems into smaller sections each of which can be conquered FLYNN’s Taxonomoy 21 Other Parallel Patterns Useful with recursion Classes of Parallel Computers Example: Quick Sort Simple Coding Quick sort recursively partitions the data into two sets, above pivot point and below pivot point 47 Divide and Conquer Pattern GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism Data-Based Parallelism FLYNN’s Taxonomoy 22 Other Parallel Patterns Classes of Parallel Computers Simple Coding 47 Implicit Parallelism GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism Data-Based Parallelism Implicit Parallelism It is characteristic of a programming language that allows a compiler or interpreter to automatically exploit the parallelism inherent to the computations expressed by some of the language’s constructs FLYNN’s Taxonomoy 23 Other Parallel Patterns Classes of Parallel Computers Simple Coding In other words, it is a compiler level parallelism, such as openmp flags or -O3 flags 47 Implicit Parallelism GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism Data-Based Parallelism Advantages: Need not worry about task division or communication FLYNN’s Taxonomoy 24 Other Parallel Patterns Classes of Parallel Computers Focus on the problem rather than parallelization Simple Coding Substantial improvement of programmer’s productivity Add simplicity and clarity to the program 47 Implicit Parallelism GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism Data-Based Parallelism Disadvantages: Programmer loose the control over the parallel execution of the program FLYNN’s Taxonomoy 25 Other Parallel Patterns Classes of Parallel Computers Gives less than optimal parallel efficiency Simple Coding Debugging is difficult Every program has some parallel and serial logic 47 Implicit Parallelism GPU Basics S. Sundar & M. Panchatcharam Implicit Parallelism Task-based parallelism It is the representation of concurrent optimizations by means of primitive in the form of special-purpose directives or function calls. Data-Based Parallelism FLYNN’s Taxonomoy 26 Related to process synchronization, communication, partitioning Other Parallel Patterns Classes of Parallel Computers Simple Coding Absolute programmer control over the parallel execution Skilled programmer takes advantage of explicit parallelism to produce very efficient code However, it is difficult to do explicit parallelism Reason: Extra work in planning decomposition/partition, synchronization of concurrent process 47 Classes of Parallel Computers GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism Data-Based Parallelism Multicore computing FLYNN’s Taxonomoy Symmetric multiprocessing Distributed computing Other Parallel Patterns 27 Classes of Parallel Computers Simple Coding Cluster Computing Massive parallel computing Grid computing 47 Classes of Parallel Computers GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism Data-Based Parallelism Reconfigurable computing with field programmable gate arrays (FPGA) General Purpose computing on Graphics Processing Units (GPGPU) FLYNN’s Taxonomoy Other Parallel Patterns 28 Classes of Parallel Computers Simple Coding Application Specific integrated circuits Vector Processors 47 Multicore computing GPU Basics S. Sundar & M. Panchatcharam Contains multiple execution units on the same chip Task-based parallelism Multiple instructions per cycle from multiple instruction streams Data-Based Parallelism FLYNN’s Taxonomoy Example: Intel Quad-Core Advantages: Other Parallel Patterns 29 Multiple CPU cores on the same die allows the cache coherency to operate at much higher clock rate Classes of Parallel Computers Simple Coding It allows higher performance at lower energy Disadvantages: Difficult to manage thermal power Two processing cores shares the same bus and memory limiting the real-world performance 47 Symmetric Multiprocessing (SMP) GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism Data-Based Parallelism SMP is a system with multiple identical processors that share memory and connect via bus SMPs don’t compromise more than 32 processors FLYNN’s Taxonomoy Other Parallel Patterns 30 SMPs are cost effective if sufficient amount of memory bandwidth exists Classes of Parallel Computers Simple Coding Example: Intel’s Xeon machine 47 Distributed Computing GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism Distributed Computing Data-Based Parallelism It is a distributed memory computer system in which the processing elements are connected by a network Examples: FLYNN’s Taxonomoy Other Parallel Patterns 31 Classes of Parallel Computers Simple Coding Telephone networks World wide web/Internet Aircraft control systems Multiplayer online games Grid Computing 47 Cluster Computing GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism Cluster Data-Based Parallelism Cluster is a group of loosely coupled computers that work together closely, considered as a single computer FLYNN’s Taxonomoy Other Parallel Patterns 32 Consists of multiple standalone computers connected by a network Classes of Parallel Computers Simple Coding Load balancing is difficult if the cluster machines are not symmetric Examples: Beowulf cluster, TCP/IP LAN Top500 supercomputers are clusters 47 Massively Parallel Computing GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism MPP Data-Based Parallelism It is a single computer with many networked processors FLYNN’s Taxonomoy Other Parallel Patterns Similar to clusters, but specialized with interconnect networks 33 Larger than clusters Classes of Parallel Computers Simple Coding Has more than 100 processors Each CPU has its own memory and OS/applications Communicates via high-speed interconnect Example: Blue Gene/L, the fifth fastest super computer 47 NUMA GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism NUMA It stands for Non-Uniform Memory Access and is a special type of shared memory architecture where access times of different memory locations by a processor may vary as may also access times to the same memory location by different processors Data-Based Parallelism FLYNN’s Taxonomoy Other Parallel Patterns 34 Classes of Parallel Computers Simple Coding It is known as a tightly coupled form of cluster computing Influences the memory access performance Most of the current OS such windows 7, 8 support NUMA Linux kernel 2.5 supports 47 NUMA GPU Basics S. Sundar & M. Panchatcharam Modern CPUs work faster than the main memory they use Task-based parallelism All of the processors have equal access to the memory I/O in the system (see figure) Data-Based Parallelism FLYNN’s Taxonomoy Special attention require to write a software run on NUMA Other Parallel Patterns 35 Classes of Parallel Computers Simple Coding 47 Parallel Random Access Machine GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism It is a shared memory abstract machine Read/write conflicts in accessing the same shared memory location simultaneously are resolved by the following categories Data-Based Parallelism FLYNN’s Taxonomoy Other Parallel Patterns 36 EREW-Exclusive Read, exclusive write. Every memory cell can read/write by only one processor at a time CREW-Concurrent Read, exclusive write. Multiple processor can read a memory but only one can write at a time CRCW-Concurrent Read, concurrent write. Multiple processor can read and write Classes of Parallel Computers Simple Coding 47 GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism FAQ on Parallel Computing Data-Based Parallelism FLYNN’s Taxonomoy Other Parallel Patterns 37 Classes of Parallel Computers Simple Coding 47 FAQ GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism Shared Memory architecture: Single address space is visible to all execution threads Task-latency:The time taken for a task to complete since a request for it is made Data-Based Parallelism FLYNN’s Taxonomoy Other Parallel Patterns 38 Task-throughput:The number of tasks completed in a given time Classes of Parallel Computers Simple Coding Speed up: The ratio of some performance metric obtained using a single processor with that obtained using a set of parallel processors Parallel Efficiency: The speed-up per processor 47 FAQ GPU Basics S. Sundar & M. Panchatcharam Maximum time speed up possible according to Amdhal’s law: 1 fs , where fs is inherently sequential fraction of the time taken by the best sequential of the task (Prove!) Cache Coherence: Different processors maintain their own local caches. That is multiple copies of the same data available. Coherence implies that access to the local copies behave similarly to access from the local copy–apart from the time to access. Task-based parallelism Data-Based Parallelism FLYNN’s Taxonomoy Other Parallel Patterns 39 Classes of Parallel Computers Simple Coding False Sharing: Sharing of a cache line by distinct variables. If such variables are not accessed together, the un-accessed variable is unnecessarily brought into cache along with the accessed variable. 47 NC=P Problem GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism Finally, let us look at an unsolved problem (open Problem) in Parallel Computing Data-Based Parallelism FLYNN’s Taxonomoy NC=P The class NC is the set of decision problems decidable in polylogarithmic time on parallel computer with a polynomial number of processor. That is, a problem is in NC, if there exists constants c and k such that it can be solved in time O(log c n) using O(nk ) parallel processors Other Parallel Patterns 40 Classes of Parallel Computers Simple Coding 47 NC=P Problem GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism Data-Based Parallelism NC=P FLYNN’s Taxonomoy P : tractable or efficiently solvable problem or decision problems that can be solved by a deterministic turing machine using polynomial time NC : problems can be efficiently solved on a parallel computer NC ⊂ P as polylogarithmic parallel computations can be simulated by polynomial time sequential ones Other Parallel Patterns 41 Classes of Parallel Computers Simple Coding Open Problem: Does NC = P 47 NC=P Problem GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism Data-Based Parallelism FLYNN’s Taxonomoy NC problem is known to include many problems including Integer addition, multiplication and division Other Parallel Patterns 42 Matrix multiplication, determinant, inverse, rank Classes of Parallel Computers Simple Coding Polynomial GCD, Sylvester Matrix 47 GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism Data-Based Parallelism Simple Coding FLYNN’s Taxonomoy Other Parallel Patterns Classes of Parallel Computers 43 47 Simple Coding Simple Vector Addition GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism Data-Based Parallelism Explanation FLYNN’s Taxonomoy Vector addition for ( i =0; i < N ; i ++) c [ i ]= a [ i ]+ b [ i ]; Step 1: c[0]=a[0]+b[0] Step 2: c[1]=a[1]+b[1] .. . Other Parallel Patterns Classes of Parallel Computers 44 Step N: c[N-1]=a[N-1]+b[N-1] Each addition is independent of others 47 Simple Coding Parallelization in CPU GPU Basics S. Sundar & M. Panchatcharam Assume you have 4 processors and N is divisible by 4 Task-based parallelism Data-Based Parallelism Explanation Vector addition // 4 is the number of p r o c e s s o r s for ( j =0; j <4; j ++) for ( i =0; i < N ; i = i *4+ j ) c [ i ]= a [ i ]+ b [ i ]; FLYNN’s Taxonomoy Step 1: c[0:3]=a[0:3]+b[0:3] Step 2: c[4:7]=a[4:7]+b[4:7] .. . Other Parallel Patterns Classes of Parallel Computers 45 Step N/4: c[N-4:N-1]= a[N-4:N-1]+b[N-4:N-1] 4 elements are added simultaneously 47 Simple Coding Parallelization in Intel OpenMP GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism OpenMP code Data-Based Parallelism FLYNN’s Taxonomoy # include < omp .h > # pragma omp for schedule ( static , chunk ) for ( i =0; i < N ; i ++) c [ i ]= a [ i ]+ b [ i ]; Other Parallel Patterns Classes of Parallel Computers 46 Intel OpenMP cares the parallelization Autoparallelization in Intel 47 Simple Coding GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism Data-Based Parallelism THANK YOU FLYNN’s Taxonomoy Other Parallel Patterns Classes of Parallel Computers 47 47 Simple Coding
© Copyright 2026 Paperzz