OpenMP

Introduction to OpenMP
For a more detailed tutorial see:
http://www.openmp.org
Look at the presentations
also see:
https://computing.llnl.gov/tutorials/openMP/
Concepts
• An Application Program Interface (API) that may
be used to explicitly direct multi-threaded, shared
memory parallelism
• Directive based programming
– declare properties of language structures
• sections of code, loops
– scope variables
• A few service routines
– get information
• Compiler options
• Environment variables
Open MP Programming Model
• Directive
– #pragma omp directive [clause list]
– C$OMP construct [clause [clause] … ]
• Program executes serially until it encounters
a parallel directive
– #pragma omp parallel [clause list]
– /* structured block of code */
• Clause list is used to specify conditions
– Conditional parallelism - if (cond)
– Degree of concurrency - num_threads(int)
– Data Handling • private(vlist), firstprivate(vlist), shared(vlist)
OpenMP Programming Model
• fork-join parallelism
• Master thread spawns a team of threads as
needed.
Typical OpenMP Use
• Generally used to parallelize loops
– Find most time consuming loops
– Split iterations up between threads
void main()
{
double Res[1000];
for(int i=0;i<1000;i++) {
do_huge_comp(Res[i]);
}
}
void main()
{
double Res[1000];
#pragma omp parallel for
for(int i=0;i<1000;i++) {
do_huge_comp(Res[i]);
}
}
Thread Interaction
• OpenMP operates using shared memory
– Threads communicate via shared variables
• Unintended sharing can lead to race
conditions
– output changes due to thread scheduling
• Control race conditions using synchronization
– synchronization is expensive
– change the way data is stored to minimize the need
for synchronization
Syntax format
• Compiler directives
– C/C++
• #pragma omp construct [clause [clause] …]
– Fortran
• C$OMP construct [clause [clause] … ]
• !$OMP construct [clause [clause] … ]
• *$OMP construct [clause [clause] … ]
• Since we use directives, no changes
need to be made to a program for a
compiler that doesn’t support OpenMP
Using OpenMP
• Some compilers can automatically place directives with option
– -qsmp=auto (IBM xlc)
– some loops may speed up, some may slow down
• Compiler option required when you use directives
– icc -openmp (Linux – Intel compiler)
– gcc -fopenmp (gcc versions >= 4.2)
• Scoping variables is the hard part!
– shared variables, thread private variables
OpenMP Example
#include <stdio.h>
#include <omp.h>
main ()
{
int nthreads, tid;
[snell@atropos openmp]$ gcc-4.2 -fopenmp hello.c -o hello
[snell@atropos openmp]$ ./hello
Hello World from thread = 0
Hello World from thread = 1
Number of threads = 2
[snell@atropos openmp]$
/* Fork a team of threads giving them their own copies of variables */
#pragma omp parallel private(tid)
{
/* Obtain and print thread id */
tid = omp_get_thread_num();
printf("Hello World from thread = %d\n", tid);
/* Only master thread does this */
if (tid == 0)
{
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
} /* All threads join master thread and terminate */
}
OpenMP Directives
• 5 categories
–
–
–
–
–
Parallel Regions
Worksharing
Data Environment
Synchronization
Runtime functions / environment variables
• Basically the same between C/C++ and
Fortran
Parallel Regions
• Create threads with omp parallel
double A[1000]
omp_set_num_threads(4);
#pragma omp parallel
{
int ID = omp_get_thread_num();
dosomething(ID, A);
}
• Threads share A (default behavior)
• Master thread creates the threads
• Threads all start at same time then synchronize at a
barrier at the end to continue with code.
Parallel Regions
#pragma omp parallel [clause ...] newline
structured_block
Clauses
if (scalar_expression)
private (list)
shared (list)
default (shared | none)
firstprivate (list)
reduction (operator: list)
copyin (list)
num_threads (integer-expression)
Parallel Regions
• The number of threads in a parallel region is
determined by the following factors, in order
of precedence:
–
–
–
–
1. Evaluation of the IF clause
2. Setting of the NUM_THREADS clause
3. Use of the omp_set_num_threads() library function
4. Setting of the OMP_NUM_THREADS environment
variable
– 5. Implementation default - usually the number of CPUs on a
node, though it could be dynamic.
• Threads are numbered from 0 (master
thread) to N-1
Sections construct
• The sections construct gives a different
structured block to each thread
• By default there is a barrier at the end. Use
the nowait clause to turn off.
#pragma omp parallel
#pragma omp sections
{
X_calculation();
#pragma omp section
y_calculation();
#pragma omp section
z_calculation();
}
Work-sharing constructs
• the for construct splits up loop iterations
• By default, there is a barrier at the end of the
“omp for”. Use the “nowait” clause to turn off the
barrier.
#pragma omp parallel
#pragma omp for
for (I=0;I<N;I++)
{
NEAT_STUFF(I);
}
Short-hand notation
• Can combine parallel and work sharing
constructs
#pragma omp parallel for
for (I=0;I<N;I++){
NEAT_STUFF(I);
}
• There is also a “parallel sections”
construct
A Rule
• In order to be made parallel, a loop
must have canonical “shape”
<
for (index=start; index <= end;
>=
>
index++;
++index;
index--;
--index;
index += inc;
index -= inc;
index = index + inc;
index = inc + index;
index = index – inc;
)
An example
#pragma omp parallel for private(j)
for (i = 0; i < BLOCK_SIZE(id,p,n); i++)
for (j = 0; j < n; j++)
a[i][j] = MIN(a[i][j], a[i][k] + tmp[j])
By definition, private variable values are undefined at loop entry and exit
To change this behavior, you can use the
firstprivate(var) and lastprivate(var)
clauses
x[0] = complex_function();
#pragma omp parallel for private(j) firstprivate(x)
for (i = 0; i < n; i++)
for (j = 0; j < m; j++)
x[j] = g(i, x[j-1]);
answer[i] = x[j] – x[i];
Scheduling Iterations
• The schedule clause effects how loop iterations
are mapped onto threads
• schedule(static [,chunk])
– Deal-out blocks of iterations of size “chunk” to each thread.
• schedule(dynamic[,chunk])
– Each thread grabs “chunk” iterations off a queue until all
iterations have been handled.
• schedule(guided[,chunk])
– Threads dynamically grab blocks of iterations. The size of he
block starts large and shrinks down to size “chunk” as the
calculation proceeds.
• schedule(runtime)
– Schedule and chunk size taken from the OMP_SCHEDULE
environment variable.
An example
#pragma omp parallel for private(j) schedule(static, 2)
for (i = 0; i < n; i++)
for (j = 0; j < m; j++)
x[j][j] = g(i, x[j-1]);
You can play with the chunk size to meet load balancing issues, etc.
Synchronization Directives
• BARRIER
– inside PARALLEL, all threads synchronize
• CRITICAL (lock) / END CRITICAL
(lock)
– section that can be executed by one thread
only
– lock is optional name to distinguish several
critical constructs from each other
An example
double area, pi, x;
int i, n;
area = 0.0;
#pragma omp parallel for private(x)
for (i = 0; i < n; i++)
{
x = (i + 0.5)/n;
#pragma omp critical
area += 4.0/(1.0 + x*x);
}
pi = area / n;
Reductions
• Sometimes you want each thread to
calculate part of a value then collapse
all that into a single value
• Done with reduction clause
area = 0.0;
#pragma omp parallel for private(x) reduction (+:area)
for (i = 0; i < n; i++)
{
x = (i + 0.5)/n;
area += 4.0/(1.0 + x*x);
}
pi = area / n;
Another Example
/* A Monte Carlo algorithm for calculating pi */
int
count;
/* points inside the unit quarter circle */
unsigned short
xi[3];
/* random number seed */
int
i;
/* loop index */
int
samples;
/* Number of points to generate */
double
x,y;
/* Coordinates of points */
double
pi;
/* Estimate of pi */
xi[0] = 1;
/* These statements set up the random seed */
xi[1] = 1;
xi[2] = 0;
count = 0;
OpenMP Issues
for (i = 0; i < samples; i++)
{
Each thread needs different
x = erand48(xi);
random number seeds
y = erand48(xi);
if (x*x + y*y <= 1.0) count++;
count is shared
}
we need the aggregate
pi = 4.0 * count / samples;
printf(“Estimate of pi: %7.5f\n”, pi);
OpenMP Version
/* A Monte Carlo algorithm for calculating pi */
int
count;
/* points inside the unit quarter circle */
unsigned short
xi[3];
/* random number seed */
int
i;
/* loop index */
int
samples;
/* Number of points to generate */
double
x,y;
/* Coordinates of points */
double
pi;
/* Estimate of pi */
omp_set_num_threads(omp_get_num_procs());
xi[0] = 1; xi[1] = 1;
xi[2] = omp_get_thread_num();
count = 0;
What is wrong with this?
#pragma omp parallel for firstprivate(xi) private(x,y) reduction(+:count)
for (i = 0; i < samples; i++)
{
x = erand48(xi);
y = erand48(xi);
if (x*x + y*y <= 1.0) count++;
}
pi = 4.0 * count / samples;
printf(“Estimate of pi: %7.5f\n”, pi);
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
main(int argc, char *argv[])
{
/* A Monte Carlo algorithm for calculating pi */
int
count;
/* points inside the unit quarter circle */
int i;
/* loop index */
int
samples;
/* Number of points to generate */
double
x,y;
/* Coordinates of points */
double
pi; /* Estimate of pi */
samples = atoi(argv[1]);
Corrected Version
[snell@atropos openmp]$ time ./montecarlopi.single 50000000
I am thread 0
Count = 39268604, Samples = 50000000, Estimate of pi: 3.14149
real
0m4.628s
user
0m4.568s
sys
0m0.030s
[snell@atropos openmp]$ time ./montecarlopi 50000000
I am thread 1
I am thread 0
Count = 39272420, Samples = 50000000, Estimate of pi: 3.14179
real
0m2.480s
#pragma omp parallel
user
0m4.620s
{
sys
0m0.030s
unsigned short xi[3];
/* random number seed */
[snell@atropos openmp]$
xi[0] = 1;
/* These statements set up the random seed */
xi[1] = 1;
xi[2] = omp_get_thread_num();
count = 0;
printf("I am thread %d\n", xi[2]);
#pragma omp for firstprivate(xi) private(x,y) reduction(+:count)
for (i = 0; i < samples; i++)
{
x = erand48(xi);
y = erand48(xi);
if (x*x + y*y <= 1.0) count++;
}
}
pi = 4.0 * (double)count / (double)samples;
printf("Count = %d, Samples = %d, Estimate of pi: %7.5f\n", count, samples, pi);
}
…
An alternate version
#pragma omp parallel private(xi, t, x, y, local_count)
{
xi[0] = 1; xi[1] = 1;
xi[2] = tid = omp_get_thread_num();
t = omp_get_num_threads();
local_count = 0;
for (i = tid; i < samples; i += t)
{
x = erand48(xi);
y = erand48(xi);
if (x*x + y*y <= 1.0) local_count++;
}
#pragma omp critical
[snell@atropos openmp]$ time ./montecarlopi2 50000000
I am thread 0
count += local_count;
I am thread 1
}
1: local_count is 16461003
1: local_count is 16392925
pi = 4.0 * count / samples;
Count = 32853927, Samples = 50000000, Estimate of pi: 2.62831
printf(“Estimate of pi: %7.5f\n”, pi);
}
real
0m5.053s
user
0m9.697s
sys
0m0.053s
[snell@atropos openmp]$
Problems!
Corrected Version
/* A Monte Carlo algorithm for calculating pi */
int
count;
/* points inside the unit quarter circle */
int
local_count;
/* points inside the unit quarter circle */
int
t, tid;
int i;
/* loop index */
int
samples;
/* Number of points to generate */
double
x,y;
/* Coordinates of points */
double
pi; /* Estimate of pi */
samples = atoi(argv[1]);
Corrections:
•i should be private!
•random number seed
array(xi) should also
be private
#pragma omp parallel private(i,t,x,y,local_count) reduction(+:count)
{
unsigned short xi[3];
/* random number seed */
xi[0] = 1;
/* These statements set up the random seed */
xi[1] = 1;
xi[2] = tid = omp_get_thread_num();
t = omp_get_num_threads();
count = 0;
//printf("I am thread %d\n", xi[2]);
[snell@atropos openmp]$ time ./montecarlopi3 50000000
for (i = tid; i < samples; i+=t)
Count = 39269476, Samples = 50000000, Estimate of pi: 3.14156
{
real
0m2.321s
x = erand48(xi);
user
0m4.559s
y = erand48(xi);
sys
0m0.014s
if (x*x + y*y <= 1.0) count++;
}
//printf("%d: count is %d\n", tid, count);
}
pi = 4.0 * (double)count / (double)samples;
printf("Count = %d, Samples = %d, Estimate of pi: %7.5f\n", count, samples, pi);
Use reduction
Serial Directives
• MASTER / END MASTER
– executed by master thread only
• DO SERIAL / END DO SERIAL
– loop immediately following should not be
parallelized
– useful with -qsmp=omp:auto
• SINGLE
– only one thread executes the block
Example Serial Execution
/* A Monte Carlo algorithm for calculating pi */
…
omp_set_num_threads(omp_get_num_procs());
xi[0] = 1; xi[1] = 1; xi[2] = omp_get_thread_num();
count = 0;
#pragma omp parallel for firstprivate(xi) private(x,y) reduction(+:count)
for (i = 0; i < samples; i++)
{
x = erand48(xi);
y = erand48(xi);
if (x*x + y*y <= 1.0) count++;
#pragma omp single
{
printf(“Loop Iteration: %d\n”, i);
}
}
pi = 4.0 * count / samples;
printf(“Estimate of pi: %7.5f\n”, pi);
Conditional Execution
• Overhead of fork/join is high
• If a loop is small, you don’t want to
parallellize
• But, you may not know how big until runtime
• Conditional clause for parallel execution
– if ( expression )
area = 0.0;
#pragma omp parallel for private(x) reduction (+:area) if (n > 5000)
for (i = 0; i < n; i++)
{
x = (i + 0.5)/n;
area += 4.0/(1.0 + x*x);
}
pi = area / n;
Scope Rules
• Shared memory programming model
– most variables are shared by default
• Global variables are shared
• But not everything is shared
– loop index variables
– stack variables in called functions from parallel
region
• variable set and then used in for-loop is
PRIVATE
• array whose subscript is constant w.r.t.
PARALLEL for-loop and is set and then used
within the for-loop is PRIVATE
Scope Clauses
• for/DO directive has extra clauses, the
most important
– PRIVATE (variable list)
– REDUCTION (op: variable list)
• op is sum, min, max
• variable is scalar, XLF allows array
Scope Clauses (2)
• PARALLEL and PARALELL for/DO and
PARALLEL SECTIONS have also
– DEFAULT (variable list)
• scope determined by rules
– SHARED (variable list)
– IF (scalar logical expression)
• directives are like programming language
extension, not compiler option
integer i,j,n
real*8 a(n,n), b(n)
read (1) b
!$OMP PARALLEL DO
!$OMP PRIVATE (i,j) SHARED (a,b,n)
do j=1,n
do i=1,n
a(i,j) = sqrt(1.d0 + b(j)*i)
end do
end do
!$OMP END PARALLEL DO
Matrix Multiply
!$OMP PARALLEL DO PRIVATE(i,j,k)
do j=1,n
do i=1,n
do k=1,n
c(i,j) = c(i,j) + a(i,k) * b(k,j)
end do
end do
end do
Analysis
•
•
•
•
Outer loop is parallel: columns of c
Not optimal for cache use
Can put more directives for each loop
Then granularity might be too fine
OMP Functions
•
•
•
•
int omp_get_num_procs()
int omp_get_num_threads()
int omp_get_thread_num()
void omp_set_num_threads(int)
OpenMP Environment
Variables
• OpenMP parallelism may be controlled via
environment variables
– OMP_NUM_THREADS
• Sets number of threads in parallel sections
– OMP_DYNAMIC
• When = TRUE, allows number of threads to be set at
runtime
– OMP_NESTED
• When = TRUE, enables nested parallelism
– OMP_SCHEDULE
• Controls the scheduling assignment
• Example - export OMP_SCHEDULE=“static,4”
Fortran Parallel Directives
• PARALLEL / END PARALLEL
• PARALLEL SECTIONS / SECTION /
SECTION / END PARALLEL
SECTIONS
• DO / END DO
– work sharing directive for DO loop
immediately following
• PARALLEL DO / END PARALLEL DO
– combined section and work sharing
Pthread Translation

Download Report

OpenMP

Paperzz.com

Your Paperzz