4b. Data Parallel Languages Part B

DATA PARALLEL
LANGUAGES
(Chapter 4b)
multiC,
Fortran 90, and HPF
The MultiC Language
References:
20. “The multiC Programming Language”, Preliminary
Documentation, WaveTracer, PUB-00001-001-00.80, Jan. 1991.
21. “The multiC Programming Language”, User Documentation,
WaveTracer, PUB-00001-001-1.02, June 1992.
Note: This presentation is based on the 1991 manual, unless
otherwise noted. (e.g., “manuals” refers to both versions.)
• MultiC is the language used the WaveTracer and the
Zephyr SIMD computers.
• The Zephyr is a second generation WaveTracer, but was
never commercially available.
– We were given 10 Zephyrs and several other
incomplete Zephyrs to use for spare part
• A MultiC++ was designed for their third generation
computer, but neither were released.
• Both MultiC and a parallel language designed for the
MasPar are fairly similar to an earlier parallel language
called C*.
– C* was designed by Guy Steele for the Connection
Machine.
– All are data parallel and extensions of the C
language
• An assembler was also written for the WaveTracer (and
probably the Zephyr).
– It was intended for use only by company
technicians.
2
Data Parallel Languages
•
•
•
Data Parallel Languages
3
•
– Information about assembler were released to
WaveTracer customers on a “need to know”
basis.
– No manual was distributed but some details
were recorded in a short report.
– Professor Potter was given some details needed
to put the ASC language on the WaveTracer
MultiC is an extension to ANSI C, as documented
by the following book:
– The C Programming Language, Second Edition,
1988, Kernighan & Richie.
The WaveTracer computer is called a Data
Transport Computer (DTC) in manual
– a large amount of data can be moved in parallel
using interprocessor communications.
Primary expected uses for WaveTracer were
scientific modeling and scientific computation
– acoustic waves
– heat flow
– fluid flow
– medical imaging
– molecular modeling
– neural networks
The 3-D applications are supported by a 3D mesh
on the WaveTracer
– Done by sampling a finite set of points (nodes)
in space.
WaveTracer Architecture Background
• Architecture for Zephyr is fairly similar
– Exceptions will be mentioned whenever known
• Each board has 4096 bit-serial processors, which
can be connected in any of the following ways:
– 16x16x16 cube in 3D space
– 64x64 square in 2D space
– 4096 array in 1D space
• The 3D architecture is native on the WT and the
other networks are supported in hardware using
primarily the 3D hardware
– The Zephyr probably has a 2D network and
only simulates the more expensive 3D network
using system software.
• WaveTracer was available in 1, 2, or 4 boards,
arranged as follows:
– 2 boards were arranged as a 16x32x16 cube
• one cube stacked on the top of another cube
• 8192 processors overall
4
Data Parallel Languages
WaveTracer Architecture (Cont)
– Four boards are arranged as a 32x32x16 cube
• 16,384 processors
• Arranged as two columns of stacked cubes
• Computer supports automatic creation of virtual
processors and network connections to connect
these virtual processors.
– If each processor supports k nodes, this slows
down execution speed by a factor of k
• Each processor performs each operation k
times.
• Limited by the amount of memory required
for each virtual node
• In practice, slowdown is usually less than k
• The set of virtual processors supported by a
physical processor is called its territory.
5
Data Parallel Languages
Specifiers for MultiC Variables
• Any datatype in C except pointers can be declared
to be a parallel variable using the declaration multi
• This replicates the data object for each processor to
produce a 1,2, or 3 dimensional data object
• In a parallel execution, all multi objects must have
the same dimension.
• The multi declaration follows the same format as
ANSC C, e.g
multi int imag, buffer;
• The uni declaration is used to declare a scalar
variable
– Is the default and need not be shown.
– The following are equivalent:
uni int ptr;
int ptr;
• Bit Length Variables
• can be of type uni or multi
– Allows user to save memory
– All operations can be performed on these bitlength values
– Example: A 2 color image can be declared by
multi unsigned int image :1;
and an 8 color image by
multi unsigned int picture:3;
6
Data Parallel Languages
Some Control Flow Commands
• For uni type data structures, control flow in MultiC
is identical to that in ANSI C.
• The parallel IF-ELSE Statement
– As in ASC, both the IF and ELSE portions of
the code is executed.
– As with ASC, the IF is a mask-setting operation
rather than a branching command
– FORMAT: Same as for C
– WARNING: In contrast to ASC, both sets of
statements are executed.
• Even if no responders are active in one part,
the sequential commands in that part are
executed.
• Example: count := count + 1;
• The parallel WHILE statement
– The format used is
while(condition)
– The repetition continues as long as ‘condition’
is satisfied by one or more responders.
– Only those responders (i.e., ones who satisfies
‘condition’ preceding to this pass through the
body of ‘while”) are active during the execution
of the body of the ‘while’.
7
Data Parallel Languages
Other Commands
• Jump Statements
– goto, return, continue, break
– These commands are in conflict with structured
programming and should be used with restraint.
• Parallel Reduction Operators
*=
Accumulative Product
/=
Reciprocal Accumulative Product
+=
Accumulative Sum
-=
Negate & then Accumulative Sum
&=
Accumulative bitwise AND
|=
Accumulative bitwise OR
>?=
Accumulative Maximum
<?=
Accumulative Minimum
• Each of the above reduction operations return a uni
value and provide a powerful arithmetic operation.
– Each accumulative operation would otherwise
require one or more ANSI C loop constructs.
– Example: If A is a multi data type
largest_value = >?= A
smallest_value = <?= A
8
Data Parallel Languages
•
Data Replication
– Example:
multi int A = 0;
-
– one unit in the negative X direction
– two units in the positive Y direction
– one unit in the positive Z direction
– Coordinate Axes
Y
Z
O
Data Parallel Languages
X
9
•
A = 2;
– First statement stores 0 in every A field (compile time)
– Last statement stores 2 in A field of every active PE.
Interprocessor Communications
– Operators have the form
[dx; dy; dz]m
– This operator can shift the components of the multi
variable m of all active processors along one or more
coordinate dimensions.
– Example:
A = [-1; 2; 1]B
• Causes each active processor to move the data in
its B field to the A field of the processor at the
following location:
– Conventions:
• If value of dz operator is not specified, it is
assumed to be 0
• If the values of dy and dz operators are not
specified, both are assumed to be 0
• Example: [x; y]V is the same as [x; y; 0]V
– Inactive processor actions
• Does not send its data to another processor
• Participates in moving the data from other
processors along.
– Transmission of data occurs in lock step (SIMD
fashion) without congestion or buffering.
• Coordinate Functions
– Used to return a coordinate for each active
virtual processor.
– Format: multi_x(), multi_y(), and multi_z()
– Example:
If(multi_x() = = 0 && multi_y = = 2 && multi_z = = 1)
u = += A;
• Note that all processors except the one at (0,2,1) are
inactive with the body of the IF.
• The accumulated sum of the active components of the
multivariable A is just the value of the component of A
at processor (0,2,1)
• Effect of this example is to store the value in A at
(0,2,1) in the uni variable u.
10
Data Parallel Languages
• If the second command in the example is changed to
A = u;
the effect is to store the contents of the uni variable u
into multi variable A at location (0,2,1).
• (see manual pg 11-13,14 for more details)
•
Arrays
– Multi-pointers are not supported.
• Can not have a parallel variable containing a
pointer in each component of the array.
– uni pointers to multi-variables are allowed.
– Array Examples:
int array_1 [10];
int array_2 [5][5];
multi int array_3 [5];
• array_1 is a 1 dimensional standard C array
• array_2 is a 2 dimensional standard C array
• array_3 is a 1-dimensional array of multi variables
• MULTI_PERFORM Command
– Command gives the size of each dimension of
all multi-values prior to calling for a parallel
execution.
– Format:
multi_perform(func, xsize, ysize, zsize)
• Here, “func” is the function being executed.
• “xsize”, “ysize”, “zsize” are positive integers
specifying the DTC network configuration.
• If “zsize” is 1, then multi_perform creates a 2D
grid of size “xsize  ysize”
11
Data Parallel Languages
– multi_perform is normally called within the
main program.
• Usually calls a subroutine that includes all of the
– parallel work
– parallel I/O
– The main program usually includes
– Opening and closing of files
– Some of the scalar I/O
– define and include statements
– When multi_perform is called, it initializes any
extern and static multi objects
– In the previous example, multi_perform calls
func. After func returns, the multi space created
for it becomes undefined.
– The perror function is extended to print error
messages corresponding to errno numbers
resulting from the execution of MultiC
extensions.
• Has the following format
if(multi_perform(func,x,y,z)) perror(argv[0]);
• See usage in the examples in Appendix A
• More information on page 11-2 of manual
• Examples in Manual
– Many examples in the manual
– 17 in appendices alone
– Also stored under exname.mc in the MultiC package
• They can be compiled and executed.
12
Data Parallel Languages
The AnyResponder Function
• Code Segment for Tallying Responders
unsigned int short, tall;
multi float height;
load_height; /* assigns value in inches to height */
if(height >= 70)
tall = += (multi int)1;
else
short = += (multi int)1;
printf(“There are %d tall people \n”, tall);
• Comments on Code Segment
– Note that the construct
+= (multi int)1
counts the active PE (i.e., responders).
– This technique avoids setting up a bit field to
use to tally active PEs.
• Instead sets up a temporary multi variable.
– Can be used to see there is at least one
responder at any given time.
• Check to see if resulting sum is positive
– Provides technique to define the AnyResponder
function needed for associative programming
13
Data Parallel Languages
Accessing Components from Multi Variables
• Code from page 11-13 or 11-14 of MultiC manuals
#include <multi.h> /* includes multi library */
#include <stdlib.h>
#include <stdio.h>
void work (void)
{ uni int a, b, c, u;
multi int n;
/* Code goes here to assign values to n */
/* Code goes here to assign values to a, b, c */
if (mult_x() == a && multi_y() == b
&& multi_z() == c)
u = += n;
/* Assigns value of n at PE(a,b,c) */
}
int main (int argc, char, *argv[ ])
{ if (multi_perform(work, 7 , 7, 7))
perror (argv[0]);
exit (EXIT_SUCCESS);
}
•
To place a value of 5 into the selected location, replace the line
“u = +=n” with the line
n = 5;
The capability to read or place a value in a parallel variable at a
selected position is essential for multiC to execute associative
programs.
Data Parallel Languages
14
•
The oneof and next Functions
•
•
Function oneof provides a way of selecting one out of
several active processors
– Defined in Multi Struct program (A.15) in manual
– Procedure is essential for associative programming.
Code for oneof:
multi unsigned oneof(void):1
{ /* Store the coordinate values in multi
variables x and y */
multi unsigned x = multi_x(),
y = multi_y(),
uno:1 = 0;
/* Next select processor with highest
coordinate value */
if( x == >?= x)
if( y == >?= y)
uno = 1;
return uno;}
•
•
Note that multi variable uno stores a 1 for exactly one
processor and all the other coordinates of uno stores a 0.
The function oneof can be used by another procedure
which is called by multi_perform.
– An example of oneof being called by another
procedure is given on pages A46-50 of the manuals.
– Should be useable in the form
if(oneof()) /* Check to see if an active responder exists */
If we assign
a = >? x; b = >? y; c = >? z
then (a,b,c) stores the location of the PE selected by one of .
Data Parallel Languages
15
•
• Preceding procedure assumed a 2D configuration
of processors with z=1.
– If configuration is 3D, the process of selecting
the coordinates can be continued by also
selecting the highest z-coordinate.
• Stepping through the active PEs (i.e., next)
– Provides the MultiC equivalent of the ASC next
command
– An additional one-bit multi integer variable
called bi (for “busy-idle”) is needed.
– First set bi to zero
– Activate the PEs you wish to step through.
– Next, have the active PEs to write a 1 into bi.
– Use
if(oneof())
to restrict the mask to one of the active PEs.
– Perform all desired operations with active PE.
– Have active PE set its bi value to 0 and then
exit the preceding if statement.
– Use the += (accumulative sum) operator to see if
any PEs remain to be processed.
• If so, return to step above calling oneof
• This return can be implemented using a
while loop.
16
Data Parallel Languages
Sequential Printing of Multi Variable Values
• Example: Print a block of the image 2D bit array.
– A function select_int is used which will return
the value of image at the specified (x,y,z)
coordinate.
– The printing occurs in two loops which
• increments the value of x from 0 to some
specified constant.
• increments the value of y from 0 to some
specified constant.
– This example is from page 8-1 of the manuals
and is used in an example on pgs A16-18 of
1991 manual and pgs A12-14 of 1992 manual.
– The select_int function
select_int (multi *mptr, int x, int y, int z)
/* Here, *mptr is a uni pointer to type multi */
{ int r
if( multi_x == x &&
multi_y == y &&
multi_z == z)
/* Restricts scope to the one PE at (x,y,z) */
r = |= *mptr;
/* OR reduction operator transfers binary value
of multi variable at (x,y,z) to the uni variable */
return r;}
17
Data Parallel Languages
– The two loops to print a block of values of the
image multi variable.
for( y = 0; y < ysize; y++)
{ for (x =0; x < xsize; x++)
printf( “%d”, select_int (&image,x,y,z)
printf( “\n”);
}
• Above technique can be adapted to print or read
multi variables or part of multi variables.
– Efficient as long as the number of locations
accessed is small.
• If I/O operations involve a large amounts of data,
the more efficient data transfer functions described
in manuals (Chapter 8 and Section 11.2 and 11.13)
should be used.
• The functions multi_fread and multi_fwrite are
analogous to fwrite and fread in C. Information
about them is given on pages 11-1 to 11-4 of the
manuals.
18
Data Parallel Languages
Moving Data between Uni Arrays and Multi
Variables
• The following functions allow the user to move
data between “uni” arrays and multi variables:
multi_from_uni ...
multi_to_uni ...
– The above “…” may be replaced with a data
type such as
• char
• short
• int
• long
• float
• double
• cfloat
• cdouble
• These functions are illustrated in several of the
examples.
19
Data Parallel Languages
Compiling and Executing Programs on the Zephyr
• A 4k Zephyr machine is available for use in the
Parallel and Associative Computing Lab.
• It is presently connected to a Windows 2003 Server
which supports remote desktop for interactive use.
However, you may use the computer directly at the
console while the lab is open
• Visual Studio 2002 has been installed on the server.
The MultiC language uses a “compiler wrapper” to
translate MultiC code into Visual C code.
• Programming the Zephyr on a Windows 2003
system is similar to that using command line
programming tools in UNIX.
– You can edit your program using “Edit” or
“Notepad”
– You can compile and create an executable using
“nmake”
– You can execute your program using the Visual
Studio Command Shell
• This is a special DOS shell that has extra
path, include, and library environment
variables used by the compiler and linker.
20
Data Parallel Languages
Compiling and Executing Programs on the Zephyr
• Login or use Remote Desktop Connection to
zserver.cs.kent.edu
– From Windows XP choose Start | Programs |
Accessories | Communications | Remote
Desktop Connection
– Enter your login name and password and click
on “OK”
• Open an command window and run the DTC
Monitor program
– Type dtcmonitor at the command prompt.
– This is a “daemon” program that serializes and
controls executables using the Zephyr.
– When this 100% complete, you can then
execute programs on the Zephyr.
– You can minimize this command shell.
– Important: When you are finished enter CTRLC to end the dtcmonitor.
• Create a folder on your desktop for programs. You
can copy the example Zephyr MultiC program from
D:\Common\zephyrtest to your local folder and
rename it for your programming assignment.
21
Data Parallel Languages
Compiling and Executing Programs on the Zephyr
• Create or edit your MultiC program using DOS edit
or Windows Notepad.
– From the Visual Studio Command Shell type
• edit anyprog.mc
• notepad anyproc.mc
– Make sure that the file extension is .mc
– Save your work before compiling
• Modify the makefile template and change the
names of the MultiC file and object file to those
used in your programming assignment.
• Compile and link your program by typing
– nmake /f anyprog.mak
– nmake (for the default Makefile)
• Execute your program by typing the name of your
executable at the command prompt.
• When you are finished enter CTRL-C to end the
dtcmonitor.
22
Data Parallel Languages
OMIT FOR PRESENT
(Multi-C Recursion)
• It is possible to write recursive “multi” functions in
multiC, but you have to test if there are active PEs
still working.
• Consider the following multiC function
multi int factorial( multi int n )
{
multi int r;
if( n != 1 )
r = (factorial(n-1)*n);
else
r = 1;
return( r );
}
• What happens?
23
Data Parallel Languages
OMIT FOR PRESENT
(MultiC Recursion Example)
• Recursion
multi int factorial( multi int n )
{
multi int r;
/* stop calculating if every component has
been computed */
if( ! |= (multi int) 1 )
return(( multi int ) 0 );
/* otherwise, continue calculating */
if( n > 1 )
r = factorial( n-1 ) * n;
else
r = 1;
return( r );
}
24
Data Parallel Languages
Fortran 90 and HPF
(High Performance Fortran)
A de facto standard for
scientific and engineering
computations
Fortran 90 AND HPF
• References:
– [19] Ian Foster, Designing and Building Parallel
Programs, (online copy), chapter 7.
– [8] Jordan and Alaghband, Fundamentals of
Parallel Processing, Section 3.6.
• Recall data parallelism refers to the concurrency
that occurs when all the same operation is executed
on some or all elements in a data set.
• A data parallel program is a sequence of such
operations.
• Fortran 90 (or F90)is a data-parallel programming
language.
– Some job control algorithms can not be expressed in
a data parallel language.
• F90’s array assignment statement and array
functions can be used to specify certain types of
data parallel computation.
• F90 forms the basis of HPF (High Performance
Fortran) which augments F90 with a small set of
extensions.
• In F90 and HPF, the (parallel) data structure
operated on are restricted to arrays.
– E.g., data types such as trees, sets, etc. are not
supported.
– All array elements must be of the same type.
– Fortran arrays can have up to 7 dimensions.
26
Data Parallel Languages
• Parallelism in F90 can be expressed explicitly, as in
the array assignment statement
A= B*C ! A,B,C are arrays
• Compilers may be able to detect implicit
parallelism, as in the following example
– do I = 1,m
do j = 1,n
A(i,j) = B(i.,j)* C(i,j)
enddo
enddo
– Parallel execution of above code depends on the
fact that the various do-loops are independent
• i.e., one loop does not write/read a variable
that another loop writes/reads.
• Compilation can also introduce communications
operations when the computation mapped to one PE
requires data mapped to another PE.
• Communication operations in F90 (and HPF) are
inferred by the compiler and do not need to be
specified by the programmer.
– These are derived by the compiler from the data
decomposition specified by the programmer.
• F90 allows a variety of scalar operations (i.e.,
defined on a single value) to be applied to an entire
array.
27
Data Parallel Languages
• All F90’s unary and binary operations can be
applied to arrays as well, as illustrated in below
examples:
real A(10,200), B(10,10), c
logical L(10,20)
A=B+c
A = A + 1.0
A = sqrt(A)
L = A .EQ. B
• The function of the mask is handled in F90 by the
where statement, which has two forms.
– The first form uses the where to restrict array
elements on which an assignment is performed
– For example, the following replaces each
nonzero entry of array with its reciprocal:
where(x /= 0) x = 1.0/X
– The second form of the where is block
structured and has the form
where (mask-expression)
array_assignment
elsewhere
array_assignment
end where
28
Data Parallel Languages
Some F90 Array Intrinsic Functions
•
•
Array intrinsic functions below assume a vector version
of an array is formed using “column major” ordering
Some F90 array intrinsic functions
– RESHAPE(A,...) converts array A into a new array
with specified shape and “fill”
– PACK(A, MASK, FILL) forms a vector from
masked elements of A, using “fill” as needed.
– UNPACK(A,MASK, FILL) replaces masked
elements with elements from FILL vector
– MERGE(A, B, MASK) returns array of masked A
entries and unmasked entries of B
– SPREAD(A, DIM, N) replicate array A, using N
using N copies to form a new array of one larger
dimension
– CSHIFT(A, SHIFT, DIM) column major rotation of
elements of A
– EOSHIFT(A,...) elements of A are shifted off the end
along specified dimension, with end values with fill
from either a specified scalar or array of dimension 1
less than A
– TRANSPOSE(A) returns transpose of array A.
Some array intrinsic functions that perform computation
– MAXVAL(A) returns the maximum value of A
– MINVAL(A) returns the minimum value of A
– SUM(A) returns the sum of the element of A
– PRODUCT(A) product of elements of A
– MAXLOC(ARRAY) indices of max value in A
– MINLOC(ARRAY) indices of min value in A
– MATMUL(A,B) matrix multiplication A*B
– DOT_PRODUCTS(A,B) vector dot product
Data Parallel Languages
29
•
The HPF Data Distribution Extension
• Reference: [19] Ian Foster, Designing and Building
Parallel Programs, (online copy), chapter 7.
• F90 array expressions specify opportunities for
parallel execution but no control over how to
perform these so that communication is minimized.
• HPF handling of data distribution involves three
directives
– The PROCESSOR directive specifies the shape
and size of the array of abstract processors.
– The ALIGN directive is used to align elements
of different arrays with each other, indicating
that they should be distributed in the same
manner.
– The DISTRIBUTE directive is used to
distribute an object (and all objects aligned with
it) onto an abstract processor array.
• The data distribution directives can have a major
impact on a program’s performance (but not on the
results computed), affecting
– Partitioning of data to processors
– Agglomeration – Considering value of
combining tasks to produce fewer larger tasks.
– Communications required to coordinate task
execution.
– Mapping of tasks to processors
30
Data Parallel Languages
HPF Data Distribution (Cont.)
• Data distribution directives are recommendations to
a HPF compiler, not instructions.
– Compiler can ignore them if it determines that
this will improve performance.
• PROCESSOR directive
– Creates an arrangement for abstract processors
and gives this arrangement a name.
– Example: !HPF$ PROCESSORS P(4,8)
– Normally one abstract processor is created for
each physical processor.
• There could be more abstract processors
than physical ones.
• However, HPF does not specify a way of
mapping abstract to physical processors.
• ALIGN Directive
• Specifies array elements that should, if
possible, be mapped to the same processor.
• Operations involving data objects that are
aligned are likely to be more efficient due to
reduced communication costs if on same PE.
• EXAMPLE:
real B(50), C(50)
!HPF$ ALIGN C(:) WITH B(:)
31
Data Parallel Languages
HPF Data Distribution (Cont.)
• ALIGN Directive (cont.)
– A “*” can be used to collapse dimensions (i.e.,
to match one element with many elements
– Considerably flexibility is allowed in specifying
which array elements are to be aligned.
• Dummy variables can be used for
dimensions
• Integer formulas to specify offsets.
– An align statement can be used to specify that
elements of an array should be replicated over
certain processors.
• Costly if replicated arrays are updated often.
• Increases communication or redundant
computation.
• DISTRIBUTE Directive
– Indicates how data are to be distributed among
processor memories.
– Specifies for each dimension of an array one of
three ways that the array elements will be
distributed among the processors
*
No distribution
BLOCK(n) Block distribution
(default n = N/P)
CYCLIC(n) Cyclic distribution
(default n = 1)
32
Data Parallel Languages
HPF Data Distribution (Cont.)
• DISTRIBUTE Directive (cont.)
– Block distribution divides the items/indices in
that dimension into equal-sized blocks of size
N/P.
– Cyclic distribution maps every Pth index to the
same processor.
– Applies not only to the named array but also to
any array that is aligned to it.
– The following DISTRIBUTE directives
specifies a mapping for all three arrays.
!HPF$ PROCESSORS p(20)
real A(100,100), B(100,100), C(100,100)
!HPF$ ALIGN B(:,:) with A(:,:)
!HPF$ DISTRIBUTE A(BLOCK,*) ONTO p
33
Data Parallel Languages
HPF Concurrency
•
•
•
•
•
•
The F90 array assignment statements provide a
convenient way of specifying data parallel operations.
However, this does not apply to all data parallel
operations, as the array on the right hand must have the
same shape as the one on the left hand side.
HPF provides two other constructs to exploit data
parallelism, namely the FORALL and the
INDEPENDENT directives.
The FORALL Statement
– Allows a more general assignments to sections of an
array.
– General form is
FORALL (triplet, ... , triplet, mask) assignment
– Examples:
FORALL (i=1:m, j=1,n) X(i,j) = i+j
FORALL (i=1:n, j=1,n, i<j) Y(i,j) = 0.0
The INDEPENDENT Directive and Do-Loops
– The INDEPENDENT directive can be used to assert
that the iterations of a do-loop can be performed
independently, that is
• They can be performed in any order
• They can be performed concurrently
The INDEPENDENT directive must immediately
precede the do-loop that it applies to.
Examples of independent and non-independent do-loops
are given in [19, Foster, pg 258-9].
Data Parallel Languages
34
•
Additional HPF Comments
•
•
•
•
•
A HPF program typically consists of a sequence of
calls to subroutines and functions.
The data distribution that is best for a subroutine
may be different than the data distribution used in
the calling program.
Two possible strategies for handling this situation
are
a. Specify a local distribution using
DISTRIBUTE and ALIGN, even if this
requires expensive data movement on entering
• Cost normally occurs on return as well.
b. Use whatever data distribution is used in the
calling program, even if not optimal. This
requires use of INHERIT directive.
Both F90 and HPF intrinsic functions (e.g., SUM,
MAXVAL) combine data from entire arrays and
involve considerable communication.
Some other F90/HPF intrinsic functions such as
DOT_PRODUCT involve communciation cost
only if their arguments are not aligned.
Array operations involving the FORALL statement
can result in communication if the computation of a
value for an element A(i) require data values that
are not on the same processor (e.g., B(j)).
Data Parallel Languages
35
•