Programming Parallel Algorithms
- NESL
Guy E. Blelloch
Presented by:
Michael Sirivianos
Barbara Theodorides
Problem Statement
Why design a new language specifically for programming
parallel algorithms?
In the past 20 years there has been tremendous progress in
developing and analyzing parallel algorithms
At that time less success in developing good languages for
programming parallel algorithms
There is a large gap between languages that are too low level
(details that obscure the meaning of the algorithm) and languages
that are too high level (making performance implications unclear)
NESL
Nested Data Parallel Language
Useful for teaching and implementing parallel algorithms.
Bridges the gap: allows high-level descriptions of parallel
algorithms but also has a straightforward mapping onto a
performance model.
Goals when designing NESL:
A language-based performance model that uses work
and depth rather than a machine-based model that uses
running time
Support for nested data-parallel constructs (ability to
nest parallel calls)
Analyzing performance
Processor-based models: Performance is calculated in terms of the
number of instruction cycles a computation takes (its running time) ~ A
function of input size and number of processors
Virtual models: Higher level models that can be mapped onto various
real machines (e.g. PRAM - Parallel Random Access Machine)
Can be mapped efficiently onto more realistic machines by
simulating multiple processors of the PRAM on a single processor
of a host machine. Virtual models easier to program.
Measuring performance:
Work & Depth
Work: the total number of operations executed by a computation
specifies the running time on a sequential processor
Depth: the longest chain of sequential dependencies in the
computation.
represents the best possible running time assuming an ideal machine with
an unlimited number of processors
Example: Summing 16 numbers using a balanced binary tree
How can work & depth be incorporated
into a computational model?
Circuit model
Designing a circuit of logic gates
In previous example, design a circuit in which the inputs
are at the top, each “+” is an adder circuit, and each of the
lines between adders is a bundle of wires.
Work = circuit size (number of gates)
Depth = longest path from an input to an output
How can work & depth be incorporated
into a computational model? (cont)
Vector Machine Models
VRAM is a sequential RAM extended with a set of
instructions that operate on vectors.
Each location in memory contains a whole vector
Vectors can vary in size during the computation
Vector instructions include element wise operations
(adding corresponding elements)
Depth = #instructions executed by the machine
Work = sum of the lengths of the vectors
How can work & depth be incorporated
into a computational model? (cont)
Vector Machine Models Example
Summation tree code
Work = O ( n + n/2 + … ) = O (n)
Depth = O (log n)
How can work & depth be incorporated
into a computational model? (cont)
Language-Based Models
Specify the costs of the primitive instructions and a set of rules for
composing costs across program expressions.
Discuss the running time of the algorithms without introducing a
specific machine model.
Using work & depth: work & depth costs are assigned to each
function and scalar primitive of a language and rules are specified
for combining parallel and sequential expressions.
Roughly speaking, when executing a set of tasks in parallel:
work = sum of work of the tasks
depth = maximum of the depth of the tasks
Why Work & Depth?
Work & Depth: used informally for many years to describe the
performance of parallel algorithms
easier to describe
easier to think about
easier to analyze algorithms in terms of work & depth than in
terms of running time and number of processors (processor-based
model)
Why models based on work & depth are better than processor-based
models for programming and analyzing parallel algorithms?
Performance analysis is closely related to the code and code
provides a clear abstraction of parallelism.
Why Work & Depth? (cont)
To support this claim they consider Quicksort.
Sequential algorithm:
Average case: run time = O ( n log n ) , depth or recur. calls = O ( log n )
Parallel algorithm:
Quicksort (cont.)
Code and analysis based on a processor based model
Code will have to specify how the sequence is partitioned across
processor
how the subselection is implemented in parallel
how the recursive calls get partitioned among the processors.
how the subcalls are synchronized
In the case of Quicksort, this gets even more complicated. T
The recursive calls are not of equal sizes.
Work & Depth and running time
Running time at the two limits:
Single processor. RT = work
Unlimited number of processors. RT = depth
We can place upper and lower bounds for a given number
of processor.
W/ P <= T <= W / P + D
valid under assumptions about communication and scheduling costs.
e.g. given memory latency L
W/ P <= T <= W / P + L*D
Communication cost among processor is not unit time thus D is multiplied by
a latency factor. Bandwidth is not taken into account. In case of significantly
different bandwidth W should be divided by a large B factor and D by a small
B factor.
Work & Depth and running time (cont)
Communication Bounds
Work & depth do not take into account communication costs:
latency: time between making a remote request and receiving the reply
bandwidth: rate at which a processor can access memory
Latency can be hidden.
Each processor has multiple parallel tasks (threads) to execute
and therefore has plenty to do while waiting for replies
Bandwidth can not be hidden. While processor is waiting for data
transfer to complete it is not able to perform other operations, and
therefore remains idle.
.
Nested Data-Parallelism and NESL
Data-Parallelism: the ability to operate in parallel over sets of data
Data-Parallel Languages or Collection-Oriented Languages:
languages based on data-parallelism. Can be either flat or nested
Importance of nested parallelism:
Used to implement nested loops and divide-and-conquer algorithms in
parallel
Existing languages, such as C, do not have direct support for such nesting!
NESL
Is a nested data-parallel language.
Designed in order to express nested parallelism in a simple way with a
minimum set of structures
NESL
Supports data-parallelism by means of operations on sequences
Apply-to-each construct which uses a set-like notation
e.g. {a * a : a in [3, -4, -9, 5]};
Used over multiple sequences. {a + b : a in [3, -4, -9, 5]; b in [1, 2, 3, 4]};
Ability to subselect elements of a sequence based on a filter.
e.g. {a * a : a in [3, -4, -9, 5] | a > 0};
Any function may be applied to each element of a sequence
e.g. {factorial(i) : i in [3, 1, 7]};
Provides a set of functions on sequences, each of which can be
implemented in parallel (sum, reverse, write)
e.g. write([0, 0, 0, 0, 0, 0, 0, 0], [(4,-2),(2,5),(5,9)]);
Nested parallelism: allow sequences to be nested and allow parallel
funcitons to be used in an apply-to-each.
e.g. {sum(a) : a in [[2,3], [8,3,9], [7]]};
The performance Model
Defines Work & Depth in terms of the work and depth of the primitive
operations, and Rules for composing the measures across expressions.
In most cases: W(e1 + e2) = 1 + W(e1) + W(e2), where ei : expresions
A similar rule is used for the depth.
Rules
apply-to-each expression:
if expression:
The performance Model (cont)
Example: Factorial
Concider the evaluation of the expression:
e = {factorial(n) : n in a} where a = [3, 1, 5, 2].
function factorial(n) =
if (n == 1) then 1
else n*factorial(n-1);
Using the rules for work and depth:
where W= =, W*, W- have cost 1.
The two unit constants come form the cost of the function call and the ifthen-else statement.
Examples of Parallel Algorithms in
NESL
Principles:
An important aspect of developing a good parallel algorithm is
designing one whose work is close to the time for a good sequential
algorithm that solves the same problem.
Work-efficient: Parallel algorithms are referred to as work-efficient
relative to a sequential algorithm if their work is within a constant
factor of the time of the sequential algorithm.
Examples of Parallel Algorithms in
NESL (cont)
Primes
Sieve of Eratosthenes:
1 procedure PRIMES(n):
2 let A be an array of length n
3 set all but the first element of A to TRUE
4 for i from 2 to sqrt(n)
5
begin
6
if A[i] is TRUE
7
then set all multiples of i up to n to FALSE
8
end
Line 7 is implementing by looping over the multiples, thus the
algorithm takes O (n log log n) time.
Examples of Parallel Algorithms in
NESL (cont)
Primes (parallelized)
Parallelize the line “set all multiples of i up to n to FALSE”
multiples of a value i can be generated in parallel by [2*i:n:i]
and can be written into the array A in parallel with the write function
The depth of this algorithm is O (sqrt(n)), since each iteration of the
loop has constant depth and there are sqrt(n) iterations.
The number of multiples is the same as the time of the sequential
version.
Since it does the same number of operations, work is the same
O (n log log n).
Examples of Parallel Algorithms in
NESL (cont)
Primes: Improving depth
If we are given all the primes form 2 up to sqrt(n), we could then generate
all the multiples of these primes at once: {[2*p:n:p] : in sqr_primes}
function primes (n) =
if n == 2 then ( [ ] int )
else
let sqr_primes = primes( isqrt(n) );
composites = {[2*p:n:p] : p in sqr_primes};
flat_comps = flatten (composites);
flags
= write(dist(true, n), {(i,false) : i in flat_comps});
indices
= {i in [0:n]; fl in flags | fl}
in drop(indices, 2);
Examples of Parallel Algorithms in
NESL (cont)
Primes: Improving depth
Analyze of Work & Depth:
Work: clearly most of the work is done at the top level of recursion,
which does O (n log log n) work, and therefore the total work is
O (n log log n)
Depth: since each recursion level has constant depth, the total depth is
proportional to the number of levels. The number of levels is log log n
(the size of the problem at the ith level is n1/2^d => d = log log n) and
therefore the depth is O (log log n)
This algorithm remains work-efficient and greatly improves the depth.
Examples of Parallel Algorithms in
NESL (cont)
Sparce Matrix Multiplication
Sparce matices: most elements are zero
Representation in NESL:
2.0 -1.0
0
A = -1.0 2.0 -1.0
0 -1.0 2.0
0
0 -1.0
0
0
-1.0
2.0
A = [[(0, 2.0), (1, -1.0)],
[(0, -1.0), (1, 2.0), (2, -1.0)],
[(1, -1.0), (2, 2.0), (3, -1.0)],
[(2, -1.0), (3, 2.0)]]
E.g. multiply a sparce matrix A with a dense vector x.
The dot product Ax in NESL is: {sum({v * x[i] : (i,v) in row}) : row in A};
Let n be the number of nonzero elements in the row, then
depth of the computation = the depth of the sum = O ( log n )
work = sum of the work across the elements = O (n)
Examples of Parallel Algorithms in
NESL (cont)
Planar Convex Hull
Problem: Given n points in the plane, find which of them
lie on the perimeter of the smallest convex region that
contains all points.
An example of nested parallelism for divide-and-conquer
algorithms.
Quickhull algorithm (similar to Quicksort):
The strategy is to pick a pivot element, split the data based
on the pivot, and recurse on each of the split sets.
Worst case performance is O (n2) and the worst case depth
is O (n).
Examples of Parallel Algorithms in
NESL (cont)
hsplit(set,A,P) & hsplit(set,P,A)
cross product (p, (A,P))
pm: farthest from line A-P
Recursively: hsplit(set’,A,pm)
hsplit(set’,pm,P)
Ignores elements below the line
Examples of Parallel Algorithms in
NESL (cont)
Performance analysis of Quickhull:
Each recursive call has constant depth and O(n) work.
However, since many points might be deleted on each step,
the work could be significantly less.
As in Quicksort, worst case performance is O (n2) and the
worst case depth is O (n).
For m hull points the best case times are O (n) work and
O( log m ) depth.
Summary
They formalize a clear-cut formal language-based model
for analyzing performance
Work & depth based model is directly defined through a
programming language, rather than a specific machine
It can be applied to various classes of machines using
mappings that count for number of processors, processing
and communication costs.
NESL allows simple description of parallel algorithms and
makes use of data parallel constructs and the ability to nest
such constructs..
Summary
NESL hides the CPU/Memory allocation, and interprocessor communication details by providing an
abstraction of parallelism.
The current NESL implementation is based on an
intermediate language (VCODE )and a library of low level
vector routines (CVL)
For more information on how NESL compiler is
implemented:
“Implementation of a Portable Nested Data-Parallel
Language” Guy E. Blelloch, Siddhartha Chatterjee,
Jonathan C. Hardwick, Jay Sipelstein, and Marco Zagha.
Discussion
Parallel Processing - Sensor Network Analogy:
Local processing -> Aggregation. Work corresponds to total
aggregation cost.
Moving levels up -> Collecting aggregated results from children
nodes.
Depth->Depth of routing tree in sensor network. Implies
communication cost.
Latency->Cost to transmit data between motes.
• In parallel computation the goal is to reduce execution time.
Sensor networks aim to reduce power consumption by minimizing
communications. Execution time is also an issue when real time
requirements are imposed.
Discussion
NESL and TAG queries?
Can latency be hidden by assigning multiple tasks to
motes?
Can you perform different operations on an array's
elements in parallel? Is it hard to add one more parallelism
mechanism besides apply-to-each and parallel functions?
© Copyright 2026 Paperzz