PPT

Lecture 5
Chapter4
Partitioning and Divide-and-Conquer
Strategies
划分和分治的并行技术
1
Partitioning
划分就是简单地将问题分成几部分,然
后计算每部分,最后组合各部分得结果。
2
Divide and Conquer分治
Characterized by dividing problem into sub-problems of
same form as larger problem. Further divisions into still
smaller sub-problems, usually done by recursion.
通过递归,将问题分成与该问题形式一样的子问题,然后
继续将子问题继续分成更小规模的子问题。
递归的分治可以并行,因为每个进程可以对子问题进行处理。
3
Partitioning/Divide and Conquer
Examples
Many possibilities.
• Operations on sequences of number such as
simply adding them together
• Several sorting algorithms can often be
partitioned or constructed in a recursive fashion
• Numerical integration
• N-body problem
4
Partitioning a sequence of numbers
into parts and adding the parts
5
方法
• Send+recv (for loop send, for loop recv)
• Bcast+recv (master: bcast all to slaves
+ for loop recv, slaves:receive all from
master+ send result to master)
• Scatter+reduce
6
Tree construction
二叉树
7
Dividing a list into parts
send
recv
8
Partial summation
9
进程P0
/*division phase*/
Divide(s1,s1,s2);/*divide s1 into two, s1 and s2*/
Send(s2,P4);
Divide(s1,s1,s2);
Send(s2,P2);
Divide(s1,s1,s2);
Send(s2,P1);
/*combining phase*/
part_sum=*s1;
Recv(&part_sum1,P1);
part_sum=part_sum+part_sum1;
Recv(&part_sum1,P2);
part_sum=part_sum+part_sum1;
Recv(&part_sum1,P4);
part_sum=part_sum+part_sum1;
10
进程P4
/*division phase*/
Recv(s1,p0);
Divide(s1,s1,s2);
Send(s2,P6);
Divide(s1,s1,s2);
Send(s2,P5);
/*combining phase*/
part_sum=*s1;
Recv(&part_sum1,P5);
part_sum=part_sum+part_sum1;
Recv(&part_sum1,P6);
part_sum=part_sum+part_sum1;
Send(&part_sum1,P0);
11
M路分治
12
Many Sorting algorithms can be parallelized
by partitioning and by divide and conquer.
Example
Bucket sort(桶排序)
13
Bucket sort(桶排序)
One “bucket” assigned to hold numbers that fall within each region.
Numbers in each bucket sorted using a sequential sorting algorithm.
Sequential sorting time complexity: O(nlog(n/m).
Works well if the original numbers uniformly distributed across a
known interval, say 0 to a - 1.
14
Parallel version of bucket sort
Simple approach
Assign one processor for each bucket.
15
big_array = malloc(n*sizeof(long int));
Make_numbers(big_array, n, n/p, p);
n_bar = n/p;
local_array = malloc(n_bar*sizeof(long int));
MPI_Scatter(big_array,
n_bar, MPI_LONG,
local_array, n_bar, MPI_LONG, 0, MPI_COMM_WORLD);
Sequential_sort(local_array, n_bar);
MPI_Gather(local_array, n_bar, MPI_LONG,
big_array,
n_bar, MPI_LONG, 0, MPI_COMM_WORLD);
16
17
18
Further Parallelization
• Partition sequence into m regions.
• Each processor assigned one region.
(Hence number of processors, p, equals m.)
• Each processor maintains one “big” bucket for its region.
• Each processor maintains m “small” buckets, one for each
region.
• Each processor separates numbers in its region into its own
small buckets.
• All small buckets emptied into p big buckets
• Each big bucket sorted by its processor
19
Another parallel version of bucket sort
20
all-to-all broadcast
• An MPI function that can be used to advantage
here.
• Sends one data element from each process to
every other process.
• Corresponds to multiple scatter operations, but
implemented together.
• Should be called all-to-all scatter?
21
22
From: http://www.mhpcc.edu/training/workshop/parallel_intro/MAIN.html
Applying “all-to-all” broadcast to emptying
small buckets into big buckets
Suppose 4 regions and 4 processors
Big bucket
Small bucket
4.13
23
“all-to-all” routine actually transfers rows of an array to columns:
Transposes a matrix.
24
Numerical Integration
• Computing the “area under the curve.”
• Parallelized by divided area into smaller areas
(partitioning).
• Can also apply divide and conquer -- repeatedly
divide the areas recursively.
25
Numerical integration using rectangles
Each region calculated using an approximation given by
rectangles:
Aligning the rectangles:
26
Numerical integration using
trapezoidal method
May not be better!
27
Applying Divide and Conquer
28
Adaptive Quadrature
Solution adapts to shape of curve. Use three areas, A, B,
and C. Computation terminated when largest of A and B
sufficiently close to sum of remain two areas .
29
Adaptive quadrature with
false termination.
Some care might be needed in choosing when to terminate.
Might cause us to terminate early, as two large regions are
the same (i.e., C = 0).
30
Barnes-Hut Algorithm
Start with whole space in which one cube contains
the bodies (or particles).
• First, this cube is divided into eight subcubes.
• If a subcube contains no particles, subcube deleted
from further consideration.
• If a subcube contains one body, subcube retained.
• If a subcube contains more than one body, it is
recursively divided until every subcube contains one body.
31
Creates an octtree - a tree with up to eight edges
from each node.
The leaves represent cells each containing one
body.
After the tree has been constructed, the total
mass and center of mass of the subcube is stored
at each node.
32
Constructing tree requires a time of O(nlogn), and
so does computing all the forces, so that overall
time complexity of method is O(nlogn).
33
Recursive division of 2-dimensional space
34
Orthogonal Recursive Bisection
(For 2-dimensional area) First, a vertical line found that divides
area into two areas each with equal number of bodies. For
each area, a horizontal line found that divides it into two areas
each with equal number of bodies. Repeated as required.
35