CIS6930 - UF CISE

CIS6930
Parallel Computing
Fall 2006
Exam # 1
Name: __________________________________________
UFID: ____________ - ____________
E-mail: _________________________________________
Instructions:
1. Write neatly and legibly.
2. While grading, not only your final answer but also your
approach to the problem will be evaluated.
3. You have to attempt all the problems (100 points).
4. Total time for the exam is 100 minutes.
5. When deriving expressions for runtime, you may like to
detail all the appropriate steps. Otherwise, no partial credit
will be awarded for incorrect expressions.
I have read carefully, and have understood the above
instructions. On my honor, I have neither given nor
received unauthorized aid on this examination.
Signature: _____________________________________
Date: ____ (MM) / ____ (DD) / ___________ (YYYY)
1
Question 1 (5 × 7 = 35 points)
State whether the following statements are true or false. Give a brief explanation (2-3
lines) supporting your answer.
1. It is possible to embed a hypercube into a mesh such that each link of the
hypercube is mapped on to exactly one link of mesh.
FALSE in general. Hypercube has O(plogp) links and Mesh has O(p) links. Hence, each
link of mesh will have many links of hypercube mapped on to it. It is true when a ddimensional hypercube is mapped onto a d-dimensional (or more) mesh.
2. Data locality is critical only for message passing computers and not for shared
address space computers because shared address spaces computers have
additional hardware that makes the entire memory globally accessible.
FALSE. Even though the entire memory is globally addressable in shared address space
computers, access to non-local memories is usually much slower than local memories.
3. Effective concurrency can always be obtained by decomposing the output data.
FALSE. In many cases, output data is very small (e.g., computing the sum of an array of
numbers) and cannot be decomposed. In many other cases, different parts of the output
cannot be computed independently (e.g., sorting).
4. Crossbar-based parallel computers provide uniform access to the entire memory
to all processors, and thus are good approximations of the EREW PRAM mode.
FALSE. Accesses to distinct memory locations inside one memory blocks cannot take
place concurrently in Crossbar-based parallel computers. In the EREW PRAM model, all
accesses to distinct memory locations can take place concurrently.
5. It is often easier to get good load balancing if concurrency is derived using data
decomposition as opposed to task decomposition.
TRUE. This is especially true if the amount of computation is uniform for each data
point. Task decomposition often (but not always) leads to tasks of different sizes, making
it harder to load balance.
6. Dynamic load balancing is not suited for message-passing computers.
FALSE. On message-passing computers, dynamic load balancing requires moving tasks
(and associated data) from one processor to another. If the size of the data associated with
the task is relatively small (in comparison to the associated computation), then dynamic
load balancing can be per-formed on message-passing computers quite effectively.
7. 2-D partitioning of the data always leads to a higher degree of concurrency than
1-D partitioning and are thus always preferred.
FALSE. In many problems (e.g., Problem 8 in Homework 3), 2-D partitioning leads to
similar level of concurrency as 1-D partitioning but requires many more processors. In
such cases, 1-D partitioning is better.
2
Question 2 (15 points)
In image dithering, the color of each pixel in the n × n image is determined as the
weighted average of its original value and the values of its neighboring pixels. We can
decompose this computation, by breaking the image into square regions and use a
different task to dither each one of these regions. Note that each task needs to access the
pixel values of the region assigned to it as well as the values of the image surrounding its
region. This 2-D block decomposition is illustrated in figure below. The image can also
be decomposed using a 1-D block or 1-D cyclic mapping. Answer the following
questions briefly:
1. (7 points) Compare a 1-D block decomposition against a 2-D block
decomposition for this problem. Which is better and why?
Answer: A 1-D block decomposition has less concurrency (O(n)) as compared to a 2-D
block decomposition (which has O(n2) concurrency). A 2-D block decomposition will
n
also incur less communication costs than a 1-D block decomposition ( O(
) versus
p
O (n) ). Hence a 2-D block decomposition will be better.
2. (8 points) Compare a 1-D block decomposition against a 1-D cyclic
decomposition. Which is better and why?
Answer: A 1-D block mapping incurs less communication costs as compared to a 1-D
cyclic mapping since many interactions will be locally handled in 1-D block (only O(n)
n2
interactions per processor for 1-D Block, whereas O( ) for 1-D cyclic).
p
Concurrency in both cases is the same: O(p). Hence 1-D block mapping is preferred.
3
Question 3 (20 points)
Consider an iterative parallel algorithm A that performs, in each iteration, computations
on elements of a n × n grid. The computation for a mesh element needs the values of the
four neighboring elements. The communication time in each iteration of the parallel
algorithm for solving this problem on a n × n-processor hypercube is T for the case when
the tasks for the neighboring elements are mapped on to the adjacent processors of the
hypercube.
1. (8 points) What is the least possible communication time taken by each iteration
of this algorithm on a n × n-processor mesh assuming the best case mapping of
tasks to processors?
Answer: T, which is the same as hypercube. The reason is that for a proper mapping,
each communication will still happen only amongst neighboring processors.
2. (12 points) What is the expected communication time taken by each iteration of
this algorithm on a p-processor mesh if tasks are mapped randomly to the nodes
of the mesh?
Answer: O(n 2T / p) . Since O(n2) interactions will cross the bisection of mesh, which has
only O(p) links, expected congestion will be O(n 2 / p) .
4
Question 4 (30 points)
Consider the problem of multiplying two n × n matrices A and B to obtain matrix C.
Assume that in this problem, matrices A, B, C are mapped on to a 2-D grid of
p  p processors via a standard 2-D block mapping.
1. (10 points) What basic communication operation(s) are needed to ensure that each
processor has all the portions of matrices A, B, C that they need to execute the
tasks assigned to it?
Answer: Processors in each row need to perform an all-to-all broadcast for the elements
of Matrix A that they contain. Processors in each column need to perform an all-to-all
broadcast for the elements of Matrix B that they contain.
2. (10 points) What is the communication time of this operation on the mesh with
wrap-around (i.e., torus) architecture (give expression in terms of ts, tw, p, and n)?
Answer: The time for an all-to-all broadcast on a ring is (t s  t w m)( p  1) .
In this question, the number of processors involved in each all-to-all broadcast is
p and
2
the message size at each processor is n / p . Each of the all-to-all broadcast will thus take
the following time: (t s  t w n 2 / p)( p  1)
5
3. (10 points) What is the iso-efficiency of your algorithm?
n3
n2
Answer: T p 
 2(t s  t w )( p  1)
p
p
n2
)( p  1) , where assume Ts  (n 3 )
p
So the isoefficiency function can be written as
T0  pT p  Ts  2 p(t s  t w
2
W3
W  2 p (t s  t w
)( p  1)
p
The overall asymptotic isoefficiency function is determined by the component that
requires the problem size to grow at the highest rate with respect to p. Therefore, we have
3
2
W  ( p )
6
Summary of communication times of various operations discussed in the textbook on an
interconnection network with cut-through routing.
Operation
Ring
Mesh
Hypercube Time
One-to-all
min(( t s  t w m) log p,
(t s  t w m) log p
(t s  t w m) log p
broadcast
2(t s log p  t w m))
All-to-one
reduction
All-to-all
(t s  t w m)( p  1)
2t s ( p  1)  t w m( p  1) t s log p  t w m( p  1)
broadcast,
All-to-all
reduction
All-reduce
n/a
n/a
min(( t s  t w m) log p,
2(t s log p  t w m))
Scatter,
n/a
Gather
All-to-all
(t s  t w mp / 2)( p  1)
personalized
Circular
n/a
Shift
n/a
t s log p  t w m( p  1)
(2t s  t w mp)( p  1)
(t s  t w m)( p  1)
(t s  t w m)( p  1)
ts  twm
7