CIS6930 Parallel Computing Fall 2006 Exam # 1 Name: __________________________________________ UFID: ____________ - ____________ E-mail: _________________________________________ Instructions: 1. Write neatly and legibly. 2. While grading, not only your final answer but also your approach to the problem will be evaluated. 3. You have to attempt all the problems (100 points). 4. Total time for the exam is 100 minutes. 5. When deriving expressions for runtime, you may like to detail all the appropriate steps. Otherwise, no partial credit will be awarded for incorrect expressions. I have read carefully, and have understood the above instructions. On my honor, I have neither given nor received unauthorized aid on this examination. Signature: _____________________________________ Date: ____ (MM) / ____ (DD) / ___________ (YYYY) 1 Question 1 (5 × 7 = 35 points) State whether the following statements are true or false. Give a brief explanation (2-3 lines) supporting your answer. 1. It is possible to embed a hypercube into a mesh such that each link of the hypercube is mapped on to exactly one link of mesh. FALSE in general. Hypercube has O(plogp) links and Mesh has O(p) links. Hence, each link of mesh will have many links of hypercube mapped on to it. It is true when a ddimensional hypercube is mapped onto a d-dimensional (or more) mesh. 2. Data locality is critical only for message passing computers and not for shared address space computers because shared address spaces computers have additional hardware that makes the entire memory globally accessible. FALSE. Even though the entire memory is globally addressable in shared address space computers, access to non-local memories is usually much slower than local memories. 3. Effective concurrency can always be obtained by decomposing the output data. FALSE. In many cases, output data is very small (e.g., computing the sum of an array of numbers) and cannot be decomposed. In many other cases, different parts of the output cannot be computed independently (e.g., sorting). 4. Crossbar-based parallel computers provide uniform access to the entire memory to all processors, and thus are good approximations of the EREW PRAM mode. FALSE. Accesses to distinct memory locations inside one memory blocks cannot take place concurrently in Crossbar-based parallel computers. In the EREW PRAM model, all accesses to distinct memory locations can take place concurrently. 5. It is often easier to get good load balancing if concurrency is derived using data decomposition as opposed to task decomposition. TRUE. This is especially true if the amount of computation is uniform for each data point. Task decomposition often (but not always) leads to tasks of different sizes, making it harder to load balance. 6. Dynamic load balancing is not suited for message-passing computers. FALSE. On message-passing computers, dynamic load balancing requires moving tasks (and associated data) from one processor to another. If the size of the data associated with the task is relatively small (in comparison to the associated computation), then dynamic load balancing can be per-formed on message-passing computers quite effectively. 7. 2-D partitioning of the data always leads to a higher degree of concurrency than 1-D partitioning and are thus always preferred. FALSE. In many problems (e.g., Problem 8 in Homework 3), 2-D partitioning leads to similar level of concurrency as 1-D partitioning but requires many more processors. In such cases, 1-D partitioning is better. 2 Question 2 (15 points) In image dithering, the color of each pixel in the n × n image is determined as the weighted average of its original value and the values of its neighboring pixels. We can decompose this computation, by breaking the image into square regions and use a different task to dither each one of these regions. Note that each task needs to access the pixel values of the region assigned to it as well as the values of the image surrounding its region. This 2-D block decomposition is illustrated in figure below. The image can also be decomposed using a 1-D block or 1-D cyclic mapping. Answer the following questions briefly: 1. (7 points) Compare a 1-D block decomposition against a 2-D block decomposition for this problem. Which is better and why? Answer: A 1-D block decomposition has less concurrency (O(n)) as compared to a 2-D block decomposition (which has O(n2) concurrency). A 2-D block decomposition will n also incur less communication costs than a 1-D block decomposition ( O( ) versus p O (n) ). Hence a 2-D block decomposition will be better. 2. (8 points) Compare a 1-D block decomposition against a 1-D cyclic decomposition. Which is better and why? Answer: A 1-D block mapping incurs less communication costs as compared to a 1-D cyclic mapping since many interactions will be locally handled in 1-D block (only O(n) n2 interactions per processor for 1-D Block, whereas O( ) for 1-D cyclic). p Concurrency in both cases is the same: O(p). Hence 1-D block mapping is preferred. 3 Question 3 (20 points) Consider an iterative parallel algorithm A that performs, in each iteration, computations on elements of a n × n grid. The computation for a mesh element needs the values of the four neighboring elements. The communication time in each iteration of the parallel algorithm for solving this problem on a n × n-processor hypercube is T for the case when the tasks for the neighboring elements are mapped on to the adjacent processors of the hypercube. 1. (8 points) What is the least possible communication time taken by each iteration of this algorithm on a n × n-processor mesh assuming the best case mapping of tasks to processors? Answer: T, which is the same as hypercube. The reason is that for a proper mapping, each communication will still happen only amongst neighboring processors. 2. (12 points) What is the expected communication time taken by each iteration of this algorithm on a p-processor mesh if tasks are mapped randomly to the nodes of the mesh? Answer: O(n 2T / p) . Since O(n2) interactions will cross the bisection of mesh, which has only O(p) links, expected congestion will be O(n 2 / p) . 4 Question 4 (30 points) Consider the problem of multiplying two n × n matrices A and B to obtain matrix C. Assume that in this problem, matrices A, B, C are mapped on to a 2-D grid of p p processors via a standard 2-D block mapping. 1. (10 points) What basic communication operation(s) are needed to ensure that each processor has all the portions of matrices A, B, C that they need to execute the tasks assigned to it? Answer: Processors in each row need to perform an all-to-all broadcast for the elements of Matrix A that they contain. Processors in each column need to perform an all-to-all broadcast for the elements of Matrix B that they contain. 2. (10 points) What is the communication time of this operation on the mesh with wrap-around (i.e., torus) architecture (give expression in terms of ts, tw, p, and n)? Answer: The time for an all-to-all broadcast on a ring is (t s t w m)( p 1) . In this question, the number of processors involved in each all-to-all broadcast is p and 2 the message size at each processor is n / p . Each of the all-to-all broadcast will thus take the following time: (t s t w n 2 / p)( p 1) 5 3. (10 points) What is the iso-efficiency of your algorithm? n3 n2 Answer: T p 2(t s t w )( p 1) p p n2 )( p 1) , where assume Ts (n 3 ) p So the isoefficiency function can be written as T0 pT p Ts 2 p(t s t w 2 W3 W 2 p (t s t w )( p 1) p The overall asymptotic isoefficiency function is determined by the component that requires the problem size to grow at the highest rate with respect to p. Therefore, we have 3 2 W ( p ) 6 Summary of communication times of various operations discussed in the textbook on an interconnection network with cut-through routing. Operation Ring Mesh Hypercube Time One-to-all min(( t s t w m) log p, (t s t w m) log p (t s t w m) log p broadcast 2(t s log p t w m)) All-to-one reduction All-to-all (t s t w m)( p 1) 2t s ( p 1) t w m( p 1) t s log p t w m( p 1) broadcast, All-to-all reduction All-reduce n/a n/a min(( t s t w m) log p, 2(t s log p t w m)) Scatter, n/a Gather All-to-all (t s t w mp / 2)( p 1) personalized Circular n/a Shift n/a t s log p t w m( p 1) (2t s t w mp)( p 1) (t s t w m)( p 1) (t s t w m)( p 1) ts twm 7
© Copyright 2026 Paperzz