Parallel matrix multiplication

Lab 2 Parallel processing using
NIOS II processors
CEG 4131 Computer Architecture III
Miodrag Bolic
1
Overview
You will learn how to:
• Design multiprocessing systems that use shared
memories
• Partition sequential program so that it can be
implemented on multi-processors
• Synchronize multiprocessing system
• Time: 3 weeks
• Point: 115 (There is an optional task)
2
Overview
• Part 1
– Design a multiprocessing system by following the steps from the
tutorial. Run and debug the program that comes with the tutorial.
• Part 2
– Use the same hardware designed in part 1
– Develop a program for parallel matrix multiplication and run it on
the multiprocessing system
– Compute the speedup of the program when it runs on a single
processor and on a multiprocessing system
3
Part 1
• Copy the project
C:\altera\kits\nios2\examples\vhdl\niosII_stratix_1s10\standard
to your home directory
• Go through the steps of the tutorial “Creating Multiprocessor Nios
II System tutorial”. You can download the tutorial from
tt_nios2_multiprocessor_tutorial.pdf and a program from
http://www.altera.com/literature/tt/hello_world_multi.c
• Modification: On page 30 of the tutorial, choose NIOS II/s core for
CPU3 instead of NIOS II/e. All three cores have to be NIOS II/s.
Change the instruction cache size for all 3 of them to 4kBytes.
• Before generating and compiling on page 36 of the tutorial, do the
following:
– Add performance counter in the same way as in Lab 1. Connect
performance_counter only to the data master of the CPU1.
– Add on-chip Memory block and configure it as shown in the next page.
Connect s1 port to cpu1/data_master and cpu2/data_master. Connect
s2 port to cpu3/data_master.
• Continue with the tutorial.
4
On-chip memory configuration
5
Task 1 – Demonstration and Questions
•
Show to the TA that the program is working (20 points)
•
1.
2.
3.
Questions:
Describe the program in details.
Why do we need mutex?
If processor 1 gets a mutex for the memory
messsage_buffer_ram, can processor 2 write to this
memory before processor 1 releases the mutex?
4. Can processor 1 store two messages in the buffer?
6
Part 2
• In this part, the same hardware configuration will be
used.
• You will design a program for parallel matrix
multiplication.
• Problem:
There is an input/output module which receives and
stores data in matrices in matrices M1 and M2. We will
simulate this module using shared_memory module that
we added in the first part of the Lab. Our program
multiplies these two matrices and stores the result C in
the same module (memory).
7
Sequential solution
• Program the Altera chip using the same configuration
from part 1.
• Modify the matrix_performance.c file so that matrices
M1, M2 and C are transferred to the shared_memory.
Do this step before activating the performance counter.
Change the number of iterations in matrix multiplication
from 100 to 1000.
• Change the C/C++ options in your project and syslib
project from Debug to Release.
• Run the code and present the performance count results
and matrix C that is obtained in the iteration 1000.
• Demonstration: show the result to the TA.
8
Parallel solution
• CPU 1 will be used for synchronization and for I/O
operations, while CPU 2 and 3 are used for
multiplication. CPU 2 and 3 function in single program
multiple data SPMD mode. This means that they start
the iterations at the same time and they execute the
same code but on different data. After they finish the
multiplication, they signal to CPU1. The program will
repeat the multiplication of matrices 1000 times.
9
Parallel matrix multiplication
• CPU1 transfers M1 and M2 to the shared_memory.
• Algorithm
The sequential program is show bellow. In parallel implementation,
CPU 2 will execute i loop from 0 to 4, and CPU 3 will execute i loop
from 5 to 9. CPU 2 and 3 will perform their operations at the same
time
for (i=0;i<=9;i++){
for (j=0;j<=9;j++){
C[i][j] = 0;
for (k=0;k<=9;k++){
C[i][j]+=M1[i][k]*M2[k][j];
}
}
}
10
Synchronization
• Variables status_start and status_done will be shared variables used
for synchronization. All three processor will access these variable
using the mutex. They will be stored in message_buffer_ram
memory.
• It is extremely important that both CPU2 and CPU3 start matrix
multiplication at the same time. This will not happen automatically
since they are booted from the same memory. So, CPU1 has to
assure that both CPU2 and CPU3 start at the same time. Shared
variable status_start will be used for that. CPU1 has to set this
variable to 1 and CPU2 and CPU3 have to increment this variable
before they start matrix multiplication. When status_start is 3 then
CPU2 and CPU3 will start matrix multiplication and CPU1 will initiate
measurement of time using the performance_counter.
• At the beginning, CPU 1 will set status_done to 1. After CPU 2 and
CPU 3 finish 1000 iterations of 10x10 matrix multiplication, they
each increment the status_done. CPU 1 is periodically reading the
variable status_done, and when it is 3, the program is over. CPU1
stops the performance_count and print performance_count result
and matrix C from 1000th iteration on the terminal.
11
Task 2 - Questions
• What is speedup if we compare sequential and parallel
implementation? Comment the speed-up result.
• Why can we design a program for matrix multiplication
without using mutexes (except for synchronization)?
12
Demonstration (40)
• Send matrix C of 1000th iteration of the matrix
multiplication algorithm to the terminal through JTAG
UART. Send also the number of clock cycles from the
performance counter.
• Show this result to the TA. Explain to the TA how your
parallel matrix multiplication program works and how you
achieved synchronization. You will get 10 points less if
speedup is less than 1.
13
Optional part- Synchronization
• If our program emulates real system, then CPU1 should synchronize
both CPU1 and CPU2 after 1 iteration of 10x10 matrix multiplication
and not after 1000 of them. So, in a real program after each 10x10
matrix multiplication, the CPU1 will perform some operations on the
computed matrix C and initialize new iteration of 10x10 matrix
multiplication if matrices M1 and M2 are ready.
• In this part of the lab, you will use iteration_done variable to notify
CPU1 that one iteration of 10x10 matrix multiplication is done.
Additional shared variable is needed for the start of next iteration.
Let’s call it start_next_iteration.
• The program works as follows. At the beginning CPU1 sets
start_next_iteration . After 10x10 multiplication iteration starts, CPU2
and CPU3 resets this variable. After CPU2 and CPU3 are done with
the execution of their part of 10x10 matrix multiplication, they
increment iteration_done and wait for start_next_iteration to be set.
CPU1 checks if iteration_done is equal 3 and if it is, CPU1 sets
start_next_iteration. The new iteration of 10x10 matrix multiplication
can start then.
14
Optional part – Demonstration and Questions
Question
• What is the speedup of this program?
Demonstration (10 optional points)
• Send the sum of the elements of matrix C of each
iteration of 10x10 matrix multiplication algorithm to the
terminal through JTAG UART. Send also the number of
clock cycles from the performance counter.
• Show this result to the TA. Explain to them how you
achieved synchronization.
15
What to submit
Report contains the following (30 points):
• Title page
• Description of your system with the picture of SOPC Builder System
Components
• Detailed description of your solution of the algorithm for parallel
matrix multiplication and synchronization.
• Answers to the questions from task 1-2.
• Conclusions
• Page 17 of this document signed by the TA.
• Soft copies of the report and source code of the programs for
sequential and parallel multiplication with basic comments (*.c files)
and quartus II files *.sof and *.ptf (10 points).
• Optional: Description of the synchronization method and speedup
for the optional part as apart of the report. Softcopy of the algorithm
for matrix multiplication. (5 points)
16
Lab 2 – Signature page
Student name:
Student name:
Demonstrated
(TA’s signature)
Part 1
Performance_counter result
- Time
Points
/
____/20
Part 2 sequential
Part 2 parallel
____/40
Part 2 optional
____/10
Total
/
/
____
17

Download Report

Parallel matrix multiplication

Paperzz.com

Your Paperzz