Parallel and Distributed Computing (ICTS 6301)

Parallel and Distributed Computing (ICTS 6301)
Programming Assignment #1
Strassen Matrix Multiplication Report
Lamiya El_Saedi 220093158
1. Describe the theoretical problem setting, the algorithm, the performance
figures and expectations.
Algorithm:
We want to calculate the matrix product C
the matrices A, B are of type 2n x 2n s.
We partition A, B and C into equally sized block matrices
With this construction we have not reduced the number of multiplications. We still
need 8 multiplications to calculate the Ci,j matrices, the same number of
multiplications we need when using standard matrix multiplication.
1
Now comes the important part. We define new matrices
which are then used to express the Ci,j in terms of Mk. Because of our definition of the
Mk we can eliminate one matrix multiplication and reduce the number of
multiplications to 7 (one multiplication for each Mk) and express the Ci,j as
We iterate this division process n times until the submatrices degenerate into numbers
(elements of the ring R).
Practical implementations of Strassen's algorithm switch to standard methods of
matrix multiplication for small enough submatrices, for which they are more efficient.
The particular crossover point for which Strassen's algorithm is more efficient
depends on the specific implementation and hardware. It has been estimated that
Strassen's algorithm is faster for matrices with widths from 32 to 128 for optimized
implementations.[1]
2. Describe the parallel implementation - the logical structure of the MPI machine
you have in mind, data structures and other related issues.
Logical serial structure:
In the beginning I was tried to implement the above algorithm in serial way and I
use dynamic algorithm to apply the structure, the structure I'll use it like this:
Step 1 : Store the results of each addition operation in the M's operations in
another matrix TA with size nxn in a certain steps:
I have this matrix (original matrix):
2
A11
A12
A21
A22
TA matrix:
A11+A22
A11+A12
A21+A22
And store the subtraction in other SA matrix nxn in a certain steps also:
SA matrix:
A12-A22
A21-A11
(Note: make the same matrex for B, then I have TB and SB)
Step 2: Then I deal with the two new matrix TA ,TB, SA and SB to achieve the M
equations
(note: I divided the original matrix to four blocks by using the above technique on
each level in the matrix and apply strassen's equations in each level until I have
2x2 matrix then I can apply the strassen's equations to multiply one element to
one element )
Step 3: To perform the result matrix C I'll do the following:
For example: if I have matrix 4x4, I divide the matrix to 4 blocks each block is
matrix 2x2 . I apply the above structure to have matrices TA,TB, SA and SB.
Now I need to multiply (A11+A22)*(B11+B22) but the two sides of the equation
is a matrix of 2x2. So, again I multiply the elements by applying the strassen's
equations to give matrix 2x2 as a result and store the result in another matrix.
(note: I apply this way for all parts)
Finally, I have anther 7 of 2x2 matrices, I collect these matrices by function
finalc(); to apply this formula
Therefore, I have a C result matrix of 4x4.
3
I apply the idea in the example on matrix of 1024x1024
Parallel structure idea:
As a result there is no dependency in my work and I can apply the same idea in
parallel. But I insert some modification to broadcast the matrix result, so I write a
function to convert from two dimensional array to one dimensional array and
another function to convert from one dimensional array to two dimensional array.
These two function I written to convert matrix of 64x64.
In a simple word I distribute the processor in the last level of M's computations.
Therefore, if I have one process I'll make it to do the seven equations without
need to broadcast the result matrices. If I have two processor , one of them take
three equations and the other process take four equations. Each process is the
owner computes rule for all computation in the equation.
If I have four processes, three of them take two equations and the last one take
one equation. If I have six processes five of them take one equation and the last
take two equations.
Note: In the tasks from 2 to 6 processes, process zero collect the results and
compute final C matrix.
If I have eight processes , each seven processes take one equation and the last
collect the results and compute the final C result.
3.Table(s) of results for serial program, then varying number of processors p (p=
1, 2, 4, 6, and 8) corresponding speedup and efficiency figures.
This result was taken by running on cluster.
Time
Time
128x128 time in Per serial of parallel
1process
second
0.109
0.085248
Speedup (Ts/Tp)
1.278624
Efficiency (S/P)
1.278624
Total parallel time
0.085137
(PTp )(cost)
Overhead (PTp-Ts)
-0.02386
Time
parallel
2processes
0.048349
2.254453
1.127226
0.096558
Time
parallel
4processes
0.013826
7.883735
1.970934
0.057201
Time
parallel
6processes
0.0136
8.014588
1.335765
0.081622
Time
parallel
8processes
0.001687
64.59629
8.074537
0.013405
-0.01244
-0.0518
-0.02738
-0.0956
Table(1): illustrate parallel run time and serial run time and other computation of matrix
128x128
4
4. Present plots of the speedup (show also the ideal speedup on the plot).
Graph (1): illustrate the relation between Tp time parallel and P number of processes
Graph (2): illustrate the relation between S speedup and P number of processes
5
Graph (3): illustrate the relation between P number of processes and To time overhead
Graph (4): illustrate the relation between E Efficiency and P number of processes
6
Graph (5): illustrate the relation between PTp coast optimal, Ts time serial and P number of
processes
5. Analyze the speedup and efficiency results in comparison with the theoretical
expectations.
1) Speedup increase when number of processes increase.
2) When processes number are increase the efficiency also increase.
3) The highest speedup at P = 8, and the lowest at P = 1.
4) The maximum fin grain is 8 processes because after this number efficiency
decrease.
5) Less amount of overhead at P=8 and the greatest amount of overhead at p=2
because there is a dependency and no interaction between any process and no
communication between each other. But at p=2 is the greatest overhead may be
because each process come to do a large number of computation.
6) Each of processes are coast optimal because it is less than Ts.
7
6. Add observations, comments how the particular computer architecture
influences the parallel performance. Pay attention how to balance properly the
workload per workstation and the number of processors used.
1) Observation: I think that the best way to solve any serial problem in a
parallel problem, you should if you can to distinct the computation in a way when
ever there is no dependency and interaction between any other computation. So,
you can make your parallel problem in a safe way by give every processes one or
more computation (or group the computation that related to each other).
Therefore, the process responsible for every inside computation. Finally,
gathering the results in one process to give a final result.
2) you don't need a cluster to test your work because you can imagine the parallel
situation in your PC as a native way to thinking in parallel .
3) when you try to run a parallel program that use more than one process , you
should to try to stop each process at the same time, not to reside some process
still work. You can do this by give process zero the large number of computation or
give it the last computation and to collect the results give it to last process.
7. Conclusions, ideas for possible optimizations
1) If you need a good run time parallel program you must stop the screen
server, and don’t do any job at this time in your computer because the
execution be stop before finishing the complete execution.
2) Don't print any computation results on the output screen because it takes
a long time in execution and effect on the time of execution.
3) The best platform to run parallel programming you need a Multicore on
your computer or run it on a cluster of computer.
4) The safe way to execute large number of matrix using pointer not static to
a void stack overflow problem.
5) From graph(1),(2),(3),(4),(5) and table (1) the best situation to solve the
serial strassen's algorithm in parallel is mapping to 8 processes. Because
it has minimum interaction , idle and communication (overhead).
6) I think the suitable way to convert serial algorithm to parallel without
overhead is give all the algorithm serial to one process in parallel and
change the size problem. Or fixed the efficiency and change the problem
size and number of processes.
8
Appendix
1- Main program in 8 processes:
int main(int argc,char *argv[])
{
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
if (numtasks == 8) {
double start,end;
start=MPI_Wtime ();
for (int i=0;i<size128;i++)
for(int j=0;j<size128;j++)
{
A128[i][j]=i+1;
B128[i][j]=i+1;
C128[i][j]=0;
TA128[i][j]=0;
TB128[i][j]=0;
SA128[i][j]=0;
SB128[i][j]=0;
}
source = 0;
sendcount =1;
recvcount =1;
strass128(A128,B128,0,0,size128,size128);
for(int i=0;i<size64*size64;i++)
{//mpi_broadcast
if(rank==1){
MPI_Bcast(&g1[i],sendcount,MPI_INT,1,MPI_COMM_WORLD);
}
if(rank==2){
MPI_Bcast(&g2[i],sendcount,MPI_INT,2,MPI_COMM_WORLD);
}
if(rank==3){
MPI_Bcast(&g3[i],sendcount,MPI_INT,3,MPI_COMM_WORLD);
}
if(rank==4){
MPI_Bcast(&g4[i],sendcount,MPI_INT,4,MPI_COMM_WORLD);
}
if(rank==5){
MPI_Bcast(&g5[i],sendcount,MPI_INT,5,MPI_COMM_WORLD);
}
if(rank==6){
MPI_Bcast(&g6[i],sendcount,MPI_INT,6,MPI_COMM_WORLD);
}
if(rank==0){
MPI_Bcast(&g7[i],sendcount,MPI_INT,0,MPI_COMM_WORLD);
}
}
if(rank==7)
{//summation
9
conv1281to2(gp18,g1,64);
conv1281to2(gp28,g2,64);
conv1281to2(gp38,g3,64);
conv1281to2(gp48,g4,64);
conv1281to2(gp58,g5,64);
conv1281to2(gp68,g6,64);
//conv1281to2(gp78,g7,64);
finalgc128(C128);
cout<<endl<<"C=\n";
//wrc128(C128);
}
end=MPI_Wtime();
double ftime;
ftime=(end-start);
cout<<endl<<"parallel8 time = "<<ftime<<endl;
}
else
printf("Must specify %d processors. Terminating.\n",SIZE);
MPI_Finalize();
}
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
2- strass128( ) in parallel:
void strass128(int A[size128][size128],int B[size128][size128],int
ai,int aj,int bi,int bj)
{
add128A11A22(A,TA128,0,0,size128/2,size128/2);
add128A11A12(A,TA128,0,size128/2);
add128A21A22(A,TA128,0,size128/2);
add128A11A22(B,TB128,0,0,size128/2,size128/2);
add128A11A12(B,TB128,0,size128/2);
add128A21A22(B,TB128,0,size128/2);
sub128A12A22(A,SA128,0,size128);
sub128A21A11(A,SA128,0,size128/2);
sub128A12A22(B,SB128,0,size128);
sub128A21A11(B,SB128,0,size128/2);
if(rank==1){
strass64(TA128,TB128,0,0,63,63,0,0,63,63);
finalgc64(gp18);
conv1282to1(g1,gp18);
tim++;
cout<<"\ntim= "<<tim;
// cout<<endl<<"\ngp1=\n";
// wrc4(gp1);
}
10
if(rank==2){
strass64(TA128,B128,64,0,127,63,0,0,63,63);
finalgc64(gp28);
conv1282to1(g2,gp28);
tim++;
cout<<"\ntim= "<<tim;
}
// cout<<endl<<"\ngp2=\n";
// wrc4(gp2);
if(rank==3){
strass64(A128,SB128,0,0,63,63,0,0,63,63);
finalgc64(gp38);
conv1282to1(g3,gp38);
tim++;
cout<<"\ntim= "<<tim;
}
// cout<<endl<<"\ngp3=\n";
// wrc4(gp3);
if(rank==4){
strass64(A128,SB128,64,64,127,127,64,0,127,63);
finalgc64(gp48);
conv1282to1(g4,gp48);
tim++;
cout<<"\ntim= "<<tim;
}
// cout<<endl<<"gp4=\n";
// wrc4(gp4);
if(rank==5){
strass64(TA128,B128,0,64,63,127,64,64,127,127);
finalgc64(gp58);
conv1282to1(g5,gp58);
tim++;
cout<<"\ntim= "<<tim;
}
// cout<<endl<<"gp5=\n";
// wrc4(gp5);
if(rank==6){
strass64(SA128,TB128,64,0,127,63,0,64,63,127);
finalgc64(gp68);
conv1282to1(g6,gp68);
tim++;
cout<<"\ntim= "<<tim;
}
if(rank==0){
// cout<<endl<<"gp6=\n";
// wrc4(gp6);
strass64(SA128,TB128,0,0,63,63,64,0,127,63);
finalgc64(gp78);
conv1282to1(g7,gp78);
tim++;
cout<<"\ntim= "<<tim;
}
// cout<<endl<<"gp7=\n";
//wrc4(gp7);
}
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
11
3- convert from 2 dim to 1 dim:
void conv642to1(int send[size64*size64],int A[][size64])
{
int v=0;
for (int i=0;i<size64;i++)
for(int j=0;j<size64;j++)
{
send[v]=A[i][j];
// sendbuf2[v]=B[i][j];
v++;
}
}
/////////////////////////////////////////////////////////////////////
4- convert from 1 dim to 2 dim:
void conv641to2(int A[][size64],int res[size64*size64],int si)
{int c=0;
int j=0;
for(int i=0;i<si*si;i++)
if(j<si)
{ A[c][j]=res[i];
j++;
//B[c][j]=recvbuf2[i];
}
else{
j=0;
c++;
A[c][j]=res[i];
j++;
}
}
/////////////////////////////////////////////////////////////////////
5- the computation of 2x2 matrix:
void processall(int A[][s],int B[][s],int ai,int aj,int aii,int
ajj,int bi,int bj,int bii,int bjj)
{
P[0]=((A[ai][aj]+A[aii][ajj])*(B[bi][bj]+B[bii][bjj]));
P[1]=((A[aii][aj]+A[aii][ajj])*B[bi][bj]);
P[2]=((B[bi][bjj]-B[bii][bjj])*A[ai][aj]);
P[3]=(A[aii][ajj]*(B[bii][bj]-B[bi][bj]));
P[4]=((A[ai][aj]+A[ai][ajj])*B[bii][bjj]);
P[5]=((A[aii][aj]-A[ai][aj])*(B[bi][bj]+B[bi][bjj]));
P[6]=((A[ai][ajj]-A[aii][ajj])*(B[bii][bj]+B[bii][bjj]));
}
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
12
6- finalc( ) function:
void finalgc128(int c[][size128])
{
//C11;
int i,j;
int k=0;
int r;
for(i=0;i<64;i++)
{r=0;
for(j=0;j<64;j++)
{
c[i][j]=gp18[k][r]+gp48[k][r]-gp58[k][r]+gp78[k][r];
r++;
}
k++;
}
//C12;
for(i=0;i<64;i++)
{k=0;
for(j=64;j<size128;j++)
{
c[i][j]=gp38[i][k]+gp58[i][k];
k++;
}
}
//C21;
k=0;
for(i=64;i<128;i++)
{
for(j=0;j<64;j++)
c[i][j]=gp28[k][j]+gp48[k][j];
k++;
}
//C22;
k=0;
for(i=64;i<128;i++)
{ r=0;
for(j=64;j<128;j++)
{
c[i][j]=gp18[k][r]+gp38[k][r]-gp28[k][r]+gp68[k][r];
r++;
}
k++;
}
}
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
13
7- some of the output screen:
a) part of the C matrix 128x128 and the parallel time at 8 processes
Problem:
1) when I coming to do something the program finished and the same result
occur.
14
Problem
2) without print output and with print output there are a huge different of time.
15
On Cluster:
16