MPI vs POSIX Threads

MPI vs POSIX Threads
A COMPARISON
Overview
 MPI allows you to run multiple processes on 1 host

How would running MPI on 1 host compare with a similar POSIX thread
solution?
 Attempting to compare MPI vs POSIX run times
 Hardware

Dual 6 Core (2 threads per core) 12 logical


Intel Xeon CPU E5 – 2667 (show schematic)



http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/AboutRage.txt
http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/xeon-e5-v2-datasheet-vol-1.pdf
2.96 GHz
15 MB L3 Cache Shared 2.5MB per core
 All code / output / analysis available here:

http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/
About the Time Trials

Going to compare runtimes of code in MPI vs code written using POSIX threads and
shared memory


Try to make the code as similar as possible so we’re comparing apples with oranges and not apples with monkeys
Since we are on 1 machine the BUS is doing all the com traffic, that should make the POSIX and MPI versions similar (ie. network latency
isn’t the weak link.
So this analysis only makes sense on 1 machine
 Use Matrix Matrix multiply code we developed over the semester






Everyone is familiar with the code and can make observations
http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/src/pthread_matrix_21.c
http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/src/matmat_3.c
http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/src/matmat_no_mp.c
Use square matrices

Not necessary but it made things more convenient
Vary Matrix sizes from 500 -> 10,000 elements square (plus a couple of bigger ones)
 Matrix A will be filled with 1-n Left to Right and Top Down
 Matrix B will be the identity matrix





Can then check our results easily as A*B = A when B = identity matrix
http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/mat_500_result.txt
Ran all processes ie. compile / output result / parsing many times and checked before writing final scripts to do the processing
Set up test bed

Try each step individually, check results, then automate
Specifics cont.

About the runs





For each MATRIX size (500 -> 3000 ,4000, 5000, 6000,7000,8000,9000,10000)
Vary thread count 2-12 (POSIX)
Vary Processes 2-12 (MPI)
Run 10 trials of each and take average (machine mostly idle when not running tests, but want to smooth spikes in run times caused by the
system doing routine tasks)
With later runs I ran 12, dropped high and low then took average
Try Make observations about anomalies in the run times where appropriate
 Caveats



All initial runs with no optimization for testing, but hey this is a class about performance
Second set of runs with optimization turned on –O1 ( note: -O2 & -O3 made no appreciable difference)






First level optimization made a huge difference > 3 x improvement
GNU Optimization explanation can be found here: http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
Built with just the –O1 flags to see if I could catch the “one” making the most difference (nope) (code isn’t that complicated)
Not all optimizations are flag controlled
Regardless of whether the code is written in the most efficient fashion (and it’s not) because of the similarity we can make some runs and
observations
Oh No moment **

Huge improvement in performance with optimized code, why?

Maybe the compiler found a clever way to increase the speed because of the simple math and it’s not really doing all the calculations I
thought it was?
Came back and made matrix B non Identity, same performance. Whew.
OK - Ready to make the runs



I now Believe the main performance improvement came from loop unrolling.
Discussion
Please chime in as questions come up.
 Process Explanation: (After initial testing and verification)




Attempted a 25,000 x 25,000 matrix





Do you get enhanced or degraded performance by exceeding that number?
http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/POSIX_MANY_THREADS.txt
Example of process space / top output (10,000 x 10,000)




Compiler error for MPI (exceeded MPI_Bcast 2 GB limit on matrices)
http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/BadCompileMPI.txt
Not an issue for POSIX threads (until you run out of memory on the machine) swap
Settled on 12 Processes / Threads because of the number of cores available


http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/process_explanation.txt
top –d .1 (tap 1 to show CPU list tap H to show threads)
Early testing, before runs started. Pre Optimization
http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/RageTestRun_Debug_CPU_Usage.txt
Use >> top –d t (t in floating point secs ; linux) hit “1” key to see list of the cores
Take a look at some numbers




http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/POSIX_optmized-400-3000_ave.xlsx
http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/POSIX_optimized-4000-10000_ave.xlsx
http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/MPI_optmized-400-3000_ave.xlsx
http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/MPI_optimized-4000-8000_ave.xlsx
Time Comparison
Number of POSIX
threads
POSIX Threads Matrix Matrix Multiply Matrix Size - 4000 x 4000
12
12
11
11
10
10
9
9
8
8
7
7
6
6
20
25
30
35
Time (secs)
40
45
Number of MPI Processes
MPI Matrix Matrix Multiply Matrix Size - 4000 x 4000
12
12
11
11
10
10
9
9
8
8
7
7
6
6
20
25
30
35
Time (secs)
40
45
Time Comparison
In all these cases time for 5 ,4, 3, 2 processes much longer than 6 so left of for comparison
Number of POSIX Threads
POSIX Threads Matrix Matrix Multiply Matrix Size - 5000 x 5000
12
12
11
11
10
10
9
9
8
8
7
7
6
6
40
50
60
70
Time (secs)
80
90
100
POSIX Doesn’t “catch” back up till 9 processes
Number of MPI Processes
MPI Matrix Matrix Multiply Matrix Size - 5000 x 5000
12
12
11
11
10
10
9
9
8
8
7
7
6
6
40
50
60
70
Time (secs)
80
90
MPI Doesn’t “catch” back up till 11 processes
100
MPI Time Curve
MPI Matrix Sizes 2400x2400 - 3000x3000
3000 x 3000
2900 x 2900
2800 x 2800
2700 x 2700
2600 x 2600
2500 x 2500
2400 x 2400
12
11
Number of MPI Processes
10
Note: 3000 x 3000 performs
better than 2900 x 2900
9
8
Run Time 1 processor optimized
3000 x 3000, straight C no MPI
7
6
5
4
3
2
0
10
20
30
40
Time (secs)
50
60
70
POSIX Time Curve
POSIX Matrix Sizes 2400x2400 – 3000x3000
3000 x 3000
2900 x 2900
2800 x 2800
2700 x 2700
2600 x 2600
2500 x 2500
2400 x 2400
12
11
Number of POSIX Threads
10
9
8
Up to here 3000 x 3000 performs
better than 2900 x 2900
7
6
5
4
3
2
3
5
7
9
11
13
Time (secs)
15
17
19
21
POSIX Threads Vs MPI Processes Run Times
Matrix Sizes 4000x4000 – 10,000 x 10,000
POSIX Threads 4000 x 4000 - 10,000 x 10,000
Number of POSIX threads
10,000 x 10,000
12
11
10
9
8
7
6
5
4
3
2
12 12 12
11 11 11
10 10 10
9 9
9
8 8
8
7 7
7
6 6
6
9000 x 9000
12
11
10
9
8
7
6
8000 x 8000
12
11
11
10
9
6
6000 x 6000
100
200
9
8
7
6
7
6
300
4000 x 4000
10
9
8
4
0
5000 x 5000
11
10
8
7
7000 x 7000
12
400
4
500
600
700
800
900
1000
Time (secs)
Number of MPI Processes
MPI Processes 4000 x 4000 - 10,000 x 10,000
10,000 x 10,000
12
11
10
9
8
7
6
5
4
3
2
12
11
10
9
8
7
6
12
12
11
11
10
10
9
9
8
8
7
7
6
6
9000 x 9000
8000 x 8000
12
11
10
9
8
7
6
7000 x 7000
11
10
9
8
7
6
6000 x 6000
5000 x 5000
12 12
11 11
10
10
9
8
4000 x 4000
9
8
7
7
6
6
4
0
100
200
300
400
500
Time (secs)
600
700
800
4
900
1000
POSIX Threads 1500 x 1500 – 2500x2500
POSIX Threads Matrix Sizes 1500 x 1500 - 2500 x 2500
12
2500 x 2500
2400 x 2400
2300 x 2300
2000 x 2000
2100 x 2100
1900 x 1900
1800 x 1800
1700 x 1700
1600 x 1600
1500 x 1500
2000 x 2000
11
Number of POSIX Threads
10
9
8
7
6
5
4
3
2
0
0.5
1
1.5
2
2.5
Time (Secs)
3
3.5
4
4.5
5
MPI 1500 x 1500 – 1800 x 1800
Notice MPI Didn’t exhibit the same problem at size 1600 as POSIX and NO MP case.
MPI Matrix Matrix Multiply 1500 x 1500 - 1800 x 1800
1800 x 1800
1700 x 1700
1600 x 1600
1500 x 1500
14
Number of MPI Processes
12
10
8
6
4
2
0
2
3
4
5
6
Time (Secs)
7
8
9
10
POSIX & NO MP 1600 x 1600 case

Straight C runs long enough to see top output (here I can see the memory usage)




threaded ,MPI, and non mp code share same basic structure for calculating “C” Matrix
Suspect some kind of boundary issue here, possibly “false sharing”?
Process fits entirely in shared L3 cache 15 MB x 2 = 30MB
Do same number of calculations but make initial array allocations larger (shown below)
[rahnbj@rage ~/SUNY]$ foreach NUM_TRIALS (1 2 3 4 5)
foreach? ./a.out
foreach? End
Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 21.979548 secs
Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 21.980786 secs
Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 21.971891 secs
Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 21.974897 secs
Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 22.012967 secs
[rahnbj@rage ~/SUNY]$ foreach NUM_TRIALS ( 1 2 3 4 5 )
foreach? ./a.out
foreach? End
Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.890815 secs
Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.903997 secs
Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.881991 secs
Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.884655 secs
Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.887197 secs
[rahnbj@rage ~/SUNY]$
Notes / Future Directions


Start MPI Timer after communication. Is coms the sole source of difference? <- TESTED NO
At the boundary conditions the driving force is the amount of memory allocated on the heap.


Intel had a nice article about false sharing:








http://www-polsys.lip6.fr/~safey/Reports/pasco.pdf
Couldn’t get MPE running with MPIch (would like to re-investigate why)
Investigate optimization techniques


MPI to multiple machines, then POSIX threads ?
http://cdac.in/index.aspx?id=ev_hpc_hegapa12_mode01_multicore_mpi_pthreads
Found this paper on OpenMP vs direct POSIX programming (similar tests)


https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads
link to a product they sell for detecting false sharing on their processors
Combo MPI and POSIX Threads?


Not the number of calculations being performed
Did the compiler figure out how to reduce run times because of the simple matrix multiplies? <- NO
Rerun with non-identity B matrix and compare times <- DONE
Try different languages ie CHAPEL
Try different algorithms
For < 6 processes look at thread_affinity and assignment of threads to a physical processor



There is no gaurantee that with 6 or less processes they will all reside on same physical processor
Noticed CPU switching occaionally.
Setting the affinity can mitigate this, thread = assigned and not “allowed” to move
Notes / Future Directions cont.
Notice the shape of the curves for both MPI and POSIX solutions. There is definitely a
point of diminishing returns. 6? In this particular case.
 Instead of using 12 cores could we cut the problem set in half and launch 2 independent 6
process solutions by declaring thread_affinity?



Would this produce better results?
How to merge the 2 process spaces?