Processes Distribution of Homogeneous
Parallel Linear Algebra Routines on
Heterogeneous Clusters
Javier Cuenca
Luis Pedro García
Domingo Giménez
Scientific Computation Researching Group, University of Murcia, Spain
Antonio Javier Cuenca Muñoz
Dpto. Ingeniería y Tecnología de Computadores
Jack Dongarra
Innovative Computing Laboratory, University of Tennessee, USA
Introduction
Automatically Optimised Linear Algebra Software
Objective
Software capable of tuning itself according to the execution environment
Motivation
Non-expert users take decisions about computation
Software should adapt to the continuous evolution of hardware
Developing efficient code by hand consumes a large quantity of resources
System computation capabilities are very variable
Some examples of auto-tuning software:
ATLAS, LFC, FFTW, I-LIB, FIBER, mpC, BeBOP, FLAME, ...
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
2
Automatic Optimisation on Heterogeneous Parallel Systems
Two possibilities on heterogeneous systems:
HoHe: Heterogeneous algorithms (heterogeneous distribution
of data).
HeHo: Homogeneous algorithms and heterogeneous
assignation of processes:
A variable number of processes to each processor, depending on the
relative speeds
Mapping processes processors must be made, and without a large
execution time in the decision taking
Theoretical models: parameters which represent the characteristics of the
system
The general assignation problem is NP use of heuristic approximations
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
3
Our previous HoHo methodology
Routines model
TEXEC f (n,SP, AP )
n: problem size
SP: system parameters
Computation and communication characteristics of the system
AP: algorithm parameters
Block size, number of processors to use, logical configurations of the
processors, ... (with one process per processor)
Values are chosen when the routine begins to run
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
4
Our previous HoHo methodology Our HeHo meth.
Modifications in the routine model:
New AP:
Number of processes to generate
Mapping processes to processors
SP values changes:
More than one process per processor: Each SPi in processor i as di
(number of processes assigned to processor i) times higher
Implicit synchronization global value of each of the SPi is considered as
the maximum value from all the processors.
The slowest process forces to the other ones to reduce their speed,
waiting for it at the different synchronization points of the routine.
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
6
Our HeHo methodology: an example of routine model
LU factorisation, parallel version. Model:
TARI
TCOM
2
n3 r c
1
k 3 _ DGEMM
bk 3 _ DTRSM n 2 b 2 k 2 _ DGETF 2 n
3
p
p
3
2nd
2n 2 d
ts
tw
b
p
TEXEC TARI TCOM
SP: system parameters
k3_DGEMM, k3_DTRSM, k2_DGETF2
ts, tw
AP: algorithm parameters
b: block size
P: number of processors
p: number of processes
Mapping p processes on the P processors
p = r x c: logical configuration of the processes: 2D mesh
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
7
Our HeHo methodology: an example of routine model
Platforms:
SUNEt:
Five SUN Ultra 1
One SUN Ultra 5
Interconexion network: Ethernet
TORC (Innovative Computing Laboratory):
21 nodes of different types
dual and single processors
Pentium II, III and 4
AMD Athlon
Interconexion networks:
FastEthernet
Giganet
Myrinet
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
8
Our HeHo methodology: an example of routine model
Mapping of 8
processes on the
6 processors
Logical topology
of the 8
processes
Block
1
(1,1,1,1,1,3)
2х4
32
AP 2
(2,1,1,1,1,2)
2х4
32
AP 3
(2,2,1,1,1,1)
2х4
32
AP 4
(1,1,1,1,1,3)
2х4
64
AP 5
(2,1,1,1,1,2)
2х4
64
AP 6
(2,2,1,1,1,1)
2х4
64
AP 7
(1,1,1,1,1,3)
1х8
32
AP 8
(2,1,1,1,1,2)
1х8
32
AP 9
(2,2,1,1,1,1)
1х8
32
AP 10
(1,1,1,1,1,3)
1х8
64
AP 11
(2,1,1,1,1,2)
1х8
64
0
AP 12
(2,2,1,1,1,1)
1х8
64
AP
Theoretical vs. Experimental time on SUNEt.n=2048
theoretical time
size
experimental time
200
150
100
7
6
5
4
3
2
8
AP
9
AP
10
AP
11
AP
12
AP
AP
AP
AP
AP
AP
1
50
AP
AP
250
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
9
Our HeHo methodology: an example of routine model
Theoretical vs. Experimental time on TORC. n=4096
Mapping of 8 processes on 19
processors
Logical
topology of
the 8
processes
Block
AP 1 (1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,0,0,0)
4х2
32
AP 2 (1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,0,0,0)
8х1
32
AP 3 (1,0,1,0,1,0,1,0,1,0,1,0,0,0,0,0,0,2,0)
4х2
32
AP 4 (1,0,1,0,1,0,1,0,1,0,1,0,0,0,0,0,0,2,0)
8х1
32
AP 5 (1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,2,2,1)
4х2
32
AP 6 (1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,2,2,1)
8х1
32
Size
70
60
50
40
30
20
theoretical time
experimental time
10
AP 7 (1,0,1,0,1,0,1,0,1,0,1,0,0,0,0,0,2,0,0)
4х2
32
AP 8 (1,0,1,0,1,0,1,0,1,0,1,0,0,0,0,0,2,0,0)
8х1
32
0
AP1
AP2
AP3
AP4
AP5
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
AP6
AP7
10
Our HeHo methodology
Our approach: Assignment tree
P processors
1
2
2
3 ... P
2
3 ... P
3
... P
3 ... P
P
...
p processes
1
A limit in the height of the tree (number of processes) is necessary
Each node represents a possible solution: processesprocessors
The other APs (block size, logical topology) are chosen at each
node
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
11
Our HeHo methodology
For each node:
EET(node): Estimated Execution Time
Optimization problem: finding the node with the lowest EET
LET(node): Lowest Execution Time
GET(node): Greatest Execution Time
LET and GET are lower and upper bounds of the optimum solution of the
subtree below the node
LET and GET to limit the number of nodes evaluated
MEET = minevaluated_nodes {GET(node)}
If {LET (node) > MEET} do not work below this node
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
12
Our HeHo methodology
Automatic searching strategies in the assignment tree:
Method 1:
Backtracking
GET = EET.
Method 2:
Backtraking
GET obtained with a greedy approach
Method 3:
Backtraking
GETobtained with a greedy approach
LET obtained with a greedy approach
Method 4:
Greedy method on the current assignment tree
(a combinatorial tree with repetitions)
Method 5:
Greedy method on a permutational tree with repetitions
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
13
Our HeHo methodology
Automatic searching strategies in the assignment tree:
Method 1:
Backtracking
GET = EET
LET = LETari + LETcom
LETari = sequential time divided by the maximum achievable speed-up
when using all the processors not yet discarded
LETcom = assuming the best logical topology of processes that can be
obtained from this node
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
14
Our HeHo methodology
Automatic searching strategies in the assignment tree:
Method 2:
Backtracking
GET = a greedy approach: the EET for each of the children of the node is
calculated, and the node with the lowest EET is included in the solution
LET = LETari + LETcom
LETari = sequential time divided by the maximum achievable speed-up
when using all the processors not yet discarded
LETcom = assuming the best logical topology of processes that can be
obtained from this node.
Fewer nodes are analyzed, but the evaluated cost per node increases
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
15
Our HeHo methodology
Automatic searching strategies in the assignment tree:
Method 3:
Backtracking
GET = a greedy approach: the EET for each of the children of the node is
calculated, and the node with the lowest EET is included in the solution
LET = LETari + LETcom
LETari = A greedy approach is used:
For each node, the child that least increases the cost of
arithmetic operations is included in the solution to obtain the
lowest bound
LETcom = assuming the best logical topology of processes that can be
obtained from this node.
It is possible that a branch to a optimal solution will be discarded
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
16
Our HeHo methodology
Automatic searching strategies in the assignment tree:
Method 4:
Greedy method on the current assignment tree
(a combinatorial tree with repetitions)
Method 5:
Greedy method on a permutational tree with repetitions
Both methods 4 and 5:
To obtain better logical topologies of the processes:
traversal searching continues (through the best child for each node)
until the established maximum level is reached.
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
17
Experimental Results
Human searching strategies in the assignment tree:
Greedy User (GU)
Use ALL the available processors
One process per processor
Conservative User (CU)
Use HALF of the available processors
One process per processor
Expert User (EU):
Use 1 processor, HALF or ALL the processors depending on the problem
size
One process per processor
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
18
Experimental Results
Automatic decisions vs. Users, on SUNEt (n = 7680)
Method
Processes
mapping
b
Logical
Topology
Solution
t. t. t.
Level
1
(1,1,1,1,1,1)
64
2х3
718.94
0.02
25
2
(1,1,1,1,1,1)
64
2х3
718.94
0.04
25
3
(1,1,1,1,1,1)
64
2х3
718.94
0.02
25
4
(1,1,0,0,0,1)
128
1х3
887.85
0.0001
25
5
(1,1,0,0,0,1)
128
1х3
887.85
0.0005
25
CU
(1,1,0,0,0,1)
128
1х3
1047.13
GU
(1,1,1,1,1,1)
64
2х3
887.85
EU
(1,1,1,1,1,1)
64
2х3
887.85
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
21
Experimental Results
Automatic decisions vs. Users, on TORC (n = 2048)
Method
Processes
mapping
b
Logical
Topology
Solution
t. t. t.
Level
1
(1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,0,0,0,0)
64
3х5
17.91
3.08
15
2
(1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,0,0,0,0)
64
3х5
17.91
3.08
15
3
(1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,0,0,0,0)
64
4х4
15.27
0.06
25
4
(0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,1,0)
64
1х1
43.16
0.0012
30
5
(1,1,1,1,1,1,1,1,1,1,1
,1,1,1,1,0,0,0,0)
64
4х4
15.27
0.01
30
CU
(1,1,1,1,1,1,0,0,0,0,
0,0,0,0,0,0,1,1,1)
64
3х3
23.77
GU
(1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1)
32
1 х 19
33.57
EU
(1,1,1,1,1,1,0,0,0,0,
0,0,0,0,0,0,1,1,1)
64
3х3
23.77
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
22
Simulations
Virtual Platforms: variations and/or increases of the
real platforms:
mTORC-01
the quantity of 17P4 is increased to 11
Number of processors: 29. Types of processors: 4
mTORC-02
the quantities of DPIII, SPIII, Ath and 17P4 are increased to 10, 10, 10 and
20 respectively. Number of processors: 50. Types of processors: 4
mTORC-03
the quantities of DPIII, SPIII, Ath and 17P4 are increased to 10, 15, 5 and
10, respectively
additional processors have been included
Number of processors: 100. Types of processors: 10
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
24
Simulations
Automatic decisions vs. Users
On virtual platform: mTORC01 (n = 20000)
the quantity of 17P4 is increased to 11
Number of processors: 29. Types of processors: 4
Met. 1
Met. 2
Met. 3
Met. 4
Met. 5
CU
GU
EU
Solution
666.44
818.82
666.44
666.44
666.44
1322.23
1145.09
1145.09
t.t.t
20.39
59.45
0.68
0.0007
0.0122
Level
15
15
20
25
25
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
25
Experimental Results
Automatic decisions vs. Users
On virtual platform: mTORC02 (n = 20000)
the quantities of DPIII, SPIII, Ath and 17P4 are increased to 10, 10, 10 and
20 respectively
Number of processors: 50. Types of processors: 4
Met. 1
Solution
Met. 2
Met. 3
Met. 4
Met. 5
CU
GU
EU
3721.98 3791.98
2439.43
1958.43
1500.24
2249.70
2748.36
2249.70
t.t.t
259.44
792.32
7.46
0.01
0.07
Level
15
15
25
30
30
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
26
Experimental Results
Automatic decisions vs. Users
On virtual platform: mTORC03 (n = 20000)
the quantities of DPIII, SPIII, Ath and 17P4 are increased to 10, 15, 5 and
10, respectively
additional processors have been included
Number of processors: 100. Types of processors: 10
Met. 1
Met. 2
Met. 3
Met. 4
Met. 5
Solution 10712.55 14532.45 10712.55 10712.55 4333.23
t.t.t
109.24
169.72
1274.34
0.08
2.34
Level
10
10
5
25
40
CU
GU
EU
7405.34
5422.87
5422.87
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
27
Conclusions
Extension of our previous self-optimisation methodology for
homogeneous systems
On hetereogeneous systems, new decisions:
Number of processes
Mapping processes processors
Good results with parallel LU factorisation
Same methodology could be applied to other linear algebra
routines
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
28
© Copyright 2026 Paperzz