Document

Processes Distribution of Homogeneous
Parallel Linear Algebra Routines on
Heterogeneous Clusters
Javier Cuenca
Luis Pedro García
Domingo Giménez
Scientific Computation Researching Group, University of Murcia, Spain
Antonio Javier Cuenca Muñoz
Dpto. Ingeniería y Tecnología de Computadores
Jack Dongarra
Innovative Computing Laboratory, University of Tennessee, USA
Introduction
 Automatically Optimised Linear Algebra Software

Objective
 Software capable of tuning itself according to the execution environment

Motivation





Non-expert users take decisions about computation
Software should adapt to the continuous evolution of hardware
Developing efficient code by hand consumes a large quantity of resources
System computation capabilities are very variable
Some examples of auto-tuning software:
 ATLAS, LFC, FFTW, I-LIB, FIBER, mpC, BeBOP, FLAME, ...
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
2
Automatic Optimisation on Heterogeneous Parallel Systems
 Two possibilities on heterogeneous systems:


HoHe: Heterogeneous algorithms (heterogeneous distribution
of data).
HeHo: Homogeneous algorithms and heterogeneous
assignation of processes:
 A variable number of processes to each processor, depending on the
relative speeds
 Mapping processes  processors must be made, and without a large
execution time in the decision taking
 Theoretical models: parameters which represent the characteristics of the
system
 The general assignation problem is NP  use of heuristic approximations
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
3
Our previous HoHo methodology
 Routines model
TEXEC  f (n,SP, AP )

n: problem size

SP: system parameters
 Computation and communication characteristics of the system

AP: algorithm parameters
 Block size, number of processors to use, logical configurations of the
processors, ... (with one process per processor)
 Values are chosen when the routine begins to run
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
4
Our previous HoHo methodology  Our HeHo meth.
 Modifications in the routine model:

New AP:
 Number of processes to generate
 Mapping processes to processors

SP values changes:
 More than one process per processor: Each SPi in processor i as di
(number of processes assigned to processor i) times higher
 Implicit synchronization  global value of each of the SPi is considered as
the maximum value from all the processors.
 The slowest process forces to the other ones to reduce their speed,
waiting for it at the different synchronization points of the routine.
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
6
Our HeHo methodology: an example of routine model
 LU factorisation, parallel version. Model:
TARI
TCOM

2
n3 r  c
1
 k 3 _ DGEMM

bk 3 _ DTRSM n 2  b 2 k 2 _ DGETF 2 n
3
p
p
3
2nd
2n 2 d
 ts
 tw
b
p
TEXEC  TARI  TCOM
SP: system parameters
 k3_DGEMM, k3_DTRSM, k2_DGETF2
 ts, tw

AP: algorithm parameters





b: block size
P: number of processors
p: number of processes
Mapping p processes on the P processors
p = r x c: logical configuration of the processes: 2D mesh
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
7
Our HeHo methodology: an example of routine model
 Platforms:

SUNEt:
 Five SUN Ultra 1
 One SUN Ultra 5
 Interconexion network: Ethernet

TORC (Innovative Computing Laboratory):
 21 nodes of different types
 dual and single processors
 Pentium II, III and 4
 AMD Athlon
 Interconexion networks:
 FastEthernet
 Giganet
 Myrinet
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
8
Our HeHo methodology: an example of routine model
Mapping of 8
processes on the
6 processors
Logical topology
of the 8
processes
Block
1
(1,1,1,1,1,3)
2х4
32
AP 2
(2,1,1,1,1,2)
2х4
32
AP 3
(2,2,1,1,1,1)
2х4
32
AP 4
(1,1,1,1,1,3)
2х4
64
AP 5
(2,1,1,1,1,2)
2х4
64
AP 6
(2,2,1,1,1,1)
2х4
64
AP 7
(1,1,1,1,1,3)
1х8
32
AP 8
(2,1,1,1,1,2)
1х8
32
AP 9
(2,2,1,1,1,1)
1х8
32
AP 10
(1,1,1,1,1,3)
1х8
64
AP 11
(2,1,1,1,1,2)
1х8
64
0
AP 12
(2,2,1,1,1,1)
1х8
64
AP
 Theoretical vs. Experimental time on SUNEt.n=2048
theoretical time
size
experimental time
200
150
100
7
6
5
4
3
2
8
AP
9
AP
10
AP
11
AP
12
AP
AP
AP
AP
AP
AP
1
50
AP
AP
250
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
9
Our HeHo methodology: an example of routine model
 Theoretical vs. Experimental time on TORC. n=4096
Mapping of 8 processes on 19
processors
Logical
topology of
the 8
processes
Block
AP 1 (1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,0,0,0)
4х2
32
AP 2 (1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,0,0,0)
8х1
32
AP 3 (1,0,1,0,1,0,1,0,1,0,1,0,0,0,0,0,0,2,0)
4х2
32
AP 4 (1,0,1,0,1,0,1,0,1,0,1,0,0,0,0,0,0,2,0)
8х1
32
AP 5 (1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,2,2,1)
4х2
32
AP 6 (1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,2,2,1)
8х1
32
Size
70
60
50
40
30
20
theoretical time
experimental time
10
AP 7 (1,0,1,0,1,0,1,0,1,0,1,0,0,0,0,0,2,0,0)
4х2
32
AP 8 (1,0,1,0,1,0,1,0,1,0,1,0,0,0,0,0,2,0,0)
8х1
32
0
AP1
AP2
AP3
AP4
AP5
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
AP6
AP7
10
Our HeHo methodology
 Our approach: Assignment tree
P processors
1
2
2
3 ... P
2
3 ... P
3
... P
3 ... P
P
...
p processes
1
 A limit in the height of the tree (number of processes) is necessary
 Each node represents a possible solution: processesprocessors
 The other APs (block size, logical topology) are chosen at each
node
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
11
Our HeHo methodology
 For each node:

EET(node): Estimated Execution Time
 Optimization problem: finding the node with the lowest EET


LET(node): Lowest Execution Time
GET(node): Greatest Execution Time
 LET and GET are lower and upper bounds of the optimum solution of the
subtree below the node

LET and GET  to limit the number of nodes evaluated
 MEET = minevaluated_nodes {GET(node)}
 If {LET (node) > MEET}  do not work below this node
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
12
Our HeHo methodology
 Automatic searching strategies in the assignment tree:

Method 1:
 Backtracking
 GET = EET.

Method 2:
 Backtraking
 GET obtained with a greedy approach

Method 3:
 Backtraking
 GETobtained with a greedy approach
 LET obtained with a greedy approach

Method 4:
 Greedy method on the current assignment tree
 (a combinatorial tree with repetitions)

Method 5:
 Greedy method on a permutational tree with repetitions
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
13
Our HeHo methodology
 Automatic searching strategies in the assignment tree:

Method 1:
 Backtracking
 GET = EET
 LET = LETari + LETcom

LETari = sequential time divided by the maximum achievable speed-up
when using all the processors not yet discarded

LETcom = assuming the best logical topology of processes that can be
obtained from this node
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
14
Our HeHo methodology
 Automatic searching strategies in the assignment tree:

Method 2:
 Backtracking
 GET = a greedy approach: the EET for each of the children of the node is
calculated, and the node with the lowest EET is included in the solution
 LET = LETari + LETcom

LETari = sequential time divided by the maximum achievable speed-up
when using all the processors not yet discarded

LETcom = assuming the best logical topology of processes that can be
obtained from this node.
 Fewer nodes are analyzed, but the evaluated cost per node increases
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
15
Our HeHo methodology
 Automatic searching strategies in the assignment tree:

Method 3:
 Backtracking
 GET = a greedy approach: the EET for each of the children of the node is
calculated, and the node with the lowest EET is included in the solution
 LET = LETari + LETcom
 LETari = A greedy approach is used:
 For each node, the child that least increases the cost of
arithmetic operations is included in the solution to obtain the
lowest bound

LETcom = assuming the best logical topology of processes that can be
obtained from this node.
 It is possible that a branch to a optimal solution will be discarded
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
16
Our HeHo methodology
 Automatic searching strategies in the assignment tree:

Method 4:
 Greedy method on the current assignment tree
 (a combinatorial tree with repetitions)

Method 5:
 Greedy method on a permutational tree with repetitions

Both methods 4 and 5:
 To obtain better logical topologies of the processes:
 traversal searching continues (through the best child for each node)
until the established maximum level is reached.
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
17
Experimental Results
 Human searching strategies in the assignment tree:

Greedy User (GU)
 Use ALL the available processors
 One process per processor

Conservative User (CU)
 Use HALF of the available processors
 One process per processor

Expert User (EU):
 Use 1 processor, HALF or ALL the processors depending on the problem
size
 One process per processor
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
18
Experimental Results
 Automatic decisions vs. Users, on SUNEt (n = 7680)
Method
Processes
mapping
b
Logical
Topology
Solution
t. t. t.
Level
1
(1,1,1,1,1,1)
64
2х3
718.94
0.02
25
2
(1,1,1,1,1,1)
64
2х3
718.94
0.04
25
3
(1,1,1,1,1,1)
64
2х3
718.94
0.02
25
4
(1,1,0,0,0,1)
128
1х3
887.85
0.0001
25
5
(1,1,0,0,0,1)
128
1х3
887.85
0.0005
25
CU
(1,1,0,0,0,1)
128
1х3
1047.13
GU
(1,1,1,1,1,1)
64
2х3
887.85
EU
(1,1,1,1,1,1)
64
2х3
887.85
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
21
Experimental Results
 Automatic decisions vs. Users, on TORC (n = 2048)
Method
Processes
mapping
b
Logical
Topology
Solution
t. t. t.
Level
1
(1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,0,0,0,0)
64
3х5
17.91
3.08
15
2
(1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,0,0,0,0)
64
3х5
17.91
3.08
15
3
(1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,0,0,0,0)
64
4х4
15.27
0.06
25
4
(0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,1,0)
64
1х1
43.16
0.0012
30
5
(1,1,1,1,1,1,1,1,1,1,1
,1,1,1,1,0,0,0,0)
64
4х4
15.27
0.01
30
CU
(1,1,1,1,1,1,0,0,0,0,
0,0,0,0,0,0,1,1,1)
64
3х3
23.77
GU
(1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1)
32
1 х 19
33.57
EU
(1,1,1,1,1,1,0,0,0,0,
0,0,0,0,0,0,1,1,1)
64
3х3
23.77
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
22
Simulations
 Virtual Platforms: variations and/or increases of the
real platforms:

mTORC-01
 the quantity of 17P4 is increased to 11
 Number of processors: 29. Types of processors: 4

mTORC-02
 the quantities of DPIII, SPIII, Ath and 17P4 are increased to 10, 10, 10 and
20 respectively. Number of processors: 50. Types of processors: 4

mTORC-03
 the quantities of DPIII, SPIII, Ath and 17P4 are increased to 10, 15, 5 and
10, respectively
 additional processors have been included
 Number of processors: 100. Types of processors: 10
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
24
Simulations
 Automatic decisions vs. Users
 On virtual platform: mTORC01 (n = 20000)
 the quantity of 17P4 is increased to 11
 Number of processors: 29. Types of processors: 4
Met. 1
Met. 2
Met. 3
Met. 4
Met. 5
CU
GU
EU
Solution
666.44
818.82
666.44
666.44
666.44
1322.23
1145.09
1145.09
t.t.t
20.39
59.45
0.68
0.0007
0.0122
Level
15
15
20
25
25
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
25
Experimental Results
 Automatic decisions vs. Users
 On virtual platform: mTORC02 (n = 20000)
 the quantities of DPIII, SPIII, Ath and 17P4 are increased to 10, 10, 10 and
20 respectively
 Number of processors: 50. Types of processors: 4
Met. 1
Solution
Met. 2
Met. 3
Met. 4
Met. 5
CU
GU
EU
3721.98 3791.98
2439.43
1958.43
1500.24
2249.70
2748.36
2249.70
t.t.t
259.44
792.32
7.46
0.01
0.07
Level
15
15
25
30
30
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
26
Experimental Results
 Automatic decisions vs. Users
 On virtual platform: mTORC03 (n = 20000)
 the quantities of DPIII, SPIII, Ath and 17P4 are increased to 10, 15, 5 and
10, respectively
 additional processors have been included
 Number of processors: 100. Types of processors: 10
Met. 1
Met. 2
Met. 3
Met. 4
Met. 5
Solution 10712.55 14532.45 10712.55 10712.55 4333.23
t.t.t
109.24
169.72
1274.34
0.08
2.34
Level
10
10
5
25
40
CU
GU
EU
7405.34
5422.87
5422.87
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
27
Conclusions

Extension of our previous self-optimisation methodology for
homogeneous systems

On hetereogeneous systems, new decisions:
 Number of processes
 Mapping processes  processors

Good results with parallel LU factorisation

Same methodology could be applied to other linear algebra
routines
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
28