Performance Tests on the GSI Batch Farm + Lustre

Performance Tests
on the GSI Batch Farm + Lustre
Goals
Nodes, Data, Jobs
Data Traffic over the Network
Performance Test Results
Silvia Masciocchi
GSI Darmstadt
January 28, 2009
Goals
• Test stability and robusteness of the whole system:
batch farm + lustre file system + supporting network
• Verify the data traffic over the involved network
• Verify definition and use of the batch queues
and interplay with PROOF
• Measure analysis speed in terms of events/sec
and relate to the requirements for 2009
[email protected]
Performance tests, January 28, 2009
1
Nodes, Data and Jobs
Nodes:
- ≈ 150 nodes
- ≈ 1200 cores
Data:
- 30 million MC events on /lustre/
Jobs:
• MC Jobs: pp collisions
event generation + transport through detector + reconstruction
100% CPU usage, 3-4 hours/100 events, up to 1.3 GB memory used
• Analysis Train Jobs:
10 realistic ALICE analyses in the GSI train (see next page)
I/O bound (reading data on lustre), CPU/total time ≈ 0.90
≈ 1 hour to process 40,000 events, up to 700 MB memory used
[email protected]
Performance tests, January 28, 2009
2
The GSI analysis train
Runs in ROOT 522-00, plus AliRoot and analysis PAR files
task
task1
task2
task3
task4
task5
task6
task7
task8
task9
task10
sum
TRAIN OF 10
[email protected]
total time [sec]
1941
2045
5339
7079
8511
7287
4017
4872
1980
8468
—–
51539
CPU time [sec]
889
1021
4324
6197
7520
6388
3105
4012
933
7588
—–
41977
fraction
0.458
0.499
0.810
0.875
0.884
0.877
0.773
0.823
0.471
0.896
8400
7600
0.905
Performance tests, January 28, 2009
3
Tests done
•
•
•
•
•
•
•
•
•
•
•
1 analysis job/node (155 jobs)
2 analysis jobs/node (310 jobs)
4 analysis jobs/node (620 jobs)
6 analysis jobs/node (930 jobs)
8 analysis jobs/node (1240 jobs)
6 ana jobs/node + 2 MC jobs/node (930 + 310 jobs)
4 ana jobs/node + 4 MC jobs/node (620 + 620 jobs)
4 ana jobs/node + 4 MC jobs/node + PROOF (Jacek) (620 + 620 + 144)
2 ana jobs/node + 6 MC jobs/node (310 + 930 jobs)
8 MC jobs/node + 1 analysis/node (HP) (1240 + 155 jobs)
8 MC jobs/node + 4 analysis/node (HP) (1240 + 620 jobs)
In total:
- ≈ 20,000 analysis jobs run
- ≈ 109 events analyzed
[email protected]
Performance tests, January 28, 2009
4
GSI farm + lustre
[email protected]
Performance tests, January 28, 2009
5
lustre 1 GB link
[email protected]
Performance tests, January 28, 2009
6
A switch
[email protected]
Performance tests, January 28, 2009
7
Another switch
[email protected]
Performance tests, January 28, 2009
8
A switch for the blades
[email protected]
Performance tests, January 28, 2009
9
1 analysis job/node
Total time for 100 events
htemp
Entries
153
Mean
8.567
RMS
0.1997
9
8
(realt)/evn:lxb
(realt)/evn
(realt)/evn
7
6
Total time for 100 events
vs node number
9.2
9
8.8
5
8.6
4
3
8.4
2
8.2
1
8
(cput)/evn
(cput)/evn:lxb
8.2
8.4
8.6
8.8
9
CPU time for 100 events
vs node number
8
9.2
(realt)/evn
300
(cput/realt):lxb
(cput/realt)
0
0.9
350
400
450
500
lxb
CPU/total time for 100 events
vs node number
0.895
8
0.89
0.885
7.8
0.88
7.6
0.875
0.87
7.4
0.865
0.86
7.2
0.855
7
300
[email protected]
350
400
450
500
lxb
0.85
300
350
400
Performance tests, January 28, 2009
450
500
lxb
10
Nodes and Switches
[email protected]
Performance tests, January 28, 2009
11
2 analysis jobs/node
Total time for 100 events
htemp
Entries
304
Mean
8.825
RMS 0.1852
18
16
(realt)/evn:lxb
(realt)/evn
(realt)/evn
9.4
9
12
8.8
10
8.6
8
8.4
6
8.2
4
8
2
7.8
8
8.2
8.4
8.6
8.8
9
9.2
CPU time for 100 events
vs node number
9.4
9.6
(realt)/evn
300
(cput/realt):lxb
(cput/realt)
7.8
(cput)/evn:lxb
(cput)/evn
9.6
9.2
14
0
Total time for 100 events
vs node number
8
350
400
450
500
lxb
CPU/total time for 100 events
vs node number
0.94
0.92
7.8
0.9
7.6
0.88
7.4
0.86
7.2
0.84
300
[email protected]
350
400
450
500
lxb
0.82
300
350
400
Performance tests, January 28, 2009
450
500
lxb
12
4 analysis jobs/node
Total time for 100 events
htemp
Entries
612
Mean
9.313
RMS
0.1963
25
(realt)/evn:lxb
(realt)/evn
(realt)/evn
20
Total time for 100 events
vs node number
10
9.8
9.6
15
9.4
10
9.2
5
8.8
8.8
(cput)/evn
(cput)/evn:lxb
8.2
9
9.2
9.4
9.6
9.8
CPU time for 100 events
vs node number
10
(realt)/evn
300
(cput/realt):lxb
(cput/realt)
0
9
8
350
400
450
500
lxb
CPU/total time for 100 events
vs node number
0.84
0.83
7.8
0.82
7.6
0.81
7.4
0.8
7.2
0.79
300
[email protected]
350
400
450
500
lxb
300
350
400
Performance tests, January 28, 2009
450
500
lxb
13
6 analysis jobs/node
Total time for 100 events
htemp
Entries
917
Mean
9.936
RMS
0.2089
45
40
(realt)/evn:lxb
(realt)/evn
(realt)/evn
35
Total time for 100 events
vs node number
11
10.8
10.6
10.4
30
10.2
25
10
20
9.8
15
9.6
10
9.4
5
9.2
9.2
(cput)/evn
(cput)/evn:lxb
8.6
9.4
9.6
9.8
10
10.2
10.4
10.6
CPU time for 100 events
vs node number
10.8
11
(realt)/evn
300
(cput/realt):lxb
(cput/realt)
0
8.4
350
400
450
500
lxb
CPU/total time for 100 events
vs node number
0.82
0.81
8.2
0.8
8
0.79
7.8
0.78
7.6
0.77
7.4
0.76
0.75
7.2
300
[email protected]
350
400
450
500
lxb
300
350
400
Performance tests, January 28, 2009
450
500
lxb
14
8 analysis jobs/node
htemp
Entries
1223
Mean
10.7
RMS
0.2958
90
80
70
(realt)/evn:lxb
(realt)/evn
Total time for 100 events
(realt)/evn
Total time for 100 events
vs node number
11.5
11
10.5
60
50
10
40
9.5
30
9
20
10
8.5
(cput)/evn
(cput)/evn:lxb
9
9.5
10
10.5
11
CPU time for 100 events
vs node number
11.5
(realt)/evn
300
(cput/realt):lxb
(cput/realt)
0
8.5
8.6
8.4
0.9
350
400
450
500
lxb
CPU/total time for 100 events
vs node number
0.85
8.2
0.8
8
7.8
0.75
7.6
0.7
7.4
300
[email protected]
350
400
450
500
lxb
300
350
400
Performance tests, January 28, 2009
450
500
lxb
15
4 analysis jobs/node + PROOF
Total time for 100 events
htemp
Entries
Mean
RMS
90
80
70
612
9.935
1.471
(realt)/evn:lxb
(realt)/evn
(realt)/evn
Total time for 100 events
vs node number
14
13
60
12
50
11
40
30
10
20
9
10
8
8
(cput)/evn
(cput)/evn:lxb
9
10
11
12
13
CPU time for 100 events
vs node number
14
(realt)/evn
300
(cput/realt):lxb
(cput/realt)
0
8.6
8.4
8
0.75
7.8
0.7
7.6
0.65
7.4
0.6
7.2
0.55
[email protected]
400
450
500
lxb
450
500
lxb
CPU/total time for 100 events
vs node number
0.85
0.8
350
400
0.9
8.2
300
350
300
350
400
Performance tests, January 28, 2009
450
500
lxb
16
Time to analyze 100 events
ONLY
analysis
jobs,
other processors free
remaining processors used
for MC
PROOF on top of 4 analysis
jobs and 4 MC
analysis jobs in high priority queue,
8 MC jobs/node (some suspended)
[email protected]
Performance tests, January 28, 2009
17
N. events per second per job
ONLY
analysis
jobs,
other processors free
remaining processors used
for MC
PROOF on top of 4 analysis
jobs and 4 MC
analysis jobs in high priority queue,
8 MC jobs/node (some suspended)
[email protected]
Performance tests, January 28, 2009
18
Total n. events per second
Current farm: 150 nodes x 8 cores = 1200 CPUs
ONLY
analysis
jobs,
other processors free
remaining processors used
for MC
PROOF on top of 4 analysis
jobs and 4 MC
analysis jobs in high priority queue,
8 MC jobs/node (some suspended)
[email protected]
Performance tests, January 28, 2009
19
Conclusions
• Network
• Lustre
• Queues + PROOF
• Data rate in analysis
[email protected]
Performance tests, January 28, 2009
20