Performance Tests on the GSI Batch Farm + Lustre Goals Nodes, Data, Jobs Data Traffic over the Network Performance Test Results Silvia Masciocchi GSI Darmstadt January 28, 2009 Goals • Test stability and robusteness of the whole system: batch farm + lustre file system + supporting network • Verify the data traffic over the involved network • Verify definition and use of the batch queues and interplay with PROOF • Measure analysis speed in terms of events/sec and relate to the requirements for 2009 [email protected] Performance tests, January 28, 2009 1 Nodes, Data and Jobs Nodes: - ≈ 150 nodes - ≈ 1200 cores Data: - 30 million MC events on /lustre/ Jobs: • MC Jobs: pp collisions event generation + transport through detector + reconstruction 100% CPU usage, 3-4 hours/100 events, up to 1.3 GB memory used • Analysis Train Jobs: 10 realistic ALICE analyses in the GSI train (see next page) I/O bound (reading data on lustre), CPU/total time ≈ 0.90 ≈ 1 hour to process 40,000 events, up to 700 MB memory used [email protected] Performance tests, January 28, 2009 2 The GSI analysis train Runs in ROOT 522-00, plus AliRoot and analysis PAR files task task1 task2 task3 task4 task5 task6 task7 task8 task9 task10 sum TRAIN OF 10 [email protected] total time [sec] 1941 2045 5339 7079 8511 7287 4017 4872 1980 8468 —– 51539 CPU time [sec] 889 1021 4324 6197 7520 6388 3105 4012 933 7588 —– 41977 fraction 0.458 0.499 0.810 0.875 0.884 0.877 0.773 0.823 0.471 0.896 8400 7600 0.905 Performance tests, January 28, 2009 3 Tests done • • • • • • • • • • • 1 analysis job/node (155 jobs) 2 analysis jobs/node (310 jobs) 4 analysis jobs/node (620 jobs) 6 analysis jobs/node (930 jobs) 8 analysis jobs/node (1240 jobs) 6 ana jobs/node + 2 MC jobs/node (930 + 310 jobs) 4 ana jobs/node + 4 MC jobs/node (620 + 620 jobs) 4 ana jobs/node + 4 MC jobs/node + PROOF (Jacek) (620 + 620 + 144) 2 ana jobs/node + 6 MC jobs/node (310 + 930 jobs) 8 MC jobs/node + 1 analysis/node (HP) (1240 + 155 jobs) 8 MC jobs/node + 4 analysis/node (HP) (1240 + 620 jobs) In total: - ≈ 20,000 analysis jobs run - ≈ 109 events analyzed [email protected] Performance tests, January 28, 2009 4 GSI farm + lustre [email protected] Performance tests, January 28, 2009 5 lustre 1 GB link [email protected] Performance tests, January 28, 2009 6 A switch [email protected] Performance tests, January 28, 2009 7 Another switch [email protected] Performance tests, January 28, 2009 8 A switch for the blades [email protected] Performance tests, January 28, 2009 9 1 analysis job/node Total time for 100 events htemp Entries 153 Mean 8.567 RMS 0.1997 9 8 (realt)/evn:lxb (realt)/evn (realt)/evn 7 6 Total time for 100 events vs node number 9.2 9 8.8 5 8.6 4 3 8.4 2 8.2 1 8 (cput)/evn (cput)/evn:lxb 8.2 8.4 8.6 8.8 9 CPU time for 100 events vs node number 8 9.2 (realt)/evn 300 (cput/realt):lxb (cput/realt) 0 0.9 350 400 450 500 lxb CPU/total time for 100 events vs node number 0.895 8 0.89 0.885 7.8 0.88 7.6 0.875 0.87 7.4 0.865 0.86 7.2 0.855 7 300 [email protected] 350 400 450 500 lxb 0.85 300 350 400 Performance tests, January 28, 2009 450 500 lxb 10 Nodes and Switches [email protected] Performance tests, January 28, 2009 11 2 analysis jobs/node Total time for 100 events htemp Entries 304 Mean 8.825 RMS 0.1852 18 16 (realt)/evn:lxb (realt)/evn (realt)/evn 9.4 9 12 8.8 10 8.6 8 8.4 6 8.2 4 8 2 7.8 8 8.2 8.4 8.6 8.8 9 9.2 CPU time for 100 events vs node number 9.4 9.6 (realt)/evn 300 (cput/realt):lxb (cput/realt) 7.8 (cput)/evn:lxb (cput)/evn 9.6 9.2 14 0 Total time for 100 events vs node number 8 350 400 450 500 lxb CPU/total time for 100 events vs node number 0.94 0.92 7.8 0.9 7.6 0.88 7.4 0.86 7.2 0.84 300 [email protected] 350 400 450 500 lxb 0.82 300 350 400 Performance tests, January 28, 2009 450 500 lxb 12 4 analysis jobs/node Total time for 100 events htemp Entries 612 Mean 9.313 RMS 0.1963 25 (realt)/evn:lxb (realt)/evn (realt)/evn 20 Total time for 100 events vs node number 10 9.8 9.6 15 9.4 10 9.2 5 8.8 8.8 (cput)/evn (cput)/evn:lxb 8.2 9 9.2 9.4 9.6 9.8 CPU time for 100 events vs node number 10 (realt)/evn 300 (cput/realt):lxb (cput/realt) 0 9 8 350 400 450 500 lxb CPU/total time for 100 events vs node number 0.84 0.83 7.8 0.82 7.6 0.81 7.4 0.8 7.2 0.79 300 [email protected] 350 400 450 500 lxb 300 350 400 Performance tests, January 28, 2009 450 500 lxb 13 6 analysis jobs/node Total time for 100 events htemp Entries 917 Mean 9.936 RMS 0.2089 45 40 (realt)/evn:lxb (realt)/evn (realt)/evn 35 Total time for 100 events vs node number 11 10.8 10.6 10.4 30 10.2 25 10 20 9.8 15 9.6 10 9.4 5 9.2 9.2 (cput)/evn (cput)/evn:lxb 8.6 9.4 9.6 9.8 10 10.2 10.4 10.6 CPU time for 100 events vs node number 10.8 11 (realt)/evn 300 (cput/realt):lxb (cput/realt) 0 8.4 350 400 450 500 lxb CPU/total time for 100 events vs node number 0.82 0.81 8.2 0.8 8 0.79 7.8 0.78 7.6 0.77 7.4 0.76 0.75 7.2 300 [email protected] 350 400 450 500 lxb 300 350 400 Performance tests, January 28, 2009 450 500 lxb 14 8 analysis jobs/node htemp Entries 1223 Mean 10.7 RMS 0.2958 90 80 70 (realt)/evn:lxb (realt)/evn Total time for 100 events (realt)/evn Total time for 100 events vs node number 11.5 11 10.5 60 50 10 40 9.5 30 9 20 10 8.5 (cput)/evn (cput)/evn:lxb 9 9.5 10 10.5 11 CPU time for 100 events vs node number 11.5 (realt)/evn 300 (cput/realt):lxb (cput/realt) 0 8.5 8.6 8.4 0.9 350 400 450 500 lxb CPU/total time for 100 events vs node number 0.85 8.2 0.8 8 7.8 0.75 7.6 0.7 7.4 300 [email protected] 350 400 450 500 lxb 300 350 400 Performance tests, January 28, 2009 450 500 lxb 15 4 analysis jobs/node + PROOF Total time for 100 events htemp Entries Mean RMS 90 80 70 612 9.935 1.471 (realt)/evn:lxb (realt)/evn (realt)/evn Total time for 100 events vs node number 14 13 60 12 50 11 40 30 10 20 9 10 8 8 (cput)/evn (cput)/evn:lxb 9 10 11 12 13 CPU time for 100 events vs node number 14 (realt)/evn 300 (cput/realt):lxb (cput/realt) 0 8.6 8.4 8 0.75 7.8 0.7 7.6 0.65 7.4 0.6 7.2 0.55 [email protected] 400 450 500 lxb 450 500 lxb CPU/total time for 100 events vs node number 0.85 0.8 350 400 0.9 8.2 300 350 300 350 400 Performance tests, January 28, 2009 450 500 lxb 16 Time to analyze 100 events ONLY analysis jobs, other processors free remaining processors used for MC PROOF on top of 4 analysis jobs and 4 MC analysis jobs in high priority queue, 8 MC jobs/node (some suspended) [email protected] Performance tests, January 28, 2009 17 N. events per second per job ONLY analysis jobs, other processors free remaining processors used for MC PROOF on top of 4 analysis jobs and 4 MC analysis jobs in high priority queue, 8 MC jobs/node (some suspended) [email protected] Performance tests, January 28, 2009 18 Total n. events per second Current farm: 150 nodes x 8 cores = 1200 CPUs ONLY analysis jobs, other processors free remaining processors used for MC PROOF on top of 4 analysis jobs and 4 MC analysis jobs in high priority queue, 8 MC jobs/node (some suspended) [email protected] Performance tests, January 28, 2009 19 Conclusions • Network • Lustre • Queues + PROOF • Data rate in analysis [email protected] Performance tests, January 28, 2009 20
© Copyright 2025 Paperzz