Monitoring of a distributed computing system: the AliEn Grid Marco MEONI Alice Offline weekly meeting Thursday 3rd February 2005 CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 1/18 Content • Document I’ve been working on since mid Dec 2004 • ~100 pages up to now • Not too far from the final version • Available on http://... (let me discuss the thesis first) liEn A d an E C I L toring i 1. A n o id M r G 2. ISA L A n o 3. M ptations a d a A IS L A n o 4. M and extensions onitoring m 4 0 0 2 C D P . 5 and results d Outlooks 6. Conclusion an ~ 65 pages ~ 35 pages CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 2/18 Section I Grid Concepts and Monitoring CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 3/18 Grid, ALICE, AliEn • Grid Computing overview “coordinated use of large sets of different, geographically distributed resources in order to allow high-performance computation” • ALICE experiment and ALICE Off-line • AliEn • PULL rather than PUSH architecture, • scheduling service does not need to know the status of all other resources in the system, • robust and fault tolerant system where resources can come and go at any point in time. • possible to interface an entire foreign Grid as a large Computing and Storage Element (LCG) CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 4/18 Grid Monitoring • GMA architecture • R-GMA: an example of implementation • Jini (Sun) provides the technical basis Producer Store location Transfer Data Consumer Registry Lookup location CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 5/18 Section II MonALISA Adaptations and Extensions CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 6/18 MonALISA Adaptations • A WEB Repository as a front-end • Stores history of the monitored data • Plots any kind of chart • Interfaces to user code (custom consumers, config modules, new charts, distributions) • Farms monitoring • User Java class to interface MonALISA and bash script to monitor the site CE Bash monitoring script Monitored data Java interface class MonALISA Agent WNs ALICE’s resources User code MonALISA framework CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 7/18 AliEn Jobs Monitoring • If the Grid executes jobs then it works! • Centralized or distributed? • AliEn native APIs to retrieve job status snapshots (Error_I) Job is submitted (Error_A) (Error_S) (Error_E) (Error_R) >1h (Error_V, VT, VN) (Error_SV) >3h • Additional Java thread to feed directly the repository Monitored data Ad hoc java thread Repository TOMCAT JSP/servlets CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 8/18 Repository DataBase(s) • 7.5 Gb of monitored information, 52M records • During DCs data from ~2K monitored parameters arrive every 2/3 mins 1min 60 bins for each basic information 10 min 100 min Averaging process FIFO • Data Replication: MASTER DB SPARE DB Online Replication alimonitor.cern.ch Data collecting and Grid Monitoring aliweb01.cern.ch Grid Analysis CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 9/18 Monitored parameters Source Category Number Examples AliEn API CE load factors 63 Run load, queue load SE occupancy 62 Used space, free space, files number Job information 557 Running, saving, done, failed Soap calls CERN Network traffic 29 MBs, files LCG CPU – Jobs 48 Free CPUs, job running and waiting ML services on MQ Job summary 34 Running, saving, done, failed AliEn parameters 15 MySQL load, Perl processes ML services Sites info 1060 Paging, threads, I/O, processes 1868 Derived classes… Job execution efficiency Successfuly done jobs / all submitted jobs System efficiency Error (CE) free jobs / all submitted jobs AliRoot efficiency Error (AliROOT) free jobs / all submitted jobs Resource efficiency Running (queued) jobs / max_running (queued) CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 10/18 Extensions • Job monitoring by user •AliEn “ps –xxx” commands •Job’s JDL •Results presented in the same web front end • Repository Web Services • Application Monitoring (ApMon) at WNs • Grid Analysis •Repository interfaced to ROOT and Carrot CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 11/18 Section III PDC 2004 Monitoring and Results CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 12/18 Phase 1 (simulation) Sum of all sites Successfully done jobs all submitted jobs Error (CE) free jobs all submitted jobs Error (AliROOT) free jobs all submitted jobs •Start 10/03, end 29/05 (58 days active) •Maximum jobs running in parallel: 1450 •Average during active period: 430 CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 13/18 Phase 2 (merging) ¾ as in the 1st phase, general equilibrium in CPU contribution ¾ not sigle site dominating the production ¾ jobs successfully done 76% AliEn, 24% LCG Jobs failure Reason Rate Submission CE scheduler not responding 1% Loading input data Remote SE not responding 3% During execution Job aborted, not started, killed, WN malfunction 10% Saving output data Local SE not responding 2% CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 14/18 Phase 3 (analysis) • Occupancy changes respect the number of queued jobs in the local batch system CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 15/18 Salutations… CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 16/18 Credits • Federico, Predrag and Peter they could pick up another TS • Latchezar continuos help and suggestions, review of my thesis • MonALISA team collaborative anytime I needed • Guenter very useful integrations • my fiancee moral support: “did they hire you just to look at some plots?” CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 17/18 …thanks to all …and all the others I couldn’t find a pic! CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 18/18
© Copyright 2025 Paperzz