the AliEn Grid

Monitoring of a distributed
computing system:
the AliEn Grid
Marco MEONI
Alice Offline weekly meeting
Thursday 3rd February 2005
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 1/18
Content
• Document I’ve been working on since mid Dec 2004
• ~100 pages up to now
• Not too far from the final version
• Available on http://... (let me discuss the thesis first)
liEn
A
d
an
E
C
I
L
toring
i
1. A
n
o
id M
r
G
2.
ISA
L
A
n
o
3. M
ptations
a
d
a
A
IS
L
A
n
o
4. M
and extensions
onitoring
m
4
0
0
2
C
D
P
.
5
and results
d Outlooks
6. Conclusion an
~ 65 pages
~ 35 pages
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 2/18
Section I
Grid Concepts and Monitoring
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 3/18
Grid, ALICE, AliEn
• Grid Computing overview
“coordinated use of large sets of different, geographically distributed resources in
order to allow high-performance computation”
• ALICE experiment and ALICE Off-line
• AliEn
• PULL rather than PUSH architecture,
• scheduling service does not need to know the status of all other resources in
the system,
• robust and fault tolerant system where resources can come and go at any
point in time.
• possible to interface an entire foreign Grid as a large Computing and
Storage Element (LCG)
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 4/18
Grid Monitoring
• GMA architecture
• R-GMA: an example of implementation
• Jini (Sun) provides the technical basis
Producer
Store
location
Transfer
Data
Consumer
Registry
Lookup
location
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 5/18
Section II
MonALISA Adaptations and Extensions
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 6/18
MonALISA Adaptations
• A WEB Repository as a front-end
• Stores history of the monitored data
• Plots any kind of chart
• Interfaces to user code
(custom consumers, config modules, new charts, distributions)
• Farms monitoring
• User Java class to interface MonALISA and bash script to monitor the site
CE
Bash monitoring script
Monitored data
Java interface class
MonALISA
Agent
WNs
ALICE’s resources
User code
MonALISA framework
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 7/18
AliEn Jobs Monitoring
• If the Grid executes jobs then it works!
• Centralized or distributed?
• AliEn native APIs to retrieve job status snapshots
(Error_I)
Job is submitted
(Error_A)
(Error_S)
(Error_E)
(Error_R)
>1h
(Error_V, VT, VN)
(Error_SV)
>3h
• Additional Java thread to feed directly the repository
Monitored data
Ad hoc
java thread
Repository
TOMCAT
JSP/servlets
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 8/18
Repository DataBase(s)
• 7.5 Gb of monitored information, 52M records
• During DCs data from ~2K monitored parameters arrive every 2/3 mins
1min
60 bins for
each basic
information
10 min
100 min
Averaging
process
FIFO
• Data Replication:
MASTER DB
SPARE DB
Online Replication
alimonitor.cern.ch
Data collecting and Grid Monitoring
aliweb01.cern.ch
Grid Analysis
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 9/18
Monitored parameters
Source
Category
Number
Examples
AliEn API
CE load factors
63
Run load, queue load
SE occupancy
62
Used space, free space, files number
Job information
557
Running, saving, done, failed
Soap calls
CERN Network traffic
29
MBs, files
LCG
CPU – Jobs
48
Free CPUs, job running and waiting
ML services on MQ
Job summary
34
Running, saving, done, failed
AliEn parameters
15
MySQL load, Perl processes
ML services
Sites info
1060
Paging, threads, I/O, processes
1868
Derived classes…
Job execution efficiency
Successfuly done jobs / all submitted jobs
System efficiency
Error (CE) free jobs / all submitted jobs
AliRoot efficiency
Error (AliROOT) free jobs / all submitted jobs
Resource efficiency
Running (queued) jobs / max_running (queued)
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 10/18
Extensions
• Job monitoring by user
•AliEn “ps –xxx” commands
•Job’s JDL
•Results presented in the same web front end
• Repository Web Services
• Application Monitoring (ApMon) at WNs
• Grid Analysis
•Repository interfaced to ROOT and Carrot
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 11/18
Section III
PDC 2004 Monitoring and Results
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 12/18
Phase 1 (simulation)
Sum of all sites
Successfully
done jobs
all submitted
jobs
Error (CE)
free jobs
all submitted
jobs
Error (AliROOT)
free jobs
all submitted
jobs
•Start 10/03, end 29/05 (58 days active)
•Maximum jobs running in parallel: 1450
•Average during active period: 430
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 13/18
Phase 2 (merging)
¾ as in the 1st phase, general equilibrium in CPU contribution
¾ not sigle site dominating the production
¾ jobs successfully done 76% AliEn, 24% LCG
Jobs failure
Reason
Rate
Submission
CE scheduler not responding
1%
Loading input data
Remote SE not responding
3%
During execution
Job aborted, not started, killed, WN malfunction
10%
Saving output data
Local SE not responding
2%
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 14/18
Phase 3 (analysis)
• Occupancy changes respect the number of queued jobs in the local batch system
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 15/18
Salutations…
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 16/18
Credits
• Federico, Predrag and Peter
they could pick up another TS
• Latchezar
continuos help and suggestions, review of my thesis
• MonALISA team
collaborative anytime I needed
• Guenter
very useful integrations
• my fiancee
moral support: “did they hire you just to look at some plots?”
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 17/18
…thanks to all
…and all the others I couldn’t find a pic!
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 18/18