WMS - Agenda INFN

GRID Architecture part 2:
the Workload Management
System
Alessandro Paolini (INFN-CNAF)
II corso di formazione INFN per amministratori di siti
GRID
ICTP TRIESTE
24 – 28 Novembre 2008
1
The
GRID
2
Summary
• Overview of Workload Management System
• Job lifecycle
• A view a bit more exhaustive…spiced with an
attempt of troubleshooting
3
Glossary
•
•
•
•
•
•
•
•
•
•
•
•
•
BDII: Berkeley Database Information
Index
CE: Computing Element
GASS: Global Access to Secondary
Storage
GIIS: Grid Information Index Service
GRAM: Globus Resource Allocation
Manager
GRIS: Grid Resource Information Service
GSI: Grid Security Infrastructure
GUID: Globally Unique Identifier
IS: Information System
JC: Job Controller
LB: Logging and Bookkeeping
LCAS: Local Centre Autorisation System
LCMAPS: Local Credential MAPping
Service
•
•
•
•
•
•
•
•
•
•
•
4
LM: Log Monitor
MDS: Metadata Directory Service
NS: Network Server
PRS: Proxy Renewal Service
PS: Proxy Server
RB: Resource Broker
UI: User Interface
VO: Virtual Organisation
WM: Workload Manager
WMS: Workload Management System or
Server
WN: Worker Node
WMS architecture
The gLite WMS is deployed on five kinds
of machines:
User Interface (UI)
Workolad Management Service
(WMS)
Computing Element (CE)
Worker Node (WN)
Proxy Server (PS)
5
User Interface
•
•
•
allows users to access the
functionalities of the WMS
It provides an interface to WMS
– command line interface
– graphical interface
– a C++ programming interface
The basic functionalities:
– list the compatible resources
• given a set of job
requirements
– submit a job
– get the job status
– cancel a job
– retrieve logging information of a
job
– retrieve the output of a job.
6
WMS architecture
•
•
•
A collection of services
– Usually (but not necessarily) running
on the same node
The WMProxy
– Based on web services
– Accepts incoming requests from a UI
– Authenticates the user to the pool
accounts
– Copies the Input and Output Sandbox
between WMS and UI
– Registers the user proxy for periodic
renewal
• See Proxy Renewal Service
– Forwards the requests to the WM
The Workload Manager (WM)
– Core component of the WMS
– Takes appropriate actions to satisfy
requests
WMProxy
LB Server
– Finds the resources that best match the request
• Interacts with IS and File Catalog
– Calls the JC
7
Submitting models
From the initial design, two submitting models are provided:
eager scheduling (“push” model)
a job is bound to a resource as soon as possible. Once the decision has been
taken, the job is passed to the selected resource for execution
lazy scheduling (“pull” model)
the job is held by the WM until a resource becomes available. When this
happens the resource is matched against the submitted job.
At the moment only the push mode is used because of backward compatibility.
Pull model needs of gLite CE, but it has been abandoned.
8
WMS architecture
•
The Job Controller (JC)
– Submits the job to Condor-G
•
The Condor-Gridmanager component
– responsible for performing the actual job
management operations
• job submission to CE
• job removal
– Submits an extra job (the grid monitor)
to monitor the user jobs
• One grid monitor per CE per user
•
The Log Monitor (LM) component:
– is responsible for
• watching the CondorG log file
• intercepting interesting events
concerning active jobs
– In case of job failure, the LM informs the
WM for (optional) resubmission to
another CE
WMProxy
9
LB Server
WMS architecture
•
The Proxy Renewal Service is
responsible to assure that:
– for all the lifetime of a job, a valid
user proxy exists within the WMS
– MyProxy Server is contacted in
order to renew the user's credential
WMProxy
•
The LB PROXY (only on the Glite
WMS)
– A local cache of the LB
•
Logging & Bookkeeping (LB) is
responsible for:
– storing events generated by the
various Grid components (UI, WMS,
•
WN..)
– Providing this information on user
requests (job-status, job-logginginfo)
10
LB Server
ICE is the components responsible of
submitting jobs to the new (just released)
CREAM CE. It will replace condor
completely
Computing Element
•
•
The Grid interface to a computing cluster. The
world “CE” refers to both:
– The host where the Grid services run
– The Grid identifier of a local batch system
queue
• <hostname>:<port>/<bsys>-<qname>
The CE runs a gatekeeper
– Accepts jobs from Condor-G
– Creates a job manager (JM) per job
• Generic interface to the batch system
•
– The JM only submits or cancel a job
– The grid monitor queries the status of the
jobs
• One istance per CE per user
The local batch system
– Last element of the chain
– Often a server runs on the CE node
11
Worker Node
•
•
•
•
•
It is the host executing the job
A set of WNs managed by a CE constitutes
a computing cluster
A cluster MUST (Should) be homogeneous
– Similar hardware
– Same OS, configuration …
The gLite WN does NOT run any service
– Requires a minimal amount of Grid
middleware
The WN runs the WP1 job wrapper
– Wrapper around the user executable
– Transports the input/output sandbox
from/to the WMS
12
Proxy server
•
•
In gLite users authenticate to services
through proxy certificates
– The more the proxy is short living, the
more the mechanism is secure
– Default is 12 hours proxy
For long jobs a proxy renewal mechanism
is employed:
– A Proxy Server (PS)
• Usually runs on a separate host
• Stores long lived user proxies
• Generates short-lived user proxies
starting from the long-lived one
– A Proxy Renewal Service (PRS)
• Runs on the WMS
• Contacts the PS to “refresh” the
short-lived proxies before their
expiration
13
The lifecycle of a job gLite
UI
JDL
File
Catalogue
Input “sandbox”
DataSets info
WMS
Storage
Element
Globus RSL
Job Status
Logging &
Book-keeping
Publish
Job Query
Job Submit Event
Author.
&Authen.
Expanded JDL
Output “sandbox”
Information
Service
Job Submission
Service
Job Status
14
Computing
Element
Jobs State Machine (1/2)
Waiting
Submitted job but not yet
job
accepted
by
WMProxy
and
waiting
for
Workload Manager processing
transferred to WMProxy for
processing
Ready job processed by WM
Scheduled job waiting in
but not yet transferred to
the CE
the queue on the CE.
Running
job is running on WN.
15
Jobs State Machine (2/2)
Cancelled job has been
successfully canceled on
user request.
Cleared output sandbox was transferred to
the user or removed due to the timeout.
Done
job
exited
or
considered to be in a terminal
state
by
Condor
(e.g.,
submission to CE has failed in
an unrecoverable way).
Aborted
job processing was
aborted by WMS (waiting in the
WM queue or CE for too long,
expiration of user credentials).
16
Installing/configuring a WMS/LB
• Using IG YAIM is for sure the easiest way to install a WMS
• ig-yaim supports the following profile:
•ig_WMS
•ig_LB
•ig_WMSLB (now obsolete, not to be used)
•ig_RB (no more supported)
If the machines is supposed to be heavy loaded it is strongly
advised to use the separated profiles in different machines
Important site.def parameters
• WMS_HOST (RB_HOST)
• LB_HOST
• PX_HOST
• BDII_HOST
• MYSQL_PASSWORD
•all the usual users/groups.conf, java, domain, VO…
17
WMS/LB Installation with YAIM
Installing with yum on i386 32bit SL(C)4
Yum clean all
Yum install ig_WMS [ig_LB]
/opt/glite/yaim/bin/ig_yaim -c -s <site.def> -n ig_WMS [-n ig_LB]
A note for the separated LB
• To have the system working you need to create the file:
"/opt/glite/etc/LB-super-users“ to allow WMS to use the LB
• You can allow more than one WMS
[root@albalonga root]# less /opt/glite/etc/LB-super-users
/C=IT/O=INFN/OU=Host/L=CNAF/CN=glite-rb-00.cnaf.infn.it
/C=IT/O=INFN/OU=Host/L=CNAF/CN=gridit-wms-01.cnaf.infn.it
/C=IT/O=INFN/OU=Host/L=CNAF/CN=egee-wms-01.cnaf.infn.it
18
Running services 1
•
On a WMS node:
[root@gridit-wms-01 ~]# service gLite status
*** globus-gridftp:
globus-gridftp-server (pid 26470) is running...
*** glite-wms-wmproxy:
WMProxy httpd listening on port 7443
httpd (pid 26486 25459 25458 25457 25456 25455 25454 25419) is running ....
===
WMProxy Server running instances:
UID
PID PPID C STIME TTY
TIME CMD
*** glite-wms-wm:
/opt/glite/bin/glite-wms-workload_manager (pid 25246) is running...
*** glite-wms-lm:
Logmonitor running...
*** glite-wms-jc:
JobController running in pid: 25288
CondorG master running in pid: 25320
CondorG schedd running in pid: 25325
*** glite-proxy-renewald:
glite-proxy-renewd running
*** glite-lb-proxy:
glite-lb-proxy running as 25738
*** glite-lb-locallogger:
glite-lb-logd running
glite-lb-interlogd running
19
Running services 2
• On an LB node
[root@albalonga root]# service gLite status
*** glite-lb-locallogger:
glite-lb-logd running
glite-lb-interlogd running
PAY ATTENTION: you need mysql
also on a WMS only node because
of lbproxy
*** glite-lb-bkserverd:
glite-lb-notif-interlogd running
glite-lb-bkserverd running as 6156
mysql> show databases;
+----------+
| Database |
+----------+
| lbproxy |
| mysql |
| test |
+----------+
3 rows in set (0.00 sec)
You need mysql running
mysql> show databases;
+------------+
| Database |
+------------+
| lbserver20 |
| mysql
|
| test
|
+------------+
20
Authentication on WMS
The WMPROXY:
• Perform authentication and authorization
• All the possible classical problems for GSI/PKI handshaking can arise:
– Missing CA
– CRL expired
– VOMS server certificates not present or expired
– NTP misconfigured
– Host certificates missing or expired
– Pool account run out
• Use the new style grid-mapfile for users mapping (users DN no more)
"/compchem/Role=SoftwareManager/Capability=NULL" .sgmcompchem
"/compchem/Role=SoftwareManager" .sgmcompchem
"/compchem/Role=NULL/Capability=NULL" .compchem
"/compchem" .compchem
21
Authentication on WMS
•
The WMProxy:
– Transfer the Input Sandbox in the directory properly created:
/var/glite/SandboxDir/<job initials>/https_<some string with jobID>/input
– Obtains a delegated full proxy from the user proxy
• stores it into /var/glite/SandboxDir/<…>/<…>/ (configurable)
• Registers it for renewal to the ProxyRenewal service
– under /var/glite/spool/glite-renewd
-rw------- 1 glite glite 169 Nov 7 13:52 fbb6db39f3181918cd8a0bc245861fa7.data
-rw------- 1 glite glite 6145 Nov 7 13:52 fbb6db39f3181918cd8a0bc245861fa7.0
[root@gridit-wms-01 ~]# openssl x509 -in /var/glite/spool/glite-renewd/fbb6db39f3181918cd8a0bc245861fa7.0 noout -subject
subject= /C=IT/O=INFN/OU=Personal Certificate/L=Pisa/CN=XXX/CN=proxy/CN=proxy
[root@gridit-wms-01 ~]# less /var/glite/spool/glite-renewd/fbb6db39f3181918cd8a0bc245861fa7.data
suffix=0, unique=0, voms_exts=1, server=myproxy.cnaf.infn.it, next_renewal=1229576004,
end_time=1229585004, jobid=https://lb009.cnaf.infn.it:9000/UcXF9ost_yGM9Qlan-8NDg
• NOTE: this happens even if a long-lived has never been registered in the
proxy server
– If everything goes fine the job is enqueued to the WM and trigger the WAITING
status.
22
Authentication on WMS
The WMPROXY:
•
•
•
•
•
•
Allow the submission of new types of jobs: collection, parametric,
checkpointable:
– https://edms.cern.ch/file/590869/1/EGEE-JRA1-TEC-590869-JDLAttributes-v0-8.pdf
Log on:
– /var/log/glite/wmproxy.log
– /var/log/glite/glite-wms-wmproxy-purge-proxycache.log
– /var/log/glite/httpd-wmproxy-errors.log
– /var/log/glite/httpd-wmproxy-access.log
Accepts submissions done ONLY with VOMS proxy
A limiter prevents submission if the machine load is too high
Authorization is done using an xml file:
– /opt/glite/etc/glite_wms_wmproxy.gacl (SUPPORTED VO MUST BE
THERE)
Use also a dedicated conf file (in addition to glite_wms.conf):
– /opt/glite/etc/glite_wms_wmproxy_httpd.conf (DO NOT MODIFY)
23
Workload manager (WM)
• The Workload Manager component performs the matchmaking
– The glite WMS has an IS snapshot cached locally: the information
supermarket (ISM): /var/glite/workload_manager/ismdump.fl
– A series of purchasers collect information from various Information
systems (i.e. BDII) and put them into the ISM
– By default the ISM it is updated every 5 minutes
– If the ISM is empty the job is Aborted (“No matching resource”) 
probably there are problems in contacting the IS (see next slides)
– contacts the File Catalog in case of input data job requirements
24
Workload manager (WM)
• Calculates the ranking of all the matched resources.
– It selects the resource with the best ranking
– in case of resources with equal ranking it selects a random one among
them
• Passes the job to the Job Controller for submission to Condor triggering the
READY status
– The job is enqueued in a configurable text file
• In case of failure the job gets back to the WM
– It submits the job to a different resource
– If number of resubmission exceeds the MaxNumber of possible attempts
(defined by the user) the WM gives up and the job results FAILED
• Log on /var/log/glite/workload_manager_events.log
• Input file (DO NOT MODIFY if you don’t know what you are doing):
– /var/glite/workload_manager/input.fl
25
The IS kills the WMS …
• The WM needs to contact the IS to discover resources and obtain values for
external attributes in requirements and ranking
– The BDII could be down (or ISM is empty)
– The requirements could restrict to a single site which is temporary
missing from the BDII (Information System glitch)
– The queue required by the user is closed:
• GlueCEStateStatus: Draining
– FCR removed the site (removal of GlueCEAccessControlBaseRule
parameter)
– Your site could be down
• In such cases, the job fails with the message
“Brokerhelper: Cannot plan. No compatible resources”
26
The Job Controller
JobController = [
glite_wms.conf
[…]
CondorSubmitDag = "${CONDORG_INSTALL_PATH}/bin/condor_submit_dag";
CondorRelease = "${CONDORG_INSTALL_PATH}/bin/condor_release";
SubmitFileDir = "${EDG_WL_TMP}/jobcontrol/submit";
OutputFileDir = "${EDG_WL_TMP}/jobcontrol/cond";
Input = "${EDG_WL_TMP}/jobcontrol/queue.fl";
LockFile = "${EDG_WL_TMP}/jobcontrol/lock";
LogFile = "${EDG_WL_TMP}/jobcontrol/log/events.log";
[…]
];
•
•
•
•
•
Creates the directory for the condor job
– This is where Condor stdout and stderr will be stored
Creates the job wrapper
– A shell script around the user executable
Creates the condor submit file
– From the JDL string representation
Converts the condor submit file into ClassAd
– Understood by Condor
Submits the job to the CondorG cluster (as job of type “Grid”)
– via the condor scheduler
27
Condor-G
• CondorG consists in two elements:
– The condor_gridmanager process
– The Globus Ascii Helper Protocol (gahp) server
• The condor_gridmanager process
– One single process per user
• Handles all the jobs of the same user
– Interprets the ClassAd description and translates it into RSL
• Understood by globus
– Passes the job description to the gahp server
• The gahp server (a very complicated object)
– One single process per user
– It is a GRAM client to contact the globus-gatekeeper (on the CE)
– There is a GASS server to receive/distribute messages/files from/to
processes running on the CE
28
Condor-G
[root@egee-wms-01 ~]# ps auxfwww | grep condor
root 17481 0.0 0.0 3740 636 pts/0 S+ 11:26 0:00
\_ grep condor
glite
967 0.0 0.0 7756 2988 ?
Ss Nov05 5:09 /opt/condor-c/sbin/condor_master
glite
968 0.0 0.0 8268 3152 ?
Ss Nov05 3:37 \_ condor_collector -f
glite
970 1.0 0.8 40896 35224 ?
Ss Nov05 120:14 \_ condor_schedd -f
glite 1102 1.4 0.8 40556 33864 ?
S Nov05 166:26 | \_ condor_gridmanager -f -C
(Owner=?="glite"&&JobUniverse==9) -S /tmp/condor_g_scratch.0xaa93958.970
glite 1110 0.0 0.0 7520 1648 ?
S Nov05 0:10 | \_ perl /var/local/condor/spool/cluster130147.ickpt.subproc0
glite 1118 0.0 0.0 7540 1700 ?
S Nov05 0:02 | \_ perl /var/local/condor/spool/cluster130148.ickpt.subproc0
glite 1122 0.0 0.0 6356 1644 ?
S Nov05 0:00 | \_ perl /var/local/condor/spool/cluster130149.ickpt.subproc0
glite
971 0.0 0.0 8012 2980 ?
Ss Nov05 0:38 \_ condor_negotiator -f
• The GRAM client sends the Condor job to the CE for execution
– Contacts the globus-gatekeeper
– The GRAM Sandbox is shipped to the CE
• Contains The Job Wrapper, a delegation of the user proxy, info for
the GASS server etc …
29
globus-gatekeeper
• Listens on port 2119
• Logs by default in /var/log/globus-gatekeeper.log
• Grants access to the Computing Element
– Authentication performed loading LCAS module
• Basically does nothing …
• Check if the user is banned (/opt/glite/etc/lcas/ban_users.db)
– Authorization performed loading the LCMAPS module
• Check the VOMS credentials of the user
• Maps the user to a unix account on the CE
NOTE: the CE grid-mapfile is a sum of old and new style
Users without voms credentials are still allowed to access
30
Authentication again…
All the GSI problems seen for job submission can happen for the CE
10 data transfer to the server failed
• Usually it means the Globus job manager on the CE cannot call back the
RB/WMS (or UI in tests)
• It can also occur when the proxy is not acceptable to LCAS/LCMAPS on the
CE, e.g. because it is a plain grid proxy instead of a VOMS proxy (currently
the lcg-CE should still accept both), or when the VOMS extensions have
expired
– Check LCAS/LCMAPS configuration on CE. For all VOMS servers of each
supported VO there must either be the public key in /etc/grid-security/vomsdir.
Otherwise there may be LCAS failures reported in /var/log/globus-gatekeeper.log:
LCAS 0: lcas_plugin_voms-plugin_confirm_authorization_from_x509():
VOMS Signature error (failure)!
31
Authentication again…
7 authentication failed: GSS Major Status:…
•
CRL? Checked! Host certificate? Checked! What’s wrong?
There might be something nastier …
•
The clocks of the UI, WMS, CE and WN should be in sync …
– if not all sort of authentication problems can come out
– The error message depends on the lifetime of the proxy at job submission, how
much things are skewed, who is skewed in respect of who and in which direction
….
• This is a nice one: the job gets submitted but then aborts almost immediately:
Current Status: Aborted
Status Reason: cannot retrieve previous matches for
https://lxb0704.cern.ch:9000/wW6q2yudAcAfRds2qwL7gQ
• But the more instructive WM log says
SSL Error (1413) - sslv3 alert bad certificate
– The conclusion? Whenever you have an authentication problem you can not
explain, start checking all the clocks of all the involved services you have access to
…
32
Jobmanager .. meet GRAM
• Once the user is mapped to a local account, the gatekeeper forks the
globus-jobmanager
– Offers an interface to the local batch system
• It is batch system specific i.e. you have a JM for PBS, one for
LSF …
– After the authentication through the gatekeeper, the GRAM client
will communicate directly to the globus-jobmanager
• The communication JM-GRAM is complicated! But for sure needs
some ports to be open on both WMS and CE
See https://twiki.cern.ch/twiki/bin/view/LCG/LCGPortTable
33
Jobmanager .. meet GRAM
• Watch out the firewall
– Don’t forget that the CE sits on a site (most likely behind a firewall) and the WMS
sits in some other site (most likely behind a firewall) and only occasionally the two
locations coincide …
– If the WMS is not allowed to connect to the CE in the CE port range, one typically
gets the following error for jobs submitted through that WMS:
------------------------------------------------------------------------------------ Got a job held event, reason: Globus error 79: connecting to the job
manager failed. Possible reasons: job terminated, invalid job
contact, network problems, ...
-------------------------------------------------------------------------------------
which is not really exhaustive ..
– NOTE: a direct globus-job-run will work however, because it does not use the two
phase commit feature of GRAM.
– Ensure outgoing connections are allowed from the CE to the RB/WMS
– Check on all the grid elements GLOBUS_TCP_PORT_RANGE=“20000,25000”
34
globus-jobmanager
•
•
•
The globus-jobmanager creates and runs the jobmanager-perl-script
– Gets executed to perform the submission to the local batch system
• It prepares a Batch System specific perl script around the job wrapper,
which is the executable to be submitted to the WN
• It submits the job to the batch system
– After such submission, the globus-jobmanager is told to exit
The CE usually is also a master of the Batch System
The globus-jobmanager is used to submit and cancel a job
– The status of the job submitted to the BS is queried by a special process: the
grid-monitor
– Workaround invented by LCG to overcome overload problems of the
Comupting Element
prdlhcb 18154 0.0 0.0 7136 4432 ?
S 13:24 0:00 globus-job-manager -conf /opt/globus/etc/globus-jobmanager.conf -type fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
prdlhcb 18194 0.0 0.0 6188 2788 ?
S 13:24 0:00 perl
/home/prdlhcb/.globus/.gass_cache/local/md5/ef/e08dc5e0533373cfbae685b3ace104/md5/14/049ed082194491ee
9875e6fd1260a3/data –desturl=https://lcgwms02.gridpp.rl.ac.uk:50171/tmp/condor_g_scratch.0xc806a60.31915/grid-monitor.gridit-ce001.cnaf.infn.it:2119.2734/grid-monitor-job-status
35
grid-monitor
• It is submitted together with the user job by the condor_gridmanager
• Uses common libraries to the jobmanager_perl_script
– This is how it gets “aware” of the specific batch system calls
• It runs on the CE and it is a “one process per user”
– It gathers information about all the job of the same user on that CE
• Note: the globus-job-manager-marshal daemon on the gLite 3.1 lcg-CE will
only allow a limited number of requests to run in parallel (5 by default),
putting the rest into a queue that can be seen with "ps afuxwww"; if any
account with pending requests has a problem, it also can cause jobs for other
accounts to fail with the ”10 data transfer” error!
Important!! On CE check the running status of:
• globus-job-manager-marshal
– If down, nobody can be authorized on the CE
• globus-gass-cache-marshal
– If down nobody can be authorized on the WNs
36
Job on the Worker Node
• What gets submitted to the WN is a perl script
– Sets up a bit of environment
– Fetches via gridftp a tarball from the CE containing several GASS state
files, GRAM stuff etc …
– Unpacks such tarball
– Stuff … stuff … stuff ..
– Runs the Job Wrapper
• Togheter with the job comes a delegated limited proxy and the Job Wrapper
itself
– It is through this limited proxy that the GRAM sandbox can be
downloaded
– In case of proxy expiration at runtime in the WN, it is duty of the gahp
server to provide a fresh proxy to the job
• proxy obtained from the proxy server through the proxy-renewal
service
37
A delicate phase…
•
Submitting a job to the WN is a delicate phase:
– If for some reason the WN can not contact the gridftp server on the CE, the tarball
can not be downloaded and you get:
• “submit-helper script running on host lxb1761 gave
error: cache_export_dir (<some dir>) on gatekeeper
did not contain a cache_export_dir.tar archive”
• Some CRLs on the WN or CE are out of date. Run the cron job manually,
check for errors
• Check if all of the latest CA rpms have been installed
• The CE is not running a gridftp daemon. Check on the CE:
– /etc/init.d/globus-gridftp status
• CE and WN are not time synchronized. Even a difference of less than 1 minute
can cause a problem
38
A delicate phase…
•
Submitting a job to the WN is a delicate phase:
– On PBS, the sandbox is transferred from the CE to the WN via scp (from the WN)
• If this is not possible the batch submission fails, the job will be put on held and the error
message will be:
Unspecified gridmanager error
• scp from WN to CE must work without password
• Possible problem with duplicate entries for the WNs in the CE ssh configuration.
• Remove shosts.equiv and ssh_known_hosts files from /etc/ssh directory on the CE and
WNs.
• Re-run the following scripts on CE, that are usually also cron jobs.
– /opt/edg/sbin/edg-pbs-knownhosts and /opt/edg/sbin/edg-pbs-shostsequiv
• Re-run the following scripts on WN, that are usually also cron jobs.
– /opt/edg/sbin/edg-pbs-knownhosts
– The latest error is quite generic, applies all the time the job submission at
the batch system level fails.
– More often that message comes toghether to:
Job got an error while in the CondorG queue.
• the user has no permission to submit to the given queue;
• the batch system is in some bad state (at least for some grid users);
• there is a bad WN refusing or failing jobs, e.g. with a full partition;
39
False Friends …
•
Some error messages sometimes do not reflect the real cause of the trouble. Example:
– A job fails with a status reason
“Got a job held event, reason: Globus error 3: an I/O operation
failed”
– You might think you are having a network problem or a communication problem
between grid elements.
– Not necessarily. This error is mostly due to shortage of memory on the WMS or CE
or WN. From the ROLLOUT mailing list:
• “The problem was that memory was very low. queue_submit() in Helper.pm of GRAM
checks for memory and returns a NORESOURCES error if the free memory is less than
2% of the total, NORESOURCES is GRAM error 3, not necesarily IO. The reason for
that was that edg-wl-interlogd was using 717MB of RAM, so I restarted it with:
/etc/init.d/edg-wl-locallogger restart”
– The problem can also be due to lack of disk space or quota or a permission problem
with the pool account home directory
– I have to admit it: it could also be a hardware I/O error.
40
Job on the Worker Node
•
•
•
Once the Job Wrapper is finally able to run on the WN
– It performs some basic operations
– It downloads via the gridftp server on the WMS the user InputSandbox and
the .Brokerinfo file
– Starts running the user executable process
– Logs several information (success and failures) to the LB server
– Writes the exit status of the user executable togheter with eventual
messages (in case of error) in a special .maradonaxxx file
The stdout and stderr of the user executable are redirected toward the files
specified in the JDL
– Mandatory attributes for a normal (i.e. non interactive) jobs
Once the user executable concludes, the Job Wrapper
– uploads the OuptutSandbox to the WMS via gridftp
– Uploads the .maradona file to the WMS also via gridftp
– exits (end of the batch job)
41
Maradona strikes back…
Cannot read JobWrapper output, both from
Condor and from Maradona
•
•
•
This error means the user job exit status failed to be delivered to
the WMS, when two independent methods should have been tried:
– The job wrapper script writes the user job exit status to
stdout, which is supposed to be sent back to the WMS by
Globus.
– The user job exit status is written into an extra "Maradona"
file that is copied to the WMS with globus-url-copy.
Such failures are really “expensive” because the job might have
finished correctly but you just can not retrieve the exit status
– Consequently, the job is considered FAILED and you can
not retrieve the Output
When both methods fail, it usually means that the job did not run
to completion!
– Several causes: batch system problems, WN disk full,
problems in home directories, time not synchronized between
CE and WN, CRLs out of date, some CAs missing,…
42
Job termination
•
Once the batch job terminates
– The grid-monitor communicates the events to the WMS
– The WMS contacts the CE again on port 2119 to restart globus-jobmanager which
• cleans things up
• sends back the stderr and stdout of the condor job to the WMS
– Stdout contains a single line with the exit code of the User Job
– Stderr should be empty
– The LogMonitor needs to figure out the User Job exit code
• Looks first in the condor output file
• If not present, looks in the .maradona file
• If none of the two is present … well it gives up (the Maradona error…)
– In case the Job Wrapper completed successfully
• The job is declared done
• Several files are removed
• The proxy unregistered from the proxyrenewal.
– Otherwise the LM passes the job to the WM for resubmission to a different
resource , in case some RetryCount has been set
• If the max number of attempts is reached, the job is marked as failed.
43
Troubleshooting
It is a really difficult task, there is no rules of thumb since problems can
be on the WMS, on the LB and finally on the CE
A very rough grouping of the symptoms of a problem can be:
The user cannot submit
90% of the times
there are
authentication
problems  check
the networkserver or
wmproxy logs
10% the
networkserver or
wmproxy daemons
are not running
The user can
submit, but all jobs
abort
The user can submit but
the job hangs forever (or
for very long time) in a
non final status
•Probably there are problems with the upload of the
output Sandbox (gridftpd running?)
•There could be some certificate problems that
cause the authentication to fail with the CEs but not
with the user
•Check if the jobs went all to the same site – in this
case the problem could be site specific
44
See next
slides
Troubleshooting
• If the status of the job does not evolve, whichever the status is, a
possible problem can be the LB server daemons dead  check
the status of the LB server daemons
• If the LB server is OK and…
– …the status is always “Submitted”:
• The WMS cannot communicate with the LB  check LB
daemons on the WMS
– …the status is always “Waiting”:
• The WM daemon is probably dead
45
Troubleshooting
– …the status is always “Ready”:
• There are probably authentication problems between the WMS and the
CE
• The gridmanager cannot contact its counterpart on the CE
– …the status is always “Scheduled”:
• The CE can be very busy
• The LRMS on the CE can be misconfigured
• The WN cannot communicate with the LB to log the “ReallyRunning
Event”
• The LM is dead
– …the status is always “Running”:
• The LRMS on the CE can be misconfigured
• The WN cannot communicate with the WMS to upload the output
Sandbox
• The LM is dead
46
WMS architecture again..
The gLite WMS is deployed on five kinds
of machines:
User Interface (UI)
Workolad Management Service
(WMS)
Computing Element (CE)
Worker Node (WN)
Proxy Server (PS)
47
(My personal) Conclusion on debugging
• Debugging a job failure is a complicated task
– The WMS middleware is very complicated
– Error messages often are not exhaustive
– Most likely you will have no access to all the resources
• It is impossible to create an exhaustive list of failures
– Too many different types
– Too many different boundary conditions
• The advice
– You should try to understand the architecture as much as you can
• Understanding the job flow is essential in order to understand what
went wrong and when … (and why…)
– Start debugging … you will get the expertise bit by bit …
48